Innovecture Innsight

Managing the Data Explosion

Data Explosion
The rapid growth in the rich media data creation and consumption has increased the cost and complexity of managing the data within the organizations. The traditional software solution in the market being used by the Enterprises is the RDBMS. While RDBMS is still meets most of the needs to store extremely large volumes of data, the current explosion of data has shown the weaknesses of a RDBMS to handle such volumes like the case of BigData.

RDBMS provides us a robust platform to manage the data and its strengths are:

• Provide a relational structure for easy storage and retrieval of information especially in a transactional environment
• Maintain the data integrity through the ACID properties
• Provide a reliable way to recover the data in case of a system failure. RDBMS creates the transaction logs which are used for the recovery

However due to the I/O operations the performance of RDBMS suffers when it comes to managing BigData. The increasing traffic on social networks, sensor networks, Internet searches, military surveillances and use of multimedia especially the videos are making the data size a moving target. The data sizes to be managed are moving beyond the petabytes range now. With RDBMS the options of horizontal scaling, sharding and distributed databases provide some amount of scalability at a manageable cost however after that we need to explore some new technologies specifically designed for handling BigData.

Some of these new techniques are:
  • In memory databases: While in memory databases reduce the I/O and improves the performance its very expensive and has weak support for the durability (ACID) properties.
  • Massively parallel processing (MPP) databases: MPP architectures leverage independent servers executing in parallel operating in a shared-nothing mode. Most of the appliance based solutions use this technique. Although performance is a major benefit the cost of ownership needs to be watched.
  • BigTable: Designed by Google BigTable is a compressed, high performance database system to manage petabytes of information. It's proprietary to Google although some players are providing a similar solution in their products.
  • MapReduce: MapReduce is a framework patented by Google for processing huge datasets by partitioning and distributing the workload to the worker nodes in the cluster. The response from each worker node is collected by the Master node to provide the output. The typical application of MapReduce can be where there is a need to process complex, large datasets.
While there are many solutions available today in the market to manage the BigData problem all of them are newer and yet to be established as mature solutions in the industry. The new trend is to use the newer techniques and combine it with an optimized hardware to provide a high database management performance.