Friday, September 5, 2014

DataWarehouse vs BigData

To Be or not To Be is the question to ask today. I'm not being Hamlet here but with the evolution of Big Data and looking at the current technology trends, I'm curious to discover the cases where it can overcome conventional Data warehouse, where it cannot and most important what are the areas where both DW and BigData can be implemented in conjunction?

Data Warehousing concept has been in place for more than last four decades now  while Big Data, In Memory and the Cloud concept started prevailing in early 2Ks and are in high demand today. Between now and then the data has grown exponentially in the network and the competitive analysis of this data has lead  to evolution of new tools and BI concepts.  Does it mean a slow sunset for DW? I'd say it's a myth as the business cases differ, an ideal approach will be a blend of both for sure.


Before I classify when to use which approach? or how to use both? I want to compare  the three main data storage techniques in a matrix. Though In Memory and Big Data storage work on pretty much similar data storage principles, my reference here is with the tools like SAP HANA that cost a fortune in comparison to Big Data implementations like Hadoop.

Data WarehouseBig DataIn Memory
Total Implementation CostHighReasonableHigh
Total Operational CostHighReasonableHigh
Can be used for querying Real time Data?NoYesYes
Can effectively store Unstructured Data ?NoYesYes
Can effectively store frequently changing Data? YesNoNo
Processing speed with large volume of Data Relatively SlowFastestFaster

Certainly the cost of implementation will be lesser if things are setup on Cloud, however, I'll park that thought for now as it's more of a service platform than a storage technique.

 When to implement a Data Warehouse?

A Data warehouse is built in a RDMS database. It stores the data at row level granularity, which means you'll be reading, inserting or updating the whole record at once for every single operation you do. Therefore, in cases where your data is too volatile and you need to retain the history of the changes, it is best  stored and read in and from a DW. For example, consider an insurance company XYZ,  whose Data Analysts refer to  last 10 years worth of historical data of  the security bonds to perform analysis based on the security rating, price and other dimensions whose values have changed a lot in that time frame. Such a data set will work well with a DW as your slowly changing dimensions can be stored and queried effectively if the DW structure is well planned. Of-course there are challenges in integrating the data together but the end result is  sweet.

 When to implement Big Data?

 Big Data on the other hand is columnar storage, i.e data is stored into columns but not rows. It's the metadata that connects the dots between the related dimensions. The distributed and parallel processing makes it fastest in computing calculations on a HUGE volume of data. Gartner describes the nature of Big Data with the 3 Vs, Volume (the size of data), Variety(type of the data) and Velocity (rate at which the data flows in and out). And I'm not talking about GBs or TBs here but beyond PBs worth of data, be it structured or unstructured. Therefore, If you are dealing with such high volume of data  where your object values are not changing very often,  Big Data is the preferred solution. I'll give an example for a use case in our XYZ insurance company. Imagine they are performing competitive marketing campaigns and analysis to sell more insurance policies. To identify a new potential customer, they can analyse the unstructured data which is being produced daily on the social networks, mobile phones, commodity websites etc. Such a data set will go well with Big Data solution. Another example can be fraud analytics where the probability of data change is very less in a real time large data set.

Can I use DW and Big Data together? 

Absolutely, in fact it is already being practiced by many market leaders. You just have to draw the line by understanding the nature of your data, it's inflow and it's consumption, don't just decide just by considering the volume. A possible use case is to have Big Data for your monthly/ yearly snapshot or as staging area for handling your real time data analysis and then transform it to the warehouse. First know your data well and you'll be able to use the combination creatively.

I know it is tempting to move along with the technology trends, however,  there will be "some" people who'll  drag you to just run the shop as is or unnecessarily implement disparate solutions that are troublesome to integrate in future, you'll have to make a choice foreseeing the nature  and the path of growth for your organization.

Thanks for taking out the time to read my blog, your comments and opinions are most welcome. In my upcoming posts I will be sharing the proof of concepts using Big Data with the available open source tools.

4 comments :

  1. Appriciate your initative & it is quite informative for techies. Would like to see more on Big-Data & In-Memory.
    Inform me once PoC using Big-Data is posted. Keep blogging!

    ReplyDelete
  2. Thanks Sunil! You'll find them here in upcoming weeks.

    ReplyDelete
  3. Wonderful write-up Uday!
    I felt the part where you mention that the blend of both Big Data with DW is very valid in-terms of gaining the required benefits out of investing on the repository of tools and technologies.

    Keep it going!

    ReplyDelete