The Value of Data in Big Data Architectures

A new “V” for Big Data

Big Data, an enabler for many next-generation use cases is defined with the 3 “V” s “Volume, Variety and Velocity” in the literature. These characteristics describe a new set of requirements for which every company has to develop a strategy.

In many companies a combination of In-Memory Database (IMDB) and Hadoop have prevailed to meet these requirements. At this point, all combinations of IBM, Oracle or SAP HANA with Hadoop distributions of Hortonworks, Cloudera, IBM, and several others are conceivable. An issue, which is not discussed in this blog post, is that some combinations are more advantageous than others. What all possible combinations equal, is the large gap of in-memory to Hadoop licensing and hardware prices. A strategy which data, how long, at what point is stored can make a difference of several hundred of thousand euros per year. In addition to the often-mentioned 3 “V” s, companies therefore have to deal with another “V”. The value of data.

The value of data describes the monetary value of a record and has to be defined separately for any use case and any data source. Thus, for a manufacturing company is primarily important that the production processes work correctly, social media data is important only for concurrent processes. In service companies on the other hand social media issues are very important because a large proportion of its turnover on advertising will be generated there.

The value of the data can be expressed in monetary terms, by describing them in a kind of risk matrix. The main question for this matrix is: “What’s the impact, if a data record is not analyzed immediately?”.

Once you have determined the value of a data set, the problem in Big Data architectures is, which combination of expensive in-memory databases and inexpensive stores like Hadoop is used, to store the data primary and where it is stored the next month till archiving. This is even more complicated if a hybrid scenario of cloud and on premise servers is being used because Hadoop and the IMDB can be located in two different data centers. At this point I will describe an IMDB /Hadoop scenario in one data center only.

There are basically three ways to control a data flow in an IMDB /Hadoop scenario:

  1. Duplicate all data while loading them in Hadoop and IMDB in parallel and clean the IMDB with housekeeping jobs from time to time
  2. Load all data in Hadoop and only specific (aggregated) data in the IMDB
  3. Load all data in the IMDB and archive from time to time to Hadoop

data_flow

The first Scenario is the most expensive and most complex model, since all data is initially stored redundantly and therefore must be managed in two independent systems. To manage means all actions around, backup, run & maintain etc. Only data with the highest risk should be processed in this way, however it is ensured that real time data ingestion as well as pattern recognition can be performed on all historical data at the same time. This is, for example, interesting to waterworks where sensor data must be interpreted immediately in order to respond promptly to pollution. In addition to the live limits, which can be evaluated directly in the IMBD, the values ​​can be processed in parallel in the Hadoop Cluster and compared to historical data and patterns, which indicate a growing pollution, can be found.

The second Scenario is suitable for data that is used in daily, weekly or monthly reports and linked with other information. The data is stored in Hadoop in the original form and can be accessed at any time for reporting. Since those types of reports are not time-critical, even greater processing times can be tolerated. In addition, many IMDB have connectors to Hadoop, so the data do not need to be physically transmitted but initiated the calculation out of the IMDB and only the result is returned.

The third Scenario can be used for data that will be needed for immediate decisions, but their value wears off quickly. E.g. a sensor data of the last month only has small significance. But for the most recently produced product the same data decided on committee or good part. The data can therefore be transferred from time to time for long-term archiving in the lower-cost storage medium Hadoop.

All of these approaches have their place and can affect the cost of licenses and storage in a specific scenario. However, these channels are not a panacea and must be re-evaluated for each data source in each use case.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: