Right data: Why Big Data sometimes just isn’t enough

Data journalism CSC Blogs

Big Data is everywhere!

The sources of Big Data are manifold and relate to several popular topics: sensors in vehicles (Connected Car) or production lines (Internet of Things, IoT), mobile communication, social media and many other areas capable of steadily producing enormous amounts of data. The pure collection of data is not an end in itself. The underlying goal is to generate additional value from it through skillful exploration.

In recent years, the ability to both store and process Big Data sets improved significantly with ongoing technical developments, e.g. in the Hadoop ecosystem. While Business Analysts have learned to derive KPIs and additional insights from the collected data, the step to prepare predictive applications is led by members of a new job profile: Data Scientists. Their field of expertise is to analyze the data sets, determine correlations among the different entities and, finally, use these findings to build and train predictive models. These models, developed in close collaboration with the corresponding business units, allow the derivation of predictions and, subsequently, the automation of decisions. Hence, predictive analytics has the potential to change business processes and generate added value for customers.

However, the challenge is not just to collect a huge amount of data. In order to build telling predictive models, it is essential that these internally available data sets contain the right data with adequate data quality.

Data need to fulfill certain criteria:

  • Domain specific: In general, the data stored in a Big Data platform originate from different sources and cover a variety of topics. In order to build a predictive model for a certain business case, only those entities that are related to the problem, are of interest. The corresponding tables, members and files need to be easily identifiable and relatable in the whole system.
  • Data range: A sufficient amount of historical data has to be available. If the prediction, for example, requires the modeling of seasonal effects, a data set that comprises a period of two months is not comprehensive. The limiting factor is the mandatory parameter with the least available history. Furthermore, it is advantageous if the statistics available for different time intervals are comparable.
  • Documentation: The individual entities need to be well documented in order to avoid misconceptions. Wherever appropriate, the documentation should contain information about physical units, default values, limits and necessary transformations and calibrations, if applicable. If the definition of an entity changes over time, the time spans of the different versions have to be documented as well as the differences in the definitions.
  • Event frequency: The number of occurrences of events to be predicted has to be sufficiently large in order to prepare predictive models. While the number of sales of an article in the retail sector can be measured on a daily basis, the breakdown of a technical component in an assembly line is a rather rare event. Depending on the number of observations that are present in the collected data, the eligible predictive models that can be used to address the task differ.

The remainder of this post discusses the need to have the right data stored for predictive analytics projects. The example deals with the predictive maintenance of wearing parts of vehicles. However, the conclusions drawn from this example can easily be transferred to other projects that have a predictive focus.

Car Plant Worker With Machinery --- Image by © Monty Rakusen/cultura/Corbis

Nowadays, modern vehicles are equipped with a variety of sensors that monitor and report different aspects of the current state of the parts they control. A frequent business case is the prediction of the remaining lifetime of a wearing part. If the wear is higher than expected, the customer can be notified, and an earlier service appointment can be suggested. On the other hand, the service can be shifted to a later date if the wear is less pronounced than expected.

The added value for the business owner is an increase in the customer’s loyalty due to the personalized information that can be distributed on various communication channels, an increased sense of reliability that comes with a well-functioning vehicle (for the end customer), or an improvement in the planning ability for service appointments of a vehicle fleet (for a fleet manager in a freight forwarding business).

Building a model for this prediction requires both feature (Xhist ) and target parameters (Yhist) to be provided to the supervised learning techniques. These Machine Learning methods then try to determine a relationship

f(X) = y

that describes the interrelation between the stored feature parameters and the observations from the historic data. Once the relationship is trained and thoroughly tested, the model can be used to predict the state  based on the newly recorded feature parameters.

Usually, feature parameters that serve as input parameters for Machine Learning methods can be derived from the smart transformation and combination of sensor data measuring properties that contribute to the wear. Additional descriptive input parameters may come from a database containing information about the vehicle itself, such as mass or fabrication date. Typically, such measured or descriptive feature data is either already available or comparatively easy to obtain. The bigger challenge is the access to target values, i.e. the knowledge about the wearing in the historical data that can then be related to the observed values of feature parameters.

Traditionally, wearing parts of vehicles are replaced after a predefined mileage, and no record is kept about the state of the part at the point of disassembly. The workshop simply disposes of the part without sharing info about its final state with the manufacturer. Consequently, information about why, when and in what condition the part was replaced gets lost. Without this knowledge, the link between the wear (the observation) and the corresponding feature parameters cannot be established. However, this information is essential for the correct modelling of the target parameter description within the predictive model.

Another consequence of the missing feedback process: The condition of the replacement part is  unknown. The knowledge about its properties, such as its current state or type, is essential because it provides important parameters and defines the starting point for predictions. Thus, the quality of the predictions for the newly installed part would decrease, due to insufficient quality of the input parameters.

The example described above illustrates the challenge of collecting the right data. Too often, promising business cases are doomed due to the fact that the available target parameters are either not available at all or do not meet the requirements. The chances of extracting additional benefits from the data sets are then reduced to the application of unsupervised learning techniques, such as clustering methods or the optimization of analytical models. The latter ones are typically unknown, which, in many cases, originally led to the decision to take Big Data & Analytics into account. This shows that for predictive analytics use-cases, any considerations about data acquisition schemes must include detailed concepts for both input and target variables from the very beginning.

CSC offers expertise in both Big Data techniques and Data Science, helping customers analyze current data and identify potential for advanced analytic methods. Teams can determine missing information and propose ways to extract or obtain those, finally building predictive models.

During the deep dive into the data lake, the current status of the data quality is evaluated and discussed with the customer. Scenarios are developed  on how the business case can be implemented and what issues need to be addressed. Based on these evaluations, concepts are developed that demonstrate how the available Big Data can be extended to finally comprise the right data and ensure high data quality standards. This marks the starting point from which the development of the predictive model takes off and the original business case can be translated into action.

This work and the intense discussions about data content may lead to the development of new and previously undiscovered business cases or service ideas. If these new ideas turn out to be realized, resulting in additional productive applications, the primary investment cost can be distributed over a wider range of solutions. Thus, the sooner the Big Data storage includes the right data, the earlier thay data can be used to generate added value.


  1. […] Right data: Why Big Data sometimes just isn’t enough […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: