Cheat sheet: Best data ingestion tools for helping deliver analytic insights

pen-and-paper

Analytic insights have proven to be a strong driver of growth in business today, but the technologies and platforms used to develop these insights can be very complex and often require new skillsets. One of the initial steps in developing analytic insights is loading relevant data into your analytics platform. Many enterprises stand up an analytics platform, but don’t realize what it’s going to take to ingest all that data.

Choosing the correct tool to ingest data can be challenging. DXC has significant experience in loading data into today’s analytic platforms and we can help you make the right choices. As part of our Analytics Platform Services, DXC offers a best of breed set of tools to run on top of your analytics platform and we have integrated them to help you get analytic insights as quickly as possible.

To get an idea of what it takes to choose the right data ingestion tools, imagine this scenario: You just had a large Hadoop-based analytics platform turned over to your organization. Eight worker nodes, 64 CPUs, 2,048 GB of RAM, and 40TB of data storage all ready to energize your business with new analytic insights. But before you can begin developing your business-changing analytics, you need to load your data into your new platform.

Keep in mind, we are not talking about just a little data here. Typically, the larger and more detailed your set of data, the more accurate your analytics are. You will need to load transaction and master data such as products, inventory, clients, vendors, transactions, web logs, and an abundance of other data types. This will often come from many different types of data sources such as text files, relational databases, log files, web service APIs, and perhaps even event streams of near real-time data.

You have a few choices here. One is to purchase an ETL (Extract, Transform, Load) software package to help simplify loading your data. Many of the ETL packages popular in Hadoop circles will simplify ingesting data from various data sources. Of course, there are usually significant licensing costs associated with purchasing the software, but for many organizations, this is the right choice.

Another option is to use the common data ingestion utilities included with today’s Hadoop distributions to load your company’s data. Understanding the various tools and their use can be confusing, so here is a little cheat sheet of the more common ones:

  • Hadoop file system shell copy command – A standard part of Hadoop, it copies simple data files from a local directory into HDFS (Hadoop Distributed File System). It is sometimes used with a file upload utility to provide users the ability to upload data.
  • Sqoop – Transfers data from relational databases to Hadoop in an efficient manner via a JDBC (Java Database Connectivity) connection.
  • Kafka – A high-throughput, low-latency platform for handling real-time data feeds, ensuring no data loss. It is often used as a queueing agent.
  • Flume – A distributed application used to collect, aggregate, and load streaming data such as log files into Hadoop. Flume is sometimes used with Kafka to improve reliability.
  • Storm – A real-time streaming system which can process data as it ingests it, providing real-time analytics, ETL, and other processing of data. (Storm is not included in all Hadoop distributions).
  • Spark Streaming – To a certain extent, this is the new kid on the block. Like Storm, Spark Streaming is a processor for real-time streams of data. It supports Java, Python and Scala programming languages, and can read data from Kafka, Flume, and user-defined data sources.
  • Custom development – Hadoop also supports development of custom data ingestion programs which are often used when connecting to a web service or other programming API to retrieve data.

As you can see, there are many choices for loading your data. Very often the right choice is a combination of different tools and, in any case, there is a high learning curve in ingesting that data and getting it into your system.

DXC has streamlined the process by creating a Data Ingestion Framework which includes templates for each of the different ways to pull data. In those templates, we use common tools for tasks such as scheduling the ingestion of data. We also provide our customers with the necessary user documentation and training, so you can get up to speed and get your data into your system very quickly. This all leads to the next step, generating analytic insights, which is where your value is.

In addition, DXC’s Data Ingestion Framework error handling integrates with our managed services support to reduce our client’s costs in maintaining reliable data ingestion. We will discuss this framework in more detail in a future blog.

Learn more about DXC’s analytics offerings.


Jim Coleman, a Solution Architect and Product Manager for the DXC Analytics Platform, is responsible for the strategy, roadmap, and feature definition for the DXC Analytics Platform. For the past 25 years, he has enjoyed working with large scale enterprise data, focusing on analytics and business intelligence for the past 10 years. Jim has a Master’s degree in Computer Science from West Virginia University.

RELATED LINKS

Get a jumpstart on analytics

Why advanced analytics needs the cloud

From lakes to watersheds: A better approach to data management

Comments

  1. Mark Lenke says:

    Appreciate the introduction to this complex scenario.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: