Big Data Ingestion Challenges: What to do with ALL that Data?

Written by Sai Yammada | August 12, 2016

by Sai Yammada –

It’s a flood. A tidal wave. A torrent. Massive and unprecedented volumes of data are pouring into every business entity on the planet.

It’s a good news – bad news scenario.

On the one hand, that flood of data offers incredible value; it’s like a huge heap of freshly mined ore from a gold mine. Once properly processed, that data will yield invaluable nuggets of actionable, insightful information.

But on the other hand, many companies find themselves essentially drowning in that raging, unending flood of data. Over time, many companies have built up a huge reservoir of data.

What to do with that massive reservoir of data? The key to surviving and thriving amidst the Big Data deluge is to implement a data ingestion process that works for your specific business environment.

Big Data Comes In Many Flavors

We often refer to Big Data as a homogenous whole. But in reality, that mass of data consists of many distinct data types from a number of disparate sources – a fact that greatly intensifies the challenge of data ingestion.

At most companies, data must be ingested from both internal and external sources. Internal sources of data include:

Data offloaded from enterprise data warehouses such as Netezza, Oracle Exadata, or Teradata
Operational data from RDBMS
Data from web activity on e-marketing systems and web applications
Internal shared mounts that act as sources for external files

External sources of data include:

REST APIs (sometimes requiring custom map reduce code to fetch data and store it in HDFS)
Data from Facebook, Twitter, and other social media networks, acquired for sentiment analysis and other correlation or scoring methodologies
Files sent from external vendors in different formats (gip, zip compressed, .csv or tab separated or JSON format)
Cloud-based sources such as Amazon web services S3 buckets

Incoming data streams are also organized into a variety of file formats. Files might be compressed using a variety of compression formats (such as .ZIP, Gzip, or a tarballed Gzip), or they might be uncompressed, plain text files.

Incoming files might also represent a smorgasbord of different file types, such as comma delimited, tab or pipe delimited, XML, JASON, etc.

Lots of Little Ponds

Over the years, at most organizations that incoming flood of data has settled into a collection of many small but separate ponds.

That’s not surprising, since most companies are comprised of a number of distinct business units. So there’s been a natural tendency to distribute data across all business units, each having its own little pond of data.

One company I recently worked with, for example, had 50 business units, each with its own different data system. That’s 50 ponds of data within a single company, each pond handled and managed differently.

That’s not a good thing!

Merging Little Ponds Into One Big Lake

Fifty little ponds of data within a single company — or even just a few — is highly problematical. This sort of environment fosters the very problems that many companies are now trying to escape:

Massive duplication of data
Communication problems between departments
Scalability limitations
The near impossibility of achieving an enterprise-wide single version of truth

One solution is to bring all the little ponds of data into a single Hadoop data lake – an enterprise data platform.

That’s a solution that provides the opportunity to store essentially unlimited quantities of data in a scalable fashion. It’s a solution that also accommodates the wide range of data types that are flooding into every company in the age of Big Data. It’s even a very cost effective solution.

And it’s a solution that many companies are turning to.

Getting From Here to There

But how do you merge all those little ponds of data, whether a few or hundreds, into a single data lake that serves as a single repository for the entire organization?

How do you get from where you are — lots of little ponds of data — to where you want to go – a single enterprise data platform, a data lake?

Making that transition can be very challenging – even a bit scary. Because the truth is that there’s really no one-best-approach for every company. There’s no silver bullet solution to this problem.

There are many tools available to aid you in the process of making that transition. But selecting the right tool requires some careful consideration. And implementing the process requires some precise planning.

View full post