by Sai Yammada –
It’s a flood. A tidal wave. A torrent. Massive and unprecedented volumes of data are pouring into every business entity on the planet.
It’s a good news – bad news scenario.
On the one hand, that flood of data offers incredible value; it’s like a huge heap of freshly mined ore from a gold mine. Once properly processed, that data will yield invaluable nuggets of actionable, insightful information.
But on the other hand, many companies find themselves essentially drowning in that raging, unending flood of data. Over time, many companies have built up a huge reservoir of data.
What to do with that massive reservoir of data? The key to surviving and thriving amidst the Big Data deluge is to implement a data ingestion process that works for your specific business environment.
We often refer to Big Data as a homogenous whole. But in reality, that mass of data consists of many distinct data types from a number of disparate sources – a fact that greatly intensifies the challenge of data ingestion.
At most companies, data must be ingested from both internal and external sources. Internal sources of data include:
External sources of data include:
Incoming data streams are also organized into a variety of file formats. Files might be compressed using a variety of compression formats (such as .ZIP, Gzip, or a tarballed Gzip), or they might be uncompressed, plain text files.
Incoming files might also represent a smorgasbord of different file types, such as comma delimited, tab or pipe delimited, XML, JASON, etc.
Over the years, at most organizations that incoming flood of data has settled into a collection of many small but separate ponds.
That’s not surprising, since most companies are comprised of a number of distinct business units. So there’s been a natural tendency to distribute data across all business units, each having its own little pond of data.
One company I recently worked with, for example, had 50 business units, each with its own different data system. That’s 50 ponds of data within a single company, each pond handled and managed differently.
That’s not a good thing!
Fifty little ponds of data within a single company — or even just a few — is highly problematical. This sort of environment fosters the very problems that many companies are now trying to escape:
One solution is to bring all the little ponds of data into a single Hadoop data lake – an enterprise data platform.
That’s a solution that provides the opportunity to store essentially unlimited quantities of data in a scalable fashion. It’s a solution that also accommodates the wide range of data types that are flooding into every company in the age of Big Data. It’s even a very cost effective solution.
And it’s a solution that many companies are turning to.
But how do you merge all those little ponds of data, whether a few or hundreds, into a single data lake that serves as a single repository for the entire organization?
How do you get from where you are — lots of little ponds of data — to where you want to go – a single enterprise data platform, a data lake?
Making that transition can be very challenging – even a bit scary. Because the truth is that there’s really no one-best-approach for every company. There’s no silver bullet solution to this problem.
There are many tools available to aid you in the process of making that transition. But selecting the right tool requires some careful consideration. And implementing the process requires some precise planning.