The Best Data Ingestion Tools for Migrating to a Hadoop Data Lake
Related Topics: Data & Analytics
by Sai Yammada –
Is your company engulfed in a flood of data?
That’s the situation at many companies in this age of Big Data. Data has been flooding in at historically unprecedented rates in recent years. As a result, most companies have accumulated huge reservoirs of data – reservoirs that are growing ever larger as that incoming torrent of data continues unabated.
All of that data represents great opportunity. But it also presents somewhat of a problem: What to do with it?
Migrating to a Hadoop Data Lake
One of the most popular solutions for managing that flood of data involves ingesting the data into a Hadoop data lake. Consolidating all enterprise data into a single Hadoop data lake solves a number of problems, and offers some very attractive benefits:
- It’s a very cost-effective solution
- It enables the scalable storage of virtually unlimited quantities of data
- It accommodates the very wide range of data types – structured and unstructured – that most companies are attempting to manage
But once the decision has been made to migrate to a Hadoop data lake, how is that task best accomplished?
RCG|enable™ Data is our Data Ingestion Framework which is a fully integrated, highly scalable, distributed and secure solution for managing, preparing and delivering data from a vast array of sources including: social media, mobile devices, smart devices and enterprise systems.
The framework is vendor agnostic and supports data sources (structured, semi-structured and unstructured) and targets in traditional enterprise systems, external systems and multiple Hadoop distributions including Hortonworks, MapR and Cloudera. The framework is also a Cloudera-validated solution.
The RCG|enable™ Data service and framework eliminates the need for IT professionals to become experts in Hadoop eco-system technologies and languages, and speeds time to delivery at reduced costs by simplifying and standardizing data management and data work flows.
The following are currently some of the other most popular tools for the job:
- Apache NiFi (a.k.a. Hortonworks DataFlow)
Both Apache NiFi and StreamSets Data Collector (detailed below) are Apache-licensed open-source tools. Hortonworks offers a commercially supported variant, Hortonworks DataFlow (HDF).NiFi processors are file-oriented and schema-less. This means that a piece of data is represented by a FlowFile (this could be an actual file on disk, or some blob of data acquired elsewhere). Each processor is responsible for understanding the content of the data in order to operate on it. So if one processor understands format A and another only understands format B, you may need to perform a data format conversion between those two processors.NiFi has been around for about the last 10 years (but less than 2 years in the open source community). It can be run standalone, or as a cluster using its own built-in clustering system.
- StreamSets Data Collector (SDC)
StreamSets takes a record-based approach. As data enters your pipeline (whether it’s JSON, CSV, etc.) it is parsed into a common format. This means that the responsibility of understanding the data format is no longer placed on each individual processor, and so any processor can be connected to any other processor. SDC also offers great flexibility. It runs standalone and as a clustered mode, running atop Spark on YARN/Mesos, leveraging existing cluster resources you may have.StreamSets was released to the open source community in 2015. It is vendor agnostic, and Hortonworks, Cloudera, and MapR are all supported.
Gobblin is an ingestion framework/toolset developed by LinkedIn. It is open source. Gobblin is a flexible framework that ingests data into Hadoop from different sources such as databases, rest APIs, FTP/SFTP servers, filers, etc. It is an extensible framework that handles ETL and job scheduling equally well. Gobblin can run in standalone mode or in distributed mode on the cluster.
A common ingestion tool that is used to import data into Hadoop from any RDBMS. Sqoop provides an extensible Java-based framework that can be used to develop new Sqoop drivers to be used for importing data into Hadoop. Sqoop runs on a MapReduce framework on Hadoop, and can also be used to export data from Hadoop to relational databases.
A Java-based ingestion tool, Flume is used when input data streams-in faster than it can be consumed. Typically Flume is used to ingest streaming data into HDFS or Kafka topics, where it can act as a Kafka producer. Multiple Flume agents can also be used collect data from multiple sources into a Flume collector.
Kafka is a highly scalable messaging system that efficiently stores messages on disk partitions in a Kafka topic. Producers publish messages as Kakfa topics, and Kafka consumers consume them as they please.
Home-Grown Ingestion Patterns
Most organizations making the move to a Hadoop data lake put together custom scripts — either themselves or with the help of outside consultants — that are adapted to their specific environments. Frequently, custom data ingestion scripts are built upon a tool that’s available either open-source or commercially.
Common home-grown ingestion patterns include the following:
- FTP Pattern – When an enterprise has multiple FTP sources, an FTP pattern script can be highly efficient. It can enable engineers to pass certain input parameters to the script that imports data into a FTP stage, aggregates as needed (based on the size of files), compresses them if needed, and inserts them into a Hadoop distributed file system.
- NFS Share Pattern – NFS shares are very common in enterprises that are comprised of many business units. Though this pattern is similar to FTP patterns, the data source will be a NFS share.
- Sqoop / Hive Pattern – Sqoop is one of the most common tools used to ingest data from relational database management systems. But when combined with a Hive table, it can also be very useful for bringing RAW data into Hadoop, and transforming it into different layers using compression (Gzip/Snappy), and into different file formats.
- Flume / Kafka Pattern – Flume can bring data into Kafka topics (using special Flafka sources /sinks) from different sources that do not provide prepackaged data. Kafka topics are often used for streaming Spark destinations, along with other streaming applications. Kafka and Flume are also used when creating a Lambda architecture in Hadoop.
Ask the Right Questions…
Before making the move to a Hadoop data lake, it’s important to know about the tools that are available to help with the process. But in selecting the best tool for the data ingestion process, it’s also important to first answer a few key questions about your environments and your needs:
- What kind of data will you be dealing with (internal/external, structured/unstructured, operational, etc.)?
- Who is going to be the key stakeholder of the data?
- What is your existing data management architecture?
- Who is going to be the steward of the data?
All of the above are questions that should be answered before beginning the data ingestion process.
But the most important question to ask is this: Do we have the in-house skill set to successfully carry out this migration? Providing an honest answer to that question, and acting accordingly, will lay the foundation for a successful data ingestion process.