Retailers, Hadoop and Modern-Day Alchemy: Turning XML into Gold

Related Topics: Consumer, Data & Analytics

by Steve Thompson – 

It’s just sitting there, isn’t it?

All of that vast quantity of XML data your retail business has accumulated—it’s just sitting there. You haven’t been able to do much with it. You don’t know what to do with it.  And it’s all just sitting there.

How can businesses get to, store, understand, and derive value from XML data? This is a problem many retail businesses and other industry verticals face on a daily basis.

That’s about to change.

Alchemy 101?

A few centuries ago, what I’m about to tell you might have been called alchemy—turning something worthless into something of great value, like turning lead into gold. In a sense, that’s what we can now do with XML data.  We can take something that currently offers you very little value, and turn it into something that can provide massive value.

It’s not quite alchemy, though—at least not the medieval version of alchemy.

The difference is that we’re starting with something that already has great value: your XML data. You just haven’t been able to unlock that data. You haven’t been able to make it usable and actionable.

And that’s what we’re going to change.

To do that, we’re going to use the RCG|enable™ Data Ingestion Framework in combination with the RCG GDSN (Global Data Synchronization Network) XML Parser.

What is GDSN data? It’s a way for companies in various industries, such as retail, to send price data, inventory data, confirmation data, etc. using XML data. That data contains valuable insights and metrics. But how to get to that value?

First, a bit of background…

A Relational Problem

Even as we ease into the age of Big Data, most retail businesses continue to rely on relational databases. RDBMS’s have served the world well for several decades, and continue to do so.

But the world has changed mightily since the advent of the RDBMS. Specifically, the quantities and types of data that we’re now able to accumulate has evolved significantly. Add to that mix, the world of Big Data, where there is unstructured data, semi-structured data, and structured data (relational) all mixed together. How to handle that mix of data?

And much of that data—particularly semi-structured data, as well as object-oriented data—just doesn’t fit into your relational database. The data, as is, simply doesn’t lend itself into categorization by tables, rows, columns, and indexes.

That’s why you have that huge mass of XML data just sitting there, with no real way to use it. Until now.

This is the high-level description of what we’re going to do with your GDSN XML data:

  • Ingest the XML data into a cheap Hadoop Big Data cluster using HDFS (by way of the RCG|enable™ Data Ingestion Framework)
  • Identify data items of interest in the XML Files sitting in HDFS in the Hadoop cluster
  • Convert the XML into a Java-based POJO (Plain Ole Java Object) using the JAXB APIs
  • Then, create a customized Object-to-Relational Mapping and POJO parsing to load those items into the cheap Hadoop Hive relational tables using dynamically generated SQL statements…

Sounds simple, doesn’t it? And really, it is simple, if you have the right tools for the job. Those tools are:

  • RCG|enable™Data Ingestion Framework
  • The RCG GDSN XML Parser

Details for Techies

We begin by ingesting the XML data files themselves through the RCG|enable™ Data Ingestion Framework from any location, and then we bring those into the HDFS storage location for the cheap Hadoop cluster. Any Hadoop cluster will work (MAPR, Cloudera, Hortonworks, or plain vanilla Apache Hadoop).

Once the XML data file is stored in HDFS, the XML is converted into a Java-based POJO by way of the JAXB (Java Architecture for XML Binding) “un-marshalling” process. JAXB provides two main features: the ability to marshal Java objects into XML and the inverse, i.e. to “un-marshal” XML back into Java objects. Once the POJO has been created (it’s a very complex object that matches the GDSN XML schema perfectly) it can be parsed at will, and data pulled out that may be of interest. Any data can be parsed; it’s a matter of digging into the structure and nested sub-structures of the object itself. The parser pulls apart the data, and then builds a dynamically created SQL statement that uses that data for an INSERT statement into the Hive tables of interest.

Why do we use the RCG|enable™ Data Ingestion Framework? Because it’s all written in Java, open source, and highly extensible as a Java-based framework that you can easily extend!

“RCG’s Data Ingestion Framework is a fully integrated, highly scalable, distributed and secure solution for managing, preparing and delivering data from a vast array of sources…” (To read more, see my previous blog, “Everybody Can Enjoy the Benefits of Cloud-Based Data Ingestion“)

The Data Ingestion Framework makes it easy to load a Hadoop system with data—any data, in addition to our GDSN XML data. And you don’t have to be a Hadoop expert to use it. This framework supports all data types, including structured, semi-structured, and unstructured—including, of course, XML data. All with an easy-to-configure, drag-and-drop User Interface.

Once the above steps have been completed, the data will be stored in a relational Hive table, which is an SQL-based format. The data will then be accessible and usable for all your data processing and business intelligence tools that are designed for a traditional RDBMS. Now the data can be stored, queried using SQL, and utilized by analytics tools and BI tools. Important business data patterns and information can be derived from it!

Imagine the power your GDSN XML data (or any XML data for that matter) provides by enabling you to pick out those precious 10-30 pieces of data that are of importance and leaving the other hundreds or thousands of elements that are not currently needed but can be stored for possible future usage.

At the very least, you’ll have a way to store all of that data very inexpensively in the HDFS tables as the original XML data, as well as the Relational SQL Based HIVE tables.

Not Just for Retailers…

It’s likely that most retail businesses have a flood of XML data that they aren’t using to any significant degree. The RCG GDSN XML Parser combined with RCG|enable™ Data Ingestion Framework provides a way to put that data to work, easily and inexpensively.

But countless enterprises in most industry verticals are experiencing the same problem: lots of XML data, and no real way to use it. This is a solution that will work for any type of business.

Any business, in fact, could customize the GDSN parser specifically for their needs, and use the RCG|enable™ Data Ingestion Framework to load the data into a Hadoop cluster. The data is then stored cheaply and is available to use for a variety of purposes, including analytics and business intelligence.

It’s not quite the same as turning lead into gold. But to derive value from a mass of data that’s currently just sitting there, not doing you a bit of good…perhaps that’s a modern-day version of alchemy.


Subscribe to get the Latest Updates

Enter your email address below to get the latest news and updates from RCG.