Skip to content

Making Data Lakes Usable: Why we need a semantic layer – and why it should be open source

, | November 4, 2016 | By

by Rick Skriletz –  

Big Data platforms, like Cloudera’s Data Hub, Hortonworks’ Data Platform, MapR’s Converged Data Platform, and others including IBM, Oracle, Pivotal, promise to easily bring diverse sets of application data together into one data cluster running on Hadoop. This collection of data sets is called a data lake.

It is a wonderful thing to be able to bring data together so easily with Hadoop’s schema on read ability rather than the schema on write required by traditional database systems. But we need to remember that along with the data comes all the data’s associated data problems. Bringing data into the data lake does not, unfortunately, wash away all the problems of non-standard, non-integrated, redundant, and inconsistent data that are buried in application data.

For data lakes, with great ease of data access comes a great need for data management. But this not what I want to talk about today (perhaps I’ll do so in a later blog post). I want to talk about how we need to make it easy for users to access data in a data lake when there is non-integrated and redundant data.

Access Needed Data

Today, users access data they need, whether it is in a data warehouse, data mart, data extract, reporting database, or application, that is physically separate from other data. Consequently, users know the physical data they use very well.

In a data lake, however, this data coexists together and is in a non-SQL data store, unless data in the lake mimics its data sources. In any case, users should access the best data available for the use to which it will be applied. That means a user shouldn’t make a choice of which version of duplicated data values to use, they should simply get the data they need to do the task at hand.

The Semantic Layer

This is what a semantic layer does. As Wikipedia states: “A semantic layer is a business representation of corporate data that helps end users access data autonomously using common business terms. A semantic layer maps complex data into familiar business terms such as product, customer, or revenue to offer a unified, consolidated view of data across the organization.”

Unfortunately, semantic layers, like metadata, have been used by vendors to facilitate the operation of their product sets and features. To date, a semantic layer, built with open source technology, that makes accessing data stored in data lakes simple for users is not available. Because much of the core technology for Hadoop is open source, I believe a Hadoop-based, open source semantic layer is needed.

The Need for an Open Source Semantic Layer

To unleash the power of Hadoop-based data lakes, a semantic layer is needed, and my challenge to The Apache Software Foundation is to start a semantic layer for Hadoop project that will operationally integrate with Apache’s Hadoop-based data security and governance projects.

After all, where data is concerned, ease of use is as important as ease of access.