by Rick Skriletz –
Data and its use have always been a challenge. It begins with the data entered into, stored in, and used by application systems. Application systems, the core of the business, have always been the primary IT focus. The second need has been reporting and basic analysis of this data to understand the state of the business and its operations. It was from this second need that data marts and data warehouses arose.
Thus, data warehouses have always been about providing reporting and dashboards to the business. Yet, data warehouses have not been successful in meeting the data demands of the business. As I wrote about in Four Fundamental Obstacles to Successful Analytics, BI is slow to get data to users, data and analytic technologies are complex, the platforms are expensive, and implementations have been performed with inconsistent technical discipline.
A proliferation of independent reporting data marts, data warehouses containing data siloes, and extracts of data has been the means for addressing users’ data demands. This uncontrolled use for user-performed reporting and analysis resulted in different results produced from the different sources or silos. With no ‘single source of truth,’ variations in reported results and analytics continue to plague organizations.
In spite of this lack of success with an enterprise data solution, data warehouses continued to be the focus of solving the problem, until data lakes built on open source Hadoop technologies were introduced. The value proposition was lower-cost data management software because of open source, lower-cost operations due to use of commodity rather than proprietary hardware, and the ability to store all data regardless of source or type.
However, like any new technology, using it like a previous technology was used will fail to utilize the new technology in the most effective manner. The challenge for organizations developing a data lake, especially if there is a rush to capitalize on the cost savings they provide, is to not think of it as a cheaper data warehouse. Properly developed, it is so much more.
Data Warehouses are for Reporting and Data Lakes are for Analytics
The expectations for a data lake need to focus on the future, not just the current, use cases for data. This means using deep data sets that include history, third party data enhancement, and a wide variety of data types and data structures. Data lake use cases include:
- Real-time event processing, alerts, notifications, and analytics
- Complex relationship analysis between actors, events, devices, and more
- Capture of data from Internet of Things (IoT) sensors and smart devices
- Application of machine learning (ML) to data events and streams
- Deep learning applied to automation of data-driven processes and real-time actions
- Use of artificial intelligence (AI) appliances to facilitate and automate business processes
- Advanced analytics, like data mining and predictive and prescriptive analytics
These use cases show how different a data lake is than a data warehouse. They also show why designing and architecting a data lake like a data warehouse is limiting. Success with a data lake requires methods and techniques suited to the real-time, analytic purpose of the data lake.
Methods and techniques for a data lake begin with its architecture, data management process, governance, and operation. Problems can be architected into a data lake when a ‘lift and shift’ approach is used to bring data warehouse functions into a data lake.
The RCG|enable™ Data solutions accelerate and operationalize the methods and capabilities required for a successful data lake. These solutions deliver the necessary attention to data lake use cases, methods, and techniques which provides organizations with a data lake that realizes value for the business from its data and analytics instead of a data swamp.