A retrospective on a prevalent practice in building data lakes
by Debashis Rana
As data lakes have gained popularity in the industry, a few styles of implementation have emerged. One of these involves going on a rampage of ingesting data into the lake – a simple copy of production data from all conceivable sources if you will – with little thought about its use down the road. Or, being stuck in pilot use case purgatory at best. Unfortunately, simply ingesting data and making it available doesn’t work, and creates yet another data asset that IT has to manage, care and feed going forward.
The true benefit of a data lake comes from making the data usable so that business decisions can be made. Analytics are the biggest value add to the data that makes it suitable for business decision-making and moves the needle towards the proverbial data-driven enterprise.
In the Beginning…
… there were just operational systems, i.e., applications that enabled business processes that ran the business, and the data that lived in these systems to accomplish it. When insights were needed, users were relegated to running analytical reports on these systems, which then slowed the business down, and got the operations folks upset.
Then came data warehouses that served as the single source of truth. But they took too long to build – the dreaded enterprise data model that all stakeholders must buy into, laborious data mapping from sources to target, and tedious ETL development and testing were to blame, to name a few. And these were even harder to change when the requirements changed. Not to mention, it could only be tabular data; if the data wasn’t tabular, it would be converted to tabular or stored using data types such as long, BLOBs, CLOBs, etc., and specialized applications would be built to use such data. The compromises were too many, and the returns were too few and far between to justify the expense.
Then came data lakes, which offered the flexibility to ingest anything and everything. Non-tabular data? Not a problem. Unstructured data? Ditto. So, we’re off to the races again. The lure of being able to ingest it all – recall the three V’s of Big Data that were all the rage not too long ago – is too tempting, given fairly mature distributed architectures, implemented on commodity hardware, Cloud adoption, and an open source movement to boot. But are we repeating old mistakes? Are we building yet another data store that simply inherits and amplifies all the problems of the past? Are we treating these contemporary platforms as yet another container to build traditional data warehouses? In a time where IT budgets are shrinking and the business is demanding more, do we need yet another “technology failure” badge of honor?
So, how do we reap the benefits of a data lake to finally reach the promised land, but avoid the landmines along the way? The key to the answer lies in how the data will be used — not technically, but from a business standpoint. And that business standpoint will have technical implications. While the 3 Vs are probably a dated (and tired) way to characterize Big Data (another terrible term in my opinion), they still offer some perspective on what we’re facing.
First, consider the sheer volume of data that one can collect into a data lake. One could say that more data is generally a good thing. However, we know that a human’s capacity to process information has limits, so large amounts of data is naturally a problem from a business standpoint. Just as IT teams are scratching their heads about how to store, manage and process more data than we have seen or even imagined before, business users are wondering how they will understand all the data and make good use of it. Nevertheless, from a technical standpoint, this calls for not just more storage, but also more forethought, planning, and organization of the data throughout its lifecycle. Elastic infrastructures, a variety of storage options and updated archival policies and technologies are some of the tried and tested ways to manage volume. One word: scale. Scale doesn’t apply only to physical storage, but to the people and processes that live around it.
Next, consider the variety of data. The business user could care less about the format of the data as long as it is available to them when they need it, and in a shape and form that they understand. So, this is an “IT problem” to figure out how to consistently extract relevant pieces of information from a variety of sources, especially semi-structured and unstructured ones. It’s a problem that impacts both storage and processing capabilities. Data stores must be format-agnostic and data processing must be sophisticated enough to handle various formats. And don’t forget the importance of efficiency, given the volume constraint we noted above. Format translators, text analytics, search engines, natural language understanding, image/audio/video recognition, and many more technologies have emerged and/or matured. The fundamental principle here is to pick one or more tools that are fit-for-purpose while ensuring scalable operation and sustainable support and maintainability. On the last one, the non-technical aspects – such as skills, technology, and automation – are as important (if not more) as the technical ones.
Velocity. This is an interesting one. First, the business perspective – when data is coming at you that fast, there must be “an equal and opposite reaction,” i.e., the business must have a nimble process that can absorb that fast data and do something about it within a timeframe that has parity with the speed of data. Many organizations are struggling with this since decades of “batch” data have shaped their thinking and business processes – and more importantly their expectations – to live with the last available data which could be a day (or more) old. So, dealing with the latest-and-greatest data also requires latest-and-greatest business processes. At the same time, this doesn’t let IT off the hook; the plumbing needs to be changed to accommodate fast data. Or, the roads need to be upgraded to highways — choose your analogy. Here, too, technological advancements that are too numerous to list have become mainstream and offer many solutions. As before, a holistic view that takes people, process and technology into consideration is what makes it not just successful, but also sustainable.
Another “V” that has sort of earned its place in industry vernacular is veracity (a variant I’ve seen is “validity” and I’m sure there are more): that is, the business must be able to trust the data that it receives. No questions asked. (Anyone that hasn’t lived the “prove to me that your report is right” scenario please raise your hand here.) As with variety, the business doesn’t care; IT must just get it (the data) right. Therefore, the fundamental principles of data management – which include data quality, metadata and master data – must still apply, but they must now apply at scale (volume), be flexible (variety), be extensible (velocity), and be auditable (veracity).
But won’t this take too long? How is this going to work? It seems like being placed between the proverbial rock and a hard place, for sure.
Enter Agile Analytics
Before we start problem-solving, first a disclaimer about the term “agile.” Like any other industry buzzword, it has been used, overused, misused and abused. In this context, by agile we simply mean being nimble rather than trying to align with the formal definition of the term in a methodology. Hey, agile analytics has a better ring to it than nimble analytics!
OK, typical scenario: impatient business user wants everything now. First, assure them that all these wonderful new data sources are being captured and stored in the data lake. We have found the term “ingest” to invoke a range of responses, so keep it simple. Better yet, you know your user best, so use your judgment.
Next, negotiate what they would like to have first versus later. Hey, IT budgets have been cut (yet again), so why don’t we focus on something that you need urgently, and work to make that happen. Then we’ll worry about the rest. (That is a thinly veiled version of starting to create the classical “product backlog.”) It’s tricky but it can be done. The key thing here is to be specific about the urgent request. The best place to start is the desired business outcome, for example, “I need to be able to advise my loan department as to what kind of loans we should be prioritizing for the next quarter based on our current portfolio, market conditions, external factors, and risk tolerance.” Sounds logical and great, but what does it mean? Decompose, decompose, decompose:
- Current portfolio – should come from the operational system and/or data warehouse, done. (If not, there’s a bigger problem that should be solved first.)
- Market conditions – ah, yes, our data lake does capture market data about loans and leases being made by banks similar to us, check.
- External factors – what do you mean by this? “Oh, I meant I want to reduce my exposure to small and midsize businesses in flood plains in Florida, Louisiana, and Texas. And that’s just one example.” Hmmm, I don’t think we capture that in the data lake currently, but let me check. Do you have a data source in mind for that? “Nope. But it would really help.” OK, made a note of it. Move on.
- Risk tolerance – what is that? “Oh, there’s a formula we use… there’s a spreadsheet that is created in my department which has the formula.” Perfect. Now we’re talking.
Give a shape to what should be delivered. “A dashboard would be perfect.” Can you sketch it out on this piece of paper for me? “Oh, I don’t know, maybe (while doodling) a list of counties in the 3 states I mentioned down the left-hand side and number of loans and outstanding balances across the top, followed by whether they’re in a flood plain, some market benchmarks, and the risk values… you know the formula we talked about?” Now we’re getting somewhere.
Then, talk about the specific data elements that will be needed to satisfy the focused request. Cross-reference the data needed to what is being ingested. There are two possibilities here obviously – either it’s there or it isn’t (businesses that are within a designated flood plain). If it’s there, perfect. If not, two possibilities:
- Can we get it as part of this focused effort?
- Or, can we put it on the backlog?
Either way, you have a resolution.
Now comes the fun part. What must be done to the “raw data” (the data you’ve been ingesting) before it can be delivered? This is where the “there’s a spreadsheet with the formula” comes in. And several other things, but let’s keep it simple. What if you could allow the users to make that doodle described above a reality by themselves? Enabling user self-service would eliminate the endless cycle of:
- Let me get the data architect involved…
- We’ll need to create a data model…
- We’ll have to come back to you for the rules…
Not possible, you say? Yes, it is possible. As a matter of fact, it’s here now, and organizations are using it very effectively.
Here’s an interesting term: self-service data curation. What is that? It’s a collaborative methodology, enabled by advanced technology, in which the business user and IT (typically an analyst) sit down together with the raw data, and physically manipulate it to shape it interactively to match that doodle. The process goes by several industry buzzwords such as data wrangling, data preparation, data harmonization, data munging (didn’t even know that last one was a formal word), etc. At a high level, they all do the same basic thing: they show real data and allow the user to “transform” it without writing a stitch of code. How do they do it? The process varies, but generally involves predictive actions based on commonly known ETL patterns. Some of them also combine machine learning with pattern recognition and have come up with some pretty sophisticated features. Regardless, the business user (with minimal training on one of these tools) has effectively accomplished three things. He or she has:
- Given you requirements, including a prototype
- Developed the solution for you
- Effectively “passed” user acceptance testing
Once the data curation is complete, the remaining pieces are comparatively simple, such as putting together a dashboard in a BI tool of your choice. The process described above isn’t fiction, it is real. We did this at a regional bank (yeah, the examples are a dead giveaway). We went from “zero to dashboard” in 2 weeks. And we added more sources and analytics in another 2 weeks. Truly nimble if you ask me. Our experience has shown that agile analytics saves 50% time and effort in data integration, and that’s a conservative estimate.
Help Is Available
Some IT housekeeping is necessary before we wrap up. We’ve all seen multiple generations of self-service tools and how they can proliferate data and related artifacts in uncontrollable ways, and then it eventually becomes IT’s problem to manage. A serious issue. So, how do we solve this? There are 2 key actions to be taken here:
- Ensure that best practices are followed during data curation, or have a way to enforce it after the fact (yes, like any other self-service technology, there’s more than one way to achieve a certain result)
- Have a robust way to operationalize it so that it can be managed at scale, with flexibility, extensibility, and auditability
Some of these tasks are technology-specific. And they can be time-consuming. This is where RCG Global Services can help. We have built accelerators that help operationalize self-service data curation technology and grouped them in our family of solutions under the heading of RCG|enable® Visualized Data Management. Contact us and we’ll be happy to help you.
In summary, the process of going from raw data to something the business can use is becoming an agile process. Traditional SDLC has been turned on its head, and the disruptors are here to stay.
Last but not least, consider this alternate scenario: “Just get me access to the data and I’ll know what to do with it.” We’ve all been here, too, right? A power user. A DIY mentality added to the impatience and the “I want it now.” Sure thing. Give it to them. And then keep in touch frequently to see how they’re doing and if agile analytics can help. With the monitoring tools available with most modern data lake platforms, one can easily replicate the user’s actions and, better yet, enhance them to create prototypes that you know or think will wow them. Once they see that things can be done quickly, they will come. After all, the data belongs to the business (all the enterprise data anyway). IT is merely a custodian of it.
In the End…
… it’s not about just ingesting data. Sure, ingesting is important since without it nothing else can happen downstream. But ingesting alone is not enough. It’s about curating it into a form that the business can readily use without any further manipulation, and with complete trust. That’s the value. The faster this happens, the happier the users are. Agile analytics delivers curated data at the speed of business.
Simply put, don’t just ingest it, but also curate it… and then they will come.