How to Speed Data Integration by Transforming Data Transformation
Related Topics: Data & Analytics
by Debashis Rana –
Does the data integration process at your company sometimes seem tediously slow?
If your answer to that question would be yes, you’re in the clear majority. Most IT execs would likely say that the ETL/ELT process is a significant driver in expanding costs and extending project timelines. And most are on the lookout for ways to accelerate the ETL/ELT process.
If you’re looking to accelerate the data integration process at your company, you’ll want to read on. Because by transforming the data transformation process, you can expect to significantly reduce the time required for your typical data integration project.
Tried, True, Tested — and S-L-O-W
The traditional SDLC for a data integration project (building a data warehouse, for example) would follow this path:
- Capture business requirements, including the data and analytics that will be needed, and the ways users will interact with the data
- Convert the business requirements into data requirements, typically done by studying an existing solution or by interviewing users
- Identify the best sources of data by interviewing SMEs
- Profile the data, with the goal of gauging its quality and usability for the project
- Define the transformations (manipulations) that will be performed on input data to transform it to output data (data filtering, cleaning, enriching, joining, aggregating, pivoting, etc.)
The process generates documentation (often spreadsheets) that are typically known as “source-to-target mapping documents.” These are used to guide the development team, which then uses an ETL/ELT tool to develop the mappings and transformations.
The testing team then creates and executes test cases intended to ensure that the mappings and transformations conform to the design, as specified in the source-to-target documentation. Data validation testing usually involves writing and executing queries against sources and targets.
The process described above is certainly tried and true. It does work. And it is painfully slow.
Changes Are Afoot (and likely to cause a stumble)
Though the traditional process nutshelled above is slow, it’s quite typical for another wrench to be thrown into the works that slows the process even further: the dreaded change requests.
Once the code has been deployed, it’s a virtual certainty that some changes will be demanded by users. It’s the dreaded “I know that’s what I said, but that’s not what I meant” phenomenon.
A change requires that the entire process be repeated, frequently causing a team to stumble in its efforts to meet project timelines.
Today’s Users Demand Acceleration
The traditional SDLC process runs counter to current industry trends. Today’s users want access to data to occur much faster — even in real time.
But they also want data that has been cleaned, tested, and presented in the most usable state. (Don’t fall into the “provide me with the raw data, and I’ll deal with it” trap; that approach rarely works.)
The growing demand is for clean, usable data, delivered with lighting speed. Fulfilling that demand requires an acceleration of the traditional process.
But how can that best be accomplished? The key is in eliminating a redundancy that exists within the design, development, and testing process.
Speed Data Integration by “Skipping” Development
Consider that for most organizations, the preponderance of data is structured data that lives in Relational Database Management Systems. (The use of unstructured data is growing, but still minimal at most companies.) Accessing and manipulating data in an RDMS is done through Structured Query Language commands such as SELECT, INSERT, UPDATE, and DELETE. SQL has been in use for decades, and is well understood by technical users.
So what if we had a “code generating engine” that could leverage the functionality and familiarity of RDMS and SQL? Such an engine would be capable of:
- Reading the source-to-target mapping documents
- Generating SQL against the sources to fetch input data
- Applying required transformations to the input data to achieve the output data
- Generating SQL against the targets for sending the output data to them
This code-generating engine would use metadata to read, manipulate, and write data.
While the concept of a code-generating engine is certainly not new, the twist on the concept, in this case, is that code would be generated and executed on-the-fly, or dynamically. The net effect would be the de-facto elimination of the development step of the SDLC by automating development.
This would represent an evolution from a traditional development-based approach to a configuration-based approach; the metadata is essentially “configuring” what data to fetch, how to manipulate it, and where to send it.
The Impact Upon Testing
How would this configuration-based approach impact the testing process? A configuration-based approach would streamline testing considerably, adding to the total project time savings that would be realized.
We know that our engine would have the ability to generate and execute SELECT statements against the sources. Why not also command the engine to generate and execute SELECT statements against the targets? That would provide us with a basis for comparing data between the sources and the targets — essentially automating the data validation testing process.
The departure from prior efforts at automating the data validation testing process is that our engine would enable the use of the same design document for both generating and testing code.
Additional Time Savers
Our configuration-based code-generating engine would offer substantial time-savings by automating development and data validation testing. But additional time-saving benefits would be realized through the engine’s ability to:
- Maintain a complete audit trail of executions
- Log errors and perform the associated necessary actions (such as alerting the appropriate personnel)
- Compute and distribute analytics on executions
Once the engine’s core capabilities of code generation and execution have been accomplished, the above are easier to implement.
And, in fact, these additional benefits constitute “must haves.” They help to ensure that the code can be deployed into an operational IT environment where appropriate controls are necessary from multiple standpoints, including (but not limited to) maintainability and regulatory compliance.
The Real-World Benefits? Quite Impressive…
Departing from the realm of theory, how would the advantages of a configuration-based code-generating engine translate into real-world benefits?
At a minimum, companies deploying this approach could realistically expect to speed data integration by a factor of two, while also reducing the size of development teams. And doubling the speed at which data integration projects are completed would be a very welcome boon at most companies — even if it does require a bit of a twist to the good old SDLC.
The configuration-based code-generating engine described above is not mere theory; it exists in the form of RCG Global Service’s RCG|enable™ Metadata Transformation framework. Deployed in the field over a period of more than two years for many RCG clients, this proprietary framework has delivered acceleration rates as high as 85%.