Stelo Blog

Understanding the Future of Data Ingestion by Exploring the Past

Written by Jessica Sheridan | Nov 30, 2022 2:47:24 PM

The ETL data ingestion method—extract, transform, load—originated when database processing protocols were becoming more standardized. Storing and using data required these three steps. In recent years, cloud computing has made it possible to flip the last two steps—loading and transformation—in a method called ELT. This method seeks to leverage data warehousing solutions for transformation in the destination database. For more information on how this terminology is used, take a look at our blog, Data Ingestion vs. ETL vs. ELT: What are the differences? Here, we’re focused on the future of data ingestion. To understand where we’re going, we have to look back.

For many years, companies would set up a singular, monolithic database and perform transaction processing against it. There was a lot of interest in cleaving operational data stores from reporting stores, but many different stakeholders would analyze data from the same database. In practice, operational data and reporting data represented competing needs. For operational purposes, you need high speed, rapid in-and-out transaction processing. For analysis and reporting, you need elaborate constructs, like data cubes, and data consolidations that can be very resource-intensive to produce—especially in the context of serving operational requirements too. With a monolithic database structure in place and these goals in mind, data management evolved towards data warehousing. It made sense to, in essence, create replicas of the original database so each knowledge user would be able to work with data the way they needed to for their own unique job function.

Since then, the industry-wide perspective continues to evolve. We recognize that the relational database was not an ideal way to store information for analysis. A replica of the current database isn’t enough because, generally speaking, it’s been optimized for transaction processing. Companies need access to the history of changes that have occurred—exploding the amount of information that's being maintained. Further, instead of prioritizing transaction processing, many companies benefit from distinct messages (or transactions) that can be digested independent of the rest of the database. That brings us to this concept of self-defining messages.

Stakeholders don’t want to spend a lot of time looking up schema information, and they shouldn’t have to. In self-defined data formats, normalized data is tagged into a self-contained format. A glacier is a great comparison. Each year, snow and ice accumulate in layers, so a core sample of the glacier offers a rich history of climate over long periods of time. Similarly, with delta lakes, information is layered in, so you don’t have refer elsewhere to make sense of what you have. Today, self-defining messages are helping change data into this new form.

With new methods comes new interfaces for data repositories too. When departing from a traditional SQL interface to access a destination, there are two options to consider: a message-based format or a custom interface. As a company that values futureproofing, Stelo opted for open-source code using Java connectors. We can quickly adapt to different destinations while maintaining the integrity of the data pipe. We work with a wide variety of adapters that can be prepackaged for delta lakes or completely custom. Stelo’s v6.1 sets up high-speed connections from the change data capture pipe to a variety of consumers and extends destination processing to self-defining messages encoded using JavaScript Object Notation (JSON) and carried using Kafka.

ETL and ELT coexist right now, and it's likely that they will continue to co-exist for some time. Stelo embraces both methodologies, and we’re committed to evolving as data management strategies and customer requirements evolve. More and more, people expect their systems to be able to take the same set of data and deliver it to different users in the form that’s most useful for them. Objectively, that’s a reasonable ask. ETL offers these kinds of efficiencies, but if there’s more transform work that needs to be done after data has gotten to its destination, there’s space for that too.

For more information on Stelo’s approach to data ingestion or additional details on SQDR v6.1, visit our Data Replication page.