For data to be useful, it needs to be up to date.
Changes in data over time often suggest key insights that can drive business decisions. Dynamic data tends to be more functional than static data – but it’s also more difficult to accurately maintain.
This is where change data capture (CDC) comes in. With CDC, you can ensure that multiple databases – sources and destinations, publishers and subscribers – accurately reflect changes.
To help you identify how this might impact your business environment, we’ll break down what change data capture is, what role it plays in data ingestion, and what the best strategies are to carry it out.
Ready? Let’s dive in.
Let’s begin with a definition. Here’s how Syncsort defines it:
“CDC is the process that ensures that changes made over time in one dataset are automatically transferred to another dataset.”
This is an apt description. It’s worth noting that change data capture should also be reliable (meaning it ensures accuracy in addition to automation) and it may encompass more than two datasets (meaning changes may be passed to multiple subscribers).
With the definition clarified, let’s review how change data capture fits into data ingestion.
In the most granular sense, change data capture is typically the first stage in data ingestion; in other words, if you’re using an ETL approach to data ingestion, change data capture would represent the extraction phase. It’s the stage at which changes are noted.
There are five commonly recognized means of performing change data capture. These include:
Under this approach, after data is extracted, a process is run to identify any rows that were modified since the most recent extraction. Any row with an updated timestamp is noted as a change, and that data is then passed through and updated in the destination database.
This is a simple and fairly effective method for change data capture, but it can be problematic in certain scenarios. Maybe most notably, if datasets have rows deleted, this approach won’t catch the change; there’s no timestamp data on a row when the row is gone.
A second change data capture strategy is to use database triggers. This method requires that the source database be configured to send notifications when changes to a row are made.
This makes real-time change data capture possible, but it comes with drawbacks, too. The biggest one is that it requires a significant expenditure of resources from the source; any applications that depend on the source may feel negative effects.
Another method of change data capture is to use snapshot comparisons (or the diff method, as it’s alternatively called).
Snapshot comparisons work by comparing entire datasets against previous versions of entire datasets and noting any changes. They typically are set to run at regular intervals.
This approach tends to be very reliable, but, because it requires processing the entire dataset, it can become a significant drain on resources, especially as the dataset grows. For this reason, it’s not very scalable.
There’s some debate as to whether dual writing qualifies as change data capture, because it technically isn’t capturing changes and passing them on; it’s simply recording a change simultaneously in multiple places.
This approach is uncommon, because it’s difficult to ensure that both writes succeed or fail. If one succeeds and one fails, datasets will become unreliable.
Finally, the fifth approach to change data capture is to scrape logs to identify changes. This is called log-based CDC, and it works because nearly all transactional databases record transaction logs.
This is our preferred approach. It has a near-zero impact on source database resources, making it very scalable.
To ensure changes are processed, there’s an increasing need for self-defined message content. Accordingly, data interchange formats such as JSON and message bus platforms such as Kakfa are emerging as fundamental ingestion technologies.
Hopefully, the information above has helped you to improve your understanding of change data capture and the strategies by which it’s implemented. If you’re looking to record change data through a data ingestion solution, let’s talk.
At StarQuest, we’re experts at data ingestion. Our powerful SQDR software can be utilized for replication and ingestion from an extensive range of data sources, and it powers real-time change data capture.
And, importantly, our customer service team is regarded as some of the best in the business, with clients calling us “The best vendor support I have ever encountered.”
If you’re looking for data ingestion for migration, data warehousing, application development, auditing, disaster recovery, or another use case – we can help.
Get in touch with us to discuss your data ingestion needs. We can set you up with a no-charge trial of our software using the DBMS of your choice, and help you take the first step toward a solution that will benefit your business.