What is change data capture?
Data everywhere is on the rise. Experts predict that, by 2025, the global volume of data will reach 181 zettabytes, or more than four times its pre-COVID levels in 2019. Data is inescapable in every aspect of life — and that's doubly true in business. In a world transformed by COVID, the world of business is a world of data.
But it can seem that for every problem data solves, another arises: Saturated and siloed data streams make it hard to create meaningful connections between datasets. New data gives us new opportunities to solve problems, but maintaining the freshness, quality, and relevance of data in data lakes and data warehouses is a never-ending effort. And, despite the proliferation of machine learning and automated solutions, much of our data analysis is still the product of inefficient, mundane, and manually intensive tasks.
When you boil it all down, organisations need to get the most value from their data, and they need to do it in the most scalable way possible. To support this objective, data integrators and engineers need a real-time data replication solution that helps them avoid data loss and ensure data freshness across use cases — something that will streamline their data modernisation initiatives, support real-time analytics use cases across hybrid and multi-cloud environments, and increase business agility.
Change data capture (CDC) makes it possible to replicate data from source applications to any destination quickly — without the heavy technical lift of extracting or replicating entire datasets. This ensures organisations always have access to the freshest, most recent data.
Change data capture definition
Change data capture refers to the process of identifying and capturing changes as they are made in a database or source application, then delivering those changes in real time to a downstream process, system, or data lake.
This advanced technology for data replication and loading reduces the time and resource costs of data warehousing programmes while facilitating real-time data integration across the enterprise. By detecting changed records in data sources in real time and propagating those changes to an ETL data warehouse, change data capture can sharply reduce the need for bulk-load updating of the warehouse.
CDC is increasingly the most popular form of data replication because it sends only the most relevant data, putting less of a burden on the system. And because CDC only imports data that has changed — instead of replicating entire databases — CDC can dramatically speed data processing and enable real-time analytics.
What is data replication, and why does it matter?
Data replication is exactly what it sounds like: the process of simultaneously creating copies of and storing the same data in multiple locations. Putting this kind of redundancy in place for your database systems offers wide-ranging benefits, simultaneously improving data availability and accessibility as well as system resilience and reliability.
Data replication ensures that you always have an accurate backup in case of a catastrophe, hardware failure, or a system breach. And having a local copy of key datasets can cut down on latency and lag when global teams are working from the same source data in, for example, both Asia and North America.
When it comes to data analytics, there’s yet another layer for data replication. Data-driven organisations will often replicate data from multiple sources into data warehouses, where they use them to power business intelligence (BI) tools.
But, like any system with redundancy, data replication can have its drawbacks. When there are updates to data stored in multiple locations, it must be updated system-wide to avoid conflict and confusion. This can double (or triple, or more) the lift of data management over time, and creates a strain on resources, forcing data integrators and engineers to monitor multiple systems and databases, or to periodically replicate the full database from the source systems to all the other systems, applications, and data lakes or data warehouses that are using the same datasets.
CDC reduces this lift by only replicating new data or data that has been recently changed, giving users all the advantages of data replication with none of the drawbacks.
How does CDC work?
When new data is consistently pouring in and existing data is constantly changing, data replication becomes increasingly complicated. Because it works continuously instead of sending mass updates in bulk, CDC gives organisations faster updates and more efficient scaling as more data becomes available for analysis.
That said, not every implementation of CDC is identical — or provides identical benefits. Let’s look at three methods of CDC and examine the benefits and challenges of each:
It is possible to build a CDC solution at the application by writing a script at the SQL level that watches only key fields within a database. When there is a change to that field (or fields) in the source table, that serves as the indicator that the row has changed. Changed rows can then be replicated to the destination in real time, or they can be replicated asynchronously during a scheduled bulk upload.
The script-based method is fairly straightforward, but building and maintaining a script may be challenging, particularly in a fast-paced or constantly changing data environment. An effective script might require changing the schema, such as adding a datetime field to indicate when the record was created or updated, adding a version number to log files, or including a boolean status indicator.
Because the script is only looking at select fields, data integrity could be an issue If there are table schema changes. And, while CDC is still less resource-intensive than many other replication methods, by retrieving data from the source database, script-based CDC can put an additional load on the system.
Instead of writing a script at the application level, another CDC solution looks for database triggers. Triggers are functions written into the software to capture changes based on specific events or “triggers.” Most triggers are activated when there is a change to the source table, using SQL syntax such as “BEFORE UPDATE” or “AFTER INSERT.”
This method gives developers control because they can define triggers to capture changes and then generate a changelog. And since the triggers are dependable and specific, data changes can be captured in near real time.
There are, however, some drawbacks to the approach. The first is obvious: since triggers must be defined for each table, there can be downstream issues when tables are replicated. The reliability of this solution can also suffer when, for example, triggers may be disabled either deliberately by users or to enable certain operations.
Moreover, with every transaction, a record of the change is created in a separate table, as well as in the database transaction log. Because it must go to the source database at intervals, trigger-based CDC puts an additional load on the system and may have a negative impact on latency.
The most efficient and effective method of CDC relies on an existing feature of enterprise databases: the transaction log. In the typical enterprise database, all changes to the data are tracked in a transaction log. In the event of a disaster or a system crash, the data could be reconstructed by referencing these transaction logs.
A log-based CDC solution monitors the transaction log for changes. When those changes occur, it pushes them to the destination data warehouse in real time.
Because the transaction logs exist to ensure consistency, log-based CDC is exceptionally reliable and captures every change. And because the transaction logs exist separately from the database records, there is no need to write additional procedures that put more of a load on the system — which means the process has no performance impact on source database transactions. Best of all, continuous log-based CDC operates with exceptionally low latency, monitoring changes in the transaction log and streaming those changes to the destination or target system in real time.
But because log-based CDC exploits the advantages of the transaction log, it is also subject to the limitations of that log — and log formats are often proprietary. As a result, log-based CDC only works with databases that support log-based CDC. However, given all the advantages in reliability, speed, and cost, this is a minor drawback.
CDC and ETL
Today, the average organisation draws from over 400 data sources. When you’re reliant on so many diverse sources, the data you get is bound to have different formats or rules. Moving it as-is from the data source to the target system via simple APIs or connectors would likely result in duplication, confusion, and other data errors.
ETL — which stands for Extract, Transform, Load — is an essential technology for bringing data from multiple different data sources into one centralized location. As the name implies, this technology extracts data from the source, transforms it to comply with the organisations’ standards and norms, then loads it into a data lake or data warehouse, such as Redshift, Azure, or BigQuery.
Without ETL, it would be virtually impossible to turn vast quantities of data into actionable business intelligence. But when the process relies on bulk loading of the entire source database into the target system, it eats up a lot of system resources, making ETL occasionally impractical — particularly for large datasets.
That’s where CDC comes in. Because the CDC process only takes in the newest, freshest, most recently changed data, it takes a lot of pressure off the ETL system. Essentially, CDC optimises the ETL process.
At the same time, ETL can make up for the primary weakness of log-based CDC. Unlike CDC, ETL is not restrained by proprietary log formats. That means it can replicate data from any source — including those that can’t be replicated through log-based CDC.
In short, CDC and ETL are complementary technologies: CDC makes ETL more efficient, and ETL catches any data sources that log-based CDC can’t capture.
Use cases for CDC technology
Because CDC gives organisations real-time access to the freshest data, applications are virtually endless. With change data capture technology such as Talend CDC, organisations can meet some of their most pressing challenges:
Get the right data into the right hands in the right formats
Just having data isn’t enough — that data also needs to be accessible. CDC makes it easier to create, manage, and maintain data pipelines for use across an organisation. This means that all users have access to the most current and most correct data for business intelligence, reporting, and direct use in analytics and applications.
Increase data accuracy, quality, and reliability
The low-touch, real-time data replication of CDC removes the most common barriers to trusted data. The data lake or data warehouse is guaranteed to always have the most current, most relevant data. As a results, users can have more confidence in their analytics and data-driven decisions.
Improve regulatory compliance and adherence to privacy standards
Compliance with regulatory standards isn’t as easy as it sounds: when an organisation receives a request to remove personal information from their databases, the first step is to locate that information. If the person submitting the request has multiple related logs across multiple applications — for example, web forms, CRM, and in-product activity records — compliance can be a challenge.
By keeping records current and consistent, CDC makes it much easier to locate and manage these records, protecting both the business and the consumer.
Comprehensive enterprise data integration
Talend CDC helps customers achieve data health by providing data teams the capability for strong and secure data replication to help increase data reliability and accuracy. Our proven, enterprise-grade replication capabilities help businesses avoid data loss, ensure data freshness, and deliver on their desired business outcomes.
Talend's change data capture functionality works with a wide variety of source databases.
Talend’s data integration provides end-to-end support for all facets of data integration and management in a single unified platform. With an intuitive development environment, users can easily design, develop, and deploy processes for database conversion, data warehouse loading, real-time data synchronisation, or any other integration project.
Along with advanced runtime features like change data capture, Talend's data warehouse tools include support for sophisticated ETL testing, with features such as context management and remote job execution. The system also delivers enterprise class functionality such as workflow collaboration tools, real-time load balancing, and support for innovative mass volume storage technologies like Hadoop.
Along with our leading-edge functionality, Talend offers professional technical support from Talend data integration experts. For organisations launching master data management initiatives, Talend also offers an MDM solution that seamlessly integrates with Talend.
Learn more about Talend’s data integration solutions today, and start benefiting from the leading open source data integration tool.
Ready to get started with Talend?
More related articles
- Talend Job Design Patterns and Best Practices: Part 4
- Talend Job Design Patterns and Best Practices: Part 3
- What is Data Migration?
- What is Data Mapping?
- What is Data Integration?
- Understanding Data Migration: Strategy and Best Practices
- Talend Job Design Patterns and Best Practices: Part 2
- Talend Job Design Patterns and Best Practices: Part 1
- An Informatica PowerCenter Developers' Guide to Talend: Part 1