ⓘ Change data capture
In databases, change data capture is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data.
CDC is an approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources.
CDC occurs often in data-warehouse environments since capturing and preserving the state of data across time is one of the core functions of a data warehouse, but CDC can be utilized in any database or data repository system.
System developers can set up CDC mechanisms in a number of ways and in any one or a combination of system layers from application logic down to physical storage.
In a simplified CDC context, one computer system has data believed to have changed from a previous point in time, and a second computer system needs to take action based on that changed data. The former is the source, the latter is the target. It is possible that the source and target are the same system physically, but that would not change the design pattern logically. Multiple CDC solutions can exist in a single system.
1.1. Methodology Timestamps on rows
Tables whose changes must be captured may have a column that represents the time of last change. Names such as LAST_UPDATE, etc. are common. Any row in any table that has a timestamp in that column that is more recent than the last time data was captured is considered to have changed.
1.2. Methodology Version numbers on rows
Database designers give tables whose changes must be captured a column that contains a version number. Names such as VERSION_NUMBER, etc. are common. When data in a row changes, its version number is updated to the current version. A supporting construct such as a reference table with the current version in it is needed. When a change capture occurs, all data with the latest version number is considered to have changed. When the change capture is complete, the reference table is updated with a new version number.
Three or four major techniques exist for doing CDC with version numbers, the above paragraph is just one.
1.3. Methodology Use in optimistic locking
Version numbers can be useful with optimistic locking in ACID transactional or relational database management systems RDBMS. For an example in read-then-update scenarios for CRUD applications in relational database management systems, a row is first read along with the state of its version number; in a separate transaction, a SQL UPDATE statement is executed along with an additional WHERE clause that includes the version number found from the initial read. If no record was updated, it usually means that the version numbers didnt match because some other action/transaction had already updated the row and consequently its version number. Several object relational mapping tools use this method to detect for optimistic locking scenarios including Hibernate.
1.4. Methodology Status indicators on rows
This technique can either supplement or complement timestamps and versioning. It can configure an alternative if, for example, a status column is set up on a table row indicating that the row has changed. Otherwise, it can act as a complement to the previous methods, indicating that a row, despite having a new version number or a later date, still shouldnt be updated on the target for example, the data may require human validation.
1.5. Methodology Time/Version/Status on rows
This approach combines the three previously discussed methods. As noted, it is not uncommon to see multiple CDC solutions at work in a single system, however, the combination of time, version, and status provides a particularly powerful mechanism and programmers should utilize them as a trio where possible. The three elements are not redundant or superfluous. Using them together allows for such logic as, "Capture all data for version 2.1 that changed between 6/1/2005 12:00 a.m. and 7/1/2005 12:00 a.m. where the status code indicates it is ready for production."
1.6. Methodology Triggers on tables
May include a publish/subscribe pattern to communicate the changed data to multiple targets. In this approach, triggers log events that happen to the transactional table into another queue table that can later be "played back". For example, imagine an Accounts table, when transactions are taken against this table, triggers would fire that would then store a history of the event or even the deltas into a separate queue table. The queue table might have schema with the following fields: Id, TableName, RowId, TimeStamp, Operation. The data inserted for our Account sample might be: 1, Accounts, 76, 11/02/2008 12:15am, Update. More complicated designs might log the actual data that changed. This queue table could then be "played back" to replicate the data from the source system to a target.
An example of this technique is the pattern known as the log trigger.
1.7. Methodology Event programming
Coding a change into an application at appropriate points is another method that can give intelligent discernment that data changed. Although this method involves programming vs. more easily implemented "dumb" triggers, it may provide more accurate and desirable CDC, such as only after a COMMIT, or only after certain columns changed to certain values - just what the target system is looking for.
1.8. Methodology Log scanners
Most database management systems manage a transaction log that records changes made to the database contents and to metadata. By scanning and interpreting the contents of the database transaction log, one can capture the changes made to the database in a non-intrusive manner.
Using transaction logs for change data capture offers a challenge in that the structure, contents and use of a transaction log is specific to a database management system. Unlike data access, no standard exists for transaction logs. Most database management systems do not document the internal format of their transaction logs, although some provide programmatic interfaces to their transaction logs.
Other challenges in using transaction logs for change data capture include:
- Coordinating the reading of the transaction logs and the archiving of log files database management software typically archives log files off-line on a regular basis.
- Eliminating uncommitted changes that the database wrote to the transaction log and later rolled back.
- Dealing with changes to the metadata of tables in the database.
- Dealing with changes to the format of the transaction logs between versions of the database management system.
- Translation between physical storage formats that are recorded in the transaction logs and the logical formats typically expected by database users e.g., some transaction logs save only minimal buffer differences that are not directly useful for change consumers.
CDC solutions based on transaction log files have distinct advantages that include:
- no need to change the database schema
- low latency in acquiring changes.
- no need for programmatic changes to the applications that use the database.
- minimal impact on the database even more so if one uses log shipping to process the logs on a dedicated host.
- transactional integrity: log scanning can produce a change stream that replays the original transactions in the order they were committed. Such a change stream include changes made to all tables participating in the captured transaction.
2. Confounding factors
As often occurs in complex domains, the final solution to a CDC problem may have to balance many competing concerns.
Sometimes the slowly changing dimension is used as a method.
2.1. Confounding factors Unsuitable source systems
Change data capture both increases in complexity and reduces in value if the source system saves metadata changes when the data itself is not modified. For example, some Data models track the user who last looked at but did not change the data in the same structure as the data. This results in noise in the Change Data Capture.
2.2. Confounding factors Push versus pull
- Pull: the target that is immediately downstream from the source, prepares a request for data from the source. The downstream target delivers the snapshot to the next target, as in the push model.
- Push: the source process creates a snapshot of changes within its own process and delivers rows downstream. The downstream process uses the snapshot, creates its own subset and delivers them to the next process.
2.3. Confounding factors Alternatives
Sometimes the slowly changing dimension is used as a method.
- electronic data capture EDC systems. Welcome to the CTSU Remote Data Capture Project Cancer Trials Support Unit Water and global change Hydrology for
- An electronic data capture EDC system is a computerized system designed for the collection of clinical data in electronic format for use mainly in human
- the purpose of motion capture is to record only the movements of the actor, not his or her visual appearance. This animation data is mapped to a 3D model
- and economic change consider secondary data essential, since it is impossible to conduct a new survey that can adequately capture past change and or developments
- accurate and honest collection remains the same. The goal for all data collection is to capture quality evidence that allows analysis to lead to the formulation
- Carbon capture and storage CCS or carbon capture and sequestration or carbon control and sequestration is the process of capturing waste carbon dioxide
- Supplier History. This method resembles how database audit tables and change data capture techniques function. There is no SCD Type 5. The Type 6 method combines
- as size or color, will change to reflect change in the value of a datum. To communicate information clearly and efficiently, data visualization uses statistical
- programs, and direct marketing, derived from the capture and analysis of transaction - rich data Alliance Data was formed from the December 1996 merger of two
- applications to capture and convert this character data as numeric data for inclusion into calculations for trading decisions without re - keying the data The common