Synchronize databases extra simply with open supply instruments

Li Zongwen

3 years ago

Change Data Capture (CDC) makes use of Server Agents to report, insert, replace, and delete exercise utilized to database tables. CDC offers particulars on adjustments in an easy-to-use relational format. It captures column info and metadata wanted to use the adjustments to the goal surroundings for modified rows. A altering desk that mirrors the column construction of the tracked supply desk shops this info.

Capturing change knowledge is not any straightforward feat. However, the open supply Apache SeaTunnel project i is a knowledge integration platform offers CDC operate with a design philosophy and have set that makes these captures attainable, with options above and past current options.

CDC utilization situations

Classic use instances for CDC is knowledge synchronization or backups between heterogeneous databases. You might synchronize knowledge between MySQL, PostgreSQL, MariaDB, and related databases in a single situation. You might synchronize the information to a full-text search engine in a special instance. With CDC, you possibly can create backups of information primarily based on what CDC has captured.

When designed nicely, the information evaluation system obtains knowledge for processing by subscribing to adjustments within the goal knowledge tables. There’s no must embed the evaluation course of into the present system.

Sharing knowledge state between microservices

Microservices are widespread, however sharing info between them is commonly difficult. CDC is a attainable resolution. Microservices can use CDC to acquire adjustments in different microservice databases, purchase knowledge standing updates, and execute the corresponding logic.

Update cache

The idea of Command Query Responsibility Segregation (CQRS) is the separation of command exercise from question exercise. The two are basically totally different:

A command writes knowledge to a knowledge supply.
A question reads knowledge from a knowledge supply.

The drawback is, when does a learn occasion occur in relation to when a write occasion occurred, and what bears the burden of creating these occasions happen?

It might be troublesome to replace a cache. You can use CDC to acquire knowledge replace occasions from a database and let that management the refresh or invalidation of the cache.

CQRS design normally makes use of two totally different storage cases to assist enterprise question and alter operations. Because of the usage of two shops, to be able to guarantee knowledge consistency, we will use distributed transactions to make sure sturdy knowledge consistency, at the price of availability, efficiency, and scalability. You may also use CDC to make sure remaining consistency of information, which has higher efficiency and scalability, however at the price of knowledge latency, which may presently be saved within the vary of millisecond within the trade.

For instance, you might use CDC to synchronize MySQL knowledge to your full-text search engine, resembling ElasticSearch. In this structure, ElasticSearch searches all queries, however once you need to modify knowledge, you do not immediately change ElasticSearch. Instead, you modify the upstream MySQL knowledge, which generates a knowledge replace occasion. This occasion is consumed by the ElasticSearch system because it screens the database, and the occasion prompts an replace inside ElasticSearch.

In some CQRS techniques, the same methodology can be utilized to replace the question view.

Pain factors

CDC is not a brand new idea and numerous current tasks implement it. For many customers, although, there are some disadvantages to the present options.

Single desk configuration

With some CDC software program, you could configure every desk individually. For instance, to synchronize ten tables, you could write ten supply SQLs and Sink SQLs. To carry out a rework, you additionally want to write down the rework SQL.

Sometimes, a desk might be written by hand, however solely when the quantity is small. When the quantity is giant, sort mapping or parameter configuration errors might happen, leading to excessive operation and upkeep prices.

Apache SeaTunnel is an easy-to-use knowledge integration platform hoping to unravel this drawback.

Schema evolution just isn’t supported

Some CDC options assist DDL occasion sending however don’t assist sending to Sink in order that it might make synchronous adjustments. Even a CDC that may get an occasion might not have the ability to ship it to the engine as a result of it can not change the Type info of the rework primarily based on the DDL occasion (so the Sink can not comply with the DDL occasion to vary it).

Too many hyperlinks

On some CDC platforms, when there are a number of tables, a hyperlink should be used to characterize every desk whereas one is synchronized. When there are a lot of tables, numerous hyperlinks are required. This places strain on the supply JDBC database and causes too many Binlogs, which can end in repeated log parsing.

SeaTunnel CDC structure objectives

Apache SeaTunnel is an open supply high-performance, distributed, and large knowledge integration framework. To deal with the issues the present knowledge integration software’s CDC capabilities can not clear up, the neighborhood “reinvents the wheel” to develop a CDC platform with distinctive options. This architectural design is predicated on the strengths and weaknesses of current CDC instruments.

Apache Seatunnel helps:

Lock-free parallel snapshot historical past knowledge.
Log heartbeat detection and dynamic desk addition.
Sub-database, sub-table, and multi-structure desk studying.
Schema evolution.
All the essential CDC capabilities.

The Apache SeaTunnel reduces the operations and upkeep prices for customers and might dynamically add tables.

For instance, once you need to synchronize the whole database and add a brand new desk later, you needn’t keep it manually, change the job configuration, or cease and restart jobs.

Additionally, Apache SeaTunnel helps studying sub-databases, sub-tables, and multi-structure tables in parallel. It additionally permits schema evolution, DDL transmission, and adjustments supporting schema evolution within the engine, which might be modified to Transform and Sink.

SeaTunnel CDC present standing

Currently, CDC has the essential capabilities to assist incremental and snapshot phases. It additionally helps MySQL for real-time and offline use. The MySQL real-time take a look at is full, and the offline take a look at is coming. The schema just isn’t supported but as a result of it includes adjustments to Transform and Sink. The dynamic discovery of recent tables just isn’t but supported, and a few interfaces have been reserved for multi-structure tables.

Open supply and knowledge science

Project outlook

As an Apache incubation challenge, the Apache SeaTunnel neighborhood is growing quickly. The subsequent neighborhood planning session has these most important instructions:

1. Expand and enhance connector and catalog ecology

We’re working to reinforce many connector and catalog options, together with:

Support extra connectors, together with TiDB, Doris, and Stripe.
Improving current connectors by way of usability and efficiency.
Support CDC connectors for real-time, incremental synchronization situations.

Anyone fascinated about connectors can assessment Umbrella.

2. Support for extra knowledge integration situations (SeaTunnel Engine)

There are ache factors that current engines can not clear up, such because the synchronization of a complete database, the synchronization of desk construction adjustments, and the big granularity of activity failure.

We’re working to unravel these points. Anyone within the CDC engine ought to take a look at issue 2272.

3. Easier to make use of (SeaTunnel Web)

We’re working to supply an internet interface to make operations simpler and extra intuitive. Through an internet interface, we are going to make it attainable to show Catalog, Connector, Job, and associated info, within the type of DAG/SQL. We’re additionally giving customers entry to the scheduling platform to simply deal with activity administration.

Visit the web sub-project for extra info on the internet UI.

Wrap up

Database exercise usually should be rigorously tracked to handle adjustments primarily based on actions resembling report updates, deletions, or insertions. Change Data Capture offers this functionality. Apache SeaTunnel is an open supply resolution that addresses these wants and continues to evolve to supply extra options. The challenge and neighborhood are energetic and your participation is welcome.