Ingestion Modes
Log-based
In log-based incremental replication, Hevo uses database log files to identify any record alterations such as inserts, updates, and deletes.
According to the Source type, Hevo allows you to select one of the following Log modes:
Log Mode | Source Type |
---|---|
BinLog | MySQL Source types |
Change Tracking | SQL Server Source types |
Logical Replication | PostgreSQL Source types |
OpLog | MongoDB Source types |
Redo Log | Oracle Source Types |
BinLog (Binary Logging)
The BinLog mode is applicable for MySQL Source types. In this mode, data is read using MySQL’s BinLog. This mode is useful when you are looking to replicate the complete database, as is, to the Destination. This mode is very efficient in replicating but leaves you with less control and manageability over data ingestion.
After a historical load of the initial state of the MySQL database, the BinLog is continuously streamed. The first time the data is ingested, the table definitions are directly read from the schema and not the BinLog. Post that, all of the schema updates are read from the BinLog for near real-time replication at scale. This approach supports both deletions and table alterations leading to exactly one-to-one replication. It does not require locks or affect the performance of the database. It is also consistent with the stream processing paradigm allowing near real-time performance.
Note: The binary log files in your MySQL server store information about operations performed across all the databases. When you select specific databases while configuring your Pipeline, Hevo skips the logs for the other databases present on the server. This may sometimes make it seem that the Pipeline is not ingesting data, especially if the skipped databases have a large number of log entries or there are no updates for the selected tables or databases.
Change Tracking
The Change Tracking mode is applicable for SQL Server Source types. It makes use of the Change Tracking feature provided by SQL Server to track the changes made to the Source objects. It captures the records (rows) that were inserted, updated, or deleted in these objects.
Before selecting this mode, you must use an SQL client tool to enable Change Tracking at the database or object level. Once this is done, Hevo automatically sets Change Tracking as the query mode for the objects that you want to track. The Change Tracking period should be more than the replication frequency of your Hevo Pipeline to avoid any data loss.
Note: Automatic selection of Change Tracking as the query mode occurs only for new Pipelines.
Logical Replication
Logical Replication is applicable for the PostgreSQL Source types. In this mode, data is replicated using PostgreSQL Write Ahead Logs (WAL) set at a logical level (available on PostgreSQL version 9.4 and above). This mode is useful when you are looking to replicate the complete database as it is.
While incremental ingestion is done from the logs, query modes must be defined to fetch the historical data from the Source tables. The query modes available for Logical Replication mode are Full Load, Delta-Timestamp, and Unique Incrementing Append Only.
Note: Hevo creates a new Replication Slot for the Pipeline, which may lead to higher disk consumption in your PostgreSQL Database.
Read Set up Log-based Incremental Replication for steps to set up WAL for logical replication.
OpLog
OpLog is applicable to the MongoDB Source. In this mode, data is polled using MongoDB’s OpLog. The OpLog is a collection of individual, transaction-level details which help replicas sync data from the primary instance. Read more about MongoDB’s OpLog and OpLog Alerts in Hevo.
Redo Log
This mode is applicable for the Oracle Source Types. It uses Oracle Logminer to incrementally ingest the data from Oracle Redo Logs. This is the recommended mode for replicating data from an Oracle database Source. Read Using Logminer. For Pipelines created after Release 1.96, Hevo supports the RedoLog ingestion mode for Oracle Database 19c and higher. Refer to the documentation for the Oracle Source variants for instructions to set up Redo Log for an Oracle database. Also read Pipeline failure due to Redo Log expiry to follow the troubleshooting steps in case the Redo Log expires.
Note: This mode does not support user-defined data types for fields. Read Generic Oracle.
XMIN
The XMIN ingestion mode is currently available for Early Access. Please contact your Hevo account executive or the Support team to enable it. Alternatively, request for early access to try it out.
The XMIN mode is applicable for PostgreSQL Source types. It uses PostgreSQL’s system-generated incremental column XMIN
to ingest data. Only new and updated records for the selected objects are ingested using the XMIN
column.
The XMIN
column stores the transaction ID of the INSERT transactions for each row in a table. This value is incremented whenever a row is created or updated. The XMIN ingestion mode reads this changed value to identify the inserts and updates made to the table and syncs the data with the Destination.
Hevo uses the PostgreSQL function pg_current_snapshot() to identify the current XMIN
value. This function returns a list of transaction IDs that are active in the current snapshot. With each run of the Pipeline, Hevo obtains the ID of the earliest transaction that is still active when the snapshot is taken. A transaction is considered to be active from the time it starts until the changes it makes are committed or discarded. Hevo marks the XMIN
value, or ID, of the earliest active transaction as the current XMIN
value. This is because the snapshot may capture IDs of uncommitted long-running transactions as well. Once these transactions are committed, their XMIN
value lies between the earliest and latest transaction IDs in the snapshot. Hence, to avoid any data loss, Hevo considers the earliest transaction ID as the maximum limit while identifying the data to be ingested.
Hevo then ingests the range of records whose XMIN
values are greater than the last_polled_id and lesser than the current XMIN
value. Here, the last_polled_id is equal to the maximum XMIN
value available from the last run of the Pipeline.
When to use the XMIN ingestion mode?
XMIN-based replication is recommended when:
-
Logical replication is not possible due to:
-
Unavailability of Write Ahead Logs (WAL): This can occur when there is a failure in the WAL archiving process, and the necessary logs are not preserved.
-
Inability to connect with the primary database instance: This is a pre-requisite to replicate using WAL because PostgreSQL does not allow logical replication from read replicas.
-
-
There is no mechanism in place to identify the updates per row in a table, like a timestamp column that captures the last updated timestamp.
-
Deleted rows in the Source need not be captured.
-
The volume of incremental data is low. Refer to the Source Considerations section of your PostgreSQL variant for more information on the limitations of XMIN ingestion mode while handling large volumes of data.
Limitations of the XMIN ingestion mode
-
When a row is deleted in the Source table, its
XMIN
value is deleted as well. As a result, the XMIN ingestion mode cannot track deleted data. -
The
XMIN
column is not indexed. Therefore, the XMIN ingestion mode scans the whole table to identify updated rows, which may lead to slower data ingestion and increased processing overheads on your PostgreSQL database host. -
The XMIN ingestion mode cannot ingest large volumes of incremental data as Hevo has an ingestion limit of around 2.5 Million records with each poll. As a result, if more than 2.5 Million records have the same
XMIN
value, the Pipeline may get stuck. In such a case, you need to contact Hevo Support to restart the Pipeline. -
If there are too many long-running transactions in the database, your Pipeline gets stuck at the earliest uncommitted transaction. Hevo can proceed with ingestion only once that transaction is committed.
Due to these limitations, Hevo recommends that you create the Pipeline in the Logical Replication mode, if possible.
Table
In the Table mode, your tables are read individually at a fixed frequency. Use this mode to fetch data from multiple tables in your database, while maintaining control over the ingestion for every table individually. You can fetch data using different query modes.
In Table mode:
-
Hevo does not fetch Views automatically from your database. As a workaround, you can create individual Pipelines in the Custom SQL mode to fetch each View. However, some limitations may arise based on the type of data synchronization, the query mode, or the number of Events. Contact Hevo Support for more detail.
-
Hevo does not update the Destination tables for any deletes that may have occurred in the Source data. In log-based replication, deletes can be identified by the field hevo_marked_deleted being _True for an Event. However, in the Table mode, data is fetched using SQL queries, and these do not offer any mechanism to determine the deletes.
-
For SQL Server Source types, Hevo automatically sets Change Tracking as the query mode if you have enabled Change Tracking on the Source objects.
Custom SQL
The Custom SQL mode allows you to fetch data in a different structure than how it is stored in your Source tables. Custom SQL mode allows you to fetch data using a custom query at a fixed frequency. Based on the query mode and the parameter/column name specified in the query mode configuration, Hevo fetches data from the Source tables/views.
Using Custom SQL as the ingestion mode, you can fetch data from one or more tables or views using the following query:
SELECT * FROM <table_1>, <table_2>,..., <table_n>
Note:
-
Replace the placeholder values in the command above with your own. For example, <table_1> with sales_data.
-
Hevo allows you to execute one custom SQL query per Pipeline. If you want to run multiple queries, you must create separate Pipelines for each query.
Let us say, you want to fetch data from the view or table named some_table
, then, you can write the following query:
SELECT * FROM some_table
Hevo runs the following query to fetch the data periodically:
SELECT *
FROM some_table
WHERE updated_timestamp_column > last_polled_time
AND updated_timestamp_column < Now() - delay
ORDER BY updated_timestamp_column ASC
LIMIT 500000
Note: Aliased columns cannot be used directly in the job configuration fields. Your query must be written as a table expression before the aliased column can be used. Moreover, if your query results in several columns with the same name, they must be aliased uniquely to disambiguate.
Suppose you have two tables with these columns:
user (id, name, updated_ts)
employee (user_id, dept_name, updated_ts)
And, you want to fetch data using the following query and query mode as Delta - Timestamp and timestamp column name as updated_ts (from the table employee):
SELECT u.id,
u.name,
u.updated_ts,
e.user_id,
e.dept_name,
e.updated_ts
FROM user u
INNER JOIN employee e
ON u.id = e.id
Then, you must specify the query as:
SELECT *
FROM (SELECT u.id,
u.name,
u.updated_ts AS user_updated_ts,
e.user_id,
e.dept_name,
e.updated_ts AS employee_updated_ts
FROM user u
INNER JOIN employee e
ON u.id = e.id)TABLE_ALIAS
with timestamp column name being employee_updated_ts.
The corresponding Hevo query would be:
SELECT *
FROM (SELECT u.id,
u.name,
u.updated_ts AS user_updated_ts,
e.user_id,
e.dept_name,
e.updated_ts AS employee_updated_ts
FROM user u
INNER JOIN employee e
ON u.id = e.id)TABLE_ALIAS
WHERE employee_updated_ts > last_polled_time
AND employee_updated_ts < Now() - delay
ORDER BY employee_updated_ts ASC
LIMIT 5000000
Pipelines created in the Custom SQL mode do not have any primary keys defined by default even though the selected Source columns have these. You need to manually define the primary keys to avoid duplicates, even if Auto Mapping is enabled.
You can either do this by setting the primary keys as part of creating transformations or by creating them in the Destination table manually. Read Handling of Updates.