Share

Types of Data Synchronization

Data Synchronization is categorized depending on the timeline of the data that Hevo ingests.

Historical Data

Historical load is the one-time initial load of data that the Source already had before the creation of the Pipeline.

All Sources in Hevo, except for the ones purely dependent on webhooks, support loading of historical data. How far back in time Hevo can replicate data from may depend upon the constraints put by the Source. Read Sources for more information.

Enabling historical load for a Pipeline

Historical data is ingested the first time you run the Pipeline. Some Sources provide you the option to select or deselect the Load Historical Data Advanced setting during Source configuration. If this option is deselected, Events older than the Pipeline creation date are not loaded.

Note: Hevo does not allow you to choose the tables or collections for which historical load is to be performed. To disable historical load for certain tables or collections, you can skip the individual objects in the Pipeline Overview page.

Historical data ingestion methods

Hevo uses three methods for ingesting historical data, depending on the Source:

Ingesting historical data

Hevo ingests your historical data using the following steps:

  1. A Historical load task is created for each table or collection in the database, or object in the Source.

  2. Hevo starts ingesting historical data using either the Recent Data First or Earliest Data First method and performs one-time ingestion for all the Events in the Source. For a few Sources such as LinkedIn Ads and Instagram Business, Hevo allows you to specify the historical sync duration while setting up the Source in Hevo. Refer to the respective Source document for more details.

  3. Once all the data is ingested, Hevo displays the status Historical Load Ingested, and the historical load is never run again unless restarted.

    Historical Load Ingested

In a historical load, wherever primary keys are defined in the Source, Hevo uses these primary keys to replicate data to the Destination. In other cases, Hevo asks you to provide a set of unique columns for each table or collection during the Pipeline creation process.

Prioritization of historical data loads

Log replication (BinLog, WAL, OpLog, Change Tracking) has precedence over historical data loading. Once the log-based replication completes, historical data ingestion is started.

To avoid overwriting updated data with older data, historical load and log replication never occur in parallel. When the log replication is running, all historical loads are put in QUEUED status and vice versa.

In Hevo, every Pipeline job has a maximum run time of one hour. Therefore, all historical loads run for an hour and then wait for the log replication to run before resuming. The ingestion always resumes running from the position where the last run stopped.


Incremental Data

Incremental data is the changed data that is fetched in a continuous manner. For example, log-based jobs for databases, daily synchronization jobs for SaaS Sources, or Webhook-based jobs.

Incremental load updates only the new or modified data in the Source. After the initial load, Hevo loads most of the objects using incremental updates. Hevo uses a variety of mechanisms to capture the changes in the Source data, depending on how the Source provides these changes. During incremental load, Hevo maintains an internal position, which lets Hevo track the exact point where the last successful load stopped.

Incremental load is efficient for the user as it updates only the changed data, instead of re-ingesting the entire data for the objects.


Data Refresh

Data refresh refers to the process of re-ingesting data from the Source and loading it again to the Destination in order to keep the data updated and fresh. The data refresh period is usually defined at the Source for example, the past 30 days in case of PostgreSQL, however, for a few Sources, you can define this setting in Hevo. Hevo performs the data refresh task on every run of the Pipeline.

Data refresh is important in marketing-oriented Sources such as Marketo, Facebook, and LinkedIn. Such Sources use the conversion or attribution window to track the conversion (purchase or sign-up or any other user action) and attribute it to a click on an ad or a post.

A conversion or attribution window is the number of days within which a person clicks on your ad and then subsequently takes an action on it, such as, a purchase or sign-up.

For example, let us assume a prospect clicks a LinkedIn product ad on Day 1 and converts or signs up for the product on Day 10. The conversion is attributed to the click Event of Day 1, therefore, that record is updated with the attribution information and the Day 10 timestamp. Now, suppose the data refresh period is 2 days. Then, the data refresh on Day 11 picks up all the records having the timestamp of the past two days. Therefore, the modified record of Day 1 carrying the attribution information also gets picked up and loaded to the Destination, thereby capturing the conversion and the attribution correctly.

Note: After each data refresh, the number of ingested Events displayed in the Pipeline Activity section could be greater than the actual number of Events in your Source. This is because the ingestion count also includes any Events that are re-ingested due to changes in the Source data, while there is no change in the number of records or the rows in the Source. This happens for any changed data that is re-ingested while there is no change in the number of records or the rows in the Source. However, for Google Sheets, the entire set of data (changed and unchanged) gets re-ingested on each run of the Pipeline, as Google Sheets API provides no way to identify just the changed data.



See Also

Last updated on Sep 02, 2024

Tell us what went wrong

Skip to the section