Data Replication
Every organization uses various applications for its business. These applications generate a large amount of transactional data, which the organization stores in different formats in different locations.
For example, a company may be marketing their products using Facebook Ads and mailers and solving their customers’ issues via Intercom. Data from both these sources is a critical aid in making insightful, data-driven business decisions and strategies. Therefore, it needs to be collated and be made analysis-ready. Before this can happen, the data may need to be cleaned or organized in some way. This entire process of fetching the Source data, collating or preparing it, and then loading it to a Destination is called Data Replication and can be done through data Pipelines. It involves three key stages: Data Ingestion, Transformation, and Data Loading.
Data Ingestion
The process of fetching existing and new data from a Source application or database is called Data Ingestion. The data is ingested using either of these two techniques:
-
Poll-based ingestion
This ingestion is applicable for SaaS and database Sources. In poll-based ingestion, the data is ingested by Hevo according to the Pipeline schedule you specify. Read Scheduling a Pipeline. Hevo ingests the data via the Source APIs and SQL queries for SaaS and database Sources, respectively.
-
Push-based ingestion
This ingestion is applicable for webhook-based Sources. In push-based ingestion, the data is sent (pushed) by the Source to Hevo as soon as it is created or modified. The data ingestion is triggered from the Source, therefore, there is no schedule that you can define.
Read Data Ingestion to know more about the settings you can define to control the amount and type of data that is ingested.
Data Transformation
Different Sources maintain data in different formats and structures. These may often not be same as how the tables are defined in your Destination database or Destination. Or, the JSON Events ingested from the Source may need to be parsed to be successfully loaded to the Destination. The process of converting or modifying the ingested data to cleanse and prepare it for loading it into the Destination is called Data Transformation. Read Transformations.
You can transform data by writing Python Code-based Transformations or using Drag and Drop Transformations.
Data may also get transformed based on configurations you specify for loading the data, such as:
- The parsing strategy you define for JSON data may result in new, child Events being created.
- The Name Sanitization feature that allows you to sanitize table and column names to adhere to warehouse-acceptable formats.
- The way the ingestion process handles duplicate columns in your Source files.
- Datatype promotion that is performed on Destination table schema to accommodate more Source data types.
Data Loading
Once the data has been ingested and transformed, it can be loaded to the Destination. Data loading involves the following considerations:
-
Identifying where to load: You can configure the Destination while creating your Pipeline or directly create one. The Destination configuration you specify can be modified post-Pipeline creation. Read Modifying a Pipeline.
-
Defining the load schedule: You can define the desired schedule for loading the data to the Destination. This schedule applies to all your Pipelines that use this Destination. Read Scheduling Data Load for a Destination for the steps to do this.
-
Mapping the data objects and fields: By default, Hevo automatically maps the objects and fields to the Destination using the Auto Mapping feature. However, you can also disable this feature and handle the mapping manually. Read Auto Mapping Event Types.
-
Deduplicating the data: Hevo loads your data using the primary keys defined in the Destination table to prevent duplication of records. In case of absence of primary keys, the data is loaded using the append rows feature. You must use the Append only feature when you want to track every update on a granular level. Read Data Loading. Where primary keys are present but not enforced by the Destination data warehouse, Hevo uses a query and metadata-based approach to load only unique data.
-
Optimizing the cost, time, and Events quota consumption: Many factors along the data replication process determine the overall cost of loading the data to the Destination. For example, how frequently the data is ingested can impact the cost due to re-ingestion of some Events needlessly. Read Events Usage.