Generic PostgreSQL

Generic PostgreSQL Setup Guide

PostgreSQL is a highly stable object-relational database management system (ORDBMS) used as the primary data store or data warehouse for many web, mobile, geospatial, and analytics applications. It provides various replication methods, which makes it suitable for use in high-availability and data distribution scenarios.

You can ingest data from your PostgreSQL database using Hevo Pipelines and replicate it to a Destination of your choice.

Getting Started

Setup

Prerequisites

The IP address or hostname and port number of your PostgreSQL server are available.
The PostgreSQL server version is 10.0 or higher.
Hevo’s IP address(es) for your region is added to your PostgreSQL database IP Allowlist.
Log-based incremental replication is enabled on the primary instance of your PostgreSQL database server.

Note: PostgreSQL does not support logical replication on read replicas.
A non-administrative database user for Hevo is created in your PostgreSQL database instance. You must have the Superuser or CREATE ROLE privileges to add a new user.
The SELECT, USAGE, and CONNECT privileges are granted to the database user.

Create a Database User and Grant Privileges

1. Create a database user (Optional)

Perform the following steps to create a user in your PostgreSQL database:

Connect to your PostgreSQL database as an admin user with an SQL client tool, such as psql.
Run the following command to create a user in your database:
```
CREATE USER <database_username> WITH LOGIN PASSWORD '<password>';
```
Note: Replace the placeholder values in the command above with your own. For example, <database_username> with hevouser.

2. Grant privileges to the user

The following table lists the privileges that the database user for Hevo requires to Log in and ingest data from your PostgreSQL database:

Privilege Name	Allows Hevo to
CONNECT	Connect to the specified database.
USAGE	Access the objects in the specified schema.
SELECT	Select rows from the database tables.
ALTER DEFAULT PRIVILEGES	Access new tables created in the specified schema after Hevo has connected to the PostgreSQL database.
REPLICATION	Access the WALs.

Grant privileges to the database user connecting to the PostgreSQL database as follows:

Connect to your PostgreSQL database instance.
(Optional) View existing users and their privileges in your PostgreSQL instance:
```
psql -U postgres

\du
```
If the privileges are not assigned, perform the next step. Else, proceed to Step 2.

Grant the required privileges to the user:

GRANT CONNECT ON DATABASE <database_name> TO <database_username>;
GRANT USAGE ON SCHEMA <schema_name> TO <database_username>;
GRANT SELECT ON ALL TABLES IN SCHEMA <schema_name> TO <database_username>;

(Optional) Alter the schema to grant SELECT privileges on tables created in the future to your database user:

Note: Grant this privilege only if you want Hevo to replicate data from tables created in the schema after the Pipeline is created.
```
ALTER DEFAULT PRIVILEGES IN SCHEMA <schema_name> GRANT SELECT ON TABLES TO <database_username>;
```
Run the following command to grant your database user permission to read from the WALs:
```
ALTER ROLE <database_username> WITH REPLICATION;
```

Note: Replace the placeholder values in the commands above with your own. For example, <database_username> with hevouser.

Set up Logical Replication for Incremental Data

Hevo supports data replication from PostgreSQL servers using the pgoutput plugin (available on PostgreSQL version 10.0 and above). For this, Hevo identifies the incremental data from publications, which are defined to track changes generated by all or some database tables. A publication identifies the changes generated by the tables from the Write Ahead Logs (WALs) set at the logical level.

Perform the following steps to enable logical replication on the PostgreSQL server:

1. Update the database configuration file

Log in to your PostgreSQL primary database instance host as the root or admin user.
Locate your database configuration file, postgresql.conf, and modify it with your favorite text editor. For example, vim.

Note: The postgresql.conf file is generally found in the /etc/postgresql/<version>/main/ directory on a UNIX-like system.

Update the values of the following parameters:

Parameter	Value	Description
`max_replication_slots`	Any whole number between 1 and 10	The number of clients that can connect to the server. Default value: 10.
`max_wal_senders`	10	The number of processes that can simultaneously transmit the WAL. Default value: 10. You must set this value to at least twice the number of read replicas connecting to the primary database instance.
`wal_level`	logical	The amount of information being written to the WAL log. For example, minimal, replica, or logical. Default value: replica. The value logical is required to enable log-based replication.
`wal_sender_timeout`	0	The time, in seconds, after which PostgreSQL terminates the replication connections due to inactivity. Default value: 60 seconds. You must set the value to 0 so that the connections are never terminated and your Pipeline does not fail.

Note: If any of the parameters do not exist in the configuration file, you should add them, and then, update their values.

2. Update the client authentication file

Log in to your PostgreSQL primary database instance host as the root or admin user.
Locate your client authentication file, pg_hba.conf, and open it in your favorite text editor.

Note: The pg_hba.conf file is generally found in the database cluster’s data directory.
In the pg_hba.conf file, assign permission to the database user to connect from an external network to the PostgreSQL server:
```
local replication <database_username> peer

host replication <database_username> 127.0.0.1/0 md5

host replication <database_username> ::1/0 md5
```
Note: Replace the placeholder value in the statements above with your intended user. For example, <database_username> with hevouser.

3. Restart the PostgreSQL service

You must restart your PostgreSQL service after editing the configuration files for the changes to take effect. To do this:

Log in to your PostgreSQL primary database instance host as the root or admin user.
Run the following command in a Terminal window:
```
sudo systemctl restart postgresql.service
```

4. Create a publication for your database tables

In PostgreSQL 10 onwards, the data to be replicated is identified via publications. A publication must be defined on the primary database instance and can include some or all the database tables. The publication is a group of tables that tracks and determines the set of changes generated by those tables from the Write-Ahead Logs (WALs).

To define a publication:

Note: You must define a publication with the insert, update, and delete privileges.

Connect to your PostgreSQL primary database instance as a Superuser using an SQL client tool, such as psql.
Run one of the following commands to create a publication based on your requirements:

Note: You can create multiple distinct publications whose names do not start with a number in a single database.
- On one or more database tables:
```
CREATE PUBLICATION <publication_name> FOR TABLE <table_1>, <table_4>, <table_5>;
```
- On all database tables:
  
  Note: You can run this command only as a Superuser.
```
CREATE PUBLICATION <publication_name> FOR ALL TABLES;
```
(Optional) Run the following command to add table(s) to or remove them from a publication:

Note: You can modify a publication only if it is not defined on all tables and if you have ownership rights on the table(s) being added or removed.
```
ALTER PUBLICATION <publication_name> ADD/DROP TABLE <table_name>;
```

Note: Replace the placeholder values in the commands above with your own. For example, <publication_name> with hevo_publication.

Allowlist Hevo IP addresses for your region

You must add Hevo’s IP address for your region to the database IP allowlist, enabling Hevo to connect to your PostgreSQL database. To do this:

Log in to your PostgreSQL primary database instance host as the root or admin user.
Locate and edit the database configuration file, postgresql.conf. This file is generally found in the path: /etc/postgresql/<version>/main/.
Update the listen_addresses parameter to add the Hevo IP address for your region or * to allow all IPs to connect.
Locate and edit the client authentication file, pg_hba.conf. This file is generally found in the path: /etc/postgresql/<version>/main/.
Scroll to the bottom, where the connection information is listed, and add the Hevo IP address for your region or 0.0.0.0 to allow all IPs to connect.
```
# TYPE  DATABASE        USER                    ADDRESS             METHOD
  host  all             <database_username>     <Hevo IP address>   md5
```
Note: Replace all placeholder values in the statement above with your own. For example, <Hevo IP address> with 13.235.131.126 if your Hevo account is in the India region.

Configure PostgreSQL as a Source in your Pipeline

Perform the following steps to configure your PostgreSQL Source:

Click PIPELINES in the Navigation Bar.
Click the Edge tab in the Pipelines List View and click + CREATE EDGE PIPELINE.
On the Create Pipeline page, under Source Configuration, do the following:
1. In the Selection screen, select PostgreSQL.
2. In the PostgreSQL screen, specify the following:
  - Source Name: A unique name for your Source, not exceeding 255 characters. For example, PostgreSQL Source.
  - In the Connect to your PostgreSQL section:
    - Database Host: The PostgreSQL host’s IP address or DNS. For example, 10.123.10.001 or postgresql.westeros.inc.
      
      Note: For URL-based hostnames, exclude the http:// or https:// part. For example, if the hostname URL is https://hevo.database.westeros.inc, enter hevo.database.westeros.inc.
    - Database Port: The port on which your PostgreSQL server listens for connections. Default value: 5432.
    - Database User: The user who has permission only to read data from your database tables. This user can be the one you created in the Create a database user step of the Getting Started section or an existing user. For example, hevouser.
    - Database Password: The password of your database user.
    - Database Name: The database from where you want to replicate data. For example, dvdrental.
    - Publication Key: The name of the publication added to your PostgreSQL Source database to track the changes in your database tables. This key can be the publication you created in the Create a publication for your database tables step of the Getting Started section or an existing publication.
  - Log Monitoring
    - Max WAL Size (in GB): The maximum allowable size of the Write-Ahead Logs that you want Hevo to monitor. Specify a number greater than 1.
      
      Note: Your Pipelines will be disabled when the size of the WAL being monitored reaches the set maximum value.
    - Alert Threshold (%): The percentage limit for the WAL, whose size Hevo is monitoring. An alert is sent when this threshold is reached. Specify a value between 50 to 80. For example, if you set the Alert Threshold to 50, Hevo sends a notification when the WAL size is at 50% of the Max WAL Size specified above.
    - Send Email: Enable this option to send an email when the WAL size has reached the specified Alert Threshold percentage.
      
      If this option is turned off, Hevo does not send an email alert.
  - Additional Settings
    - Connect through SSH: Enable this option to connect to Hevo using an SSH tunnel instead of directly connecting your PostgreSQL database host to Hevo. This provides an additional level of security to your database by not exposing your PostgreSQL setup to the public.
      
      If this option is turned off, you must configure your Source to accept connections from Hevo’s IP address.
    - Use SSL: Enable this option to use an SSL-encrypted connection. Specify the following:
      - CA File: The file containing the SSL server certificate authority (CA).
      - Client Certificate: The client’s public key certificate file.
      - Client Key: The client’s private key file.
Click TEST & CONTINUE to test the connection to your PostgreSQL Source. Once the test is successful, you can proceed to set up your Destination.

Data Type Mapping

Hevo maps the PostgreSQL Source data type internally to a unified data type, referred to as the Hevo Data Type, in the table below. This data type is used to represent the Source data from all supported data types in a lossless manner.

The following table lists the supported PostgreSQL data types and the corresponding Hevo data type to which they are mapped:

PostgreSQL Data Type	Hevo Data Type
- INT_2 - SHORT - SMALLINT - SMALLSERIAL	SHORT
- BIT(1) - BOOL	BOOLEAN
- BIT(M), M>1 - BYTEA - VARBIT	BYTEARRAY
- INT_4 - INTEGER - SERIAL	INTEGER
- BIGSERIAL - INT_8 - OID	LONG
- FLOAT_4 - REAL	FLOAT
- DOUBLE_PRECISION - FLOAT_8	DOUBLE
- BPCHAR - CIDR - CITEXT - DATERANGE - ENUM - HSTORE - INET - INT_4_RANGE - INT_8_RANGE - INTERVAL - LTREE - MACADDR - MACADDR_8 - NUMRANGE - TEXT - TSRANGE - TSTZRANGE - UUID - VARCHAR - XML	VARCHAR
- TIMESTAMPTZ	TIMESTAMPTZ (Format: YYYY-MM-DDTHH:mm:ss.SSSSSSZ)
- JSON - JSONB - POINT	JSON
- DATE	DATE
- TIME	TIME
- TIMESTAMP	TIMESTAMP
- MONEY - NUMERIC	DECIMAL

At this time, the following PostgreSQL data types are not supported by Hevo:

TIMETZ
Arrays
Any other data type not listed in the table above.

Note: If any of the Source objects ingested by Hevo contain unsupported data types, Hevo fails such objects and does not load their data to the Destination.

Handling of Deletes

In a PostgreSQL database for which the WAL level is set to logical, Hevo uses the database logs for data replication. As a result, Hevo can track all operations, such as insert, update, or delete, that take place in the database. Hevo replicates delete actions in the database logs to the Destination table by setting the value of the metadata column, __hevo_is_deleted__ to True.

Source Considerations

If you add a column with a default value to a table in PostgreSQL, entries with it are created in the WAL only for the rows that are added or updated after the column is added. As a result, in the case of log-based Pipelines, Hevo cannot capture the column value for the unchanged rows. To capture those values, you need to:
- Restart the historical load for the respective object.
- Run a query in the Destination to add the column and its value to all rows.
Any table included in a publication must have a replica identity configured. PostgreSQL uses it to track the UPDATE and DELETE operations. Hence, these operations are disallowed on tables without a replica identity. As a result, Hevo cannot track the updates or deletes for such tables.

By default, PostgreSQL picks the table’s primary key as the replica identity. If your table does not have a primary key, you must either define one or set the replica identity as FULL, which records the changes to all the columns in a row.

Limitations

Hevo does not support data replication from foreign tables, temporary tables, and views.
If your Source table has indexes (indices) and or constraints, you must recreate them in your Destination table, as Hevo does not replicate them. It only creates the existing primary keys.
Hevo does not set the __hevo_is_deleted__ field to True for data deleted from the Source table using the TRUNCATE command. This action could result in a data mismatch between the Source and Destination tables.
You cannot select Source objects that Hevo marks as inaccessible for data ingestion during object configuration in the Pipeline. Following are some of the scenarios in which Hevo marks the Source objects as inaccessible:
- The object is not included in the publication (key) specified while configuring the Source.
- The publication is defined with a row filter expression. For such publications, only those rows for which the expression evaluates to FALSE are not published to the WAL. For example, suppose a publication is defined as follows:
```
CREATE PUBLICATION active_employees FOR TABLE employees WHERE (active IS TRUE);
```
  In this case, as Hevo cannot determine the changes made in the departments object, it marks the object as inaccessible.
- The publication specified in the Source configuration does not have the privileges to publish the changes from the UPDATE and DELETE operations. For example, suppose a publication is defined as follows:
```
CREATE PUBLICATION insert_only FOR TABLE employees WITH (publish = 'insert');
```
  In this case, as Hevo cannot identify the new and updated data in the employees table, it marks the object as inaccessible.