Elasticsearch is a distributed, RESTful search and analytics engine that centrally stores your data so you can search, index, and analyze data of all shapes and sizes. As Elasticsearch relies on indices to search and fetch documents from your data, it preempts operations that may cause memory issues and stops them with exceptions. Hevo parses some of these exceptions and recommends corrective actions. Read Configuration Changes in Elasticsearch to know about these.
Hevo connects to your Elasticsearch cluster using the Elasticsearch Transport Client and synchronizes the data available in the cluster to your preferred data warehouse using indices. Currently, Hevo supports the following variants:
- Generic Elasticsearch
- AWS Elasticsearch
Prerequisites
-
Elasticsearch version greater than 7.0. View versions.
-
There is at least one sortable field in each document. To be sortable, the fields can be of any of these types: unsigned_long
, long_
, _ integer
, short
, byte
, float
,_ double
half_float
scaled_float
date
and date_nanos
.
-
The database username and password are available if your Elasticsearch host uses Native Realm authentication.
-
You are assigned the Team Administrator, Team Collaborator, or Pipeline Administrator role in Hevo, to create the Pipeline.
Perform the following steps to configure your Elasticsearch Source:
Retrieve the Hostname
-
For self-hosted or cloud-based Elasticsearch databases, contact your system admin to know the database hostname and port.
-
For AWS ElasticSearch services, contact your service provider.
(Optional) Obtain Username and Password
The Elastic Stack security features authenticate users by using realms and one or more token-based authentication services. Currently Hevo’s Elasticsearch integration supports only Native Realm authentication.
Contact your system administrator for obtaining the username and password, if you do not have these details.
(Optional) Connect to Elasticsearch hosted inside a Virtual Private Cloud (VPC)
Hevo connects to your Elasticsearch instance hosted inside a VPC using a reverse proxy server set up on Amazon EC2. The server routes all requests that Hevo makes to ingest data to your Elasticsearch instance inside the VPC.
To enable Hevo to connect to your Elasticsearch instance configured inside a VPC, you need to:
-
Set up an EC2 instance.
-
Whitelist Hevo’s IP addresses, and launch the EC2 instance.
-
Retrieve the public Endpoint and connect to the EC2 instance.
-
Configure a reverse proxy server in the EC2 instance.
These steps are to set up using NGINX Open Source as the reverse proxy server. You can also use another web server, such as Apache or Caddy.
1. Set up the EC2 instance
-
Open the EC2 Management Console in your AWS account, and launch an EC2 instance. For example, NGINX_Elasticsearch.
-
Configure the network settings for the instance, such that it is in the same VPC as your Elasticsearch database. Also, retain the default setting for Auto-assign Public IP, to assign a public IP and DNS to the instance. Read Configure Instance Details.
2. Whitelist Hevo’s IP addresses and launch the instance
-
Configure the security group settings to whitelist Hevo’s IP addresses of your region for the HTTP and HTTPS protocol types. Read Configure Security Group.
-
Review the instance settings, and in the pop-up window that is displayed, create a key pair or use an existing one. A key pair, which consists of a public key and a private key, allows you to connect to your instance securely. Read Review Instance and Launch.
-
Click Download Key Pair to download the created key pair and save it in a secure location.
-
Click Launch Instances.
3. Retrieve the public Endpoint and connect to the EC2 instance
-
In the Launch Status page, click View Instances to retrieve the public endpoint of the instance you created. This could be the public IPv4 address or DNS.
-
Connect to the EC2 instance using one of the available methods, such as SSH or EC2 Instance Connect. Read Connect to your Linux instance.
-
Install NGINX Open Source in the EC2 instance. Read Installing NGINX.
-
Perform the following steps to edit the NGINX configuration, and add your Elasticsearch instance public endpoint and port number:
-
Navigate to the configuration file directory. For example, /etc/nginx.
-
Edit the configuration file, /etc/nginx/conf.d, and add the following information:
server {
listen 443;
location / {
proxy_pass http://<elasticsearch-services-endpoint>:443;
}
}
-
Save the file and restart the NGINX service. For example,
$ sudo service nginx restart
Perform the following steps to configure Elasticsearch as the Source in Hevo:
-
Click PIPELINES in the Navigation Bar.
-
Click + CREATE PIPELINE in the Pipelines List View.
-
In the Select Source Type page, select Elasticsearch.
-
In the Configure your Elasticsearch Source page, specify the following:
-
Pipeline Name: A unique name for your Pipeline, not exceeding 255 characters.
-
Database Host: The Elasticsearch database host’s IP address or DNS. Provide the public IP address or DNS of the EC2 instance as retrieved in Step 3 if your Elasticsearch database is hosted inside a VPC.
Note: For URL-based hostnames, you can exclude the protocol part (http:// or https://).
-
Database Port: The port on which your Elasticsearch server listens for connections. Default value: 9200.
Note: For an Elasticsearch database hosted inside a VPC, this port number is 443.
-
Database User (Optional): The authenticated user that can read the tables in your database.
-
Database Password (Optional): The password for the database user.
-
Connection Options: Select one of the following options to specify how Hevo must access your database instance:
-
Connect through SSH: Enable this option to connect to Hevo using an SSH tunnel, instead of directly connecting your Elasticsearch database host to Hevo. This provides an additional level of security to your database by not exposing your Elasticsearch setup to the public. Read Connecting Through SSH.
If this option is disabled, you must whitelist Hevo’s IP addresses to allow Hevo to connect to your Elasticsearch host.
Note: This option does not apply to an AWS Elasticsearch Source. To connect to that Source, you must set up a reverse proxy server.
-
Connect through HTTPS: Enable this option if your cluster is configured to use HTTPS. Contact your administrator if you do not have this information. Keep this option disabled to connect using HTTP.
-
Advanced Settings:
-
Load Historical Data: If this option is enabled, the entire table data is fetched during the first run of the Pipeline. If disabled, Hevo loads only the data that was written in your database after the time of creation of the Pipeline.
-
Include New Tables in the Pipeline: Applicable for all Pipeline modes except Custom SQL.
If enabled, Hevo automatically ingests data from tables created in the Source after the Pipeline has been built. These may include completely new tables or previously deleted tables that have been re-created in the Source.
If disabled, new and re-created tables are not ingested automatically. They are added in SKIPPED state in the objects list, in the Pipeline Overview page. You can update their status to INCLUDED to ingest data.
You can change this setting later.
-
Click TEST & CONTINUE. This button is enabled once you specify all the mandatory fields.
-
Proceed to configuring the data ingestion and setting up the Destination.
Data Replication
For Teams Created |
Default Ingestion Frequency |
Minimum Ingestion Frequency |
Maximum Ingestion Frequency |
Custom Frequency Range (in Hrs) |
Before Release 2.21 |
15 Mins |
15 Mins |
24 Hrs |
1-24 |
After Release 2.21 |
6 Hrs |
30 Mins |
24 Hrs |
1-24 |
Note: The custom frequency must be set in hours as an integer value. For example, 1, 2, or 3 but not 1.5 or 1.75.
-
Historical Data: In the first run of the Pipeline, Hevo ingests all the data available in your Elasticsearch database.
-
Incremental Data: Once the historical load is complete, all new and updated data is synchronized with your Destination as per the ingestion frequency.
Note: A maximum of 500 Events are ingested in each call to the database, to optimize the processing load on your cluster. Contact Hevo Support if you want to modify this limit.
Read the detailed Hevo documentation for the following related topics:
Source Considerations
-
Elasticsearch does not have the capability to expose each document modification. Therefore, to have at least one incrementing column of sortable type, the identity column is used as the tiebreaker if the sortable field is the same for more than one document.
The _id
field created by default is used if none is specified.
Note: The _id
field is used for sorting only in Elasticsearch versions 7.6 and below. For versions above 7.6, you should refer to Elasticsearch’s documentation for the appropriate changes.
Limitations
-
Only Native Realm authentication is supported.
-
Hevo currently does not support deletes. Therefore, any data deleted in the Source may continue to exist in the Destination.
-
Hevo does not support the replication of hidden objects.
See Also
Revision History
Refer to the following table for the list of key updates made to this page:
Date |
Release |
Description of Change |
Mar-05-2024 |
2.21 |
Updated the ingestion frequency table in the Data Replication section. |
Jan-16-2024 |
NA |
Updated section, Source Considerations to add information about _id field being used for sorting only in specific Elasticsearch versions. |
Jul-21-2023 |
NA |
Updated section, Limitations to add information about Hevo not supporting replication for hidden objects. |
Nov-22-2022 |
NA |
Updated section, Limitations to add information about Hevo not capturing deletes. |
Aug-24-2022 |
NA |
Updated sections, Data Replication and Configure Elasticsearch Connection Settings to restructure the content for better understanding and coherence. |
Jun-09-2022 |
NA |
Added a reference to the Configuration Changes in Elasticsearch page in the Overview section. |
Apr-11-2022 |
1.86 |
Added a note in the Connection Settings about setting up a reverse proxy server for connecting to an AWS Elasticsearch Source. |
Feb-21-2022 |
1.82 |
Added section, (Optional) Connect to Elasticsearch hosted inside a Virtual Private Cloud (VPC) |
Jan-03-2022 |
1.79 |
Updated the description of the Include New Tables in the Pipeline advance setting in the Configure Elasticsearch Connection Settings section. |
Jul-26-2021 |
1.68 |
Added a note for the Database Host field. |
Jul-12-2021 |
1.67 |
Added the field Include New Tables in the Pipeline under Source configuration settings. |
Jun-01-2021 |
1.64 |
Updated the Configure Elasticsearch Connection Settings section to include the Connect Through HTTPS setting. |