Amazon S3 - Hevo Data

Amazon S3 Setup Guide

Prerequisites

An active AWS account and an S3 bucket from which data is to be ingested exist.
You are logged as an IAM user with permission to:
- Create an IAM policy.
- Create an IAM role (to generate the IAM role-based credentials) or create an IAM user (to generate the access credentials).
The IAM role-based credentials or access credentials are available to authenticate Hevo on your AWS account.
You are assigned the Team Administrator, Team Collaborator, or Pipeline Administrator role in Hevo, to create the Pipeline.

Create an IAM Policy

Create an IAM policy with the ListBucket and GetObject permissions. These permissions are required for Hevo to access data from your S3 bucket.

To do this:

Log in to the AWS IAM Console.
In the left navigation pane, under Access management, click Policies.
On the Policies page, click Create policy.

On the Specify permissions page, click JSON, and in the Policy editor section, paste the following JSON statements:

JSON tab

Note: Replace the placeholder values in the commands below with your own. For example, <your_bucket_name> with Hevo-S3-bucket.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<your_bucket_name>",
                "arn:aws:s3:::<your_bucket_name>/*"
            ]
        }
    ]
}

The JSON statements allow Hevo to access and ingest data from the bucket name you specify.

JSON statements

At the bottom of the page, click Next.
On the Review and create page, specify the Policy name, and at the bottom of the page, click Create policy.

Obtain Amazon S3 Credentials

You must generate either access credentials or IAM role-based credentials and assign them the IAM policy to access and ingest your S3 data.

Generate IAM role-based credentials

To generate your IAM role-based credentials, you need to create an IAM role for Hevo and assign the policy that you created in Step 1 above, to the role. Use the Amazon Resource Name (ARN) and external ID from this role while creating your Pipeline.

1. Create an IAM role and assign the IAM policy

Log in to the AWS IAM Console.
In the left navigation pane, under Access Management, click Roles.
On the Roles page, click Create role.
In the Select trusted entity section, select AWS account.
In the An AWS account section, select Another AWS account, and in the Account ID field, specify Hevo’s Account ID, 393309748692.

This account ID enables you to assign a role to Hevo and ingest data from your S3 bucket for replicating it to your desired Destination.
In the Options section, select the Require external ID check box, specify an External ID of your choice, and click Next.
On the Add Permissions page, search and select the policy that you created in Step 1 above, and at the bottom of the page, click Next.
On the Name, review, and create page, specify the Role name and Description of your choice, and at the bottom of the page, click Create role.

You are redirected to the Roles page.

2. Obtain the ARN and external ID

On the Roles page of your IAM console, search and click the role that you created above.
On the <Role name> page, Summary section, click the copy icon below the ARN field and save it securely like any other password.
In the Trust relationships tab, copy the external ID corresponding to the sts:ExternalId field. For example, hevo-role-external-id in the image below.

You can use this ARN and external ID while configuring your Pipeline.

Generate access credentials

Your access credentials include the access key and the secret access key. To generate these, you need to create an IAM user for Hevo and assign the policy you created in Step 1 above, to it.

Note: The secret key is associated with an access key and is visible only once. Therefore, you must make sure to save the details or download the key file for later use.

1. Create an IAM user and assign the IAM policy

Log in to the AWS IAM Console.
In the left navigation pane, under Access management, click Users.
On the Users page, click Create user.
On the Specify user details page, specify the User name, and click Next.
On the Set permissions page, Permissions options section, click Attach policies directly.
On the Permissions policies section, search and select the check box corresponding to the policy that you created in Step 1 above, and at the bottom of the page, click Next.
At the bottom of the Review and create page, click Create user.

2. Generate the access keys

On the Users page of your IAM console, click the user that you created above.
On the <User name> page, click the Security credentials tab.
In the Access keys section, click Create access key.
On the Access key best practices & alternatives page, select Command Line Interface (CLI).
At the bottom of the page, select the I understand the above…. check box and click Next.
(Optional) Specify a description for the access key.
Click Create access key.
On the Retrieve access keys page, Access key section, click the copy icon in the Access key and Secret access key fields and save the keys securely like any other password.

Optionally, click Download .csv file to save the keys on your local machine.

Note: Once you leave this page, you cannot view these keys again.

You can use these keys while configuring your Pipeline.

Configure Amazon S3 as a Source

Perform the following steps to configure S3 as the Source in your Pipeline:

Click PIPELINES in the Navigation Bar.
Click + CREATE PIPELINE in the Pipelines List View.
On the Select Source Type page, select S3.
On the Select Destination Type page, select the type of Destination you want to use.
On the Configure your S3 Source page, specify the following:
- Pipeline Name: A unique name for the Pipeline, not exceeding 255 characters.
- Source Setup: The credentials needed to allow Hevo to access data from your S3 account. Select one of the following setup methods:
  - Connect using IAM Role:
    - IAM Role ARN: The ARN that you retrieved above.
    - External ID: The external ID that you retrieved above.
    - Bucket Name: The name of the bucket from which you want to ingest data.
    - Bucket Region: The AWS region where the bucket is located.
  - Connect using Access Credentials:
    - Access Key ID: The access key that you retrieved above.
    - Secret Access Key: The secret access key that you retrieved above.
    - Bucket Name: The name of the bucket from which you want to ingest data.
    - Bucket Region: The AWS region where the bucket is located.
Click TEST & CONTINUE.
In the Data Root section, specify the following. The data root signifies the directories or files that contain your data. By default, the files are listed from the root directory.
- Select the folders from which you want to ingest data.
  
  Note: If Hevo cannot retrieve the list of files from your S3 bucket, it displays the Path Prefix field. In this situation, you must specify the prefix of the path for the directory that contains your data. To specify the path prefixes for multiple files, you can click the Plus ( ) icon.
- File Format: The format of the data file in the selected folders. Hevo supports AVRO, CSV, JSON, JSONL, TSV, and XML formats.
  
  Note: You can select only one file format at a time. If your Source data is in a different format, you can export the data to either of the supported formats and then ingest the files.
  
  Based on the format you select, you must specify some additional settings:
  - Field Delimiter: The character on which the fields in each line are separated. For example, \t or ,.
    
    Note: This field is visible only for CSV data.
  - Create Events from child nodes: If enabled, Hevo loads each node present under the root node in the XML file as a separate Event. If disabled, Hevo combines and loads all nodes present in the XML file as a single Event.
    
    Note: This option is visible only for XML data.
  - Header Row: The row number in your CSV file whose data you want Hevo to use as column headers. Hevo starts ingesting data from the specified header row in your CSV file, thus skipping all the rows before it. Default value: 1.
    
    If you set the header row to 0, Hevo automatically generates the column headers during ingestion. Refer to the Example to understand this behavior.
    
    Note: This field is visible only for CSV data.
  - Include compressed files: If enabled, Hevo also ingests the compressed files of the selected file format from the folders. Hevo supports the tar.gz and zip compression types only. If disabled, Hevo does not ingest any compressed files present in the selected folders.
    
    Note: This option is visible for all supported data formats.
  - Create Event Types from folders: If enabled, Hevo ingests each subfolder as a separate Event Type. By default, data from all selected folders is ingested. However, if you do not want Hevo to ingest data from certain folders, you can skip them from the Pipeline Overview page.
    
    If disabled, Hevo merges the subfolders into their parent folders and ingests them as one Event Type. Default setting: Disabled.
    
    Note:
    - By default, Hevo allows the ingestion of subfolders as separate Event Types only if the parent folder contains no more than 500 subfolders and files. If you need to increase this limit, contact Hevo Support.
    - This option is visible for all supported data formats.
  - Convert date/time format fields to timestamp: If enabled, Hevo converts the date/time format within the files of selected folders to timestamp. For example, 07/11/2022, 12:39:23 to 1667804963. If disabled, Hevo ingests the datetime fields in the same format.
    
    Note: This option is visible for all supported data formats.
- Click CONFIGURE SOURCE.
Proceed to configuring the data ingestion and setting up the Destination.