Share

Clustering in BigQuery

Google BigQuery is a fully-managed, petabyte-scale data warehouse that allows businesses to analyze massive datasets. However, at times, you may want to access only a subset of the data, such as the data for July. You can retrieve this data efficiently by structuring it based on common access patterns with clustered tables.

Clustered tables organize and sort your data into blocks based on the contents of the column(s) in a table’s schema. These clustering columns, known as cluster keys, are used to collocate related data. The order in which you specify these keys determines the sort order of data. Clustering helps you narrow down searches, thus enhancing the performance of your queries and reducing their costs. For example, queries that aggregate data by cluster keys do not scan the entire table or partition. They only search for data in the blocks identified by the cluster keys.


Creating BigQuery Clustered Tables

BigQuery allows the clustering of both non-partitioned and partitioned tables. However, you can add cluster keys only while creating tables. Hence, you will need to recreate an existing table and add cluster keys by which you want to group and sort its data. You can select up to four fields from your Source tables as cluster keys.

You can create a clustered table in one of the following ways:

1. Disable Auto Mapping for the Event Type

  1. In the Pipeline List View, click on the Pipeline you want to modify.

  2. In the Tools Bar, click the Schema Mapper icon.

  3. In the Schema Mapper page, click on the Event Type for which you want to create a clustered Destination table.

    Schema Mapper Page

  4. If Auto Mapping was disabled during Pipeline creation, as the Event Types are UNMAPPED, skip to the Create a clustered table for your Event Type step. Else, go to the next step.

    Unmapped Events

  5. Click the More () icon in the Event Types section and disable Auto Mapping for your Event Type.

    Disable Auto Mapping

    If Auto Mapping was enabled earlier, and you turn it off now, the existing mapping is not affected. The Event Type still carries the MAPPED status.

2. Create a clustered table for the Event Type

  1. In the Mapping Summary section of the Schema Mapper page, do one of the following:

    • If you disabled Auto Mapping during Pipeline creation, click CREATE TABLE & MAP and skip to the next step.

      Create Clustered Table for New Event Type

    • If you disabled Auto Mapping for an existing Event Type:

      1. Click the arrow next to the Destination table name and select Modify Destination Table.

        Modify Destination Table for Mapped Event Type

      2. In the Destination Table drop-down, click + Create Table & Map.

        Create Table and Map

  2. In the mapping page that appears, do the following:

    Non-partitioned Clustered Table

    • Specify a unique Destination Table Name.

    • Select the No Partition Partition Style from the drop-down to create a non-partitioned table. Read Create a partitioned table for the Event Type if you want to combine clustering with partitioning.

    • For each Source Field, specify the corresponding Destination Field Name. If you do not want to map a particular Source field, deselect the check box next to its name.

    • Select a data type for the Destination field from the Data Type drop-down.

      Note: Only data types compatible with the Source field are available for selection in the drop-down.

    • Optionally, in the Primary Key column, select the check box next to the field you want to mark as a primary key.

    • In the Cluster Key column, select the Source fields by which you want to cluster your data. The order in which you specify the fields determines the order in which your data is clustered in the Destination table. Use the Up or Down arrow next to each Source Field name to change its placement in the list.

      Note:

      • You can select the same Source field as the cluster and primary keys.

      • You can select up to four fields as your cluster keys.

  3. Click CREATE TABLE & MAP to create a clustered table mapped to your Event Type.

3. Restart Historical Load for the Event Type

Note: Skip this step if Auto Mapping was disabled for the Pipeline and you mapped the Event Type manually. Hevo ingests historical data for such Event Types in the first run of the Pipeline.

You need to copy the data already ingested for your Event Type to the new clustered table, as Hevo loads only the new and updated data to it. For this, you must restart the historical load for it.

To do this:

  1. In the Pipeline Objects list, click the More () icon for your Event Type and select Restart Historical Load.

    Restart Historical Load for Event Type

  2. In the popup dialog, select the check box to agree to the cost implications and then click RESTART HISTORICAL LOAD. The object is queued for ingestion.

    Confirm Restart


Considerations while Creating Clustered Tables

  • BigQuery allows up to four clustering keys for a table. You can combine clustering with partitioning to further classify your data.

  • You cannot change the cluster keys for a table. If you want to use a different set of Source fields as cluster keys, you must create a new table.

  • The cluster key column in the Destination table contains NULL values if the column is deleted from the Source table.

  • Source fields of array and struct data types cannot be specified as cluster keys.


Last updated on Apr 22, 2024

Tell us what went wrong

Skip to the section