Managing query performance speed and low execution cost can be challenging, primarily when operating on large tables (100T+) with a dozen to hundreds of repeated fields.
Repeated fields are an elegant way of representing and accessing denormalized data without expensive joins, typical for the relational schema. While they perform great, BigQuery does not provide a partitioning or clustering option for the repeated field, which results in potentially costly a full data scan. Take a table with a repeated x1 column as an example; assume that table in question stores 1TB in x1 column; the query would use complete 1TB regardless of x1 filter criteria. Practically it means that using repeated fields on a large table can be very costly to run. You can mitigate the query cost with Flex Slot Reservation billing project mode; however, this could come at the expense of query performance due to slots starvation. Having all of these in mind, I will be exploring various optimization techniques.
To deep dive into various optimization techniques, let me create a test repeated table with 100M records (459.7 GB) and 6 repeated fields x1 to x6.
Once the table got created (in my case, it took 41min), we are ready to start testing various access patterns. I will be using Goliath data explorer to check the query execution timeline with query slot usage.
SQL Query Optimization
Let start with a benchmarking query with IN operator directly on the repeated field.
This first query took 1.7s and peaked at 1.75k slots.
The second query uses CROSS JOIN UNNEST to match data in a repeated dataset.
This second query took 1.9s and peaked at 2k slots.
The final query uses EXISTS clause on the unnested x1 repeated field.
The third query took 1.5 seconds to execute and peaked at 550 slots on demand.
The last query performed way better than the first two; all queries sadly use 149.8 GB (total x1column data storage size). So far, we’ve used the on-demand billing mode to run the queries. Let's run the fastest query using the 500 Reservation Slots pool.
The 10.1s execution time is the worst for the same query that smoothly runs 1.5 seconds with the on-demand billing. What it shows is confirming the unpredictable query performance nature of reservations due to slots starvation. Practically it means that only running a query on-demand billing mode guarantees the best performance. In the remaining part of this article, I will explore both cost and execution time optimization.
Currently, BigQuery does not support indexes, but nothing is stopping us from implementing one. Each indexed column would use the dedicated table, which has to join with master table, to do so let introduce two extra column batch_id and seq, where the seq column represents 0–63 sequential values within unique batch_id. Since repeated fields are super fast, we have to make up for extra JOIN using bitset data compaction and clustering both on the master table and index tables.
During the indexing process, each distinct value and batch_id combination produces bitset where a bit is populated at seq + 1 position when the value exists in a repeated set.
The following tables show the indexing of 8 records with repeated values.
You would export data to the Bitsy Cloud Function trigger bucket to bootstrap the process.
The bitsy indexing rule for test repeated table:
- $fragment dynamically expands with TableRoot and indexing column and data type name.
- $TimePath expands with current year/month/day/hour
BqTail data ingestion rule
The indexing data loading process for our test data (459.7GB) took under 6 min end to end; the 6 new index tables got created.
Let test our index tables. I’ll start with a rewritten query using the index table.
The indexing table runs faster than the original query, but most importantly, it cost only 1.57 GB, instead of 149.8 GB !!!. Since the index table uses value clustering, most 1.57 GB come from using batch_id, seq in the master table.
While testing data with different cardinality and distribution, all queries provided substantial cost reduction with an index. The index also tends to reduce or be inline with execution time for high cardinality low distribution dataset. Using directly repeated fields hold strong performance lead with low cardinality and high data distribution.
The following table shows the result of experimentation.
During our experimentation, slots usage peak ranged between 550 and 5000; when running a query on large production tables with multi repeated fields, it can easily peek at 12k slots and use TB of data just for just one query run.
For offline, not time-critical jobs using directly native repeated fields with Flex Slot Reservations billing might be a reasonable option. In all other cases using the bitset-index can come handy to reduce substantially cost while managing good query performance. With 5, 100, 200 repeated values per record, cost got reduced by 3, 50, 100 times respectively.