Storing raw level data in a database becomes increasingly important. Not only can it be a source for data analytics, but also it can be crunched by a data scientists in any way it serves the business. On top of that when dealing with hundreds of terabytes data daily, cost-effective real-time processing can be challenging.
Batch and streaming are two alternatives for processing large data set. While the first deals with bounded data, where size is well known before actual job start, the later cope with unbounded streams where data size is unknown. In reality, the unbounded data stream is still divided into finite chunks with various windowing strategies like session or time-base. Another notable difference is that a batch process has it’s start and end time, but streaming is a continuous process that may never terminate.
Most of the data processing frameworks support both methods in question. The batch mode is available for data stored in any database or file system, in contrast to streaming where data may be constantly shipped via message bus.
BigQuery is a high-performance database, an excellent option for both processing and storing raw and aggregated data at large scale.
In adopted data processing methodologies reading data from the database, including BigQuery falls into batch mode only, but what if you were to build close to real-time aggregates?
One option is to use a pub/sub solution as the message bus, which would allow you to use the streaming capabilities of the Apache Beam or other data processing platform. Another option is to use BigQuery time range decorator to create time window finite chunk from an unbounded stream of constantly loaded raw data.
BigQuery time range decorator allows windowing data ingested within the last 7 days, with the following syntax:
SELECT * PROJECT_ID:DATASET.TABLE@<timeFromInMs>-<timeToInMs>
Since this feature is only available with the Legacy (non-partitioned/clustered) tables, the table layout design has to address this limitation.
The following is an example of a flexible data layout, where each table shares the same schema.
Windowing data with range decorators requires state management, where successfully process batch triggers next lower bound time window update.
One of the project providing an implementation for the highlighted process is BigQuery Windowed Tables It can be integrated with any data processing platform that can issue REST calls to build SQL dynamically.
Here are a few favorite choices of mine:
I’ve been using extensively both tools implementing various ETL jobs ranging from creating bloom filters to producing real-time aggregates with dozen billions of rows daily. Apache beam is a great choice when data transformation source or destination lives outside of BigQuery.
In other scenarios as long as the problem can be addressed with a series of SQL statements, using directly BigQuery, Potents IO greatly simplifies data transformation workflows. Another important factor is cost implication, in this case, data never leave the BigQuery ecosystem, no additional data transfer is needed nor extra processing cluster.