Managing query performance speed and low execution cost can be challenging, primarily when operating on large tables (100T+) with a dozen to hundreds of repeated fields.

Repeated fields are an elegant way of representing and accessing denormalized data without expensive joins, typical for the relational schema. While they perform great, BigQuery does not provide a partitioning or clustering option for the repeated field, which results in potentially costly a full data scan. Take a table with a repeated x1 column as an example; assume that table in question stores 1TB in x1 column; the query would use complete 1TB regardless…


Data Lake Platform

At Viant, the cloud ad server stack produces 1.7 M transaction log files, resulting in 70 billion records taking up with 80 TB of data daily,
where each server rotates log files every 3 minutes and uploads them to Cloud Storage.

Our goal was to ingest all data log under a few minutes to 100+ tables in a timely and cost-effective manner, with the ability to process incrementally incoming data every 15 min by various ETL processes. Since we collect and analyze raw logs, we use the time range decorators extensively, to incrementally aggregate and transform incoming data.

BigQuery ecosystem

BigQuery is…


In the previous article, I’ve discussed automation techniques leveraging security and developer-centric multi-stack build and deployments. In this one, I will focus on an application state setup automation. Before deep diving, let first define what the application state is.

Application state is data driving your application logic,which may originate in various sources, like database, data-stores, message bus, configuration files, or even external data APIs. Setting it can be a time consuming and challenging process, contributing to substantial development overhead.

For example, to build a patch for your application glitch, you may need to reproduce it first on a dedicated environment…


Developing software requires many repetitive tasks. No matter if you rebuild/redeploy an app, restart services or reset an application state, all these tasks add up a substantial amount of non-productive time. There is no secret that the best way to improve productivity is automation.
For example, using docker to build your application and bundle app services with docker-compose would provide an excellent level of automation. But even in that case, there are some remaining tasks: to trigger a build, restart services or reset application state.

Some basic automation techniques may utilize bash aliasing or scripting.
The first one allows you…


Storing raw level data in a database becomes increasingly important. Not only can it be a source for data analytics, but also it can be crunched by a data scientists in any way it serves the business. On top of that when dealing with hundreds of terabytes data daily, cost-effective real-time processing can be challenging.

Batch and streaming are two alternatives for processing large data set. While the first deals with bounded data, where size is well known before actual job start, the later cope with unbounded streams where data size is unknown. In reality, the unbounded data stream is…


Go lang tries to distance itself from object-oriented terminology, at the same time it can give you all object-oriented programming flavors and benefits.
For instance,

  1. C++ or Java classes are called a struct,
  2. A class methods are called receivers.
  3. Inheritance is called embedding

In this post, however, I will focus on Abstract Class functionality. It can be helpful in any case when there is a need for building default or high-level functionality, leaving specialized implementation details to subclasses.

Let start with the classic object oriented animal example.

The first implementation

Before looking into implementation let me explain how embedding works. When embedding a…


Many of us faced the challenge of data handling during integration and end-to-end test automation. In this article, I look into various data organization patterns, data loading strategies and data validation requirements.

A typical use case may require ingesting data to a dozen tables; also the data can be a cross-database referenced, for instance, a permission rule can be stored in one database, while actual users and resources in another. To make the things more complicated, tables and dataset can use sequences, auto-increment, or even randomly generated IDs. Another factor to consider is application architecture, for instance, some low latency…


Have you ever found developing end to end testing for your ETL application challenging or difficult, not anymore, this article presents a practical walkthrough of various tools and technology leveraging testing complexity.

End to end testing

End to end testing (e2e) is the most comprehensive testing methodology which tests entire application in an environment the closely imitates production with all network communication and data store interactions.

ETL

ETL (extract transform and load) is an application that transforms a data from a source to a target data store. A data store can be any RDBMS, NoSQL database, or even data files on local or remote file…

Adrian Witas

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store