ETL end to end testing with docker, NoSQL, RDBMS and Big Query.

9 min readMay 24, 2018

Have you ever found developing end to end testing for your ETL application challenging or difficult, not anymore, this article presents a practical walkthrough of various tools and technology leveraging testing complexity.

End to end testing

End to end testing (e2e) is the most comprehensive testing methodology which tests entire application in an environment the closely imitates production with all network communication and data store interactions.

ETL

ETL (extract transform and load) is an application that transforms a data from a source to a target data store. A data store can be any RDBMS, NoSQL database, or even data files on local or remote file system. Typical ETL processes run without human intervention, and it is managed either by a scheduler or as a part of system workflow. Quality assurance of this type of application is equally important as other UI or web-based application, all that said testing focus moves heavily towards data.

Docker

Docker is a lightweight technology to create, ship, and run containers, where a container is one of many isolated instances running, as a part of Linux operating system level virtualization, on a single host. While it feels like VM, in fact, it is far more streamlined and lightweight compared to VMs. It is possible to run many more containers than VMs on the same hardware. Additionally, a container would run only one process, so if a system uses MySQL and Aerospike, there needs to be two separate containers. The most important benefit from the end to end testing perspective is that it decouples applications from the underlying hardware, allowing building and deploying application and its services quickly anywhere, on engineer or dedicated staging box.

ETL application

For the sake of this article, I have developed a simple ETL application: https://github.com/adrianwit/dstransfer and separately e2e testing workflows: https://github.com/adrianwit/dstransfer-e2e.

The application uses SQL based transformation between two data stores and it provides the REST endpoint to initiate transfer and to check job status:

Data transfer:

curl -d "@transfer.json" -X POST http://localhost:8080/v1/api/transfer

@transfer.json

{

  "Mode": "insert",
  "Source": {
   "DriverName": "mysql", ... other connection details
   "Query": "SELECT * FROM source_table"
  },


  "Dest": {
    "DriverName": "bigquery", ... other connection details
    "Table": "target_table"
  }

}

Job status check:

curl http://127.0.0.1:8080/v1/api/tasks

A system application environment setup.

The e2e testing environment is a collection of resources and services hosting actual application, closely mirroring production. Practically speaking most application would use some database or caching services and some configuration assets.

Let assume our application uses an Aerospike, MySQL, and Big Query as data stores. We have to make these resource and service ready before actual testing takes place. One of the first problems to address is to have the ability to run these services on engineer laptop in the similar way as on the production system.

Docker service comes here handy. It allows running an image for required version and vendor. While both Aerospike and MySQL provide various docker images, Google Big Query is only available as cloud service. In the latter case to avoid resource contention, each developer would use e2e testing dedicated BigQuery project or dataset.

Data store schema and static data setup.

Having outlined application environment preparation, another important task is to determine an initial data store state. It is common practice to use fresh start approach which involves both schema creation and loading static data, (i.e., dictionary tables). In this step, an engineer would define database schema script and initial data set in either in CSV or JSON format.

Application build and deployment.

Another critical e2e testing task is automation of application deployment.
While there is no one unified way of building and deploying an application, as there are many build systems and programming languages, one option to address this concern is to create an application image that can be deployed and managed consistently across various environments.

Docker service is perfect tool for both to build and deploy an application image on any platform. We just need to define a Dockerfile to specify how to create application image and docker-compose .yaml to managed application deployment. Since our ETL application has both of these files, the build and deployment can be accomplished with the following command.

Building app

cd /tmp/ && git clone https://github.com/adrianwit/dstransfer 
cd dstransfer
docker build -t adrianwit/dstransfer:0.1.0 .

Starting app

cd /tmp/dstransfer/config/ && docker-compose up  -d

Test plan

For each use case, one is expected to define both input and expected output with transformation rules (transform request) where test validates expected data set against actual data loaded into the target system. For ETL, testing focuses mainly on a various aspect of data validation including: structure, type, correctness, and completeness.

Besides typical data transformation validation, a test plan should also account for testing database constrains like NOT NULL, UNIQUE and additional exception handling scenarios. For instance, if a app uses compressed JSON data file, the test plan should include at least malformed JSON and corrupted data file cases. An additional level may also include scenarios where a source or destination data store is down or unavailable (i.e invalid credentials).

End to end testing implementation

Having outlined testing tasks, one has to introduce means to implement them.
I will use Endly, an e2e testing framework, for the job. It is declarative, and it works with an application written in any language providing the comprehensive automation and testing capability.

Testing project

The first challenging aspect of the e2e testing is to create an initial testing project. Endly provides convenient way to create a test project with testing workflow generator: http://endly-external.appspot.com/

In the generator, “ external application URL” template is the most flexible choice. It scans entered 3rd party source code origin (git/svn) for docker dependencies like Dokcerfile and docker-compose.yaml to customized application workflow. In addition generator allows multi data store selection with various testing options.

Let quickly examine generated project structure:

https://github.com/adrianwit/dstransfer-e2e/tree/master/initial

run.yaml is an entry point orchestrating other workflows.
system.yaml deploys application services
datastore.yaml creates databases schema and loads static data.
app.yaml builds and deploy application
regression/regression.yaml sets use cases initial state and define actual tests.
data.yaml registers data stores and sets initial data state

The first run

Before executing endly runner, create localhost, mysql and Big Query credentials using the following link: https://github.com/viant/endly/tree/master/doc/secrets

Additionally make sure you have docker service running.

To run e2e testing project workflows use the following:

cd ~/Download/dstransfer/e2e
endly -t='init,test'

After the first run: 3 data stores are ready for use, loaded with dummy data, application is build and deployed as docker container.

docker ps

Aerospike

docker exec -it endly_db1  aql
show sets
SELECT * FROM dummy_type

Big Query

SELECT * FROM db2.dummy_type

MySQL

docker exec -it endly_db3 mysql -uroot -p 
USE db3;
SELECT * FROM dummy_type

Application status

curl http://127.0.0.1:8080/status

Initial regression workflow

Before designing a test plan, let closer examine auto generated regression workflow structure.

each use case stores: setup, expected data and test instruction in dedicated folder under regression/use_cases/001_xxxxx

https://github.com/adrianwit/dstransfer-e2e/tree/master/initial/dstransfer/e2e/regression

regression.csv define a regression test workflow using neatly format.

The workflow performs the following task:

Register data store driver, set database IP by inspecting corresponding docker container, and optionally sets initial data state: data.yaml

2. Check if skip.txt file exist to skip specific use case

3. Set initial test data for all data stores if regression/use_cases/xxx/prepare/$db is defined with corresponding tables data.

4. Run a dummy REST test

5. Verify data in data stores only if expected data is defined in regression/use_cases/xxx/expect/$db

Test design

The very first step of test design if to define application input. Technically it can be any data structure, all that said for the sake of test showcase let use the following user abstraction with nested struct, date, float, string and int data types.

A user instance in JSON format example:

{
  "id": 1,
  "name": "user 1",
  "visited":"2018-01-15 08:02:23 UTC",
  "perf": {
    "rank": 100,
    "score": 6.5
  },
  "quiz": {
    "1": {
      "id": 1,
      "score": 10,
      "taken": "2018-01-10 16:02:01 UTC"
    },
    "2": {
      "id": 2,
      "score": 3,
      "taken": "2018-01-15 08:02:23 UTC"
    }
  }
}

The corresponding schema have been also defined for:

Big Query

CREATE OR REPLACE TABLE users (
  id         INT64 NOT NULL,
  name       STRING NOT NULL,
  visited    TIMESTAMP,
  perf       STRUCT<
        rank INT64,
        score FLOAT64
  >,
  quiz ARRAY<STRUCT<
    key STRING,
    value STRUCT<
      id INT64,
      score INT64,
      taken TIMESTAMP
    >>>
);

MySQL

CREATE TABLE users (
  id         INT AUTO_INCREMENT PRIMARY KEY,
  name       VARCHAR(255) NOT NULL,
  perf_rank  INT,
  perf_score DECIMAL(7, 2),
  quiz       TEXT,
  visited    TIMESTAMP
);

Auto-generated regression workflow already takes care of data preparation and data verification, so the remaining task is to adjust data and actual testing procedure with the following:

1. Sending transfer request followed by “ok” status check i.e.,

@transfer.json

{
  "Method": "post",
  "Request": "$request",
  "URL": "http://localhost:8080/v1/api/transfer",
  "Expect": {
    "Status": "ok"
  }
  ]
}

@request.json

{
  "Mode": "insert",
  "Source": {
    "DriverName": "bigquery",
    "Query": "SELECT * FROM users"
  },
  "Dest": {
    "DriverName": "mysql",
    "Table": "users"
  }
}

2. Waiting for job completion by running periodically task status request

@wait.json

{
  "Method": "get",
  "URL": "http://localhost:8080//v1/api/task/${taskId}",
  "Expect": {
    "Status": "done"
  },
  "Variables": [
    {
      "Name": "status",
      "From": "Status"
    }
  ],
  "Repeat": 10,
  "SleepTimeMs": 3000,
  "Exit": "$status:!/running/"
}

Use cases

The final list of ETL use cases includes:

Test a basic transfer within the same data store type
Stress test transfer with 40 000 records transfer
Transfer data from NoSQL to Big Query with the following SQL:

SELECT id, name, visited, perf, ARRAY(quiz) AS quiz FROM users

Note that both NoSQL and BigQuery do not support Unique constrains, so validation is performed with fromQuery and indexBy directives

4. Transfer data from NoSQL to RDBMS with the following SQL:

SELECT id, name, TIMESTAMP(visited) AS visited, perf.rank AS perf_rank, perf.score AS perf_score, TO_JSON_STRING(quiz) AS quiz  FROM users

5. Transfer data from Big Query to RDBMS with the following SQL:

SELECT id, name, FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', visited) AS visited, perf.score AS perf_score, perf.rank AS perf_rank, TO_JSON_STRING(quiz) AS quiz FROM users

6. Test invalid target data store credential transfer error

7. Test not null constraint transfer error

Working with endly

Preparing system, data store, building and deploying an application is the fairly consuming process, to streamline test development, endly allows to run either all or selected tasks/use cases.

Typical endly usage scenario include:

Running all tasks

endly

System, data store initialization and application build and deployment.

endly -t=init

Running all tests

endly -t=test

Running individual test

endly -t=test -i=rdbms_basic

Running individual test with debug logs

endly -t=test -d=true -i=nosql_to_bigquery

End to end testing made easy

Many avoids an end to end testing due to its intrinsic complexity. As shown in this article with the right tools this process does not need to be so overwhelming. Once we have a system, datastore and the application workflow defined, we can focus on actual regression testing. Docker and Endly have been great tools facilitating testing process. All in all, initial time investment to harness e2e methodology not only contributes toward better quality software but most importantly enables an engineer to boost confidence with deliverables.

End to end testing — next steps

Finally, if you interested in exploring more end to end testing techniques, here it the list of open source ETL projects using endly.