Data testing strategy reinvented

Many of us faced the challenge of data handling during integration and end-to-end test automation. In this article, I look into various data organization patterns, data loading strategies and data validation requirements.

A typical use case may require ingesting data to a dozen tables; also the data can be a cross-database referenced, for instance, a permission rule can be stored in one database, while actual users and resources in another. To make the things more complicated, tables and dataset can use sequences, auto-increment, or even randomly generated IDs. Another factor to consider is application architecture, for instance, some low latency application never directly interact with a database, but instead use a remote centralized caching system, which is refreshed with the source database periodically. Finally, over time, when more and more use cases are added, test dataset tends to grow bigger and bigger, so data maintenance may be yet an additional challenge.

Data organization.

I use cohesion to rank various data organization patterns. In a quality assurance context, it measures how well data is organized from use case perspective. The following list is ranked from the worst to the best.

  1. Coincidental cohesion is when data has no clear connection with corresponding use cases. For instance, a backup taken from a production system used as input for various use cases.

2. Communicational cohesion is when data is defined in one place where each dataset record contains use case corresponding description, but actual test plan is stored in the different location. In this scenario, it could be quite time consuming to identify an individual use case input data where there are hundreds of use cases.

3. Functional cohesion is when all data is grouped because it all contribute to a single well-defined use case.

Setting test data state.

Setting data state in test context has to do with how and when data is seeded for individual use cases.

While it is possible to use raw SQL INSERT statements to load data into a database, this approach may couple test data with particular database vendor. On top of that, it would not work with any NoSQL datastores. An alternative way that works well with any kind of database is to use CSV or JSON files as database input. An additional benefit of this strategy is reusability of the same data format for expected data set validation.

  1. Setting data prior to test execution.
    In this approach data from all use cases are loaded into a database before actual test takes place. Sometimes this method is also called fresh start which involves both database creation and data loading. One challenging aspect of this approach is key generation and functional cohesion maintenance. For instance, if a data use static IDs, an engineer may be responsible for manual ID generation across use cases to avoid collision, which adds unnecessary overhead.

One way to address this concern, is to dynamically generate IDs, however, in this scenario, a testing framework needs ability to reference back generated IDs for data validation process.

2. Resetting data for each use case.

In this strategy, each test use case uses fresh start method. Test state setup in this approach is greatly simplified, which guarantees good repeatability and predictability. While this method seems to be perfect for applications, that directly connect with a database and use small data footprints. Recreating database and reloading a large amount of data can cause serious delays, especially that it may take hundred use cases to define the reasonable end to end test coverage.
Application using caching tier may introduce another testing complexity in this strategy, which may involve hooks for cache refresh/invalidation and master caching system database snapshot refresh.

I believe there is no one universal strategy that fits all application, I personally like using a hybrid approach which involves fresh start before test execution combined with individual test case data setup.

Reducing multi-table ingestion complexity

Dealing with large data model can be a huge testing bottleneck. Not only it requires an engineer to know an application data model ins and outs, but what worse it takes a lot of time to design test data, which goes to dozens of tables. Data views are great way to simplify data access, hiding complex join across numerous entities. The similar concept can be use for data ingestion, in this method we need to specify a mapping that routes data points from master view like table to various underlying tables, triggered by presence of specific columns. In reality most of data model table may have reasonable default values, and besides auto-generated values, only handful columns may be needed to setup per use case. Some database testing framework like dsunit provide multi table mapping support.

Data test validation

Data test validation can be as challenging as data ingestion. Similarly to setting test state, a data validation can be performed as part of individual use case, or in bulk after all tests. The latter case is especially viable option for applications that persist data asynchronously.

Typical data validation involves data comparison between expected data set and actual data in database. On top of that comprehensive data assertion should allow:

  • real time transformation/casting of incompatible data types
  • contains, range, regular expression support on any data point (at any depth level)
  • consistent way of testing unordered collections
  • testing data in tables without primary key
  • testing SQL query based data
  • user defined function support for dynamic more complex cases, i.e reading database sequence values

Testing tools

Ideal testing framework should address all discussed concerns, namely by:

  • allowing functional cohesion based data organization
  • supporting various ingestion strategies including multi table mapping
  • providing complex data testing capabilities with complete and partial data strategy.

While there are many great open source frameworks providing some of the discussed functionality, endly had been design to comprehensively address the data organization and management challenges.