Dummies guide to synthesizing useful test data

Synthetic test data is imperative to speed and quality in software development. Production data only seems better because we are lazy. They may hold as little as 30% of the test cases we really need, and the cost of a production data approach is proven to be higher compared to a synthetic data approach.
Synthetic test data has advantages over production data e.g better control over edge cases. It can also be used for rapid and repeatable generation of data designed for a specific purpose, like performance testing. These reasons should be enough to warrant implementing on its own, but GDPR compliance is a bonus.

Implementing it, however, seems like a bigger problem. Plenty of projects have failed. This talk has the format of a conversation between developers and tester. It generates a guide through the journey to synthetic data, as experienced in the Norwegian Labor and Welfare Administration (NAV). Torstein Skarra, Head of Test at NAV, will play the role of a sceptical test leader, while Jon Christian, Tech lead in the register machine learning team, will elaborate on the technical side of the solutions.

With more than 200 applications at NAV, developed over the last five decades, it seems like this would need a lot of duct tape and glue. However it surprised us how far you can come with a basic understanding of statistics. The result: Statistical synthetic test data that actually works in a complicated world.

We’ll go through how you can categorize your data into simple models, dependent models and historical models, and how to create synthetic data from these models. All these models build upon existing open source projects and are quite simple to set up, so you too can get high quality synthetic test data.