Generating synthetic data

There are multiple approaches to generating a synthetic dataset. The generation method could:

Replicate statistical properties of real data (distribution, mean, range)
Add noise to real data (perturbing, shuffling, substituting)
Use Machine Learning approaches (learn patterns in real data then generate new dataset)

Different methods will achieve different levels of fidelity and privacy.

Before generating synthetic data

Before generating a synthetic dataset is is important to consider whether there are existing synthetic datasets that meet your fidelity/quality/privacy requirements. See synthetic-datasets-inventory.md for some existing synthetic datasets. If a dataset does not exist already, it is important to consider what generation methodology is important for your use-case, and re-use existing methods that have been created by others. There are existing software and tools that can help automate the generation of your synthetic datasets; some are listed below.

Software

The ONS methodology working paper on synthetic data is a good place to start. In Chapter 3, they give an overview of synthetic data software, describing web-based tools and software packages. They cover advantages and disadvantages of each tool separately, as well as providing an at-a-glance comparison table and a software decision chart. The tools and software they cover:

SimPop (R package)
Synthpop (R package)
Sms (R package)
Web-based tools such as Mackaroo
Faker (Python)

A report by ADR UK and UKRI titled 'Accelerating public policy research with easier, safer synthetic data', has an accompanying Python notebook that makes it easy for a researcher to generate low-fidelity synthetic data. Also see their related blog post.

The Alan Turing Institute has a project called QUiPP (Quantifying Utility and Preserving Privacy). See their GitHub repository for software pipelines that generate synthetic data, and this page which explains the project.

TAPAS is a Python Toolbox for Adversarial Privacy Auditing of Synthetic Data. See also their pre-print

The Synthetic Population Catalyst (SPC) makes it easier for researchers to work with synthetic population data in England.

SyntheaTM is a Synthetic Patient Population Simulator. Based on US data sources: US Census demographics, CDC rates and NIH reports.

Tofu is a Python library for generating synthetic UK Biobank data.

synthcity is a Python library for generating and evaluating synthetic tabular data based on use-case and data modality. Read a summary report of this library here.

DataSynthesizer generates synthetic data that simulates a given dataset, applying differential privacy techniques.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

synthetic-data-generation.md

synthetic-data-generation.md

Generating synthetic data

Before generating synthetic data

Software

Files

synthetic-data-generation.md

Latest commit

History

synthetic-data-generation.md

File metadata and controls

Generating synthetic data

Before generating synthetic data

Software