Scottish Longitudinal Study
Development & Support Unit
How are synthetic data created?
The two synthetic products that are available from the SLS are produced in different ways:
1. Synthetic spine data
The SLS ‘spine’ dataset is generated using the 2011 Scotland’s Census Teaching File dataset available from the National Records of Scotland and a series of 2001 to 2011 transitional probabilities of key demographic variables taken from the SLS.
The variables included are:
- Age (10 year groups)
- Marital Status
- General Health
- Approximated Social Grade
A series of algorithms are used firstly to estimate the numbers of individuals in a particular age group undergoing each longitudinal state transition (eg. Never married in 2001 to Married in 2011 or Good health in 2001 to Good health in 2011) and then allocate these changes (or not) to the appropriate number of individuals in the Census dataset, resulting in a new, plausible, SLS-like dataset which will include data from both 2001 (synthetic) and 2011 (real) for all individuals.
For more detailed information see ‘A Synthetic Longitudinal Study for the United Kingdom‘ The data can be accessed here
2. Bespoke synthetic extracts
Bespoke synthetic extracts are produced using the R package synthpop in response to user requests.
Variables are synthesised one by one using sequential regression modelling. This means that each synthetic variable is modelled separately and this variable’s relationship to all other variables in the real dataset is taken into account. This ensures that when analysis of the full dataset is performed the researcher will get results which will usually be very similar to results if this analysis was performed on the real data.
The synthetic data are produced from the user’s extract by staff at the SLS-DSU. This can be a complex task and users are expected to work with staff to facilitate their work. See ‘How to access synthetic data‘ – for details.