Scottish Longitudinal Study
Development & Support Unit
Census 2011 data quality – what is learned through matching with earlier Census records and other sources
Cecilia Macintyre (National Records of Scotland)
Gillian Raab (University of Edinburgh)
9 October 2015
The data quality of Census 2011 has been investigated as outlined in the NRS Quality Assurance Process with details of comparisons for population and households reported in a quality assurance pack. http://www.scotlandscensus.gov.uk/quality-assurance
The quality assurance of the individual questions was investigated at various stages in the process, and quality issues summarised along with the published metadata. An example of metadata for one of the variables to be studied is given here. http://www.scotlandscensus.gov.uk/variables-classification/marital-and-civil-partnership-status. Additional quality information is being developed for more expert users who will be using the microdata through either the safe settings or using the SLS setting.
Issues with the raw census data were identified throughout the process at checks which were applied throughout the processing stages. The aim was to identify and correct the majority of the errors prior to processing but priority was given to the key variables used in processing including age, sex and numeric variables such as number of rooms. Following data loading, records in error were identified by validation checks, and were manually changed to correct errors.
During later stages of the processing, in particular edit and imputation and the creation of derived variables, additional errors were identified when combinations of variables were considered. This work was being done simultaneously with the publishing of the release 2 tables, so the actions which were taken depended on whether the data had already been published and the scale of the issue.
The actions which were taken included
– checking of individual records and implementation of changes to database – this was restricted to unpublished data.
– replacement of the inconsistent and missing values using the imputation process
– documentation of data issues for publication in metadata
A summary of the findings is published as part of metadata, but more information is recorded in local memos for project documentation purposes. This information will be accessible to the project team as a member of the Census 2011 team is leading on this project. The aim of this research is to identify what additional information about data quality could be gleaned by using the longitudinal comparisons, with a view to publishing the additional information for users. Additionally the study will provide insights into the data quality and how it appears to have been affected by the stages of processing.
This will include recommendations as to better practice which could be incorporated into the plans for processing the Census 2021.
The aim of this project is to explore and understand more fully potential data quality issues with a subset of questions asked in the Census 2011. The questions which will be studied are those where there are known relationships with answers to questions in an earlier Census or with other data available for SLS members. The quality assurance of the Census 2011 to date has relied on examining relationships internal to the Census 2011 data and making aggregate comparisons with external sources, but the linkage with previous Censuses and the availability of vital events and NHSCR data will add to the possible comparisons which can be made. We have selected variables which should be consistent over time, either because they should not change, or because only certain changes would be allowed if the data were accurate. For example extra qualifications can be added, but old ones should not be removed.
Examples of the variables which will be studied include
- Country of birth
- Marital status
- Ever worked
To illustrate the type of analysis we will carry out, we use the example of age, which in most cases is derived from date of birth. The linked SLS data will allow a calculation of the age at the 2011 Census from the following sources, though they will not be available for all SLS members:-
- Birth records (baby, mother and father)
- Marriage and civil partnerships records (record for each partner)
- Widowhoods and Deaths (the latter only post-2011 Census date – so limited)
- 1991, 2001 and 2011 Censuses
For those with more than two age values (except the 2011 age) we will develop a consensus from the commonest value over the multiple sources. This will enable us to compare the 2011 value with this value and to obtain some measure of the pattern of expected differences. We will also be able to look at the quality of the imputed ages at the 2011 Census.
We have included some background variables from the 2011 Census, e.g. socio-economic and healthmeasures, to inform these analyses.
Methods of analysis for other variables will proceed in a similar manner, though dependent on what data are available in each case.
Web sites for Quality Assurance findings:
Related Outputs (viewable on CALLS Hub):
- Sorry, no related items were available.