Final report: Annex 2: Stakeholders' processes, systems and data 2A: Overview
Version Control Date of Issue 14 th June 2005 Version Number 1.0 Version Date Issued by Status 1.0 14/06/2005 PJ Maycock Final report
Metadata Coverage Creator UK Office for National Statistics, General Register Office, Citizen Information Project Team Date Issued 13/6/05 Language English Publisher Office for National Statistics, 1 Drummond Gate, London, SW1V 2QQ Status Approved by Project Manager Subject Data quality, sharing and processing Subject.category Title Citizen Information Project: Annex 2A: Overview
Contents 1. Preface... 5 2. Related documents... 5 3. Data quality framework... 6 4. Stakeholder processes and systems... 7 5. Data quality assessment... 8 6. Data trial results... 8 7. Current data sharing... 12
1. Preface 1.1.1 The Citizen Information Project Final Report recommends the creation of an adult population register that will deliver benefits by sharing basic contact information (name, address, date of birth etc) across the public sector. The report recommends that the development of a population register is implemented as part of the ID Cards Scheme by utilising the National Identity Register (NIR) and that in the interim a range of short term data sharing initiatives are explored further. 2. Related documents 2.1.1 Annex 2: Stakeholder processes, systems and data comprises the following documents: Annex 2A: Overview: This document Annex 2B: Data quality framework Annex 2C: Stakeholder profiles Annex 2D: Data trial comparative results Annex 2E: Data trial comparative results: Appendices Annex 2F: Current data sharing across government Annex 2G: Other data quality initiatives 2.1.2 This document summarises the key conclusions from the review of key stakeholders business processes, systems and data. 5 Preface
3. Data quality framework 3.1.1 Understanding data quality was essential to developing the CIP business cases, both in terms of assessing the quality of existing datasets (for technical options based on existing datasets) and in defining CIP business requirements within the proposed solution of ID Cards. Current and planned Opportunities Data assessment Questionnaire Data trial Business requirements ID Cards Short term solutions Business cases Costs Benefits Data quality framework Coverage Validity Currency Completeness Formatting Verification Uniqueness Processing metrics Address cleansing Matching 3.1.2 The data quality framework enabled an objective and qualitative definition of the quality of personal contact details. The data characteristics are independent of the data usage and provide a powerful tool for assessing fitness for purpose when considering use of a dataset for a purpose other than the reason it was collected. 3.1.3 It is recommended that the data quality framework be adopted by organisations in assessing and developing CIP Stage 1 data sharing opportunities, and within Stage 2 and 3 in assessing the detailed requirements and benefits arising from data exchange between stakeholders and NIR. 3.1.4 Address currency, i.e. the probability of the citizen being currently resident at the address on the database, is critical to many business cases and is the most difficult to assess, due to the period between the citizen moving and advising any government department. 3.1.5 A theoretical approach has been developed based on assessments of the probability of a citizen moving and the probability of them updating or confirming their address. 3.1.6 ONS migration statistics identify that approximately 10% of the population move each year. Address update frequency is dependent on two categories of events: 6 Data quality framework
Events that are independent of the citizen moving, e.g. renewal of passport. A probabilistic approach has been developed for business processes involving a limited number of regular events or where the probability of events can be defined. Events that are a consequence of moving, e.g. address updated in order to continue to obtain benefit etc. The number of address updates per year although a useful indicator of address currency, is not definitive as address currency is also a function of update history. 4. Stakeholder processes and systems 4.1.1 Detailed discussions were held with those organisations making substantial use of contact details within their business processes. Other organisations have been visited or contacted and informed of the scope of CIP. 4.1.2 From a series of workshops an appreciation of stakeholders business processes was gained, which enabled opportunities for benefits to be identified. The information was also used to assess address currency where there were only limited processes and events, e.g. passport and driving licence processes. There was insufficient opportunity to assess the many processes within HMRC and DWP. 4.1.3 The understanding gained of current systems, data models and data definitions was used to inform the selection of technical options in the early part of the project and particularly the assessment of options involving existing systems e.g. DWP, DVLA. 7 Stakeholder processes and systems
5. Data quality assessment 5.1.1 A summary of the data quality of contact details within the main central government departments is shown below. This is a summary of the more detailed results of the data trial and the data quality questionnaires, which assess all the data items and characteristics defined within the data quality framework. The questionnaires enabled organisations not able to participate in the data trial, to provide an assessment of their data quality at the level at which they had information. Citizen Estimated Name Address Up to date Address records duplicates verification verification address validity DfES Loans Student 5m < 2% High Initially high > low Low High DVLA (Drivers) 40m 0.17% High Nil ~ 62% High DVLA (Vehicles) 18m 9 - Medium Nil 90-95% High DWP (DCI) 84m ~ 0.07 as per NIRS2 Medium Low Medium High GRO / GROS (Births) 10m 0.66% (GRO) Not applicable Nil Not updated Low HMRC (CID) 60m - Low Nil Medium High HMRC (NIRS2) 72m 0.07% Low Low Medium High UKPS (Main) 70m 9 Passport renewals UKPS (PASS) 24m 9 Passport renewals High Low ~ 56% High High Low 70% > 56% High Identity (Requirements) Cards 40 / 48m (adults) 0% High Low 90 95% High Data trial results Quality questionnaire response Target 8 Data quality assessment
6. Data trial results 6.1.1 The detailed analysis of stakeholder datasets provided: Evidence of the extent of duplicate identities (e.g. NINO, driving licence no etc) referring to a single citizen The extent of occurrence of nominated dates of birth (primarily 1 st January) within HMRC and DVLA datasets. This will be a significant characteristic where matching between datasets is required, as date of birth is the most significant field when matching identities and establishing uniqueness in the absence of a unique id. Awareness of business processes, data formatting and data definition issues specific to each dataset, e.g. requirement for children to have their own passport; use of default padding characters within DVLA postcodes, Evidence that the address quality in terms completeness, consistency and formatting, is high for DVLA, HMRC and UKPS, but low for GRO / GRO(S). Address cleansing provides only marginal improvement (good addresses are made better), but can be used to identify tentative matches between datasets where in many cases a cursory manual inspection enables the address to be rationalised. Census 2001 DVLA (Drivers) GRO + GROS (Births) HMRC (NIRS2) UKPS (PASS) Census is the most accurate estimate of the whole population Includes emigrants and excludes children children GRO + GRO(S) includes everyone born in Scotland from 1974 England and Wales from 1993 DVLA only includes those with a driving licence UKPS (PASS) includes new and renewed UK passports since 1998 (60% of total UK passports) 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 Year of birth 6.1.2 The comparison of the sample datasets and the correlation and interpretation of the results with other data provided insights into the relative coverage profiles (as below) and demographics. These results enable stakeholders seeking to benefit from data within one of these organisations to identify the coverage profiles available and appreciate the demographic variations. 9 Data trial results
6.1.3 Demographic profiles have a good correlation with the census profiles, but highlight the significantly different local profiles that can occur and the difficulty of accurately extrapolating profiles from the small sample sizes used within the 2.5% 2.0% % of demographic 1.5% UK census profile s1: Name s4: London s6: Wales 1.0% 0.5% 0.0% 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 Year of birth Example demographic profiles from HMRC dataset trial. 6.1.4 The matching of the datasets using a range of criteria enabled the following to be derived: The commonality between datasets, i.e. proportion of citizens occurring in two datasets, and conversely the nature and extent of non-matching identities, e.g. citizens with a drivers licence but without a passport (see No. of records in sample dataset 350 300 250 200 150 100 Census 2001 Matching identities UKPS (adjusted) UKPS all records (adjusted) Combined UKPS-DVLA dataset DVLA (Drivers) 50 0 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 Year of birth Matching of DVLA (Drivers) and UKPS (extrapolated PASS data) 10 Data trial results
above). Citizens with different addresses in different datasets, based on exact and fuzzy matching of: Date of birth, name elements and addresses Date of birth and name elements Date of birth and surname 400 No. of records in sample dataset 350 300 250 200 150 100 Census 2001 Grey match (dob, surname) Full match (dob, name, address) DVLA (Drivers) HMRC (NIRS2) Citizens with different addresses in each dataset 50 0 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 Year of birth Coverage and matching profile of DVLA (Drivers) and HMRC (NIRS2) datasets Matching between IR (NIRS2) and DVLA (Drivers) datasets 6.1.5 These results showed that when comparing pairs of datasets (i.e. permutations of DVLA, HMRC and UKPS) that between 9 and 15% of all records matching on date of birth and name have different addresses, i.e. 9-15% of people have moved and only advised one of the organisations. The unknown element is the extent of records which match on address due to the citizen having advised none of the organisations of an address change. 6.1.6 Consideration was given to the effects of scaling up the data trial results on the matching profile (increased level of false matches). At least 90% of the UK population have a unique date of birth and surname, and this fact was used to provide an upper bound on extrapolation effects of an additional 10% of the matched records. 6.1.7 The conclusion from the trial was that no one dataset met the CIP requirements and the variation in dataset scope, the diversity of data definitions and the manual effort required to resolve grey, false and missed matches were significant contributory factors in the rejection of the technical option based on combining existing datasets and the adoption of the ID Cards solution, based on confirming contact details with each citizen via a registration process. 6.1.8 For ID Cards to achieve the 90-95% address currency needed to realize the defined benefits requires: 11 Data trial results
Citizens to be strongly motivated (by incentive and penalty) to maintain their addresses within 3-6 months of moving Update processes (particularly address) to be easy to use and accessible The capture of as many citizen interactions as possible, i.e. automatic updates from stakeholder systems to NIR 7. Current data sharing 7.1.1 Data sharing may be categorised as: Bulk transfer of data or verification of details on a regular basis With or without the consent of the citizen Case by case transfer of data, e.g. investigating fraud 7.1.2 CIP have produced an inventory of existing and planned cross-government data sharing within the first two categories above (shown below). 7.1.3 This information will enable: the range of different legislative gateways to be researched and provide input to any longer term proposals for rationalisation of these gateways gap analysis of potential data sharing opportunities that might result in benefit and efficiencies. Data suppliers DfES DVLA C C C V V V V DWP C VC Electoral Reg C GRO IND HMRC Stats LA NHS C C Stats ONS UKPS V V VC VC Courts (DCA) MoD Police C CRB V Credit Refs C Student Loans Audit Commission Other DfES DVLA DWP Electoral Reg GRO IND HMRC LA NHS ONS UKPS Courts (DCA) MoD Police CRB Credit Refs Student Loans Current data sharing V Verification of data provided - Yes/No Data sharing initiatives in progress C Data shared with consent of citizen Contact data sharing and verification within the public sector Audit Comm Other 12 Current data sharing