New technologies of data collections supporting transformation from traditional to combined and register-based censuses (PL) Janusz Dygaszewicz Central Statistical Office of Poland THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat
Long-term history (tradition) of Polish statistics 1789 - first countrywide population census in Poland (based on registers!! ) The register of nobility and burghers The parish register of births and deaths 1918 - establishment of Central Statistical Office (CSO) Almost 100 years of continued existence and development of CSO Poland. 2
In 2011 - Mixed Model for Population and Housing Census Combining Census a combination of data from administrative sources (full survey covering basic demographic variables) with data acquired from ad-hoc 20% sample survey. 3 3
Data collection channels in 2011 Census Round Administrative Sources Including spatial data reference registers Self-enumeration by Internet CAII (CAWI) Computer Assisted Internet Interview Telephone Interview Face-to-face Interview with respondents executed by the census enumerators CATI - Computer Assisted Telephone Interview (Call Center) Registered on hand-held terminals with usage GPS and GIS service CAPI - Computer Assisted Personal Interview 4 4
CAxI CAXI CAxI CAII/CAWI - Computer Assisted Internet Interview, CAPI - Computer Assisted Personal Interview, CATI - Computer Assisted Telephone Interviewing. 5 5
NSP 2011 full scale survey (short form - 15 questions) Full-scale survey: 1) Data from administrative register Master Record 2) Data acquired using the CAII method 3) Data supplemented using CATI and CAPI method Administrative and non-administrative systems The CAII method Data sumplementation (CAPI and CATI) 6 6
NSP 2011 sample survey (long form - about 100 questions) Sample survey: 1) Data from administrative register Master Record 2) Data acquired using: The CAII method The CAPI method 3) Data supplemented using CATI method The CAPI method Sample survey Administrative and non-administrative systems The CAII method The CATI method The supplementation of data 7 7
Census organisation 8 8
The project timetable National Population and Housing Census 2011 2007 Start of preparation IV-V.2010 Trial census NSP 2011 IV-VI.2011 NSP 2011 IX-X.2009 Trial Census PSR 2010 Agriculture Census 2010 IX-X.2010 PSR 2010 2013 End of project 9 9
Organization NSP 2011 16 central controllers 32 managers of the WCZS and the WCC 571 voivodship controllers 642 statistical interviewers around 2800 gmina leaders around 18000 census enumerators 10 10
DATA ACQUISITION 11 11
On-line channels for data collection System Architecture Census Completeness Management Map Server Operational Microdata Base CAPI CAII CATI 12 12
The architecture of a CAII/CAWI method ZKS OBM Online electronic questionnaire system Offline questionnaire Downloading the application file Internet Browser Online Email Electronic media Online method Offline method 13 13
Self-enumeration by Internet filling the questionnaire by the respondent Identification Used to confirm the identity of the respondent. Entering identification data in a questionnaire(f.ex.: PIN, NIP, first name, last name) or additional authentication qualities (f.ex. a place of birth, mother s maiden name) Establishing a password which jointly with PIN was the basis of authentication within 14 days 14 14
Self-enumeration by internet course of action in completing the survey by person In case of full scope survey person fulfilled his/her questionnaire, and could also specify adults who lived in the same apartment (proxy) In case of sample survey the dwellings questionnaires were completed in the first place. Later were fulfilled personal questionnaires. After authorization the census form was available for 14 days for multiple access. 15 15
Typical person using selfenumeration A man Age: about 24 years old A city inhabitant Secondary degree 16 16
17 17
CATI Computer Assisted Telephone Interview scheduled as the first or the second (following CAII) channel of collecting data; working posts of telephone interviewers located in separated Call Center studies; telephone interviewers provided with professional equipment. 18 1 8
CATI in National Census 2011-16 distributed Call Centers - 320 SIP Stations - 642 telephone interviewers/consultants - technology Interactive Inteligence - utility applications prepared by CSO in Poland The system worked from 8.04 to 30.06 (84 days) in hours 8.00 20.00/2 shifts Hotline 0 800 800 800 or (22) 44 44 777 19 19
The most significant functionality of Call Center Confirming the identity of the interviewer/census enumerator Hotline Interviewing Arranging visits by census enumerators 20 20
CAPI Computer Assisted Personal Interview the third channel of data collection in the case of failure to obtain a complete set of data via CAII and CATI channels direct interviews in households (first or second channel) where such a way of proceeding results from adopted methodology or whose members has not expressed consent for a telephone survey 2 21 1
The architecture of a CAPI method ZKS OBM Dispatching application - server - Communication server Map server Management of a terminal WAN CSO Dedicated APN Mobile network Dispatching application - client - Mobile application Cryptographic SIM card Module GPS 22 22
Field enumeration management Supervisor Regional (NUTS2) level Responsibilities: Address Point and Census Area management Enumerator monitoring Census Progress Localization and trail Emergency situation management Providing help for enumerators Providing necessary information to enumerators 23 23
24 24
25 25
HH - Mobile terminal with GPS HTC Touch Pro2 Screen touch-screen size 3,6 resolution 480 x 800 pixels sliding, tilting - convenient usage sliding, 5-rows QWERTY keyboard GSM/GPRS/EDGE/UMTS/HSPA GPS module camera - 3,2 MP Windows Mobile 6.5 26 26
Enumerator Visiting all assigned holdings Filling electronic questionnaires Daily synchronisation Contact with the supervisor in terms of task scheduling Adding newly identified holdings 27 27
Enumerator 28 28
Enumerator 29 29
Enumerator Alarm procedure In emergency situations, enumerators have a possibility of sending an alarm signal to their supervisors Alarm notice is sent to the supervisor application and via SMS to the supervisor 30 30
ADMINISTRATIVE REGISTERS 31 31
The use of administrative sources in censuses The usage of administrative sources in the census: direct source of research data, source of information to create a list of entities covered by the census frame (address-housing survey), in addition, a source of information for : imputation, data estimation, comparison the quality of the data. 32 32
The conditions of using administrative data by the Polish official statistics Legal issues, related to the access and use of data, including especially the identifiable and personal microdata Public relations public trust and acceptance of administrative data linkage and re-use, primarily collected for other purposes, in official statistics Organisational issues, related to the statistical survey management process and to the work organisation changes Methodological issues methods ensuring optimal quality of output data, including definition consistency of the official statistics system with the information systems of public administration Technical issues, concerning the use of new technologies, including especially the information and communication technologies 33
Registers - data acquisition Data Owners: Ministry of Finance, Ministry of Interior and Administration, Ministry of Justice, Agricultural Social Insurance Fund, National Health Fund, Agency for Restructuring and Modernisation of Agriculture, Agricultural and Food Quality Inspection, Agency for Geodesy and Cartography, State Fund for Rehabilitation of Disabled Persons, County Offices, Commune Offices, Regional Offices, Telcoms, Energy Suppliers, Office For Foreigners, Social Insurance Institution, Housing Managers, 34 34
Key administrative sources used in the Polish official statistics population registration system tax system economic activity information system farming activity information system social security system social insurance system health insurance system real estate information system education information system vehicle and vehicle owners information system 35
Data quality -measures- 1. Measuring the quality of administrative registers timeliness of data methodological compatibility completness identification standards used in the registry usefulness compatibility of data in administrative sources to data obtained in the study/survey 2. Measuring the quality in processing of data registers excessive coverage error rate incomplete coverage error rate subjective indicator of completness objective indicator of completness imputation rate data correction index integration data from various sources index 36 36
CENSUS Data Processing Infrastructure 37 37
Data processing Stages: Data acqusition, Source data collection and preparation, Source data loading and correction, Census Frame preparation, Master Record generation, CAxI data colection, Golden Record generation, Export to Analitycal Microdatabase, Data processing, Information Analisys, Product generation, Dissemination. 38
Architecture solution Registry 1 Registry 2 Registry n XML TXT Stage I Preparatory works Files ETL Tools CAXI Stage II Data acquisition Statistical Files Operational Microdata Base Questionnaires Golden Record Stage III Results compilation Analytical Microdata Base SDMX Metadata Metadata Metadata XML Metadata server Portal 39 39
Administrative sources The process of data processing started from collecting administrative sources from data administrators registers in the field defined by the appropriate legal Acts. The Polish statistics has the right to access all unit data stored in information sources of the public and commercial sectors. The obtained data include necessary identifiers and personal data supporting the process of merging (linking) 40 unit data from different sources.
Operational Microdata Base The basic structure of data in the OMB is a layer. It is a set of records, each of which relates to one census unit (a person, a dwelling, a household). The records include the values of census attributes derived from source data collected from respondents or defined in a different way (e.g. in the process of imputation). Layer can differ between one another with a set of attributes whose values are presented in a given layer. In the first step of processing, before beginning the census, on the basis of source sets and the census frame the layer referred to as a Master Record was created, consisting of the initial value of the selected subset of census attributes. The values from this layer were transferred to the CAPI, CATI and CAII processes for personalization electronics questionnaires. 42 42
Operational Microdata Base After completing the census with the use of the CAPI, CATI and CAII processes, on the basis of information collected in the processes proper layers in the database are created. The layer which have already been saved in the system can serve as the basis for creating new, internal layer in which new attributes (derived from the existing ones) can be added. Collection and processing of data was done in the OMB, which contained personalized data on the census unit, together with the value of characteristics obtained from administrative sources and from other channels (CAII, CATI, CAPI) 43
Operational Microdata Base The Operational MicroData Base (OMB) - system included hardware-system-tool infrastructure (computer hardware, system software, tool software) and application software (computer programs that are the result of programming work). This base enabled the inclusion of data transmitted in electronic form through four informational channels by entities and to conduct further data processing. In the OMB processes connected with the control, correction, and linking of data, up to their complete cleansing took place. Next, depersonalized data (as the Golden Records) were transferred to the Analytical Microdata Base (AMB). 44 44
Data acquisition Registry 1 XML TXT CAXI Questionaries Registry 2 Registry n Files ETL Tools Statistical Files Operational Microdata Base Golden Record Analitycal Microdata Base SDMX Metadata Metadata Metadata XML Metadata server Portal 45 45
Files format: Data acquisition Flat files, XML files, Local Databases XML files integration, 46 46
TRANSFORMATION TO STATISTICAL REGISTER 47 47
Source data collection and preparation Registers loading into data laboratory envroiment, Denormalization, Standarization, Deduplication, Validation, Data completion, Vocabulary validation and automatic correction, Statistical files (register) generation, 48 48
Source data collection and preparation Registry 1 XML CAXI TXT Questionaries Registry 2 Registry n Files ETL Tools Statistical Files Operational Microdata Base Golden Record Analitycal Microdata Base SDMX Metadata Metadata Metadata XML Metadata server Portal 49 49
Transform data Data processing in the production environment consisting of: profiling create a raport on the data quality, unification/standardization of data, parsing (separation) or combining variables, standardization with schemes, conversion, validation, deduplication, data integration. 50 50
ETL process scheme PROFILING A Register B Register C Register E X T R A C T D A T A T R A N S F O R M D A T A STANDARDIZATION PARSING CONVERSION VALIDATION DEDUPLICATION INTEGRATION STATISTICAL REGISTER L O A D D A T A ANALYTICAL BASE 51 51
The data transformation model from administrative sources into statistical data sets Preliminarypreparation of an administrative register 1) Importing 2) Mapping 3) Simple deduplication 4) Denormalisation Data transition The processing of identification and address variables The processing of substantive variables Validation and adjustment Integration Complex deduplication The selection of statistical variable value frommany registers STATISTICAL DATA SET 52
The data transformation model from administrative sources into statistical data sets Preliminary preparation of an administrative register Importing data sets Mapping excluding excessive variables from further processing, and labelling the remaining ones in accordance with an agreed standard Simple de-duplication removing identical records from the data set De-normalisation creating a flat table with a specific substantive content 53
The data transformation model from administrative sources into statistical data sets Data transition cleaning obtaining a set of coherent data, containing standardised variables which are accurate in substantive terms Validation and adjustment verifying the processing outcome based on suitable rules, i.e. checking the meeting requirements for data coherence, accuracy and standardisation generating a data quality improvement report further processing or re-cleaning the data set 54
The data transformation model from administrative sources into statistical data sets Data set integration with the reference set (e.g. with a previously given list of units individuals, real estate, farms, enterprises the statistical business register or an administrative register) Complex de-duplication eliminating recurrent units parallel but not identical obtaining one record for a given entity without losing other information Selection of the statistical variable value from multiple registers 55
The data transformation model from administrative sources into statistical data sets A statistical data set comprises a certain population determined in the reference table, together with a number of entity-specific descriptive variables, the values of which represent the synthesis of information included in multiple registers the outcome of data quality improvement, upon transformation of the source data set into a statistical one, may be illustrated statistically 56
Census Frame preparation Citizens, buildings and dwelling list preparing, Citizens, buildings and dwelling list and statistical data integration, Census Frame preparing. Goal Frame Preparation, Random Sample preparation, 57 57
CAxI Registry 1 XML CAXI TXT Questionaries Registry 2 Registry n Files ETL Tools Statistical Files Operational Microdata Base Golden Record Analitycal Microdata Base SDMX Metadata Metadata Metadata XML Metadata server Portal 58 58
GOLDEN RECORD 59 59
Golden Record generation Registry 1 XML TXT CAXI Questionaries Registry 2 Registry n Files ETL Tools Statistical Files Operational Microdata Base Golden Record Analitycal Microdata Base SDMX Metadata Metadata Metadata XML Metadata server Portal 60 60
Golden Record generation Integration with Census Frame and CAxI data, Validation, Correction, Operational Imputation, Transfer proper values to Golden Record, Registers 1..n CAxI OMB Layers Golden Record AMB 61
Export to Analitycal Microdata Base Transition Tables Preparing, Golden Records anonymisation, Transfer to Analitycal Microdatabase, 62 62
Export to Analitycal Microdata Base Registry 1 XML CAXI TXT Questionaries Registry 2 Registry n Files ETL Tools Statistical Files Operational Microdata Base Golden Record Analitycal Microdata Base SDMX Metadata Metadata Metadata XML Metadata server Portal 63 63
The main objectives of AMB Preparation and dissemination census data for statistical analysis Dissemination of analytical and reporting platform for development of census data collected and preparation of statistical products for national and international users. Supporting for process of creation and dissemination statistical products 64 64
Analytical Microdata Base cont. In the Analytical Microdata Base took place the following processes: data integration, validation, automatic correction, imputation, calibration, creation a new secondary variables and new statistical units (i.e. families and households). ETL processes in OMB and AMB were repeatedly executed until the approval by the methodologists. In the next step, processes concerning creation multidimensional objects - OLAP Cubes (domestic needs), and Hypercube (60 HC) and Quality Hypercube (in accordance with international requirements - Eurostat) were made. 65 65
Concept solution GOLDEN RECORD NSP 2011 ETL - transfer to base - validation - correction - imputation - weighting - calibration - transformation - quality indicators, - calculation of the secondary variables, APPLICATION OF INTERNAL USERS ANALYTICAL MODEL MULTIDIMENTIONAL STRUCTURES APPLICATION OF EXTERNAL USERS REPORTS, ANALYSES DISSEMINATION DATA SDMX STATISTICAL PRODUCTS FILES TO EUROSTAT REPOSITORY OF DATA AND METADATA 66 66
Data processing in AMB In order to realize the process of "Data Processing, a series of ETL jobs were divided into the following steps (using the SAS Data Integration Studio tools): S00: load source data from Golden Record NSP 2011 (8 tables: Buildings, Dwellings, Collective living quarters, Persons, Emigrants after 2002, Homeless, Households, Families), S01: copy the relevant objects, Households, Families, S02: execution of the automatic data correction, S03: Validation tables of Golden Record, S04: execution of manual data correction, S05: weighing data S06: data calibration and imputation 67 67
Data processing in AMB S07: control reports S08: calculation of derived variables S09: calculating rules on the basis of dictionaries S09a: adding the secondary variables S10: transformation data to the analytical model (facts tables and dimension tables, cubs for national purposes) S11: creation and loading hypercubes for Eurostat S12: validation of hypercubes and creation quality hypercubes for Eurostat S13: transformation of operational metadata S14: enumeration of quality indicators S15: preparation aggregates for the Geostatistics Portal 68 68
Instead of a conclusion Census in 2002 Census 2011 180 thousands of census enumerators 120 mln of questionnaires 18 thousands of census enumerators 0 questionnaires 1 000 tons of papers At the end shredding census questionnaires 0 tons of papers ca. 50 mln less better data the more reliable results statistical surveys in the future 70
Methodological CSO work on the use of administrative registers for the 2021 National Census (NSP 2021) conducted within: Improvement of the use of administrative sources (ESS.VIP ADMIN WP6 Pilot studies and applications) 1.X.2015-30.IX.2016 Improvement of the use of administrative sources (ESS.VIP ADMIN WP6 Pilot studies and applications) 03.10.2016-02.04.2018 Initial identification of administrative registers as data sources for the 2021 National Census (NSP 2021) Improvement of: method for assessing the quality of administrative registers assumptions for creating identifiers for linking data from administrative registers assumptions for getting variable values using many administrative registers assumptions to derive new variables from the administrative registers methods of building a list of address and dwelling linking information about people, dwellings and buildings methods of data integration from different sources methods of data imputation and calibration using data from administrative registers and surveys methods of generalizing results based on combinations of data from different sources at different levels, territorial cross-section methods of assessing the quality of estimates - generalized results methods of improving the quality of the spatial location of buildings in a statistical frame. 71
Thank you for your attention e-mail: j.dygaszewicz@stat.gov.pl 72