Cleaning the data: Who should do What, When? José Antonio Mejía Inter American Development Bank SDS/POV MECOVI Program February 28, 2001

Size: px
Start display at page:

Download "Cleaning the data: Who should do What, When? José Antonio Mejía Inter American Development Bank SDS/POV MECOVI Program February 28, 2001"

Transcription

1 Cleaning the data: Who should do What, When? José Antonio Mejía Inter American Development Bank SDS/POV MECOVI Program February 28, 2001

2 Precious resource Better to answer these questions than to have to wonder: Who? Where? With what? When? Why? Integrated surveys cannot afford to lose observations. The integrity of the data must be preserved.

3 Who does What? Data quality control has been pushed back as close to production as possible. The best person to correct errors is the informer. There is still some work that needs to be done at the central office so that the final data files are ready for analysis and distribution. Definitively in terms of organization, but also some in terms of data quality.

4 Do the job while the questionnaires are still at hand. Go back to them, check and re-check them. The work done at the central office should be one of CLEANING not of CHANGING. Cleaning is based on facts, on reviewing the questionnaires. Changing is altering the data. I don t think that is correct

5 Data quality checks after data entry (Who: Data manager in central office) Gather all files with household data, verify that all hh are included without duplication. A good system of hh identifiers should facilitate this process. Convert household based files into thematic files more useful for analysis. Files should be converted to the software format that will be used for analysis. Convert to popular software languages (SAS, STATA, SPSS). Always keep a master version of the files in ASCII.

6 Data quality checks after data entry Check structural consistency of files, so that different thematic files with hh data can be matched with each other, with individual data and community data. Compile basic univariate statistics for each variable. Frequencies for qualitative variables. For quantitative variables, the minimum, maximum and mean values. Issues with logical consistency have been taken care by a good data entry program and concurrent data entry.

7 Data quality checks after data entry Any issues left related to logical consistency should be left to the analysts. Because there is no consensus as to how to identify or treat outliers and missing observations. Best to give the raw data and let each analyst perform whatever cleaning she/he thinks best.

8 Data quality checks after data entry The statistical office as an analyst can do some more data cleaning. If the office chooses to make public its corrections, imputed values, treatment of outliers and missing values, and aggregated variables, it should always do it with the proper labels and tags, plus a well documented account of what was done. And, of course, this data should be distribute in addition not instead of the original data.

9 Data cleaning Minimize dirty data focus on training, field work, and data entry. Document changes keep program files with explanation of all changes made to the raw data, and construction of variables. Maintain original data give users the option to change assumptions. Use robust estimation techniques use statistics that are relatively insensitive to outliers (e.g. median instead of mean, etc.).

10 Distribution Organize data by section/section sub-part One to one correspondence with questionnaire. File naming convention should be informative, transparent and well documented. Add variable and value labels. Links between variables in questionnaire and data files should be clear and well documented. Distribute original data files with good and complete documentation, users are analysts not detectives.

11 Key messages Extremely important steps! Bad data is just bad data. Good data badly organized is bad data. Try to catch errors early. Documentation is essential. Cleaning is very different from changing. Learn from mistakes.

Data Entry, Processing and Management

Data Entry, Processing and Management Data Entry, Processing and Management Raka Banerjee Multi-Topic Household Surveys Poverty and Inequality Course March 7 th, 2012 Introduction Data entry, processing and management as fundamental aspects

More information

Monitoring and Improving Quality of Data Handling

Monitoring and Improving Quality of Data Handling Monitoring and Improving Quality of Data Handling The purpose of this document is to: (a) (b) (c) Maximise the quality of the research process once the question has been formulated and the study designed.

More information

NCRP Data. Quality Control. Jeremy Luallen October 29, 2012

NCRP Data. Quality Control. Jeremy Luallen October 29, 2012 NCRP Data Processing & Quality Control Jeremy Luallen October 29, 2012 Opening Remarks Working together with BJS and States, we ve introduced significant new quality assurances Result is a final product

More information

SDP TOOLKIT FOR EFFECTIVE DATA USE

SDP TOOLKIT FOR EFFECTIVE DATA USE AN INTRODUCTION TO THE SDP TOOLKIT FOR EFFECTIVE DATA USE A GUIDE FOR CONDUCTING DATA ANALYSIS IN EDUCATION AGENCIES www.gse.harvard.edu/sdp/toolkit Toolkit Documents An Introduction to the SDP Toolkit

More information

Fact Sheet No.1 MERLIN

Fact Sheet No.1 MERLIN Fact Sheet No.1 MERLIN Fact Sheet No.1: MERLIN Page 1 1 Overview MERLIN is a comprehensive software package for survey data processing. It has been developed for over forty years on a wide variety of systems,

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

Four steps in an effective workflow...

Four steps in an effective workflow... Four steps in an effective workflow... 1. Cleaning data Things to do: Verify your data are accurate Variables should be well named Variables should be properly labeled Ask yourself: Do the variables have

More information

Automating the Capture of Data Transformation Metadata

Automating the Capture of Data Transformation Metadata Automating the Capture of Data Transformation Metadata H.V. Jagadish Univ. of Michigan http://www.eecs.umich.edu/~jag George Alter, University of Michigan Why Metadata? Data are useless without Metadata

More information

CPSC 427: Object-Oriented Programming

CPSC 427: Object-Oriented Programming CPSC 427: Object-Oriented Programming Michael J. Fischer Lecture 1 August 31, 2016 CPSC 427, Lecture 1 1/30 About This Course Topics to be Covered Kinds of Programming Why C++? C++ Programming Standards

More information

WORKING GROUP ON PASSENGER MOBILITY STATISTICS

WORKING GROUP ON PASSENGER MOBILITY STATISTICS Document: PM-2003-05/EN Original: English "Transport Statistics" WORKING GROUP ON PASSENGER MOBILITY STATISTICS Luxembourg, 24-25 April 2003 Jean Monnet Building, Room M5 Beginning 0:00 am Database and

More information

CoE CENTRE of EXCELLENCE ON DATA WAREHOUSING

CoE CENTRE of EXCELLENCE ON DATA WAREHOUSING in partnership with Overall handbook to set up a S-DWH CoE: Deliverable: 4.6 Version: 3.1 Date: 3 November 2017 CoE CENTRE of EXCELLENCE ON DATA WAREHOUSING Handbook to set up a S-DWH 1 version 2.1 / 4

More information

Plunging into the waters of UX

Plunging into the waters of UX Plunging into the waters of UX Maja Engel TCUK 2017 UX vs. UI design UX is a journey UI design and technical communication are vehicles for that journey «things» that the user can interact with A UI without

More information

CPSC 427: Object-Oriented Programming

CPSC 427: Object-Oriented Programming CPSC 427: Object-Oriented Programming Michael J. Fischer Lecture 1 August 29, 2018 CPSC 427, Lecture 1, August 29, 2018 1/30 About This Course Topics to be Covered Kinds of Programming Why C++? C++ Programming

More information

Blaise 5 Data In/Data Out

Blaise 5 Data In/Data Out Blaise 5 Data In/Data Out Andrew D. Piskorowski, University of Michigan Survey Research Center, United States 1. Abstract The focus of this presentation is to demonstrate various methods used to move data

More information

SPSS Export. Cases & Variables. SPSS Syntax File SPSS EXPORT

SPSS Export. Cases & Variables. SPSS Syntax File SPSS EXPORT 184 SPSS Export ATLAS.ti is intended primarily for supporting qualitative reasoning processes. On the other hand, especially with large amounts data, it is sometimes useful to analyze the data in a quantitative

More information

Data Quality Control: Using High Performance Binning to Prevent Information Loss

Data Quality Control: Using High Performance Binning to Prevent Information Loss SESUG Paper DM-173-2017 Data Quality Control: Using High Performance Binning to Prevent Information Loss ABSTRACT Deanna N Schreiber-Gregory, Henry M Jackson Foundation It is a well-known fact that the

More information

Important issues. Query the Sensor Network. Challenges. Challenges. In-network network data aggregation. Distributed In-network network Storage

Important issues. Query the Sensor Network. Challenges. Challenges. In-network network data aggregation. Distributed In-network network Storage Query the ensor Network Jie Gao Computer cience Department tony Brook University // Jie Gao CE9-fall Challenges Data Rich and massive data, spatially distributed. Data streaming and aging. Uncertainty,

More information

CS 147: Computer Systems Performance Analysis

CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis Test Loads CS 147: Computer Systems Performance Analysis Test Loads 1 / 33 Overview Overview Overview 2 / 33 Test Load Design Test Load Design Test Load Design

More information

Software Revision Control for MASS. Git Basics, Best Practices

Software Revision Control for MASS. Git Basics, Best Practices Software Revision Control for MASS Git Basics, Best Practices Matthew Sell, CSSE Student MASS Research Participant, February 2014 What is revision control? The obligatory Wikipedia definition: revision

More information

Data Management Plan

Data Management Plan Data Management Plan Mark Sanders, Martina Chýlková Document Identifier D1.9 Data Management Plan Version 1.0 Date Due M6 Submission date 30 November, 2015 WorkPackage WP1 Management and coordination Lead

More information

HEALTH AND RETIREMENT STUDY 2006 Internet Survey Final, Version 1.0 November Data Description and Usage. November 2008, Version 1.

HEALTH AND RETIREMENT STUDY 2006 Internet Survey Final, Version 1.0 November Data Description and Usage. November 2008, Version 1. HEALTH AND RETIREMENT STUDY 2006 Internet Survey Final, Version 1.0 November 2008 Data Description and Usage November 2008, Version 1.0 TABLE OF CONTENTS TABLE OF CONTENTS... II 1. INTRODUCTION... 1 2.

More information

Two Papers on Network Visualization. CPSC 533c Presented by: Jeremy Hilliker

Two Papers on Network Visualization. CPSC 533c Presented by: Jeremy Hilliker Two Papers on Network Visualization CPSC 533c Presented by: Jeremy Hilliker 2005-11-07 3D Geographic Network Displays Cox, Eick, He Bell Laboratories 1996 Motivation Computer networks can be represented

More information

STEP Household Questionnaire. Guidelines for Data Processing

STEP Household Questionnaire. Guidelines for Data Processing STEP Household Questionnaire Guidelines for Data Processing This Version: December 11, 2012 Table of Contents 1. Data Entry Process and Timing... 3 2. Data Files Structure... 4 3. Consistency Checks...

More information

Data Quality Control for Big Data: Preventing Information Loss With High Performance Binning

Data Quality Control for Big Data: Preventing Information Loss With High Performance Binning Data Quality Control for Big Data: Preventing Information Loss With High Performance Binning ABSTRACT Deanna Naomi Schreiber-Gregory, Henry M Jackson Foundation, Bethesda, MD It is a well-known fact that

More information

INDEPTH Network. Introduction to ETL. Tathagata Bhattacharjee ishare2 Support Team

INDEPTH Network. Introduction to ETL. Tathagata Bhattacharjee ishare2 Support Team INDEPTH Network Introduction to ETL Tathagata Bhattacharjee ishare2 Support Team Data Warehouse A data warehouse is a system used for reporting and data analysis. Integrating data from one or more different

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

WORKSHOP: Using the Health Survey for England, 2014

WORKSHOP: Using the Health Survey for England, 2014 WORKSHOP: Using the Health Survey for England, 2014 There are three sections to this workshop, each with a separate worksheet. The worksheets are designed to be accessible to those who have no prior experience

More information

An overview of Data Processing System of Survey data (Indian Experience)

An overview of Data Processing System of Survey data (Indian Experience) An overview of Data Processing System of Survey data (Indian Experience) The System Design for data processing. System Design of data processing is a scheme of actions to clean and tabulate the data collected

More information

Exercise 13. Accessing Census 2000 PUMS Data

Exercise 13. Accessing Census 2000 PUMS Data Exercise 13. Accessing Census 2000 PUMS Data Purpose: The goal of this exercise is to extract some 2000 PUMS data for Asian Indians for PUMAs within California. You may either download the records for

More information

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015 Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015 Page 1 Introduction We frequently want to get a set of nodes in a distributed system to agree Commitment protocols and mutual

More information

Ans 1-j)True, these diagrams show a set of classes, interfaces and collaborations and their relationships.

Ans 1-j)True, these diagrams show a set of classes, interfaces and collaborations and their relationships. Q 1) Attempt all the following questions: (a) Define the term cohesion in the context of object oriented design of systems? (b) Do you need to develop all the views of the system? Justify your answer?

More information

QUIZ How do we implement run-time constants and. compile-time constants inside classes?

QUIZ How do we implement run-time constants and. compile-time constants inside classes? QUIZ How do we implement run-time constants and compile-time constants inside classes? Compile-time constants in classes The static keyword inside a class means there s only one instance, regardless of

More information

Reception and scanning of questionnaires

Reception and scanning of questionnaires Questionnaires from the field (by cluster) 1. Check 2. Record 3. Package Reception and scanning of questionnaires CHECK cluster for completeness. Verify cluster number Sort questionnaires by household

More information

UNESCO, Division for Planning and Development of Education Systems, Section for Sector Policy Advice and ICT in Education (ED/PDE/PAD)

UNESCO, Division for Planning and Development of Education Systems, Section for Sector Policy Advice and ICT in Education (ED/PDE/PAD) Guidelines for On- line Data E ntry and Downloading Impact of the Global Financial and Economic Crisis on Education in Selected Developing Countries (DFID RIVAF) UNESCO, Division for Planning and Development

More information

Module 10A Lecture - 20 What is a function? Why use functions Example: power (base, n)

Module 10A Lecture - 20 What is a function? Why use functions Example: power (base, n) Programming, Data Structures and Algorithms Prof. Shankar Balachandran Department of Computer Science and Engineering Indian Institute of Technology, Madras Module 10A Lecture - 20 What is a function?

More information

Introduction to IPUMS

Introduction to IPUMS Introduction to IPUMS Katie Genadek Minnesota Population Center University of Minnesota kgenadek@umn.edu The IPUMS projects are funded by the National Science Foundation and the National Institutes of

More information

McCa!"s Triangle of Quality

McCa!s Triangle of Quality McCa!"s Triangle of Quality Maintainability Portability Flexibility Reusability Testability Interoperability PRODUCT REVISION PRODUCT TRANSITION PRODUCT OPERATION Correctness Usability Reliability Efficiency

More information

2.3 Organizing Quantitative Data

2.3 Organizing Quantitative Data 2.3 Organizing Quantitative Data This section will focus on ways to organize quantitative data into tables, charts, and graphs. Quantitative data is organized by dividing the observations into classes

More information

Fundamentals of Information Systems, Seventh Edition

Fundamentals of Information Systems, Seventh Edition Chapter 3 Data Centers, and Business Intelligence 1 Why Learn About Database Systems, Data Centers, and Business Intelligence? Database: A database is an organized collection of data. Databases also help

More information

GUIDELINES ON DATA FLOWS AND GLOBAL DATA REPORTING FOR SUSTAINABLE DEVELOPMENT GOALS

GUIDELINES ON DATA FLOWS AND GLOBAL DATA REPORTING FOR SUSTAINABLE DEVELOPMENT GOALS GUIDELINES ON DATA FLOWS AND GLOBAL DATA REPORTING FOR SUSTAINABLE DEVELOPMENT GOALS Aim& scope Lessons learned from the Millennium Development Goals (MDG) process Importance of robust and reliable data

More information

The Consequences of Poor Data Quality on Model Accuracy

The Consequences of Poor Data Quality on Model Accuracy The Consequences of Poor Data Quality on Model Accuracy Dr. Gerhard Svolba SAS Austria Cologne, June 14th, 2012 From this talk you can expect The analytical viewpoint on data quality Answers to the questions

More information

What I learned from Assignment 0. This is the first HCI course for most of you. You need practice with core HCI and Design concepts.

What I learned from Assignment 0. This is the first HCI course for most of you. You need practice with core HCI and Design concepts. HCI and Design Today s Reading What I learned from Assignment 0 This is the first HCI course for most of you. You need practice with core HCI and Design concepts. Today: Understanding Users Why do we need

More information

Introduction to Stata and DASP

Introduction to Stata and DASP Introduction to Stata and DASP Abdelkrim Araar, Sami Bibi and Jean-Yves Duclos Workshop on poverty and social impact analysis Dakar, Senegal, 08-12 June 2010 Introduction to Stata and DASP PEP and UNDP

More information

Using the Boxplot analysis in marketing research

Using the Boxplot analysis in marketing research Bulletin of the Transilvania University of Braşov Series V: Economic Sciences Vol. 10 (59) No. 2-2017 Using the Boxplot analysis in marketing research Cristinel CONSTANTIN 1 Abstract: Taking into account

More information

Learning Objectives for Data Concept and Visualization

Learning Objectives for Data Concept and Visualization Learning Objectives for Data Concept and Visualization Assignment 1: Data Quality Concept and Impact of Data Quality Summarize concepts of data quality. Understand and describe the impact of data on actuarial

More information

Overview. When to export? How to export? What is exported? Structure of exported data files Interview Actions file

Overview. When to export? How to export? What is exported? Structure of exported data files Interview Actions file Data export Overview When to export? How to export? What is exported? Structure of exported data files Interview Actions file When to export? FREQUENTLY! Data export isn t just for exporting finalized

More information

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors (Section 5.4) What? Consequences of homoskedasticity Implication for computing standard errors What do these two terms

More information

YEAK Survey: Online Access Data and Report Features

YEAK Survey: Online Access Data and Report Features YEAK Survey: Online Access Data and Report Features CYFERnetSEARCH.org Getting Started Click Login on the homepage Creating a New Account Click Register if you don t have an account Enter your account

More information

Reproducibly Random Values William Garner, Gilead Sciences, Inc., Foster City, CA Ting Bai, Gilead Sciences, Inc., Foster City, CA

Reproducibly Random Values William Garner, Gilead Sciences, Inc., Foster City, CA Ting Bai, Gilead Sciences, Inc., Foster City, CA ABSTRACT PharmaSUG 2015 - Paper QT24 Reproducibly Random Values William Garner, Gilead Sciences, Inc., Foster City, CA Ting Bai, Gilead Sciences, Inc., Foster City, CA For questionnaire data, multiple

More information

The Power of Unit Testing and it s impact on your business. Ashish Kumar Vice President, Engineering

The Power of Unit Testing and it s impact on your business. Ashish Kumar Vice President, Engineering The Power of Unit Testing and it s impact on your business Ashish Kumar Vice President, Engineering Agitar Software, 2006 1 The Power of Unit Testing Why Unit Test? The Practical Reality Where do we go

More information

ECE 354 Introduction to Lab 2. February 23 rd, 2003

ECE 354 Introduction to Lab 2. February 23 rd, 2003 ECE 354 Introduction to Lab 2 February 23 rd, 2003 Fun Fact Press release from Microchip: Microchip Technology Inc. announced it provides PICmicro field-programmable microcontrollers and system supervisors

More information

THE 2002 U.S. CENSUS OF AGRICULTURE DATA PROCESSING SYSTEM

THE 2002 U.S. CENSUS OF AGRICULTURE DATA PROCESSING SYSTEM Abstract THE 2002 U.S. CENSUS OF AGRICULTURE DATA PROCESSING SYSTEM Kara Perritt and Chadd Crouse National Agricultural Statistics Service In 1997 responsibility for the census of agriculture was transferred

More information

Missing Data: What Are You Missing?

Missing Data: What Are You Missing? Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

2011 INTERNATIONAL COMPARISON PROGRAM

2011 INTERNATIONAL COMPARISON PROGRAM 2011 INTERNATIONAL COMPARISON PROGRAM 2011 ICP DATA ACCESS AND ARCHIVING POLICY GUIDING PRINCIPLES AND PROCEDURES FOR DATA ACCESS ICP Global Office November 2011 Contents I. PURPOSE... 3 II. CONTEXT...

More information

Work Session on Statistical Data Editing (Paris, France, April 2014) Topic (v): International Collaboration and Software & Tools

Work Session on Statistical Data Editing (Paris, France, April 2014) Topic (v): International Collaboration and Software & Tools WP.XX ENGLISH ONLY UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing (Paris, France, 28-30 April 204) Topic (v): International

More information

Question Bank. 4) It is the source of information later delivered to data marts.

Question Bank. 4) It is the source of information later delivered to data marts. Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

More information

The TIER Documentation Protocol v2.0 Version 2.0 for Stata [.pdf format]

The TIER Documentation Protocol v2.0 Version 2.0 for Stata [.pdf format] RJB First version: 2015-12-21 This version: 2016-03-30 I. Overview The TIER Documentation Protocol v2.0 Version 2.0 for Stata [.pdf format] The TIER Documentation Protocol provides instructions for assembling

More information

Results Based Financing for Health Impact Evaluation Workshop Tunis, Tunisia October Stata 2. Willa Friedman

Results Based Financing for Health Impact Evaluation Workshop Tunis, Tunisia October Stata 2. Willa Friedman Results Based Financing for Health Impact Evaluation Workshop Tunis, Tunisia October 2010 Stata 2 Willa Friedman Outline of Presentation Importing data from other sources IDs Merging and Appending multiple

More information

CLAREMONT MCKENNA COLLEGE. Fletcher Jones Student Peer to Peer Technology Training Program. Basic Statistics using Stata

CLAREMONT MCKENNA COLLEGE. Fletcher Jones Student Peer to Peer Technology Training Program. Basic Statistics using Stata CLAREMONT MCKENNA COLLEGE Fletcher Jones Student Peer to Peer Technology Training Program Basic Statistics using Stata An Introduction to Stata A Comparison of Statistical Packages... 3 Opening Stata...

More information

Control Invitation

Control Invitation Online Appendices Appendix A. Invitation Emails Control Invitation Email Subject: Reviewer Invitation from JPubE You are invited to review the above-mentioned manuscript for publication in the. The manuscript's

More information

Joint Application Design & Function Point Analysis the Perfect Match By Sherry Ferrell & Roger Heller

Joint Application Design & Function Point Analysis the Perfect Match By Sherry Ferrell & Roger Heller Joint Application Design & Function Point Analysis the Perfect Match By Sherry Ferrell & Roger Heller Introduction The old adage It s not what you know but when you know it that counts is certainly true

More information

Name Date Types of Graphs and Creating Graphs Notes

Name Date Types of Graphs and Creating Graphs Notes Name Date Types of Graphs and Creating Graphs Notes Graphs are helpful visual representations of data. Different graphs display data in different ways. Some graphs show individual data, but many do not.

More information

Usability Report for Online Writing Portfolio

Usability Report for Online Writing Portfolio Usability Report for Online Writing Portfolio October 30, 2012 WR 305.01 Written By: Kelsey Carper I pledge on my honor that I have not given or received any unauthorized assistance in the completion of

More information

Enhancements to the 2006 Canadian Census Edit and Imputation System

Enhancements to the 2006 Canadian Census Edit and Imputation System Enhancements to the 2006 Canadian Census Edit and Imputation System Wesley Benjamin Statistics Canada, Ottawa, ON, K1A 0T6 Abstract The CANadian Census Edit and Imputation System (CANCEIS) will do deterministic

More information

Exploratory Data Analysis with R. Matthew Renze Iowa Code Camp Fall 2013

Exploratory Data Analysis with R. Matthew Renze Iowa Code Camp Fall 2013 Exploratory Data Analysis with R Matthew Renze Iowa Code Camp Fall 2013 Motivation The ability to take data to be able to understand it, to process it, to extract value from it, to visualize it, to communicate

More information

Regression testing. Whenever you find a bug. Why is this a good idea?

Regression testing. Whenever you find a bug. Why is this a good idea? Regression testing Whenever you find a bug Reproduce it (before you fix it!) Store input that elicited that bug Store correct output Put into test suite Then, fix it and verify the fix Why is this a good

More information

Technical Working Session on Profiling Equity Focused Information

Technical Working Session on Profiling Equity Focused Information Technical Working Session on Profiling Equity Focused Information Using to create, knowledge and wisdom (with a particular focus on meta) 23 26 June, 2015 UN ESCAP, Bangkok 24/06/2015 1 Aims 1. Outline

More information

Note: In the presentation I should have said "baby registry" instead of "bridal registry," see

Note: In the presentation I should have said baby registry instead of bridal registry, see Q-and-A from the Data-Mining Webinar Note: In the presentation I should have said "baby registry" instead of "bridal registry," see http://www.target.com/babyregistryportalview Q: You mentioned the 'Big

More information

Six Core Data Wrangling Activities. An introductory guide to data wrangling with Trifacta

Six Core Data Wrangling Activities. An introductory guide to data wrangling with Trifacta Six Core Data Wrangling Activities An introductory guide to data wrangling with Trifacta Today s Data Driven Culture Are you inundated with data? Today, most organizations are collecting as much data in

More information

How to clean up dirty data in Patient reported outcomes

How to clean up dirty data in Patient reported outcomes Paper DH02 How to clean up dirty data in Patient reported outcomes Knut Mueller, UCB Schwarz Biosciences, Monheim, Germany ABSTRACT The current FDA Guidance for Industry - Patient Reported Outcome Measures

More information

Introduction to Mplus

Introduction to Mplus Introduction to Mplus May 12, 2010 SPONSORED BY: Research Data Centre Population and Life Course Studies PLCS Interdisciplinary Development Initiative Piotr Wilk piotr.wilk@schulich.uwo.ca OVERVIEW Mplus

More information

TTEDesigner User s Manual

TTEDesigner User s Manual TTEDesigner User s Manual John D. Cook Department of Biostatistics, Box 447 The University of Texas, M. D. Anderson Cancer Center 1515 Holcombe Blvd., Houston, Texas 77030, USA cook@mdanderson.org September

More information

Rockefeller College University at Albany

Rockefeller College University at Albany Rockefeller College University at Albany Problem Set #7: Handling Egocentric Network Data Adapted from original by Peter V. Marsden, Harvard University Egocentric network data sometimes known as personal

More information

USER-CENTERED DESIGN KRANACK / DESIGN 4

USER-CENTERED DESIGN KRANACK / DESIGN 4 USER-CENTERED DESIGN WHAT IS USER-CENTERED DESIGN? User-centered design (UCD) is an approach to design that grounds the process in information about the people who will use the product. UCD processes focus

More information

Creating a Departmental Standard SAS Enterprise Guide Template

Creating a Departmental Standard SAS Enterprise Guide Template Paper 1288-2017 Creating a Departmental Standard SAS Enterprise Guide Template ABSTRACT Amanda Pasch and Chris Koppenhafer, Kaiser Permanente This paper describes an ongoing effort to standardize and simplify

More information

CS510 Advanced Topics in Concurrency. Jonathan Walpole

CS510 Advanced Topics in Concurrency. Jonathan Walpole CS510 Advanced Topics in Concurrency Jonathan Walpole Threads Cannot Be Implemented as a Library Reasoning About Programs What are the valid outcomes for this program? Is it valid for both r1 and r2 to

More information

Software Review: Ruby Tabulation Software

Software Review: Ruby Tabulation Software Software Review: Ruby Tabulation Software Tags: Research Industry Software-Data Delivery Tools Software-Data Tabulation Data Processing Data Conversion Data Analysis Data Crosstabulation Data Collection

More information

Towards a Cross- Disciplinary Pedagogy for Big Data. Joshua Eckroth Math/CS Department Stetson University CCSC- Eastern 2015

Towards a Cross- Disciplinary Pedagogy for Big Data. Joshua Eckroth Math/CS Department Stetson University CCSC- Eastern 2015 Towards a Cross- Disciplinary Pedagogy for Big Data Joshua Eckroth Math/CS Department Stetson University CCSC- Eastern 2015 What is big data? Data mining and analysis require big data techniques when

More information

WHITE PAPER. The truth about data MASTER DATA IS YOUR KEY TO SUCCESS

WHITE PAPER. The truth about data MASTER DATA IS YOUR KEY TO SUCCESS WHITE PAPER The truth about data MASTER DATA IS YOUR KEY TO SUCCESS Master Data is your key to success SO HOW DO YOU KNOW WHAT S TRUE AMONG ALL THE DIFFER- ENT DATA SOURCES AND ACROSS ALL YOUR ORGANIZATIONAL

More information

Online and On a Budget

Online and On a Budget Online and On a Budget Taking Multi-modal Transportation Planning to the Next Level #micities 2014 Saturday, October 4, 2014 Ann Arbor, MI Norman Cox, PLA, ASLA and Carolyn Prudhomme, ASLA The Greenway

More information

A new international standard for data validation and processing

A new international standard for data validation and processing A new international standard for data validation and processing Marco Pellegrino (marco.pellegrino@ec.europa.eu) 1 Keywords: Data validation, transformation, open standards, SDMX, GSIM 1. INTRODUCTION

More information

Mn/DOT Market Research Reporting General Guidelines for Qualitative and Quantitative Market Research Reports Revised: August 2, 2011

Mn/DOT Market Research Reporting General Guidelines for Qualitative and Quantitative Market Research Reports Revised: August 2, 2011 Mn/DOT Market Research Reporting General Guidelines for Qualitative and Quantitative Market Research Reports Revised: August 2, 2011 The following guidelines have been developed to help our vendors understand

More information

1. The narratives, diagrams, charts, and other written materials that explain how a system works are collectively called

1. The narratives, diagrams, charts, and other written materials that explain how a system works are collectively called CH 3 MULTIPLE CHOICE 1. The narratives, diagrams, charts, and other written materials that explain how a system works are collectively called a) documentation. b) data flows. c) flowcharts. d) schema.

More information

AMERICAN JOURNAL OF POLITICAL SCIENCE GUIDELINES FOR PREPARING REPLICATION FILES Version 1.0, March 25, 2015 William G. Jacoby

AMERICAN JOURNAL OF POLITICAL SCIENCE GUIDELINES FOR PREPARING REPLICATION FILES Version 1.0, March 25, 2015 William G. Jacoby AJPS, South Kedzie Hall, 368 Farm Lane, S303, East Lansing, MI 48824 ajps@msu.edu (517) 884-7836 AMERICAN JOURNAL OF POLITICAL SCIENCE GUIDELINES FOR PREPARING REPLICATION FILES Version 1.0, March 25,

More information

Concepts of Usability. Usability Testing. Usability concept ISO/IS What is context? What is context? What is usability? How to measure it?

Concepts of Usability. Usability Testing. Usability concept ISO/IS What is context? What is context? What is usability? How to measure it? Concepts of Usability Usability Testing What is usability? How to measure it? Fang Chen ISO/IS 9241 Usability concept The extent to which a product can be used by specified users to achieve specified goals

More information

2/6/2018. ECE 220: Computer Systems & Programming. Function Signature Needed to Call Function. Signature Include Name and Types for Inputs and Outputs

2/6/2018. ECE 220: Computer Systems & Programming. Function Signature Needed to Call Function. Signature Include Name and Types for Inputs and Outputs University of Illinois at Urbana-Champaign Dept. of Electrical and Computer Engineering ECE 220: Computer Systems & Programming C Functions and Examples Signature Include Name and Types for Inputs and

More information

CLEANING DATA IN PYTHON. Data types

CLEANING DATA IN PYTHON. Data types CLEANING DATA IN PYTHON Data types Prepare and clean data Cleaning Data in Python Data types In [1]: print(df.dtypes) name object sex object treatment a object treatment b int64 dtype: object There may

More information

Basic Stata Tutorial

Basic Stata Tutorial Basic Stata Tutorial By Brandon Heck Downloading Stata To obtain Stata, select your country of residence and click Go. Then, assuming you are a student, click New Educational then click Students. The capacity

More information

Session 10: Coding and Data Management for Household Interview Variables (Coding/Encoding Data using Excel and SPSS)

Session 10: Coding and Data Management for Household Interview Variables (Coding/Encoding Data using Excel and SPSS) Training on Socioeconomic Monitoring (SocMon) Methodology for Evaluation of Socioeconomics and Marine Resources Utilization at Selected Coastal Communities in Myanmar Mawlamyine University, Mon State and

More information

Intermediate Programming, Spring 2017*

Intermediate Programming, Spring 2017* 600.120 Intermediate Programming, Spring 2017* Misha Kazhdan *Much of the code in these examples is not commented because it would otherwise not fit on the slides. This is bad coding practice in general

More information

Liquibase Version Control For Your Schema. Nathan Voxland April 3,

Liquibase Version Control For Your Schema. Nathan Voxland April 3, Liquibase Version Control For Your Schema Nathan Voxland April 3, 2014 nathan@liquibase.org @nvoxland Agenda 2 Why Liquibase Standard Usage Tips and Tricks Q&A Why Liquibase? 3 You would never develop

More information

Corel Ventura 8 Introduction

Corel Ventura 8 Introduction Corel Ventura 8 Introduction Training Manual A! ANZAI 1998 Anzai! Inc. Corel Ventura 8 Introduction Table of Contents Section 1, Introduction...1 What Is Corel Ventura?...2 Course Objectives...3 How to

More information

Economic and Social Council

Economic and Social Council United Nations Economic and Social Council Distr.: General 27 January 2014 ECE/CES/2014/1 Original: English Economic Commission for Europe Conference of European Statisticians Sixty-second plenary session

More information

PSS718 - Data Mining

PSS718 - Data Mining Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the

More information

CATCH ERRORS BEFORE THEY HAPPEN. Lessons for a mature data governance practice

CATCH ERRORS BEFORE THEY HAPPEN. Lessons for a mature data governance practice CATCH ERRORS BEFORE THEY HAPPEN Lessons for a mature data governance practice A guide to working with cross-departmental teams to establish proactive data governance for your website or mobile app. 2 Robust

More information

GUIDE TO USING THE 2014 AND 2015 CURRENT POPULATION SURVEY PUBLIC USE FILES

GUIDE TO USING THE 2014 AND 2015 CURRENT POPULATION SURVEY PUBLIC USE FILES GUIDE TO USING THE 2014 AND 2015 CURRENT POPULATION SURVEY PUBLIC USE FILES INTRODUCTION Tabulating estimates of health insurance coverage, income, and poverty from the redesigned survey TECHNICAL BRIEF

More information

DATA CLEANING & DATA MANIPULATION

DATA CLEANING & DATA MANIPULATION DATA CLEANING & DATA MANIPULATION WESLEY WILLETT INFO VISUAL 340 ANALYTICS D 13 FEB 2014 1 OCT 2014 WHAT IS DIRTY DATA? BEFORE WE CAN TALK ABOUT CLEANING,WE NEED TO KNOW ABOUT TYPES OF ERROR AND WHERE

More information

User-Centered Design Process

User-Centered Design Process KAIST Fall 2018 CS408E/F: Computer Science Project User-Centered Design Process 2018.08.27 Juho Kim CS408 Project-oriented course in which students design, develop, test, validate, and present a software

More information

Using NHGIS: An Introduction

Using NHGIS: An Introduction Using NHGIS: An Introduction August 2014 Funding provided by the National Science Foundation and National Institutes of Health. Project support provided by the Minnesota Population Center at the University

More information