Preparing for Data Analysis

Similar documents
Preparing for Data Analysis

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

8. MINITAB COMMANDS WEEK-BY-WEEK

WELCOME! Lecture 3 Thommy Perlinger

Brief Guide on Using SPSS 10.0

Generalized least squares (GLS) estimates of the level-2 coefficients,

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

SAS (Statistical Analysis Software/System)

Acknowledgments. Acronyms

An introduction to SPSS

STA 570 Spring Lecture 5 Tuesday, Feb 1

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Basic concepts and terms

WORKSHOP: Using the Health Survey for England, 2014

Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

JMP Book Descriptions

Data 100 Lecture 5: Data Cleaning & Exploratory Data Analysis

Data 100. Lecture 5: Data Cleaning & Exploratory Data Analysis

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University

Basic Medical Statistics Course

Introductions Overview of SPSS

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office)

International Graduate School of Genetic and Molecular Epidemiology (GAME) Computing Notes and Introduction to Stata

Answer keys for Assignment 16: Principles of data collection

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Lab 1: Introduction to data

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

SPSS: AN OVERVIEW. V.K. Bhatia Indian Agricultural Statistics Research Institute, New Delhi

Basic Medical Statistics Course

Data Science. Data Analyst. Data Scientist. Data Architect

DSC 201: Data Analysis & Visualization

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) *

SPSS TRAINING SPSS VIEWS

Monitoring and Improving Quality of Data Handling

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

BIOSTATISTICS LABORATORY PART 1: INTRODUCTION TO DATA ANALYIS WITH STATA: EXPLORING AND SUMMARIZING DATA

Statistical Pattern Recognition

Dr. V. Alhanaqtah. Econometrics. Graded assignment

Frequency Distributions

Statistical Pattern Recognition

Minitab 17 commands Prepared by Jeffrey S. Simonoff

SPSS. (Statistical Packages for the Social Sciences)

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT

Statistical Analysis Using SPSS for Windows Getting Started (Ver. 2018/10/30) The numbers of figures in the SPSS_screenshot.pptx are shown in red.

Basics of Plotting Data

Dr. Barbara Morgan Quantitative Methods

book 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition

Learn What s New. Statistical Software

Statistical Pattern Recognition

R for IR. Created by Narren Brown, Grinnell College, and Diane Saphire, Trinity University

Fathom Dynamic Data TM Version 2 Specifications

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

STATA 13 INTRODUCTION

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26

Lab 2: Introduction to data

Introduction to STATA

An Introduction to the R Commander

Predict Outcomes and Reveal Relationships in Categorical Data

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Minitab 18 Feature List

Lab 2: Introduction to data

Correctly Compute Complex Samples Statistics

Correctly Compute Complex Samples Statistics

Right-click on whatever it is you are trying to change Get help about the screen you are on Help Help Get help interpreting a table

An Interactive GUI Front-End for a Credit Scoring Modeling System by Jeffrey Morrison, Futian Shi, and Timothy Lee

Key Stage 3 Curriculum

Statistics Lecture 6. Looking at data one variable

AcaStat User Manual. Version 8.3 for Mac and Windows. Copyright 2014, AcaStat Software. All rights Reserved.

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

1 Introduction. 1.1 What is Statistics?

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

Creating New Variables in JMP Datasets Using Formulas Exercises

8 Organizing and Displaying

Introducing Categorical Data/Variables (pp )

Introduction to StatsDirect, 15/03/2017 1

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression

MINITAB Release Comparison Chart Release 14, Release 13, and Student Versions

Audit Teaching in GP 1

CHAPTER 1 INTRODUCTION

CS6220: DATA MINING TECHNIQUES

An Interactive GUI Front-End for a Credit Scoring Modeling System

Analysis of Complex Survey Data with SAS

Technical Support Minitab Version Student Free technical support for eligible products

Tabular & Graphical Presentation of data

STA 490H1S Initial Examination of Data

JMP 10 Student Edition Quick Guide

Chapter 17: INTERNATIONAL DATA PRODUCTS

Lab 1: Introduction to Data

The results section of a clinicaltrials.gov file is divided into discrete parts, each of which includes nested series of data entry screens.

Making Science Graphs and Interpreting Data

Data Preprocessing. Slides by: Shree Jaswal

Exercises R For Simulations Columbia University EPIC 2015 (no answers)

MIS 0855 Data Science (Section 006) Fall 2017 In-Class Exercise (Day 18) Finding Bad Data in Excel

Credit card Fraud Detection using Predictive Modeling: a Review

Transcription:

Preparing for Data Analysis Prof. Andrew Stokes March 27, 2018

Managing your data Entering the data into a database Reading the data into a statistical computing package Checking the data for errors and inconsistencies Data cleaning Preparing the data for analysis

Cautionary Note Never alter the raw data directly! All manipulations should be performed in the statistical computing software

Data checks Range checks to identify out-of-range values (e.g. age) Cross-checks to identify inconsistencies between values (e.g. males that are pregnant) Cross-tabulations

Data checks continued Scatter plots and box-plots to compare groups and identify outliers Proportion of responses missing, Other and Don t Know Consistency across questions that elicit similar information Note: These preliminary checks can be run before you have finished collecting your data

Data cleaning The goal of data cleaning is to resolve problems that were identified during data checking process The aim is make the data as high quality as possible for analysis If a problem cannot be resolved, the incorrect data can be assigned a missing code. Remember: never alter the raw data file. All changes should be made in the R script

Variable naming and coding conventions Variable names should be descriptive (e.g. birthwt for birth weight) Questions that have categories of answers should be assigned numeric values in a systematic fashion For example, yes/no questions should be coded assigned values 0/1

Routinely backup your data and code Use systematic approach to naming backup files (e.g. with the date of backup included in the filename) In addition to backing up your data, you should routinely create backups of your R scripts

Preparing data for analysis The goal of this step is to create a final analytic dataset The raw data as entered into the database is usually not sufficient You ll both be creating new variables and recoding existing ones In some cases, you may also need to merge multiple sources of data

Organizing your code You can have one script both for data pre-processing and analysis Or you can have two or more scripts (recommended) The first script can be used to generate the final analytic data set The second script can then be used to implement analyses

Data dictionary The data dictionary provides a map between the questionnaire and the data files. It is a record of how the data are structurued and will be useful resource as you prepare for analysis. It should contain the following information: Name and description of each variable Data type (e.g. numeric or text; if numeric, continuous, binary or categorical) Coding (e.g. 0=No, 1=Yes) Question number to which the variable relates

Creating new variables Calculated variables may combine information from two or more individual variables Body mass index is calculated using weight in kilograms over height in meters squared. When you compute such a variable you may have to first translate the raw variables into the correct units Some variables may use external data

Checking your composite variables After calculating composite variables, check for validity of the responses (e.g. plot the data) For example, seemingly reliable weight and height data may produce unrealistic BMI values Sometimes data errors may only appear upon checking the range of calculated variables

Coding & Re-coding Reasons to consider categorizing data: Grouping values is a form of simplication or dimension reduction Can help for identifying non-linear associations Will be necessary when some categories include too few observations to be analyzed separately

Re-coding continued Pooling like groups for variables that are already categorical (important principle here is that risk of outcome should be similar in each of the combined groups) Divide continuous data into quartiles (four groups) or quintiles (five groups) with equal numbers of observations

Re-coding continued Other times cutpoints will be based on established rules or guidelines (NHLBI/wHO BMI categories for normal weight, overweight and obese) If no standard cutpoints are available, another approach is to study a histogram of the data and choose cutpoints based on natural break

Planning your analysis What are you measuring? What comparisons do you want to make? What is your outcome and how should your outcome variable be constructed? What are your predictor variables? How should they be constructed?

Steps in the analysis Two primary levels of analysis Descriptive analysis (summary of population) Multivariate analysis Note: often simple bivariate analyses will be presented prior to multivariate ones to show associations without statistical adjustment for other variables.

Descriptive analysis (quantitative) Description of quantitative data is Table 1 in most papers Describes characteristics of population Why is this important?

Example from Stokes 2014

Bivariate analysis Comparing outcomes between groups t-tests for continuous outcomes chi-squared tests for categorical outcomes

Multivariate analysis Used to adjust for confounding Can be used to investigate effect modification and mediators Linear regression for continuous outcomes Logistic regression for dichotomous outcomes Other common models: ordinal logit, poisson, negative binomial, Cox proportional hazards

Upcoming deadlines (due Sun 5 PM) Methods Section Data Dictionary Table Shells