Data 100 Lecture 5: Data Cleaning & Exploratory Data Analysis

Similar documents
Data 100. Lecture 5: Data Cleaning & Exploratory Data Analysis

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Statistics Lecture 6. Looking at data one variable

Table of Contents (As covered from textbook)

DSC 201: Data Analysis & Visualization

DSC 201: Data Analysis & Visualization

CHAPTER 3: Data Description

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus

WELCOME! Lecture 3 Thommy Perlinger

Chapter 2 Describing, Exploring, and Comparing Data

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Chapter 6: DESCRIPTIVE STATISTICS

MATH11400 Statistics Homepage

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

STA 570 Spring Lecture 5 Tuesday, Feb 1

Name: Stat 300: Intro to Probability & Statistics Textbook: Introduction to Statistical Investigations

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Chapter 3 - Displaying and Summarizing Quantitative Data

Chapter 6: Comparing Two Means Section 6.1: Comparing Two Groups Quantitative Response

Preparing for Data Analysis

AP Statistics Prerequisite Packet

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

1.3 Graphical Summaries of Data

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

An Introduction to Minitab Statistics 529

Exploratory Data Analysis

Preparing for Data Analysis

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

We will start at 2:05 pm! Thanks for coming early!

Topic (3) SUMMARIZING DATA - TABLES AND GRAPHICS

Mineração de Dados Aplicada

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

Data Preprocessing. Slides by: Shree Jaswal

Univariate descriptives

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Averages and Variation

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

Kernel Density Estimation (KDE)

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Basic Medical Statistics Course

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

3 Graphical Displays of Data

No. of blue jelly beans No. of bags

3 Graphical Displays of Data

To make sense of data, you can start by answering the following questions:

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

MHPE 494: Data Analysis. Welcome! The Analytic Process

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Organizing Your Data. Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Chapter 2: Understanding Data Distributions with Tables and Graphs

CHAPTER 2 DESCRIPTIVE STATISTICS

MATH 117 Statistical Methods for Management I Chapter Two

Basic Statistical Terms and Definitions

Learning Objectives for Data Concept and Visualization

Nuts and Bolts Research Methods Symposium

Tabular & Graphical Presentation of data

Chapter 1. Looking at Data-Distribution

Age & Stage Structure: Elephant Model

NCSS Statistical Software

Stat Day 6 Graphs in Minitab

Basic concepts and terms

CS6220: DATA MINING TECHNIQUES

Chpt 3. Data Description. 3-2 Measures of Central Tendency /40

What are we working with? Data Abstractions. Week 4 Lecture A IAT 814 Lyn Bartram

Summarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

Chapter 2 Modeling Distributions of Data

Lecture 1: Exploratory data analysis

Name Date Types of Graphs and Creating Graphs Notes

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.

Business Analytics Nanodegree Syllabus

Quantitative - One Population

BIOSTATISTICS LABORATORY PART 1: INTRODUCTION TO DATA ANALYIS WITH STATA: EXPLORING AND SUMMARIZING DATA

Lecture 3 Questions that we should be able to answer by the end of this lecture:

STA 490H1S Initial Examination of Data

Bar Charts and Frequency Distributions

Understanding Statistical Questions

Univariate Statistics Summary

Lecture Notes 3: Data summarization

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Lecture 3 Questions that we should be able to answer by the end of this lecture:

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

Data Mining and Analytics. Introduction

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Bar Graphs and Dot Plots

Chapter 3 Analyzing Normal Quantitative Data

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Overview. Frequency Distributions. Chapter 2 Summarizing & Graphing Data. Descriptive Statistics. Inferential Statistics. Frequency Distribution

SAN FRANCISCO MUNICIPAL TRANSPORTATION AGENCY BOARD OF DIRECTORS. RESOLUTION No

DATA CLEANING & DATA MANIPULATION

Transcription:

OrderNum ProdID Name OrderId Cust Name Date 1 42 Gum 1 Joe 8/21/2017 2 999 NullFood 2 Arthur 8/14/2017 2 42 Towel 2 Arthur 8/14/2017 1/31/18 Data 100 Lecture 5: Data Cleaning & Exploratory Data Analysis Slides by: Joseph E. Gonzalez, Deb Nolan, & Joe Hellerstein jegonzal@berkeley.edu deborah_nolan@berkeley.edu hellerstein@berkeley.edu? Started discussing exploratory data analysis Structure -- the shape of a data file (how is it organized) Kinds of Data Price OrderNum ProdID Name OrderId Cust Name Date Quantity 1 42 Gum x 1 Joe 8/21/2017 Temperature 2 999 NullFood 2 Arthur 8/14/2017 Date 2 42 Towel Categorical Data Quantitative Data Numbers with meaning ratios or intervals. Ordinal Nominal Categories with orders but no Categories with no consistent meaning if specific ordering. magnitudes or intervals Preferences Political Affiliation Level of education Product Type Cal Id Started discussing exploratory data analysis Structure -- the shape of a data file (how is it organized) Granularity -- how fine/coarse is each datum Rec. 1 Rec. 2 Rec. 3 Fine Grained Rec. 1 Rec. 2 Rec. 3 Rec. 1 Coarse Grained Group By manipulating granularity Key Data A 3 B 1 C 4 A 1 B 5 C 9 A 2 B 6 C 5 Split into Groups A 3 A 1 A 2 B 1 B 5 B 6 C 4 C 9 C 5 A 6 B 12 C 18 Merge Results A 6 B 12 C 18 Pivot A kind of Group By Operation Key Key R C Data A U 3 B V 1 A V 1 C V 9 A U 2 B V 6 Split into Groups A U 3 A U 2 A V 1 B V 1 B V 6 C V 9 A U 5 A V 1 B V 7 C V 9 U A 5 1 B 5 7 C 4 9 D 5 V Need to address missing values Started discussing exploratory data analysis Structure -- the shape of a data file (how is it organized) Granularity -- how fine/coarse is each datum Scope -- how (in)complete is the data Need to Filter Data Population Need to Filter Data Population Need more data 1

Started discussing exploratory data analysis Structure -- the shape of a data file (how is it organized) Granularity -- how fine/coarse is each datum Scope -- how (in)complete is the data Temporality -- how is the data situated in time Temporality Data changes à When was the data collected! What is the meaning of a the time and date fields? When the event happened? When the data was collected or was entered into the system? Date the data was copied into a database (look for many matching timestamps) Time depends on where! (Time zones & daylight savings) Learn to use datetime python library Multiple string representation (depends on region): 07/08/09? Are there strange null values? January 1 st 1970, January 1 st 1900 Is there periodicity? Diurnal patterns Unix Time / POSIX Time Time measured in seconds since January 1 st 1970 Minus leap seconds Unix time follows Coordinated Universal Time (UTC) International time standard Measured at 0 degrees latitude Similar to Greenwich Mean Time (GMT) No daylight savings Time codes Time Zones: San Francisco (UTC-8) without daylight savings Key Data Properties to Consider in EDA Structure -- the shape of a data file Granularity -- how fine/coarse is each datum Scope -- how (in)complete is the data Temporality -- how is the data situated in time Faithfulness -- how well does the data capture reality https://en.wikipedia.org/wiki/coordinated_universal_time Key Data Properties to Consider in EDA Structure -- the shape of a data file Granularity -- how fine/coarse is each datum Scope -- how (in)complete is the data Temporality -- how is the data situated in time Faithfulness -- how well does the data capture reality Faithfulness: Do I trust this data? Does my data contain unrealistic or incorrect values? Examples? Dates in the future for events in the past Locations that don t exist Negative counts Misspellings of names Large outliers Does my data violate obvious dependencies? E.g., age and birthday don t match Was the data entered by hand? Spelling errors, fields shifted Did the form require fields or provide default values? Are there obvious signs of curb stoning (data falsification): Repeated names, fake looking email addresses, repeated use of uncommon names or fields. 2

Signs that your data may not be faithful Missing Values/Default values: (0, -1, 999, 12345, NaN, Null, 1970, 1900, others?) Soln 1: Drop records with missing values à implications on your sample! Soln 2: Impute missing values à Bias your conclusions Time Zone Inconsistencies Soln 1: convert to a common timezone (e.g., UTC) Soln 2: convert to the timezone of the location useful in modeling behavior. Duplicated Records or Fields Soln: identify and eliminate (use primary key) à implications on sample? Spelling Errors Soln: Apply corrections or drop records not in a dictionary à implications on sample? Units not specified or consistent Solns: Infer units, check values are in reasonable ranges for data Truncated data (early excel limits: 65536 Rows, 255 Columns) Soln: be aware of consequences in analysis à how did truncation affect sample? Others How do you do EDA? Examine data and meta-data: What is the date, size, organization, and structure of the data? Examine each field/attribute/dimension individually Examine pairs of related dimensions Stratifying earlier analysis: break down grades by major Along the way: Visualize/summarize the data Validate assumptions about data and collection process Identify and address anomalies Apply data transformations and corrections Record everything you do! (why?) Berkeley Crime Data Visualization and EDA Which helps identify key patterns in the data? Visualizing Univariate Relationships Quantitative Data Histograms, Box Plots, Rug Plots, Smoothed Interpolations (KDE Kernel Density Estimators) Look for spread, shape, modes, outliers, unreasonable values Nominal & Ordinal Data Bar plots (sorted by frequency or oridinal dimension) Look for skew, frequent and rare categories, or invalid categories Consider grouping categories and repeating analysis Histograms, Rug Plots, and KDE Interpolation Describes distribution of data relative prevalence of values Histogram relative frequency of values Tradeoff of bin sizes Rug Plot Shows the actual data locations Smoothed density estimator Tradeoff of bandwidth parameter (more on this later) Smoothed Density Estimator Second Mode Histogram Rug Main Mode Skew Right Outliers Gap 3

1/31/18 Box Charts Useful for summarizing distributions and comparing multiple distributions Outliers Wisker Outliers are 1.5 * IQR away from lower and upper quartiles. Bar Charts Used to compare nominal and ordinal data. Consider sorting by category or frequency Titanic Passenger Manifest Upper Quartile Interquartile Range (IQR) 50% of Data Median Lower Quartile Class Visualizing Multivariate Relationships http://bit.ly/ds100-sp18-eda Conditioning on a range of values (e.g., ages in groups) and construct side by side box-plots or bar charts Comparison of Restaurant Bills Comparison of Restaurant Bills Titanic Survivors Quick Break Survived Sex Titanic Survivors Quick Break Scope: Do you have a full picture? 4

Berkeley Police Public Datasets Question: For this analysis we will not begin with a detailed question but instead a rough goal of understanding Police activity. Berkeley Police Data Demo Examine Two Data Sets: Call data Stop data Today we will work through the basic process of data loading, some preliminary cleaning, and exploratory data analysis. Call Data Description Data pulled from Public Safety Server using data created for Berkeley s Crime View Community page. Displays incidents reported for the last 180 days along with time, date, day of week and block level location information. The dataset reflects crimes as they have been reported to the BPD based on preliminary information supplied by the reporting parties. Preliminary crime classifications may change based on follow-up investigations. Not all calls for police service are included (e.g. Animal Bite). The information provided on this site is intended for use by the community to enhance their awareness of crimes occurring in their neighborhoods and the entire City. The data should not be used for in-depth crime analysis as the initial information is subject to change. Stops Data Description This data was extracted from the Department s Public Safety Server and covers the data beginning January 26, 2015. On January 26, 2015 the department began collecting data pursuant to General Order B-4 (issued December 31, 2014). Under that order, officers were required to provide certain data after making all vehicle detentions (including bicycles) and pedestrian detentions (up to five persons). This data set lists stops by police in the categories of traffic, suspicious vehicle, pedestrian and bicycle stops. Incident number, date and time, location and disposition codes are also listed in this data. Address data has been changed from a specific address, where applicable, and listed as the block where the incident occurred. Disposition codes were entered by officers who made the stop. These codes included the person(s) race, gender, age (range), reason for the stop, enforcement action taken, and whether or not a search was conducted. Caution about EDA With enough data, if you look hard enough you will find something interesting Important to differentiate inferential conclusions about world from exploratory analysis of data 5