Data 100. Lecture 5: Data Cleaning & Exploratory Data Analysis

Similar documents
Data 100 Lecture 5: Data Cleaning & Exploratory Data Analysis

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Statistics Lecture 6. Looking at data one variable

Table of Contents (As covered from textbook)

DSC 201: Data Analysis & Visualization

DSC 201: Data Analysis & Visualization

CHAPTER 3: Data Description

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus

WELCOME! Lecture 3 Thommy Perlinger

Chapter 2 Describing, Exploring, and Comparing Data

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Chapter 6: DESCRIPTIVE STATISTICS

MATH11400 Statistics Homepage

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

STA 570 Spring Lecture 5 Tuesday, Feb 1

Name: Stat 300: Intro to Probability & Statistics Textbook: Introduction to Statistical Investigations

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Chapter 3 - Displaying and Summarizing Quantitative Data

Chapter 6: Comparing Two Means Section 6.1: Comparing Two Groups Quantitative Response

Preparing for Data Analysis

AP Statistics Prerequisite Packet

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

We will start at 2:05 pm! Thanks for coming early!

Data Preprocessing. Slides by: Shree Jaswal

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

1.3 Graphical Summaries of Data

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

An Introduction to Minitab Statistics 529

Exploratory Data Analysis

Preparing for Data Analysis

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

Univariate descriptives

Topic (3) SUMMARIZING DATA - TABLES AND GRAPHICS

Mineração de Dados Aplicada

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

Averages and Variation

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

MHPE 494: Data Analysis. Welcome! The Analytic Process

Kernel Density Estimation (KDE)

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Basic Medical Statistics Course

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

Learning Objectives for Data Concept and Visualization

3 Graphical Displays of Data

3 Graphical Displays of Data

No. of blue jelly beans No. of bags

To make sense of data, you can start by answering the following questions:

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

CS6220: DATA MINING TECHNIQUES

Business Analytics Nanodegree Syllabus

What are we working with? Data Abstractions. Week 4 Lecture A IAT 814 Lyn Bartram

Organizing Your Data. Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Chapter 2: Understanding Data Distributions with Tables and Graphs

CHAPTER 2 DESCRIPTIVE STATISTICS

MATH 117 Statistical Methods for Management I Chapter Two

Basic Statistical Terms and Definitions

Tabular & Graphical Presentation of data

Nuts and Bolts Research Methods Symposium

Chapter 1. Looking at Data-Distribution

Age & Stage Structure: Elephant Model

NCSS Statistical Software

Stat Day 6 Graphs in Minitab

Basic concepts and terms

Data Mining and Analytics. Introduction

Chpt 3. Data Description. 3-2 Measures of Central Tendency /40

Summarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester

Data Preprocessing in Python. Prof.Sushila Aghav

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

Data Science. Data Analyst. Data Scientist. Data Architect

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Lecture 1: Exploratory data analysis

Chapter 2 Modeling Distributions of Data

Name Date Types of Graphs and Creating Graphs Notes

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.

Quantitative - One Population

Lecture 3 Questions that we should be able to answer by the end of this lecture:

BIOSTATISTICS LABORATORY PART 1: INTRODUCTION TO DATA ANALYIS WITH STATA: EXPLORING AND SUMMARIZING DATA

3. Data Preprocessing. 3.1 Introduction

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

2. Data Preprocessing

STA 490H1S Initial Examination of Data

Lecture Notes 3: Data summarization

Lecture 3 Questions that we should be able to answer by the end of this lecture:

Bar Charts and Frequency Distributions

Understanding Statistical Questions

Univariate Statistics Summary

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

Transcription:

Data 100 Lecture 5: Data Cleaning & Exploratory Data Analysis Slides by: Joseph E. Gonzalez, Deb Nolan, & Joe Hellerstein jegonzal@berkeley.edu deborah_nolan@berkeley.edu hellerstein@berkeley.edu?

Last Lecture Started discussing exploratory data analysis Structure -- the shape of a data file (how is it organized) Kinds of Data Quantitative Data Categorical Data Numbers with meaning ratios or intervals. Ordinal Nominal OrderNum ProdID Name 1 42 Gum 2 999 NullFood 2 42 Towel x OrderId Cust Name Date 1 Joe 8/21/2017 2 Arthur 8/14/2017 OrderNum ProdID Name OrderId Cust Name Date 1 42 Gum 1 Joe 8/21/2017 2 999 NullFood 2 Arthur 8/14/2017 2 42 Towel 2 Arthur 8/14/2017 Examples: Price Quantity Temperature Date Categories with orders but no consistent meaning if magnitudes or intervals Examples: Preferences Level of education Categories with no specific ordering. Examples: Political Affiliation Product Type Cal Id

Last Lecture Started discussing exploratory data analysis Structure -- the shape of a data file (how is it organized) Granularity -- how fine/coarse is each datum Rec. 1 Rec. 1 Rec. 2 Rec. 3 Rec. 1 Rec. 2 Rec. 3 Fine Grained Coarse Grained

Group By manipulating granularity Key Data A 3 B 1 C 4 A 1 B 5 C 9 Split into Groups A 3 A 1 A 2 B 1 B 5 B 6 Aggregate Function Aggregate Function A 6 B 12 Merge Results A 6 B 12 C 18 A 2 B 6 C 5 C 4 C 9 C 5 Aggregate Function C 18

Pivot A kind of Group By Operation Key R Key C Data A U 3 A U 3 A U 2 Aggregate A U 5 Function B V 1 C U 4 A V 1 B U 5 C V 9 A U 2 B V 6 Split into Groups A V 1 B U 5 B V 1 B V 6 C U 4 C V 9 Aggregate A V 1 Function Aggregate Function B U 5 Aggregate B V 7 Function Aggregate C U 4 Function Aggregate C V 9 Function U A 5 1 B 5 7 C 4 9 D 5 V Need to address missing values D U 5 D U 5 Aggregate Function D U 5

Last Lecture Started discussing exploratory data analysis Structure -- the shape of a data file (how is it organized) Granularity -- how fine/coarse is each datum Scope -- how (in)complete is the data Need to Filter Population Need to Filter Population Need more data Data Data

Last Lecture Started discussing exploratory data analysis Structure -- the shape of a data file (how is it organized) Granularity -- how fine/coarse is each datum Scope -- how (in)complete is the data Temporality -- how is the data situated in time

Temporality Data changes à When was the data collected! What is the meaning of a the time and date fields? When the event happened? When the data was collected or was entered into the system? Date the data was copied into a database (look for many matching timestamps) Time depends on where! (Time zones & daylight savings) Learn to use datetime python library Multiple string representation (depends on region): 07/08/09? Are there strange null values? January 1 st 1970, January 1 st 1900 Is there periodicity? Diurnal patterns

Unix Time / POSIX Time Time measured in seconds since January 1 st 1970 Minus leap seconds Unix time follows Coordinated Universal Time (UTC) International time standard Measured at 0 degrees latitude Similar to Greenwich Mean Time (GMT) No daylight savings Time codes Time Zones: San Francisco (UTC-8) without daylight savings https://en.wikipedia.org/wiki/coordinated_universal_time

Key Data Properties to Consider in EDA Structure -- the shape of a data file Granularity -- how fine/coarse is each datum Scope -- how (in)complete is the data Temporality -- how is the data situated in time Faithfulness -- how well does the data capture reality

Key Data Properties to Consider in EDA Structure -- the shape of a data file Granularity -- how fine/coarse is each datum Scope -- how (in)complete is the data Temporality -- how is the data situated in time Faithfulness -- how well does the data capture reality

Faithfulness: Do I trust this data? Does my data contain unrealistic or incorrect values? Examples? Dates in the future for events in the past Locations that don t exist Negative counts Misspellings of names Large outliers Does my data violate obvious dependencies? E.g., age and birthday don t match Was the data entered by hand? Spelling errors, fields shifted Did the form require fields or provide default values? Are there obvious signs of curb stoning (data falsification): Repeated names, fake looking email addresses, repeated use of uncommon names or fields.

Signs that your data may not be faithful Missing Values/Default values: (0, -1, 999, 12345, NaN, Null, 1970, 1900, others?) Soln 1: Drop records with missing values à implications on your sample! Soln 2: Impute missing values à Bias your conclusions Time Zone Inconsistencies Soln 1: convert to a common timezone (e.g., UTC) Soln 2: convert to the timezone of the location useful in modeling behavior. Duplicated Records or Fields Soln: identify and eliminate (use primary key) à implications on sample? Spelling Errors Soln: Apply corrections or drop records not in a dictionary à implications on sample? Units not specified or consistent Solns: Infer units, check values are in reasonable ranges for data Truncated data (early excel limits: 65536 Rows, 255 Columns) Soln: be aware of consequences in analysis à how did truncation affect sample? Others

How do you do EDA? Examine data and meta-data: What is the date, size, organization, and structure of the data? Examine each field/attribute/dimension individually Examine pairs of related dimensions Stratifying earlier analysis: break down grades by major Along the way: Visualize/summarize the data Validate assumptions about data and collection process Identify and address anomalies Apply data transformations and corrections Record everything you do! (why?)

Visualization and EDA

Berkeley Crime Data Which helps identify key patterns in the data?

Visualizing Univariate Relationships Quantitative Data Histograms, Box Plots, Rug Plots, Smoothed Interpolations (KDE Kernel Density Estimators) Look for spread, shape, modes, outliers, unreasonable values Nominal & Ordinal Data Bar plots (sorted by frequency or oridinal dimension) Look for skew, frequent and rare categories, or invalid categories Consider grouping categories and repeating analysis

Histograms, Rug Plots, and KDE Interpolation Describes distribution of data relative prevalence of values Histogram relative frequency of values Tradeoff of bin sizes Rug Plot Shows the actual data locations Smoothed density estimator Tradeoff of bandwidth parameter (more on this later) Smoothed Density Estimator Second Mode Histogram Main Mode Skew Right Gap Outliers Rug

Box Charts Useful for summarizing distributions and comparing multiple distributions Outliers Wisker Outliers are 1.5 * IQR away from lower and upper quartiles. Upper Quartile Interquartile Range (IQR) 50% of Data Median Lower Quartile

Bar Charts Used to compare nominal and ordinal data. Consider sorting by category or frequency Titanic Passenger Manifest Class Sex Survived

Visualizing Multivariate Relationships Conditioning on a range of values (e.g., ages in groups) and construct side by side box-plots or bar charts Comparison of Restaurant Bills Titanic Survivors

http://bit.ly/ds100-sp18-eda Comparison of Restaurant Bills Titanic Survivors

Quick Break

Quick Break Scope: Do you have a full picture?

Berkeley Police Data Demo

Berkeley Police Public Datasets Question: For this analysis we will not begin with a detailed question but instead a rough goal of understanding Police activity. Examine Two Data Sets: Call data Stop data Today we will work through the basic process of data loading, some preliminary cleaning, and exploratory data analysis.

Call Data Description Data pulled from Public Safety Server using data created for Berkeley s Crime View Community page. Displays incidents reported for the last 180 days along with time, date, day of week and block level location information. The dataset reflects crimes as they have been reported to the BPD based on preliminary information supplied by the reporting parties. Preliminary crime classifications may change based on follow-up investigations. Not all calls for police service are included (e.g. Animal Bite). The information provided on this site is intended for use by the community to enhance their awareness of crimes occurring in their neighborhoods and the entire City. The data should not be used for in-depth crime analysis as the initial information is subject to change.

Stops Data Description This data was extracted from the Department s Public Safety Server and covers the data beginning January 26, 2015. On January 26, 2015 the department began collecting data pursuant to General Order B-4 (issued December 31, 2014). Under that order, officers were required to provide certain data after making all vehicle detentions (including bicycles) and pedestrian detentions (up to five persons). This data set lists stops by police in the categories of traffic, suspicious vehicle, pedestrian and bicycle stops. Incident number, date and time, location and disposition codes are also listed in this data. Address data has been changed from a specific address, where applicable, and listed as the block where the incident occurred. Disposition codes were entered by officers who made the stop. These codes included the person(s) race, gender, age (range), reason for the stop, enforcement action taken, and whether or not a search was conducted.

Caution about EDA With enough data, if you look hard enough you will find something interesting Important to differentiate inferential conclusions about world from exploratory analysis of data