DSC 201: Data Analysis & Visualization

Similar documents
DSC 201: Data Analysis & Visualization

DSC 201: Data Analysis & Visualization

DSC 201: Data Analysis & Visualization

DSC 201: Data Analysis & Visualization

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus

Data 100. Lecture 5: Data Cleaning & Exploratory Data Analysis

Data 100 Lecture 5: Data Cleaning & Exploratory Data Analysis

MANIPULATING TIME SERIES DATA IN PYTHON. Compare Time Series Growth Rates

Chapter 3 - Displaying and Summarizing Quantitative Data

Python for Data Analysis

Table of Contents (As covered from textbook)

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Part I. Graphical exploratory data analysis. Graphical summaries of data. Graphical summaries of data

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Preprocessing in Python. Prof.Sushila Aghav

DSC 201: Data Analysis & Visualization

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

MATH11400 Statistics Homepage

Chapter 2: Descriptive Statistics

Thomas Vincent Head of Data Science, Getty Images

Few s Design Guidance

DSC 201: Data Analysis & Visualization

Statistics Lecture 6. Looking at data one variable

Pandas IV: Time Series

STA 570 Spring Lecture 5 Tuesday, Feb 1

Bar Charts and Frequency Distributions

3 Graphical Displays of Data

Statistical Graphs & Charts

Exploratory Data Analysis EDA

Multivariate Data & Tables and Graphs. Agenda. Data and its characteristics Tables and graphs Design principles

Regression III: Advanced Methods

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Applied Regression Modeling: A Business Approach

Chapter 6: DESCRIPTIVE STATISTICS

DSC 201: Data Analysis & Visualization

Math 227 EXCEL / MEGASTAT Guide

8. MINITAB COMMANDS WEEK-BY-WEEK

Data Mining and Analytics. Introduction

MANIPULATING TIME SERIES DATA IN PYTHON. How to use Dates & Times with pandas

Chapter 2 Describing, Exploring, and Comparing Data

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

Chapter 2 - Graphical Summaries of Data

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

JMP 10 Student Edition Quick Guide

Pre-Calculus Multiple Choice Questions - Chapter S2

3 Graphical Displays of Data

Exploratory Data Analysis with R. Matthew Renze Iowa Code Camp Fall 2013

刘淇 School of Computer Science and Technology USTC

8 Organizing and Displaying

Applied Regression Modeling: A Business Approach

DSC 201: Data Analysis & Visualization

Data Mining: Exploring Data

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

Multivariate Data & Tables and Graphs

Overview. Frequency Distributions. Chapter 2 Summarizing & Graphing Data. Descriptive Statistics. Inferential Statistics. Frequency Distribution

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data

3 Visualizing quantitative Information

Learning Objectives for Data Concept and Visualization

CHAPTER 2 Modeling Distributions of Data

Data Science with Python Course Catalog

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Topic (3) SUMMARIZING DATA - TABLES AND GRAPHICS

Pandas 4: Time Series

Lecture 6: Chapter 6 Summary

University of Florida CISE department Gator Engineering. Visualization

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

Brief Guide on Using SPSS 10.0

Visualizing Data: Freq. Tables, Histograms

DATA STRUCTURE AND ALGORITHM USING PYTHON

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Multivariate Data & Tables and Graphs. Agenda. Data and its characteristics Tables and graphs Design principles

Visual Encoding Design

JMP Book Descriptions

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

Slides Prepared by JOHN S. LOUCKS St. Edward s s University Thomson/South-Western. Slide

Practical 2: Using Minitab (not assessed, for practice only!)

CHAPTER 3: Data Description

Chapter 2 Organizing and Graphing Data. 2.1 Organizing and Graphing Qualitative Data

UNIT 1A EXPLORING UNIVARIATE DATA

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Python With Data Science

Slides by. John Loucks. St. Edward s University. Slide South-Western, a part of Cengage Learning

Data 8 Final Review #1

Nonparametric Approaches to Regression

SPSS. (Statistical Packages for the Social Sciences)

WELCOME! Lecture 3 Thommy Perlinger

Lesson 18-1 Lesson Lesson 18-1 Lesson Lesson 18-2 Lesson 18-2

MATH1635, Statistics (2)

Visual Analytics. Visualizing multivariate data:

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

Test Bank for Privitera, Statistics for the Behavioral Sciences

Applied Regression Modeling: A Business Approach

Exploratory data analysis with one and two variables

STA 490H1S Initial Examination of Data

Transcription:

DSC 201: Data Analysis & Visualization Exploratory Data Analysis Dr. David Koop

Python Support for Time The datetime package - Has date, time, and datetime classes -.now() method: the current datetime - Can access properties of the time (year, month, seconds, etc.) Converting from strings to datetimes: - datetime.strptime: good for known formats - dateutil.parser.parse: good for unknown formats Converting to strings - str(dt) or dt.strftime(<format>) Differences between times - datetime.timedelta: can get number of days/hours/etc. - deal with issues with different length months, etc. 2

Time Zones Why? Coordinated Universal Time (UTC) is the standard time (basically equivalent to Greenwich Mean Time (GMT) Other time zones are UTC +/- a number in [1,12] Dartmouth is UTC-5 (aka US/Eastern) 3

Python, Pandas, and Time Zones Time series in pandas are time zone native The pytz module keeps track of all of the time zone parameters - even Daylight Savings Time Localize a timestamp using tz_localize - ts = pd.timestamp("1 Dec 2016 12:30 PM") ts = ts.tz_localize("us/eastern") Convert a timestamp using tz_convert - ts.tz_convert("europe/budapest") Operations involving timestamps from different time zones become UTC 4

Resampling resample method Can downsample or upsample data If downsample, need to provide method for combining data (mean/ sum) If upsample, need to fill in missing data Interpolation is useful here 5

Downsampling rng = pd.date_range('1/1/2010',periods=100,freq='d') ts = pd.series(np.random.randn(len(rng)),index=rng) - 2010-01-01 0.649916 2010-01-02-0.856086 2010-01-03 0.225041 2010-01-04-0.510606 ts.resample('m').mean() - 2010-01-31-0.152096 2010-02-28 0.247770 2010-03-31-0.008501 2010-04-30 0.432483 Freq: M, dtype: float64 6

Upsampling ts.resample('12h').mean() - 2010-01-01 00:00:00 0.649916 2010-01-01 12:00:00 NaN 2010-01-02 00:00:00-0.856086 2010-01-02 12:00:00 NaN 2010-01-03 00:00:00 0.225041 Now, we need to fill in missing data (NaN) Options: fillna(), ffill(), bfill() Interpolation! ts.resample('12h').sum().interpolate() - 2010-01-01 00:00:00 0.649916 2010-01-01 12:00:00-0.103085 2010-01-02 00:00:00-0.856086 2010-01-02 12:00:00-0.315523 2010-01-03 00:00:00 0.225041 7

Window Functions Idea: want to aggregate over a window of time, calculate the answer, and then slide that window ahead. Repeat. rolling: smooth out data In old versions of pandas (like the book uses), this used to be rolling_count, rolling_sum, rolling_mean Specify the window size, then an aggregation method Can also specify the window Result is set to the right edge of window (change with center=true) 8

Food Inspections Example Questions: - When was a location was last inspected? - How many inspections are done per month/year/etc.? - Is there any trend in the number of inspections over time? Notebook posted on course web page 9

Assignment 5 http://www.cis.umassd.edu/~dkoop/dsc201/assignment5.html Aggregation, resampling, and visualization of time series data Part 3: Multi-level index and groupbys 10

What is Exploratory Data Analysis? "Detective work" to summarize and explore datasets Includes: - Data acquisition and input - Data cleaning and wrangling ( tidying ) - Data transformation and summarization - Data visualization 11

Exploratory Data Analysis John W. Tukey - Born in New Bedford - 1977: Highly influential book Emphasis on value of visualization in discovering trends, relationships From a review of the book: Tukey favors analysis of data with little more than pencil and paper. Specifically, there is no need for a calculator, a computer, or a lettering guide to do the analyses he proposes [R.M. Church, 1979] 12

Supporting Exploration Data Computation Data Products Perception & Cognition Knowledge Specification Exploration Reflective thought requires the ability to store temporary results, to make inferences from stored knowledge, and to follow chains of reasoning backward and forward, sometimes backtracking when a promising line of thought proves to be unfruitful. The process takes time. Donald A. Norman 13

Comparison with Older Statistical Methods Older method: - Take data and a probable model and learn model parameters - Good models are useful, help understand phenomena - What happens when we pick the wrong model, or don't know which one to pick? EDA: - Postpone the model assumptions, let the data speak first - Usually involves graphical techniques - Tukey used pen-and-paper approaches - Today, we can do much of this via computer, but insight may still take time 14

Types of EDA Univariate vs. multivariate Non-graphical vs. graphical 15

Univariate Non-graphical EDA Categorical Data: Quantitative Data: 16

Univariate Non-graphical EDA Categorical Data: - Frequency counts, proportions - Groupings Quantitative Data: - Distribution - Summary statistics: mean, median, mode, variance, standard deviation, quantiles 17

Univariate Graphical EDA 18

Univariate Graphical EDA Histograms - Aggregation of data - Choose number of bins - Bin width makes a difference! Frequency 0 5 15 25 5 0 5 10 15 20 25 Stem-and-leaf plots: X - 1 710340 2 06 3 030223459 4 028907 5 00798273487 6 128 7 7897 8 345 9 1 Frequency Frequency 0 5 10 15 0 2 4 6 8 5 0 5 10 15 20 25 X 5 0 5 10 15 20 25 X 19

Univariate Graphical EDA Boxplots - Show distribution X 2 4 6 8 - Multiple summary statistics can be read from the chart - Also provides a general shape of the data - Best for unimodal data 20

Multivariate non-graphical EDA 21

Multivariate non-graphical EDA Crosstabs and Pivot Tables - What is in the data? Count Correlation and covariance - How related are different columns? - How do they change together? 22

Multivariate graphical EDA 23

Multivariate Graphical EDA Scatterplots: look for correlation - Usually put outcome on y-axis - Can encode other variables Side-by-side boxplots Parallel coordinates Grouped bar charts 24