Introduction to Data Science

Similar documents
STA 570 Spring Lecture 5 Tuesday, Feb 1

Chapter 6: DESCRIPTIVE STATISTICS

Regression Analysis and Linear Regression Models

An Introduction to R- Programming

Multiple Linear Regression

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Multiple Regression White paper

Chapter 5snow year.notebook March 15, 2018

Introduction to R: Part I

Week 4: Simple Linear Regression II

GRETL FOR TODDLERS!! CONTENTS. 1. Access to the econometric software A new data set: An existent data set: 3

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Applied Regression Modeling: A Business Approach

Dr. Barbara Morgan Quantitative Methods

Install RStudio from - use the standard installation.

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Data Foundations. Topic Objectives. and list subcategories of each. its properties. before producing a visualization. subsetting

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

Exploratory Data Analysis with R. Matthew Renze Iowa Code Camp Fall 2013

Robust Linear Regression (Passing- Bablok Median-Slope)

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

Data Preprocessing. Slides by: Shree Jaswal

Two-Stage Least Squares

MHPE 494: Data Analysis. Welcome! The Analytic Process

StatsMate. User Guide

Weka ( )

Introduction to R. Introduction to Econometrics W

PSS718 - Data Mining

Data analysis using Microsoft Excel

simpler Using R for Introductory Statistics

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Fathom Dynamic Data TM Version 2 Specifications

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Chapter 1. Looking at Data-Distribution

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

An Introductory Tutorial: Learning R for Quantitative Thinking in the Life Sciences. Scott C Merrill. September 5 th, 2012

Descriptive Statistics, Standard Deviation and Standard Error

Averages and Variation

Computing With R Handout 1

R Basics / Course Business

Written by Donna Hiestand-Tupper CCBC - Essex TI 83 TUTORIAL. Version 3.0 to accompany Elementary Statistics by Mario Triola, 9 th edition

Lab #9: ANOVA and TUKEY tests

Introduction to R and R-Studio Toy Program #1 R Essentials. This illustration Assumes that You Have Installed R and R-Studio

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Introduction to CS databases and statistics in Excel Jacek Wiślicki, Laurent Babout,

8. MINITAB COMMANDS WEEK-BY-WEEK

Introduction to Geospatial Analysis

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Box-Cox Transformation for Simple Linear Regression

Lecture 06 Decision Trees I

In this computer exercise we will work with the analysis of variance in R. We ll take a look at the following topics:

IQR = number. summary: largest. = 2. Upper half: Q3 =

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.

Question. Dinner at the Urquhart House. Data, Statistics, and Spreadsheets. Data. Types of Data. Statistics and Data

Data Science Essentials

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

Lab 1: Introduction, Plotting, Data manipulation

Week 4: Simple Linear Regression III

Six Weeks:

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Network Traffic Measurements and Analysis

Ivy s Business Analytics Foundation Certification Details (Module I + II+ III + IV + V)

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Lecture 25: Review I

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

ANNOUNCING THE RELEASE OF LISREL VERSION BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Correctly Compute Complex Samples Statistics

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

DATA STRUCTURE AND ALGORITHM USING PYTHON

Chapter 2: Modeling Distributions of Data

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

CS570: Introduction to Data Mining

CHAPTER 2 Modeling Distributions of Data

Getting Started. Slides R-Intro: R-Analytics: R-HPC:

Computing With R Handout 1

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

Elementary Statistics

Data Mining and Analytics. Introduction

1 RefresheR. Figure 1.1: Soy ice cream flavor preferences

Package OLScurve. August 29, 2016

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Expectation Maximization (EM) and Gaussian Mixture Models

Data 8 Final Review #1

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings

Themes in the Texas CCRS - Mathematics

Transcription:

Introduction to Data Science CS 491, DES 430, IE 444, ME 444, MKTG 477 UIC Innovation Center Fall 2017 and Spring 2018 Instructors: Charles Frisbie, Marco Susani, Michael Scott and Ugo Buy Author: Ugo Buy 1

What is data science? Discipline seeking to extract knowledge and insights from large amounts of raw data Examples: Predict income level from age; predict gender of Twitter user from colors chosen in tweets, etc. Multidisciplinary in nature, mostly borrowing from: AKA Data Analytics Wide array of applications Medical sciences (healthcare) Finance (market predictions) Logistics, etc. Statistics Computer Science (databases, machine learning, data mining, parallel computing) Data Visualization 2

Drew Conway s Venn diagram Multidisciplinary convergence:! Math and statistics! Domain knowledge! Computer science Detailed descriptions make it explicit the role of HCI and UX in data science! HCI = Human Computer Interaction! UX = User Experience 3

Our learning objectives Overarching pedagogical goal: Learn how to extract knowledge from mobility and transportation datasets! Public datasets: UIC library, Bureau of Transportation Statistics, Chicago Data Portal, etc.! BMW datasets (hopefully) Specific learning objectives: Learn the basics of statistical learning! Input variables (aka features or predictors) vs. responses (aka outcomes or output variables)! Distinguish different prediction methods: regression and classification! Regression = predicted variable is continuous (e.g., predict vehicle value based on family income, etc.)! Classification = predicted variable is discrete (e.g., fraudulent vs. legit transaction, male vs.female user ) Learn how to visualize analysis results (Professor Susani)! Box plots, Scatter plots, Histograms, etc.) 4

Resources Statistical learning: An Introduction to Statistical Learning PDF available from http://www-bcf.usc.edu/~gareth/isl/ Computer Science: Various languages with built-in support for statistical analysis, e.g., R https://www.r-project.org/ Hadoop http://hadoop.apache.org/ 5

Public and UIC datasets 1. SimplyAnalytics database(uic Library)! EASI " Census Data " Employment! EASI " Census Data " Vehicles 2. Chicago Data Portal (public)! https://data.cityofchicago.org/! Transportation data! Similar sites for NYC, LA, SFO, etc.! Counties sometimes have similar sites 3. National transit database (public)! https://www.transit.dot.gov/ntd 4. Reference USA database (public)! http://resource.referenceusa.com/! Use advanced search! Location and number of gas stations, car rental companies, etc. 5. Bureau of transportation statistics (BTS)! https://data.cityofchicago.org/! Intermodal transportation database! Data on commercial aviation! Data on transportation economics! Asset Inventory Module (aka vehicles) 6

What we do with datasets of interest We extract information by means of statistical analysis Paradigm 1. Formulate a hypothesis (i.e. ask a question)! Examples: Is there a correlation between urban traffic density and air pollution? 2. Apply statistical learning methods to dataset! Compute correlation indices between input and output variable, e.g., using regression analysis 3. Analyze statistical data to validate or refute initial hypothesis! Null hypothesis: No significant correlation between input and output variables (variables are independent of each other)! Alternative hypothesis: Variables are in fact correlated (e.g., when input is high, output is likely to be low) 7

Correlation Causality Ultimate goal of correlation analysis: Establish causal relationships between different variables! If two variables are correlated, there could be a causal relationship between the variables!... or not Analysis of beach communities shows high correlation between ice cream sales and shark attacks! But nobody is suggesting cutting ice cream sales as a way of preventing shark attacks Source: h*ps://m.xkcd.com/552/?! Ice cream sales and shark attacks are correlated but not causally related 8

Basic statistics definitions Average (aka mean value): Given a set of n values, their average μ is the sum of the values divided by the number n of values that were added together! Assume dataset = (15, 18, 6, 20, 24), then average μ = 16 = (12+18+6+20+24)/5 Median: Given a set of n values, median M is the value in the middle! Dataset above " M = 18! Often more useful than average, because average sometimes affected by outliers Variance: Average of the squared differences of the values from the mean, denoted by σ 2! Indication of how spread out values are around the average! Sets (5, 10, 10, 15) and (9, 10, 10, 11) have the same μ=10, but their variances are different (12.5 vs. 0.5) Standard deviation: The square root of the variance, denoted by σ! How much you should expect random value to differ from mean! σ = 3.535 and σ = 0.707 for two sets above 9

How do statistics help us? Plotting wage data (response variable) with respect to age (input variable) or year (input variable) Blue lines represent averages for each age and year value Help make sense of data! Source: ISLR, page 2 10

The key goal: Express output as a function of input + some error Given an input variable X, estimate response variable Y as a function of X + some error ε See how f may help understand relation between input and output variables Population = 30 people with different incomes and education Source: ISLR, page 16 11

The inference problem Given a response variable Y, and a set of input variables X i! Which input variables will affect the response?! What is the relationship between the response and each input variable?! Can the relationship be modeled as a linear function or is it more complex? We will consider linear relationships first Example: different advertising markets Source: ISLR, page 16 12

Simple linear regression Statistical model assuming that a single input variable is linearly related to response variable Basic assumption: The relation between input and output is arranged as a line! Actual relation drawn as a line! Could be true or false, but a good starting point for analyzing CAT datasets! Linear prediction from n observations! Goal: Try to get predicted values as close as possible to actual values 13

Drawing the line What is the line that best fits our observations?! Must come up with predicted slope and intercept values β 0 and β 1 Least squares method: Minimize the square of the errors between observed and predicted values! Residual (error of one observation is difference between observed and predicted value):! Minimize RSS = Residual Sum of Squares when choosing β 0 and β 1! Good news: You ll never have to do calculation of β 0 and β 1 yourself 14

The numbers for TV ad problem Advertising dataset (From http://www-bcf.usc.edu/~gareth/isl/data.html) Predicted slope β 1 = 0.0475! Sales to increase by 47.5 units of product for every $1,000 spent in TV advertising Predicted intercept β 0 = 7.03! Sales without TV advertising predicted to be 7,030 units 15

How good of a prediction? Must validate linear model assumption, but how? 1. Residual Standard Error (RSE): Ratio of RSS and number of observations n: RSE is absolute value of lack of fit of linear prediction (= 3.26 for TV ad data; prediction off by 3,260 units on average) 2. R 2 statistic: Normalized version of RSE (values between 0 and 1): Proportion of variability of Y that is explained by X where Values close to 1 indicate high correlation; close to 0 indicate low correlation 16

Analyzing public datasets Decide whether certain features may affect each other (e.g., urban pollution vs. population density) Select features of interest (X and Y) Regress one feature over the other, using R or other analysis system Do regression analysis (e.g., using R or other statistical analysis package) Check the null hypothesis (X and Y are not correlated)! If null hypothesis is true, slope β 1 will be zero or close to zero! How close to zero?! t-statistic: Normalized value of slope β 1 relative to zero! p-value: Probability that given t-value be consistent with null hypothesis; reject null hypothesis for p-value less than 5% 17

The values for the TV ad dataset Source: ISLR, Pages 68 and 69 18

The language R Programming language for statistical computing and graphics Named after initial letter of founders names, Ross Ihaka and Robert Gentleman Relatively easy syntax Lots of built-in analysis methods (both for regression and classification) Basic language has command line interface; various GUI-based systems exist (e.g., Rattle, R Studio, etc.)! GUI tools usually include command-line window Target platform: standalone computer (vs. Hadoop) Freely available on MS Windows, Linux, and Mac OS X platforms (GNU GPL terms)! Quite extensible " Packages Software, documentation and reference materials available at https://cran.r-project.org/ 19

R: Basic commands Most commands execute built-in and user-defined functions Syntax: function_name(arg1, arg2, )! Example: sqrt is a 1-argument function returning the argument s square root! sqrt(9) " 3 Values returned by functions can be saved with variables! x = sqrt(9)! Now x equals 3 Function c() concatenates args into a vector of values, e.g.,! c(10, 20, 30, 40)! 10 20 30 40 Functions length(), mean(), median(), var(), sd() take a vector of values and return the obvious 20

R: Matrix commands Matrix: A table of numbers (2-dimensional matrix)! R representation of CAT spreadsheets Create matrix with function: matrix(elements, row_number, column_number) Typically assign matrix to a variable to remember it Matrix element access by values or sets of values for row and column! Use name of matrix + row index and column index in square brackets, e.g.,! y[3,2] returns second element in the third row of y! Ranges possible for row and column index 21

R: Read data from spreadsheets Function read.csv() loads spreadsheet into R! Input: Comma-Separated Values (csv) spreadsheet! Output: A 2-dimensional matrix Function dim() returns dimensions Function names() returns column names Function cor() returns correlation index (= sqrt of R 2 ) Use dollar sign $ to denote column by symbolic name! Syntax: matrix_name$column_name Alternatively,! Use attach() function (sets default matrix)! Use numeric indices 22

R: Graphic display tools Function plot() opens window with scatter plot of 2 features Function hist() shows histogram of 1 feature 23

R: Statistical learning tools Function lm() computes linear model! Funny syntax uses tilde character var = lm(response_var~input1+input2) Function summary(var) returns summary data Function abline(var) returns column names (use after plot())! Beware of switching response and predictors order between lm and plot() 24

R: Statistical outputs 25

R: Some of your friends use wisely Help: Type function name preceded by question mark to get function documentation (e.g.,?lm,?read.csv, etc.) Function write.csv() saves an object to a file Syntax: write.csv(object.name, file.name ) Function subset() allows you to select rows and columns based on conditions on values stored, e.g.,! selected.data = subset(original.data, RunTime >= 10 RunTime < 5, select=c(runtime, ))! See http://www.statmethods.net/management/subset.html Function merge() allows you to perform database JOIN operations on multiple spreadsheets All the functions shown in the previous slides 26

References ISLR: http://www-bcf.usc.edu/~gareth/isl/ R Language System: https://www.r-project.org/ Hadoop Language System: http://hadoop.apache.org/ Advertising dataset: http://www-bcf.usc.edu/~gareth/isl/data.html Nice R GUI #1: https://rattle.togaware.com (Rattle runs on Windows or Linux) Nice R GUI #2: https://www.rstudio.com 27