Introduction to Data Science CS 491, DES 430, IE 444, ME 444, MKTG 477 UIC Innovation Center Fall 2017 and Spring 2018 Instructors: Charles Frisbie, Marco Susani, Michael Scott and Ugo Buy Author: Ugo Buy 1
What is data science? Discipline seeking to extract knowledge and insights from large amounts of raw data Examples: Predict income level from age; predict gender of Twitter user from colors chosen in tweets, etc. Multidisciplinary in nature, mostly borrowing from: AKA Data Analytics Wide array of applications Medical sciences (healthcare) Finance (market predictions) Logistics, etc. Statistics Computer Science (databases, machine learning, data mining, parallel computing) Data Visualization 2
Drew Conway s Venn diagram Multidisciplinary convergence:! Math and statistics! Domain knowledge! Computer science Detailed descriptions make it explicit the role of HCI and UX in data science! HCI = Human Computer Interaction! UX = User Experience 3
Our learning objectives Overarching pedagogical goal: Learn how to extract knowledge from mobility and transportation datasets! Public datasets: UIC library, Bureau of Transportation Statistics, Chicago Data Portal, etc.! BMW datasets (hopefully) Specific learning objectives: Learn the basics of statistical learning! Input variables (aka features or predictors) vs. responses (aka outcomes or output variables)! Distinguish different prediction methods: regression and classification! Regression = predicted variable is continuous (e.g., predict vehicle value based on family income, etc.)! Classification = predicted variable is discrete (e.g., fraudulent vs. legit transaction, male vs.female user ) Learn how to visualize analysis results (Professor Susani)! Box plots, Scatter plots, Histograms, etc.) 4
Resources Statistical learning: An Introduction to Statistical Learning PDF available from http://www-bcf.usc.edu/~gareth/isl/ Computer Science: Various languages with built-in support for statistical analysis, e.g., R https://www.r-project.org/ Hadoop http://hadoop.apache.org/ 5
Public and UIC datasets 1. SimplyAnalytics database(uic Library)! EASI " Census Data " Employment! EASI " Census Data " Vehicles 2. Chicago Data Portal (public)! https://data.cityofchicago.org/! Transportation data! Similar sites for NYC, LA, SFO, etc.! Counties sometimes have similar sites 3. National transit database (public)! https://www.transit.dot.gov/ntd 4. Reference USA database (public)! http://resource.referenceusa.com/! Use advanced search! Location and number of gas stations, car rental companies, etc. 5. Bureau of transportation statistics (BTS)! https://data.cityofchicago.org/! Intermodal transportation database! Data on commercial aviation! Data on transportation economics! Asset Inventory Module (aka vehicles) 6
What we do with datasets of interest We extract information by means of statistical analysis Paradigm 1. Formulate a hypothesis (i.e. ask a question)! Examples: Is there a correlation between urban traffic density and air pollution? 2. Apply statistical learning methods to dataset! Compute correlation indices between input and output variable, e.g., using regression analysis 3. Analyze statistical data to validate or refute initial hypothesis! Null hypothesis: No significant correlation between input and output variables (variables are independent of each other)! Alternative hypothesis: Variables are in fact correlated (e.g., when input is high, output is likely to be low) 7
Correlation Causality Ultimate goal of correlation analysis: Establish causal relationships between different variables! If two variables are correlated, there could be a causal relationship between the variables!... or not Analysis of beach communities shows high correlation between ice cream sales and shark attacks! But nobody is suggesting cutting ice cream sales as a way of preventing shark attacks Source: h*ps://m.xkcd.com/552/?! Ice cream sales and shark attacks are correlated but not causally related 8
Basic statistics definitions Average (aka mean value): Given a set of n values, their average μ is the sum of the values divided by the number n of values that were added together! Assume dataset = (15, 18, 6, 20, 24), then average μ = 16 = (12+18+6+20+24)/5 Median: Given a set of n values, median M is the value in the middle! Dataset above " M = 18! Often more useful than average, because average sometimes affected by outliers Variance: Average of the squared differences of the values from the mean, denoted by σ 2! Indication of how spread out values are around the average! Sets (5, 10, 10, 15) and (9, 10, 10, 11) have the same μ=10, but their variances are different (12.5 vs. 0.5) Standard deviation: The square root of the variance, denoted by σ! How much you should expect random value to differ from mean! σ = 3.535 and σ = 0.707 for two sets above 9
How do statistics help us? Plotting wage data (response variable) with respect to age (input variable) or year (input variable) Blue lines represent averages for each age and year value Help make sense of data! Source: ISLR, page 2 10
The key goal: Express output as a function of input + some error Given an input variable X, estimate response variable Y as a function of X + some error ε See how f may help understand relation between input and output variables Population = 30 people with different incomes and education Source: ISLR, page 16 11
The inference problem Given a response variable Y, and a set of input variables X i! Which input variables will affect the response?! What is the relationship between the response and each input variable?! Can the relationship be modeled as a linear function or is it more complex? We will consider linear relationships first Example: different advertising markets Source: ISLR, page 16 12
Simple linear regression Statistical model assuming that a single input variable is linearly related to response variable Basic assumption: The relation between input and output is arranged as a line! Actual relation drawn as a line! Could be true or false, but a good starting point for analyzing CAT datasets! Linear prediction from n observations! Goal: Try to get predicted values as close as possible to actual values 13
Drawing the line What is the line that best fits our observations?! Must come up with predicted slope and intercept values β 0 and β 1 Least squares method: Minimize the square of the errors between observed and predicted values! Residual (error of one observation is difference between observed and predicted value):! Minimize RSS = Residual Sum of Squares when choosing β 0 and β 1! Good news: You ll never have to do calculation of β 0 and β 1 yourself 14
The numbers for TV ad problem Advertising dataset (From http://www-bcf.usc.edu/~gareth/isl/data.html) Predicted slope β 1 = 0.0475! Sales to increase by 47.5 units of product for every $1,000 spent in TV advertising Predicted intercept β 0 = 7.03! Sales without TV advertising predicted to be 7,030 units 15
How good of a prediction? Must validate linear model assumption, but how? 1. Residual Standard Error (RSE): Ratio of RSS and number of observations n: RSE is absolute value of lack of fit of linear prediction (= 3.26 for TV ad data; prediction off by 3,260 units on average) 2. R 2 statistic: Normalized version of RSE (values between 0 and 1): Proportion of variability of Y that is explained by X where Values close to 1 indicate high correlation; close to 0 indicate low correlation 16
Analyzing public datasets Decide whether certain features may affect each other (e.g., urban pollution vs. population density) Select features of interest (X and Y) Regress one feature over the other, using R or other analysis system Do regression analysis (e.g., using R or other statistical analysis package) Check the null hypothesis (X and Y are not correlated)! If null hypothesis is true, slope β 1 will be zero or close to zero! How close to zero?! t-statistic: Normalized value of slope β 1 relative to zero! p-value: Probability that given t-value be consistent with null hypothesis; reject null hypothesis for p-value less than 5% 17
The values for the TV ad dataset Source: ISLR, Pages 68 and 69 18
The language R Programming language for statistical computing and graphics Named after initial letter of founders names, Ross Ihaka and Robert Gentleman Relatively easy syntax Lots of built-in analysis methods (both for regression and classification) Basic language has command line interface; various GUI-based systems exist (e.g., Rattle, R Studio, etc.)! GUI tools usually include command-line window Target platform: standalone computer (vs. Hadoop) Freely available on MS Windows, Linux, and Mac OS X platforms (GNU GPL terms)! Quite extensible " Packages Software, documentation and reference materials available at https://cran.r-project.org/ 19
R: Basic commands Most commands execute built-in and user-defined functions Syntax: function_name(arg1, arg2, )! Example: sqrt is a 1-argument function returning the argument s square root! sqrt(9) " 3 Values returned by functions can be saved with variables! x = sqrt(9)! Now x equals 3 Function c() concatenates args into a vector of values, e.g.,! c(10, 20, 30, 40)! 10 20 30 40 Functions length(), mean(), median(), var(), sd() take a vector of values and return the obvious 20
R: Matrix commands Matrix: A table of numbers (2-dimensional matrix)! R representation of CAT spreadsheets Create matrix with function: matrix(elements, row_number, column_number) Typically assign matrix to a variable to remember it Matrix element access by values or sets of values for row and column! Use name of matrix + row index and column index in square brackets, e.g.,! y[3,2] returns second element in the third row of y! Ranges possible for row and column index 21
R: Read data from spreadsheets Function read.csv() loads spreadsheet into R! Input: Comma-Separated Values (csv) spreadsheet! Output: A 2-dimensional matrix Function dim() returns dimensions Function names() returns column names Function cor() returns correlation index (= sqrt of R 2 ) Use dollar sign $ to denote column by symbolic name! Syntax: matrix_name$column_name Alternatively,! Use attach() function (sets default matrix)! Use numeric indices 22
R: Graphic display tools Function plot() opens window with scatter plot of 2 features Function hist() shows histogram of 1 feature 23
R: Statistical learning tools Function lm() computes linear model! Funny syntax uses tilde character var = lm(response_var~input1+input2) Function summary(var) returns summary data Function abline(var) returns column names (use after plot())! Beware of switching response and predictors order between lm and plot() 24
R: Statistical outputs 25
R: Some of your friends use wisely Help: Type function name preceded by question mark to get function documentation (e.g.,?lm,?read.csv, etc.) Function write.csv() saves an object to a file Syntax: write.csv(object.name, file.name ) Function subset() allows you to select rows and columns based on conditions on values stored, e.g.,! selected.data = subset(original.data, RunTime >= 10 RunTime < 5, select=c(runtime, ))! See http://www.statmethods.net/management/subset.html Function merge() allows you to perform database JOIN operations on multiple spreadsheets All the functions shown in the previous slides 26
References ISLR: http://www-bcf.usc.edu/~gareth/isl/ R Language System: https://www.r-project.org/ Hadoop Language System: http://hadoop.apache.org/ Advertising dataset: http://www-bcf.usc.edu/~gareth/isl/data.html Nice R GUI #1: https://rattle.togaware.com (Rattle runs on Windows or Linux) Nice R GUI #2: https://www.rstudio.com 27