Introduction to Data Science
|
|
- Jerome McGee
- 6 years ago
- Views:
Transcription
1 Introduction to Data Science CS 491, DES 430, IE 444, ME 444, MKTG 477 UIC Innovation Center Fall 2017 and Spring 2018 Instructors: Charles Frisbie, Marco Susani, Michael Scott and Ugo Buy Author: Ugo Buy 1
2 What is data science? Discipline seeking to extract knowledge and insights from large amounts of raw data Examples: Predict income level from age; predict gender of Twitter user from colors chosen in tweets, etc. Multidisciplinary in nature, mostly borrowing from: AKA Data Analytics Wide array of applications Medical sciences (healthcare) Finance (market predictions) Logistics, etc. Statistics Computer Science (databases, machine learning, data mining, parallel computing) Data Visualization 2
3 Drew Conway s Venn diagram Multidisciplinary convergence:! Math and statistics! Domain knowledge! Computer science Detailed descriptions make it explicit the role of HCI and UX in data science! HCI = Human Computer Interaction! UX = User Experience 3
4 Our learning objectives Overarching pedagogical goal: Learn how to extract knowledge from mobility and transportation datasets! Public datasets: UIC library, Bureau of Transportation Statistics, Chicago Data Portal, etc.! BMW datasets (hopefully) Specific learning objectives: Learn the basics of statistical learning! Input variables (aka features or predictors) vs. responses (aka outcomes or output variables)! Distinguish different prediction methods: regression and classification! Regression = predicted variable is continuous (e.g., predict vehicle value based on family income, etc.)! Classification = predicted variable is discrete (e.g., fraudulent vs. legit transaction, male vs.female user ) Learn how to visualize analysis results (Professor Susani)! Box plots, Scatter plots, Histograms, etc.) 4
5 Resources Statistical learning: An Introduction to Statistical Learning PDF available from Computer Science: Various languages with built-in support for statistical analysis, e.g., R Hadoop 5
6 Public and UIC datasets 1. SimplyAnalytics database(uic Library)! EASI " Census Data " Employment! EASI " Census Data " Vehicles 2. Chicago Data Portal (public)! Transportation data! Similar sites for NYC, LA, SFO, etc.! Counties sometimes have similar sites 3. National transit database (public)! 4. Reference USA database (public)! Use advanced search! Location and number of gas stations, car rental companies, etc. 5. Bureau of transportation statistics (BTS)! Intermodal transportation database! Data on commercial aviation! Data on transportation economics! Asset Inventory Module (aka vehicles) 6
7 What we do with datasets of interest We extract information by means of statistical analysis Paradigm 1. Formulate a hypothesis (i.e. ask a question)! Examples: Is there a correlation between urban traffic density and air pollution? 2. Apply statistical learning methods to dataset! Compute correlation indices between input and output variable, e.g., using regression analysis 3. Analyze statistical data to validate or refute initial hypothesis! Null hypothesis: No significant correlation between input and output variables (variables are independent of each other)! Alternative hypothesis: Variables are in fact correlated (e.g., when input is high, output is likely to be low) 7
8 Correlation Causality Ultimate goal of correlation analysis: Establish causal relationships between different variables! If two variables are correlated, there could be a causal relationship between the variables!... or not Analysis of beach communities shows high correlation between ice cream sales and shark attacks! But nobody is suggesting cutting ice cream sales as a way of preventing shark attacks Source: h*ps://m.xkcd.com/552/?! Ice cream sales and shark attacks are correlated but not causally related 8
9 Basic statistics definitions Average (aka mean value): Given a set of n values, their average μ is the sum of the values divided by the number n of values that were added together! Assume dataset = (15, 18, 6, 20, 24), then average μ = 16 = ( )/5 Median: Given a set of n values, median M is the value in the middle! Dataset above " M = 18! Often more useful than average, because average sometimes affected by outliers Variance: Average of the squared differences of the values from the mean, denoted by σ 2! Indication of how spread out values are around the average! Sets (5, 10, 10, 15) and (9, 10, 10, 11) have the same μ=10, but their variances are different (12.5 vs. 0.5) Standard deviation: The square root of the variance, denoted by σ! How much you should expect random value to differ from mean! σ = and σ = for two sets above 9
10 How do statistics help us? Plotting wage data (response variable) with respect to age (input variable) or year (input variable) Blue lines represent averages for each age and year value Help make sense of data! Source: ISLR, page 2 10
11 The key goal: Express output as a function of input + some error Given an input variable X, estimate response variable Y as a function of X + some error ε See how f may help understand relation between input and output variables Population = 30 people with different incomes and education Source: ISLR, page 16 11
12 The inference problem Given a response variable Y, and a set of input variables X i! Which input variables will affect the response?! What is the relationship between the response and each input variable?! Can the relationship be modeled as a linear function or is it more complex? We will consider linear relationships first Example: different advertising markets Source: ISLR, page 16 12
13 Simple linear regression Statistical model assuming that a single input variable is linearly related to response variable Basic assumption: The relation between input and output is arranged as a line! Actual relation drawn as a line! Could be true or false, but a good starting point for analyzing CAT datasets! Linear prediction from n observations! Goal: Try to get predicted values as close as possible to actual values 13
14 Drawing the line What is the line that best fits our observations?! Must come up with predicted slope and intercept values β 0 and β 1 Least squares method: Minimize the square of the errors between observed and predicted values! Residual (error of one observation is difference between observed and predicted value):! Minimize RSS = Residual Sum of Squares when choosing β 0 and β 1! Good news: You ll never have to do calculation of β 0 and β 1 yourself 14
15 The numbers for TV ad problem Advertising dataset (From Predicted slope β 1 = ! Sales to increase by 47.5 units of product for every $1,000 spent in TV advertising Predicted intercept β 0 = 7.03! Sales without TV advertising predicted to be 7,030 units 15
16 How good of a prediction? Must validate linear model assumption, but how? 1. Residual Standard Error (RSE): Ratio of RSS and number of observations n: RSE is absolute value of lack of fit of linear prediction (= 3.26 for TV ad data; prediction off by 3,260 units on average) 2. R 2 statistic: Normalized version of RSE (values between 0 and 1): Proportion of variability of Y that is explained by X where Values close to 1 indicate high correlation; close to 0 indicate low correlation 16
17 Analyzing public datasets Decide whether certain features may affect each other (e.g., urban pollution vs. population density) Select features of interest (X and Y) Regress one feature over the other, using R or other analysis system Do regression analysis (e.g., using R or other statistical analysis package) Check the null hypothesis (X and Y are not correlated)! If null hypothesis is true, slope β 1 will be zero or close to zero! How close to zero?! t-statistic: Normalized value of slope β 1 relative to zero! p-value: Probability that given t-value be consistent with null hypothesis; reject null hypothesis for p-value less than 5% 17
18 The values for the TV ad dataset Source: ISLR, Pages 68 and 69 18
19 The language R Programming language for statistical computing and graphics Named after initial letter of founders names, Ross Ihaka and Robert Gentleman Relatively easy syntax Lots of built-in analysis methods (both for regression and classification) Basic language has command line interface; various GUI-based systems exist (e.g., Rattle, R Studio, etc.)! GUI tools usually include command-line window Target platform: standalone computer (vs. Hadoop) Freely available on MS Windows, Linux, and Mac OS X platforms (GNU GPL terms)! Quite extensible " Packages Software, documentation and reference materials available at 19
20 R: Basic commands Most commands execute built-in and user-defined functions Syntax: function_name(arg1, arg2, )! Example: sqrt is a 1-argument function returning the argument s square root! sqrt(9) " 3 Values returned by functions can be saved with variables! x = sqrt(9)! Now x equals 3 Function c() concatenates args into a vector of values, e.g.,! c(10, 20, 30, 40)! Functions length(), mean(), median(), var(), sd() take a vector of values and return the obvious 20
21 R: Matrix commands Matrix: A table of numbers (2-dimensional matrix)! R representation of CAT spreadsheets Create matrix with function: matrix(elements, row_number, column_number) Typically assign matrix to a variable to remember it Matrix element access by values or sets of values for row and column! Use name of matrix + row index and column index in square brackets, e.g.,! y[3,2] returns second element in the third row of y! Ranges possible for row and column index 21
22 R: Read data from spreadsheets Function read.csv() loads spreadsheet into R! Input: Comma-Separated Values (csv) spreadsheet! Output: A 2-dimensional matrix Function dim() returns dimensions Function names() returns column names Function cor() returns correlation index (= sqrt of R 2 ) Use dollar sign $ to denote column by symbolic name! Syntax: matrix_name$column_name Alternatively,! Use attach() function (sets default matrix)! Use numeric indices 22
23 R: Graphic display tools Function plot() opens window with scatter plot of 2 features Function hist() shows histogram of 1 feature 23
24 R: Statistical learning tools Function lm() computes linear model! Funny syntax uses tilde character var = lm(response_var~input1+input2) Function summary(var) returns summary data Function abline(var) returns column names (use after plot())! Beware of switching response and predictors order between lm and plot() 24
25 R: Statistical outputs 25
26 R: Some of your friends use wisely Help: Type function name preceded by question mark to get function documentation (e.g.,?lm,?read.csv, etc.) Function write.csv() saves an object to a file Syntax: write.csv(object.name, file.name ) Function subset() allows you to select rows and columns based on conditions on values stored, e.g.,! selected.data = subset(original.data, RunTime >= 10 RunTime < 5, select=c(runtime, ))! See Function merge() allows you to perform database JOIN operations on multiple spreadsheets All the functions shown in the previous slides 26
27 References ISLR: R Language System: Hadoop Language System: Advertising dataset: Nice R GUI #1: (Rattle runs on Windows or Linux) Nice R GUI #2: 27
STA 570 Spring Lecture 5 Tuesday, Feb 1
STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row
More informationChapter 6: DESCRIPTIVE STATISTICS
Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling
More informationRegression Analysis and Linear Regression Models
Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical
More informationAn Introduction to R- Programming
An Introduction to R- Programming Hadeel Alkofide, Msc, PhD NOT a biostatistician or R expert just simply an R user Some slides were adapted from lectures by Angie Mae Rodday MSc, PhD at Tufts University
More informationMultiple Linear Regression
Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors
More informationResources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.
Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department
More informationMultiple Regression White paper
+44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms
More informationChapter 5snow year.notebook March 15, 2018
Chapter 5: Statistical Reasoning Section 5.1 Exploring Data Measures of central tendency (Mean, Median and Mode) attempt to describe a set of data by identifying the central position within a set of data
More informationIntroduction to R: Part I
Introduction to R: Part I Jeffrey C. Miecznikowski March 26, 2015 R impact R is the 13th most popular language by IEEE Spectrum (2014) Google uses R for ROI calculations Ford uses R to improve vehicle
More informationWeek 4: Simple Linear Regression II
Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties
More informationGRETL FOR TODDLERS!! CONTENTS. 1. Access to the econometric software A new data set: An existent data set: 3
GRETL FOR TODDLERS!! JAVIER FERNÁNDEZ-MACHO CONTENTS 1. Access to the econometric software 3 2. Loading and saving data: the File menu 3 2.1. A new data set: 3 2.2. An existent data set: 3 2.3. Importing
More informationLearner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display
CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &
More informationApplied Regression Modeling: A Business Approach
i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming
More informationDr. Barbara Morgan Quantitative Methods
Dr. Barbara Morgan Quantitative Methods 195.650 Basic Stata This is a brief guide to using the most basic operations in Stata. Stata also has an on-line tutorial. At the initial prompt type tutorial. In
More informationInstall RStudio from - use the standard installation.
Session 1: Reading in Data Before you begin: Install RStudio from http://www.rstudio.com/ide/download/ - use the standard installation. Go to the course website; http://faculty.washington.edu/kenrice/rintro/
More informationNon-trivial extraction of implicit, previously unknown and potentially useful information from data
CS 795/895 Applied Visual Analytics Spring 2013 Data Mining Dr. Michele C. Weigle http://www.cs.odu.edu/~mweigle/cs795-s13/ What is Data Mining? Many Definitions Non-trivial extraction of implicit, previously
More informationData Foundations. Topic Objectives. and list subcategories of each. its properties. before producing a visualization. subsetting
CS 725/825 Information Visualization Fall 2013 Data Foundations Dr. Michele C. Weigle http://www.cs.odu.edu/~mweigle/cs725-f13/ Topic Objectives! Distinguish between ordinal and nominal values and list
More informationMath 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency
Math 14 Introductory Statistics Summer 008 6-9-08 Class Notes Sections 3, 33 3: 1-1 odd 33: 7-13, 35-39 Measures of Central Tendency odd Notation: Let N be the size of the population, n the size of the
More informationSlide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques
SEVENTH EDITION and EXPANDED SEVENTH EDITION Slide - Chapter Statistics. Sampling Techniques Statistics Statistics is the art and science of gathering, analyzing, and making inferences from numerical information
More informationExploratory Data Analysis with R. Matthew Renze Iowa Code Camp Fall 2013
Exploratory Data Analysis with R Matthew Renze Iowa Code Camp Fall 2013 Motivation The ability to take data to be able to understand it, to process it, to extract value from it, to visualize it, to communicate
More informationRobust Linear Regression (Passing- Bablok Median-Slope)
Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their
More informationIAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram
IAT 355 Visual Analytics Data and Statistical Models Lyn Bartram Exploring data Example: US Census People # of people in group Year # 1850 2000 (every decade) Age # 0 90+ Sex (Gender) # Male, female Marital
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationTwo-Stage Least Squares
Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes
More informationMHPE 494: Data Analysis. Welcome! The Analytic Process
MHPE 494: Data Analysis Alan Schwartz, PhD Department of Medical Education Memoona Hasnain,, MD, PhD, MHPE Department of Family Medicine College of Medicine University of Illinois at Chicago Welcome! Your
More informationStatsMate. User Guide
StatsMate User Guide Overview StatsMate is an easy-to-use powerful statistical calculator. It has been featured by Apple on Apps For Learning Math in the App Stores around the world. StatsMate comes with
More informationWeka ( )
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised
More informationIntroduction to R. Introduction to Econometrics W
Introduction to R Introduction to Econometrics W3412 Begin Download R from the Comprehensive R Archive Network (CRAN) by choosing a location close to you. Students are also recommended to download RStudio,
More informationPSS718 - Data Mining
Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the
More informationData analysis using Microsoft Excel
Introduction to Statistics Statistics may be defined as the science of collection, organization presentation analysis and interpretation of numerical data from the logical analysis. 1.Collection of Data
More informationsimpler Using R for Introductory Statistics
John Verzani y 2e+05 4e+05 6e+05 8e+05 20000 40000 60000 80000 120000 160000 Preface page i These notes are an introduction to using the statistical software package R for an introductory statistics course.
More informationLAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT
NAVAL POSTGRADUATE SCHOOL LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT Statistics (OA3102) Lab #2: Sampling, Sampling Distributions, and the Central Limit Theorem Goal: Use R to demonstrate sampling
More informationData Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology
❷Chapter 2 Basic Statistics Business School, University of Shanghai for Science & Technology 2016-2017 2nd Semester, Spring2017 Contents of chapter 1 1 recording data using computers 2 3 4 5 6 some famous
More informationData Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski
Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...
More informationFathom Dynamic Data TM Version 2 Specifications
Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other
More informationECLT 5810 Data Preprocessing. Prof. Wai Lam
ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
More informationPreprocessing Short Lecture Notes cse352. Professor Anita Wasilewska
Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept
More informationChapter 1. Looking at Data-Distribution
Chapter 1. Looking at Data-Distribution Statistics is the scientific discipline that provides methods to draw right conclusions: 1)Collecting the data 2)Describing the data 3)Drawing the conclusions Raw
More informationST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.
ST512 Fall Quarter, 2005 Exam 1 Name: Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false. 1. (42 points) A random sample of n = 30 NBA basketball
More informationAn Introductory Tutorial: Learning R for Quantitative Thinking in the Life Sciences. Scott C Merrill. September 5 th, 2012
An Introductory Tutorial: Learning R for Quantitative Thinking in the Life Sciences Scott C Merrill September 5 th, 2012 Chapter 2 Additional help tools Last week you asked about getting help on packages.
More informationDescriptive Statistics, Standard Deviation and Standard Error
AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.
More informationAverages and Variation
Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus
More informationComputing With R Handout 1
Computing With R Handout 1 Getting Into R To access the R language (free software), go to a computing lab that has R installed, or a computer on which you have downloaded R from one of the distribution
More informationR Basics / Course Business
R Basics / Course Business We ll be using a sample dataset in class today: CourseWeb: Course Documents " Sample Data " Week 2 Can download to your computer before class CourseWeb survey on research/stats
More informationWritten by Donna Hiestand-Tupper CCBC - Essex TI 83 TUTORIAL. Version 3.0 to accompany Elementary Statistics by Mario Triola, 9 th edition
TI 83 TUTORIAL Version 3.0 to accompany Elementary Statistics by Mario Triola, 9 th edition Written by Donna Hiestand-Tupper CCBC - Essex 1 2 Math 153 - Introduction to Statistical Methods TI 83 (PLUS)
More informationLab #9: ANOVA and TUKEY tests
Lab #9: ANOVA and TUKEY tests Objectives: 1. Column manipulation in SAS 2. Analysis of variance 3. Tukey test 4. Least Significant Difference test 5. Analysis of variance with PROC GLM 6. Levene test for
More informationIntroduction to R and R-Studio Toy Program #1 R Essentials. This illustration Assumes that You Have Installed R and R-Studio
Introduction to R and R-Studio 2018-19 Toy Program #1 R Essentials This illustration Assumes that You Have Installed R and R-Studio If you have not already installed R and RStudio, please see: Windows
More informationPredictive Analysis: Evaluation and Experimentation. Heejun Kim
Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training
More informationIntroduction to CS databases and statistics in Excel Jacek Wiślicki, Laurent Babout,
One of the applications of MS Excel is data processing and statistical analysis. The following exercises will demonstrate some of these functions. The base files for the exercises is included in http://lbabout.iis.p.lodz.pl/teaching_and_student_projects_files/files/us/lab_04b.zip.
More information8. MINITAB COMMANDS WEEK-BY-WEEK
8. MINITAB COMMANDS WEEK-BY-WEEK In this section of the Study Guide, we give brief information about the Minitab commands that are needed to apply the statistical methods in each week s study. They are
More informationIntroduction to Geospatial Analysis
Introduction to Geospatial Analysis Introduction to Geospatial Analysis 1 Descriptive Statistics Descriptive statistics. 2 What and Why? Descriptive Statistics Quantitative description of data Why? Allow
More informationRegression on SAT Scores of 374 High Schools and K-means on Clustering Schools
Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data
More informationBivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017
Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 4, 217 PDF file location: http://www.murraylax.org/rtutorials/regression_intro.pdf HTML file location:
More informationBox-Cox Transformation for Simple Linear Regression
Chapter 192 Box-Cox Transformation for Simple Linear Regression Introduction This procedure finds the appropriate Box-Cox power transformation (1964) for a dataset containing a pair of variables that are
More informationLecture 06 Decision Trees I
Lecture 06 Decision Trees I 08 February 2016 Taylor B. Arnold Yale Statistics STAT 365/665 1/33 Problem Set #2 Posted Due February 19th Piazza site https://piazza.com/ 2/33 Last time we starting fitting
More informationIn this computer exercise we will work with the analysis of variance in R. We ll take a look at the following topics:
UPPSALA UNIVERSITY Department of Mathematics Måns Thulin, thulin@math.uu.se Analysis of regression and variance Fall 2011 COMPUTER EXERCISE 2: One-way ANOVA In this computer exercise we will work with
More informationIQR = number. summary: largest. = 2. Upper half: Q3 =
Step by step box plot Height in centimeters of players on the 003 Women s Worldd Cup soccer team. 157 1611 163 163 164 165 165 165 168 168 168 170 170 170 171 173 173 175 180 180 Determine the 5 number
More informationMinitab 17 commands Prepared by Jeffrey S. Simonoff
Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save
More informationPrepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.
Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good
More informationIntegrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.
Standard 1: Number Sense and Computation Students simplify and compare expressions. They use rational exponents and simplify square roots. IM1.1.1 Compare real number expressions. IM1.1.2 Simplify square
More informationQuestion. Dinner at the Urquhart House. Data, Statistics, and Spreadsheets. Data. Types of Data. Statistics and Data
Question What are data and what do they mean to a scientist? Dinner at the Urquhart House Brought to you by the Briggs Multiracial Alliance Sunday night All food provided (probably Chinese) Contact Mimi
More informationData Science Essentials
Data Science Essentials Lab 6 Introduction to Machine Learning Overview In this lab, you will use Azure Machine Learning to train, evaluate, and publish a classification model, a regression model, and
More informationTHE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann
Forrest W. Young & Carla M. Bann THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA CB 3270 DAVIE HALL, CHAPEL HILL N.C., USA 27599-3270 VISUAL STATISTICS PROJECT WWW.VISUALSTATS.ORG
More informationLab 1: Introduction, Plotting, Data manipulation
Linear Statistical Models, R-tutorial Fall 2009 Lab 1: Introduction, Plotting, Data manipulation If you have never used Splus or R before, check out these texts and help pages; http://cran.r-project.org/doc/manuals/r-intro.html,
More informationWeek 4: Simple Linear Regression III
Week 4: Simple Linear Regression III Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Goodness of
More informationSix Weeks:
HPISD Grade 7 7/8 Math The student uses mathematical processes to: acquire and demonstrate mathematical understanding Mathematical Process Standards Apply mathematics to problems arising in everyday life,
More informationSTAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.
STAT 2607 REVIEW PROBLEMS 1 REMINDER: On the final exam 1. Word problems must be answered in words of the problem. 2. "Test" means that you must carry out a formal hypothesis testing procedure with H0,
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationIvy s Business Analytics Foundation Certification Details (Module I + II+ III + IV + V)
Ivy s Business Analytics Foundation Certification Details (Module I + II+ III + IV + V) Based on Industry Cases, Live Exercises, & Industry Executed Projects Module (I) Analytics Essentials 81 hrs 1. Statistics
More informationFurther Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables
Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationTHIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010
THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE
More informationANNOUNCING THE RELEASE OF LISREL VERSION BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3
ANNOUNCING THE RELEASE OF LISREL VERSION 9.1 2 BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3 THREE-LEVEL MULTILEVEL GENERALIZED LINEAR MODELS 3 FOUR
More informationMATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation
MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation Objectives: 1. Learn the meaning of descriptive versus inferential statistics 2. Identify bar graphs,
More informationMachine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013
Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork
More informationBig Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1
Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that
More informationCorrectly Compute Complex Samples Statistics
SPSS Complex Samples 15.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample
More informationPart I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures
Part I, Chapters 4 & 5 Data Tables and Data Analysis Statistics and Figures Descriptive Statistics 1 Are data points clumped? (order variable / exp. variable) Concentrated around one value? Concentrated
More informationDATA STRUCTURE AND ALGORITHM USING PYTHON
DATA STRUCTURE AND ALGORITHM USING PYTHON Common Use Python Module II Peter Lo Pandas Data Structures and Data Analysis tools 2 What is Pandas? Pandas is an open-source Python library providing highperformance,
More informationChapter 2: Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data Section 2.2 The Practice of Statistics, 4 th edition - For AP* STARNES, YATES, MOORE Chapter 2 Modeling Distributions of Data 2.1 Describing Location in a Distribution
More informationData Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha
Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.
More informationCHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data 2.2 Density Curves and Normal Distributions The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers Density Curves
More informationGetting Started. Slides R-Intro: R-Analytics: R-HPC:
Getting Started Download and install R + Rstudio http://www.r-project.org/ https://www.rstudio.com/products/rstudio/download2/ TACC ssh username@wrangler.tacc.utexas.edu % module load Rstats %R Slides
More informationComputing With R Handout 1
Computing With R Handout 1 The purpose of this handout is to lead you through a simple exercise using the R computing language. It is essentially an assignment, although there will be nothing to hand in.
More informationDS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University
DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January 22 2019 Outline Practical issues in Linear Regression Outliers Categorical variables Lab
More informationStatistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.
Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975. SPSS Statistics were designed INTRODUCTION TO SPSS Objective About the
More informationEXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression
EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression OBJECTIVES 1. Prepare a scatter plot of the dependent variable on the independent variable 2. Do a simple linear regression
More informationElementary Statistics
1 Elementary Statistics Introduction Statistics is the collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing
More informationData Mining and Analytics. Introduction
Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data
More information1 RefresheR. Figure 1.1: Soy ice cream flavor preferences
1 RefresheR Figure 1.1: Soy ice cream flavor preferences 2 The Shape of Data Figure 2.1: Frequency distribution of number of carburetors in mtcars dataset Figure 2.2: Daily temperature measurements from
More informationPackage OLScurve. August 29, 2016
Type Package Title OLS growth curve trajectories Version 0.2.0 Date 2014-02-20 Package OLScurve August 29, 2016 Maintainer Provides tools for more easily organizing and plotting individual ordinary least
More informationPython for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT
Python for Data Analysis Prof.Sushila Aghav-Palwe Assistant Professor MIT Four steps to apply data analytics: 1. Define your Objective What are you trying to achieve? What could the result look like? 2.
More informationEcon 3790: Business and Economics Statistics. Instructor: Yogesh Uppal
Econ 3790: Business and Economics Statistics Instructor: Yogesh Uppal Email: yuppal@ysu.edu Chapter 8: Interval Estimation Population Mean: Known Population Mean: Unknown Margin of Error and the Interval
More informationExpectation Maximization (EM) and Gaussian Mixture Models
Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation
More informationData 8 Final Review #1
Data 8 Final Review #1 Topics we ll cover: Visualizations Arrays and Table Manipulations Programming constructs (functions, for loops, conditional statements) Chance, Simulation, Sampling and Distributions
More informationHomework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)
Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in
More informationStatistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings
Statistical Good Practice Guidelines SSC home Using Excel for Statistics - Tips and Warnings On-line version 2 - March 2001 This is one in a series of guides for research and support staff involved in
More informationThemes in the Texas CCRS - Mathematics
1. Compare real numbers. a. Classify numbers as natural, whole, integers, rational, irrational, real, imaginary, &/or complex. b. Use and apply the relative magnitude of real numbers by using inequality
More information