Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Similar documents
Lecture on Modeling Tools for Clustering & Regression

model order p weights The solution to this optimization problem is obtained by solving the linear system

Chapter 8 The C 4.5*stat algorithm

Stat 342 Exam 3 Fall 2014

Weka ( )

2-9 Operations with Complex Numbers

5 Learning hypothesis classes (16 points)

MS&E 226: Small Data

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,

1.1 Pearson Modeling and Equation Solving

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Topics in Machine Learning

CSC 411: Lecture 02: Linear Regression

Evaluating Classifiers

Data Mining. Lecture 03: Nearest Neighbor Learning

Lecture 26: Missing data

Linear Regression and K-Nearest Neighbors 3/28/18

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Evaluating Classifiers

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Knowledge Discovery and Data Mining

Model selection and validation 1: Cross-validation

CISC 4631 Data Mining

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2.

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

2.1 Basics of Functions and Their Graphs

Quadratic Equations. Learning Objectives. Quadratic Function 2. where a, b, and c are real numbers and a 0

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Introduction to Artificial Intelligence

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

PROBLEM 4

MAC Learning Objectives. Transformation of Graphs. Module 5 Transformation of Graphs. - A Library of Functions - Transformation of Graphs

MAC Module 5 Transformation of Graphs. Rev.S08

Maths PoS: Year 7 HT1. Students will colour code as they work through the scheme of work. Students will learn about Number and Shape

GRAPHING WORKSHOP. A graph of an equation is an illustration of a set of points whose coordinates satisfy the equation.

Tree-based methods for classification and regression

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

EXERCISE SET 10.2 MATD 0390 DUE DATE: INSTRUCTOR

6.034 Design Assignment 2

Graphs and Linear Functions

Using Machine Learning to Optimize Storage Systems

Increasing/Decreasing Behavior

1.1 Functions. Cartesian Coordinate System

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Name Course Days/Start Time

Slope of the Tangent Line. Estimating with a Secant Line

Rational functions, like rational numbers, will involve a fraction. We will discuss rational functions in the form:

Lecture 8. Divided Differences,Least-Squares Approximations. Ceng375 Numerical Computations at December 9, 2010

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Sketching graphs of polynomials

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

Linear Methods for Regression and Shrinkage Methods

Exploring Quadratic Graphs

Business Club. Decision Trees

9.1: GRAPHING QUADRATICS ALGEBRA 1

2. On classification and related tasks

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Algebra II Quadratic Functions

Lecture 25: Review I

Comparative Study of Instance Based Learning and Back Propagation for Classification Problems

Network Traffic Measurements and Analysis

Cross-validation and the Bootstrap

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Global Journal of Engineering Science and Research Management

Instance-based Learning

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Basis Functions. Volker Tresp Summer 2017

Using Large Data Sets Workbook Version A (MEI)

Increasing/Decreasing Behavior

Machine Learning Methods in Visualisation for Big Data 2018

CS489/698 Lecture 2: January 8 th, 2018

Lecture 7: Linear Regression (continued)

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Common Core Algebra 2. Chapter 1: Linear Functions

Machine Learning Classifiers and Boosting

Math 3 Coordinate Geometry Part 2 Graphing Solutions

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Cross-validation and the Bootstrap

CSC 411: Lecture 05: Nearest Neighbors

Machine Learning: An Applied Econometric Approach Online Appendix

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Decision Trees: Representation

Lecture 3. Oct

Graphs of Exponential

Mathematics. Scheme of Work. Year 7. New Mathematics Framework

Optimization Methods for Machine Learning (OMML)

Summer Packet 7 th into 8 th grade. Name. Integer Operations = 2. (-7)(6)(-4) = = = = 6.

5.1 Introduction to the Graphs of Polynomials

ALGEBRA 1 NOTES. Quarter 3. Name: Block


IQR = number. summary: largest. = 2. Upper half: Q3 =

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Lecture 3: Linear Classification

Supervised vs unsupervised clustering

Markscheme May 2017 Mathematical studies Standard level Paper 1

Transcription:

Assignment 4 (Sol.) Introduction to Data Analytics Prof. andan Sudarsanam & Prof. B. Ravindran 1. Which among the following techniques can be used to aid decision making when those decisions depend upon some available data? (a) descriptive statistics (b) inferential statistics (c) predictive analytics (d) prescriptive analytics Sol. (a), (b), (c), (d) The techniques listed above allow us to visualise and understand data, infer properties of data sets and compare data sets, as well as build models of the processes generating the data in order to make predictions about unseen data. In different scenarios, each of these processes can be used to help in decision making. 2. In a popular classification algorithm it is assumed that the entire input space can be divided into axis-parallel rectangles with each such rectangle being assigned a class label. This assumption is an example of (a) language bias (b) search bias Sol. (a) The mentioned constraints have to do with what forms the classification models can take rather than the search process to find the specific model given some data. Hence, the above assumption is an example of language bias. 3. Suppose we trained a supervised learning algorithm on some training data and observed that the resultant model gave no error on the training data. Which among the following conclusions can you draw in this scenario? (a) the learned model has overfit the data (b) it is possible that the learned model will generalise well to unseen data (c) it is possible that the learned model will not generalise well to unseen data (d) the learned model will definitely perform well on unseen data (e) the learned model will definitely not perform well on unseen data 1

Sol. (b), (c) Consider the (rare) situation where the given data is absolutely pristine, and our choice of algorithm and parameter selection allow us to come up with a model which exactly matches the process generating the data. In such a situation we can expect 100% accuracy on the training data as well as good performance on unseen data. As an example, consider the case where we create some synthetic data having a linear relationship between the inputs and the output and then apply linear regression to model the data. The more plausible situation is of course that that we used a very complex model which ended up overfitting the training data. As we have seen, such models may achieve high accuracy on the training data but generally perform poorly on unseen examples. 4. You are faced with a five class classification problem, with one class being the class of interest, i.e., you care more about correctly classifying data points belonging to that one class than the others. You are given a data set to use as training data. You analyse the data and observe the following properties. Which among these properties do you think are unfavourable from the point of view of applying general machine learning techniques? (a) all data points come from the same distribution (b) there is class imbalance with much more data points belonging to the class of interest than the other classes (c) the given data set has some missing values (d) each data point in the data set is independent of the other data points Option (a) is favourable, since it is an implicit assumption we make when we try applying supervised learning techniques. Option (b) indicates class imbalance which is usually an issue when it comes to classification. However, note that the imbalance is in favour of the class of interest. This means that there are a large number of examples for the class which we are interested in, which should allow us to model this class well. Option (c) is of course unfavourable as it would require us to handle missing data, and given the extent of the data that is missing, could severely affect the performance of any classifier we come up with. Finally, option (d) is favourable since, once again, it is an assumption about the data that we implicitly make.. You are given the following four training instances: x 1 = 1, y 1 = 0.0319 x 2 = 0, y 2 = 0.8692 x 3 = 1, y 3 = 1.966 x 4 = 2, y = 3.0343 Modelling this data using the ordinary least squares regression form of y = f(x) = b 0 + b 1 x, which of the following parameters (b 0, b 1 ) would you use to best model this data? (a) (1,1) (b) (1,2) (c) (2,1) 2

(d) (2,2) Sol. (a) Let us consider the model with parameters (1,1). We can find the sum of squared errors using this parameter setting as (0 0.319) 2 + (1 0.8692) 2 + (2 1.966) 2 + (3 3.0343) 2 = 0.12 Repeating the same calculation for the other 3 options, we find that the minimum squared error is obtained when the parameters are (1,1). Thus, this is the most suitable parameter setting for modelling the given data. 6. You are given the following five training instances: x 1 = 1, y 1 = 3.4 x 2 = 1., y 2 = 4.7 x 3 = 2, y 3 = 6.1 x 4 = 2.2, y 4 = 6.4 x = 4, y = 10.9 Using the derivation results for the parameters in ordinary least squares, calculate the values of b 0 and b 1. (ote that in the expression for b 1, the last term in the denominator is ( i=1 xi)2.) (a) b 0 = 7.4, b 1 = 0.7 (b) b 0 = 2.484, b 1 = 0.969 (c) b 0 = 0.969, b 1 = 2.484 (d) b 0 = 1, b 1 = 2. According to the derivations we saw in the lectures, we have b 1 = i=1 y ix i i=1 yi i=1 xi i=1 x2 i ( i=1 xi)2 and b 0 = ȳ b 1 x Using the supplied data, we have: = i=1 x i = 1 + 1. + 2 + 2.2 + 4 = 10.7 i=1 y i = 3.4 + 4.7 + 6.1 + 6.4 + 10.9 = 31. i=1 x iy i = 1 3.4 + 1. 4.7 + 2 6.1 + 2.2 6.4 + 4 10.9 = 80.7 i=1 x2 i = 12 + 1. 2 + 2 2 + 2.2 2 + 4 2 = 28.312 ( i=1 x i) 2 = 10.7 2 = 11.62 3

i=1 x = xi = 10.7 = 2.1 i=1 ȳ = yi = 31. = 6.31 Using the above calculations, we have Using this value of b 1, we have b 1 = b 1 = i=1 y ix i i=1 yi i=1 xi i=1 x2 i ( i=1 xi)2 80.7 31. 10.7 28.312 11.62 b 0 = ȳ b 1 x = 2.484 b 0 = 6.31 2.484 2.1 = 0.969 7. Recall the regression output obtained when using Microsoft Excel. In the third table, there are columns for t Stat and P-value. Is the probability value shown here for a one-sided test or a two-sided test? (a) one-sided test (b) two-sided test Sol. (b) Recall that the null hypothesis used here is H 0 : b i = 0, where b i is the coefficient corresponding to the i th variable (i > 0), and b 0 is the intercept. 8. In building a linear regression model for a particular data set, you observe the coefficient of one of the features having a relatively high negative value. This suggests that (a) this feature has a strong effect on the model (should be retained) (b) this feature does not have a strong effect on the model (should be ignored) (c) it is not possible to comment on the importance of this feature without additional information A high magnitude suggests that the feature is important. However, it may be the case that another feature is highly correlated with this feature and it s coefficient also has a high magnitude with the opposite sign, in effect cancelling out the effect of the former. Thus, we cannot really remark on the importance of a feature just because it s coefficient has a relatively large magnitude. 9. Assuming that for a specific problem, both linear regression and regression using the K- approach give same levels of performance, which technique would you prefer if response time, i.e., the time taken to make a prediction given an input data point, is a major consideration? (a) linear regression (b) K- 4

(c) both are equally suitable Sol. (a) As we have seen, in K-, for each input data point for which we want to make a prediction, we have to search the entire training data set for the neighbours of that point. Depending upon the dimensionality of the data as well as the size of the training set, this process can take some time. On the other hand in linear regression, once the model has been learned we need to perform a single simple calculation to make a prediction for a given data point. 10. You are given the following five training instances: x 1 = 2, x 2 = 1, y = 4 x 1 = 6, x 2 = 3, y = 2 x 1 = 2, x 2 =, y = 2 x 1 = 6, x 2 = 7, y = 3 x 1 = 10, x 2 = 7, y = 3 Using the K-nearest neighbour technique for performing regression, what will be the predicted y value corresponding to the input point (x 1 = 3, x 2 = 6), for K = 2 and for K = 3? (a) K = 2, y = 3; K = 3, y = 2.33 (b) K = 2, y = 3; K = 3, y = 2.66 (c) K = 2, y = 2.; K = 3, y = 2.33 (d) K = 2, y = 2.; K = 3, y = 2.66 When K = 2, the nearest points are x 1 = 2, x 2 = and x 1 = 6, x 2 = 7. Taking the average of the outputs of these two points, we have y = (2 + 3)/2 = 2.. Similarly, when K = 3, we additionally consider the point x 1 = 6, x 2 = 3 to get output, y = (2 + 3 + 2)/3 = 2.33. Weka-based programming assignment questions The following questions are based on using Weka. Go through the tutorial on Weka before attempting these questions. You can download the data sets used in this assignment here. Data set 1 This is a synthetic data set to get you started with Weka. This data set contains 100 data points. The input is 3-dimensional (x1, x2, x3) with one output variable (y). This data is in the arff format which can directly be used in Weka. Tasks For this data set, you will need to apply linear regression with regularisation, attribute selection and collinear attribute elimination disabled (other parameters to be left to their default values). Use 10-fold cross validation for evaluation. Data set 2 This is a modified version of the prostate cancer data set from the ESL text book in the arff format. It contains nine numeric attributes with the attribute lpsa being treated as the target

attribute. This data set is provided in two files. Use the test file provided for evaluation (rather than the cross-validation method). Tasks For this data set, you will need to apply linear regression. First apply linear regression with regularisation, attribute selection and collinear attribute elimination disabled. ext enable regularisation and try out different values of the parameter and observe whether better performance can be obtained with suitable values of the regularisation parameter. ote that to apply regularisation you should normalise the data (except the target variable). First normalise the test set and save the normalised data. ext open the training data, normalise it and then run the regression algorithm (with the normalised test set supplied for evaluation purposes). Data set 3 This is the Parkinsons Telemonitoring data set taken from the UCI machine learning repository. Given are two files, one for training and one for testing. The files are in csv format and need to be converted into the arff format before algorithms are applied to it. The last variable, PPE, is the target variable. Tasks For this data set, first apply linear regression with regularisation, attribute selection and collinear attribute elimination disabled. ote the performance and compare with the performance obtained by applying K-nearest neighbour regression. To run K-, select the function IBk under the lazy folder. Leave all parameters set to their default values, except for the K parameter (K in the interface). 11. What is the best linear fit for data set 1? (a) y = *x1 + 6*x2 + 4*x3 + 12 (b) y = 3*x1 + 7*x2-2.*x3-16 (c) y = 3*x1 + 7*x2 + 2.*x3-16 (d) y = 2.*x1 + 7*x2 + 4*x3 + 16 Sol. (b) The data set used here is generated using the function specified by option (b) and with no noise added, hence running linear regression on this data set should allow us to recover the exact parameters and observe 100% accuracy. 12. Which of the following ridge regression parameter values leads to the lowest root mean squared error for the prostate cancer data (data set 2) on the supplied test set? (a) 0 (b) 2 (c) 4 (d) 8 (e) 16 (f) 32 6

Sol. (e) Among the given choices of the ridge regression parameter, we observe best performance (lowest error) at the value of 16. 13. If a curve is plotted with the error on the y-axis and the ridge regression parameter on the x-axis, then based on your observations in the previous question, which of the following most closely resembles the curve? (a) straight line passing through origin (b) concave downwards function (c) concave upwards function (d) line parallel to x-axis In general, as the ridge regression parameter is increased, the error begins to decrease, but after a certain point, further increase in the parameter drives up the error. 14. Considering data set 3, is the performance of K-nearest neighbour regression (where the value of the parameter K is varied between 1 and 2) comparable to the performance of linear regression (without regularisation)? (a) no (b) yes Sol. (b) At K =, the performance is even better. 7