Assignment No: 2. Assessment as per Schedule. Specifications Readability Assignments

Size: px

Start display at page:

Download "Assignment No: 2. Assessment as per Schedule. Specifications Readability Assignments"

Darcy Moody
5 years ago
Views:

1 Specifications Readability Assignments Assessment as per Schedule Oral Total Date of Performance:... Expected Date of Completion:... Actual Date of Completion: Assignment No: Title of the Assignment: Supervised Learning - Regression Objective of the Assignment: Generate a proper 2-D data set of N points. Split the data set into Training data set and Test data set. i. Perform linear regression analysis with Least Squares Method. ii. Plot the graphs for Training MSE and Test MSE and comment on iii. Curve Fitting and Generalization Error. iv. Verify the Effect of Data Set Size and Bias-Variance Tradeoffs. v. Apply Cross Validation and plot the graphs for errors. vi. Apply Subset Selection and plot the graphs for errors. vii. Describe your findings Prerequisite: Basic Concepts of RStudio, Concept of Supervised Learning, Parameters of Linear - Regression Theory: Part 1. Least Square Method for Linear Regression. This method is used to find coefficient of model parameters in linear regression. Least square method is simple but found to be more applicable. Least square method is discussed with respect to simple(univariate ) linear regression.

2 Consider that we are given with some sample data as, (x1, y1), (x2, y2), (x3, y3),.., (xn, yn) Out target is to find simple linear regression model between independent variable X and dependent variable Y as, Y = X + As above equation is linear in 0, 1 this is called as linear regression. 0, 1 are called as parameters of the Linear Regression. These parameters can be found by two different method: 1. Least Square Method. 2. Maximum Likelihood Estimation. Both methods derives same formulae to calculate 0, 1. In turn, target is to find values of 0 AND 1, which should result in linear regression model (Regression line) that is Best fitting to given sample data. Let regression equation is Y= X Where this value y is predicted by regression line. Residuals or Errors: The difference between actual value of Y given in training data and predicted value of i.e. y^ Predicted by linear regression. least square method finds values of 0 and 1 to result in Regression line for which Sum of Squares of all Residuals or Errors (SSE) is minimum.

3 To get minimum value of SSE for 0 and 1, partial derivative of SSE w.r.t. 0 and 1 must be equal to 0. Part 2. Divide data into training data set and testing data set for different sizes. For example if data contains 1000 records then following can be different cases for training data set and test data. Case Size of Size of Find Find Test training test data Training MSE data MSE Tr_MSE[1] Ts_MSE[1] Apply Linear Tr_MSE[2] Ts_MSE[2] Regression to Tr_MSE[3] Ts_MSE[3] Tr_MSE[4] Ts_MSE[4] each case Tr_MSE[5] Ts_MSE[5] Tr_MSE[6] Ts_MSE[6] Tr_MSE[7] Ts_MSE[7] Tr_MSE[8] Ts_MSE[8] Tr_MSE[9] Ts_MSE[9] Plot the graph for 1. Y axis : Size of training dataset 100,200,300,400,500,600,700,800,900 X axis: Training_MSE

2. Y axis: Size of dataset 100,200,300,400,500,600,700,800,900 X axis: Test_MSE IDEAL Graph for linear Regression: Fig: 1 Ideal Learning Curve NOTE: Observe the difference between observed graph and

4 2. Y axis: Size of dataset 100,200,300,400,500,600,700,800,900 X axis: Test_MSE IDEAL Graph for linear Regression: Fig: 1 Ideal Learning Curve NOTE: Observe the difference between observed graph and ideal graph. Nature of training and characteristic of dataset if matches well with the capacity of a (Complexity of underlined model), good generalization results can occur. In such case the learning curves i.e plots of Eout,EinVs Size(N) will be ideal as shown in Fig. 1 Part 3. Include theory for K-fold Cross validation. Cross validation is a model evaluation method that is better than residuals. The problem with residual evaluations is that they do not give an indication of how well the learner will do when it is asked to make new predictions for data it has not already seen. One way to overcome this problem is to not use the entire data set when training a learner. Some of the data is removed before training begins. Then when training is done, the data that was removed can be used to test the performance of the learned model on ``new'' data. This is the basic idea for a whole class of model evaluation methods called cross validation.

5 K-fold cross validation is one way to improve over the holdout method. The data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. Then the average error across all k trials is computed. The advantage of this method is that it matters less how the data gets divided. Every data point gets to be in a test set exactly once, and gets to be in a training set k-1 times. The variance of the resulting estimate is reduced as k is increased. The disadvantage of this method is that the training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation. A variant of this method is to randomly divide the data into a test and training set k different times. The advantage of doing this is that you can independently choose how large each test set is and how many trials you average over. For a dataset of 1000 records apply 10-fold cross validation. In each case find MSE (Mean Square Error). Save them and plot against k. Y axis : Size of training dataset 100,200,300,400,500,600,700,800,900 X axis: MSE Conclusion: Thus we have studied that Linear regression consists of finding the best-fitting straight line through the points. Goodness of Fitting of Regression equation to sample data points depends on amount of SSE. the more tightly the data points ar e clustered about the regression line the more accurately does the line describe the relationship between the dependent and the independent variables.

6 Assignment: The following observation is given by one company to find the relationship between age of person and his/her typing speed. Consider the following sample data : Age of typist (x) Words per minute (y) a. Find regression line that best predicts value of dependent variable b. Find SST,SSE, SSR and co-efficient of determination(r 2 ). c. Explain pictorial relationship between SST,SSE,SSR. d. Decide whether regression equation is good fit or not for the given sample data.

Machine Learning. Cross Validation

Machine Learning. Cross Validation Machine Learning Cross Validation Cross Validation Cross validation is a model evaluation method that is better than residuals. The problem with residual evaluations is that they do not give an indication