Basics of Multivariate Modelling and Data Analysis
|
|
- Lee Welch
- 5 years ago
- Views:
Transcription
1 Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 9. Linear regression with latent variables 9.1 Principal component regression (PCR) 9.2 Partial least-squares regression (PLS) [ mostly from Varmuza and Filzmoser (2009) and PLS-toolbox manual by Wise et al. (2006) ] KEH Basics of Multivariate Modelling and Data Analysis 1
2 9. Linear regression with latent variables 9.1 Principal component regression (PCR) Overview The case with many correlated regressor variables ( independent variables that are collinear) is notoriously difficult in classical multiple linear regression (MLR) such as ordinary least-squares (OLS) regression. Usually it is necessary to select a subset of variables to reduce the number of regressors and collinearity In general, this is a very difficult task. However, principal component regression (PCR) is a way of avoiding/simplifying the variable selection task PCR is a combination of principal component analysis (PCA) multiple linear regression (MLR), usually OLS where the PC scores are used as regressor variables. Since they are orthogonal, the multicollinearity problem is avoided often only a few ones, the number or regressor variables is small KEH Basics of Multivariate Modelling and Data Analysis 2
3 9. Linear regression with latent variables 9.1 PCR Calculating regression coefficients Multiple linear regression (MLR) If data is mean-centred, the regression model in MLR is y Xb e b Using OLS, the regression coefficients are determined by minimizing. This gives a solution that can be expressed as T T b ( X X) 1 X y bols ( XX) T 1 The problem here is the inverse for collinear X data, it is very sensitive to small errors in X, which means that b OLS is also very sensitive to small errors the number of observations must be larger than the number of variables T ee KEH Basics of Multivariate Modelling and Data Analysis 3
4 9.1 PCR Calculating regression coefficients Principal component regression In PCR, a principal component analysis (PCA) is first done: T XP The latter expression is inserted into the linear regression model: T X TP E T y Xb e TP b Eb e Tg epcr e Eb e T g Pb PCR where and. Minimization of by OLS gives T T g ( TT) 1 Ty Here the inverse is well-conditioned, and it always exists, because the columns of T are orthogonal the number of principal components is never larger than the number of observations b In terms of, the solution for the linear regression model can be expressed as T T ( ) 1 b Pg PTT Ty b PCR e T PCR e PCR KEH Basics of Multivariate Modelling and Data Analysis 4
5 9. Linear regression with latent variables 9.1 PCR Selecting principal components The problem Overfitting of a regression model is strongly related with collinearity problem. For PCR this means that PCR is less susceptible to overfitting than MLR because it directly addresses the collinearity problem a PCR model could become overfitted through the retention of too many principal components Therefore, an important part of PCR is the determination of the optimal number of PCs to retain in the model. Another problem is that PCA determines and ranks PCs to explain as much as possible of the variance of the regressor variables x j in PCR, we want PCs that give the best possible prediction of the dependent variable ; such PCs have a high correlation with y Therefore, some variable selection technique has to be applied also in PCR it is not necessarily best to choose the highest ranked PCs according to the PCA. KEH Basics of Multivariate Modelling and Data Analysis 5 y
6 9.1 PCR Selecting principal components Cross-validation As for PCA and MLR, cross-validation is an important tool for variable selection. This means that the data have to be split observation-wise into a modelling (or training) data set a test data set The prediction residual error on the test set observations is then determined as a function of the (number of) PCs retained in the PCR model. This procedure is usually repeated several times using different selections of observation subsets for training and test sets, such that each sample in the original data set is part of a test set at least once. A good rule of thumb is to use the square root of the number of observations for the number of repetitions of the cross- validation procedure, up to a maximum of ten repetitions. A plot of the total composite prediction error over all test sets as a function of the (number of) PCs retained in the model is used to determine the optimal (number of) PCs. KEH Basics of Multivariate Modelling and Data Analysis 6
7 9. Linear regression with latent variables 9.1 PCR PCR application using the PLS-toolbox We shall apply PCR to a Slurry-Fed Ceramic Melter (SFCM) system (Wise et al., 1991), where nuclear waste from fuel reprocessing is combined with glassforming materials. Data from the process, consisting of temperatures in 20 locations within the melter and the molten glass level, are shown in the figure. It is apparent that there is a great deal of correlation in the data. Many of the variables appear to follow a sawtooth pattern. We shall develop a PCR model that will enable estimation of the level of molten glass using temperature measurements. KEH Basics of Multivariate Modelling and Data Analysis 7
8 9.1 PCR PCR application using the PLS-toolbox Starting and loading data The SFCM temperature and molten glass level data are stored in the file plsdata.mat. The file contains 300 calibration or training samples (xblock1 and yblock1) and 200 test samples (xblock2 and yblock2). We will load the data into MATLAB and delete a few samples that are known to be outliers. KEH Basics of Multivariate Modelling and Data Analysis 8
9 9.1.4 PCR application using the PLS-toolbox Starting and loading data KEH Basics of Multivariate Modelling and Data Analysis 9
10 9.1.4 PCR application using the PLS-toolbox Starting and loading data KEH Basics of Multivariate Modelling and Data Analysis 10
11 9.1 PCR PCR application using the PLS-toolbox Preprocessing of data Now that the data are loaded, we need to decide how to preprocess the data for modelling. Because the response (temperature) variables with the greatest variance in this data set also appear to be correlated to the molten glass level, we choose to mean-centre (rather than autoscale) the data. The scaling for Y is irrelevant if there is only one Y variable, as in this case. KEH Basics of Multivariate Modelling and Data Analysis 11
12 9.1.4 PCR application using the PLS-toolbox Preprocessing of data KEH Basics of Multivariate Modelling and Data Analysis 12
13 9.1 PCR PCR application using the PLS-toolbox A preliminary model KEH Basics of Multivariate Modelling and Data Analysis 13
14 9.1.4 PCR application using the PLS-toolbox A preliminary model KEH Basics of Multivariate Modelling and Data Analysis 14
15 9.1 PCR PCR application using the PLS-toolbox Cross-validation We must now decide how to cross-validate the model. We will choose to split the data into ten contiguous block-wise subsets, and to calculate all twenty PCs. KEH Basics of Multivariate Modelling and Data Analysis 15
16 9.1.4 PCR application using the PLS-toolbox Cross-validation KEH Basics of Multivariate Modelling and Data Analysis 16
17 9.1.4 PCR application using the PLS-toolbox Cross-validation KEH Basics of Multivariate Modelling and Data Analysis 17
18 9.1 PCR PCR application using the PLS-toolbox A new model KEH Basics of Multivariate Modelling and Data Analysis 18
19 9.1.4 PCR application using the PLS-toolbox A new model KEH Basics of Multivariate Modelling and Data Analysis 19
20 9.1 PCR PCR application using the PLS-toolbox Choice of principal components Now that the PCR model and the cross-validation results have been computed, one can view the cross-validation results in various ways. A common plot that is used to analyse cross-validation results is a RMSECV-plot (root mean squared error of cross-validation) KEH Basics of Multivariate Modelling and Data Analysis 20
21 9.1.4 PCR application using the PLS-toolbox Choice of principal components Note how the RMSECV has several local minima and a global minimum at eleven PCs. Two rules of thumb: do not include a PC unless it improves the RMSECV by at least 2% use a model with the lowest possible complexity among close alternatives. Here the rules suggest that a model with six PCs would be the best choice. KEH Basics of Multivariate Modelling and Data Analysis 21
22 9.1 PCR PCR application using the PLS-toolbox Model suggested by CV Choose the desired # of PCs by clicking on the corresponding line. KEH Basics of Multivariate Modelling and Data Analysis 22
23 9.1.4 PCR application using the PLS-toolbox Model suggested by CV KEH Basics of Multivariate Modelling and Data Analysis 23
24 9.1.4 PCR application using the PLS-toolbox Model suggested by CV Returning to the RMSECV plot for the final PCR model, we note that some PCs in the final model (specifically, PCs 2, 4 and 5) result in an increase in the model s estimated prediction error; this suggests that these specific PCs, although they help explain variation in the X variables (temperatures), are not useful for prediction of the molten glass level. KEH Basics of Multivariate Modelling and Data Analysis 24
25 9.1 PCR PCR application using the PLS-toolbox Saving the model KEH Basics of Multivariate Modelling and Data Analysis 25
26 9.1 PCR PCR application using the PLS-toolbox To Matlab workspace name can be changed will be lost after the session unless saved by the Matlab save command To disk Exported KEH Basics of Multivariate Modelling and Data Analysis 26
27 9. Linear regression with latent variables 9.2 Partial least-squares regression (PLS) Overview PLS stands for Projection to Latent Structures by means of Partial Least Squares and is a method to relate a matrix to a vector or to a matrix. Essentially, the model structures of PLS and PCR are the same: The x-data are first transformed into a set of (a few) latent variables (components). The latent variables are used for regression (by OLS) with one or several dependent variables. The regression criterion (most often applied) is maximum covariance between the scores of dependent variables and regressor variables (i.e. latent variables) Maximum covariance combines X high variance of scores with high correlation with scores Y X y Y KEH Basics of Multivariate Modelling and Data Analysis 27
28 9.2 Partial least-squares regression (PLS) Overview Relationship with MLR and PCR PLS is related to both MLR (OLS) and PCR/PCA. To see this, consider the linear regression model The best prediction/estimation of y using is y Xb e yˆ Xb MLR by OLS maximizes the correlation between and arg max r b as seen from PCR also maximizes the correlation between and, but with the constraint b Pg, dim( g) k p dim( b), where P is the PCA loading matrix that maximizes the variance of the columns in T XP. This is seen from T T gty T 1 T arg max ryy ˆ arg max ( ) T T T g yy ˆ T yy arg max b T T yy ˆ ˆ yy T T bxy T 1 T arg max ( ) T T T b g ˆ bxxb gttg TT Ty g KEH Basics of Multivariate Modelling and Data Analysis 28 X yy yy y y ŷ XX Xy b ŷ OLS PCR
29 9.2.1 Overview Relationship with MLR and PCR As a summary of this, we can say that MLR (OLS) gives the prediction yˆ Xb, i.e. the identity OLS XIbOLS matrix can be considered as a loading matrix I yˆ XPg P T XP PCR/PCA gives the prediction, where is the loading PCR matrix that maximizes the variance of the columns in The idea in PLS is to determine a loading matrix for the prediction yˆ XWh, where both and are determined in such a way that PLS W hpls they help maximize the correlation between and. This solution is different from MLR (OLS) because of the constraint b Wh, dim( h) k p dim( b). different from PCR/PCA because the loading matrix W is determined to maximize the correlation between and. We have also said that PLS maximizes the covariance between the dependent variable(s) and the scores. When the scores are constrained to have a given constant variance (e.g. =1), this is equivalent to maximizing the correlation. y y ŷ W ŷ KEH Basics of Multivariate Modelling and Data Analysis 29
30 9. Linear regression with latent variables 9.2 PLS regression One dependent variable The main purpose of PLS is to determine a linear model for prediction (estimation) of one or several dependent variables from a set of predictor (independent) variables. The modelling procedure is here outlined for one dependent variable. The first PLS-component is calculated as the latent variable which has the maximum covariance between the scores and the modelled property. T w1 arg max y Xw t1 Xw1 w, Xw 1 Next, the information (variance) of this component is removed from the X data. This process is called peeling or deflation. It is a projection of the X space on to a (hyper-)plane that is orthogonal to the direction of the found component. The resulting matrix after deflation is X T T X X tt X ( I tt ) X which as required has the property. t 1 Xw y KEH Basics of Multivariate Modelling and Data Analysis 30
31 9.2 PLS regression One dependent variable The next PLS component is derived from the residual matrix X 1 again with maximum covariance between the scores and. This procedure is continued to produce sufficiently many PLScomponents. The final choice of the number of components to retain in the model can be made as for PCR (mainly using cross-validation). Some comments In the standard versions of PLS, the scores of the PLS-components are uncorrelated; the loading vectors, are in general not orthogonal. Because PLS components are developed as latent variables possessing a high correlation with y, the optimum number of PLS-components is usually smaller than the optimum number of PCA-components in PCR. However, PLS models may be less stable than PCR models because less X variance is contained. The more components are used, the more similar PCR and PLS models become. y KEH Basics of Multivariate Modelling and Data Analysis 31
32 9.2.2 One dependent variable Some comments A complicating aspect of most PLS algorithms is the stepwise calculation of components. After a component is computed, the residual matrices for X (and Y ) are determined. The next PLS-component is calculated from the residual matrices and therefore their parameters (scores, loadings, weights) do not relate to the original matrices. However, equations exist that relate the PLS parameters to the original data and that also provide the regression coefficients b PLS of the final model for the original data. X X k b PLS p p p b PLS b PLS KEH Basics of Multivariate Modelling and Data Analysis 32
33 9. Linear regression with latent variables 9.2 PLS regression Many dependent variables If there is more than one dependent variable, the dependent data is stored in a matrix. The basic regression model is then Y B Y XB E E B E y Y where is a matrix of regression coefficients and is a matrix of residuals. Here columns and in and correspond to column in. b j e j If the dependent variables are considered to be mutually independent (uncorrelated), PLS (or any other regression method) for one dependent variable can be applied to each variable y j, one at a time. T T MLR using OLS then gives the solution B ( X X) 1 X Y. If the dependent variables are correlated, it is best to deal with them jointly. PLS, which is called PLS2 when there is more than one dependent variable, is then very suitable. j KEH Basics of Multivariate Modelling and Data Analysis 33
34 9.2 PLS regression Many dependent variables Main idea In PLS2, both X and Y are decomposed into scores and loadings p and t, 1,, k, are loadings and scores of latent variables for X q and u, 1,, k, are loadings and scores of latent variables for Y in such a way that the covariance between X and Y scores is maximized. The regression is performed between X and scores. The regression coefficients can be transformed to allow direct estimation of from. Y Y X KEH Basics of Multivariate Modelling and Data Analysis 34
35 9.2 PLS regression Many dependent variables Mathematical development There are many variants of PLS algorithms. It is e.g. possible to specify that loading vectors are orthogonal (the Eigenvector algorithm ) scores are uncorrelated (i.e. orthogonal), loadings are non-orthogonal (algorithms such as Kernel, NIPALS, SIMPLS, O-PLS) Here a PLS2 method producing uncorrelated scores is outlined. X and Y are modelled by linear latent variables as and T X TP E X P Q T Y UQ E Y Instead of the loadings and, new loading vectors and that satisfy t X 1w and u Yc, 1,, k are introduced. Here and,, is an updated version of. X 0 X 1 w 1 For the first PLS-component, the vectors 1 and are determined by solving [ w c ] arg max ( Xw) Yc, Xw 1, Yc 1 T T T 1 1 T T [ w c ] X 1 c w c X KEH Basics of Multivariate Modelling and Data Analysis 35
36 9.2.3 Many dependent variables Mathematical development w 1 The vectors 1 and can be found by solving eigenvalue problems T T w 1 is the eigenvector to the largest eigenvalue of XYYX T T is the eigenvector to the largest eigenvalue of YXXY c 1 The first component scores are given by t1 Xw1 and u1 Yc1 The first component loadings for are given by The first component loadings for X The deflated matrix is given by Y c X Y T T p X t X Xw T T X X t t X ( I t t ) X are not needed for the regression. The deflated matrix could be calculated similarly, but it is not needed. The second PLS-components are calculated similarly from and. The procedure is repeated to obtain sufficiently many components. When all components are known, the regression coefficients are given by T T B W( P W) 1 C 1,, W [ w w ] k 1 KEH Basics of Multivariate Modelling and Data Analysis 36 X 1 C [ c c ] k Y
37 9. Linear regression with latent variables 9.2 PLS regression Geometric illustration [ based on UMETRICS material] n Data consist of observations and a set of p independent variables p (inputs) in a n p matrix X set of q dependent variables n (outputs) in a n q matrix Y Each variable has a coordinate axis p coordinates for X data coordinates for data illustration for q Y p q 3 Each observation is represented by one point in the X space one point in the space Y The mean value of each variable in both data sets is here denoted by a red dot in the two coordinate systems (it is not an observation). Data are here mean-centred. KEH Basics of Multivariate Modelling and Data Analysis 37 n q
38 9.2 PLS regression Geometric illustration The first PLS-component is a line in the X space line in the Y space calculated to approximate the points well in X and Y yield a good correlation between the projections and (the scores) t1 u1 The vector directions are 1and coordinates are and t -u w 1 t1 u1 The 1 1 plot shows how well the first PLS-components models the data points on the line are modelled exactly points not on the line may be modelled by other PLS-components c KEH Basics of Multivariate Modelling and Data Analysis 38
39 9.2 PLS regression Geometric illustration The second PLS-component is also represented by lines in the X and Y space calculated to approximate the points well provide a good correlation in such a way that X-lines are orthogonal Y -lines may be orthogonal These lines with directions w2 and c2 coordinates t2 and u2 improve the approximation and correlation as much as possible. The second projection coordinates usually correlate less well than the first projection coordinates may correlate better than the first projection coordinates if there is a strong structure in that is not related to (or present in). X KEH Basics of Multivariate Modelling and Data Analysis 39 Y
40 9.2 PLS regression Geometric illustration The first two PLS-components form planes in the and spaces The variability around the -plane can be used to calculate a tolerance interval within which new observations will (should) be located Observations outside this interval implies that the model may not be valid for this data Plotting of successive pairs of latent variables against each other will give a good picture of the correlation structure The plot in the SE corner indicates that there is almost no information left in the :th pair of latent variables k X Y X u k t k KEH Basics of Multivariate Modelling and Data Analysis 40
41 9. Linear regression with latent variables 9.2 PLS regression Evaluation and diagnostics The PLS result and the data can be analysed and evaluated in many ways. Detection of outliers The techniques for outlier detection used in PCA, can also be used in PLS. Since PLS uses a Y block in addition to the X block, one can also look for outliers in the prediction of Y. The figure illustrates a way of plotting the prediction error against a score related parameter ( leverage ) the y-axis prediction error is autoscaled error unit is standard deviations the x-axis Leverage defines the influence of a given observation on the model; it is proportional to Hotelling s T 2 Four of the marked observations are very clear outliers. These outliers should be removed from the model-building data. KEH Basics of Multivariate Modelling and Data Analysis 41
42 9.2 PLS regression Evaluation and diagnostics Cross-validation The standard techniques for selecting the number of latent variables based on cross-validation can be used. In addition, one can consider how much of the Y variation each LV T T describes expressed e.g. as uu / uu. Relationships between observations Relationships between observations can be studied by various score plots, e.g. vs., vs. u1 t 1 u2 t2 a linear relationship with high correlation is desired t2 t 1 u2 u1 no correlation is desired vs., vs. KEH Basics of Multivariate Modelling and Data Analysis 42
43 9.2 PLS regression Evaluation and diagnostics Variable interpretations There are many ways to analyse the contribution and importance of variables in the PLS model, e.g. loadings on LVs ( p :s) Q-residuals Hotelling s T 2 statistic regression coefficients VIP scores ( Variable Importance in Projection ) In the PLS-toolbox, these plots are obtained via Loadings plots. KEH Basics of Multivariate Modelling and Data Analysis 43
44 9. Linear regression with latent variables 9.2 PLS regression PLS application using the PLS-toolbox We shall apply PLS to the same Slurry-Fed Ceramic Melter (SFCM) system, which was used in the PCR application (section 9.1.4). Except for the analysis startup for PLS the model building steps for PLS are exactly the same as for PCR, including cross-validation for choice of the number of PLS-components KEH Basics of Multivariate Modelling and Data Analysis 44
45 9.2 PLS regression PLS application using the PLS-toolbox Cross-validation results Model-building combined with cross-validation produces the result shown. The variance captured by the model for each # of latent variables (LVs) is shown for the T X block (% tt ) T Y block (% uu ) The suggested model is (apparently) based on X block variance Note the small Y block variance for LV4 LV4 does not contain info about Y. This suggests that 3 LVs would be sufficient for predictive purposes. KEH Basics of Multivariate Modelling and Data Analysis 45
46 9.2.6 PLS application using the PLS-toolbox Cross-validation results Another (better) way to select the number of LVs is to consider the prediction error based on cross-validation, which can be quantified by a RMSECV-plot fig suggest 3 or 4 based on rule of thumb (at least 2% improvement), 3 LVs is sufficient KEH Basics of Multivariate Modelling and Data Analysis 46
SELECTION OF A MULTIVARIATE CALIBRATION METHOD
SELECTION OF A MULTIVARIATE CALIBRATION METHOD 0. Aim of this document Different types of multivariate calibration methods are available. The aim of this document is to help the user select the proper
More informationMultivariate Analysis Multivariate Calibration part 2
Multivariate Analysis Multivariate Calibration part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Linear Latent Variables An essential concept in multivariate data
More informationLinear Methods for Regression and Shrinkage Methods
Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationLinear Model Selection and Regularization. especially usefull in high dimensions p>>100.
Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records
More informationPsychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding
Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding In the previous lecture we learned how to incorporate a categorical research factor into a MLR model by using
More informationPractical OmicsFusion
Practical OmicsFusion Introduction In this practical, we will analyse data, from an experiment which aim was to identify the most important metabolites that are related to potato flesh colour, from an
More informationDimensionality Reduction, including by Feature Selection.
Dimensionality Reduction, including by Feature Selection www.cs.wisc.edu/~dpage/cs760 Goals for the lecture you should understand the following concepts filtering-based feature selection information gain
More informationChemometrics. Description of Pirouette Algorithms. Technical Note. Abstract
19-1214 Chemometrics Technical Note Description of Pirouette Algorithms Abstract This discussion introduces the three analysis realms available in Pirouette and briefly describes each of the algorithms
More informationRegression on SAT Scores of 374 High Schools and K-means on Clustering Schools
Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationGenotype x Environmental Analysis with R for Windows
Genotype x Environmental Analysis with R for Windows Biometrics and Statistics Unit Angela Pacheco CIMMYT,Int. 23-24 Junio 2015 About GEI In agricultural experimentation, a large number of genotypes are
More informationStatistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte
Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,
More informationCDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening
CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening Variables Entered/Removed b Variables Entered GPA in other high school, test, Math test, GPA, High school math GPA a Variables Removed
More informationSGN (4 cr) Chapter 10
SGN-41006 (4 cr) Chapter 10 Feature Selection and Extraction Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 18, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006
More informationEvaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München
Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics
More informationCPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016
CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:
More informationFMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu
FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)
More informationEvaluation of empirical models for calibration and classification
Evaluation of empirical models for calibration and classification Kurt VARMUZA Vienna University of Technology Institute of Chemical Engineering and Department of Statistics and Probability Theory www.lcm.tuwien.ac.at,
More informationEverything taken from (Hair, Hult et al. 2017) but some formulas taken elswere or created by Erik Mønness.
/Users/astacbf/Desktop/Assessing smartpls (engelsk).docx 1/8 Assessing smartpls Everything taken from (Hair, Hult et al. 017) but some formulas taken elswere or created by Erik Mønness. Run PLS algorithm,
More informationFurther Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables
Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables
More informationMulticollinearity and Validation CIVL 7012/8012
Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.
More informationPre-processing method minimizing the need for reference analyses
JOURNAL OF CHEMOMETRICS J. Chemometrics 2001; 15: 123 131 Pre-processing method minimizing the need for reference analyses Per Waaben Hansen* Foss Electric A/S, Slangerupgade 69, DK-3400 Hillerød, Denmark
More informationThe problem we have now is called variable selection or perhaps model selection. There are several objectives.
STAT-UB.0103 NOTES for Wednesday 01.APR.04 One of the clues on the library data comes through the VIF values. These VIFs tell you to what extent a predictor is linearly dependent on other predictors. We
More informationCOMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS
COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS Toomas Kirt Supervisor: Leo Võhandu Tallinn Technical University Toomas.Kirt@mail.ee Abstract: Key words: For the visualisation
More informationD-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview
Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,
More informationSPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL
SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered
More informationMultivariate Calibration Quick Guide
Last Updated: 06.06.2007 Table Of Contents 1. HOW TO CREATE CALIBRATION MODELS...1 1.1. Introduction into Multivariate Calibration Modelling... 1 1.1.1. Preparing Data... 1 1.2. Step 1: Calibration Wizard
More informationA Hybrid Linear Kernel with PCA in SVM Prediction Model of Tamil Writing Pattern
THENDRAL THARMALINGAM et al: A HYBRID LINEAR KERNEL WITH PCA IN PREDICTION MODEL A Hybrid Linear Kernel with PCA in Prediction Model of Tamil Writing Pattern Thendral Tharmalingam *, Vijaya Vijayakumar
More informationHyperspectral Chemical Imaging: principles and Chemometrics.
Hyperspectral Chemical Imaging: principles and Chemometrics aoife.gowen@ucd.ie University College Dublin University College Dublin 1,596 PhD students 6,17 international students 8,54 graduate students
More informationCREATING THE ANALYSIS
Chapter 14 Multiple Regression Chapter Table of Contents CREATING THE ANALYSIS...214 ModelInformation...217 SummaryofFit...217 AnalysisofVariance...217 TypeIIITests...218 ParameterEstimates...218 Residuals-by-PredictedPlot...219
More informationMinitab 17 commands Prepared by Jeffrey S. Simonoff
Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save
More informationPrincipal Component Image Interpretation A Logical and Statistical Approach
Principal Component Image Interpretation A Logical and Statistical Approach Md Shahid Latif M.Tech Student, Department of Remote Sensing, Birla Institute of Technology, Mesra Ranchi, Jharkhand-835215 Abstract
More informationTwo-Stage Least Squares
Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes
More informationThe Automation of the Feature Selection Process. Ronen Meiri & Jacob Zahavi
The Automation of the Feature Selection Process Ronen Meiri & Jacob Zahavi Automated Data Science http://www.kdnuggets.com/2016/03/automated-data-science.html Outline The feature selection problem Objective
More informationData Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47
Data Mining - Data Dr. Jean-Michel RICHER 2018 jean-michel.richer@univ-angers.fr Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Outline 1. Introduction 2. Data preprocessing 3. CPA with R 4. Exercise
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationCSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo
CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Do Something..
More informationDimension Reduction CS534
Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of
More informationResources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.
Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationFEA and Multivariate Statistical Data Analysis of Polypropylene Tube Forming Process
FEA and Multivariate Statistical Data Analysis of Polypropylene ube Forming Process Galini Gavrilidou, Mukesh Jain Department of Mechanical Engineering, McMaster University, Hamilton, On, LS L7, Canada
More informationApplied Regression Modeling: A Business Approach
i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use
More information7. Collinearity and Model Selection
Sociology 740 John Fox Lecture Notes 7. Collinearity and Model Selection Copyright 2014 by John Fox Collinearity and Model Selection 1 1. Introduction I When there is a perfect linear relationship among
More informationPredict Outcomes and Reveal Relationships in Categorical Data
PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,
More information2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy
2017 ITRON EFG Meeting Abdul Razack Specialist, Load Forecasting NV Energy Topics 1. Concepts 2. Model (Variable) Selection Methods 3. Cross- Validation 4. Cross-Validation: Time Series 5. Example 1 6.
More informationApplied Regression Modeling: A Business Approach
i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming
More informationUsing the DATAMINE Program
6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection
More informationUNIT 1: NUMBER LINES, INTERVALS, AND SETS
ALGEBRA II CURRICULUM OUTLINE 2011-2012 OVERVIEW: 1. Numbers, Lines, Intervals and Sets 2. Algebraic Manipulation: Rational Expressions and Exponents 3. Radicals and Radical Equations 4. Function Basics
More informationStudy Guide. Module 1. Key Terms
Study Guide Module 1 Key Terms general linear model dummy variable multiple regression model ANOVA model ANCOVA model confounding variable squared multiple correlation adjusted squared multiple correlation
More informationCS 195-5: Machine Learning Problem Set 5
CS 195-5: Machine Learning Problem Set 5 Douglas Lanman dlanman@brown.edu 26 November 26 1 Clustering and Vector Quantization Problem 1 Part 1: In this problem we will apply Vector Quantization (VQ) to
More informationRSM Split-Plot Designs & Diagnostics Solve Real-World Problems
RSM Split-Plot Designs & Diagnostics Solve Real-World Problems Shari Kraber Pat Whitcomb Martin Bezener Stat-Ease, Inc. Stat-Ease, Inc. Stat-Ease, Inc. 221 E. Hennepin Ave. 221 E. Hennepin Ave. 221 E.
More informationGeneral Instructions. Questions
CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These
More informationMultiresponse Sparse Regression with Application to Multidimensional Scaling
Multiresponse Sparse Regression with Application to Multidimensional Scaling Timo Similä and Jarkko Tikka Helsinki University of Technology, Laboratory of Computer and Information Science P.O. Box 54,
More informationPrincipal Component Analysis
Copyright 2004, Casa Software Ltd. All Rights Reserved. 1 of 16 Principal Component Analysis Introduction XPS is a technique that provides chemical information about a sample that sets it apart from other
More information( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components
Review Lecture 14 ! PRINCIPAL COMPONENT ANALYSIS Eigenvectors of the covariance matrix are the principal components 1. =cov X Top K principal components are the eigenvectors with K largest eigenvalues
More informationFMRI data: Independent Component Analysis (GIFT) & Connectivity Analysis (FNC)
FMRI data: Independent Component Analysis (GIFT) & Connectivity Analysis (FNC) Software: Matlab Toolbox: GIFT & FNC Yingying Wang, Ph.D. in Biomedical Engineering 10 16 th, 2014 PI: Dr. Nadine Gaab Outline
More informationSYS 6021 Linear Statistical Models
SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are
More informationCSE 481C Imitation Learning in Humanoid Robots Motion capture, inverse kinematics, and dimensionality reduction
1 CSE 481C Imitation Learning in Humanoid Robots Motion capture, inverse kinematics, and dimensionality reduction Robotic Imitation of Human Actions 2 The inverse kinematics problem Joint angles Human-robot
More information2016 Stat-Ease, Inc. & CAMO Software
Multivariate Analysis and Design of Experiments in practice using The Unscrambler X Frank Westad CAMO Software fw@camo.com Pat Whitcomb Stat-Ease pat@statease.com Agenda Goal: Part 1: Part 2: Show how
More informationMODERN FACTOR ANALYSIS
MODERN FACTOR ANALYSIS Harry H. Harman «ö THE pigj UNIVERSITY OF CHICAGO PRESS Contents LIST OF ILLUSTRATIONS GUIDE TO NOTATION xv xvi Parti Foundations of Factor Analysis 1. INTRODUCTION 3 1.1. Brief
More informationCHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY
23 CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 3.1 DESIGN OF EXPERIMENTS Design of experiments is a systematic approach for investigation of a system or process. A series
More informationSupervised vs unsupervised clustering
Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful
More informationLecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017
Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last
More information3D Geometry and Camera Calibration
3D Geometry and Camera Calibration 3D Coordinate Systems Right-handed vs. left-handed x x y z z y 2D Coordinate Systems 3D Geometry Basics y axis up vs. y axis down Origin at center vs. corner Will often
More informationExam Review: Ch. 1-3 Answer Section
Exam Review: Ch. 1-3 Answer Section MDM 4U0 MULTIPLE CHOICE 1. ANS: A Section 1.6 2. ANS: A Section 1.6 3. ANS: A Section 1.7 4. ANS: A Section 1.7 5. ANS: C Section 2.3 6. ANS: B Section 2.3 7. ANS: D
More information2014 Stat-Ease, Inc. All Rights Reserved.
What s New in Design-Expert version 9 Factorial split plots (Two-Level, Multilevel, Optimal) Definitive Screening and Single Factor designs Journal Feature Design layout Graph Columns Design Evaluation
More informationWorkload Characterization Techniques
Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/
More informationExploring association among quality indicators of the In-service Education Information Service
Exploring association among quality indicators of the In-service Education Information Service LUNG-HSING KUO, HUNG-JEN YANG, TSUNG-JUNG TSAI, FONG-CHING SU National Kaohsiung Normal University No.116,
More informationComputer Experiments: Space Filling Design and Gaussian Process Modeling
Computer Experiments: Space Filling Design and Gaussian Process Modeling Best Practice Authored by: Cory Natoli Sarah Burke, Ph.D. 30 March 2018 The goal of the STAT COE is to assist in developing rigorous,
More informationDesign of Fault Diagnosis System of FPSO Production Process Based on MSPCA
2009 Fifth International Conference on Information Assurance and Security Design of Fault Diagnosis System of FPSO Production Process Based on MSPCA GAO Qiang, HAN Miao, HU Shu-liang, DONG Hai-jie ianjin
More information3 Feature Selection & Feature Extraction
3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy
More informationRobust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson
Robust Regression Robust Data Mining Techniques By Boonyakorn Jantaranuson Outline Introduction OLS and important terminology Least Median of Squares (LMedS) M-estimator Penalized least squares What is
More informationDI TRANSFORM. The regressive analyses. identify relationships
July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,
More informationChapter 9 Robust Regression Examples
Chapter 9 Robust Regression Examples Chapter Table of Contents OVERVIEW...177 FlowChartforLMS,LTS,andMVE...179 EXAMPLES USING LMS AND LTS REGRESSION...180 Example 9.1 LMS and LTS with Substantial Leverage
More informationA NEW VARIABLES SELECTION AND DIMENSIONALITY REDUCTION TECHNIQUE COUPLED WITH SIMCA METHOD FOR THE CLASSIFICATION OF TEXT DOCUMENTS
A NEW VARIABLES SELECTION AND DIMENSIONALITY REDUCTION TECHNIQUE COUPLED WITH SIMCA METHOD FOR THE CLASSIFICATION OF TEXT DOCUMENTS Ahmed Abdelfattah Saleh University of Brasilia, Brasil ahmdsalh@yahoo.com
More informationLouis Fourrier Fabien Gaie Thomas Rolf
CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted
More informationSection 2.1: Intro to Simple Linear Regression & Least Squares
Section 2.1: Intro to Simple Linear Regression & Least Squares Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.1, 7.2 1 Regression:
More informationData preprocessing Functional Programming and Intelligent Algorithms
Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Høgskolen i Ålesund 20th March 2017 1 Why data preprocessing? Real-world data tend to be dirty incomplete: lacking attribute
More informationThe Curse of Dimensionality
The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more
More informationInformation Driven Healthcare:
Information Driven Healthcare: Machine Learning course Lecture: Feature selection I --- Concepts Centre for Doctoral Training in Healthcare Innovation Dr. Athanasios Tsanas ( Thanasis ), Wellcome Trust
More informationCSC 411: Lecture 14: Principal Components Analysis & Autoencoders
CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Raquel Urtasun & Rich Zemel University of Toronto Nov 4, 2015 Urtasun & Zemel (UofT) CSC 411: 14-PCA & Autoencoders Nov 4, 2015 1 / 18
More informationChapter 1. Using the Cluster Analysis. Background Information
Chapter 1 Using the Cluster Analysis Background Information Cluster analysis is the name of a multivariate technique used to identify similar characteristics in a group of observations. In cluster analysis,
More informationDimension reduction : PCA and Clustering
Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental
More informationFinal Report: Kaggle Soil Property Prediction Challenge
Final Report: Kaggle Soil Property Prediction Challenge Saurabh Verma (verma076@umn.edu, (612)598-1893) 1 Project Goal Low cost and rapid analysis of soil samples using infrared spectroscopy provide new
More informationFEATURE or input variable selection plays a very important
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 5, SEPTEMBER 2006 1101 Feature Selection Using a Piecewise Linear Network Jiang Li, Member, IEEE, Michael T. Manry, Pramod L. Narasimha, Student Member,
More informationSAS/STAT 15.1 User s Guide The HPPLS Procedure
SAS/STAT 15.1 User s Guide The HPPLS Procedure This document is an individual chapter from SAS/STAT 15.1 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute Inc.
More informationSPSS INSTRUCTION CHAPTER 9
SPSS INSTRUCTION CHAPTER 9 Chapter 9 does no more than introduce the repeated-measures ANOVA, the MANOVA, and the ANCOVA, and discriminant analysis. But, you can likely envision how complicated it can
More informationA Course in Machine Learning
A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling
More informationCSC 411: Lecture 14: Principal Components Analysis & Autoencoders
CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 1 / 18
More informationCOMP61011 Foundations of Machine Learning. Feature Selection
OMP61011 Foundations of Machine Learning Feature Selection Pattern Recognition: The Early Days Only 200 papers in the world! I wish! Pattern Recognition: The Early Days Using eight very simple measurements
More informationLISA: Explore JMP Capabilities in Design of Experiments. Liaosa Xu June 21, 2012
LISA: Explore JMP Capabilities in Design of Experiments Liaosa Xu June 21, 2012 Course Outline Why We Need Custom Design The General Approach JMP Examples Potential Collinearity Issues Prior Design Evaluations
More informationData Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures
More informationQQ normality plots Harvey Motulsky, GraphPad Software Inc. July 2013
QQ normality plots Harvey Motulsky, GraphPad Software Inc. July 213 Introduction Many statistical tests assume that data (or residuals) are sampled from a Gaussian distribution. Normality tests are often
More informationvector space retrieval many slides courtesy James Amherst
vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the
More informationFeature Selection in Knowledge Discovery
Feature Selection in Knowledge Discovery Susana Vieira Technical University of Lisbon, Instituto Superior Técnico Department of Mechanical Engineering, Center of Intelligent Systems, IDMEC-LAETA Av. Rovisco
More informationBootstrapping Method for 14 June 2016 R. Russell Rhinehart. Bootstrapping
Bootstrapping Method for www.r3eda.com 14 June 2016 R. Russell Rhinehart Bootstrapping This is extracted from the book, Nonlinear Regression Modeling for Engineering Applications: Modeling, Model Validation,
More informationLecture on Modeling Tools for Clustering & Regression
Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into
More informationComputer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging
Prof. Daniel Cremers 8. Boosting and Bagging Repetition: Regression We start with a set of basis functions (x) =( 0 (x), 1(x),..., M 1(x)) x 2 í d The goal is to fit a model into the data y(x, w) =w T
More informationLecture 7: Linear Regression (continued)
Lecture 7: Linear Regression (continued) Reading: Chapter 3 STATS 2: Data mining and analysis Jonathan Taylor, 10/8 Slide credits: Sergio Bacallado 1 / 14 Potential issues in linear regression 1. Interactions
More information