Basics of Multivariate Modelling and Data Analysis

Size: px

Start display at page:

Download "Basics of Multivariate Modelling and Data Analysis"

Lee Welch
5 years ago
Views:

1 Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 9. Linear regression with latent variables 9.1 Principal component regression (PCR) 9.2 Partial least-squares regression (PLS) [ mostly from Varmuza and Filzmoser (2009) and PLS-toolbox manual by Wise et al. (2006) ] KEH Basics of Multivariate Modelling and Data Analysis 1

2 9. Linear regression with latent variables 9.1 Principal component regression (PCR) Overview The case with many correlated regressor variables ( independent variables that are collinear) is notoriously difficult in classical multiple linear regression (MLR) such as ordinary least-squares (OLS) regression. Usually it is necessary to select a subset of variables to reduce the number of regressors and collinearity In general, this is a very difficult task. However, principal component regression (PCR) is a way of avoiding/simplifying the variable selection task PCR is a combination of principal component analysis (PCA) multiple linear regression (MLR), usually OLS where the PC scores are used as regressor variables. Since they are orthogonal, the multicollinearity problem is avoided often only a few ones, the number or regressor variables is small KEH Basics of Multivariate Modelling and Data Analysis 2

3 9. Linear regression with latent variables 9.1 PCR Calculating regression coefficients Multiple linear regression (MLR) If data is mean-centred, the regression model in MLR is y Xb e b Using OLS, the regression coefficients are determined by minimizing. This gives a solution that can be expressed as T T b ( X X) 1 X y bols ( XX) T 1 The problem here is the inverse for collinear X data, it is very sensitive to small errors in X, which means that b OLS is also very sensitive to small errors the number of observations must be larger than the number of variables T ee KEH Basics of Multivariate Modelling and Data Analysis 3

4 9.1 PCR Calculating regression coefficients Principal component regression In PCR, a principal component analysis (PCA) is first done: T XP The latter expression is inserted into the linear regression model: T X TP E T y Xb e TP b Eb e Tg epcr e Eb e T g Pb PCR where and. Minimization of by OLS gives T T g ( TT) 1 Ty Here the inverse is well-conditioned, and it always exists, because the columns of T are orthogonal the number of principal components is never larger than the number of observations b In terms of, the solution for the linear regression model can be expressed as T T ( ) 1 b Pg PTT Ty b PCR e T PCR e PCR KEH Basics of Multivariate Modelling and Data Analysis 4

5 9. Linear regression with latent variables 9.1 PCR Selecting principal components The problem Overfitting of a regression model is strongly related with collinearity problem. For PCR this means that PCR is less susceptible to overfitting than MLR because it directly addresses the collinearity problem a PCR model could become overfitted through the retention of too many principal components Therefore, an important part of PCR is the determination of the optimal number of PCs to retain in the model. Another problem is that PCA determines and ranks PCs to explain as much as possible of the variance of the regressor variables x j in PCR, we want PCs that give the best possible prediction of the dependent variable ; such PCs have a high correlation with y Therefore, some variable selection technique has to be applied also in PCR it is not necessarily best to choose the highest ranked PCs according to the PCA. KEH Basics of Multivariate Modelling and Data Analysis 5 y

6 9.1 PCR Selecting principal components Cross-validation As for PCA and MLR, cross-validation is an important tool for variable selection. This means that the data have to be split observation-wise into a modelling (or training) data set a test data set The prediction residual error on the test set observations is then determined as a function of the (number of) PCs retained in the PCR model. This procedure is usually repeated several times using different selections of observation subsets for training and test sets, such that each sample in the original data set is part of a test set at least once. A good rule of thumb is to use the square root of the number of observations for the number of repetitions of the cross- validation procedure, up to a maximum of ten repetitions. A plot of the total composite prediction error over all test sets as a function of the (number of) PCs retained in the model is used to determine the optimal (number of) PCs. KEH Basics of Multivariate Modelling and Data Analysis 6

9. Linear regression with latent variables 9.1 PCR 9.1.4 PCR application using the PLS-toolbox We shall apply PCR to a Slurry-Fed Ceramic Melter (SFCM) system (Wise et al.

7 9. Linear regression with latent variables 9.1 PCR PCR application using the PLS-toolbox We shall apply PCR to a Slurry-Fed Ceramic Melter (SFCM) system (Wise et al., 1991), where nuclear waste from fuel reprocessing is combined with glassforming materials. Data from the process, consisting of temperatures in 20 locations within the melter and the molten glass level, are shown in the figure. It is apparent that there is a great deal of correlation in the data. Many of the variables appear to follow a sawtooth pattern. We shall develop a PCR model that will enable estimation of the level of molten glass using temperature measurements. KEH Basics of Multivariate Modelling and Data Analysis 7

9.1 PCR 9.1.4 PCR application using the PLS-toolbox Starting and loading data The SFCM temperature and molten glass level data are stored in the file plsdata.mat.

8 9.1 PCR PCR application using the PLS-toolbox Starting and loading data The SFCM temperature and molten glass level data are stored in the file plsdata.mat. The file contains 300 calibration or training samples (xblock1 and yblock1) and 200 test samples (xblock2 and yblock2). We will load the data into MATLAB and delete a few samples that are known to be outliers. KEH Basics of Multivariate Modelling and Data Analysis 8

9 9.1.4 PCR application using the PLS-toolbox Starting and loading data KEH Basics of Multivariate Modelling and Data Analysis 9

10 9.1.4 PCR application using the PLS-toolbox Starting and loading data KEH Basics of Multivariate Modelling and Data Analysis 10

9.1 PCR 9.1.4 PCR application using the PLS-toolbox Preprocessing of data Now that the data are loaded, we need to decide how to preprocess the data for modelling.

11 9.1 PCR PCR application using the PLS-toolbox Preprocessing of data Now that the data are loaded, we need to decide how to preprocess the data for modelling. Because the response (temperature) variables with the greatest variance in this data set also appear to be correlated to the molten glass level, we choose to mean-centre (rather than autoscale) the data. The scaling for Y is irrelevant if there is only one Y variable, as in this case. KEH Basics of Multivariate Modelling and Data Analysis 11

12 9.1.4 PCR application using the PLS-toolbox Preprocessing of data KEH Basics of Multivariate Modelling and Data Analysis 12

9.1 PCR 9.1.4 PCR application using the PLS-toolbox A

13 9.1 PCR PCR application using the PLS-toolbox A preliminary model KEH Basics of Multivariate Modelling and Data Analysis 13

14 9.1.4 PCR application using the PLS-toolbox A preliminary model KEH Basics of Multivariate Modelling and Data Analysis 14

9.1 PCR 9.1.4 PCR application using the PLS-toolbox Cross-validation We must now decide how to cross-validate the model.

15 9.1 PCR PCR application using the PLS-toolbox Cross-validation We must now decide how to cross-validate the model. We will choose to split the data into ten contiguous block-wise subsets, and to calculate all twenty PCs. KEH Basics of Multivariate Modelling and Data Analysis 15

16 9.1.4 PCR application using the PLS-toolbox Cross-validation KEH Basics of Multivariate Modelling and Data Analysis 16

17 9.1.4 PCR application using the PLS-toolbox Cross-validation KEH Basics of Multivariate Modelling and Data Analysis 17

9.1 PCR 9.1.4 PCR application using the PLS-toolbox A new

18 9.1 PCR PCR application using the PLS-toolbox A new model KEH Basics of Multivariate Modelling and Data Analysis 18

19 9.1.4 PCR application using the PLS-toolbox A new model KEH Basics of Multivariate Modelling and Data Analysis 19

9.1 PCR 9.1.4 PCR application using the PLS-toolbox Choice of principal components Now that the PCR

20 9.1 PCR PCR application using the PLS-toolbox Choice of principal components Now that the PCR model and the cross-validation results have been computed, one can view the cross-validation results in various ways. A common plot that is used to analyse cross-validation results is a RMSECV-plot (root mean squared error of cross-validation) KEH Basics of Multivariate Modelling and Data Analysis 20

21 9.1.4 PCR application using the PLS-toolbox Choice of principal components Note how the RMSECV has several local minima and a global minimum at eleven PCs. Two rules of thumb: do not include a PC unless it improves the RMSECV by at least 2% use a model with the lowest possible complexity among close alternatives. Here the rules suggest that a model with six PCs would be the best choice. KEH Basics of Multivariate Modelling and Data Analysis 21

9.1 PCR 9.1.4 PCR application using the PLS-toolbox Model suggested by CV Choose the desired # of

22 9.1 PCR PCR application using the PLS-toolbox Model suggested by CV Choose the desired # of PCs by clicking on the corresponding line. KEH Basics of Multivariate Modelling and Data Analysis 22

23 9.1.4 PCR application using the PLS-toolbox Model suggested by CV KEH Basics of Multivariate Modelling and Data Analysis 23

9.1.4 PCR application using the PLS-toolbox Model suggested by CV Returning to the RMSECV plot for the final PCR model, we note that some PCs in the final model (specifically, PCs 2, 4 and 5) result

24 9.1.4 PCR application using the PLS-toolbox Model suggested by CV Returning to the RMSECV plot for the final PCR model, we note that some PCs in the final model (specifically, PCs 2, 4 and 5) result in an increase in the model s estimated prediction error; this suggests that these specific PCs, although they help explain variation in the X variables (temperatures), are not useful for prediction of the molten glass level. KEH Basics of Multivariate Modelling and Data Analysis 24

25 9.1 PCR PCR application using the PLS-toolbox Saving the model KEH Basics of Multivariate Modelling and Data Analysis 25

9.1 PCR 9.1.4 PCR application using the PLS-toolbox To

session unless saved by the Matlab save command To disk

26 9.1 PCR PCR application using the PLS-toolbox To Matlab workspace name can be changed will be lost after the session unless saved by the Matlab save command To disk Exported KEH Basics of Multivariate Modelling and Data Analysis 26

27 9. Linear regression with latent variables 9.2 Partial least-squares regression (PLS) Overview PLS stands for Projection to Latent Structures by means of Partial Least Squares and is a method to relate a matrix to a vector or to a matrix. Essentially, the model structures of PLS and PCR are the same: The x-data are first transformed into a set of (a few) latent variables (components). The latent variables are used for regression (by OLS) with one or several dependent variables. The regression criterion (most often applied) is maximum covariance between the scores of dependent variables and regressor variables (i.e. latent variables) Maximum covariance combines X high variance of scores with high correlation with scores Y X y Y KEH Basics of Multivariate Modelling and Data Analysis 27

28 9.2 Partial least-squares regression (PLS) Overview Relationship with MLR and PCR PLS is related to both MLR (OLS) and PCR/PCA. To see this, consider the linear regression model The best prediction/estimation of y using is y Xb e yˆ Xb MLR by OLS maximizes the correlation between and arg max r b as seen from PCR also maximizes the correlation between and, but with the constraint b Pg, dim( g) k p dim( b), where P is the PCA loading matrix that maximizes the variance of the columns in T XP. This is seen from T T gty T 1 T arg max ryy ˆ arg max ( ) T T T g yy ˆ T yy arg max b T T yy ˆ ˆ yy T T bxy T 1 T arg max ( ) T T T b g ˆ bxxb gttg TT Ty g KEH Basics of Multivariate Modelling and Data Analysis 28 X yy yy y y ŷ XX Xy b ŷ OLS PCR

29 9.2.1 Overview Relationship with MLR and PCR As a summary of this, we can say that MLR (OLS) gives the prediction yˆ Xb, i.e. the identity OLS XIbOLS matrix can be considered as a loading matrix I yˆ XPg P T XP PCR/PCA gives the prediction, where is the loading PCR matrix that maximizes the variance of the columns in The idea in PLS is to determine a loading matrix for the prediction yˆ XWh, where both and are determined in such a way that PLS W hpls they help maximize the correlation between and. This solution is different from MLR (OLS) because of the constraint b Wh, dim( h) k p dim( b). different from PCR/PCA because the loading matrix W is determined to maximize the correlation between and. We have also said that PLS maximizes the covariance between the dependent variable(s) and the scores. When the scores are constrained to have a given constant variance (e.g. =1), this is equivalent to maximizing the correlation. y y ŷ W ŷ KEH Basics of Multivariate Modelling and Data Analysis 29

30 9. Linear regression with latent variables 9.2 PLS regression One dependent variable The main purpose of PLS is to determine a linear model for prediction (estimation) of one or several dependent variables from a set of predictor (independent) variables. The modelling procedure is here outlined for one dependent variable. The first PLS-component is calculated as the latent variable which has the maximum covariance between the scores and the modelled property. T w1 arg max y Xw t1 Xw1 w, Xw 1 Next, the information (variance) of this component is removed from the X data. This process is called peeling or deflation. It is a projection of the X space on to a (hyper-)plane that is orthogonal to the direction of the found component. The resulting matrix after deflation is X T T X X tt X ( I tt ) X which as required has the property. t 1 Xw y KEH Basics of Multivariate Modelling and Data Analysis 30

31 9.2 PLS regression One dependent variable The next PLS component is derived from the residual matrix X 1 again with maximum covariance between the scores and. This procedure is continued to produce sufficiently many PLScomponents. The final choice of the number of components to retain in the model can be made as for PCR (mainly using cross-validation). Some comments In the standard versions of PLS, the scores of the PLS-components are uncorrelated; the loading vectors, are in general not orthogonal. Because PLS components are developed as latent variables possessing a high correlation with y, the optimum number of PLS-components is usually smaller than the optimum number of PCA-components in PCR. However, PLS models may be less stable than PCR models because less X variance is contained. The more components are used, the more similar PCR and PLS models become. y KEH Basics of Multivariate Modelling and Data Analysis 31

32 9.2.2 One dependent variable Some comments A complicating aspect of most PLS algorithms is the stepwise calculation of components. After a component is computed, the residual matrices for X (and Y ) are determined. The next PLS-component is calculated from the residual matrices and therefore their parameters (scores, loadings, weights) do not relate to the original matrices. However, equations exist that relate the PLS parameters to the original data and that also provide the regression coefficients b PLS of the final model for the original data. X X k b PLS p p p b PLS b PLS KEH Basics of Multivariate Modelling and Data Analysis 32

33 9. Linear regression with latent variables 9.2 PLS regression Many dependent variables If there is more than one dependent variable, the dependent data is stored in a matrix. The basic regression model is then Y B Y XB E E B E y Y where is a matrix of regression coefficients and is a matrix of residuals. Here columns and in and correspond to column in. b j e j If the dependent variables are considered to be mutually independent (uncorrelated), PLS (or any other regression method) for one dependent variable can be applied to each variable y j, one at a time. T T MLR using OLS then gives the solution B ( X X) 1 X Y. If the dependent variables are correlated, it is best to deal with them jointly. PLS, which is called PLS2 when there is more than one dependent variable, is then very suitable. j KEH Basics of Multivariate Modelling and Data Analysis 33

34 9.2 PLS regression Many dependent variables Main idea In PLS2, both X and Y are decomposed into scores and loadings p and t, 1,, k, are loadings and scores of latent variables for X q and u, 1,, k, are loadings and scores of latent variables for Y in such a way that the covariance between X and Y scores is maximized. The regression is performed between X and scores. The regression coefficients can be transformed to allow direct estimation of from. Y Y X KEH Basics of Multivariate Modelling and Data Analysis 34

35 9.2 PLS regression Many dependent variables Mathematical development There are many variants of PLS algorithms. It is e.g. possible to specify that loading vectors are orthogonal (the Eigenvector algorithm ) scores are uncorrelated (i.e. orthogonal), loadings are non-orthogonal (algorithms such as Kernel, NIPALS, SIMPLS, O-PLS) Here a PLS2 method producing uncorrelated scores is outlined. X and Y are modelled by linear latent variables as and T X TP E X P Q T Y UQ E Y Instead of the loadings and, new loading vectors and that satisfy t X 1w and u Yc, 1,, k are introduced. Here and,, is an updated version of. X 0 X 1 w 1 For the first PLS-component, the vectors 1 and are determined by solving [ w c ] arg max ( Xw) Yc, Xw 1, Yc 1 T T T 1 1 T T [ w c ] X 1 c w c X KEH Basics of Multivariate Modelling and Data Analysis 35

36 9.2.3 Many dependent variables Mathematical development w 1 The vectors 1 and can be found by solving eigenvalue problems T T w 1 is the eigenvector to the largest eigenvalue of XYYX T T is the eigenvector to the largest eigenvalue of YXXY c 1 The first component scores are given by t1 Xw1 and u1 Yc1 The first component loadings for are given by The first component loadings for X The deflated matrix is given by Y c X Y T T p X t X Xw T T X X t t X ( I t t ) X are not needed for the regression. The deflated matrix could be calculated similarly, but it is not needed. The second PLS-components are calculated similarly from and. The procedure is repeated to obtain sufficiently many components. When all components are known, the regression coefficients are given by T T B W( P W) 1 C 1,, W [ w w ] k 1 KEH Basics of Multivariate Modelling and Data Analysis 36 X 1 C [ c c ] k Y

37 9. Linear regression with latent variables 9.2 PLS regression Geometric illustration [ based on UMETRICS material] n Data consist of observations and a set of p independent variables p (inputs) in a n p matrix X set of q dependent variables n (outputs) in a n q matrix Y Each variable has a coordinate axis p coordinates for X data coordinates for data illustration for q Y p q 3 Each observation is represented by one point in the X space one point in the space Y The mean value of each variable in both data sets is here denoted by a red dot in the two coordinate systems (it is not an observation). Data are here mean-centred. KEH Basics of Multivariate Modelling and Data Analysis 37 n q

38 9.2 PLS regression Geometric illustration The first PLS-component is a line in the X space line in the Y space calculated to approximate the points well in X and Y yield a good correlation between the projections and (the scores) t1 u1 The vector directions are 1and coordinates are and t -u w 1 t1 u1 The 1 1 plot shows how well the first PLS-components models the data points on the line are modelled exactly points not on the line may be modelled by other PLS-components c KEH Basics of Multivariate Modelling and Data Analysis 38

39 9.2 PLS regression Geometric illustration The second PLS-component is also represented by lines in the X and Y space calculated to approximate the points well provide a good correlation in such a way that X-lines are orthogonal Y -lines may be orthogonal These lines with directions w2 and c2 coordinates t2 and u2 improve the approximation and correlation as much as possible. The second projection coordinates usually correlate less well than the first projection coordinates may correlate better than the first projection coordinates if there is a strong structure in that is not related to (or present in). X KEH Basics of Multivariate Modelling and Data Analysis 39 Y

40 9.2 PLS regression Geometric illustration The first two PLS-components form planes in the and spaces The variability around the -plane can be used to calculate a tolerance interval within which new observations will (should) be located Observations outside this interval implies that the model may not be valid for this data Plotting of successive pairs of latent variables against each other will give a good picture of the correlation structure The plot in the SE corner indicates that there is almost no information left in the :th pair of latent variables k X Y X u k t k KEH Basics of Multivariate Modelling and Data Analysis 40

41 9. Linear regression with latent variables 9.2 PLS regression Evaluation and diagnostics The PLS result and the data can be analysed and evaluated in many ways. Detection of outliers The techniques for outlier detection used in PCA, can also be used in PLS. Since PLS uses a Y block in addition to the X block, one can also look for outliers in the prediction of Y. The figure illustrates a way of plotting the prediction error against a score related parameter ( leverage ) the y-axis prediction error is autoscaled error unit is standard deviations the x-axis Leverage defines the influence of a given observation on the model; it is proportional to Hotelling s T 2 Four of the marked observations are very clear outliers. These outliers should be removed from the model-building data. KEH Basics of Multivariate Modelling and Data Analysis 41

42 9.2 PLS regression Evaluation and diagnostics Cross-validation The standard techniques for selecting the number of latent variables based on cross-validation can be used. In addition, one can consider how much of the Y variation each LV T T describes expressed e.g. as uu / uu. Relationships between observations Relationships between observations can be studied by various score plots, e.g. vs., vs. u1 t 1 u2 t2 a linear relationship with high correlation is desired t2 t 1 u2 u1 no correlation is desired vs., vs. KEH Basics of Multivariate Modelling and Data Analysis 42

43 9.2 PLS regression Evaluation and diagnostics Variable interpretations There are many ways to analyse the contribution and importance of variables in the PLS model, e.g. loadings on LVs ( p :s) Q-residuals Hotelling s T 2 statistic regression coefficients VIP scores ( Variable Importance in Projection ) In the PLS-toolbox, these plots are obtained via Loadings plots. KEH Basics of Multivariate Modelling and Data Analysis 43

44 9. Linear regression with latent variables 9.2 PLS regression PLS application using the PLS-toolbox We shall apply PLS to the same Slurry-Fed Ceramic Melter (SFCM) system, which was used in the PCR application (section 9.1.4). Except for the analysis startup for PLS the model building steps for PLS are exactly the same as for PCR, including cross-validation for choice of the number of PLS-components KEH Basics of Multivariate Modelling and Data Analysis 44

9.2 PLS regression 9.2.6 PLS application using the PLS-toolbox Cross-validation results Model-building combined with cross-validation produces the result shown.

45 9.2 PLS regression PLS application using the PLS-toolbox Cross-validation results Model-building combined with cross-validation produces the result shown. The variance captured by the model for each # of latent variables (LVs) is shown for the T X block (% tt ) T Y block (% uu ) The suggested model is (apparently) based on X block variance Note the small Y block variance for LV4 LV4 does not contain info about Y. This suggests that 3 LVs would be sufficient for predictive purposes. KEH Basics of Multivariate Modelling and Data Analysis 45

which can be quantified by a RMSECV-plot fig suggest 3 or 4 based on rule of thumb (at

46 9.2.6 PLS application using the PLS-toolbox Cross-validation results Another (better) way to select the number of LVs is to consider the prediction error based on cross-validation, which can be quantified by a RMSECV-plot fig suggest 3 or 4 based on rule of thumb (at least 2% improvement), 3 LVs is sufficient KEH Basics of Multivariate Modelling and Data Analysis 46

SELECTION OF A MULTIVARIATE CALIBRATION METHOD

SELECTION OF A MULTIVARIATE CALIBRATION METHOD 0. Aim of this document Different types of multivariate calibration methods are available. The aim of this document is to help the user select the proper