Basics of Multivariate Modelling and Data Analysis

Size: px
Start display at page:

Download "Basics of Multivariate Modelling and Data Analysis"

Transcription

1 Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 9. Linear regression with latent variables 9.1 Principal component regression (PCR) 9.2 Partial least-squares regression (PLS) [ mostly from Varmuza and Filzmoser (2009) and PLS-toolbox manual by Wise et al. (2006) ] KEH Basics of Multivariate Modelling and Data Analysis 1

2 9. Linear regression with latent variables 9.1 Principal component regression (PCR) Overview The case with many correlated regressor variables ( independent variables that are collinear) is notoriously difficult in classical multiple linear regression (MLR) such as ordinary least-squares (OLS) regression. Usually it is necessary to select a subset of variables to reduce the number of regressors and collinearity In general, this is a very difficult task. However, principal component regression (PCR) is a way of avoiding/simplifying the variable selection task PCR is a combination of principal component analysis (PCA) multiple linear regression (MLR), usually OLS where the PC scores are used as regressor variables. Since they are orthogonal, the multicollinearity problem is avoided often only a few ones, the number or regressor variables is small KEH Basics of Multivariate Modelling and Data Analysis 2

3 9. Linear regression with latent variables 9.1 PCR Calculating regression coefficients Multiple linear regression (MLR) If data is mean-centred, the regression model in MLR is y Xb e b Using OLS, the regression coefficients are determined by minimizing. This gives a solution that can be expressed as T T b ( X X) 1 X y bols ( XX) T 1 The problem here is the inverse for collinear X data, it is very sensitive to small errors in X, which means that b OLS is also very sensitive to small errors the number of observations must be larger than the number of variables T ee KEH Basics of Multivariate Modelling and Data Analysis 3

4 9.1 PCR Calculating regression coefficients Principal component regression In PCR, a principal component analysis (PCA) is first done: T XP The latter expression is inserted into the linear regression model: T X TP E T y Xb e TP b Eb e Tg epcr e Eb e T g Pb PCR where and. Minimization of by OLS gives T T g ( TT) 1 Ty Here the inverse is well-conditioned, and it always exists, because the columns of T are orthogonal the number of principal components is never larger than the number of observations b In terms of, the solution for the linear regression model can be expressed as T T ( ) 1 b Pg PTT Ty b PCR e T PCR e PCR KEH Basics of Multivariate Modelling and Data Analysis 4

5 9. Linear regression with latent variables 9.1 PCR Selecting principal components The problem Overfitting of a regression model is strongly related with collinearity problem. For PCR this means that PCR is less susceptible to overfitting than MLR because it directly addresses the collinearity problem a PCR model could become overfitted through the retention of too many principal components Therefore, an important part of PCR is the determination of the optimal number of PCs to retain in the model. Another problem is that PCA determines and ranks PCs to explain as much as possible of the variance of the regressor variables x j in PCR, we want PCs that give the best possible prediction of the dependent variable ; such PCs have a high correlation with y Therefore, some variable selection technique has to be applied also in PCR it is not necessarily best to choose the highest ranked PCs according to the PCA. KEH Basics of Multivariate Modelling and Data Analysis 5 y

6 9.1 PCR Selecting principal components Cross-validation As for PCA and MLR, cross-validation is an important tool for variable selection. This means that the data have to be split observation-wise into a modelling (or training) data set a test data set The prediction residual error on the test set observations is then determined as a function of the (number of) PCs retained in the PCR model. This procedure is usually repeated several times using different selections of observation subsets for training and test sets, such that each sample in the original data set is part of a test set at least once. A good rule of thumb is to use the square root of the number of observations for the number of repetitions of the cross- validation procedure, up to a maximum of ten repetitions. A plot of the total composite prediction error over all test sets as a function of the (number of) PCs retained in the model is used to determine the optimal (number of) PCs. KEH Basics of Multivariate Modelling and Data Analysis 6

7 9. Linear regression with latent variables 9.1 PCR PCR application using the PLS-toolbox We shall apply PCR to a Slurry-Fed Ceramic Melter (SFCM) system (Wise et al., 1991), where nuclear waste from fuel reprocessing is combined with glassforming materials. Data from the process, consisting of temperatures in 20 locations within the melter and the molten glass level, are shown in the figure. It is apparent that there is a great deal of correlation in the data. Many of the variables appear to follow a sawtooth pattern. We shall develop a PCR model that will enable estimation of the level of molten glass using temperature measurements. KEH Basics of Multivariate Modelling and Data Analysis 7

8 9.1 PCR PCR application using the PLS-toolbox Starting and loading data The SFCM temperature and molten glass level data are stored in the file plsdata.mat. The file contains 300 calibration or training samples (xblock1 and yblock1) and 200 test samples (xblock2 and yblock2). We will load the data into MATLAB and delete a few samples that are known to be outliers. KEH Basics of Multivariate Modelling and Data Analysis 8

9 9.1.4 PCR application using the PLS-toolbox Starting and loading data KEH Basics of Multivariate Modelling and Data Analysis 9

10 9.1.4 PCR application using the PLS-toolbox Starting and loading data KEH Basics of Multivariate Modelling and Data Analysis 10

11 9.1 PCR PCR application using the PLS-toolbox Preprocessing of data Now that the data are loaded, we need to decide how to preprocess the data for modelling. Because the response (temperature) variables with the greatest variance in this data set also appear to be correlated to the molten glass level, we choose to mean-centre (rather than autoscale) the data. The scaling for Y is irrelevant if there is only one Y variable, as in this case. KEH Basics of Multivariate Modelling and Data Analysis 11

12 9.1.4 PCR application using the PLS-toolbox Preprocessing of data KEH Basics of Multivariate Modelling and Data Analysis 12

13 9.1 PCR PCR application using the PLS-toolbox A preliminary model KEH Basics of Multivariate Modelling and Data Analysis 13

14 9.1.4 PCR application using the PLS-toolbox A preliminary model KEH Basics of Multivariate Modelling and Data Analysis 14

15 9.1 PCR PCR application using the PLS-toolbox Cross-validation We must now decide how to cross-validate the model. We will choose to split the data into ten contiguous block-wise subsets, and to calculate all twenty PCs. KEH Basics of Multivariate Modelling and Data Analysis 15

16 9.1.4 PCR application using the PLS-toolbox Cross-validation KEH Basics of Multivariate Modelling and Data Analysis 16

17 9.1.4 PCR application using the PLS-toolbox Cross-validation KEH Basics of Multivariate Modelling and Data Analysis 17

18 9.1 PCR PCR application using the PLS-toolbox A new model KEH Basics of Multivariate Modelling and Data Analysis 18

19 9.1.4 PCR application using the PLS-toolbox A new model KEH Basics of Multivariate Modelling and Data Analysis 19

20 9.1 PCR PCR application using the PLS-toolbox Choice of principal components Now that the PCR model and the cross-validation results have been computed, one can view the cross-validation results in various ways. A common plot that is used to analyse cross-validation results is a RMSECV-plot (root mean squared error of cross-validation) KEH Basics of Multivariate Modelling and Data Analysis 20

21 9.1.4 PCR application using the PLS-toolbox Choice of principal components Note how the RMSECV has several local minima and a global minimum at eleven PCs. Two rules of thumb: do not include a PC unless it improves the RMSECV by at least 2% use a model with the lowest possible complexity among close alternatives. Here the rules suggest that a model with six PCs would be the best choice. KEH Basics of Multivariate Modelling and Data Analysis 21

22 9.1 PCR PCR application using the PLS-toolbox Model suggested by CV Choose the desired # of PCs by clicking on the corresponding line. KEH Basics of Multivariate Modelling and Data Analysis 22

23 9.1.4 PCR application using the PLS-toolbox Model suggested by CV KEH Basics of Multivariate Modelling and Data Analysis 23

24 9.1.4 PCR application using the PLS-toolbox Model suggested by CV Returning to the RMSECV plot for the final PCR model, we note that some PCs in the final model (specifically, PCs 2, 4 and 5) result in an increase in the model s estimated prediction error; this suggests that these specific PCs, although they help explain variation in the X variables (temperatures), are not useful for prediction of the molten glass level. KEH Basics of Multivariate Modelling and Data Analysis 24

25 9.1 PCR PCR application using the PLS-toolbox Saving the model KEH Basics of Multivariate Modelling and Data Analysis 25

26 9.1 PCR PCR application using the PLS-toolbox To Matlab workspace name can be changed will be lost after the session unless saved by the Matlab save command To disk Exported KEH Basics of Multivariate Modelling and Data Analysis 26

27 9. Linear regression with latent variables 9.2 Partial least-squares regression (PLS) Overview PLS stands for Projection to Latent Structures by means of Partial Least Squares and is a method to relate a matrix to a vector or to a matrix. Essentially, the model structures of PLS and PCR are the same: The x-data are first transformed into a set of (a few) latent variables (components). The latent variables are used for regression (by OLS) with one or several dependent variables. The regression criterion (most often applied) is maximum covariance between the scores of dependent variables and regressor variables (i.e. latent variables) Maximum covariance combines X high variance of scores with high correlation with scores Y X y Y KEH Basics of Multivariate Modelling and Data Analysis 27

28 9.2 Partial least-squares regression (PLS) Overview Relationship with MLR and PCR PLS is related to both MLR (OLS) and PCR/PCA. To see this, consider the linear regression model The best prediction/estimation of y using is y Xb e yˆ Xb MLR by OLS maximizes the correlation between and arg max r b as seen from PCR also maximizes the correlation between and, but with the constraint b Pg, dim( g) k p dim( b), where P is the PCA loading matrix that maximizes the variance of the columns in T XP. This is seen from T T gty T 1 T arg max ryy ˆ arg max ( ) T T T g yy ˆ T yy arg max b T T yy ˆ ˆ yy T T bxy T 1 T arg max ( ) T T T b g ˆ bxxb gttg TT Ty g KEH Basics of Multivariate Modelling and Data Analysis 28 X yy yy y y ŷ XX Xy b ŷ OLS PCR

29 9.2.1 Overview Relationship with MLR and PCR As a summary of this, we can say that MLR (OLS) gives the prediction yˆ Xb, i.e. the identity OLS XIbOLS matrix can be considered as a loading matrix I yˆ XPg P T XP PCR/PCA gives the prediction, where is the loading PCR matrix that maximizes the variance of the columns in The idea in PLS is to determine a loading matrix for the prediction yˆ XWh, where both and are determined in such a way that PLS W hpls they help maximize the correlation between and. This solution is different from MLR (OLS) because of the constraint b Wh, dim( h) k p dim( b). different from PCR/PCA because the loading matrix W is determined to maximize the correlation between and. We have also said that PLS maximizes the covariance between the dependent variable(s) and the scores. When the scores are constrained to have a given constant variance (e.g. =1), this is equivalent to maximizing the correlation. y y ŷ W ŷ KEH Basics of Multivariate Modelling and Data Analysis 29

30 9. Linear regression with latent variables 9.2 PLS regression One dependent variable The main purpose of PLS is to determine a linear model for prediction (estimation) of one or several dependent variables from a set of predictor (independent) variables. The modelling procedure is here outlined for one dependent variable. The first PLS-component is calculated as the latent variable which has the maximum covariance between the scores and the modelled property. T w1 arg max y Xw t1 Xw1 w, Xw 1 Next, the information (variance) of this component is removed from the X data. This process is called peeling or deflation. It is a projection of the X space on to a (hyper-)plane that is orthogonal to the direction of the found component. The resulting matrix after deflation is X T T X X tt X ( I tt ) X which as required has the property. t 1 Xw y KEH Basics of Multivariate Modelling and Data Analysis 30

31 9.2 PLS regression One dependent variable The next PLS component is derived from the residual matrix X 1 again with maximum covariance between the scores and. This procedure is continued to produce sufficiently many PLScomponents. The final choice of the number of components to retain in the model can be made as for PCR (mainly using cross-validation). Some comments In the standard versions of PLS, the scores of the PLS-components are uncorrelated; the loading vectors, are in general not orthogonal. Because PLS components are developed as latent variables possessing a high correlation with y, the optimum number of PLS-components is usually smaller than the optimum number of PCA-components in PCR. However, PLS models may be less stable than PCR models because less X variance is contained. The more components are used, the more similar PCR and PLS models become. y KEH Basics of Multivariate Modelling and Data Analysis 31

32 9.2.2 One dependent variable Some comments A complicating aspect of most PLS algorithms is the stepwise calculation of components. After a component is computed, the residual matrices for X (and Y ) are determined. The next PLS-component is calculated from the residual matrices and therefore their parameters (scores, loadings, weights) do not relate to the original matrices. However, equations exist that relate the PLS parameters to the original data and that also provide the regression coefficients b PLS of the final model for the original data. X X k b PLS p p p b PLS b PLS KEH Basics of Multivariate Modelling and Data Analysis 32

33 9. Linear regression with latent variables 9.2 PLS regression Many dependent variables If there is more than one dependent variable, the dependent data is stored in a matrix. The basic regression model is then Y B Y XB E E B E y Y where is a matrix of regression coefficients and is a matrix of residuals. Here columns and in and correspond to column in. b j e j If the dependent variables are considered to be mutually independent (uncorrelated), PLS (or any other regression method) for one dependent variable can be applied to each variable y j, one at a time. T T MLR using OLS then gives the solution B ( X X) 1 X Y. If the dependent variables are correlated, it is best to deal with them jointly. PLS, which is called PLS2 when there is more than one dependent variable, is then very suitable. j KEH Basics of Multivariate Modelling and Data Analysis 33

34 9.2 PLS regression Many dependent variables Main idea In PLS2, both X and Y are decomposed into scores and loadings p and t, 1,, k, are loadings and scores of latent variables for X q and u, 1,, k, are loadings and scores of latent variables for Y in such a way that the covariance between X and Y scores is maximized. The regression is performed between X and scores. The regression coefficients can be transformed to allow direct estimation of from. Y Y X KEH Basics of Multivariate Modelling and Data Analysis 34

35 9.2 PLS regression Many dependent variables Mathematical development There are many variants of PLS algorithms. It is e.g. possible to specify that loading vectors are orthogonal (the Eigenvector algorithm ) scores are uncorrelated (i.e. orthogonal), loadings are non-orthogonal (algorithms such as Kernel, NIPALS, SIMPLS, O-PLS) Here a PLS2 method producing uncorrelated scores is outlined. X and Y are modelled by linear latent variables as and T X TP E X P Q T Y UQ E Y Instead of the loadings and, new loading vectors and that satisfy t X 1w and u Yc, 1,, k are introduced. Here and,, is an updated version of. X 0 X 1 w 1 For the first PLS-component, the vectors 1 and are determined by solving [ w c ] arg max ( Xw) Yc, Xw 1, Yc 1 T T T 1 1 T T [ w c ] X 1 c w c X KEH Basics of Multivariate Modelling and Data Analysis 35

36 9.2.3 Many dependent variables Mathematical development w 1 The vectors 1 and can be found by solving eigenvalue problems T T w 1 is the eigenvector to the largest eigenvalue of XYYX T T is the eigenvector to the largest eigenvalue of YXXY c 1 The first component scores are given by t1 Xw1 and u1 Yc1 The first component loadings for are given by The first component loadings for X The deflated matrix is given by Y c X Y T T p X t X Xw T T X X t t X ( I t t ) X are not needed for the regression. The deflated matrix could be calculated similarly, but it is not needed. The second PLS-components are calculated similarly from and. The procedure is repeated to obtain sufficiently many components. When all components are known, the regression coefficients are given by T T B W( P W) 1 C 1,, W [ w w ] k 1 KEH Basics of Multivariate Modelling and Data Analysis 36 X 1 C [ c c ] k Y

37 9. Linear regression with latent variables 9.2 PLS regression Geometric illustration [ based on UMETRICS material] n Data consist of observations and a set of p independent variables p (inputs) in a n p matrix X set of q dependent variables n (outputs) in a n q matrix Y Each variable has a coordinate axis p coordinates for X data coordinates for data illustration for q Y p q 3 Each observation is represented by one point in the X space one point in the space Y The mean value of each variable in both data sets is here denoted by a red dot in the two coordinate systems (it is not an observation). Data are here mean-centred. KEH Basics of Multivariate Modelling and Data Analysis 37 n q

38 9.2 PLS regression Geometric illustration The first PLS-component is a line in the X space line in the Y space calculated to approximate the points well in X and Y yield a good correlation between the projections and (the scores) t1 u1 The vector directions are 1and coordinates are and t -u w 1 t1 u1 The 1 1 plot shows how well the first PLS-components models the data points on the line are modelled exactly points not on the line may be modelled by other PLS-components c KEH Basics of Multivariate Modelling and Data Analysis 38

39 9.2 PLS regression Geometric illustration The second PLS-component is also represented by lines in the X and Y space calculated to approximate the points well provide a good correlation in such a way that X-lines are orthogonal Y -lines may be orthogonal These lines with directions w2 and c2 coordinates t2 and u2 improve the approximation and correlation as much as possible. The second projection coordinates usually correlate less well than the first projection coordinates may correlate better than the first projection coordinates if there is a strong structure in that is not related to (or present in). X KEH Basics of Multivariate Modelling and Data Analysis 39 Y

40 9.2 PLS regression Geometric illustration The first two PLS-components form planes in the and spaces The variability around the -plane can be used to calculate a tolerance interval within which new observations will (should) be located Observations outside this interval implies that the model may not be valid for this data Plotting of successive pairs of latent variables against each other will give a good picture of the correlation structure The plot in the SE corner indicates that there is almost no information left in the :th pair of latent variables k X Y X u k t k KEH Basics of Multivariate Modelling and Data Analysis 40

41 9. Linear regression with latent variables 9.2 PLS regression Evaluation and diagnostics The PLS result and the data can be analysed and evaluated in many ways. Detection of outliers The techniques for outlier detection used in PCA, can also be used in PLS. Since PLS uses a Y block in addition to the X block, one can also look for outliers in the prediction of Y. The figure illustrates a way of plotting the prediction error against a score related parameter ( leverage ) the y-axis prediction error is autoscaled error unit is standard deviations the x-axis Leverage defines the influence of a given observation on the model; it is proportional to Hotelling s T 2 Four of the marked observations are very clear outliers. These outliers should be removed from the model-building data. KEH Basics of Multivariate Modelling and Data Analysis 41

42 9.2 PLS regression Evaluation and diagnostics Cross-validation The standard techniques for selecting the number of latent variables based on cross-validation can be used. In addition, one can consider how much of the Y variation each LV T T describes expressed e.g. as uu / uu. Relationships between observations Relationships between observations can be studied by various score plots, e.g. vs., vs. u1 t 1 u2 t2 a linear relationship with high correlation is desired t2 t 1 u2 u1 no correlation is desired vs., vs. KEH Basics of Multivariate Modelling and Data Analysis 42

43 9.2 PLS regression Evaluation and diagnostics Variable interpretations There are many ways to analyse the contribution and importance of variables in the PLS model, e.g. loadings on LVs ( p :s) Q-residuals Hotelling s T 2 statistic regression coefficients VIP scores ( Variable Importance in Projection ) In the PLS-toolbox, these plots are obtained via Loadings plots. KEH Basics of Multivariate Modelling and Data Analysis 43

44 9. Linear regression with latent variables 9.2 PLS regression PLS application using the PLS-toolbox We shall apply PLS to the same Slurry-Fed Ceramic Melter (SFCM) system, which was used in the PCR application (section 9.1.4). Except for the analysis startup for PLS the model building steps for PLS are exactly the same as for PCR, including cross-validation for choice of the number of PLS-components KEH Basics of Multivariate Modelling and Data Analysis 44

45 9.2 PLS regression PLS application using the PLS-toolbox Cross-validation results Model-building combined with cross-validation produces the result shown. The variance captured by the model for each # of latent variables (LVs) is shown for the T X block (% tt ) T Y block (% uu ) The suggested model is (apparently) based on X block variance Note the small Y block variance for LV4 LV4 does not contain info about Y. This suggests that 3 LVs would be sufficient for predictive purposes. KEH Basics of Multivariate Modelling and Data Analysis 45

46 9.2.6 PLS application using the PLS-toolbox Cross-validation results Another (better) way to select the number of LVs is to consider the prediction error based on cross-validation, which can be quantified by a RMSECV-plot fig suggest 3 or 4 based on rule of thumb (at least 2% improvement), 3 LVs is sufficient KEH Basics of Multivariate Modelling and Data Analysis 46

SELECTION OF A MULTIVARIATE CALIBRATION METHOD

SELECTION OF A MULTIVARIATE CALIBRATION METHOD SELECTION OF A MULTIVARIATE CALIBRATION METHOD 0. Aim of this document Different types of multivariate calibration methods are available. The aim of this document is to help the user select the proper

More information

Multivariate Analysis Multivariate Calibration part 2

Multivariate Analysis Multivariate Calibration part 2 Multivariate Analysis Multivariate Calibration part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Linear Latent Variables An essential concept in multivariate data

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100. Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records

More information

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding In the previous lecture we learned how to incorporate a categorical research factor into a MLR model by using

More information

Practical OmicsFusion

Practical OmicsFusion Practical OmicsFusion Introduction In this practical, we will analyse data, from an experiment which aim was to identify the most important metabolites that are related to potato flesh colour, from an

More information

Dimensionality Reduction, including by Feature Selection.

Dimensionality Reduction, including by Feature Selection. Dimensionality Reduction, including by Feature Selection www.cs.wisc.edu/~dpage/cs760 Goals for the lecture you should understand the following concepts filtering-based feature selection information gain

More information

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract 19-1214 Chemometrics Technical Note Description of Pirouette Algorithms Abstract This discussion introduces the three analysis realms available in Pirouette and briefly describes each of the algorithms

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Genotype x Environmental Analysis with R for Windows

Genotype x Environmental Analysis with R for Windows Genotype x Environmental Analysis with R for Windows Biometrics and Statistics Unit Angela Pacheco CIMMYT,Int. 23-24 Junio 2015 About GEI In agricultural experimentation, a large number of genotypes are

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening Variables Entered/Removed b Variables Entered GPA in other high school, test, Math test, GPA, High school math GPA a Variables Removed

More information

SGN (4 cr) Chapter 10

SGN (4 cr) Chapter 10 SGN-41006 (4 cr) Chapter 10 Feature Selection and Extraction Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 18, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Evaluation of empirical models for calibration and classification

Evaluation of empirical models for calibration and classification Evaluation of empirical models for calibration and classification Kurt VARMUZA Vienna University of Technology Institute of Chemical Engineering and Department of Statistics and Probability Theory www.lcm.tuwien.ac.at,

More information

Everything taken from (Hair, Hult et al. 2017) but some formulas taken elswere or created by Erik Mønness.

Everything taken from (Hair, Hult et al. 2017) but some formulas taken elswere or created by Erik Mønness. /Users/astacbf/Desktop/Assessing smartpls (engelsk).docx 1/8 Assessing smartpls Everything taken from (Hair, Hult et al. 017) but some formulas taken elswere or created by Erik Mønness. Run PLS algorithm,

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

Pre-processing method minimizing the need for reference analyses

Pre-processing method minimizing the need for reference analyses JOURNAL OF CHEMOMETRICS J. Chemometrics 2001; 15: 123 131 Pre-processing method minimizing the need for reference analyses Per Waaben Hansen* Foss Electric A/S, Slangerupgade 69, DK-3400 Hillerød, Denmark

More information

The problem we have now is called variable selection or perhaps model selection. There are several objectives.

The problem we have now is called variable selection or perhaps model selection. There are several objectives. STAT-UB.0103 NOTES for Wednesday 01.APR.04 One of the clues on the library data comes through the VIF values. These VIFs tell you to what extent a predictor is linearly dependent on other predictors. We

More information

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS Toomas Kirt Supervisor: Leo Võhandu Tallinn Technical University Toomas.Kirt@mail.ee Abstract: Key words: For the visualisation

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

Multivariate Calibration Quick Guide

Multivariate Calibration Quick Guide Last Updated: 06.06.2007 Table Of Contents 1. HOW TO CREATE CALIBRATION MODELS...1 1.1. Introduction into Multivariate Calibration Modelling... 1 1.1.1. Preparing Data... 1 1.2. Step 1: Calibration Wizard

More information

A Hybrid Linear Kernel with PCA in SVM Prediction Model of Tamil Writing Pattern

A Hybrid Linear Kernel with PCA in SVM Prediction Model of Tamil Writing Pattern THENDRAL THARMALINGAM et al: A HYBRID LINEAR KERNEL WITH PCA IN PREDICTION MODEL A Hybrid Linear Kernel with PCA in Prediction Model of Tamil Writing Pattern Thendral Tharmalingam *, Vijaya Vijayakumar

More information

Hyperspectral Chemical Imaging: principles and Chemometrics.

Hyperspectral Chemical Imaging: principles and Chemometrics. Hyperspectral Chemical Imaging: principles and Chemometrics aoife.gowen@ucd.ie University College Dublin University College Dublin 1,596 PhD students 6,17 international students 8,54 graduate students

More information

CREATING THE ANALYSIS

CREATING THE ANALYSIS Chapter 14 Multiple Regression Chapter Table of Contents CREATING THE ANALYSIS...214 ModelInformation...217 SummaryofFit...217 AnalysisofVariance...217 TypeIIITests...218 ParameterEstimates...218 Residuals-by-PredictedPlot...219

More information

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Minitab 17 commands Prepared by Jeffrey S. Simonoff Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save

More information

Principal Component Image Interpretation A Logical and Statistical Approach

Principal Component Image Interpretation A Logical and Statistical Approach Principal Component Image Interpretation A Logical and Statistical Approach Md Shahid Latif M.Tech Student, Department of Remote Sensing, Birla Institute of Technology, Mesra Ranchi, Jharkhand-835215 Abstract

More information

Two-Stage Least Squares

Two-Stage Least Squares Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes

More information

The Automation of the Feature Selection Process. Ronen Meiri & Jacob Zahavi

The Automation of the Feature Selection Process. Ronen Meiri & Jacob Zahavi The Automation of the Feature Selection Process Ronen Meiri & Jacob Zahavi Automated Data Science http://www.kdnuggets.com/2016/03/automated-data-science.html Outline The feature selection problem Objective

More information

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Data Mining - Data Dr. Jean-Michel RICHER 2018 jean-michel.richer@univ-angers.fr Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Outline 1. Introduction 2. Data preprocessing 3. CPA with R 4. Exercise

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Do Something..

More information

Dimension Reduction CS534

Dimension Reduction CS534 Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of

More information

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes. Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

FEA and Multivariate Statistical Data Analysis of Polypropylene Tube Forming Process

FEA and Multivariate Statistical Data Analysis of Polypropylene Tube Forming Process FEA and Multivariate Statistical Data Analysis of Polypropylene ube Forming Process Galini Gavrilidou, Mukesh Jain Department of Mechanical Engineering, McMaster University, Hamilton, On, LS L7, Canada

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use

More information

7. Collinearity and Model Selection

7. Collinearity and Model Selection Sociology 740 John Fox Lecture Notes 7. Collinearity and Model Selection Copyright 2014 by John Fox Collinearity and Model Selection 1 1. Introduction I When there is a perfect linear relationship among

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy 2017 ITRON EFG Meeting Abdul Razack Specialist, Load Forecasting NV Energy Topics 1. Concepts 2. Model (Variable) Selection Methods 3. Cross- Validation 4. Cross-Validation: Time Series 5. Example 1 6.

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

Using the DATAMINE Program

Using the DATAMINE Program 6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection

More information

UNIT 1: NUMBER LINES, INTERVALS, AND SETS

UNIT 1: NUMBER LINES, INTERVALS, AND SETS ALGEBRA II CURRICULUM OUTLINE 2011-2012 OVERVIEW: 1. Numbers, Lines, Intervals and Sets 2. Algebraic Manipulation: Rational Expressions and Exponents 3. Radicals and Radical Equations 4. Function Basics

More information

Study Guide. Module 1. Key Terms

Study Guide. Module 1. Key Terms Study Guide Module 1 Key Terms general linear model dummy variable multiple regression model ANOVA model ANCOVA model confounding variable squared multiple correlation adjusted squared multiple correlation

More information

CS 195-5: Machine Learning Problem Set 5

CS 195-5: Machine Learning Problem Set 5 CS 195-5: Machine Learning Problem Set 5 Douglas Lanman dlanman@brown.edu 26 November 26 1 Clustering and Vector Quantization Problem 1 Part 1: In this problem we will apply Vector Quantization (VQ) to

More information

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems RSM Split-Plot Designs & Diagnostics Solve Real-World Problems Shari Kraber Pat Whitcomb Martin Bezener Stat-Ease, Inc. Stat-Ease, Inc. Stat-Ease, Inc. 221 E. Hennepin Ave. 221 E. Hennepin Ave. 221 E.

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Multiresponse Sparse Regression with Application to Multidimensional Scaling Multiresponse Sparse Regression with Application to Multidimensional Scaling Timo Similä and Jarkko Tikka Helsinki University of Technology, Laboratory of Computer and Information Science P.O. Box 54,

More information

Principal Component Analysis

Principal Component Analysis Copyright 2004, Casa Software Ltd. All Rights Reserved. 1 of 16 Principal Component Analysis Introduction XPS is a technique that provides chemical information about a sample that sets it apart from other

More information

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components Review Lecture 14 ! PRINCIPAL COMPONENT ANALYSIS Eigenvectors of the covariance matrix are the principal components 1. =cov X Top K principal components are the eigenvectors with K largest eigenvalues

More information

FMRI data: Independent Component Analysis (GIFT) & Connectivity Analysis (FNC)

FMRI data: Independent Component Analysis (GIFT) & Connectivity Analysis (FNC) FMRI data: Independent Component Analysis (GIFT) & Connectivity Analysis (FNC) Software: Matlab Toolbox: GIFT & FNC Yingying Wang, Ph.D. in Biomedical Engineering 10 16 th, 2014 PI: Dr. Nadine Gaab Outline

More information

SYS 6021 Linear Statistical Models

SYS 6021 Linear Statistical Models SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are

More information

CSE 481C Imitation Learning in Humanoid Robots Motion capture, inverse kinematics, and dimensionality reduction

CSE 481C Imitation Learning in Humanoid Robots Motion capture, inverse kinematics, and dimensionality reduction 1 CSE 481C Imitation Learning in Humanoid Robots Motion capture, inverse kinematics, and dimensionality reduction Robotic Imitation of Human Actions 2 The inverse kinematics problem Joint angles Human-robot

More information

2016 Stat-Ease, Inc. & CAMO Software

2016 Stat-Ease, Inc. & CAMO Software Multivariate Analysis and Design of Experiments in practice using The Unscrambler X Frank Westad CAMO Software fw@camo.com Pat Whitcomb Stat-Ease pat@statease.com Agenda Goal: Part 1: Part 2: Show how

More information

MODERN FACTOR ANALYSIS

MODERN FACTOR ANALYSIS MODERN FACTOR ANALYSIS Harry H. Harman «ö THE pigj UNIVERSITY OF CHICAGO PRESS Contents LIST OF ILLUSTRATIONS GUIDE TO NOTATION xv xvi Parti Foundations of Factor Analysis 1. INTRODUCTION 3 1.1. Brief

More information

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 23 CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 3.1 DESIGN OF EXPERIMENTS Design of experiments is a systematic approach for investigation of a system or process. A series

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017 Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last

More information

3D Geometry and Camera Calibration

3D Geometry and Camera Calibration 3D Geometry and Camera Calibration 3D Coordinate Systems Right-handed vs. left-handed x x y z z y 2D Coordinate Systems 3D Geometry Basics y axis up vs. y axis down Origin at center vs. corner Will often

More information

Exam Review: Ch. 1-3 Answer Section

Exam Review: Ch. 1-3 Answer Section Exam Review: Ch. 1-3 Answer Section MDM 4U0 MULTIPLE CHOICE 1. ANS: A Section 1.6 2. ANS: A Section 1.6 3. ANS: A Section 1.7 4. ANS: A Section 1.7 5. ANS: C Section 2.3 6. ANS: B Section 2.3 7. ANS: D

More information

2014 Stat-Ease, Inc. All Rights Reserved.

2014 Stat-Ease, Inc. All Rights Reserved. What s New in Design-Expert version 9 Factorial split plots (Two-Level, Multilevel, Optimal) Definitive Screening and Single Factor designs Journal Feature Design layout Graph Columns Design Evaluation

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Exploring association among quality indicators of the In-service Education Information Service

Exploring association among quality indicators of the In-service Education Information Service Exploring association among quality indicators of the In-service Education Information Service LUNG-HSING KUO, HUNG-JEN YANG, TSUNG-JUNG TSAI, FONG-CHING SU National Kaohsiung Normal University No.116,

More information

Computer Experiments: Space Filling Design and Gaussian Process Modeling

Computer Experiments: Space Filling Design and Gaussian Process Modeling Computer Experiments: Space Filling Design and Gaussian Process Modeling Best Practice Authored by: Cory Natoli Sarah Burke, Ph.D. 30 March 2018 The goal of the STAT COE is to assist in developing rigorous,

More information

Design of Fault Diagnosis System of FPSO Production Process Based on MSPCA

Design of Fault Diagnosis System of FPSO Production Process Based on MSPCA 2009 Fifth International Conference on Information Assurance and Security Design of Fault Diagnosis System of FPSO Production Process Based on MSPCA GAO Qiang, HAN Miao, HU Shu-liang, DONG Hai-jie ianjin

More information

3 Feature Selection & Feature Extraction

3 Feature Selection & Feature Extraction 3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy

More information

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson Robust Regression Robust Data Mining Techniques By Boonyakorn Jantaranuson Outline Introduction OLS and important terminology Least Median of Squares (LMedS) M-estimator Penalized least squares What is

More information

DI TRANSFORM. The regressive analyses. identify relationships

DI TRANSFORM. The regressive analyses. identify relationships July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,

More information

Chapter 9 Robust Regression Examples

Chapter 9 Robust Regression Examples Chapter 9 Robust Regression Examples Chapter Table of Contents OVERVIEW...177 FlowChartforLMS,LTS,andMVE...179 EXAMPLES USING LMS AND LTS REGRESSION...180 Example 9.1 LMS and LTS with Substantial Leverage

More information

A NEW VARIABLES SELECTION AND DIMENSIONALITY REDUCTION TECHNIQUE COUPLED WITH SIMCA METHOD FOR THE CLASSIFICATION OF TEXT DOCUMENTS

A NEW VARIABLES SELECTION AND DIMENSIONALITY REDUCTION TECHNIQUE COUPLED WITH SIMCA METHOD FOR THE CLASSIFICATION OF TEXT DOCUMENTS A NEW VARIABLES SELECTION AND DIMENSIONALITY REDUCTION TECHNIQUE COUPLED WITH SIMCA METHOD FOR THE CLASSIFICATION OF TEXT DOCUMENTS Ahmed Abdelfattah Saleh University of Brasilia, Brasil ahmdsalh@yahoo.com

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

Section 2.1: Intro to Simple Linear Regression & Least Squares

Section 2.1: Intro to Simple Linear Regression & Least Squares Section 2.1: Intro to Simple Linear Regression & Least Squares Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.1, 7.2 1 Regression:

More information

Data preprocessing Functional Programming and Intelligent Algorithms

Data preprocessing Functional Programming and Intelligent Algorithms Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Høgskolen i Ålesund 20th March 2017 1 Why data preprocessing? Real-world data tend to be dirty incomplete: lacking attribute

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Information Driven Healthcare:

Information Driven Healthcare: Information Driven Healthcare: Machine Learning course Lecture: Feature selection I --- Concepts Centre for Doctoral Training in Healthcare Innovation Dr. Athanasios Tsanas ( Thanasis ), Wellcome Trust

More information

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Raquel Urtasun & Rich Zemel University of Toronto Nov 4, 2015 Urtasun & Zemel (UofT) CSC 411: 14-PCA & Autoencoders Nov 4, 2015 1 / 18

More information

Chapter 1. Using the Cluster Analysis. Background Information

Chapter 1. Using the Cluster Analysis. Background Information Chapter 1 Using the Cluster Analysis Background Information Cluster analysis is the name of a multivariate technique used to identify similar characteristics in a group of observations. In cluster analysis,

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Final Report: Kaggle Soil Property Prediction Challenge

Final Report: Kaggle Soil Property Prediction Challenge Final Report: Kaggle Soil Property Prediction Challenge Saurabh Verma (verma076@umn.edu, (612)598-1893) 1 Project Goal Low cost and rapid analysis of soil samples using infrared spectroscopy provide new

More information

FEATURE or input variable selection plays a very important

FEATURE or input variable selection plays a very important IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 5, SEPTEMBER 2006 1101 Feature Selection Using a Piecewise Linear Network Jiang Li, Member, IEEE, Michael T. Manry, Pramod L. Narasimha, Student Member,

More information

SAS/STAT 15.1 User s Guide The HPPLS Procedure

SAS/STAT 15.1 User s Guide The HPPLS Procedure SAS/STAT 15.1 User s Guide The HPPLS Procedure This document is an individual chapter from SAS/STAT 15.1 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute Inc.

More information

SPSS INSTRUCTION CHAPTER 9

SPSS INSTRUCTION CHAPTER 9 SPSS INSTRUCTION CHAPTER 9 Chapter 9 does no more than introduce the repeated-measures ANOVA, the MANOVA, and the ANCOVA, and discriminant analysis. But, you can likely envision how complicated it can

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 1 / 18

More information

COMP61011 Foundations of Machine Learning. Feature Selection

COMP61011 Foundations of Machine Learning. Feature Selection OMP61011 Foundations of Machine Learning Feature Selection Pattern Recognition: The Early Days Only 200 papers in the world! I wish! Pattern Recognition: The Early Days Using eight very simple measurements

More information

LISA: Explore JMP Capabilities in Design of Experiments. Liaosa Xu June 21, 2012

LISA: Explore JMP Capabilities in Design of Experiments. Liaosa Xu June 21, 2012 LISA: Explore JMP Capabilities in Design of Experiments Liaosa Xu June 21, 2012 Course Outline Why We Need Custom Design The General Approach JMP Examples Potential Collinearity Issues Prior Design Evaluations

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

QQ normality plots Harvey Motulsky, GraphPad Software Inc. July 2013

QQ normality plots Harvey Motulsky, GraphPad Software Inc. July 2013 QQ normality plots Harvey Motulsky, GraphPad Software Inc. July 213 Introduction Many statistical tests assume that data (or residuals) are sampled from a Gaussian distribution. Normality tests are often

More information

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

Feature Selection in Knowledge Discovery

Feature Selection in Knowledge Discovery Feature Selection in Knowledge Discovery Susana Vieira Technical University of Lisbon, Instituto Superior Técnico Department of Mechanical Engineering, Center of Intelligent Systems, IDMEC-LAETA Av. Rovisco

More information

Bootstrapping Method for 14 June 2016 R. Russell Rhinehart. Bootstrapping

Bootstrapping Method for  14 June 2016 R. Russell Rhinehart. Bootstrapping Bootstrapping Method for www.r3eda.com 14 June 2016 R. Russell Rhinehart Bootstrapping This is extracted from the book, Nonlinear Regression Modeling for Engineering Applications: Modeling, Model Validation,

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging Prof. Daniel Cremers 8. Boosting and Bagging Repetition: Regression We start with a set of basis functions (x) =( 0 (x), 1(x),..., M 1(x)) x 2 í d The goal is to fit a model into the data y(x, w) =w T

More information

Lecture 7: Linear Regression (continued)

Lecture 7: Linear Regression (continued) Lecture 7: Linear Regression (continued) Reading: Chapter 3 STATS 2: Data mining and analysis Jonathan Taylor, 10/8 Slide credits: Sergio Bacallado 1 / 14 Potential issues in linear regression 1. Interactions

More information