Multivariate Analysis Multivariate Calibration part 2

Size: px
Start display at page:

Download "Multivariate Analysis Multivariate Calibration part 2"

Transcription

1 Multivariate Analysis Multivariate Calibration part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com

2 Linear Latent Variables An essential concept in multivariate data analysis is the mathematical combination of several variables into a new variable that has a certain desired property. In chemometrics such a new variable is often called a latent variable Linear Latent Variable u = b 1 x 1 + b 2 x b m x m u scores, value of linear latent variable b j loadings, coefficients describing the influence of variables on the score x j variables, features.

3 Calibration Linear models y = b 0 + b 1 x 1 + b 2 x b j x j + + b m x m + e b 0 is called the intercept b 1 to b m the regression coefficients m the number of variables e the residual (error term) Often mean-centered data are used and then b 0 becomes zero This model corresponds to a linear latent variable.

4 Calibration The parameters of a model are estimated from a calibration set (training set) containing the values of the x-variables and y for n samples (objects) The resulting model is evaluated with a test set (with know values for x and y) Because modeling and prediction of the y-data is a defined aim of data analysis, this type of data treatment is called supervised learning.

5 Calibration All regression methods aim at the minimization of residuals, for instance minimization of the sum of the squared residuals It is essential to focus on minimal prediction errors for new cases the test set but no (only) for the calibration set from which the model has been created It is realatively ease to create a model especially with many variables and eventually nonlinear features that very well fits the calibration data; however, it may be useless for new cases. This effect of overfitting is a crucial topic in model creation.

6 Calibration Regression can be performed directly with the values of the variables (ordinary least-squares regression, OLS) but in the most powerful methods, such as principal component regression (PCR) and partial leastsquares regression (PLS), it is done via a small set of intermediate linear latent variables (the components). This approach has important advantages: Data with highly correlating x-variables can be used Data sets with more variables than samples can be used The complexity of the model can be controlled by the number of components and thus overfitting can be avoided and maximum prediction performance for test set data can be approached.

7 Calibration Depending on the type of data, different methods are available Number of x-variables Number of y-variables Name Methods 1 1 Simple Simple OLS, robust regression Many 1 Multiple PLS, PCR, multiple OLS, robust regression, Ridge regression, Lasso regression Many Many Multivariate PLS2, Canonical Correlation Analysis (CCA)

8 Calibration y = x x x 3

9 Calibration

10 Calibration OLS-model using only x 1 and x 2 y = x x 2

11 Performance of Regression Models Any model for prediction makes sense only if appropriate criteria are defined and applied to measure the performance of the model For models based on regression, the residuals (prediction errors) e i e i = y i y i are the basis for performance measures, with y i for the given (experimental, true ) value and y i the predicted (modeled) value of an object i An often-used performance measure estimates the standard deviation of the prediction errors (standard error of prediction, SEP).

12 Performance of Regression Models Using the same objects for calibration and test should be strictly avoided Depending on the size of the data set (the number of objects available) and on the effort of work different strategies are possible The following levels are ordered by typical applications to data with decreasing size and also by decreasing reliability of the results.

13 Performance of Regression Models 1. If data from many objects are available, a split into three sets is best: i. Training set (ca. 50% of the objects) for creating models ii. Validation set (ca. 25% of the objects) for optimizing the model to obtain good prediction performance iii. Test set (prediction set, approximately 25%) for testing the final model to obtain a realistic estimation of the prediction performance for new cases iv. The three sets are treated separately v. Applications in chemistry rarely allow this strategy because of a too small number of objects available.

14 Performance of Regression Models 2. The data is split into a calibration set used for model creation and optimization and a test set (prediction set) to obtain a realistic estimation of the prediction perfomance for new cases i. The calibration set is divided into a training set and a ii. validation set by cross validation (CV) or bootstrap First the optimum complexity (for instance optimum number of PLS components) of the model is estimated and then a model is built from the whole calibration set applying the found optimum complexity; this model is applied to the test set.

15 Performance of Regression Models 3. CV or bootstrap is used to split the data set into different calibration sets and test sets i. A calibration set is used as described (2) to create an optimized model and this is applied to the corresponding ii. iii. test set All objects are principally used in training set, validation set, and test set; however, an object is never simultaneously used for model creations and for test This strategy (double CV or double bootstrap or a combination of CV and bootstrap) is applicable to a relatively small number of objects; furthermore, the process can be repeated many times with different random splits resulting in a high number of test-setpredicted values.

16 Performance of Regression Models Mostly the split of the objects intro training, validation, and test sets is performed by simple random sampling More sophisticated procedures related to experimental design and theory of sampling are available Kennard-Stone algorithm it claims to set up a calibration set that is representative for the population, to cover the x-space as uniformly as possible, and to give more weight to objects outside the center This aim is reached by selecting objects with maximum distances (for instance Euclidean distances) in the x-space.

17 Performance of Regression Models The methods CV and bootstrap are called resampling strategies Small data sets They are applied to obtain a reasonable high number of predictions The larger the date set and the more reliable the data are, the better the prediction performance can be estimated size of the data friendliness of data uncertainty of performance measure = constant.

18 Overfitting and Underfitting The more complex (the larger ) a model is, the better is it capable of fit given data (the calibration set) The prediction error for the calibration set in general decreases with increasing complexity of the model An appropriate highly complicated model can fit almost any data with almost zero deviations (residuals) between experimental (true) y and modeled (predicted) y Such model are not necessarily useful for new cases, because they are probably overfitted.

19 Overfitting and Underfitting Fonte da Imagen: Dr. Frank Dieterle

20 Overfitting and Underfitting Fonte da Imagen: holehouse.org

21 Overfitting and Underfitting Prediction error for new cases are high for small models (underfitting, low complexity, too simple models) but also for overfitted models Determination of the optimum complexity of a model is an important but not always na easy task, because the minumum of measures for the prediction error for test sets is often not well marked In chemometrics, the complexity is typically controled by the number of PLS or PCA components (latent variables), and the optimum complexity is estimated by CV CV or bootstrap allows an estimation of the prediction error for each object of the calibration set at each considered model complexity.

22 Performance Criteria The basis of all perfomence criteria are prediction errors (residuals) y i y i The classical standard deviation of prediction errors is widely used as a measure of the spread of the error distribution, and is called standard error of prediction (SEP) defined by with SEP = 1 z 1 z i=1 z y i y i bias 2 bias = 1 z where y i are the given (experimental, true ) values y i are the predicted (modeled) values z is the number of predictions. i=1 y i y i

23 Performance Criteria The bias is the arithmetic mean of the prediction errors and should be near zero A systematic error (a nonzero bias) may appear if, for instance, a calibration model is applied to data that have been produced by another instrument In the case of a normal distribution, about 95% of the prediction errors are within the tolerance interval 2 SEP The measure SEP and the tolerance interval are given in the units of y.

24 Performance Criteria Standard error of calibration (SEC) is similar to SEP applied to predictions of the calibration set The mean squared error (MSE) is the arithmetic mean of the squared errors z MSE = 1 y z i y 2 i i=1 MSEC refers to results from a calibration set, MSECV to results obtained in CV, and MSEP to results from a prediction/test set MSE minus the squated bias gives the squared SEP SEP 2 = MSE bias 2.

25 Performance Criteria The root mean squared error (RMSE) is the square root of MSE, and can again be given for calibration (RMSEC), CV (RMSECV) or for prediction/test (RMSEP) RMSE = 1 z z i=1 y i y i 2 MSE is preferably used during the development and optimization of models but is less useful for practical applications because it has not the units of the predicted property A similar widely used measure is predicted residual error sum of squares (PRESS), the sum of the squared errors z PRESS = y i y 2 i=1 i = z MSE.

26 Performance Criteria Correlation measures between experimental y and predicted y are frequently used to characterize the model performance Mostly used is the squared Pearson correlation coefficient.

27 Criteria for Models with Different Numbers of Variables The model should not contain a too smal number of variables because this leads to poor prediction performance. On the other hand, it should also not contain a too large number of variables because this results in overfitting and thus again poor prediction performance 2 The adjusted R-square, R adj 2 R adj or R = 1 n 1 n m 1 1 R2 where n is the number of objects m is the number of regressor variables (including the intercept) R 2 is called coefficient of determination, expressing the proportion of variance that is explained by the model.

28 Criteria for Models with Different Numbers of Variables 2 Another, equivalent representation for R adj RSS n m 1 2 R adj = 1 TSS n 1 with residual sum of squares (RSS) for the sum of the squared residuals n RSS = y i y i 2 i=1 and the total sum of squares (TSS) for the sum of the squared differences to the mean y of y, n TSS = y i y 2 i i=1. is

29 Cross Validation (CV) It is the most used resampling strategy to obtain a reasonable large number of predictions It is also often applied to optimize the complexity of a model The optimum number of PLS or PCA components To split the data into calibration sets and test sets.

30 Cross Validation (CV) The procedure of CV applied to model optimization The available set with n object is randomly split into s segments (parts) of approximately equal size The number of segments can be 2 to n, often values between 4 to 10 are used One segment is left out as a validation set The other s 1 sets are used as a training set to create models which have increasing complexity (for instance, 1,2,3,, a max PLS components) The models are separately applied to the objects of the validation set resulting in predicted values connected to different model complexities This procedure is repeated so that each segment is a validation set once The result is a matrix with n rows and a max colums containing predicted values y CV (predicted by CV) for all objects and all considered model complexities From this matrix and the known y-values, a residual matrix is computed An error measure (for instance MSECV) is calculated from the residuals, and the lowest MSECV or a similar criterion indicates the optimum model complexity.

31 Cross Validation (CV) K. Baumann, TrAC 2003, 22(6), 395

32 Cross Validation (CV) A single CV gives n predictions For many data sets in chemistry n is too small for a visualization of the error distribution The perfomance measure depends on the split of the objects into segmentes Is therefore recommended to repeat the CV with different random splits into segments (repeated CV), and to summarize the results.

33 Cross Validation (CV) If the number of segments is equal to the number of objects (each segment contains only one object), the method is called leave-one-out CV or full CV Randomization of the sequence of objects is senseless, therefore only one CV run is necessary resulting in n predictions The number of created models is n which may be timeconsuming for large data sets Depending on the data, full CV may give too optimistic results, especially if pairwise similar objects are in the data set, for instance from duplicate measurements Full CV is easier to apply than repeated CV or bootstrap In many cases, full CV gives a reasonable first estimate of the model performance.

34 Bootstrap As a verb, bootstrap has survived into the computer age, being the origin of the phrase "booting a computer." "Boot" is a shortened form of the word "bootstrap," and refers to the initial commands necessary to load the rest of the computer's operating system. Thus, the system is "booted," or "pulled up by its own bootstraps"

35 Bootstrap Within multivariate analysis, bootstrap is a resampling method that can be used as na alternative to CV, for instance, to estimate the prediction performance of a model or to estimate the optimum complexity In general, bootstrap can be used to estimate the distribution of model parameters Basic ideas of bootstraping are resampling with replacement, and to use calibration sets with the same number of objects, n, as objects are in the available data set A calibration set is obtained by selecting randomly objects and copying (not moving) them into the calibration set Toolbox.

36 Ordinary Least-Squares Regression

37 Simple OLS y = b 0 + bx + e b and b 0 are the regression parameters (regression coefficients) b 0 is the intercept and b is the slope Since the data will in general not follow a perfect linear relation, the vector e contains the residuals (errors) e 1, e 2,, e n.

38 Simple OLS The predicted (modeled) property y i for sample i and the prediction error e i are calculated by y i = b 0 + bx i e i = y i y i The Ordinary Least-Squares (OLS) approach minimizes the sum of the squared residuals e 2 i to estimate the model parameters b and b 0 n i=1 x i x y i y b = n x i x 2 i=1 b 0 = y bx For mean-centered data (x = 0, y = 0), b 0 = 0 and b = n i=1 n i=1 x i y i x i 2 = xt y x T x.

39 Simple OLS The described model best fits the given (calibration) data, but is not necessarily optimal for predictions The least-squares approach can become very unreliable if outliers are present in the data Assumption for obtaining reliable estimates Errors are only in y but not in x Residuals are uncorrelated and normally distributed with mean 0 and constant variance σ 2 (homoscedasticity).

40 Simple OLS The following scatterplots are plots of x i (measurement) vs. i (observation number) with the sample mean marked with a red horizontal line. The measurement is plotted on the vertical axis; the observation number is plotted on the horizontal axis Unbiased The average of the observations in every thin vertical strip is the same all the way across the scatterplot. Biased The average of the observations changes, depending on which thin vertical strip you pick. Homoscedastic The variation (σ) of the observations is the same in every thin vertical strip all the way across the scatterplot. Heteroscedastic The variation (σ) of the observations in a thin vertical strip changes, depending on which vertical strip you pick. (a) unbiased and homoscedastic (b) unbiased and heteroscedastic (c) biased and homoscedastic (d) biased and heteroscedastic (e) unbiased and heteroscedastic (f) biased and homoscedastic

41 Simple OLS Besides estimating the regression coefficients, it is also of interest to estimate the variation of the measurements around the fitted regression line. This means that the residual variance σ 2 has to be estimated s e 2 = 1 n 2 n i=1 y i y i 2 = 1 n 2 The denominator n 2 is used here because two parameters are necessary for a fitted straight line, and this makes s e 2 an unbiased estimator for σ 2 Confidence intervals for intercept and slope b 0 ± t n 2;p s b 0 b ± t n 2;p s b Standard deviations of b 0 and b n 2 i=1 x s b 0 = s i e n n x i x 2 s b = n x i x 2 i=1 t n 2;p is the p-quantile of the t-distribution witn n 2 degress of freedom, with for instance p = for a 95% confidence interval. i=1 s e n i=1 e i 2

42 Simple OLS Confidence interval for the residual variance σ 2 n 2 s2 e 2 < σ 2 < n 2 s e 2 2 χ n 2;1 p χ n 2;p 2 2 where χ n 2;1 p and χ n 2;p are the appropriate quantiles of the chisquare distribution (ref 1, table) with n 2 degrees of freedom (e.g., p = for a 95% confidence interval) Null hypothesis H 0 : b 0 = 0 T b0 = b 0 s b 0 H 0 is rejected at the significance level of if T b0 > t n 2;1 α 2 The test for b = 0 is equivalent T b = b s b.

43 Simple OLS Often it is of interest to obtain a confidence interval for the prediction at a new x value 1 y ± s e 2F 2,n 2;p n + x x 2 n x x 2 i=1 with F 2,n 2;p the p-quantile of the F-distribution with 2 and n 2 degrees of freedrom Best predictions are possible in the mid part of the range of x where most information is available.

44 Simple OLS Using the open source software (An Introduction to R) > x=c(1.5,2,2.5,2.9,3.4,3.7,4,4.2,4.6,5,5.5,5.7,6.6) > y=c(3.5,6.1,5.6,7.1,6.2,7.2,8.9,9.1,8.5,9.4,9.5,11.3,11.1) > res<-lm(y~x) # linear model for y on x. The symbol ~ # allows to construct a formula for the relation > plot(x,y) > abline(res)

45 Simple OLS > summary(res) Call: lm(formula = y ~ x) b 0 and b Residuals: Min 1Q Median 3Q Max s b 0 and s b T b 0 and T b Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ** x e-06 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 11 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 11 DF, p-value: 1.049e-06 in the case of univariate x and y is the same as the squared Pearson correlation coefficient, and it is a measure of model fit first quartile is the value in which 25% of the numbers are below it s e tests wheter all parameters are zero, against the alternative that at least one regression parameter is different from zero. Since the p-value ~0, at least one of intercept or slope contributes to the regression model since p-values are much smaller than a reasonable significance level, e,g, α = 0.05, both intercept and slope are important in our regression model

46 Muliple OLS y 1 = b 0 + b 1 x 11 + b 2 x b m x 1m + e 1 y 2 = b 0 + b 1 x 21 + b 2 x b m x 2m + e 2 y n = b 0 + b 1 x n1 + b 2 x n2 + + b m x nm + e n or y = Xb + e Multiple Linear Regression Model X of size n m + 1 which includes in its first column n values of 1 The residuals are calculated by e = y y The regression coefficients b = b 0, b 1,, b T m result from the OLS estimation minimizing the sum of squared residuals e T e b = X T X 1 X T y.

47 Muliple OLS Confidence Intervals and Statistical Tests The following assumptions must be fulfilled Erros e are independent n-dimensional normally distributed With mean vector 0 And covariance matrix σ 2 I n An unbiased estimator for the residual variance σ 2 is n s e 2 = 1 n m 1 y i y i 2 = 1 n m 1 y Xb T y Xb i=1 The null hypothesis b j = 0 against the alternative b j 0 can be tested with the test statistic b j z j = s e d j where d j is the jth diagonal element of X T X 1 The distribution of z j is t n m 1, and thus a large absolute value of z j will lead to a rejection of the null hypothesis F-test can also be constructed to test the null hypothesis b 0 = b 1 = = b m = 0 against the alternative b j 0 for any j = 0,1,, m.

48 Multiple OLS Using the open source software R > T=c(80,93,100,82,90,99,81,96,94,93,97,95,100,85,86,87) > V=c(8,9,10,12,11,8,8,10,12,11,13,11,8,12,9,12) > y=c(2256,2340,2426,2293,2330,2368,2250,2409,2364,2379,2440,2364,2404,2317,2309,2328) > res<-lm(y~t+v) # linear model for y on T and V. The symbol ~ # allows to construct a formula for the relation > summary(res) Call: lm(formula = y ~ T + V) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-12 *** T e-08 *** V ** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 13 degrees of freedom Multiple R-squared: 0.927, Adjusted R-squared: F-statistic: 82.5 on 2 and 13 DF, p-value: 4.1e-08

49 Hat Matrix The hat-matrix H combines the observed and predicted y-values y = Hy and so it puts the hat on y. The hat-matrix is defined as H = X X T X 1 X T the diagonal elements h ii of the n n matrix H reflet the influence of each value y i on its own prediction y i.

50 Multivariate OLS Multivariate Linear Regression relates several y-variables with several x-variables Y = XB + E In terms of a single y-variable OLS estimator for b j resulting in and y j = Xb j + e j b j = X X T X 1 X T y j B = X X T X 1 X T Y Y = XB Matrix B consists of q loading vectors, each defining a direction in the x-space for a linear latent variable which has maximum Pearson s correlation coefficient between y j and y j for j = 1,, q The regression coefficients for all y-variables can be computed at once, however, only for noncollinear x-variables and if m < n m x-variables, n observations, and q y-variables Alternative methods are PLS2 and CCA.

51 Variable Selection Feature selection For multiple regression all available variables x 1, x 2,, x m were used to build a linear model for the prediction of the y-variable Useful as long as the number m of regressor variables is small, say not more than 10 OLS regression is no longer computable if the Regressor variables are highly correlated Number of objects is lower than the number of variables PCR and PLS can handle such data.

52 Variable Selection Arguments against the use of all available regressor variables Use of all variable will produce a better fit of the model for the training data The residuals become smaller and thus the R 2 measure increases We are usually not interested in maximizing the fit for the training data but in maximizing the prediction performance for the test data Reduction of the regressor variables can avoid the effects of overfitting A regression model with a high number of variables is pratically impossible to interpret.

53 Variable Selection Univariate and Bivariate Selection Methods Criteria for the elimination of regressor variables A considerable percentage of the variable values is missing of below a small threshold All or nearly all variable values are equal The variable includes many and severe outliers Compute the correlation between pairs of regressor variables. If the correlation is high (positive or negative), exclude that variable having the larger sum of (absolute) correlation coefficients to all remaining regressor variables.

54 Variable Selection Criteria for the identification of potentially useful regressor variables High variance of the variable High (absolute) correlation coefficient with the yvariable.

55 Variable Selection Stepwise Selectrion Methods Adds or drops one variable at a time Forward selection Start with the empty model (or with preselected variables) and add that variable to the model that optimizes a criterion Continue to add variables until a stopping rule becomes active Backward elimination Start with the full model... Both directions.

56 Variable Selection An often-used version of stepwise variable selection works as follows Select the variable with highest absolute correlation coefficient with the y-variable; the number of selected variables is m 0 = 1 Add each of the remaining x-variables separately to the selected variable; the number of variables in each subset is m 1 = 2 Calculate F F = RSS 0 RSS 1 m 1 m 0 RSS 1 n m 1 1 with RSS being the sum of the squared residuals y i y i 2 n i=1 Consider the added variables which gives the highest F, and if the decrease of RSS is significant, take this variable as the second selected one Significance: F > F m1 m 0,n m 1 1;0.95 Forward selection of variables would continue in the same way until no significant change occurs Disadvantage: a selected variable cannot be removed later on» Usually the better strategy is to continue with a backwad step» Another forward step is done, followed by another backward step, and so on, until no significant change of RSS occurs or a defined maximum number of variables is reached.

57 Variable Selection Best-Subset/All-Subsets Regression Allows excluding complete branches in the tree of all possible subsets, and thus finding the best subset for data sets with up to about variables Leaps and Bounds algorithm or regression-tree methods Model selection Adjusted R 2 Akaike s Information Criterion (AIC) AIC = nlog RSS + 2m n Bayes Information Criterion (BIC) Mallow s Cp.

58 Variable Selection x 1 + x 2 + x 3 x 1 + x 2 AIC = 10 x 1 + x 3 x 2 + x 3 AIC = 20 x 1 AIC > 8 x 2 AIC > 18 x 3 AIC > 18 Since we want to select the model which gives the smallest value of the AIC, the complete branch with x 2 + x 3 can be ignored, bacause any submodel in this branch is worse (AIC > 18) than the model x 1 + x 2 with AIC = 10

59 Variable Selection Variable Selection Based on PCA or PLS Models These methods form new latent variables by using linear combinations of the regressor variables b 1 x 1 + b 2 x b m x m The coefficients/loadings reflect the importance of an x-variable for the new latent variable The absolute size of the coefficients can be used as a criterion for variable selection.

60 Variable Selection Genetic Algorithms (GAs) Natural Computation Method

61 Variable Selection Gene Population Delete chromosomes with poor fitness (selection) Create new chromosomes from pairs of good chromosomes (crossover) Change a few genes randomly (mutation) New (better) population

62 Variable Selection Crossover two chromosomes are cut at a random position and the parts are connected in a crossover scheme resulting in two new chromosomes

63 Variable Selection Cluster Analysis of Variables Cluster analysis tries to identify homogenous groups in the data If it is applied to the correlation matix of the regressor variables, one may obtain groups of variables that are strongly related, while variables in different groups will have a weak correlation.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100. Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 9. Linear regression with latent variables 9.1 Principal component regression (PCR) 9.2 Partial least-squares regression (PLS) [ mostly

More information

Chapter 6: Linear Model Selection and Regularization

Chapter 6: Linear Model Selection and Regularization Chapter 6: Linear Model Selection and Regularization As p (the number of predictors) comes close to or exceeds n (the sample size) standard linear regression is faced with problems. The variance of the

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy 2017 ITRON EFG Meeting Abdul Razack Specialist, Load Forecasting NV Energy Topics 1. Concepts 2. Model (Variable) Selection Methods 3. Cross- Validation 4. Cross-Validation: Time Series 5. Example 1 6.

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

Evaluation of empirical models for calibration and classification

Evaluation of empirical models for calibration and classification Evaluation of empirical models for calibration and classification Kurt VARMUZA Vienna University of Technology Institute of Chemical Engineering and Department of Statistics and Probability Theory www.lcm.tuwien.ac.at,

More information

7. Collinearity and Model Selection

7. Collinearity and Model Selection Sociology 740 John Fox Lecture Notes 7. Collinearity and Model Selection Copyright 2014 by John Fox Collinearity and Model Selection 1 1. Introduction I When there is a perfect linear relationship among

More information

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,

More information

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010 THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

Variable selection is intended to select the best subset of predictors. But why bother?

Variable selection is intended to select the best subset of predictors. But why bother? Chapter 10 Variable Selection Variable selection is intended to select the best subset of predictors. But why bother? 1. We want to explain the data in the simplest way redundant predictors should be removed.

More information

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. 1/44 Cross-validation and the Bootstrap In the section we discuss two resampling

More information

2016 Stat-Ease, Inc. & CAMO Software

2016 Stat-Ease, Inc. & CAMO Software Multivariate Analysis and Design of Experiments in practice using The Unscrambler X Frank Westad CAMO Software fw@camo.com Pat Whitcomb Stat-Ease pat@statease.com Agenda Goal: Part 1: Part 2: Show how

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Fathom Dynamic Data TM Version 2 Specifications

Fathom Dynamic Data TM Version 2 Specifications Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other

More information

Machine Learning: An Applied Econometric Approach Online Appendix

Machine Learning: An Applied Econometric Approach Online Appendix Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail

More information

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression 22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider

More information

Robust Linear Regression (Passing- Bablok Median-Slope)

Robust Linear Regression (Passing- Bablok Median-Slope) Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 23 CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 3.1 DESIGN OF EXPERIMENTS Design of experiments is a systematic approach for investigation of a system or process. A series

More information

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression 22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Multiresponse Sparse Regression with Application to Multidimensional Scaling Multiresponse Sparse Regression with Application to Multidimensional Scaling Timo Similä and Jarkko Tikka Helsinki University of Technology, Laboratory of Computer and Information Science P.O. Box 54,

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Statistics Lab #7 ANOVA Part 2 & ANCOVA Statistics Lab #7 ANOVA Part 2 & ANCOVA PSYCH 710 7 Initialize R Initialize R by entering the following commands at the prompt. You must type the commands exactly as shown. options(contrasts=c("contr.sum","contr.poly")

More information

Regression Analysis and Linear Regression Models

Regression Analysis and Linear Regression Models Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical

More information

OLS Assumptions and Goodness of Fit

OLS Assumptions and Goodness of Fit OLS Assumptions and Goodness of Fit A little warm-up Assume I am a poor free-throw shooter. To win a contest I can choose to attempt one of the two following challenges: A. Make three out of four free

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes. Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department

More information

Exam Review: Ch. 1-3 Answer Section

Exam Review: Ch. 1-3 Answer Section Exam Review: Ch. 1-3 Answer Section MDM 4U0 MULTIPLE CHOICE 1. ANS: A Section 1.6 2. ANS: A Section 1.6 3. ANS: A Section 1.7 4. ANS: A Section 1.7 5. ANS: C Section 2.3 6. ANS: B Section 2.3 7. ANS: D

More information

Serial Correlation and Heteroscedasticity in Time series Regressions. Econometric (EC3090) - Week 11 Agustín Bénétrix

Serial Correlation and Heteroscedasticity in Time series Regressions. Econometric (EC3090) - Week 11 Agustín Bénétrix Serial Correlation and Heteroscedasticity in Time series Regressions Econometric (EC3090) - Week 11 Agustín Bénétrix 1 Properties of OLS with serially correlated errors OLS still unbiased and consistent

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Leveling Up as a Data Scientist.   ds/2014/10/level-up-ds.jpg Model Optimization Leveling Up as a Data Scientist http://shorelinechurch.org/wp-content/uploa ds/2014/10/level-up-ds.jpg Bias and Variance Error = (expected loss of accuracy) 2 + flexibility of model

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors

More information

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model Topic 16 - Other Remedies Ridge Regression Robust Regression Regression Trees Outline - Fall 2013 Piecewise Linear Model Bootstrapping Topic 16 2 Ridge Regression Modification of least squares that addresses

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

IQR = number. summary: largest. = 2. Upper half: Q3 =

IQR = number. summary: largest. = 2. Upper half: Q3 = Step by step box plot Height in centimeters of players on the 003 Women s Worldd Cup soccer team. 157 1611 163 163 164 165 165 165 168 168 168 170 170 170 171 173 173 175 180 180 Determine the 5 number

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Data transformation in multivariate quality control

Data transformation in multivariate quality control Motto: Is it normal to have normal data? Data transformation in multivariate quality control J. Militký and M. Meloun The Technical University of Liberec Liberec, Czech Republic University of Pardubice

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 13: The bootstrap (v3) Ramesh Johari ramesh.johari@stanford.edu 1 / 30 Resampling 2 / 30 Sampling distribution of a statistic For this lecture: There is a population model

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Examples: Mixture Modeling With Cross-Sectional Data CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Mixture modeling refers to modeling with categorical latent variables that represent

More information

Bivariate (Simple) Regression Analysis

Bivariate (Simple) Regression Analysis Revised July 2018 Bivariate (Simple) Regression Analysis This set of notes shows how to use Stata to estimate a simple (two-variable) regression equation. It assumes that you have set Stata up on your

More information

3 Nonlinear Regression

3 Nonlinear Regression CSC 4 / CSC D / CSC C 3 Sometimes linear models are not sufficient to capture the real-world phenomena, and thus nonlinear models are necessary. In regression, all such models will have the same basic

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Subset Selection in Multiple Regression

Subset Selection in Multiple Regression Chapter 307 Subset Selection in Multiple Regression Introduction Multiple regression analysis is documented in Chapter 305 Multiple Regression, so that information will not be repeated here. Refer to that

More information

An introduction to SPSS

An introduction to SPSS An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible

More information

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Predictive Analysis: Evaluation and Experimentation. Heejun Kim Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training

More information

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Minitab 17 commands Prepared by Jeffrey S. Simonoff Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save

More information

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

SLStats.notebook. January 12, Statistics:

SLStats.notebook. January 12, Statistics: Statistics: 1 2 3 Ways to display data: 4 generic arithmetic mean sample 14A: Opener, #3,4 (Vocabulary, histograms, frequency tables, stem and leaf) 14B.1: #3,5,8,9,11,12,14,15,16 (Mean, median, mode,

More information

Two-Stage Least Squares

Two-Stage Least Squares Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes

More information

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Information Criteria Methods in SAS for Multiple Linear Regression Models

Information Criteria Methods in SAS for Multiple Linear Regression Models Paper SA5 Information Criteria Methods in SAS for Multiple Linear Regression Models Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN ABSTRACT SAS 9.1 calculates Akaike s Information

More information

Chapter 4: Analyzing Bivariate Data with Fathom

Chapter 4: Analyzing Bivariate Data with Fathom Chapter 4: Analyzing Bivariate Data with Fathom Summary: Building from ideas introduced in Chapter 3, teachers continue to analyze automobile data using Fathom to look for relationships between two quantitative

More information

SELECTION OF A MULTIVARIATE CALIBRATION METHOD

SELECTION OF A MULTIVARIATE CALIBRATION METHOD SELECTION OF A MULTIVARIATE CALIBRATION METHOD 0. Aim of this document Different types of multivariate calibration methods are available. The aim of this document is to help the user select the proper

More information

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Creation & Description of a Data Set * 4 Levels of Measurement * Nominal, ordinal, interval, ratio * Variable Types

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Applied Statistics and Econometrics Lecture 6

Applied Statistics and Econometrics Lecture 6 Applied Statistics and Econometrics Lecture 6 Giuseppe Ragusa Luiss University gragusa@luiss.it http://gragusa.org/ March 6, 2017 Luiss University Empirical application. Data Italian Labour Force Survey,

More information

CREATING THE ANALYSIS

CREATING THE ANALYSIS Chapter 14 Multiple Regression Chapter Table of Contents CREATING THE ANALYSIS...214 ModelInformation...217 SummaryofFit...217 AnalysisofVariance...217 TypeIIITests...218 ParameterEstimates...218 Residuals-by-PredictedPlot...219

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART V Credibility: Evaluating what s been learned 10/25/2000 2 Evaluation: the key to success How

More information

STA121: Applied Regression Analysis

STA121: Applied Regression Analysis STA121: Applied Regression Analysis Variable Selection - Chapters 8 in Dielman Artin Department of Statistical Science October 23, 2009 Outline Introduction 1 Introduction 2 3 4 Variable Selection Model

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM * Which directories are used for input files and output files? See menu-item "Options" and page 22 in the manual.

More information

One Factor Experiments

One Factor Experiments One Factor Experiments 20-1 Overview Computation of Effects Estimating Experimental Errors Allocation of Variation ANOVA Table and F-Test Visual Diagnostic Tests Confidence Intervals For Effects Unequal

More information

Product Catalog. AcaStat. Software

Product Catalog. AcaStat. Software Product Catalog AcaStat Software AcaStat AcaStat is an inexpensive and easy-to-use data analysis tool. Easily create data files or import data from spreadsheets or delimited text files. Run crosstabulations,

More information

Estimation of Item Response Models

Estimation of Item Response Models Estimation of Item Response Models Lecture #5 ICPSR Item Response Theory Workshop Lecture #5: 1of 39 The Big Picture of Estimation ESTIMATOR = Maximum Likelihood; Mplus Any questions? answers Lecture #5:

More information

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &

More information

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

BIOL 458 BIOMETRY Lab 10 - Multiple Regression BIOL 458 BIOMETRY Lab 10 - Multiple Regression Many problems in science involve the analysis of multi-variable data sets. For data sets in which there is a single continuous dependent variable, but several

More information

SYS 6021 Linear Statistical Models

SYS 6021 Linear Statistical Models SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are

More information

More Summer Program t-shirts

More Summer Program t-shirts ICPSR Blalock Lectures, 2003 Bootstrap Resampling Robert Stine Lecture 2 Exploring the Bootstrap Questions from Lecture 1 Review of ideas, notes from Lecture 1 - sample-to-sample variation - resampling

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

3 Nonlinear Regression

3 Nonlinear Regression 3 Linear models are often insufficient to capture the real-world phenomena. That is, the relation between the inputs and the outputs we want to be able to predict are not linear. As a consequence, nonlinear

More information

The Truth behind PGA Tour Player Scores

The Truth behind PGA Tour Player Scores The Truth behind PGA Tour Player Scores Sukhyun Sean Park, Dong Kyun Kim, Ilsung Lee May 7, 2016 Abstract The main aim of this project is to analyze the variation in a dataset that is obtained from the

More information

10601 Machine Learning. Model and feature selection

10601 Machine Learning. Model and feature selection 10601 Machine Learning Model and feature selection Model selection issues We have seen some of this before Selecting features (or basis functions) Logistic regression SVMs Selecting parameter value Prior

More information

Introduction to hypothesis testing

Introduction to hypothesis testing Introduction to hypothesis testing Mark Johnson Macquarie University Sydney, Australia February 27, 2017 1 / 38 Outline Introduction Hypothesis tests and confidence intervals Classical hypothesis tests

More information

Introduction to mixed-effects regression for (psycho)linguists

Introduction to mixed-effects regression for (psycho)linguists Introduction to mixed-effects regression for (psycho)linguists Martijn Wieling Department of Humanities Computing, University of Groningen Groningen, April 21, 2015 1 Martijn Wieling Introduction to mixed-effects

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

mcssubset: Efficient Computation of Best Subset Linear Regressions in R

mcssubset: Efficient Computation of Best Subset Linear Regressions in R mcssubset: Efficient Computation of Best Subset Linear Regressions in R Marc Hofmann Université de Neuchâtel Cristian Gatu Université de Neuchâtel Erricos J. Kontoghiorghes Birbeck College Achim Zeileis

More information

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures Part I, Chapters 4 & 5 Data Tables and Data Analysis Statistics and Figures Descriptive Statistics 1 Are data points clumped? (order variable / exp. variable) Concentrated around one value? Concentrated

More information

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017 Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last

More information

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017 CPSC 340: Machine Learning and Data Mining Feature Selection Fall 2017 Assignment 2: Admin 1 late day to hand in tonight, 2 for Wednesday, answers posted Thursday. Extra office hours Thursday at 4pm (ICICS

More information

3. Data Analysis and Statistics

3. Data Analysis and Statistics 3. Data Analysis and Statistics 3.1 Visual Analysis of Data 3.2.1 Basic Statistics Examples 3.2.2 Basic Statistical Theory 3.3 Normal Distributions 3.4 Bivariate Data 3.1 Visual Analysis of Data Visual

More information

Machine Learning. Topic 4: Linear Regression Models

Machine Learning. Topic 4: Linear Regression Models Machine Learning Topic 4: Linear Regression Models (contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/ Stork, and Bishop. Updated Fall 205) Regression Learning Task There

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Regression. Dr. G. Bharadwaja Kumar VIT Chennai Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called

More information

Descriptive Statistics, Standard Deviation and Standard Error

Descriptive Statistics, Standard Deviation and Standard Error AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.

More information

Using the DATAMINE Program

Using the DATAMINE Program 6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

Artificial Neural Networks (Feedforward Nets)

Artificial Neural Networks (Feedforward Nets) Artificial Neural Networks (Feedforward Nets) y w 03-1 w 13 y 1 w 23 y 2 w 01 w 21 w 22 w 02-1 w 11 w 12-1 x 1 x 2 6.034 - Spring 1 Single Perceptron Unit y w 0 w 1 w n w 2 w 3 x 0 =1 x 1 x 2 x 3... x

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information