AN INTRODUCTION TO MULTIVARIATE ADAPTIVE REGRESSION SPLINES FOR THE CANE INDUSTRY YL EVERINGHAM, J SEXTON

Size: px
Start display at page:

Download "AN INTRODUCTION TO MULTIVARIATE ADAPTIVE REGRESSION SPLINES FOR THE CANE INDUSTRY YL EVERINGHAM, J SEXTON"

Transcription

1 AN INTRODUCTION TO MULTIVARIATE ADAPTIVE REGRESSION SPLINES FOR THE CANE INDUSTRY By YL EVERINGHAM, J SEXTON School of Engineering and Physical Sciences, James Cook University yvette.everingham@jcu.edu.au KEYWORDS: MARS, CCS, GIS, Precision, Agriculture, Data Mining. Abstract INDUSTRIES strive to find the balance between increased productivity and future sustainability of production. To this end, the sugar cane industry maintains records from each farm about CCS (commercial cane sugar content (%)), total cane yield, cane varieties and growing conditions throughout each region. A challenge that the cane industry faces is how to accurately extract useful information from this vast array of data to better understand and improve the production system. Data mining methods have been developed to search large data sets for hidden patterns. This paper introduces a powerful data mining method known as Multivariate Adaptive Regression Splines (MARS). By applying the MARS methodology to model CCS production data from the Herbert district, a model was produced for the 2005 harvest period. This model produced a north-south geographic separation between low and high CCS producing farms in line with recorded CCS values. The model was also able to identify farm groupings which contributed to lower, modelled CCS values, relative to other farms. A brief investigation on the isolated effects of variety was also conducted. Introduction Due to advances in technology, it is much easier to collect and store masses of data. For many industries worldwide this means data and infrastructure exist that enable enquiries about respective industry production systems to be made in pursuit of industry sustainability. The purpose of this paper therefore is to introduce a well established statistical analysis method that has the capability to identify and extract vital pieces of information from very large and complex data sets. This process is known as data mining (Hastie et al., 2001). The technique introduced in this paper is called MARS and stands for Multivariate Adaptive Regression Splines (Steinberg, 1999). This technique was first developed by Friedman (1991) (see also Friedman and Roosen, 1995). In close collaboration with Friedman, a company called Salford Systems (Salford Systems, 2010) developed a graphical user interface to make MARS much more accessible than the original FORTRAN code written by Friedman (Steinberg et al., 1999). From this point forward all discussions referring to MARS are based on the Salford System s MARS version 3.0 Pro (Salford Systems, 2010) The strength of the MARS methodology over competing models is discussed in a number of publications (Leathwick et al., 2006; Lee et al., 2006; Muñoz and Felicísimo, 2004). Classical statistical analyses like mixed linear models find it much more challenging to perform well when the number of potential predictors is exceedingly high. MARS models can also be linked with GIS packages (Muñoz and Felicísimo, 2004). Identifying spatial trends in productivity variables would be invaluable to the effective implementation of precision farming models, which may in turn 1

2 increase productivity, profitability and environmental sustainability through improved systems knowledge. After providing an overview of the MARS methodology, this paper applies the MARS method to a simulated data set to introduce the reader to key concepts that underlie the MARS methodology. We then apply the MARS method to a cane productivity data set from the Herbert region to demonstrate its workings on a data set relevant to the Australian sugar industry. MARS MARS is a regression technique that can model the relationship between a desired response (also called target variable) and multiple predictor variables. The main strength of MARS is its ability to detect and enhance the interpretability of complex interactions between a target variable and a set of predictor variables. The MARS model is represented by simple linear functions which combine additively and/or interactively. The ability of MARS to simplify complicated relationships is quite pertinent to the Australian sugar industry as sugarcane productivity is affected by complex interactions of biological, environmental and management conditions. MARS (Steinberg et al., 1999) models the contribution of each predictor variable to the target variable using a sequence of piece-wise linear regression splines 1. A spline is a flexible curve that is fixed at various points or knots. This curve mimics the relationship between the target and predictor variable, and the knot(s) identifies regions where the relationship between the response and the predictor variable changes. A linear regression spline describes the relationship in each region by a straight line function of the form Y=a(x) +b where the highest power of the predictor variable (x) is one (i.e. x = x 1 ). Since the linear regression spline is built up from different linear functions, it is said to be built piece-wise. Hence the term: piece-wise linear regression spline. We illustrate this concept with a simple polynomial relationship in Figure 1. Fig. 1 An example of a piecewise linear regression spline, the fundamental building block of the MARS model. 1 It is possible, and in some cases beneficial to approximate data using higher order polynomial splines (F(x,x 2,x 3 ) etc.) for example, when continuous derivatives are desired. However the MARS technique (Friedman, 1991; Friedman and Roosen, 1995) defines splines in terms of linear polynomial basis functions for simplicity, as accuracy is not greatly diminished and fitting higher order polynomials can lead to higher variance at end points (Friedman, 1991). 2

3 MARS does not present relationships in terms of the original variables, but reclassifies the target-predictor variable relationships into a set of basis functions (BFs), that use hockey stick functions (Figure 2) to represent the calculated splines. The bend in the hockey stick function represents the knot point in the response space where a particular independent variable becomes influential. For example, two basis functions to represent the data about knot 3 in Figure 1 would be BF1 = max (0, x 6.2) and its mirror, BF2 = max (0, 6.2 x) By creating mirror images, MARS can express a variable with a slope before and/or after the knot point. In this case the knot point has been identified at x= 6.2. For BF1 this is read as BF1 is equal to the largest value of either 0 or x 6.2 for a given value of x. Mathematically BF1 = 0 for x 6.2 and linearly increases for x > 6.2 (Figure 2a). Similarly BF2 = 0 for x 6.2 and linearly decreases for x < 6.2 (Figure 2b). (a) (b) Fig. 2 (a) Graphical representation of BF1= max (0, x 6.2). (b) Graphical representation of BF2 = max (0, 6.2 x). Each knot point is defined using these mirrored pairs. A selection of basis functions can then be linearly combined to model the response, using coefficients to define the contribution of each basis function and hence, the individual slopes. The examples that follow demonstrate how the MARS model is built, but first we need to understand how categorical variables are handled by MARS. Categorical predictor variables in MARS are treated in a slightly different way to traditional linear regression models. Linear regression models using a categorical predictor variable x with K levels that might represent K different varieties of sugarcane, create K-1 dummy variables. MARS however, is capable of merging different levels together based on how they contribute to the target variable. For example, say we wish to approximate a target predictor relationship such as: Y i = F(x i ) + e i The subscript i indicates that the values refer to the ith case in the data set being modeled. Y i and x i are the scalar values of the target and predictor variables for the ith case, respectively. F(x i ) is 3

4 a function involving x that describes the relationship between x and Y. The term, e i is called the ith residual and represents the difference between the ith target variable Y i and the approximating function F(x i ). If we assume x is a categorical variable with four distinct levels of classification, then in a standard regression, the equation may be represented as: where, Y i = a 0 + a 1 x 1i +a 2 x 2i +a 3 x 3i + e i (x 1i, x 2i, x 3i ) = (1, 0, 0) if i Є group 1 (x 1i, x 2i, x 3i ) = (0, 1, 0) if i Є group 2 (x 1i, x 2i, x 3i ) = (0, 0, 1) if i Є group 3 (x 1i, x 2i, x 3i ) = (0, 0, 0) if i Є group 4 The notation i Є group 1 is read if the ith observational unit is an element of group 1, a 0 is the intercept and a 1, a 2 and a 3 are regression coefficients. Thus, x 1i, x 2i, x 3i collectively represent the four separate levels of the categorical variable x. If MARS produced the following prediction of the response: Ŷ i = a 0 + a 1 BF1 i + a 2 BF2 i + a 3 BF3 i where MARS reports: BF1 i = (x in 1, 3), BF2 i = (x in 2, 4), BF3 i = (x in 1). Then this is interpreted as: BF1 i = 1 if case i Є group 1 or group 3 BF1 i = 0 if case i Є group 2 or group 4 BF2 i = 1 if case i Є group 2 or group 4 BF2 i = 0 if case i Є group 1 or group 3 BF3 i = 1 if case i Є group 1 BF3 i = 0 if case i Є groups 2, 3 or 4 Therefore the categorical variable x, with four levels would contribute: a 0 +a 1 + a 3 units to Y when case i is from group 1, a 0 + a 1 units to Y when case i is from group 3 and a 0 + a 2 units to Y if x i is from group 2 or 4. MARS employs a forward/backward stepwise approach to determine the knot points in the data set, which in turn defines each basis function. Initially the model is overfitted by selecting more basis functions than are actually needed to describe the target variable. This model is subsequently pruned back to describe an optimal model. During the pruning stage, basis functions are removed one at a time from the over-fit model based on a residual sums of squares criterion (Steinberg et al., 1999). The model is refitted after each deletion and each reduced model is tested on the GCV (Generalised Cross-Validation) criterion (Craven and Wahba, 1979), a weighted mean squared error criterion used to prevent overfitting. A good model will have a GCV R 2 score approaching 1 while a model with a GCV R 2 score close to zero is considered poor therefore a model is considered optimal if it maximises the GCV R 2. This concept is relatively simple to visualise on a univariate scale. However, MARS (Salford Systems, 2010) must extend this concept to situations with multiple independent variables which may combine linearly and/or interact with each other to affect the response variable. The simulated example in the next section illustrates this concept. 4

5 Simulated example Data set up and model parameters This section will approximate a known predefined function to demonstrate the key features of the MARS method. Consider a target variable Y, and three predictor variables, two continuous (x 1 and x 2 ) and one categorical (x 3 ) with three distinct levels. The target predictor relationship is described by equation 1. Y i = 7+x 1i (x 1i 10) x 2i x 4i + 3 x 2i 10 x 4i +14 x 5i +e i, i = 1,..., 46. (eq. 1) Here, e i represents a noise contribution which follows the standard normal distribution, (x 4i, x 5i ) = (1,0) if x 3i =1, (x 4i, x 5i ) = (0,1) if x 3i = 2 and (x 4i, x 5i ) = (0,0) if x 3i = 3. Outlined in Table 1, variable x 1 ranges from 1 10 with 46 even increments. The value of x 1 for any given case represents a random selection from the entire range without replacement (any of the 46 values of x 1 can only be chosen once). Variable x 2 was calculated as x 2 = e a where a ranges from 5 to +4 with 46 even increments. The value of x 2 for any given case represents a random selection from this entire range without replacement. Table 1 Independent variables and target variable which were generated according to equation 1. MARS outputs Using Y as the target variable, x 1 and x 2 as predictors and x 3 as a categorical predictor MARS produced the output shown in Figure 4. The model contains seven basis functions (Figure 4) but uses only two variables to model the data and reports a GCV-R 2 of Fig. 3 Screen capture of MARS output produced by modelling Y as a function of x 1, x 2 and x 3. This equation models the target response by a series of linear combinations of basis functions (Salford Systems, 2010). 5

6 Fig. 4 Screen capture of MARS basis functions (Salford Systems, 2010). Basis function one (BF1) is equal to 0 if x or BF1 is equal to x if x Note that the knot point represents the smallest value of x 2 (Table 1) and that the mirror basis function does not appear in the equation, this is how MARS enters a linear variable. Figure 5 is a graphical representation of the relationship between Y and the x 1 variable and the MARS model approximation to the relationship (ŷ) where: ŷ i = (x 2i ), x 2i , i = 1 to 46 Fig. 5 The relationship between Y and x 2 (points) and (ŷ) the contribution of basis function one (BF1) to the MARS model (line). Figure 4 is a screen capture of the output from the MARS software (Salford Systems, 2010). Notation such as (x 3 in ( 2 )) is not standard mathematical notation and simply means x 3 is in group 2. Basis function two is interpreted as: BF2 = 1 if x 3 = 2 and zero otherwise. Recall that variable x 3 = 2 when x 4 = 0 and x 5 = 1 (Table 1). Therefore the contribution of BF2 to the MARS model is directly comparable to the contribution of the variable x 5 in equation 1. The contributions can be compared as: 14 x 5i i = 1 to 46 6

7 from equation 1, and (x 3i in 2) i = 1 to 46 from Figure 3. In a similar situation, BF6 = 1 if x 3 = 1, otherwise BF6 = 0. Variable x 3 is equal to one when variable x 4 = 1 and x 5 = 0, therefore the contribution of BF6 to the MARS model is directly comparable to the contribution of the variable x 4 in equation 1. The contributions can be compared as: 10 x 4i, i = 1 to 46 in equation 1, and (x 3i in 1), i = 1 to 46 from Figure 3. Basis functions four, five and eight collectively represent the contributions of x 1 to the MARS model. Basis functions four and five are mirror image basis which have a knot point at x 1 = 4. Basis function eight defines a second knot point at x 1 = 7. In this way the MARS model defines three separate regions in the relationship between the target variable Y and the variable x 1. Figure 6 is a graphical representation of the relationship between Y and x 1 and the total contribution of the x 1 variable to the MARS model (ŷ i ) where: Fig. 6 Relationship between Y and x 1 (points) and the collective contribution of basis functions 4, 5 and 8 to the MARS model (ŷ i line). The knot points separate the three segments of the MARS model with knot 1 = 4 and knot 2 = 7. Basis function 10 is an interaction term between BF1 and BF6. This is equivalent to the interaction between x 2 and x 4 in equation one. Recall that in equation 1 we had the term 0.5 x 2i for: x 4i = 1 and i = 1 to 46 7

8 This is similar to the following term displayed in Figure 4: (x 2i ) for: x 2i , x 3i = 1 and i = 1 to 46 If Ŷ 0 and ŷ 0 represent the intercepts for equation 1 and the MARS model respectively, the two intercepts can be compared directly as: Ŷ 0 = 7 since x 1 = x 2 = x 4 = x 5 = 0, and in the MARS model, ŷ 0 = = 2.3 since x 1 = x 2 = 0, x 3 = 3. The difference in the intercepts is caused by the random error added in to the model. Case study: Modeling CCS in the Herbert district Data setup and model parameters In order to model the commercially recoverable sugar content of sugar cane (CCS), data were collected from approximately 750 farms across the Herbert district. Data were collected for the 2005 harvest season (July to December). Only blocks that recorded details of CCS (as a percentage) (CCS05) and a subset predictor variables believed to influence CCS levels were included. Blocks that were fallowed during the 2005 harvest season or which represented outlying cases were excluded from the selection into the data set. The final data set collated 8998 individual blocks and included the potential predictor variables: farm of origin (F1), month of harvest (M05), cane variety (VC), crop class (C), soil type (SC) and two geographic locator variables (X and Y). All predictor variables were categorical except for the continuous predictors month of harvest, X and Y. Refer to Table 2 for a summary. Farm of origin refers to which farm controls the individual block and can account for effects such as individual management style and farmer experience. Together with month of harvest, farm of origin has previously proven to be an influential predictor of CCS and total cane yield (Lawes et al., 2002). This study compared 40 varieties of cane in use in the Herbert district and identified six crop classes (plant crop and ratoons 1 5). Soil in the district was classified as one of six types: alluvial, hill-slope, terrace-loamy, clay, Seymour or sandy. Table 2 Model set up describing which variables to use to predict target and which variables are to be considered categorical. Variable Variable Symbol Target Predictor Categorical CCS CCS05 YES Farm F1 YES YES Soil type SC YES YES Cane variety VC YES YES Crop class C YES YES Month of harvest M05 YES Locator 1 X YES Locator 2 Y YES 8

9 MARS allows users to specify many options within the model, including a maximum number of basis functions, level of interactions between variables and a minimum number of observations to leave between each knot point (Figure 7). We specified a maximum of 16 basis functions and allowed only for two-way interactions between variables as higher level interactions are unreliable due to the inherent variability of block productivity data (Lawes and Lawn, 2005). We had V set to 10 folds as part of the V-fold cross-validation procedure. This means the data are divided into 10 roughly equal subsets. One subset is then held out and used to test the predictive capabilities of a model built using the remaining 9 subsets. The process is then repeated until every subset has been left out. Fig. 7 Screen capture of options and limits tab (Salford Systems, 2010). MARS results Overall the model reports a GCV R 2 score of (Figure 8). The optimal MARS model (Figure 9) contained 10 terms (Figure 10) based on the productivity variables: farm of origin, month of harvest and cane variety. Figure 11 graphically displays the strong correlation between the MARS predicted CCS values and the actual CCS recorded for each farm. We note that subset 1 in BF9 is different to subset 1 in BF1. That is, the subsets pertain only to each variable in the respective basis function. The same is true for all subsets listed in Figure 10. Fig. 8 Screen capture of model summary (Salford Systems, 2010). 9

10 Fig. 9 Screen capture of MARS model for CCS production of Herbert district 2005 (Salford Systems, 2010). Fig. 10 MARS basis functions for CCS production of Herbert district 2005(Salford Systems, 2010). Fig. 11 Correlation between raw data and predicted values. Six subsets of farms have been identified by the model. Subsets 1 and 2 contribute directly to CCS with subset 1 farms reducing the predicted CCS value more than subset 2. These farms occur most frequently in the central eastern sector of the Herbert district (graph not shown). Farms that appear in both subset 1 and 2 reduce the predicted CCS value the most and cluster in the northern arm of the district (Figure 12(a)). 10

11 (a) (b) (c) (d) Fig. 12 (a) Farms that occur in subset one and two and therefore decrease the response the most (relative to other farms). (b) Spatial spread of residuals. (c) Spatial spread of recorded CCS (%) in (d) Spatial spread of CCS (%) calculated by MARS model. 11

12 Subsets 3 and 4 contribute to the model differently depending on the month of harvest. Relative to other farms, predicted CCS values for Subset 3 farms increase if the harvest occurs before October. Predicted CCS values for farms in subset 4 decrease relative to other farms, if the harvest occurs after October. The relationship between month of harvest and modelled CCS values was temporal rather than spatial with model CCS values peaking in October and decreasing thereafter. Finally, farm subsets 5 and 6 interact with two subsets of varieties of cane (Table 3). Interactions with cane variety affected CCS predicted values in the districts north with the high CCS predictions of the south seemingly independent of cane variety. Figure 12(b) shows the spatially variability in the residual. To contrast the performance of the MARS model Figure 12(c) shows the actual recorded CCS values and Figure 12(d) shows the MARS modelled CCS values which present similarly, at least on a spatial scale to the actual CCS values. Table 3 Cane variety subsets. Variety subset 1 2 Varieties of cane ARGOS, CASS, MIX, EXP, MIDA,Q107, Q124, Q138, Q142, Q152, Q157, Q158, Q162, Q164, Q166, Q167, Q172, Q179, Q181, Q187, Q215, Q216,Q114,Q99, Q119 Q115, Q117, Q120, Q127, Q135, Q165, Q170, Q174, Q186, Q190, Q194, Q195, Q200, Q204, Q96 It is possible to investigate the effect that a specific variable (e.g. cane variety) has on CCS by implementing a three-step procedure. The first step will build the MARS model using all predictors except the variable of interest, i.e. cane variety. The second step will compute the residuals from this model. Finally, these residuals can then be compared with the predictor variable of interest. For example, Figure 13 shows how six cane varieties affect CCS after the effects due to the other variables described in Table 2 have been removed. Fig. 13 The effect of cane variety on CCS residuals. Higher residuals represent a higher CCS percentage. Bold horizontal lines represent the median value for each variety. Boxes represent the spread from the 25 th to the 75 th percentile of cases while dashed vertical whiskers represent ± 1.5 times the interquartile range. Points represent cases more than 1.5 times the interquartile range and may be considered outliers. All varieties were considered in the model, but for visual purposes only 6 varieties have been displayed. 12

13 The MARS model that excluded cane varieties reported a GCV R 2 of 0.50, only 5% lower that that of the full model which contained cane variety. We stress that this investigation was done purely for illustrative purposes and that more rigorous investigations over a longer temporal period that also considers the removal of additional confounding variables needs to be performed before comprehensive conclusions can be drawn about the relationship between cane variety and CCS. Conclusion MARS is an effective tool for modeling complex multivariate datasets as relatively simple linear additive models. MARS is capable of handling both quantitative and categorical predictors. This case study gives a demonstration of how to apply MARS to a real world data set which contains CCS production measures from the Herbert district from the year Furthermore, the model allowed some insight about variables that influence CCS productivity (farm and month of harvest and cane variety). We also briefly discussed how the effects that predictor variables have on CCS productivity could be removed. If sufficient data were available the MARS approach could therefore assist researchers to remove the effects of different biological, climatological or management conditions in different years. This would allow researchers to compare productivity maps across different years on a level playing field, which in turn could identify farms with consistently low or consistently high productivity subject to the removal of various effects such as variety and month of harvest for example. Under an emerging precision farming regime the MARS program offers enormous utility. Acknowledgements The authors would like to extend sincere thanks to Mr. Lawrence Di Bella from the HCPSL for providing access to these data and for assisting Mr. Daniel Zamykal during the early stages of his PhD, the topic of which has provided motivation for this manuscript. The authors are also grateful to staff from Salford Systems for their support and to Mr. Daniel Zamykal for preparing the data analysed in this manuscript. The authors also extend thanks to Ms Madalyn Casey for her assistance with this manuscript. This research was funded by the Australian Government through the Sugar Research and Development Corporation with in- kind support from James Cook University. REFERENCES Craven P, Wahba G (1979) Smoothing noisy data with spline functions. Numerische Mathematik 31, Friedman JH (1991) Multivariate adaptive regression splines (with discussion). Annals of Statistics 19, Friedman JH, Roosen CB (1995) An introduction to multivariate adaptive regression splines. Statistical Methods in Medical Research 4, Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning: Data mining, inference and prediction. Springer Series in Statistics. (Springer-Verlag: New York). Leathwick J, Elith J, Hastie T (2006) Comparative performance of generalised additive models and multivariate adaptive regression splines for statistical modelling of species distributions. Ecological Modeling 199, Lawes RA, Basford KE, McDonald LM, Lawn RJ, Wegener MK (2002) Factors affecting cane yield and commercial cane sugar in the Tully district. Australian Journal of Experimental Agriculture 42, Lawes RA, Lawn RJ (2005) Applications of industry information in sugarcane production systems. Field Crops Research 92,

14 Lee T, Chiu C, Chou Y, Lu C (2006) Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Computational Statistics & Data Analysis 50, Muñoz J, Felicísimo, Á M (2004) Comparison of statistical methods commonly used in predictive modeling. Journal of Vegetation Science 15, Salford Systems (2010) Salford Predictive Modeler. Salford Systems Version (accessed 25 August 2010). Steinberg D, Colla PL, Martin K (1999) MARS user guide. (Salford Systems: San Diego, CA). 14

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications

More information

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers Why enhance GLM? Shortcomings of the linear modelling approach. GLM being

More information

A Method for Comparing Multiple Regression Models

A Method for Comparing Multiple Regression Models CSIS Discussion Paper No. 141 A Method for Comparing Multiple Regression Models Yuki Hiruta Yasushi Asami Department of Urban Engineering, the University of Tokyo e-mail: hiruta@ua.t.u-tokyo.ac.jp asami@csis.u-tokyo.ac.jp

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

Application of Multivariate Adaptive Regression Splines to Evaporation Losses in Reservoirs

Application of Multivariate Adaptive Regression Splines to Evaporation Losses in Reservoirs Open access e-journal Earth Science India, eissn: 0974 8350 Vol. 4(I), January, 20, pp.5-20 http://www.earthscienceindia.info/ Application of Multivariate Adaptive Regression Splines to Evaporation Losses

More information

A new approach to analysing spatial data using sparse grids

A new approach to analysing spatial data using sparse grids A new approach to analysing spatial data using sparse grids Shawn W. Laffan a, Howard Silcock b, Ole Nielsen b and Markus Hegland b a Centre for Remote Sensing and GIS, School of Biological, Earth and

More information

Nonparametric Regression

Nonparametric Regression Nonparametric Regression John Fox Department of Sociology McMaster University 1280 Main Street West Hamilton, Ontario Canada L8S 4M4 jfox@mcmaster.ca February 2004 Abstract Nonparametric regression analysis

More information

A technique for constructing monotonic regression splines to enable non-linear transformation of GIS rasters

A technique for constructing monotonic regression splines to enable non-linear transformation of GIS rasters 18 th World IMACS / MODSIM Congress, Cairns, Australia 13-17 July 2009 http://mssanz.org.au/modsim09 A technique for constructing monotonic regression splines to enable non-linear transformation of GIS

More information

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann Forrest W. Young & Carla M. Bann THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA CB 3270 DAVIE HALL, CHAPEL HILL N.C., USA 27599-3270 VISUAL STATISTICS PROJECT WWW.VISUALSTATS.ORG

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Cyber attack detection using decision tree approach

Cyber attack detection using decision tree approach Cyber attack detection using decision tree approach Amit Shinde Department of Industrial Engineering, Arizona State University,Tempe, AZ, USA {amit.shinde@asu.edu} In this information age, information

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Assessing the Quality of the Natural Cubic Spline Approximation

Assessing the Quality of the Natural Cubic Spline Approximation Assessing the Quality of the Natural Cubic Spline Approximation AHMET SEZER ANADOLU UNIVERSITY Department of Statisticss Yunus Emre Kampusu Eskisehir TURKEY ahsst12@yahoo.com Abstract: In large samples,

More information

Data mining with Support Vector Machine

Data mining with Support Vector Machine Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine

More information

An Overview MARS is an adaptive procedure for regression, and is well suited for high dimensional problems.

An Overview MARS is an adaptive procedure for regression, and is well suited for high dimensional problems. Data mining and Machine Learning, Mar 24, 2008 MARS: Multivariate Adaptive Regression Splines (Friedman 1991, Friedman and Silverman 1989) An Overview MARS is an adaptive procedure for regression, and

More information

Generalized Additive Model

Generalized Additive Model Generalized Additive Model by Huimin Liu Department of Mathematics and Statistics University of Minnesota Duluth, Duluth, MN 55812 December 2008 Table of Contents Abstract... 2 Chapter 1 Introduction 1.1

More information

STATISTICS (STAT) Statistics (STAT) 1

STATISTICS (STAT) Statistics (STAT) 1 Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).

More information

8.11 Multivariate regression trees (MRT)

8.11 Multivariate regression trees (MRT) Multivariate regression trees (MRT) 375 8.11 Multivariate regression trees (MRT) Univariate classification tree analysis (CT) refers to problems where a qualitative response variable is to be predicted

More information

3. Data Analysis and Statistics

3. Data Analysis and Statistics 3. Data Analysis and Statistics 3.1 Visual Analysis of Data 3.2.1 Basic Statistics Examples 3.2.2 Basic Statistical Theory 3.3 Normal Distributions 3.4 Bivariate Data 3.1 Visual Analysis of Data Visual

More information

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones What is machine learning? Data interpretation describing relationship between predictors and responses

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

Latent variable transformation using monotonic B-splines in PLS Path Modeling

Latent variable transformation using monotonic B-splines in PLS Path Modeling Latent variable transformation using monotonic B-splines in PLS Path Modeling E. Jakobowicz CEDRIC, Conservatoire National des Arts et Métiers, 9 rue Saint Martin, 754 Paris Cedex 3, France EDF R&D, avenue

More information

Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an

Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an Soft Threshold Estimation for Varying{coecient Models Artur Klinger, Universitat Munchen ABSTRACT: An alternative penalized likelihood estimator for varying{coecient regression in generalized linear models

More information

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error

More information

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract 19-1214 Chemometrics Technical Note Description of Pirouette Algorithms Abstract This discussion introduces the three analysis realms available in Pirouette and briefly describes each of the algorithms

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique www.ijcsi.org 29 Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn

More information

WESTMORELAND COUNTY PUBLIC SCHOOLS Integrated Instructional Pacing Guide and Checklist Algebra, Functions & Data Analysis

WESTMORELAND COUNTY PUBLIC SCHOOLS Integrated Instructional Pacing Guide and Checklist Algebra, Functions & Data Analysis WESTMORELAND COUNTY PUBLIC SCHOOLS 2013 2014 Integrated Instructional Pacing Guide and Checklist Algebra, Functions & Data Analysis FIRST QUARTER and SECOND QUARTER (s) ESS Vocabulary A.4 A.5 Equations

More information

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13 CSE 634 - Data Mining Concepts and Techniques STATISTICAL METHODS Professor- Anita Wasilewska (REGRESSION) Team 13 Contents Linear Regression Logistic Regression Bias and Variance in Regression Model Fit

More information

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation. Statistical Consulting Topics Using cross-validation for model selection Cross-validation is a technique that can be used for model evaluation. We often fit a model to a full data set and then perform

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

The Basics of Decision Trees

The Basics of Decision Trees Tree-based Methods Here we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions. Since the set of splitting

More information

ES-2 Lecture: Fitting models to data

ES-2 Lecture: Fitting models to data ES-2 Lecture: Fitting models to data Outline Motivation: why fit models to data? Special case (exact solution): # unknowns in model =# datapoints Typical case (approximate solution): # unknowns in model

More information

Estimation of Design Flow in Ungauged Basins by Regionalization

Estimation of Design Flow in Ungauged Basins by Regionalization Estimation of Design Flow in Ungauged Basins by Regionalization Yu, P.-S., H.-P. Tsai, S.-T. Chen and Y.-C. Wang Department of Hydraulic and Ocean Engineering, National Cheng Kung University, Taiwan E-mail:

More information

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017 Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Supplementary Figure 1. Decoding results broken down for different ROIs

Supplementary Figure 1. Decoding results broken down for different ROIs Supplementary Figure 1 Decoding results broken down for different ROIs Decoding results for areas V1, V2, V3, and V1 V3 combined. (a) Decoded and presented orientations are strongly correlated in areas

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University CS423: Data Mining Introduction Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS423: Data Mining 1 / 29 Quote of the day Never memorize something that

More information

Curve fitting. Lab. Formulation. Truncation Error Round-off. Measurement. Good data. Not as good data. Least squares polynomials.

Curve fitting. Lab. Formulation. Truncation Error Round-off. Measurement. Good data. Not as good data. Least squares polynomials. Formulating models We can use information from data to formulate mathematical models These models rely on assumptions about the data or data not collected Different assumptions will lead to different models.

More information

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy 2017 ITRON EFG Meeting Abdul Razack Specialist, Load Forecasting NV Energy Topics 1. Concepts 2. Model (Variable) Selection Methods 3. Cross- Validation 4. Cross-Validation: Time Series 5. Example 1 6.

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Table of Contents (As covered from textbook)

Table of Contents (As covered from textbook) Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA, 2015 MODULE 4 : Modelling experimental data Time allowed: Three hours Candidates should answer FIVE questions. All questions carry equal

More information

SPSS INSTRUCTION CHAPTER 9

SPSS INSTRUCTION CHAPTER 9 SPSS INSTRUCTION CHAPTER 9 Chapter 9 does no more than introduce the repeated-measures ANOVA, the MANOVA, and the ANCOVA, and discriminant analysis. But, you can likely envision how complicated it can

More information

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more.

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more. SPM Users Guide Model Compression via ISLE and RuleLearner This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more. Title: Model Compression

More information

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015 GLM II Basic Modeling Strategy 2015 CAS Ratemaking and Product Management Seminar by Paul Bailey March 10, 2015 Building predictive models is a multi-step process Set project goals and review background

More information

Machine Learning: An Applied Econometric Approach Online Appendix

Machine Learning: An Applied Econometric Approach Online Appendix Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail

More information

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts Chapter 28 Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms

More information

Topics in Machine Learning

Topics in Machine Learning Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng Machine Learning - Motivation Arthur

More information

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value. Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Matthew S. Shotwell, Ph.D. Department of Biostatistics Vanderbilt University School of Medicine Nashville, TN, USA March 16, 2018 Introduction trees partition feature

More information

BIG DATA SCIENTIST Certification. Big Data Scientist

BIG DATA SCIENTIST Certification. Big Data Scientist BIG DATA SCIENTIST Certification Big Data Scientist Big Data Science Professional (BDSCP) certifications are formal accreditations that prove proficiency in specific areas of Big Data. To obtain a certification,

More information

Algebra 1, 4th 4.5 weeks

Algebra 1, 4th 4.5 weeks The following practice standards will be used throughout 4.5 weeks:. Make sense of problems and persevere in solving them.. Reason abstractly and quantitatively. 3. Construct viable arguments and critique

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Basis Functions Tom Kelsey School of Computer Science University of St Andrews http://www.cs.st-andrews.ac.uk/~tom/ tom@cs.st-andrews.ac.uk Tom Kelsey ID5059-02-BF 2015-02-04

More information

Comparative Study of Instance Based Learning and Back Propagation for Classification Problems

Comparative Study of Instance Based Learning and Back Propagation for Classification Problems Comparative Study of Instance Based Learning and Back Propagation for Classification Problems 1 Nadia Kanwal, 2 Erkan Bostanci 1 Department of Computer Science, Lahore College for Women University, Lahore,

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. 1 CHAPTER 1 Introduction Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. Variable: Any characteristic of a person or thing that can be expressed

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

Basis Functions. Volker Tresp Summer 2017

Basis Functions. Volker Tresp Summer 2017 Basis Functions Volker Tresp Summer 2017 1 Nonlinear Mappings and Nonlinear Classifiers Regression: Linearity is often a good assumption when many inputs influence the output Some natural laws are (approximately)

More information

Exam Advanced Data Mining Date: Time:

Exam Advanced Data Mining Date: Time: Exam Advanced Data Mining Date: 11-11-2010 Time: 13.30-16.30 General Remarks 1. You are allowed to consult 1 A4 sheet with notes written on both sides. 2. Always show how you arrived at the result of your

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 9. Linear regression with latent variables 9.1 Principal component regression (PCR) 9.2 Partial least-squares regression (PLS) [ mostly

More information

CS229 Lecture notes. Raphael John Lamarre Townshend

CS229 Lecture notes. Raphael John Lamarre Townshend CS229 Lecture notes Raphael John Lamarre Townshend Decision Trees We now turn our attention to decision trees, a simple yet flexible class of algorithms. We will first consider the non-linear, region-based

More information

The Evaluation of Useful Method of Effort Estimation in Software Projects

The Evaluation of Useful Method of Effort Estimation in Software Projects The Evaluation of Useful Method of Effort Estimation in Software Projects Abstract Amin Moradbeiky, Vahid Khatibi Bardsiri Kerman Branch, Islamic Azad University, Kerman, Iran moradbeigi@csri.ac.i Kerman

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Trimmed bagging a DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) Christophe Croux, Kristel Joossens and Aurélie Lemmens

Trimmed bagging a DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) Christophe Croux, Kristel Joossens and Aurélie Lemmens Faculty of Economics and Applied Economics Trimmed bagging a Christophe Croux, Kristel Joossens and Aurélie Lemmens DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) KBI 0721 Trimmed Bagging

More information

Statistical Modeling with Spline Functions Methodology and Theory

Statistical Modeling with Spline Functions Methodology and Theory This is page 1 Printer: Opaque this Statistical Modeling with Spline Functions Methodology and Theory Mark H Hansen University of California at Los Angeles Jianhua Z Huang University of Pennsylvania Charles

More information

Predicting Rare Failure Events using Classification Trees on Large Scale Manufacturing Data with Complex Interactions

Predicting Rare Failure Events using Classification Trees on Large Scale Manufacturing Data with Complex Interactions 2016 IEEE International Conference on Big Data (Big Data) Predicting Rare Failure Events using Classification Trees on Large Scale Manufacturing Data with Complex Interactions Jeff Hebert, Texas Instruments

More information

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K. GAMs semi-parametric GLMs Simon Wood Mathematical Sciences, University of Bath, U.K. Generalized linear models, GLM 1. A GLM models a univariate response, y i as g{e(y i )} = X i β where y i Exponential

More information

NONPARAMETRIC REGRESSION TECHNIQUES

NONPARAMETRIC REGRESSION TECHNIQUES NONPARAMETRIC REGRESSION TECHNIQUES C&PE 940, 28 November 2005 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and other resources available at: http://people.ku.edu/~gbohling/cpe940

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

Estimating Data Center Thermal Correlation Indices from Historical Data

Estimating Data Center Thermal Correlation Indices from Historical Data Estimating Data Center Thermal Correlation Indices from Historical Data Manish Marwah, Cullen Bash, Rongliang Zhou, Carlos Felix, Rocky Shih, Tom Christian HP Labs Palo Alto, CA 94304 Email: firstname.lastname@hp.com

More information

Cover Page. The handle holds various files of this Leiden University dissertation.

Cover Page. The handle   holds various files of this Leiden University dissertation. Cover Page The handle http://hdl.handle.net/1887/22055 holds various files of this Leiden University dissertation. Author: Koch, Patrick Title: Efficient tuning in supervised machine learning Issue Date:

More information

Predicting Messaging Response Time in a Long Distance Relationship

Predicting Messaging Response Time in a Long Distance Relationship Predicting Messaging Response Time in a Long Distance Relationship Meng-Chen Shieh m3shieh@ucsd.edu I. Introduction The key to any successful relationship is communication, especially during times when

More information

IBM SPSS Categories 23

IBM SPSS Categories 23 IBM SPSS Categories 23 Note Before using this information and the product it supports, read the information in Notices on page 55. Product Information This edition applies to version 23, release 0, modification

More information

Classification/Regression Trees and Random Forests

Classification/Regression Trees and Random Forests Classification/Regression Trees and Random Forests Fabio G. Cozman - fgcozman@usp.br November 6, 2018 Classification tree Consider binary class variable Y and features X 1,..., X n. Decide Ŷ after a series

More information

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

MINI-PAPER A Gentle Introduction to the Analysis of Sequential Data

MINI-PAPER A Gentle Introduction to the Analysis of Sequential Data MINI-PAPER by Rong Pan, Ph.D., Assistant Professor of Industrial Engineering, Arizona State University We, applied statisticians and manufacturing engineers, often need to deal with sequential data, which

More information

Nonparametric Approaches to Regression

Nonparametric Approaches to Regression Nonparametric Approaches to Regression In traditional nonparametric regression, we assume very little about the functional form of the mean response function. In particular, we assume the model where m(xi)

More information

Comparison of Optimization Methods for L1-regularized Logistic Regression

Comparison of Optimization Methods for L1-regularized Logistic Regression Comparison of Optimization Methods for L1-regularized Logistic Regression Aleksandar Jovanovich Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 aleksjovanovich@gmail.com

More information

8 th Grade Mathematics Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the

8 th Grade Mathematics Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the 8 th Grade Mathematics Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the 2012-13. This document is designed to help North Carolina educators

More information

An Introduction to Growth Curve Analysis using Structural Equation Modeling

An Introduction to Growth Curve Analysis using Structural Equation Modeling An Introduction to Growth Curve Analysis using Structural Equation Modeling James Jaccard New York University 1 Overview Will introduce the basics of growth curve analysis (GCA) and the fundamental questions

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Chapter 3 - Displaying and Summarizing Quantitative Data

Chapter 3 - Displaying and Summarizing Quantitative Data Chapter 3 - Displaying and Summarizing Quantitative Data 3.1 Graphs for Quantitative Data (LABEL GRAPHS) August 25, 2014 Histogram (p. 44) - Graph that uses bars to represent different frequencies or relative

More information

scikit-learn (Machine Learning in Python)

scikit-learn (Machine Learning in Python) scikit-learn (Machine Learning in Python) (PB13007115) 2016-07-12 (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 1 / 29 Outline 1 Introduction 2 scikit-learn examples 3 Captcha recognize

More information

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection Volume-8, Issue-1 February 2018 International Journal of Engineering and Management Research Page Number: 194-200 The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers

More information

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable. 5-number summary 68-95-99.7 Rule Area principle Bar chart Bimodal Boxplot Case Categorical data Categorical variable Center Changing center and spread Conditional distribution Context Contingency table

More information

Averages and Variation

Averages and Variation Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &

More information

CCSSM Curriculum Analysis Project Tool 1 Interpreting Functions in Grades 9-12

CCSSM Curriculum Analysis Project Tool 1 Interpreting Functions in Grades 9-12 Tool 1: Standards for Mathematical ent: Interpreting Functions CCSSM Curriculum Analysis Project Tool 1 Interpreting Functions in Grades 9-12 Name of Reviewer School/District Date Name of Curriculum Materials:

More information

A General Greedy Approximation Algorithm with Applications

A General Greedy Approximation Algorithm with Applications A General Greedy Approximation Algorithm with Applications Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, NY 10598 tzhang@watson.ibm.com Abstract Greedy approximation algorithms have been

More information

Lecture 8. Divided Differences,Least-Squares Approximations. Ceng375 Numerical Computations at December 9, 2010

Lecture 8. Divided Differences,Least-Squares Approximations. Ceng375 Numerical Computations at December 9, 2010 Lecture 8, Ceng375 Numerical Computations at December 9, 2010 Computer Engineering Department Çankaya University 8.1 Contents 1 2 3 8.2 : These provide a more efficient way to construct an interpolating

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information