AN INTRODUCTION TO MULTIVARIATE ADAPTIVE REGRESSION SPLINES FOR THE CANE INDUSTRY YL EVERINGHAM, J SEXTON

Size: px

Start display at page:

Download "AN INTRODUCTION TO MULTIVARIATE ADAPTIVE REGRESSION SPLINES FOR THE CANE INDUSTRY YL EVERINGHAM, J SEXTON"

Brittany Taylor
6 years ago
Views:

1 AN INTRODUCTION TO MULTIVARIATE ADAPTIVE REGRESSION SPLINES FOR THE CANE INDUSTRY By YL EVERINGHAM, J SEXTON School of Engineering and Physical Sciences, James Cook University yvette.everingham@jcu.edu.au KEYWORDS: MARS, CCS, GIS, Precision, Agriculture, Data Mining. Abstract INDUSTRIES strive to find the balance between increased productivity and future sustainability of production. To this end, the sugar cane industry maintains records from each farm about CCS (commercial cane sugar content (%)), total cane yield, cane varieties and growing conditions throughout each region. A challenge that the cane industry faces is how to accurately extract useful information from this vast array of data to better understand and improve the production system. Data mining methods have been developed to search large data sets for hidden patterns. This paper introduces a powerful data mining method known as Multivariate Adaptive Regression Splines (MARS). By applying the MARS methodology to model CCS production data from the Herbert district, a model was produced for the 2005 harvest period. This model produced a north-south geographic separation between low and high CCS producing farms in line with recorded CCS values. The model was also able to identify farm groupings which contributed to lower, modelled CCS values, relative to other farms. A brief investigation on the isolated effects of variety was also conducted. Introduction Due to advances in technology, it is much easier to collect and store masses of data. For many industries worldwide this means data and infrastructure exist that enable enquiries about respective industry production systems to be made in pursuit of industry sustainability. The purpose of this paper therefore is to introduce a well established statistical analysis method that has the capability to identify and extract vital pieces of information from very large and complex data sets. This process is known as data mining (Hastie et al., 2001). The technique introduced in this paper is called MARS and stands for Multivariate Adaptive Regression Splines (Steinberg, 1999). This technique was first developed by Friedman (1991) (see also Friedman and Roosen, 1995). In close collaboration with Friedman, a company called Salford Systems (Salford Systems, 2010) developed a graphical user interface to make MARS much more accessible than the original FORTRAN code written by Friedman (Steinberg et al., 1999). From this point forward all discussions referring to MARS are based on the Salford System s MARS version 3.0 Pro (Salford Systems, 2010) The strength of the MARS methodology over competing models is discussed in a number of publications (Leathwick et al., 2006; Lee et al., 2006; Muñoz and Felicísimo, 2004). Classical statistical analyses like mixed linear models find it much more challenging to perform well when the number of potential predictors is exceedingly high. MARS models can also be linked with GIS packages (Muñoz and Felicísimo, 2004). Identifying spatial trends in productivity variables would be invaluable to the effective implementation of precision farming models, which may in turn 1

2 increase productivity, profitability and environmental sustainability through improved systems knowledge. After providing an overview of the MARS methodology, this paper applies the MARS method to a simulated data set to introduce the reader to key concepts that underlie the MARS methodology. We then apply the MARS method to a cane productivity data set from the Herbert region to demonstrate its workings on a data set relevant to the Australian sugar industry. MARS MARS is a regression technique that can model the relationship between a desired response (also called target variable) and multiple predictor variables. The main strength of MARS is its ability to detect and enhance the interpretability of complex interactions between a target variable and a set of predictor variables. The MARS model is represented by simple linear functions which combine additively and/or interactively. The ability of MARS to simplify complicated relationships is quite pertinent to the Australian sugar industry as sugarcane productivity is affected by complex interactions of biological, environmental and management conditions. MARS (Steinberg et al., 1999) models the contribution of each predictor variable to the target variable using a sequence of piece-wise linear regression splines 1. A spline is a flexible curve that is fixed at various points or knots. This curve mimics the relationship between the target and predictor variable, and the knot(s) identifies regions where the relationship between the response and the predictor variable changes. A linear regression spline describes the relationship in each region by a straight line function of the form Y=a(x) +b where the highest power of the predictor variable (x) is one (i.e. x = x 1 ). Since the linear regression spline is built up from different linear functions, it is said to be built piece-wise. Hence the term: piece-wise linear regression spline. We illustrate this concept with a simple polynomial relationship in Figure 1. Fig. 1 An example of a piecewise linear regression spline, the fundamental building block of the MARS model. 1 It is possible, and in some cases beneficial to approximate data using higher order polynomial splines (F(x,x 2,x 3 ) etc.) for example, when continuous derivatives are desired. However the MARS technique (Friedman, 1991; Friedman and Roosen, 1995) defines splines in terms of linear polynomial basis functions for simplicity, as accuracy is not greatly diminished and fitting higher order polynomials can lead to higher variance at end points (Friedman, 1991). 2

3 MARS does not present relationships in terms of the original variables, but reclassifies the target-predictor variable relationships into a set of basis functions (BFs), that use hockey stick functions (Figure 2) to represent the calculated splines. The bend in the hockey stick function represents the knot point in the response space where a particular independent variable becomes influential. For example, two basis functions to represent the data about knot 3 in Figure 1 would be BF1 = max (0, x 6.2) and its mirror, BF2 = max (0, 6.2 x) By creating mirror images, MARS can express a variable with a slope before and/or after the knot point. In this case the knot point has been identified at x= 6.2. For BF1 this is read as BF1 is equal to the largest value of either 0 or x 6.2 for a given value of x. Mathematically BF1 = 0 for x 6.2 and linearly increases for x > 6.2 (Figure 2a). Similarly BF2 = 0 for x 6.2 and linearly decreases for x < 6.2 (Figure 2b). (a) (b) Fig. 2 (a) Graphical representation of BF1= max (0, x 6.2). (b) Graphical representation of BF2 = max (0, 6.2 x). Each knot point is defined using these mirrored pairs. A selection of basis functions can then be linearly combined to model the response, using coefficients to define the contribution of each basis function and hence, the individual slopes. The examples that follow demonstrate how the MARS model is built, but first we need to understand how categorical variables are handled by MARS. Categorical predictor variables in MARS are treated in a slightly different way to traditional linear regression models. Linear regression models using a categorical predictor variable x with K levels that might represent K different varieties of sugarcane, create K-1 dummy variables. MARS however, is capable of merging different levels together based on how they contribute to the target variable. For example, say we wish to approximate a target predictor relationship such as: Y i = F(x i ) + e i The subscript i indicates that the values refer to the ith case in the data set being modeled. Y i and x i are the scalar values of the target and predictor variables for the ith case, respectively. F(x i ) is 3

4 a function involving x that describes the relationship between x and Y. The term, e i is called the ith residual and represents the difference between the ith target variable Y i and the approximating function F(x i ). If we assume x is a categorical variable with four distinct levels of classification, then in a standard regression, the equation may be represented as: where, Y i = a 0 + a 1 x 1i +a 2 x 2i +a 3 x 3i + e i (x 1i, x 2i, x 3i ) = (1, 0, 0) if i Є group 1 (x 1i, x 2i, x 3i ) = (0, 1, 0) if i Є group 2 (x 1i, x 2i, x 3i ) = (0, 0, 1) if i Є group 3 (x 1i, x 2i, x 3i ) = (0, 0, 0) if i Є group 4 The notation i Є group 1 is read if the ith observational unit is an element of group 1, a 0 is the intercept and a 1, a 2 and a 3 are regression coefficients. Thus, x 1i, x 2i, x 3i collectively represent the four separate levels of the categorical variable x. If MARS produced the following prediction of the response: Ŷ i = a 0 + a 1 BF1 i + a 2 BF2 i + a 3 BF3 i where MARS reports: BF1 i = (x in 1, 3), BF2 i = (x in 2, 4), BF3 i = (x in 1). Then this is interpreted as: BF1 i = 1 if case i Є group 1 or group 3 BF1 i = 0 if case i Є group 2 or group 4 BF2 i = 1 if case i Є group 2 or group 4 BF2 i = 0 if case i Є group 1 or group 3 BF3 i = 1 if case i Є group 1 BF3 i = 0 if case i Є groups 2, 3 or 4 Therefore the categorical variable x, with four levels would contribute: a 0 +a 1 + a 3 units to Y when case i is from group 1, a 0 + a 1 units to Y when case i is from group 3 and a 0 + a 2 units to Y if x i is from group 2 or 4. MARS employs a forward/backward stepwise approach to determine the knot points in the data set, which in turn defines each basis function. Initially the model is overfitted by selecting more basis functions than are actually needed to describe the target variable. This model is subsequently pruned back to describe an optimal model. During the pruning stage, basis functions are removed one at a time from the over-fit model based on a residual sums of squares criterion (Steinberg et al., 1999). The model is refitted after each deletion and each reduced model is tested on the GCV (Generalised Cross-Validation) criterion (Craven and Wahba, 1979), a weighted mean squared error criterion used to prevent overfitting. A good model will have a GCV R 2 score approaching 1 while a model with a GCV R 2 score close to zero is considered poor therefore a model is considered optimal if it maximises the GCV R 2. This concept is relatively simple to visualise on a univariate scale. However, MARS (Salford Systems, 2010) must extend this concept to situations with multiple independent variables which may combine linearly and/or interact with each other to affect the response variable. The simulated example in the next section illustrates this concept. 4

5 Simulated example Data set up and model parameters This section will approximate a known predefined function to demonstrate the key features of the MARS method. Consider a target variable Y, and three predictor variables, two continuous (x 1 and x 2 ) and one categorical (x 3 ) with three distinct levels. The target predictor relationship is described by equation 1. Y i = 7+x 1i (x 1i 10) x 2i x 4i + 3 x 2i 10 x 4i +14 x 5i +e i, i = 1,..., 46. (eq. 1) Here, e i represents a noise contribution which follows the standard normal distribution, (x 4i, x 5i ) = (1,0) if x 3i =1, (x 4i, x 5i ) = (0,1) if x 3i = 2 and (x 4i, x 5i ) = (0,0) if x 3i = 3. Outlined in Table 1, variable x 1 ranges from 1 10 with 46 even increments. The value of x 1 for any given case represents a random selection from the entire range without replacement (any of the 46 values of x 1 can only be chosen once). Variable x 2 was calculated as x 2 = e a where a ranges from 5 to +4 with 46 even increments. The value of x 2 for any given case represents a random selection from this entire range without replacement. Table 1 Independent variables and target variable which were generated according to equation 1. MARS outputs Using Y as the target variable, x 1 and x 2 as predictors and x 3 as a categorical predictor MARS produced the output shown in Figure 4. The model contains seven basis functions (Figure 4) but uses only two variables to model the data and reports a GCV-R 2 of Fig. 3 Screen capture of MARS output produced by modelling Y as a function of x 1, x 2 and x 3. This equation models the target response by a series of linear combinations of basis functions (Salford Systems, 2010). 5

6 Fig. 4 Screen capture of MARS basis functions (Salford Systems, 2010). Basis function one (BF1) is equal to 0 if x or BF1 is equal to x if x Note that the knot point represents the smallest value of x 2 (Table 1) and that the mirror basis function does not appear in the equation, this is how MARS enters a linear variable. Figure 5 is a graphical representation of the relationship between Y and the x 1 variable and the MARS model approximation to the relationship (ŷ) where: ŷ i = (x 2i ), x 2i , i = 1 to 46 Fig. 5 The relationship between Y and x 2 (points) and (ŷ) the contribution of basis function one (BF1) to the MARS model (line). Figure 4 is a screen capture of the output from the MARS software (Salford Systems, 2010). Notation such as (x 3 in ( 2 )) is not standard mathematical notation and simply means x 3 is in group 2. Basis function two is interpreted as: BF2 = 1 if x 3 = 2 and zero otherwise. Recall that variable x 3 = 2 when x 4 = 0 and x 5 = 1 (Table 1). Therefore the contribution of BF2 to the MARS model is directly comparable to the contribution of the variable x 5 in equation 1. The contributions can be compared as: 14 x 5i i = 1 to 46 6

7 from equation 1, and (x 3i in 2) i = 1 to 46 from Figure 3. In a similar situation, BF6 = 1 if x 3 = 1, otherwise BF6 = 0. Variable x 3 is equal to one when variable x 4 = 1 and x 5 = 0, therefore the contribution of BF6 to the MARS model is directly comparable to the contribution of the variable x 4 in equation 1. The contributions can be compared as: 10 x 4i, i = 1 to 46 in equation 1, and (x 3i in 1), i = 1 to 46 from Figure 3. Basis functions four, five and eight collectively represent the contributions of x 1 to the MARS model. Basis functions four and five are mirror image basis which have a knot point at x 1 = 4. Basis function eight defines a second knot point at x 1 = 7. In this way the MARS model defines three separate regions in the relationship between the target variable Y and the variable x 1. Figure 6 is a graphical representation of the relationship between Y and x 1 and the total contribution of the x 1 variable to the MARS model (ŷ i ) where: Fig. 6 Relationship between Y and x 1 (points) and the collective contribution of basis functions 4, 5 and 8 to the MARS model (ŷ i line). The knot points separate the three segments of the MARS model with knot 1 = 4 and knot 2 = 7. Basis function 10 is an interaction term between BF1 and BF6. This is equivalent to the interaction between x 2 and x 4 in equation one. Recall that in equation 1 we had the term 0.5 x 2i for: x 4i = 1 and i = 1 to 46 7

8 This is similar to the following term displayed in Figure 4: (x 2i ) for: x 2i , x 3i = 1 and i = 1 to 46 If Ŷ 0 and ŷ 0 represent the intercepts for equation 1 and the MARS model respectively, the two intercepts can be compared directly as: Ŷ 0 = 7 since x 1 = x 2 = x 4 = x 5 = 0, and in the MARS model, ŷ 0 = = 2.3 since x 1 = x 2 = 0, x 3 = 3. The difference in the intercepts is caused by the random error added in to the model. Case study: Modeling CCS in the Herbert district Data setup and model parameters In order to model the commercially recoverable sugar content of sugar cane (CCS), data were collected from approximately 750 farms across the Herbert district. Data were collected for the 2005 harvest season (July to December). Only blocks that recorded details of CCS (as a percentage) (CCS05) and a subset predictor variables believed to influence CCS levels were included. Blocks that were fallowed during the 2005 harvest season or which represented outlying cases were excluded from the selection into the data set. The final data set collated 8998 individual blocks and included the potential predictor variables: farm of origin (F1), month of harvest (M05), cane variety (VC), crop class (C), soil type (SC) and two geographic locator variables (X and Y). All predictor variables were categorical except for the continuous predictors month of harvest, X and Y. Refer to Table 2 for a summary. Farm of origin refers to which farm controls the individual block and can account for effects such as individual management style and farmer experience. Together with month of harvest, farm of origin has previously proven to be an influential predictor of CCS and total cane yield (Lawes et al., 2002). This study compared 40 varieties of cane in use in the Herbert district and identified six crop classes (plant crop and ratoons 1 5). Soil in the district was classified as one of six types: alluvial, hill-slope, terrace-loamy, clay, Seymour or sandy. Table 2 Model set up describing which variables to use to predict target and which variables are to be considered categorical. Variable Variable Symbol Target Predictor Categorical CCS CCS05 YES Farm F1 YES YES Soil type SC YES YES Cane variety VC YES YES Crop class C YES YES Month of harvest M05 YES Locator 1 X YES Locator 2 Y YES 8

MARS allows users to specify many options within the model, including a maximum number of basis functions, level of interactions between variables and a minimum number of observations to leave

We specified a maximum of 16 basis functions and allowed only for two-way interactions between variables as higher level interactions are unreliable due to the inherent variability of block

9 MARS allows users to specify many options within the model, including a maximum number of basis functions, level of interactions between variables and a minimum number of observations to leave between each knot point (Figure 7). We specified a maximum of 16 basis functions and allowed only for two-way interactions between variables as higher level interactions are unreliable due to the inherent variability of block productivity data (Lawes and Lawn, 2005). We had V set to 10 folds as part of the V-fold cross-validation procedure. This means the data are divided into 10 roughly equal subsets. One subset is then held out and used to test the predictive capabilities of a model built using the remaining 9 subsets. The process is then repeated until every subset has been left out. Fig. 7 Screen capture of options and limits tab (Salford Systems, 2010). MARS results Overall the model reports a GCV R 2 score of (Figure 8). The optimal MARS model (Figure 9) contained 10 terms (Figure 10) based on the productivity variables: farm of origin, month of harvest and cane variety. Figure 11 graphically displays the strong correlation between the MARS predicted CCS values and the actual CCS recorded for each farm. We note that subset 1 in BF9 is different to subset 1 in BF1. That is, the subsets pertain only to each variable in the respective basis function. The same is true for all subsets listed in Figure 10. Fig. 8 Screen capture of model summary (Salford Systems, 2010). 9

Six subsets of farms have been identified by the model. Subsets 1 and 2 contribute directly to CCS with subset 1 farms reducing the predicted CCS value more than subset 2.

10 Fig. 9 Screen capture of MARS model for CCS production of Herbert district 2005 (Salford Systems, 2010). Fig. 10 MARS basis functions for CCS production of Herbert district 2005(Salford Systems, 2010). Fig. 11 Correlation between raw data and predicted values. Six subsets of farms have been identified by the model. Subsets 1 and 2 contribute directly to CCS with subset 1 farms reducing the predicted CCS value more than subset 2. These farms occur most frequently in the central eastern sector of the Herbert district (graph not shown). Farms that appear in both subset 1 and 2 reduce the predicted CCS value the most and cluster in the northern arm of the district (Figure 12(a)). 10

11 (a) (b) (c) (d) Fig. 12 (a) Farms that occur in subset one and two and therefore decrease the response the most (relative to other farms). (b) Spatial spread of residuals. (c) Spatial spread of recorded CCS (%) in (d) Spatial spread of CCS (%) calculated by MARS model. 11

12 Subsets 3 and 4 contribute to the model differently depending on the month of harvest. Relative to other farms, predicted CCS values for Subset 3 farms increase if the harvest occurs before October. Predicted CCS values for farms in subset 4 decrease relative to other farms, if the harvest occurs after October. The relationship between month of harvest and modelled CCS values was temporal rather than spatial with model CCS values peaking in October and decreasing thereafter. Finally, farm subsets 5 and 6 interact with two subsets of varieties of cane (Table 3). Interactions with cane variety affected CCS predicted values in the districts north with the high CCS predictions of the south seemingly independent of cane variety. Figure 12(b) shows the spatially variability in the residual. To contrast the performance of the MARS model Figure 12(c) shows the actual recorded CCS values and Figure 12(d) shows the MARS modelled CCS values which present similarly, at least on a spatial scale to the actual CCS values. Table 3 Cane variety subsets. Variety subset 1 2 Varieties of cane ARGOS, CASS, MIX, EXP, MIDA,Q107, Q124, Q138, Q142, Q152, Q157, Q158, Q162, Q164, Q166, Q167, Q172, Q179, Q181, Q187, Q215, Q216,Q114,Q99, Q119 Q115, Q117, Q120, Q127, Q135, Q165, Q170, Q174, Q186, Q190, Q194, Q195, Q200, Q204, Q96 It is possible to investigate the effect that a specific variable (e.g. cane variety) has on CCS by implementing a three-step procedure. The first step will build the MARS model using all predictors except the variable of interest, i.e. cane variety. The second step will compute the residuals from this model. Finally, these residuals can then be compared with the predictor variable of interest. For example, Figure 13 shows how six cane varieties affect CCS after the effects due to the other variables described in Table 2 have been removed. Fig. 13 The effect of cane variety on CCS residuals. Higher residuals represent a higher CCS percentage. Bold horizontal lines represent the median value for each variety. Boxes represent the spread from the 25 th to the 75 th percentile of cases while dashed vertical whiskers represent ± 1.5 times the interquartile range. Points represent cases more than 1.5 times the interquartile range and may be considered outliers. All varieties were considered in the model, but for visual purposes only 6 varieties have been displayed. 12

13 The MARS model that excluded cane varieties reported a GCV R 2 of 0.50, only 5% lower that that of the full model which contained cane variety. We stress that this investigation was done purely for illustrative purposes and that more rigorous investigations over a longer temporal period that also considers the removal of additional confounding variables needs to be performed before comprehensive conclusions can be drawn about the relationship between cane variety and CCS. Conclusion MARS is an effective tool for modeling complex multivariate datasets as relatively simple linear additive models. MARS is capable of handling both quantitative and categorical predictors. This case study gives a demonstration of how to apply MARS to a real world data set which contains CCS production measures from the Herbert district from the year Furthermore, the model allowed some insight about variables that influence CCS productivity (farm and month of harvest and cane variety). We also briefly discussed how the effects that predictor variables have on CCS productivity could be removed. If sufficient data were available the MARS approach could therefore assist researchers to remove the effects of different biological, climatological or management conditions in different years. This would allow researchers to compare productivity maps across different years on a level playing field, which in turn could identify farms with consistently low or consistently high productivity subject to the removal of various effects such as variety and month of harvest for example. Under an emerging precision farming regime the MARS program offers enormous utility. Acknowledgements The authors would like to extend sincere thanks to Mr. Lawrence Di Bella from the HCPSL for providing access to these data and for assisting Mr. Daniel Zamykal during the early stages of his PhD, the topic of which has provided motivation for this manuscript. The authors are also grateful to staff from Salford Systems for their support and to Mr. Daniel Zamykal for preparing the data analysed in this manuscript. The authors also extend thanks to Ms Madalyn Casey for her assistance with this manuscript. This research was funded by the Australian Government through the Sugar Research and Development Corporation with in- kind support from James Cook University. REFERENCES Craven P, Wahba G (1979) Smoothing noisy data with spline functions. Numerische Mathematik 31, Friedman JH (1991) Multivariate adaptive regression splines (with discussion). Annals of Statistics 19, Friedman JH, Roosen CB (1995) An introduction to multivariate adaptive regression splines. Statistical Methods in Medical Research 4, Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning: Data mining, inference and prediction. Springer Series in Statistics. (Springer-Verlag: New York). Leathwick J, Elith J, Hastie T (2006) Comparative performance of generalised additive models and multivariate adaptive regression splines for statistical modelling of species distributions. Ecological Modeling 199, Lawes RA, Basford KE, McDonald LM, Lawn RJ, Wegener MK (2002) Factors affecting cane yield and commercial cane sugar in the Tully district. Australian Journal of Experimental Agriculture 42, Lawes RA, Lawn RJ (2005) Applications of industry information in sugarcane production systems. Field Crops Research 92,

14 Lee T, Chiu C, Chou Y, Lu C (2006) Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Computational Statistics & Data Analysis 50, Muñoz J, Felicísimo, Á M (2004) Comparison of statistical methods commonly used in predictive modeling. Journal of Vegetation Science 15, Salford Systems (2010) Salford Predictive Modeler. Salford Systems Version (accessed 25 August 2010). Steinberg D, Colla PL, Martin K (1999) MARS user guide. (Salford Systems: San Diego, CA). 14

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications