Evolution of Regression II: From OLS to GPS to MARS Hands-on with SPM

Size: px

Start display at page:

Download "Evolution of Regression II: From OLS to GPS to MARS Hands-on with SPM"

Dayna Hill
6 years ago
Views:

1 Evolution of Regression II: From OLS to GPS to MARS Hands-on with SPM March 2013 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems

2 Course Outline Today s Webinar: Hands-on companion session Regression Problem quick overview Classical OLS the starting point RIDGE/LASSO/GPS regularized regression MARS adaptive non-linear regression Follow Up Webinars CART Regression tree Random Forest ensembles TreeNet Stochastic Gradient Boosting Hybrid TreeNet/GPS models Salford Systems

3 Regression Challenges Preparation of data errors, missing values, etc. Determination of predictors to include in model o Hundreds, thousands, even tens and hundreds of thousands available Transformation or coding of predictors o Conventional approaches consider logarithm, power, inverse, etc.. Detecting and modeling important interactions Possibly huge number of records Complexity of underlying relationship Lack of external knowledge Salford Systems

4 Boston Housing Data Set Concerns the housing values in Boston area Harrison, D. and D. Rubinfeld. Hedonic Prices and the Demand For Clean Air. Journal of Environmental Economics and Management, v5, , 1978 Combined information from 10 separate governmental and educational sources to produce this data set 506 census tracts in City of Boston for the year 1970 o Goal: study relationship between quality of life variables and property values o MV median value of owner-occupied homes in tract ($1,000 s) o CRIM per capita crime rates o NOX concentration of nitric oxides (pp 10 million) o AGE percent built before 1940 o DIS weighted distance to centers of employment o RM average number of rooms per house o LSTAT % lower status of the population o RAD accessibility to radial highways o CHAS borders Charles River (0/1) o INDUS percent non-retail business o TAX property tax rate per $10,000 o PT pupil teacher ratio Salford Systems

different SPM files Specific file format options for major

5 Open Data and First Look File..Open Displays recently opened files Options for opening different SPM files Specific file format options for major database formats (SAS, SPSS, R, Excel, CSV) ODBC connectivity to RDBMS Salford Systems

6 Quick Metadata and View Data Click View Data Salford Systems

7 Ready To Model From Menu choose Model From Toolbar choose modeling icon From Activity Window below left, select Model Model dialog starts with tab in red until target has been chosen Salford Systems

Setting Up the Regression Select MV as the target (dependent variable) Select allowed predictors (check none=select all) Analysis Type: Regression Analysis Method: MARS Click, Shift, highlight to

8 Setting Up the Regression Select MV as the target (dependent variable) Select allowed predictors (check none=select all) Analysis Type: Regression Analysis Method: MARS Click, Shift, highlight to select groups of predictors and then click Select Predictors for groups of predictors This is all that is required to run an SPM analysis Open file Select target Click START Rest is just options and controls Salford Systems

9 Classical Regression Results 20% random test partition Out of the box regression No attempt to perfect Test MSE= Salford Systems

10 Residual Box Plots Test Data Learn Data Salford Systems

11 Fraction of MSE Due to Largest Residuals Largest Residual Accounts for 13.4% of the total MSE We consider a lift (%MSE / %Residuals) >20 to be evidence of extreme outliers Salford Systems

12 BATTERY PARTITION: Rerun 80/20 Learn test 100 times Note partition sizes are constant All three partitions change each cycle Mean MSE=23.80 Salford Systems

13 BATTERY BOOTSTRAP Other options to obtain measures of what could reasonably expected when predicting from other samples drawn from same population Model Setup, Bootstrap Options, Results Salford Systems

14 MARS Modeling Select MARS from Analysis Method Click Regression for Analysis Type Select MV as the target Can explicitly select all other variables as allowed predictors Salford Systems

15 MARS Modeling Options We will start with MARS defaults Most important controls are: Max Basis Functions Degree of Interaction Defaults: Max BF=15 Max Interaction Level=1 Will always want to experiment but can start with this relatively simple set up Salford Systems

Quick MARS Result Observe that sample partition is as expected: 92 records in test partition We are guaranteed that this partition will not change if we do not change random number seed in

16 Quick MARS Result Observe that sample partition is as expected: 92 records in test partition We are guaranteed that this partition will not change if we do not change random number seed in preferences Two models are identified as optimal: one based on test sample performance (Prediction squared Error) and one based on an internal penalty on model complexity (no reference to the test sample) Salford Systems

17 Summary displays optimal model details Observe results displayed are for the 15 basis function model MSE on test sample is a remarkably lower We have access to similar results for all other models of any size (later) Salford Systems

18 MARS Selector: Access to all available models Access to 15 models starting with the largest with 15 basis functions (regressors) Progressively smaller models all the way to no model (which might be best) Double click on any row to bring up its details Salford Systems

19 Residual Box Plots MARS Model Test Data Learn Data Salford Systems

20 How Did MARS do so well? By Transforming the predictors Curves and Surfaces Displays the transforms All are essentially broken line regressions Salford Systems

21 Remaining Graphs Salford Systems

transform) or two segments allowing a change of slope in

22 Remaining Graphs Most predictors subjected to rather simple transforms Here we see either one line segment (no transform) or two segments allowing a change of slope in the regression at some critical point (knot) Salford Systems

23 MARS Model with ONE Predictor Use LSTAT as lone predictor Look at graph for each sized model Repeatedly double click on different rows of selector table Salford Systems

24 Variations of the MARS Model Salford Systems

25 MARS Uses a forward growing and a backward pruning strategy Forward Stepwise Knot Placement BasFn(s) GCV IndBsFns EfPrms Variable Knot LSTAT RM DIS CRIM PT TAX RAD NOX B Classic Output reports the forward stepping progress of model construction Each step involves a predictor and a knot where regression slope changes A predictor broken via knot contributes TWO predictors to model (two regions) See our MARS course for a detailed discussion of all elements above Salford Systems

26 Backstepping We backstep from largest model to smallest possible Each step eliminates a regressor which may be a region of a raw variable Backstep all the way to the null model Evaluate test sample performance of each model extracted from the backstepping sequence MARS also offers an internal method based on a penalty placed on model complexity if no test data available Report optimal model but give access to all models in the sequence Salford Systems

27 Model Details and Selection Reveals Backstepping. Starts with Largest Model Salford Systems

28 MARS Model as Code Basis Function tab displays the model detail Includes code that would allow model to be rebuilt on same or new data Salford Systems

29 Can we improve the model? 28 built-in MARS experiments designed to help you find the right settings and modeling strategy Salford Systems

30 BATTERY BASIS Salford Systems

31 BATTERY BOOTSTRAP MARS Model Test MSE 5 th percentile=12.82, 95 th percentile=19.89 Median number of basis functions=8 Salford Systems

32 Bootstrap MSE Profiles MARS Models Salford Systems

33 Bootstrapped GPS Coefficient Bounds Predictor Coefficient Order RawVarImp Lower Upper % Selected Constant CHAS = CRIM ZN INDUS NOX RM AGE DIS RAD TAX PT B LSTAT Request Bootstrap to obtain bounds on GPS coefficient estimates May take considerable time on large data sets Salford Systems

34 MARS Observations MARS does its own variable selection. You specify what is allowed. MARS decides what to actually use. MARS automatically handles missing values o o Creates missing value indicators (MVIs) and allows them into model Predictors with missings are interacted with MVI to create a special sub-branch of model just for records without missings for that variable MARS automatically detects interactions o o Forward stepping permits construction of an interaction Interactions are of basis functions (regions of predictors) MARS transforms all variables before it uses them MARS can model the 0/1 binary target well Better to do advance variable selection when there are many (eg use CART decision tree or TreeNet gradient boosting) MARS is not at its best with many predictors having missing values Salford Systems

35 Motivation for Regularized Regression Unsatisfactory regression results based on data modeling physical processes (1970) when predictors correlated o o o o Resulting coefficient estimates could change dramatically with small changes in the data Some coefficient values judged to be much too large Frequent appearance of coefficients with the wrong sign Problem severe with substantial multicollinearity but always present Solution proposed by Hoerl and Kennard, (Technometrics, 1970) was Ridge regression o First proposal by Hoerl just for stabilization of coefficients 1962 o Initially very poorly received by statistics profession Intentionally biased but yields both more satisfactory coefficient estimates and superior generalization o Better performance (test MSE) on previously unseen data Shrinkage of regression coefficients towards zero Salford Systems

36 Lasso Regularized Regression Introduced by Tibshirani in 1996 explicitly as an improvement on the RIDGE regression Least Absolute Shrinkage and Selection Operator Desire to gain the stability and lower variance of ridge regression while also performing variable selection Especially in the context of many possible predictors looking for a simple, stable, low predictive variance model Historical note: The Lasso was inspired by related work in 1993 by Leo Breiman (of CART and RandomForests fame). This was the non-negative garotte. Breiman s simulation studies showed the potential for improved prediction via selection and shrinkage Salford Systems

37 Getting Ready for GPS Generalized PathSeeker Regularized Regression Select GPS engine Specify regression 20% random test sample Salford Systems

GPS Main Results Window Displays best test sample results for each size of model developed Could contain thousands of models, each of a different size We can

38 GPS Main Results Window Displays best test sample results for each size of model developed Could contain thousands of models, each of a different size We can simply go to the best model based on test sample results (MSE=21.361) Could elect to go with the ultra-small 3 variable above (MSE=22.66) Salford Systems

39 Ridge Regression vs OLS Ridge Regression Classical Regression Salford Systems

40 Bootstrap Resampling of Run GPS Test MSE percentiles 5 th = th =22.90 OLS Test MSE Percentiles 5 th = th =35.77 Salford Systems

41 GPS Terminology (Elastic Net Extended) All variations of GPS regularized regression are described by a single real number: Elasticity o o o 0 Searches for a very small number of coefficients in model (COMPACT) 1 - Selects predictors and differentially shrinks (LASSO) 2 - No variable selection, just shrinkage (RIDGE) RIDGE elasticity will not be true ridge regression if we limit the number of predictors allowed in model Elasticity can take on any value between 0 and 2 inclusive o o o By contrast the Elastic Net can vary only between 1 and 2 in our terminology Think of elasticity as controlling the premium we put on obtaining a smaller model (fewer coefficients) Elasticities<1 can be essential when dealing with many predictors (text mining, bioinformatics, post-processing of large ensemble models) Salford Systems

42 GPS Penalized Loss Function RIDGE penalty based on Sb 2 squared LASSO penalty based on S b absolute value COMPACT penalty based on S b 0 count Each elasticity is based on fitting a model that minimizes the residual sum of squared errors + penalty Penalty is a to be estimated constant times one of the functions of the b vector Intermediate elasticities are mixtures, e.g. we could have a 50/50 mix of RIDGE and LASSO Salford Systems

43 Regularized Regression - Theory OLS Regression Minimize Ridge: Sum of squared coefficients Mean Squared Error λ Model Complexity Lasso: Sum of absolute coefficients Minimize Regularized Regression Best Subsets: Number of coefficients Any regularized regression approach tries to balance model performance and model complexity λ regularization parameter o λ = Zero-coefficient solution o λ = 0 OLS solution Salford Systems

between 0 and 2 Default generates 4 elasticities

44 GPS Model Setup Advanced Setup allows specification of a grid of elasticities here 21 evenly spaced between 0 and 2 Default generates 4 elasticities Recommend to use other default settings Salford Systems

GPS Modeling Details Specific Elasticities Four different modeling strategies deployed above Each follows a forward stepwise algorithm to build up model

45 GPS Modeling Details Specific Elasticities Four different modeling strategies deployed above Each follows a forward stepwise algorithm to build up model Performance of each strategy at selected points along its path are displayed Can review coefficients and all performance measures for every model Salford Systems

until close to the end of path Double click on Ridged Lasso

46 Details for the 3 coefficient models Ridge will not necessarily show same number of coefficients as the others until close to the end of path Double click on Ridged Lasso column header to bring up more details as shown below Salford Systems

47 Sentinel Solutions Detail Along the path followed by GPS for every elasticity we identify the solution (coefficient vector) best for each performance measure No attention is paid to model size here so you might still prefer to select a model from the graphical display Salford Systems

48 GPS is Forward Stepping Classical ridge regression can be estimated just like an ordinary regression (just solve modified normal equations) but need a grid search over key shrinkage parameter Classical lasso was solved via quadratic programming in a backwards stepping algorithm Innovation in Friedman s Pathseeker (2004) and then Generalized PathSeeker (2008) was to convert difficult backwards stepping algorithm into forward stepping o Imagine problem with 10,000 predictors. Would prefer to forward step to a small solution Forward stepping algorithms applied to all elasticities Salford Systems

49 How to Forward Step At any stage of model development choose between Add a new variable to Update an existing model variable coefficient Step sizes are small, initial coefficients for any model are very small and are updated in very small increments This explains why the Ridge elasticity can have solutions with less than all the variables o o Technically ridge does not select variables, it only shrinks In practice it can only add one variable per step Salford Systems

Steps and Points Rarely need to change these settings GPS will update any path up to 5000 times We will collect and display performance details for 200 points from the 5000 steps Performance

50 Steps and Points Rarely need to change these settings GPS will update any path up to 5000 times We will collect and display performance details for 200 points from the 5000 steps Performance evaluation is time consuming Skew determines from where along the path we select points to report Steps should be changed only if classic output reports that path was not completed Points can be increased if you want more options for model selection Points can be decreased if you care most about speed and intend to accept best model results Salford Systems

Further Modeling Controls If you are working with a very large number of potential predictors it will be essential to limit the number that can eventually be included in the GPS model GPS begins by

51 Further Modeling Controls If you are working with a very large number of potential predictors it will be essential to limit the number that can eventually be included in the GPS model GPS begins by considering ALL of your variables as potential predictors and will continue to do so during each forward step Once the limit of allowed variables in the model has been reached GPS no longer checks the excluded variables in further model updates This makes for dramatic speed ups in model construction and also dramatic reduction in RAM required for the modeling The GPS correlation control can be used to induce somewhat different paths and models and has been empirically shown to be useful in discovering high performance model. See BATTERY MAXCORR for automated grid searches over this control Salford Systems

52 Try GPS with just ONE Predictor Just RIDGE is selected for display with RED beam positioned at point #1 Progress involves nothing more than refining coefficient until optimal Salford Systems

53 Four Points Along Path At point #1 coefficient= , at point #80=-.2309, at point #136= , at point #187= Salford Systems

54 Path Building Process Zero Coefficient Model A Variable is Added Sequence of 1-variable models A Variable is Added Sequence of 2-variable models A Variable is Added Sequence of 3-variable models Final OLS Solution λ = λ = 0 Variable Selection Strategy Elasticity Parameter controls the variable selection strategy along the path (using the LEARN sample only), it can be between 0 and 2, inclusive o o o Elasticity = 2 fast approximation of Ridge Regression, introduces variables as quickly as possible and then jointly varies the magnitude of coefficients lowest degree of compression Elasticity = 1 fast approximation of Lasso Regression, introduces variables sparingly letting the current active variables develop their coefficients good degree of compression versus accuracy Elasticity = 0 fast approximation of Stepwise Regression, introduces new variables only after the current active variables were fully developed excellent degree of compression but may loose accuracy Salford Systems

55 Points Versus Steps Zero Solution Path 1: Steps Path 2: Steps Path 3: Steps OLS Solution Points Path 1 Path 2 Path 3 Point Selection Strategy Each path will have different number of steps To facilitate model comparison among different paths, the Point Selection Strategy extracts a fixed collection of models into the points grid o This eliminates some of the original irregularity among individual paths and facilitates model extraction and comparison Salford Systems

56 Path Points on Boston Data Path Development Point 30 Point 100 Point 150 Point 190 Each path uses a different variable selection strategy and separate coefficient updates Salford Systems

57 Regularized Logistic Regression All the same GPS ideas apply Specify Logistic Binary Analysis Specify optimality criterion Salford Systems

58 How To Select a Best Model Regularized regression was originally invented to help modelers obtain more intuitively acceptable models Can think of the process as a search engine generating predictive models User decides based on o o Lack of complexity of model Acceptability of coefficients magnitude, signs, predictors included) Clearly can be set to automatic mode Criterion could well be performance on test data Salford Systems

59 What s Next Evolution of Regression, part III o Regression trees (Decision Trees for Regression) o Ensemble models (Ensembles of decision trees, Bagger) o Random Forest (the Bagger perfected) o TreeNet (gradient boosted decision tree regression) o TreeNet/GPS Hybrid models (TreeNet post-processed by GPS models) Hands-on session, part IV Live walk through examples using o Regression trees o Ensemble models o Random Forest o TreeNet o TreeNet/GPS Hybrid models Salford Systems

60 Salford Predictive Modeler SPM Download a current version from our website Version will run without a license key for 10-days Request a license key from unlock@salford-systems.com Request configuration to meet your needs o Data handling capacity o Data mining engines made available Salford Systems 2012

Evolution of Regression III:

Evolution of Regression III: From OLS to GPS, MARS, CART, TreeNet and RandomForests March 2013 Dan Steinberg Mikhail Golovnya Salford Systems Course Outline Previous Webinars: Regression Problem quick