Adaptive Regression in SAS/IML

Adaptve Regresson n SAS/IML Davd Katz, Davd Katz Consultng, Ashland, Oregon ABSTRACT Adaptve Regresson algorthms allow the data to select the form of a model n addton to estmatng the parameters. Fredman s procedure explots computatonal shortcuts n Adaptve Regresson, obtanng the power of Neural Networks wth a fracton of the resources. Proc IML allows us to explore these tools n a flexble envronment, wth exctng results. Ths paper descrbes an ongong proect that mplements ths approach. We revew the algorthm, dscuss the programmng technques, and gve some examples of useful applcatons. ADAPTIVE METHODS AND SPLINES Standard Multple Lnear Regresson searches for a model of the form y = w x. Snce the x can be transformed usng any a pror transformaton, ths ncludes such varants as polynomal regresson, whch are often nformally referred to as nonlnear models; however the model s lnear n the transformed varables, so the standard least squares methods stll apply. A problem arses when there are many varables n the analyss. Searchng through many possble transformatons becomes mpractcal, as the number of parameters we need to estmate grows rapdly and outstrps the number of parameters whch our data can estmate. Ths has motvated the development of a class of models called Adaptve Dctonary Methods. These are of the form = y w g ( x ) () where the g are nonlnear functons estmated from the data vector x = ( x... x2 x3 xn). For example, each g could be defned as the movng average of x. Once the g have been defned, the w are then estmated by least squares. The g are often called features, reflectng the ntutve noton that they represent notable features of the data that can be recognzed and then combned as n (). An example of a useful feature s an nteracton term. Adaptve methods are useful when the man goal of the analyss s predcton rather than hypothess testng. They avod usng the strong assumptons needed for Logstc regresson or Multple Regresson, such as lnearty. We choose a model whch mnmzes the predcton error when the model s appled to a new data sample. Ths s done wth a holdout sample, or va crossvaldaton, or generalzed cross-valdaton (see below). The man ssues n Adaptve Dctonary Methods are ) Selectng the form of the g. We need a sutable set of functons,.e. functons that are flexble enough to ft the data, but can be estmated n a feasble manner. Generally these bass functons are parameterzed, and we need a procedure for estmatng the parameters. 2) The number of features g needs to be approprate to the data we are fttng. Ths s smlar to the famlar process of Stepwse varable selecton. We add or delete bass functons (terms) and create new canddate models, and evaluate these models for the best balance of ft and stablty. However, the theoretcal bass s dfferent, snce here we are not makng any assumpton about the form of the underlyng process that produced the data. In hs 990 paper, Fredman proposed an Adaptve Dctonary method he called Multvarate Adaptve Regresson Splnes. Splnes are pecewse polynomal functons wth some added constrants to assure contnuty. The most commonly used splnes are one-dmensonal pecewse cubc polynomals whch are constraned to be contnuous and have contnuous frst and second dervatves at the ponts where the peces meet. These ponts are called knots. Splnes are useful because they are flexble; lke polynomals of arbtrary degree, they can unformly approxmate any functon (over a compact set) wth smlar contnuty requrements. Unlke polynomals, splnes are of lmted degree, but add flexblty by usng more knots and thus more peces. The have the desrable characterstc that a sum of, for example, cubc splnes are also cubc splnes. Splnes can be generalzed to dmensons > n a number of ways. Fredman used tensor product splnes, whch take the product of unvarate splnes s n dstnct dmensons. That s, g = s where the s are unvarate splnes from dstnct dmensons. Combnng ths wth () we have y = w s ( x) (2) as the form of the models. Fredman s procedure constructs these models wth a procedure that s computatonally effcent. Frst observe that pecewse lnear splnes can easly be smoothed to cubc. Ths reduces the problem to fndng models of form (2) wth s now representng pecewse lnear splnes. In fact, for many applcatons, the lnear splne representaton s accurate enough, and s easer to compute. Next, observe that lnear splnes have a convenent bass. Consder the functons of the form ( x a) and ( x a), where x = x when x 0 0 otherwse and x = x when x 0 0 otherwse. Pecewse lnear functons can be represented by lnear combnatons of functons havng ths form. If these prmtve bass

functons are denoted b, then t follows that (2) can be rewrtten as = w y b ) ( x. (3) Each b can be characterzed by ts dmenson (the nput varable x used n ts defnton), by ts sgn orentaton, and by ts knot poston. To determne whch bass functons to use, we start by selectng the best model that uses ust a sngle par of b. One of the par wll be orented postvely, and one negatvely, and they wll share the same knot. We fnd ths par by steppng through each varable, and each data pont n our tranng sample provdes a potental knot poston proected onto that varable. Usng least squares, we ft each possble model n ths set, and select the one wth the best ft, usually based on fndng the lowest MSE. We then add addtonal bass functon terms to the model, one par at a tme. In addton to searchng the unvarate bass functons, we also test chld bass functons. These are bult by usng a bass functon already n the model as one factor, and one of the b (not already a factor n the parent ) as another factor, thus testng many tensor product bass functons b for possble ncluson n the model. For example, suppose we have selected an ntal par of prmtve bass functons for ncluson n the model: b = ( x b 2 = ( x a ) a ) 2 2 so that the ntal model s estmated as w b w2b2. The ntal step of the search has determned that ths s the best model of ths form of all choces of raw varables and knots n the search. In a whch ths example, we have chosen raw varable 2 and knot s a value of x2 that appears n the data. When we search for another bass functon, we evaluate models of the form (3) wth terms. The addtonal par of terms could be another prmtve lke the frst two, but we also test products such as ( x 2 a) ( x a2). Ths s called a chld of the frst bass functon. We are buldng up a tree of bass functons wth the ntercept term at the root of the tree. The greedy algorthm reduces the enormous search space of all possble tensor products to those whch have one factor already n the model, whch shows that s of nterest for predcton. In ths respect, the algorthm s smlar to the CART and CHAID procedures, but t can produce better results when the data contan addtve characterstcs. We are clearly performng a large number of multple regressons to search all these possbltes. Fredman makes ths feasble by developng formulae to avod recomputng the sum of squares and cross products from scratch each tme. As we move from one canddate knot to the next, we can update the SSCP matrx effcently by only computng the change n ths matrx. It s even possble to smplfy the matrx nverson step by usng results from the prevous knot. Ths type of forward search s often called a greedy algorthm, n that t takes the best choce at each step. Sometmes ths can lead us astray, as when an early choce leads us n a suboptmal drecton. One method often used wth greedy algorthms s to repeat the forward search beyond what seems necessary, and to follow t wth a backwards approach where the model s smplfed. We select the term whch adds the least to the model and delete t. We can then select the best model found n the backwards search by usng the Generalzed Cross-Valdaton estmate of model ft. Ths technque adusts the MSE to reflect the complexty of each model. Ths provdes a brake on overfttng and estmates the MSE whch would be expected va cross-valdaton. The model wth the lowest Generalzed Cross-Valdaton s selected as the best. Thus Generalzed Cross-Valdaton holds a poston analogous to the C p statstc n multple regresson. Both of these are gudes to the number of terms to nclude n the model. IMPLEMENTATION SAS/IML provdes an excellent tool for explorng these deas. Proc IML mplements a matrx language dstnct from the data step and macro languages of SAS, but well ntegrated wth the rest of the SAS system. Tools are ncluded for mportng and exportng SAS datasets nto the SAS/IML workspace. Wthn ths workspace, they are manpulated usng a command language closely related to matrx notaton. Snce SAS/IML can manpulate matrces easly, the basc algorthm translates readly to ths language. IML provdes operators for matrx multplcaton and also for elementwse multplcaton. There are bult-n functons lke Solve for lnear systems, and even Trsolve for trangular systems, whch turns out to be partcularly relevant for Fredman's procedure. Another useful feature s the ablty to specfy submatrces easly. Ths proved to be useful for the ncremental formulae. In IML, let and be arrays of ndces; then A[,] denotes the approprate submatrx of A. Because SAS/IML s nterpreted, t s easy to make changes and see the effect on the results. However, for the same reason, there are some challenges wth the speed of the calculaton. I addressed the speed ssue by usng a technque known as subsamplng. When there are very large datasets, t s probably not essental to test every sngle pont as a potental knot. Skppng ponts wll stll provde an excellent approxmaton n most cases. The number of ponts to skp becomes a parameter of the SAS/IML program. Another challenge was the lack of arrays of matrces n SAS/IML. I needed to track lsts of knots for each varable. The lsts were of varyng length, so t would have been wasteful of memory to allocate a sngle 2-dmensonal matrx. The soluton was to use the SAS/IML execute command. Ths enables you to execute an expresson that you construct on the fly. In ths case, I created a character array of matrx names called cutlstnames. These names would be used n an expresson lke: Ths assgned the name at ndex x the value n cutlstptr3 whch n ths case was an array of arbtrary length. Thus cutlstnames acted as a vrtual array of arrays of potentally dfferent lengths.

The man loop for the Adaptve Regresson n SAS/IML looks somethng lke ths:!"#! $ % & "'()) %"!* "(,!! -!". /*"*" #' #"" "!" # 0! '.#"".!...$!2 22 2#22,.$3 "" # 22 /"# 2./#22. #2# ".# / #5 #!# #" "" # # 6' " The frst term s the ntercept. The chldren of the ntercept term are of degree. Chldren of other bass functons are of the next hgher degree. The maxmum degree searched s a parameter of the program run, as s the maxmum number of bass functons. When ths maxmum s reached, the program proceeds to backwards selecton. Fredman suggests that the fnal model should have no more than half the bass functons n the maxmal model. If not, the procedure lkely needs to be rerun wth a hgher value for maxterms. The ntegraton of the SAS Macro faclty was another advantage of usng SAS/IML. The %prnt macro s a smple tool for debuggng. The verbose varable s a postonal parameter whch can be set to enable varous levels of prnt messages. It defaults to verbose=, whch means the message only prnts f the global verbose flag s or greater. Macros lke %searchthruknots and %addbestbass smply make the code more readable, whle avodng the overhead of an nterpreted functon call to a SAS/IML module. Here s the code for the %prnt macro. ' "#& Another handy macro for SAS/IML debuggng s gven below. It was helpful where large matrces mght be nvolved, so that the bult-n prnt functon would produce more nformaton than needed.,..,,., REPRESENTING THE TREE As descrbed above, the search for the best model nvolves buldng a tree of bass functons. Each bass functon s a tensor product of the prmtve splnes. We represtented each of these products as a par of row vectors. One vector showed the knot locatons, and the other showed the type of prmtve, postve or negatve. So for an analyss wth nput varables we would have bass functons such as: cut cut none none.3 8.. Ths s the bass functon ( x.3) ( x2 8). Another example shows how the negatvely orented prmtves are represented: cut cut none none 5 7.5.. Represents ( x 5) ( x2 7.5). By combnng these rows, we obtaned two matrces that represented the entre growng model. Ths representaton made t possble to perform the search usng SAS/IML commands. ENHANCING THE ALGORITHM There are some cases where a pror knowledge of the problem doman makes t possble to modfy the algorthm to ft the analyss. In one case, I suspected that one varable mght be nteractng wth any of the others, but t seemed very unlkely that these other varables would nteract wth each other. It was a smple matter to add a flter to the search for new bass functons. Ths nsured a model of the form I requred, whle savng a great deal of calculaton. In many cases we have knowledge of the sgn a gven parameter should have. Because our code controls the detals of the search, we can ncorporate sgn checks as requred.

These modfcatons smply restrct the search space, and requre no other modfcatons to the algorthm or the theory. EXPERIENCES USING THE ALGORITHM DIRECT MODELING ZIP MODEL The ablty of Adaptve Regresson to ft nonlnear functons and nteractons helped partcularly n a recent proect for a catalog company. They were lookng for zp codes whch responded better than expected to ther rented lsts. An obvous approach would be to look at the response rate by zp codes from ther exstng malngs and rank the zp codes by the actual results. Assumng a suffcent sample sze, ths would provde an excellent rankng. However, many of the zp codes had nsuffcent hstory, so the estmated response rate would not be stable. In these cases, we would naturally fall back on other thngs we know about these zp codes,.e. zp demographcs or proxmty to a retal store. The best results would be obtaned by usng a combnaton of these approaches, weghtng the actual observed response more for zp codes wth more hstory. An Adaptve Regresson model of degree 2 found the expected nteractons wth the count of hstorcal data, and provded a better rankng of zp codes on the valdaton sample whch had been held out for ths purpose. IMPROVING ON LOGISTIC REGRESSION In a plot proect for a large maler, we compared the results of Logstc Regresson wth those from Adaptve Regresson. The data ncluded a regresson score developed usng nternal purchase hstory and a fle of external behavor from a large data aggregator. Our goal was to develop a combned score. Prelmnary exploratons suggested that there was nteracton between the nternal score and the external data; namely, the external data told us more, and so rased the total score more, when the nternal score was low. A straghtforward logstc run showed model nadequacy the logts were far from lnear. An adaptve regresson wth maxdeg=2 performed much better. The adaptve regresson alogrthm automatcally found the nteractons we expected. The gans chart below compares the results. We held out a valdaton sample of data whch was not used n the analyss; the results for ths valdaton sample are reported n order to avod an evaluaton based on overftted models. The gans chart s a dsplay of the cumulatve responses versus the cumulatve catalogs maled, malng the best frst accordng to each model. It s closely related to the ROC curve. A hgher curve ndcates a model wth more dscrmnaton. Gans Chart - Valdaton Sample 2500 2000 Cumulatve Orders 500 000 Adaptve Regresson Logstc 500 0 0 20000 0000 60000 80000 00000 20000 0000 Cumulatve Crculaton

BIBLIOGRAPHY Cherkassky and Muler, Learnng from Data, Wley 998 Fredman, Jerome, Multvarate Adaptve Regresson Splnes, SLAC PUB-960, 990 CONTACT INFORMATION Your comments and questons are valued and encouraged. Contact the author at: Davd Katz Davd Katz Consultng 257 Sskyou Blvd. #06 Ashland, Oregon 97520 (5) 82-7 Emal: davd@davdkatzconsultng.com Web: www.davdkatzconsultng.com SAS and all other SAS Insttute Inc. product or servce names are regstered trademarks or trademarks of SAS Insttute Inc. n the USA and other countres. ndcates USA regstraton. Other brand and product names are trademarks of ther respectve companes.