Adaptive Regression in SAS/IML

Similar documents
Lecture 5: Multilayer Perceptrons

Feature Reduction and Selection

S1 Note. Basis functions.

Support Vector Machines

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Problem Set 3 Solutions

GSLM Operations Research II Fall 13/14

Programming in Fortran 90 : 2017/2018

CMPS 10 Introduction to Computer Science Lecture Notes

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Hermite Splines in Lie Groups as Products of Geodesics

Mathematics 256 a course in differential equations for engineering students

Classification / Regression Support Vector Machines

TN348: Openlab Module - Colocalization

Smoothing Spline ANOVA for variable screening

CS 534: Computer Vision Model Fitting

A Newton-Type Method for Constrained Least-Squares Data-Fitting with Easy-to-Control Rational Curves

User Authentication Based On Behavioral Mouse Dynamics Biometrics

An Entropy-Based Approach to Integrated Information Needs Assessment

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

The Codesign Challenge

Performance Evaluation of Information Retrieval Systems

Lecture 4: Principal components

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Solutions to Programming Assignment Five Interpolation and Numerical Differentiation

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Intro. Iterators. 1. Access

SENSITIVITY ANALYSIS IN LINEAR PROGRAMMING USING A CALCULATOR

y and the total sum of


Module Management Tool in Software Development Organizations

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Brave New World Pseudocode Reference

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Classifier Selection Based on Data Complexity Measures *

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

CE 221 Data Structures and Algorithms

Active Contours/Snakes

A Binarization Algorithm specialized on Document Images and Photos

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

LECTURE : MANIFOLD LEARNING

An Optimal Algorithm for Prufer Codes *

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Review of approximation techniques

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vidyanagar

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Edge Detection in Noisy Images Using the Support Vector Machines

Support Vector Machines

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Radial Basis Functions

Analysis of Continuous Beams in General

Notes on Organizing Java Code: Packages, Visibility, and Scope

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

Computer Animation and Visualisation. Lecture 4. Rigging / Skinning

Optimizing Document Scoring for Query Retrieval

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

A Coding Practice for Preparing Adaptive Multistage Testing Yung-chen Hsu, GED Testing Service, LLC, Washington, DC

Computer models of motion: Iterative calculations

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

Parallel matrix-vector multiplication

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

A Geometric Approach for Multi-Degree Spline

Accounting for the Use of Different Length Scale Factors in x, y and z Directions

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

5 The Primal-Dual Method

Query Clustering Using a Hybrid Query Similarity Measure

ELEC 377 Operating Systems. Week 6 Class 3

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Machine Learning. Topic 6: Clustering

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

Topology Design using LS-TaSC Version 2 and LS-DYNA

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Categories and Subject Descriptors B.7.2 [Integrated Circuits]: Design Aids Verification. General Terms Algorithms

Reading. 14. Subdivision curves. Recommended:

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

A Bilinear Model for Sparse Coding

Analysis of 3D Cracks in an Arbitrary Geometry with Weld Residual Stress

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Improving Web Image Search using Meta Re-rankers

Array transposition in CUDA shared memory

LOOP ANALYSIS. The second systematic technique to determine all currents and voltages in a circuit

Hierarchical clustering for gene expression data analysis

Why visualisation? IRDS: Visualization. Univariate data. Visualisations that we won t be interested in. Graphics provide little additional information

Inverse Kinematics (part 2) CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Spring 2016

Transcription:

Adaptve Regresson n SAS/IML Davd Katz, Davd Katz Consultng, Ashland, Oregon ABSTRACT Adaptve Regresson algorthms allow the data to select the form of a model n addton to estmatng the parameters. Fredman s procedure explots computatonal shortcuts n Adaptve Regresson, obtanng the power of Neural Networks wth a fracton of the resources. Proc IML allows us to explore these tools n a flexble envronment, wth exctng results. Ths paper descrbes an ongong proect that mplements ths approach. We revew the algorthm, dscuss the programmng technques, and gve some examples of useful applcatons. ADAPTIVE METHODS AND SPLINES Standard Multple Lnear Regresson searches for a model of the form y = w x. Snce the x can be transformed usng any a pror transformaton, ths ncludes such varants as polynomal regresson, whch are often nformally referred to as nonlnear models; however the model s lnear n the transformed varables, so the standard least squares methods stll apply. A problem arses when there are many varables n the analyss. Searchng through many possble transformatons becomes mpractcal, as the number of parameters we need to estmate grows rapdly and outstrps the number of parameters whch our data can estmate. Ths has motvated the development of a class of models called Adaptve Dctonary Methods. These are of the form = y w g ( x ) () where the g are nonlnear functons estmated from the data vector x = ( x... x2 x3 xn). For example, each g could be defned as the movng average of x. Once the g have been defned, the w are then estmated by least squares. The g are often called features, reflectng the ntutve noton that they represent notable features of the data that can be recognzed and then combned as n (). An example of a useful feature s an nteracton term. Adaptve methods are useful when the man goal of the analyss s predcton rather than hypothess testng. They avod usng the strong assumptons needed for Logstc regresson or Multple Regresson, such as lnearty. We choose a model whch mnmzes the predcton error when the model s appled to a new data sample. Ths s done wth a holdout sample, or va crossvaldaton, or generalzed cross-valdaton (see below). The man ssues n Adaptve Dctonary Methods are ) Selectng the form of the g. We need a sutable set of functons,.e. functons that are flexble enough to ft the data, but can be estmated n a feasble manner. Generally these bass functons are parameterzed, and we need a procedure for estmatng the parameters. 2) The number of features g needs to be approprate to the data we are fttng. Ths s smlar to the famlar process of Stepwse varable selecton. We add or delete bass functons (terms) and create new canddate models, and evaluate these models for the best balance of ft and stablty. However, the theoretcal bass s dfferent, snce here we are not makng any assumpton about the form of the underlyng process that produced the data. In hs 990 paper, Fredman proposed an Adaptve Dctonary method he called Multvarate Adaptve Regresson Splnes. Splnes are pecewse polynomal functons wth some added constrants to assure contnuty. The most commonly used splnes are one-dmensonal pecewse cubc polynomals whch are constraned to be contnuous and have contnuous frst and second dervatves at the ponts where the peces meet. These ponts are called knots. Splnes are useful because they are flexble; lke polynomals of arbtrary degree, they can unformly approxmate any functon (over a compact set) wth smlar contnuty requrements. Unlke polynomals, splnes are of lmted degree, but add flexblty by usng more knots and thus more peces. The have the desrable characterstc that a sum of, for example, cubc splnes are also cubc splnes. Splnes can be generalzed to dmensons > n a number of ways. Fredman used tensor product splnes, whch take the product of unvarate splnes s n dstnct dmensons. That s, g = s where the s are unvarate splnes from dstnct dmensons. Combnng ths wth () we have y = w s ( x) (2) as the form of the models. Fredman s procedure constructs these models wth a procedure that s computatonally effcent. Frst observe that pecewse lnear splnes can easly be smoothed to cubc. Ths reduces the problem to fndng models of form (2) wth s now representng pecewse lnear splnes. In fact, for many applcatons, the lnear splne representaton s accurate enough, and s easer to compute. Next, observe that lnear splnes have a convenent bass. Consder the functons of the form ( x a) and ( x a), where x = x when x 0 0 otherwse and x = x when x 0 0 otherwse. Pecewse lnear functons can be represented by lnear combnatons of functons havng ths form. If these prmtve bass

functons are denoted b, then t follows that (2) can be rewrtten as = w y b ) ( x. (3) Each b can be characterzed by ts dmenson (the nput varable x used n ts defnton), by ts sgn orentaton, and by ts knot poston. To determne whch bass functons to use, we start by selectng the best model that uses ust a sngle par of b. One of the par wll be orented postvely, and one negatvely, and they wll share the same knot. We fnd ths par by steppng through each varable, and each data pont n our tranng sample provdes a potental knot poston proected onto that varable. Usng least squares, we ft each possble model n ths set, and select the one wth the best ft, usually based on fndng the lowest MSE. We then add addtonal bass functon terms to the model, one par at a tme. In addton to searchng the unvarate bass functons, we also test chld bass functons. These are bult by usng a bass functon already n the model as one factor, and one of the b (not already a factor n the parent ) as another factor, thus testng many tensor product bass functons b for possble ncluson n the model. For example, suppose we have selected an ntal par of prmtve bass functons for ncluson n the model: b = ( x b 2 = ( x a ) a ) 2 2 so that the ntal model s estmated as w b w2b2. The ntal step of the search has determned that ths s the best model of ths form of all choces of raw varables and knots n the search. In a whch ths example, we have chosen raw varable 2 and knot s a value of x2 that appears n the data. When we search for another bass functon, we evaluate models of the form (3) wth terms. The addtonal par of terms could be another prmtve lke the frst two, but we also test products such as ( x 2 a) ( x a2). Ths s called a chld of the frst bass functon. We are buldng up a tree of bass functons wth the ntercept term at the root of the tree. The greedy algorthm reduces the enormous search space of all possble tensor products to those whch have one factor already n the model, whch shows that s of nterest for predcton. In ths respect, the algorthm s smlar to the CART and CHAID procedures, but t can produce better results when the data contan addtve characterstcs. We are clearly performng a large number of multple regressons to search all these possbltes. Fredman makes ths feasble by developng formulae to avod recomputng the sum of squares and cross products from scratch each tme. As we move from one canddate knot to the next, we can update the SSCP matrx effcently by only computng the change n ths matrx. It s even possble to smplfy the matrx nverson step by usng results from the prevous knot. Ths type of forward search s often called a greedy algorthm, n that t takes the best choce at each step. Sometmes ths can lead us astray, as when an early choce leads us n a suboptmal drecton. One method often used wth greedy algorthms s to repeat the forward search beyond what seems necessary, and to follow t wth a backwards approach where the model s smplfed. We select the term whch adds the least to the model and delete t. We can then select the best model found n the backwards search by usng the Generalzed Cross-Valdaton estmate of model ft. Ths technque adusts the MSE to reflect the complexty of each model. Ths provdes a brake on overfttng and estmates the MSE whch would be expected va cross-valdaton. The model wth the lowest Generalzed Cross-Valdaton s selected as the best. Thus Generalzed Cross-Valdaton holds a poston analogous to the C p statstc n multple regresson. Both of these are gudes to the number of terms to nclude n the model. IMPLEMENTATION SAS/IML provdes an excellent tool for explorng these deas. Proc IML mplements a matrx language dstnct from the data step and macro languages of SAS, but well ntegrated wth the rest of the SAS system. Tools are ncluded for mportng and exportng SAS datasets nto the SAS/IML workspace. Wthn ths workspace, they are manpulated usng a command language closely related to matrx notaton. Snce SAS/IML can manpulate matrces easly, the basc algorthm translates readly to ths language. IML provdes operators for matrx multplcaton and also for elementwse multplcaton. There are bult-n functons lke Solve for lnear systems, and even Trsolve for trangular systems, whch turns out to be partcularly relevant for Fredman's procedure. Another useful feature s the ablty to specfy submatrces easly. Ths proved to be useful for the ncremental formulae. In IML, let and be arrays of ndces; then A[,] denotes the approprate submatrx of A. Because SAS/IML s nterpreted, t s easy to make changes and see the effect on the results. However, for the same reason, there are some challenges wth the speed of the calculaton. I addressed the speed ssue by usng a technque known as subsamplng. When there are very large datasets, t s probably not essental to test every sngle pont as a potental knot. Skppng ponts wll stll provde an excellent approxmaton n most cases. The number of ponts to skp becomes a parameter of the SAS/IML program. Another challenge was the lack of arrays of matrces n SAS/IML. I needed to track lsts of knots for each varable. The lsts were of varyng length, so t would have been wasteful of memory to allocate a sngle 2-dmensonal matrx. The soluton was to use the SAS/IML execute command. Ths enables you to execute an expresson that you construct on the fly. In ths case, I created a character array of matrx names called cutlstnames. These names would be used n an expresson lke: Ths assgned the name at ndex x the value n cutlstptr3 whch n ths case was an array of arbtrary length. Thus cutlstnames acted as a vrtual array of arrays of potentally dfferent lengths.

The man loop for the Adaptve Regresson n SAS/IML looks somethng lke ths:!"#! $ % & "'()) %"!* "(,!! -!". /*"*" #' #"" "!" # 0! '.#"".!...$!2 22 2#22,.$3 "" # 22 /"# 2./#22. #2# ".# / #5 #!# #" "" # # 6' " The frst term s the ntercept. The chldren of the ntercept term are of degree. Chldren of other bass functons are of the next hgher degree. The maxmum degree searched s a parameter of the program run, as s the maxmum number of bass functons. When ths maxmum s reached, the program proceeds to backwards selecton. Fredman suggests that the fnal model should have no more than half the bass functons n the maxmal model. If not, the procedure lkely needs to be rerun wth a hgher value for maxterms. The ntegraton of the SAS Macro faclty was another advantage of usng SAS/IML. The %prnt macro s a smple tool for debuggng. The verbose varable s a postonal parameter whch can be set to enable varous levels of prnt messages. It defaults to verbose=, whch means the message only prnts f the global verbose flag s or greater. Macros lke %searchthruknots and %addbestbass smply make the code more readable, whle avodng the overhead of an nterpreted functon call to a SAS/IML module. Here s the code for the %prnt macro. ' "#& Another handy macro for SAS/IML debuggng s gven below. It was helpful where large matrces mght be nvolved, so that the bult-n prnt functon would produce more nformaton than needed.,..,,., REPRESENTING THE TREE As descrbed above, the search for the best model nvolves buldng a tree of bass functons. Each bass functon s a tensor product of the prmtve splnes. We represtented each of these products as a par of row vectors. One vector showed the knot locatons, and the other showed the type of prmtve, postve or negatve. So for an analyss wth nput varables we would have bass functons such as: cut cut none none.3 8.. Ths s the bass functon ( x.3) ( x2 8). Another example shows how the negatvely orented prmtves are represented: cut cut none none 5 7.5.. Represents ( x 5) ( x2 7.5). By combnng these rows, we obtaned two matrces that represented the entre growng model. Ths representaton made t possble to perform the search usng SAS/IML commands. ENHANCING THE ALGORITHM There are some cases where a pror knowledge of the problem doman makes t possble to modfy the algorthm to ft the analyss. In one case, I suspected that one varable mght be nteractng wth any of the others, but t seemed very unlkely that these other varables would nteract wth each other. It was a smple matter to add a flter to the search for new bass functons. Ths nsured a model of the form I requred, whle savng a great deal of calculaton. In many cases we have knowledge of the sgn a gven parameter should have. Because our code controls the detals of the search, we can ncorporate sgn checks as requred.

These modfcatons smply restrct the search space, and requre no other modfcatons to the algorthm or the theory. EXPERIENCES USING THE ALGORITHM DIRECT MODELING ZIP MODEL The ablty of Adaptve Regresson to ft nonlnear functons and nteractons helped partcularly n a recent proect for a catalog company. They were lookng for zp codes whch responded better than expected to ther rented lsts. An obvous approach would be to look at the response rate by zp codes from ther exstng malngs and rank the zp codes by the actual results. Assumng a suffcent sample sze, ths would provde an excellent rankng. However, many of the zp codes had nsuffcent hstory, so the estmated response rate would not be stable. In these cases, we would naturally fall back on other thngs we know about these zp codes,.e. zp demographcs or proxmty to a retal store. The best results would be obtaned by usng a combnaton of these approaches, weghtng the actual observed response more for zp codes wth more hstory. An Adaptve Regresson model of degree 2 found the expected nteractons wth the count of hstorcal data, and provded a better rankng of zp codes on the valdaton sample whch had been held out for ths purpose. IMPROVING ON LOGISTIC REGRESSION In a plot proect for a large maler, we compared the results of Logstc Regresson wth those from Adaptve Regresson. The data ncluded a regresson score developed usng nternal purchase hstory and a fle of external behavor from a large data aggregator. Our goal was to develop a combned score. Prelmnary exploratons suggested that there was nteracton between the nternal score and the external data; namely, the external data told us more, and so rased the total score more, when the nternal score was low. A straghtforward logstc run showed model nadequacy the logts were far from lnear. An adaptve regresson wth maxdeg=2 performed much better. The adaptve regresson alogrthm automatcally found the nteractons we expected. The gans chart below compares the results. We held out a valdaton sample of data whch was not used n the analyss; the results for ths valdaton sample are reported n order to avod an evaluaton based on overftted models. The gans chart s a dsplay of the cumulatve responses versus the cumulatve catalogs maled, malng the best frst accordng to each model. It s closely related to the ROC curve. A hgher curve ndcates a model wth more dscrmnaton. Gans Chart - Valdaton Sample 2500 2000 Cumulatve Orders 500 000 Adaptve Regresson Logstc 500 0 0 20000 0000 60000 80000 00000 20000 0000 Cumulatve Crculaton

BIBLIOGRAPHY Cherkassky and Muler, Learnng from Data, Wley 998 Fredman, Jerome, Multvarate Adaptve Regresson Splnes, SLAC PUB-960, 990 CONTACT INFORMATION Your comments and questons are valued and encouraged. Contact the author at: Davd Katz Davd Katz Consultng 257 Sskyou Blvd. #06 Ashland, Oregon 97520 (5) 82-7 Emal: davd@davdkatzconsultng.com Web: www.davdkatzconsultng.com SAS and all other SAS Insttute Inc. product or servce names are regstered trademarks or trademarks of SAS Insttute Inc. n the USA and other countres. ndcates USA regstraton. Other brand and product names are trademarks of ther respectve companes.