Macro to compute best transform variable for the model

Similar documents
Submitting SAS Code On The Side

A Format to Make the _TYPE_ Field of PROC MEANS Easier to Interpret Matt Pettis, Thomson West, Eagan, MN

SD10 A SAS MACRO FOR PERFORMING BACKWARD SELECTION IN PROC SURVEYREG

Tales from the Help Desk 6: Solutions to Common SAS Tasks

To conceptualize the process, the table below shows the highly correlated covariates in descending order of their R statistic.

PROC SQL vs. DATA Step Processing. T Winand, Customer Success Technical Team

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

PROC SQL VS. DATA STEP PROCESSING JEFF SIMPSON SAS CUSTOMER LOYALTY

SAS Macro Dynamics - From Simple Basics to Powerful Invocations Rick Andrews, Office of the Actuary, CMS, Baltimore, MD

Using SAS Macro to Include Statistics Output in Clinical Trial Summary Table

Arthur L. Carpenter California Occidental Consultants, Oceanside, California

Introduction. Getting Started with the Macro Facility CHAPTER 1

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

Clinical Data Visualization using TIBCO Spotfire and SAS

%MISSING: A SAS Macro to Report Missing Value Percentages for a Multi-Year Multi-File Information System

A Macro for Systematic Treatment of Special Values in Weight of Evidence Variable Transformation Chaoxian Cai, Automated Financial Systems, Exton, PA

Introduction / Overview

A Quick and Gentle Introduction to PROC SQL

Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research

PharmaSUG Paper TT11

Base and Advance SAS

Virtual Accessing of a SAS Data Set Using OPEN, FETCH, and CLOSE Functions with %SYSFUNC and %DO Loops

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

Two useful macros to nudge SAS to serve you

WHAT ARE SASHELP VIEWS?

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

Your Own SAS Macros Are as Powerful as You Are Ingenious

Using Templates Created by the SAS/STAT Procedures

Better Metadata Through SAS II: %SYSFUNC, PROC DATASETS, and Dictionary Tables

SAS Macro Dynamics: from Simple Basics to Powerful Invocations Rick Andrews, Office of Research, Development, and Information, Baltimore, MD

Tips for Mastering Relational Databases Using SAS/ACCESS

Essential ODS Techniques for Creating Reports in PDF Patrick Thornton, SRI International, Menlo Park, CA

SAS Online Training: Course contents: Agenda:

ABSTRACT: INTRODUCTION: WEB CRAWLER OVERVIEW: METHOD 1: WEB CRAWLER IN SAS DATA STEP CODE. Paper CC-17

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

Taming a Spreadsheet Importation Monster

Statistics and Data Analysis. Common Pitfalls in SAS Statistical Analysis Macros in a Mass Production Environment

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Quick and Efficient Way to Check the Transferred Data Divyaja Padamati, Eliassen Group Inc., North Carolina.

Get Started Writing SAS Macros Luisa Hartman, Jane Liao, Merck Sharp & Dohme Corp.

Going Under the Hood: How Does the Macro Processor Really Work?

Automatically Generating a Customized Table for Summarizing the Activities in Clinics

Unlock SAS Code Automation with the Power of Macros

= %sysfunc(dequote(&ds_in)); = %sysfunc(dequote(&var));

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

Bruce Gilsen, Federal Reserve Board

Files Arriving at an Inconvenient Time? Let SAS Process Your Files with FILEEXIST While You Sleep

Exploring DATA Step Merge and PROC SQL Join Techniques Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Automatic Indicators for Dummies: A macro for generating dummy indicators from category type variables

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

Tasks Menu Reference. Introduction. Data Management APPENDIX 1

INTRODUCTION TO PROC SQL JEFF SIMPSON SYSTEMS ENGINEER

Using PROC REPORT to Cross-Tabulate Multiple Response Items Patrick Thornton, SRI International, Menlo Park, CA

Validation Summary using SYSINFO

Macros to Report Missing Data: An HTML Data Collection Guide Patrick Thornton, University of California San Francisco, SF, California

Ranking Between the Lines

Conditional Processing Using the Case Expression in PROC SQL

Conditional Processing in the SAS Software by Example

Cover the Basics, Tool for structuring data checking with SAS Ole Zester, Novo Nordisk, Denmark

SAS Macro Language: Reference

Open Problem for SUAVe User Group Meeting, November 26, 2013 (UVic)

Using SAS Macros to Extract P-values from PROC FREQ

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step Mike Zdeb, FSL, University at Albany School of Public Health, Rensselaer, NY

PharmaSUG Paper PO12

Efficiency Programming with Macro Variable Arrays

Streamline Table Lookup by Embedding HASH in FCMP Qing Liu, Eli Lilly & Company, Shanghai, China

Customized Flowcharts Using SAS Annotation Abhinav Srivastva, PaxVax Inc., Redwood City, CA

In this paper, we will build the macro step-by-step, highlighting each function. A basic familiarity with SAS Macro language is assumed.

ABSTRACT INTRODUCTION MACRO. Paper RF

An Efficient Method to Create Titles for Multiple Clinical Reports Using Proc Format within A Do Loop Youying Yu, PharmaNet/i3, West Chester, Ohio

Let s Get FREQy with our Statistics: Data-Driven Approach to Determining Appropriate Test Statistic

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

Different Methods for Accessing Non-SAS Data to Build and Incrementally Update That Data Warehouse

Simple Rules to Remember When Working with Indexes

SAS Macro Technique for Embedding and Using Metadata in Web Pages. DataCeutics, Inc., Pottstown, PA

Simplifying Your %DO Loop with CALL EXECUTE Arthur Li, City of Hope National Medical Center, Duarte, CA

Information Criteria Methods in SAS for Multiple Linear Regression Models

Developing Data-Driven SAS Programs Using Proc Contents

Analysis of Complex Survey Data with SAS

T.I.P.S. (Techniques and Information for Programming in SAS )

If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC

Advanced Visualization using TIBCO Spotfire and SAS

PharmaSUG China. Systematically Reordering Axis Major Tick Values in SAS Graph Brian Shen, PPDI, ShangHai

How to Create Data-Driven Lists

Seminar Series: CTSI Presents

Creating Macro Calls using Proc Freq

A SAS Macro for Producing Benchmarks for Interpreting School Effect Sizes

Using MACRO and SAS/GRAPH to Efficiently Assess Distributions. Paul Walker, Capital One

Posters. Workarounds for SASWare Ballot Items Jack Hamilton, First Health, West Sacramento, California USA. Paper

SAS CURRICULUM. BASE SAS Introduction

Implementing external file processing with no record delimiter via a metadata-driven approach

Paper HOW-06. Tricia Aanderud, And Data Inc, Raleigh, NC

CREATING A SUMMARY TABLE OF NORMALIZED (Z) SCORES

Paper An Automated Reporting Macro to Create Cell Index An Enhanced Revisit. Shi-Tao Yeh, GlaxoSmithKline, King of Prussia, PA

Hash Objects for Everyone

PROGRAMMING ROLLING REGRESSIONS IN SAS MICHAEL D. BOLDIN, UNIVERSITY OF PENNSYLVANIA, PHILADELPHIA, PA

RUN_MACRO Run! With PROC FCMP and the RUN_MACRO Function from SAS 9.2, Your SAS Programs Are All Grown Up

Transcription:

Paper 3103-2015 Macro to compute best transform variable for the model Nancy Hu, Discover Financial Service ABSTRACT This study is intended to assist Analysts to generate the best of variables using simple arithmetic operators (square root, log, loglog, exp and rcp) and such as monthly amount paid, daily number of received customer service calls, or annual number total sales for a specific product. During a statistical data modeling process, Analysts are often confronted with the task of computing derived variables using the existing available variables. The advantage of this methodology is that the new variables may be more significant than the original ones. This paper gives a new way to compute all the possible variables using a set of math transformation. The codes include many SAS features that are very useful tools for SAS programmers to incorporate in their future codes Such as %SYSFUNC, SQL, %INCLUDE, CALL SYMPUT, %MACRO, SORT, CONTENTS, MERGE, MACRO _NULL_, as well as %DO %TO and many more. I demonstrate the syntax and mechanics of the macro using following examples. INTRODUCTION These codes are used for logistical regression model. Table 1, sale data age_of_house. sold indiator, loation, bed_room, style, listdaystosale and location. sold indicator size_sqft location bed_rooms bath_room style listdaystosale location age_of_house yes 2393 chicago 3 2 condo 130 suburbur 50 no 3628 Houston 5 3 family 200 city 20 yes 2200 Washtington 3 2 condo 100 city 60 no 2000 chicago 3 2 town house 120 city 60 yes 1890 Houston 4 3 family 100 suburbur 87 yes 1600 Washtington 2 1 town house 80 city 70 yes 2400 chicago 3 1 family 110 suburbur 50 yes 2500 Houston 4 2 family 120 suburbur 35 yes 3600 Washtington 3 2 town house 130 city 50 yes 3500 chicago 6 2 250 suburbur 25 1

Price family yes 3400 Houston 6 3 family 160 city 25 yes 843 Washtington 2 1 condo 60 suburbur 50 %MACRO LOG_FUNC(VN,LO_LIM=,HI_LIM=,D=&DELTA); /* FUNCTION VNL=LOG(VN-LO_LIM+D) */ %LET VNN1 = &VN_6&SX1; IF &VN < 0 THEN &VNN1 = 0; ELSE &VNN1 = LOG(&VN+&D); IF &VN > &HI_LIM THEN &VNN1 = LOG(&HI_LIM+&D); IF LENG <= 34 THEN VARLAB1 = SUBSTR(VARLAB,1,LENG) ' (LOG)'; ELSE VARLAB1 = SUBSTR(VARLAB,1,34) ' (LOG)'; %MEND LOG_FU ln_sqrt_ft 4 3 2 1 0 y = 0.0002x + 2.8119 R² = 0.9481 0 1000 2000 3000 4000 squre_fee ln_sqrt_ft Linear (ln_sqrt_ft) Figure 1-a, see log transformation. %MACRO LLG_FUNC(VN,LO_LIM,HI_LIM=,D=&DELTA); /* FUNCTION VNL=LOG(LOG(VN+&D+1)) */ %LET VNN1 = &VN_6&SX2; IF &VN < 0 THEN &VNN1 = LOG(LOG(3)); ELSE &VNN1 = LOG(LOG(&VN+&D+2)); IF &VN > &HI_LIM THEN &VNN1 =LOG(LOG(&HI_LIM+&D+2)); IF LENG <= 33 THEN VARLAB2 = SUBSTR(VARLAB,1,LENG) ' (LLOG)'; 2

Price ELSE VARLAB2 = SUBSTR(VARLAB,1,33) " (LLOG)"; %MEND LLG_FUNC; %MACRO EXP_FUNC(VN,LO_LIM=,HI_LIM=,A=0,B=); /* FUNCTION VNE=EXP(A*VN) */ ; %LET VNN1 = &VN_6&SX3; %LET RANGE = %SYSEVALF(&B-&A); %IF &RANGE = 0 %THEN %LET RANGE = 1; IF &VN < 0 THEN &VNN1 = -1; ELSE &VNN1 = -EXP((-1) * &VN / &RANGE ); IF &V N > &HI_LIM THEN &VNN1 = -EXP((-1) * &HI_LIM / &RANGE ); &VNN1 = ROUND(&VNN1,0.00001); IF LENG <= 34 THEN VARLAB3 = SUBSTR(VARLAB,1,LENG) ' (EXP)'; ELSE VARLAB3 = SUBSTR(VARLAB,1,34) ' (EXP)'; %MEND EXP_FUNC; %MACRO SQR_FUNC(VN,LO_LIM=,HI_LIM=,A=0); %LET VNN1 = &VN_6&SX4;IF &VN < 0 THEN &VNN1 = 0; ELSE &VNN1 = SQRT(&VN+&A); IF &VN > &HI_LIM THEN &VNN1 = SQRT(&HI_LIM+&A); IF LENG <= 33 THEN VARLAB4=SUBSTR(VARLAB,1,LENG) ' (SQRT)'; ELSE VARLAB4 = SUBSTR(VARLAB,1,33) ' (SQRT)'; %MEND SQR_FUNC; rcp_age_of_house y = -7E+06x + 60703 R² = 0.5281 500000 400000 300000 200000 100000 0-0.06-0.05-0.04-0.03-0.02-0.01 0 Age of The House price Linear (price) Figure 1-b, RCP function. %MACRO RCP_FUNC(VN,LO_LIM=,HI_LIM=,D=&DELTA); 3

%VNAME_N(&VN,VN_6) %LET VNN1 = &VN_6&SX5; IF &VN < 0 THEN &VNN1 = -1; ELSE &VNN1 = -1 / (&VN+&D); IF &VN > &HI_LIM THEN &VNN1 = -1 / (&HI_LIM+&D); IF LENG <= 34 THEN VARLAB5 = SUBSTR(VARLAB,1,LENG) ' (RCP)'; ELSE VARLAB5 = SUBSTR(VARLAB,1,34) ' (RCP)'; %MEND RCP_FUNC; %MACRO MAPP(DATA_IN=,DATA_OUT=,VARLIST=,PCNTILE=); %GLOBAL VNN1; DATA &DATA_OUT; LENGTH VARLAB VARLAB1-VARLAB5 $40; SET &DATA_IN; %DO II=1 %TO &N_VAR; %IF (&&VAR&II NE &BAD AND &&VAR&II NE &WEIGHT_) %THEN %DO; CALL LABEL(&&VAR&II,VARLAB); LENG=LENGTH(VARLAB);%LOG_FUNC(&&VAR&II,LO_LIM=&&P&II._&LEFT_PCT,H I_LIM=&&P&II._&RITE_PCT); %LET VAR&II._1=&VNN1; %LLG_FUNC(&&VAR&II,LO_LIM=&&P&II._&LEFT_PCT,HI_LIM=&&P&II._&RITE_ PCT); %LETVAR&II._2=&VNN1; %EXP_FUNC(&&VAR&II,LO_LIM=&&P&II._&LEFT_PCT, HI_LIM=&&P&II._&RITE_PCT,B=&&P&II._&RITE_PCT); %LET VAR&II._3=&VNN1; %SQR_FUNC(&&VAR&II,LO_LIM=&&P&II._&LEFT_PCT,HI_LIM=&&P&II._&RITE_ PCT); %LET VAR&II._4=&VNN1; %RCP_FUNC(&&VAR&II,LO_LIM=&&P&II._&LEFT_PCT,HI_LIM=&&P&II._&RITE_ PCT); %LET VAR&II._5=&VNN1; %DO JJ=1 %TO 6; CALL SYMPUT("LAB&&II._&JJ",VARLAB&JJ); %IF &DROP_ALL=1 %THEN %DO; DROP &VARLIST LENG VARLAB VARLAB1-VARLAB5; %ELSE %DO; DROP LENG VARLAB VARLAB1-VARLAB5; %MEND MAPP; This mapp macro can transfer one variable into 5 new derived variables with 5 new label see table 1A. 4

Example 2, Table 1A, Original varible log loglog sqrt exp rcp a a_l a_ll a_s a_e a_r b b_l b_ll b_s b_e b_r %MACRO SLCTVAR(SDATA,INDSN,DROPFILE,NUMKP=1,SLEVEL=0.05); DATA SUBSET; SET &INDSN; %IF &PORTION NE %THEN %DO; IF RANUNI(374521) <= &PORTION THEN OUTPUT; %DO II=1 %TO &N_VAR; %IF &&P&II._&RITE_PCT NE. %THEN %DO; PROC LOGISTIC DATA=SUBSET(KEEP=&BAD &&VAR&II &&VAR&II._1 &&VAR&II._2 &&VAR&II._3 &&VAR&II._4 &&VAR&II._5 &WEIGHT_) OUTEST=EST1 NOPRINT; MODEL &BAD =&&VAR&II &&VAR&II._1 &&VAR&II._2 &&VAR&II._3 &&VAR&II._4 &&VAR&II._5 / SELECTION=FORWARD STOP=&NUMKP SLE=&SLEVEL; %IF &WEIGHT_ NE %THEN %DO; WEIGHT &WEIGHT_; DATA ONEVAR(KEEP=DROPVAR); LENGTH DROPVAR $8; SET EST1(WHERE=(_TYPE_='PARMS')); ARRAY VNNS &&VAR&II._1 &&VAR&II._2 &&VAR&II._3 &&VAR&II._4 &&VAR&II._5; DO K=1 TO DIM(VNNS); IF VNNS{K} <=.Z THEN DO; CALL VNAME(VNNS{K},DROPVAR); OUTPUT; END; END; %ELSE %DO; DATA ONEVAR(KEEP=DROPVAR); LENGTH DROPVAR $8; ARRAY VNNS &&VAR&II &&VAR&II._1 &&VAR&II._2 &&VAR&II._3 &&VAR&II._4 &&VAR&II._5; DO K=1 TO DIM(VNNS); VNNS{K} = 0; CALL VNAME(VNNS{K}, DROPVAR); OUTPUT; END; STOP; 5

PROC APPEND BASE=DROPLIST DATA=ONEVAR FORCE; DATA _NULL_; SET DROPLIST END=LAST; FILE "DROPFILE.dat"; IF _N_=1 THEN PUT '%MACRO DROPLIST;'; PUT DROPVAR; IF LAST THEN PUT '%MEND DROPLIST;'; %INCLUDE "dropfile.dat"; DATA &SDATA; SET &INDSN; LABEL %DO II=1 %TO &N_VAR; &&VAR&II._1="&&LAB&II._1" &&VAR&II._2="&&LAB&II._2" &&VAR&II._3="&&LAB&II._3" &&VAR&II._4="&&LAB&II._4" &&VAR&II._5="&&LAB&II._5" %STR(;); DROP %DROPLIST; %MEND SLCTVAR; Example 2, dropplist of this macro. droplist a_l b_s c_e d_r %macro lgt_plot(dsn,bad,varname); data temp; set &dsn (keep=&bad &varname) end=last;retain n_bads 0; format odds 6.2 ; length odds_a $5; level=&varname; if &bad then n_bads=n_bads+1; if last then do; call symput('n_bads',trim(left(n_bads))); call symput('n_obs',trim(left(_n_))); odds=(n_bads)/(_n_-n_bads+0.1); odds_a=trim(left(odds)); call symput('o_odds', odds_a); end; 6

run; drop n_bads odds; proc means data=temp noprint; var &varname &bad;class level; output out=temp2 min=min min2 sum=&varname &bad max=max max2 ; run; sql noprint; select count(*) into :macro1 from temp; quit; data temp3; set temp2(firstobs=2) end=last; format odds 3. logodds log_b_g 6.2 info 7.5; retain tot_info 0; odds=(&bad+0.1)/(s_wt-&bad+0.1); logodds=log(odds); WOE=logodds; log_b_g=logodds; ); p_good=(s_wt&bad+0.1)/(&n_obs&n_bads+0.1);p_bad=(&bad+0.1)/(&n_bads+0.1 if info =. then info = 0; tot_info = tot_info + info; if last then do; if tot_info <= 0.00001 then tot_info = 0.00001; tot_info = round(tot_info,0.00001); call symput('tot_info',tot_info); end; run; data temp3; set temp3; format info_rt 4.1; info_rt = info/(&tot_info); proc means data=temp3 noprint; var logodds &varname; output out=null css=logodds &varname; run; proc corr data=temp3 noprint outp=corr1; var logodds &varname; run; Data out.corr1; set corr1; length var $32 corr_a $5; format corr &varname 4.2; if _type_='corr' and logodds = 1; var="&varname"; corr=&varname; info=&tot_info; corr_a=trim(left(corr)); 7

run; call symput("&varname",trim(left(corr_a))); keep var corr info; %mend lgt_plot(dsn,bad,varname); Example3, final variable list and related measurement variable information value correlation size_sqft_r 0.2 0.45 location 0.1 0.35 bed_rooms_l 0.3 0.2 bath_room 0.5 0.6 style 0.6 0.6 listdaystosale 0.7 0.5 age_of_house 0.1 0.2 price 0.1 0.1 This final list of significant variable list can use to plug in the final logistical model. This example shows the ranking order of the derived variables with dependent variable. CONCLUSION This method of computing the derived variables can be useful for modeling purposes and save time during the variable creation stage. These macros are very straightforward but the user needs a strong understanding of SAS programming especially in SAS MACRO in order to fully take advantage of the codes Although not shown in the examples above, the macro will work for nonlinear variable (i.e. linear from the alternative model. In addition to normal bug fixing, future work on the macro may include improving the overall coding efficiency and enhancing the handling of missing values. Fitting complex models and/or models to large data sets is the chief challenge because of the iterative nature of the macro. My hope is that in the not-too-distance future, the parametric will be built-in component. REFERENCES SAS Institute Inc. 2011. SAS/CONNECT 9.3 User's Guide. Cary, NC: SAS Institute Inc. Available at http://support.sas.com/documentation/cdl/en/connref/63066/pdf/default/connref.pdf SAS Institute Inc. 2012. SAS 9.3 SQL Procedure User s Guide. Cary, NC: SAS Institute Inc. Available at http://support.sas.com/documentation/cdl/en/sqlproc/63043/pdf/default/sqlproc.pdf SAS Institute Inc. 2012. SAS 9.3 Language Reference: Concepts. Cary, NC: SAS Institute Inc. Available at http://support.sas.com/documentation/cdl/en/lrcon/65287/pdf/default/lrcon.pdf 8

Dickstein, Craig, Pass, Ray, & Davis, Michael. 2007. DATA Step vs. PROC SQL: What s a neophyte to do?. Proceedings of the SAS Global Forum 2007 Conference. Cary, NC. Available at http://www2.sas.com/proceedings/forum2007/237-2007.pdf Lafler, Kirk Paul. 2014. Powerful and Hard-to-Find PROC SQL Features. Proceedings of the SAS Global Forum 2014 Conference. Cary, NC. Available at http://support.sas.com/resources/papers/proceedings14/1240-2014.pdf ACKNOWLEDGMENTS We appreciate the feedback, suggestions and encouragement from my co-worker, Ming Zhang, Alicia, Anani and Joel. We also thank my Manager, Mamta Kakkar and Yang and Kotha, for their support and patience. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at nanyhu@discover.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. The views and opinions expressed here are those of the authors and do not necessarily reflect the views and opinions of Discover Bank. Discover Bank is not, by means of this article, providing technical, business, or other professional advice or services and is not endorsing any of the software, techniques, approaches, or solutions presented herein. This article is not a substitute for professional advice or services and should not be used as a basis for decisions that could impact your business. 9