A Side of Hash for You To Dig Into

Similar documents
Checking for Duplicates Wendi L. Wright

Summarizing Impossibly Large SAS Data Sets For the Data Warehouse Server Using Horizontal Summarization

Chapter 6: Modifying and Combining Data Sets

Hash Objects Why Bother? Barb Crowther SAS Technical Training Specialist. Copyright 2008, SAS Institute Inc. All rights reserved.

SAS 9 Programming Enhancements Marje Fecht, Prowerk Consulting Ltd Mississauga, Ontario, Canada

AVL 4 4 PDV DECLARE 7 _NEW_

Customer Clustering using RFM analysis

Using SAS to Analyze CYP-C Data: Introduction to Procedures. Overview

DSCI 325: Handout 10 Summarizing Numerical and Categorical Data in SAS Spring 2017

Not Just Merge - Complex Derivation Made Easy by Hash Object

Using PROC REPORT to Cross-Tabulate Multiple Response Items Patrick Thornton, SRI International, Menlo Park, CA

Paper Haven't I Seen You Before? An Application of DATA Step HASH for Efficient Complex Event Associations. John Schmitz, Luminare Data LLC

Streamline Table Lookup by Embedding HASH in FCMP Qing Liu, Eli Lilly & Company, Shanghai, China

Data Quality Control for Big Data: Preventing Information Loss With High Performance Binning

Ranking Between the Lines

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

Plot Your Custom Regions on SAS Visual Analytics Geo Maps

Using SAS/SCL to Create Flexible Programs... A Super-Sized Macro Ellen Michaliszyn, College of American Pathologists, Northfield, IL

Uncommon Techniques for Common Variables

Virtual Accessing of a SAS Data Set Using OPEN, FETCH, and CLOSE Functions with %SYSFUNC and %DO Loops

BY S NOTSORTED OPTION Karuna Samudral, Octagon Research Solutions, Inc., Wayne, PA Gregory M. Giddings, Centocor R&D Inc.

Macros for Two-Sample Hypothesis Tests Jinson J. Erinjeri, D.K. Shifflet and Associates Ltd., McLean, VA

PROC MEANS for Disaggregating Statistics in SAS : One Input Data Set and One Output Data Set with Everything You Need

Data Quality Control: Using High Performance Binning to Prevent Information Loss

Using SAS Macros to Extract P-values from PROC FREQ

Let Hash SUMINC Count For You Joseph Hinson, Accenture Life Sciences, Berwyn, PA, USA

CMISS the SAS Function You May Have Been MISSING Mira Shapiro, Analytic Designers LLC, Bethesda, MD

SAS is the most widely installed analytical tool on mainframes. I don t know the situation for midrange and PCs. My Focus for SAS Tools Here

ABSTRACT INTRODUCTION PROBLEM: TOO MUCH INFORMATION? math nrt scr. ID School Grade Gender Ethnicity read nrt scr

SAS seminar. The little SAS book Chapters 3 & 4. April 15, Åsa Klint. By LD Delwiche and SJ Slaughter. 3.1 Creating and Redefining variables

Paper An Automated Reporting Macro to Create Cell Index An Enhanced Revisit. Shi-Tao Yeh, GlaxoSmithKline, King of Prussia, PA

To conceptualize the process, the table below shows the highly correlated covariates in descending order of their R statistic.

A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA

SAS/STAT 13.1 User s Guide. The NESTED Procedure

Why Hash? Glen Becker, USAA

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Unlock SAS Code Automation with the Power of Macros

USING SAS HASH OBJECTS TO CUT DOWN PROCESSING TIME Girish Narayandas, Optum, Eden Prairie, MN

Programming Beyond the Basics. Find() the power of Hash - How, Why and When to use the SAS Hash Object John Blackwell

ABSTRACT INTRODUCTION WORK FLOW AND PROGRAM SETUP

Paper SDA-11. Logistic regression will be used for estimation of net error for the 2010 Census as outlined in Griffin (2005).

footnote1 height=8pt j=l "(Rev. &sysdate)" j=c "{\b\ Page}{\field{\*\fldinst {\b\i PAGE}}}";

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

The Proc Transpose Cookbook

A Format to Make the _TYPE_ Field of PROC MEANS Easier to Interpret Matt Pettis, Thomson West, Eagan, MN

A Few Quick and Efficient Ways to Compare Data

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Using a HASH Table to Reference Variables in an Array by Name. John Henry King, Hopper, Arkansas

Creating Macro Calls using Proc Freq

Files Arriving at an Inconvenient Time? Let SAS Process Your Files with FILEEXIST While You Sleep

The NESTED Procedure (Chapter)

Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

Using Templates Created by the SAS/STAT Procedures

Getting it Done with PROC TABULATE

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

An Application of PROC NLP to Survey Sample Weighting

CREATING A SUMMARY TABLE OF NORMALIZED (Z) SCORES

Using Proc Freq for Manageable Data Summarization

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step Mike Zdeb, FSL, University at Albany School of Public Health, Rensselaer, NY

Data Quality Control: Using High Performance Binning to Prevent Information Loss

Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide

33 Exploring Report Data using Drill Mode

Interleaving a Dataset with Itself: How and Why

Introduction to SAS Procedures SAS Basics III. Susan J. Slaughter, Avocet Solutions

STEP 1 - /*******************************/ /* Manipulate the data files */ /*******************************/ <<SAS DATA statements>>

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

It s Proc Tabulate Jim, but not as we know it!

Macros to Report Missing Data: An HTML Data Collection Guide Patrick Thornton, University of California San Francisco, SF, California

Exploring HASH Tables vs. SORT/DATA Step vs. PROC SQL

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

PharmaSUG Paper TT11

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

Statistics, Data Analysis & Econometrics

Handling missing values in Analysis

Amie Bissonett, inventiv Health Clinical, Minneapolis, MN

Two useful macros to nudge SAS to serve you

A Cross-national Comparison Using Stacked Data

STATISTICAL TECHNIQUES. Interpreting Basic Statistical Values

So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

A Quick and Gentle Introduction to PROC SQL

Cleaning up your SAS log: Note Messages

The SAS Hash Object in Action

The Essential Meaning of PROC MEANS: A Beginner's Guide to Summarizing Data Using SAS Software

Microsoft Access XP (2002) - Advanced Queries

... ) city (city, cntyid, area, pop,.. )

SD10 A SAS MACRO FOR PERFORMING BACKWARD SELECTION IN PROC SURVEYREG

Reading a Column into a Row to Count N-levels, Calculate Cardinality Ratio and Create Frequency and Summary Output In One Step

An Introduction to SAS Hash Programming Techniques Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Reporting for Fundraising

Understanding and Applying Multilabel Formats

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

An Introduction to SAS University Edition

SAS Macros of Performing Look-Ahead and Look-Back Reads

A Macro for Systematic Treatment of Special Values in Weight of Evidence Variable Transformation Chaoxian Cai, Automated Financial Systems, Exton, PA

SAS Macro Technique for Embedding and Using Metadata in Web Pages. DataCeutics, Inc., Pottstown, PA

Guide Users along Information Pathways and Surf through the Data

RedLink Publisher Dashboard Overview January 2018

Transcription:

A Side of Hash for You To Dig Into Shan Ali Rasul, Indigo Books & Music Inc, Toronto, Ontario, Canada. ABSTRACT Within the realm of Customer Relationship Management (CRM) there is always a need for segmenting customers in order to gain insights into customer behaviour and being able to use that information to better align customers with current and ongoing marketing strategies. This paper discusses a macro created using SAS version 9.2, which makes use of the data step hash object for the purposes of customer segmentation. INTRODUCTION There are many techniques with regards to segmenting customer level data. This is a representation of one such method and how to utilize the power of hashing. There are a number of introductory articles available from past SAS conferences that outline the basic definition and syntax of the data step hash object. This paper highlights a macro developed for the purpose of segmenting customers based on a particular variable which in this case is the Recency, Frequency and Monetary score or RFM score. The paper does not cover the process of building an RFM score. SEGMENTATION METHODOLOGY The objective is to segment active customers into categories of behaviour critical to the success of your company, based on some kind of metric such as RFM (Curry et.al. 2000). Almost every company has come up with their own definition of what constitutes an active customer. In this paper active customers are defined as persons who have purchased goods or services from your company within the past 12 months (Curry et.al. 2000). For the purposes of this paper the customer segmentation is based on the active customer s most recent RFM score. RFM score is defined as Recency, Frequency and Monetary score. For each of the active customers a separate Recency score, Frequency score and Monetary score is created (each one of the three scores is a decile score, which ranges from 1 to 10). The three scores are then summed up to come up with a single RFM score per customer, with RFM score ranging from 3 to 30. The scope of this paper does not cover how to create those scores. The focus of the paper is on using the RFM scores to segment customers into a standard customer pyramid (Curry et.al. 2000) using the data step hash object. According to Curry et.al. (2000) a standard customer pyramid is formed by clustering customers according to four categories based on their RFM score Top, Big, Medium and Small. The four categories are listed as follows: Top Customers the top 1% of active customers in terms of RFM score Big Customers the next 4% of active customers in terms of RFM score Medium Customers the next 15% of active customers in terms of RFM score Small Customers the remaining 80% of active customers in terms of RFM score With the help of data step hash tables, the RFM score is sorted in descending order from the highest score of 30 to the lowest score of 3, and the score is then used to categorize active customers into the Top 1%, Big 4%, Medium 15% and Small 80%. Figure 1 below helps illustrate this idea from a visual perspective. 1

Figure 1 Customer Segmentation Pyramid (based on RFM Scores) Highest RFM Score = 30 1% of customers 4% of customers 15% of customers 80% of customers Top Big Medium Small R F M S c o r e Lowest RFM Score = 3 DATA STEP HASH OBJECT AND 'HASHS' MACRO The data step hash object is implemented using a data step and object oriented programming syntax i.e. ObjectName.Method(parameters) (Parman 2006). The hash macro, pet name hashs was created with the point of view of being able to take any customer related data set with a pre-defined metrics like RFM scores or net sales etc and split it into the four customer pyramid segments mentioned above. Note the macro is dynamic enough that the percentage of those segments can be easily manipulated. There are three parts to the entire process. Part A uses the hashs macro to split the data into the four segments. Part B puts the segments together into one data set and creates a field which helps identify each of the four segments. Part C looks at some general procedures like PROC FREQ, PROC MEANS in order to validate the results of the segmentation exercise. PART A Part A is made up of steps 1-15. By explaining each of the steps it will help make the process of understanding the hash syntax and the workings of the hashs macro clear. Step 1: All variables from the data source that are to be used to define the hash table are defined in the length statement. Step 2: Initiate the hash table g and automatically load the table in descending key order. Step 3: DefineKey is a method used by the hash object to define the hash table s key (Parman 2006). Step 4: DefineData is a method used by the hash object to define the values returned when the hash table is searched (Parman 2006). 2

Step 5: DefineDone is a method used by the hash object to conclude the declaration of the hash table 2 When DEFINEDONE is called, DEFINEKEY and DEFINEDATA can no longer be called (Secosky et. al. 2006). Step 6: Initialize the variables used in the hash object to missing values (Parman 2006). Step 7: Initialize the hash iterator object. Step 8: Specifies the name of the variable that contains the number of items in the hash object (SAS (R) 9.2 Language Reference). Step 9: Creates the variables n, n 2, n 3 & n 4. Which represent the top 1%, next 4%, next 15% & bottom 80% of the customer segmentation pyramid. Note that to create the next 4% segment the percentage has to be set to 0.05 since the top 1% of the customers have already been accounted for. A similar approach is to be used for calculating the next 15% i.e. 0.20 i.e. the extra 5% accounts for the top 1% and next 4% of the total customer population. Step 10: The FIRST method is used to point to the first item in the hash object (Secosky et. al. 2006) i.e. the first point with the largest key. Step 11: The first do loop iterates through the first observation till top 0.01 percent of observations in the hash table. The top 0.01 percent of customers are read into the data set called onepct. The NEXT method moves to the next item in the hash object. Step 12: The second do loop iterates through the observation succeeding the last observation within the top 0.01 percent of observations in the hash table and the next 0.05 percent. The next 0.04 percent of customers are read into the data set called fourpct. The NEXT method moves to the next item in the hash object. Step 13: The third do loop iterates through the observation succeeding the last observation within the 0.05 percent of observations in the hash table and the next 0.20 percent. The next 0.15 percent of customers are read into the data set called fifteenpct. The NEXT method moves to the next item in the hash object. Step 14: The LAST method in this case fetches the lowest-value item since the table was ordered in descending order by RFM Score i.e. Step 2. Step 15: The last do loop iterates through the observation succeeding the last observation within the top 0.20 percent of observations in the hash table and the rest 0.80 percent. The next 0.80 percent of customers are read into the data set called eightypct. The PREV method moves to the previous item in the hash object. PART B The four datasets created in Part A are concatenated and a field called segments is created that labels each one of the segments accordingly into 'Top', 'Big', Medium' & 'Small'. PART C The results of Part A and Part B are verified by performing some general validation procedures on the final data set. The section titled results provides a visual of the procedures applied in Part C. 3

THE HASH MACRO & OTHER DATA STEPS PART A : create the hash macro, which uses hash tables to create the four RFM score segments; %macro hashs (dataset, order, seg_var, id_var, seg1, seg2, seg3); data onepct fourpct fifteenpct eightypct; /*define the data sets that will store each of the four RFM score segments*/ 1 length &id_var 8 &seg_var 8; if _n_=1 then do; /*define variables used in the hash object*/ 2 3 4 declare hash g (dataset: "&dataset", ordered: "&order"); /* hash table initiated with the ORDERED parameter set to descending*/ g.definekey("&seg_var","&id_var"); /*identify fields to use as keys*/ g.definedata("&seg_var","&id_var"); /*identify fields to use as data*/ 5 g.definedone(); /*complete hash table definition*/ 6 call missing (&id_var, &seg_var); /*initialize the variables to fill*/ 7 8 declare hiter hi('g'); count=g.num_items; /*hash iterator object initiated*/ /*counts number of items in the hash object*/ 9 n=int(count * &seg1); /*0.01 percent of customers*/ n2=int(count * &seg2); /*0.05 percent of customers*/ n3=int(count* &seg3); /*0.20 percent of customers*/ n4=int(count); /*remaining 0.80 percent of customers*/ 10 11 12 13 hi.first(); do i=1 to n; output onepct; hi.next(); do i=n+1 to n2; output fourpct; hi.next(); do i=n2+1 to n3; output fifteenpct; hi.next(); /*points iterator to first item in the hash table*/ /*iterates through 1 to n items in hash table*/ /*output sent to onepct dataset*/ /*points iterator to next item in the hash table*/ /*iterates through n+1 to n2 items in the hash table*/ /*output sent to fourpct dataset*/ /*points iterator to next item in the hash table*/ /*iterates through n2+1 to n3 items in the hash table*/ /*output sent to fifteenpct dataset*/ /*points iterator to next item in the hash table*/ 4

14 hi.last(); /*points iterator to last item in the hash table*/ do i=n3+1 to n4; /*iterates through n3+1 to n4 items in the hash table*/ output eightypct; /*output sent to eightypct dataset*/ 15 hi.prev(); /*points iterator to previous item in the hash table*/ %mend hashs; %hashs(rfm_score_dataset,descending,rfm_score,customer_no,.01,.05,.20); PART B : merge segmented datasets into one dataset and label each segment. data regroup; set onepct(in=a) fourpct(in=b) fifteenpct(in=c) eightypct(in=d); length segments $10.; if a then segments='1. Top'; else if b then segments='2. Big'; else if c then segments='3. Medium'; else if d then segments='4. Small'; drop n n2 n3 n4 i count; PART C : summarize the above created customer segments by RFM score. Reference the next section titled results for the output of the code below. proc freq data=regroup noprint; tables segments/nocol norow nocum out=reg ; proc print data =reg noobs; title 'RFM Score Segments'; format count comma12. percent 8.; proc means data=regroup max min mean median maxdec=2 missing chartype ; class segments; var rfm_score; title 'RFM Score Segments Statistics'; 5

RESULTS RFM Score Segments segments COUNT PERCENT 1. Top 8 1 2. Big 35 4 3. Medium 131 15 4. Small 699 80 RFM Score Segments Statistics The MEANS Procedure Analysis Variable : rfm_score segments N Obs Maximum Minimum Mean Median 1. Top 8 30.00 28.00 29.13 29.00 2. Big 35 28.00 26.00 27.17 27.00 3. Medium 131 26.00 23.00 24.92 25.00 4. Small 699 23.00 3.00 12.04 10.00 6

CONCLUSIONS In summary what this paper has outlined is a method of using a hash macro to segment customers by RFM score into customer pyramid segments of 'top', 'big', 'medium', and 'small'. This exercise can be recreated to develop customer pyramid segments for two time periods e.g. comparing customers and the segments they belong to for the years 2009 & 2008. The comparison of the two time periods can reveal a migration matrix giving us a dynamic view of customer behaviour in a period of time (Curry et. al. 2000). The migration matrix can help answer questions like compared to last year how many customers are newly acquired, upgraded, dropped to inactive status, stayed within the same pyramid level or revived their status from inactive to active. The migration matrix can be used to develop a contact plan for how to deal with on-going customer behaviour i.e. forecasting customers per pyramid segment and tying it into contact capacity and marketing budgets for next year could provide an optimal solution with regards to targeting strategies. REFERENCES Curry, J. and Curry, A. (2000). Customer marketing method. New York, NY: Free Press. Parman, B. (2006). How to implement the SAS data step hash object. Proceedings of the 2006 South East SAS users group conference. Available http://analytics.ncsu.edu/sesug/2006/sc19_06.pdf SAS(R) 9.2 Language Reference: Dictionary, Second Edition, NUM_ITEMS Attribute. Available http://support.sas.com/documentation/cdl/en/lrdict/62618/html/default/a002588144.htm Secosky, J. and Bloom, J. (2006). Getting started with the data step hash iterator. Proceedings of the 2006 South Central SAS users group conference. Available http://www.scsug.org/scsugproceedings/2006/secoskyjason.pdf ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at shan_ali_rasul@yahoo.ca 7