Data Preparation for Analytics Dr. Gerhard Svolba SAS-Austria Vienna http://sascommunity.org/wiki/gerhard_svolba
Agenda Data Preparation for Analytics General Thoughts Data Structures for Analytics Case Study: Building an efficient one-row-persubject data mart Data Management with the SAS Language and with SAS Tools Closing Thoughts
Some Words on Data Preparation Is for techies Is just coding Is boring Consumes up to 80 % of the project Was not in the focus of data mining literature so far Is something that SAS can excellently do Is vital to the quality of the project
The Analysis Process: From Raw Data to Actionable Results Data Sources Data Preparation Analytic Modeling Results and Actions Different Data Sources Relational Models, Star Schemes Merges, Denormalisation Derived Variables Transpositions, Aggregations Modeling, Parameter Estimation, Tuning, Predictions, Classifications, Clustering Usage of Results Profiling Interpretations Data Availability + Adequate Preparation Clever + Modeling = Good Results
Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Efficient and Clever SAS Coding Documentation and Maintainance
Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Efficient and Clever SAS Coding Documentation and Maintainance
Challenging a Business Question Can you calculate the probability that a customer will cancel the usage of product X? Questions: Cancellation or decline in usage? How do we treat usage of more advanced products? Include non-volontary cancellation? How quickly can retention measures take place? Do you want to have probabilites per customer or a list of the 10.000 high risk customers or risk classes in general?
Business, Statistican and IT: The Optimal Triangle!? I have formulated my business question. Are there any reasons, not to be back with results in 2 days? Analytic Data Preparaton I would like to have an analytic database with a high number of attributes and an environment to perform both, data management and analysis. Simon Retention Manager Name me the list of attributes and derived variables that you will use in your final model and which I have to deliver periodically. Daniele Quantitative Expert Elias IT Expert
Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Efficient and Clever SAS Coding Documentation and Maintainance
Business Questions Analytical Models Event prediction (Churn, Fraud, Delinquency, Response, ) Value prediction (Purchase Size, Claim Amount, ) Clustering (Segmentation, ) Market Basket Analysis (Association Analysis, ) Time Series Forecasting
Analysis Subjects and Multiple Observations Analysis subjects are entities that are being analyzed and the analysis results are interpreted in their context. Multiple observations per analysis subject Repeated measurements over time Multiple observations because of hierarchical relationships
Main Types of Data Marts One-Row-per- Subject Data Mart Multiple-Row-per- Subject Data Mart Longitudinal Data Mart
Data Mart Structures Data Mart Structure for the Analysis Structure of the source data: Multiple observations per analysis subject exist? NO YES One-Row-per-Subject Data Mart Data Mart with one row per subject is created. Information of multiple observations has to be aggregated or transposed per analysis subject Multiple-Row-per-Subject Data Mart (Key-Value Table) Data Mart with one row per multiple observations is created. Variables at the analysis subject level are duplicated for each repetition
The One-Row-Per-Subject Data Mart Required by many statistical methods Regression Analysis, Neural Networks, Decision Trees, Survival analysis, Cluster analysis, Most prominent data mart structure in data mining Event prediction (Churn, Fraud, Delinquency, Response, ) Value prediction (Purchase Size, Claim Amount, ) Segmentation (Clustering, )
The One-Row-Per-Subject Paradigm
Transposing Data to One-Row-Per-Subject PROC TRANSPOSE DATA = long OUT = wide(drop = _name_) PREFIX = weight; BY id ; VAR weight; ID time; RUN;
Clever Aggregations Interval Data Static Aggregation Correlation of Values Course over Time Concentration of Values Categorical Data Frequency Counts Concatenated Frequencies Total and Distinct Counts
Correlation of Values How do the monthly values per subject correlate with the overall mean per month?
Measures for the Course over Time PROC REG DATA = longitud NOPRINT OUTEST=Est_LongTerm(KEEP = CustID month RENAME = (month=longterm)); MODEL usage = month; BY CustID; RUN; PROC REG DATA = longitud NOPRINT OUTEST=Est_ShortTerm(KEEP = CustID month RENAME = (month=shortterm)); MODEL usage = month; BY CustID; WHERE month in (5 6); RUN;
Concentration of Values Concentration = proportion of the sum of the top 50 % sub-hierarchies / the total sum over all sub hierarchies
Categorical Variables: Frequency Counts Source Data Absolute and Relative Frequencies Counts and Distinct Counts
Categorical Variables: Concatenated Frequencies Account_ Cumulative Cumulative RowPct Frequency Percent Frequency Percent --------------------------------------------------------------- 0_100_0_0 12832 30.61 12832 30.61 100_0_0_0 9509 22.69 22341 53.30 50_0_0_50 4898 11.69 27239 64.98 33_0_0_67 1772 4.23 29011 69.21 0_0_100_0 1684 4.02 30695 73.23 67_0_0_33 1426 3.40 32121 76.63 0_0_50_50 861 2.05 32982 78.69 50_0_50_0 681 1.62 33663 80.31
Hierarchies: Aggregating Up, Copying Down
Main Types of Data Marts One-Row-per- Subject Data Mart Multiple-Row-per- Subject Data Mart Longitudinal Data Mart
The standard form for longitudinal data The inter-leaved longitudinal data mart
The cross-sectional dimension data mart
Longitudinal Data Marts Aggregating data at the right level Transactional Data Finest Granularity Most Appropriate Aggregation Level Defining cross sectional groups Alignment of time values Creation of event indicators and input variables
Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Efficient and Clever SAS Coding Documentation and Maintainance
Case Study: Business Question Predict customers that have a high probability to leave the company Derive target variable Cancellation YES/NO from the monthly value segment history (Entry 8. LOST ) Create a one-row-per-subject data mart for data mining analysis
Case Study: Data and Data Model Customer data: demographic and customer baseline data Account data: information customer accounts Leasing data: data on leasing information Call Center data: data on Call center contacts Score data: data of value segment scores
Considerations for Predictive Modeling Only consider customers that are still active at this point in time Snapshot Date: June 30th, 2003 Do not use input data after this point in time Define Cancellation YES/NO based on values in the target window Observation Window Offset Target Window April 2003 May 2003 June 2003 July 2003 August 2003
Using Data from the CALLCENTER Table %let snapdate = 30JUN2003 d; PROC FREQ DATA = callcenter NOPRINT; TABLE CustID / OUT = CallCenterComplaints (DROP = Percent RENAME = (Count = Complaints)); WHERE Category = 'Complaint' and datepart(date) <= &snapdate; RUN;
Using Data from the SCORES-Table %let snapdate = 30JUN2003 d; DATA ScoreFuture(RENAME = (ValueSegment = FutureValueSegment)) ScoreActual ScoreLastMonth(RENAME = (ValueSegment = LastValueSegment)); SET Scores; DATE = INTNX('MONTH',Date,0,'END'); DROP Date; IF Date = &snapdate THEN OUTPUT ScoreActual; ELSE IF Date = INTNX('MONTH',&snapdate,-1) THEN OUTPUT ScoreLastMonth; ELSE IF Date = INTNX('MONTH',&snapdate,2) THEN OUTPUT ScoreFuture; RUN; Observation Window Offset Target Window April 2003 May 2003 June 2003 July 2003 August 2003
DATA CustomerMart; ATTRIB /* Customer Baseline */ CustID FORMAT = 8. LABEL = "Customer ID" Birthdate FORMAT = DATE9. LABEL = "Date of Birth" Alter FORMAT = 8. LABEL = "Age (years)" Gender FORMAT = $6. LABEL = "Gender" MaritalStatus FORMAT = $10. LABEL = "Marital Status" Title FORMAT = $10. LABEL = "Academic Title" HasTitle FORMAT = 8. LABEL = "Has Title? 0/1" Branch FORMAT = $5. LABEL = "Branch Name"; MERGE Customer (IN = InCustomer) AccountSum (IN = InAccounts) AccountTypes LeasingSum (IN = InLeasing) CallCenterContacts (IN = InCallCenter) CallCenterComplaints ScoreFuture ScoreActual ScoreLastMonth; BY CustID; IF InCustomer;
/* Customer Baseline */ HasTitle = (Title ne ""); Alter = (&Snapdate-Birthdate)/365.25; CustomerMonths = (&Snapdate- CustomerSince)/(365.25/12); /* Accounts */ HasAccounts = InAccounts; LoanPct = Loan / BalanceSum * 100; SavingAccountPct = SavingAccount / BalanceSum * 100; FundsPct = Funds / BalanceSum * 100; /* Leasing */ HasLeasing = InLeasing; /* Call Center */ HasCallCenter = InCallCenter; ComplaintPct = Complaints / Calls *100; /* Value Segment */ Cancel = (FutureValueSegment = '8. LOST'); ChangeValueSegment = (ValueSegment = LastValueSegment); RUN;
Screenshots of the Resulting Data Mart
The Power of the SAS Language for Data Preparation
SAS Data Step vs. SQL Transposing a table PROC TRANSPOSE DATA = sampsio.assocs(obs=21) OUT = assoc_tp (DROP = _name_); BY customer; ID Product; RUN;
Changing between longitdutinal data structures Standard Form Interleaved
Selection of the first and last line per customer CustID Month Value FirstValue LastValue 1 7 45 1 0 1 8 34 0 0 1 9 5 0 1 2 7 34 1 0 2 8 32 0 0 2 9 44 0 1 3 7 56 1 0 3 8 54 0 0 3 9 32 0 1
Creating a sequential number and summing items CustID Date Points PurchNo Cum_Poi 1 10.03.2004 45 1 45 1 04.04.2004 10 2 55 1 20.04.2004 20 3 75 1 16.05.2004 18 4 93 1 01.06.2004 5 5 98 2 01.02.2004 10 1 10 2 19.03.2004 30 2 40 3 05.08.2004 4 1 4 3 16.08.2004 16 2 20 3 31.08.2004 12 3 32 3 10.09.2004 20 4 52
Copyingomitteddata 32 9 54 8 56 7 m 46 3 44 9 32 8 34 7 w 37 2 5 9 34 8 45 7 m 26 1 Value Month Gender Age CustID 32 9 m 46 3 54 8 m 46 3 56 7 m 46 3 44 9 w 37 2 32 8 w 37 2 34 7 w 37 2 5 9 m 26 1 34 8 m 26 1 45 7 m 26 1 Value Month Gender Age CustID
Shift down a column for k positions Date 01.01.2004 01.02.2004 01.03.2004 01.04.2004 01.05.2004 01.06.2004 Value 45 34 5 34 32 44 Value_ PrevDay. 45 34 5 34 32
Analytic Data Preparation in SAS Enterprise Miner
Analytic Data Preparation in SAS Enterprise Miner Input Data Source Node Sample Node Data Partition Node Metadata Node Filter Node Transform Variables Node Impute Node SAS Code Node Principal Components Node
Working with Multiple-Row-per-Subject Data in SAS Enterprise Miner Time Series Node Association Node Merge Node
Extending SAS Enterprise Miner with Data Preparation for Analytics Nodes Extension Nodes Correlation TrendRegression Concentration CategoryCount
Analytic Data Preparation in SAS Enterprise Miner - Summary Documentation of the data preparation process as a whole in the process flow diagram Definition of variables metadata that are used in the analysis Automatic Creation of Dummy-Variables for categorical data Powerful data transformations in the filter node, transform variables node and impute node. Possibility to perform association analysis and time-series-analysis Creation of score code from all nodes in SAS Enterprise Miner as SAS datastep code
Analytic Data Preparation in SAS Forecast Studio
Analytic Data Preparation in SAS Forecast Studio Convert transactional data into time series data Treatment of missing values Alignment of date values in intervals Aggregation of data on various levels Handling of events Allows users to create event definitions, assign events to selected series in the project and delete events Users can specify event duration, shape, and recurrence options. Pre-defined common events and holidays are available for inclusion in the forecasting models
Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Efficient and Clever SAS Coding Documentation and Maintainance
SAS Data Integration Studio
SAS Data Integration Studio
SAS Data Integration Studio General Features Comprehensive transformation library Graphical user interface, drag & drop, wizard driven Multi-developer support: Check-in/check-out, change management Impact analysis Documentation of the data management process Data Mining Model Scoring Register Model in SAS metadata repository Apply SAS Enterprise Miner model to new data source Creates the target table definition in the metadata
SAS DI Studio Analytic Features Data Mining Model Scoring Register Model in SAS metadata repository Apply SAS Enterprise Miner model to new data source Creates the target table definition in the metadata Forecast Analysis Transformation Allows to perform time series analysis Based on HPF (high performance forecasting) procedure Allows to integrate the forecast step into the data flow process in the metadata
Summary Analytic Data Preparation is a discipline, not a incommodious necessity Analytic Data Preparation is more than just coding The One-Row-Per-Subject Paradigm Central in data mining and predictive modeling Do not stop with simple transpose, summing or averaging Clever aggregations can be the key success factor Predictive Modeling and Historic Data: Which data are you allowed to use for modeling? SAS combines powerful data management functionality and market leading analytics in one integrated package SAS Tools like SAS DI-Studio, SAS Enterprise Miner and SAS Forecast Studio assist you in analytic data preparation
The commercial break
"It is exciting to see a book completely devoted to data preparation in SAS." Jin Li Statistician Capital One Financial Services This is a must read for anyone, who prepares data for data mining and analytics. Not only you receive ideas and suggestions for important derived variables for your datamart, you also get a lot of insight on the business rationale behind the scenes. The author does a great job in explaining you step by step the world of data preparation for data mining and shows a lot of example code and macros. Christine Hallwirth MHE & Partners Gerhard Svolba's book "Data Preparation for Analytics" is an "owner's manual" for data miners, business analysts and all who prefer to be in charge of and responsible for their own datasets. This book is for those who are not afraid of data, who understand data, and for whom rolling up their sleeves and getting their hands into data is an integral part of analytics and predictive modeling. Lessia Shajenko (American Bank)
Recommended Reading Data Preparation for Analytics by Gerhard Svolba SAS-Press (#60502) Business Rationale Concepts Coding Examples
Downloads Data Preparation for Analytics http://www.sas.com/apps/pubscat/bookdetails.jsp?pc=60502 http://support.sas.com/publishing/bbu/companion_site/60502.html http://www2.sas.com/proceedings/sugi31/078-31.pdf http://sascommunity.org/wiki/data_preparation_for_analytics
Questions and Contact Dr. Gerhard Svolba Email: gerhard.svolba@aut.sas.com