Data Preparation for Analytics

Similar documents
The Consequences of Poor Data Quality on Model Accuracy

SAS Enterprise Miner Extension Nodes by Gerhard Svolba

Data Acquisition and Integration

Now, Data Mining Is Within Your Reach

SAS Enterprise Miner : Tutorials and Examples

Enterprise Miner Tutorial Notes 2 1

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

Tessera Rapid Modeling Environment: Production-Strength Data Mining Solution for Terabyte-Class Relational Data Warehouses

DATA MINING AND WAREHOUSING

Data Mining Overview. CHAPTER 1 Introduction to SAS Enterprise Miner Software

Data Science. Data Analyst. Data Scientist. Data Architect

SAS Factory Miner 14.2: User s Guide

Data Analysis and Data Science

Learning Objectives for Data Concept and Visualization

Data Warehouses Chapter 12. Class 10: Data Warehouses 1

Guide Users along Information Pathways and Surf through the Data

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

OLAP Introduction and Overview

Taking Your Application Design to the Next Level with Data Mining

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

ENTERPRISE MINER: 1 DATA EXPLORATION AND VISUALISATION

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

SESUG 2014 IT-82 SAS-Enterprise Guide for Institutional Research and Other Data Scientists Claudia W. McCann, East Carolina University.

SAS (Statistical Analysis Software/System)

Note: In the presentation I should have said "baby registry" instead of "bridal registry," see

The DMSPLIT Procedure

SAS Enterprise Miner 7.1

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

INTRODUCTION... 2 FEATURES OF DARWIN... 4 SPECIAL FEATURES OF DARWIN LATEST FEATURES OF DARWIN STRENGTHS & LIMITATIONS OF DARWIN...

2997 Yarmouth Greenway Drive, Madison, WI Phone: (608) Web:

Predictive Modeling with SAS Enterprise Miner

Tasks Menu Reference. Introduction. Data Management APPENDIX 1

TDWI strives to provide course books that are content-rich and that serve as useful reference documents after a class has ended.

Introduction to Data Mining and Data Analytics

Choosing the Right Procedure

Slice Intelligence!

Now That You Have Your Data in Hadoop, How Are You Staging Your Analytical Base Tables?

Oracle9i Data Mining. Data Sheet August 2002

Lasso Your Business Users by Designing Information Pathways to Optimize Standardized Reporting in SAS Visual Analytics

SAS IT Resource Management Forecasting. Setup Specification Document. A SAS White Paper

SAS Data Integration Studio 3.3. User s Guide

Analytics and Visualization

Oracle Big Data Discovery

SAS Publishing SAS. Forecast Studio 1.4. User s Guide

DB2 SQL Class Outline

Using PROC SQL to Generate Shift Tables More Efficiently

Types of Data Mining

SAS Infrastructure for Risk Management 3.4: User s Guide

KNOWLEDGE DISCOVERY AND DATA MINING

SAS (Statistical Analysis Software/System)

Introducing SAS Model Manager 15.1 for SAS Viya

DATA WAREHOUSE- MODEL QUESTIONS

SAS (Statistical Analysis Software/System)

Define the data source for the Organic Purchase Analysis project.

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Certkiller.A QA

A Simple Time Series Macro Scott Hanson, SVP Risk Management, Bank of America, Calabasas, CA

Data Management - 50%

Enterprise Miner Software: Changes and Enhancements, Release 4.1

Employer Reporting. User Guide

Data Science Course Content

Data Quality Control for Big Data: Preventing Information Loss With High Performance Binning

Data Quality Control: Using High Performance Binning to Prevent Information Loss

Data warehouses Decision support The multidimensional model OLAP queries

SAS ENTERPRISE GUIDE USER INTERFACE

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

CS490D: Introduction to Data Mining Prof. Chris Clifton

ABSTRACT INTRODUCTION MACRO. Paper RF

Data Warehouse and Mining

SAS E-MINER: AN OVERVIEW

Lecture 18. Business Intelligence and Data Warehousing. 1:M Normalization. M:M Normalization 11/1/2017. Topics Covered

Reducing Credit Union Member Attrition with Predictive Analytics

Base and Advance SAS

STAT 3304/5304 Introduction to Statistical Computing. Introduction to SAS

JMP Clinical. Release Notes. Version 5.0

Data warehouse and Data Mining

CoE CENTRE of EXCELLENCE ON DATA WAREHOUSING

SAS Web Report Studio 3.1

Techdata Solution. SAS Analytics (Clinical/Finance/Banking)

EMC Ionix ControlCenter (formerly EMC ControlCenter) 6.0 StorageScope

Paper HOW-06. Tricia Aanderud, And Data Inc, Raleigh, NC

Data Preprocessing. Slides by: Shree Jaswal

Choosing the Right Procedure

BASICS BEFORE STARTING SAS DATAWAREHOSING Concepts What is ETL ETL Concepts What is OLAP SAS. What is SAS History of SAS Modules available SAS

The Definitive Guide to Preparing Your Data for Tableau

Mapping Clinical Data to a Standard Structure: A Table Driven Approach

A Cross-national Comparison Using Stacked Data

An Introduction to Data Mining in Institutional Research. Dr. Thulasi Kumar Director of Institutional Research University of Northern Iowa

Optimizing Your Analytics Life Cycle with SAS & Teradata. Rick Lower

JMP Book Descriptions

Customer Clustering using RFM analysis

Automatic Detection of Section Membership for SAS Conference Paper Abstract Submissions: A Case Study

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

Data Mining Concepts & Techniques

Right-click on whatever it is you are trying to change Get help about the screen you are on Help Help Get help interpreting a table

BC Hydro Customer Classification Model Using SAS 9.2

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

DC Area Business Objects Crystal User Group (DCABOCUG) Data Warehouse Architectures for Business Intelligence Reporting.

Inline Processing Engine User Guide. Release: August 2017 E

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

Transcription:

Data Preparation for Analytics Dr. Gerhard Svolba SAS-Austria Vienna http://sascommunity.org/wiki/gerhard_svolba

Agenda Data Preparation for Analytics General Thoughts Data Structures for Analytics Case Study: Building an efficient one-row-persubject data mart Data Management with the SAS Language and with SAS Tools Closing Thoughts

Some Words on Data Preparation Is for techies Is just coding Is boring Consumes up to 80 % of the project Was not in the focus of data mining literature so far Is something that SAS can excellently do Is vital to the quality of the project

The Analysis Process: From Raw Data to Actionable Results Data Sources Data Preparation Analytic Modeling Results and Actions Different Data Sources Relational Models, Star Schemes Merges, Denormalisation Derived Variables Transpositions, Aggregations Modeling, Parameter Estimation, Tuning, Predictions, Classifications, Clustering Usage of Results Profiling Interpretations Data Availability + Adequate Preparation Clever + Modeling = Good Results

Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Efficient and Clever SAS Coding Documentation and Maintainance

Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Efficient and Clever SAS Coding Documentation and Maintainance

Challenging a Business Question Can you calculate the probability that a customer will cancel the usage of product X? Questions: Cancellation or decline in usage? How do we treat usage of more advanced products? Include non-volontary cancellation? How quickly can retention measures take place? Do you want to have probabilites per customer or a list of the 10.000 high risk customers or risk classes in general?

Business, Statistican and IT: The Optimal Triangle!? I have formulated my business question. Are there any reasons, not to be back with results in 2 days? Analytic Data Preparaton I would like to have an analytic database with a high number of attributes and an environment to perform both, data management and analysis. Simon Retention Manager Name me the list of attributes and derived variables that you will use in your final model and which I have to deliver periodically. Daniele Quantitative Expert Elias IT Expert

Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Efficient and Clever SAS Coding Documentation and Maintainance

Business Questions Analytical Models Event prediction (Churn, Fraud, Delinquency, Response, ) Value prediction (Purchase Size, Claim Amount, ) Clustering (Segmentation, ) Market Basket Analysis (Association Analysis, ) Time Series Forecasting

Analysis Subjects and Multiple Observations Analysis subjects are entities that are being analyzed and the analysis results are interpreted in their context. Multiple observations per analysis subject Repeated measurements over time Multiple observations because of hierarchical relationships

Main Types of Data Marts One-Row-per- Subject Data Mart Multiple-Row-per- Subject Data Mart Longitudinal Data Mart

Data Mart Structures Data Mart Structure for the Analysis Structure of the source data: Multiple observations per analysis subject exist? NO YES One-Row-per-Subject Data Mart Data Mart with one row per subject is created. Information of multiple observations has to be aggregated or transposed per analysis subject Multiple-Row-per-Subject Data Mart (Key-Value Table) Data Mart with one row per multiple observations is created. Variables at the analysis subject level are duplicated for each repetition

The One-Row-Per-Subject Data Mart Required by many statistical methods Regression Analysis, Neural Networks, Decision Trees, Survival analysis, Cluster analysis, Most prominent data mart structure in data mining Event prediction (Churn, Fraud, Delinquency, Response, ) Value prediction (Purchase Size, Claim Amount, ) Segmentation (Clustering, )

The One-Row-Per-Subject Paradigm

Transposing Data to One-Row-Per-Subject PROC TRANSPOSE DATA = long OUT = wide(drop = _name_) PREFIX = weight; BY id ; VAR weight; ID time; RUN;

Clever Aggregations Interval Data Static Aggregation Correlation of Values Course over Time Concentration of Values Categorical Data Frequency Counts Concatenated Frequencies Total and Distinct Counts

Correlation of Values How do the monthly values per subject correlate with the overall mean per month?

Measures for the Course over Time PROC REG DATA = longitud NOPRINT OUTEST=Est_LongTerm(KEEP = CustID month RENAME = (month=longterm)); MODEL usage = month; BY CustID; RUN; PROC REG DATA = longitud NOPRINT OUTEST=Est_ShortTerm(KEEP = CustID month RENAME = (month=shortterm)); MODEL usage = month; BY CustID; WHERE month in (5 6); RUN;

Concentration of Values Concentration = proportion of the sum of the top 50 % sub-hierarchies / the total sum over all sub hierarchies

Categorical Variables: Frequency Counts Source Data Absolute and Relative Frequencies Counts and Distinct Counts

Categorical Variables: Concatenated Frequencies Account_ Cumulative Cumulative RowPct Frequency Percent Frequency Percent --------------------------------------------------------------- 0_100_0_0 12832 30.61 12832 30.61 100_0_0_0 9509 22.69 22341 53.30 50_0_0_50 4898 11.69 27239 64.98 33_0_0_67 1772 4.23 29011 69.21 0_0_100_0 1684 4.02 30695 73.23 67_0_0_33 1426 3.40 32121 76.63 0_0_50_50 861 2.05 32982 78.69 50_0_50_0 681 1.62 33663 80.31

Hierarchies: Aggregating Up, Copying Down

Main Types of Data Marts One-Row-per- Subject Data Mart Multiple-Row-per- Subject Data Mart Longitudinal Data Mart

The standard form for longitudinal data The inter-leaved longitudinal data mart

The cross-sectional dimension data mart

Longitudinal Data Marts Aggregating data at the right level Transactional Data Finest Granularity Most Appropriate Aggregation Level Defining cross sectional groups Alignment of time values Creation of event indicators and input variables

Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Efficient and Clever SAS Coding Documentation and Maintainance

Case Study: Business Question Predict customers that have a high probability to leave the company Derive target variable Cancellation YES/NO from the monthly value segment history (Entry 8. LOST ) Create a one-row-per-subject data mart for data mining analysis

Case Study: Data and Data Model Customer data: demographic and customer baseline data Account data: information customer accounts Leasing data: data on leasing information Call Center data: data on Call center contacts Score data: data of value segment scores

Considerations for Predictive Modeling Only consider customers that are still active at this point in time Snapshot Date: June 30th, 2003 Do not use input data after this point in time Define Cancellation YES/NO based on values in the target window Observation Window Offset Target Window April 2003 May 2003 June 2003 July 2003 August 2003

Using Data from the CALLCENTER Table %let snapdate = 30JUN2003 d; PROC FREQ DATA = callcenter NOPRINT; TABLE CustID / OUT = CallCenterComplaints (DROP = Percent RENAME = (Count = Complaints)); WHERE Category = 'Complaint' and datepart(date) <= &snapdate; RUN;

Using Data from the SCORES-Table %let snapdate = 30JUN2003 d; DATA ScoreFuture(RENAME = (ValueSegment = FutureValueSegment)) ScoreActual ScoreLastMonth(RENAME = (ValueSegment = LastValueSegment)); SET Scores; DATE = INTNX('MONTH',Date,0,'END'); DROP Date; IF Date = &snapdate THEN OUTPUT ScoreActual; ELSE IF Date = INTNX('MONTH',&snapdate,-1) THEN OUTPUT ScoreLastMonth; ELSE IF Date = INTNX('MONTH',&snapdate,2) THEN OUTPUT ScoreFuture; RUN; Observation Window Offset Target Window April 2003 May 2003 June 2003 July 2003 August 2003

DATA CustomerMart; ATTRIB /* Customer Baseline */ CustID FORMAT = 8. LABEL = "Customer ID" Birthdate FORMAT = DATE9. LABEL = "Date of Birth" Alter FORMAT = 8. LABEL = "Age (years)" Gender FORMAT = $6. LABEL = "Gender" MaritalStatus FORMAT = $10. LABEL = "Marital Status" Title FORMAT = $10. LABEL = "Academic Title" HasTitle FORMAT = 8. LABEL = "Has Title? 0/1" Branch FORMAT = $5. LABEL = "Branch Name"; MERGE Customer (IN = InCustomer) AccountSum (IN = InAccounts) AccountTypes LeasingSum (IN = InLeasing) CallCenterContacts (IN = InCallCenter) CallCenterComplaints ScoreFuture ScoreActual ScoreLastMonth; BY CustID; IF InCustomer;

/* Customer Baseline */ HasTitle = (Title ne ""); Alter = (&Snapdate-Birthdate)/365.25; CustomerMonths = (&Snapdate- CustomerSince)/(365.25/12); /* Accounts */ HasAccounts = InAccounts; LoanPct = Loan / BalanceSum * 100; SavingAccountPct = SavingAccount / BalanceSum * 100; FundsPct = Funds / BalanceSum * 100; /* Leasing */ HasLeasing = InLeasing; /* Call Center */ HasCallCenter = InCallCenter; ComplaintPct = Complaints / Calls *100; /* Value Segment */ Cancel = (FutureValueSegment = '8. LOST'); ChangeValueSegment = (ValueSegment = LastValueSegment); RUN;

Screenshots of the Resulting Data Mart

The Power of the SAS Language for Data Preparation

SAS Data Step vs. SQL Transposing a table PROC TRANSPOSE DATA = sampsio.assocs(obs=21) OUT = assoc_tp (DROP = _name_); BY customer; ID Product; RUN;

Changing between longitdutinal data structures Standard Form Interleaved

Selection of the first and last line per customer CustID Month Value FirstValue LastValue 1 7 45 1 0 1 8 34 0 0 1 9 5 0 1 2 7 34 1 0 2 8 32 0 0 2 9 44 0 1 3 7 56 1 0 3 8 54 0 0 3 9 32 0 1

Creating a sequential number and summing items CustID Date Points PurchNo Cum_Poi 1 10.03.2004 45 1 45 1 04.04.2004 10 2 55 1 20.04.2004 20 3 75 1 16.05.2004 18 4 93 1 01.06.2004 5 5 98 2 01.02.2004 10 1 10 2 19.03.2004 30 2 40 3 05.08.2004 4 1 4 3 16.08.2004 16 2 20 3 31.08.2004 12 3 32 3 10.09.2004 20 4 52

Copyingomitteddata 32 9 54 8 56 7 m 46 3 44 9 32 8 34 7 w 37 2 5 9 34 8 45 7 m 26 1 Value Month Gender Age CustID 32 9 m 46 3 54 8 m 46 3 56 7 m 46 3 44 9 w 37 2 32 8 w 37 2 34 7 w 37 2 5 9 m 26 1 34 8 m 26 1 45 7 m 26 1 Value Month Gender Age CustID

Shift down a column for k positions Date 01.01.2004 01.02.2004 01.03.2004 01.04.2004 01.05.2004 01.06.2004 Value 45 34 5 34 32 44 Value_ PrevDay. 45 34 5 34 32

Analytic Data Preparation in SAS Enterprise Miner

Analytic Data Preparation in SAS Enterprise Miner Input Data Source Node Sample Node Data Partition Node Metadata Node Filter Node Transform Variables Node Impute Node SAS Code Node Principal Components Node

Working with Multiple-Row-per-Subject Data in SAS Enterprise Miner Time Series Node Association Node Merge Node

Extending SAS Enterprise Miner with Data Preparation for Analytics Nodes Extension Nodes Correlation TrendRegression Concentration CategoryCount

Analytic Data Preparation in SAS Enterprise Miner - Summary Documentation of the data preparation process as a whole in the process flow diagram Definition of variables metadata that are used in the analysis Automatic Creation of Dummy-Variables for categorical data Powerful data transformations in the filter node, transform variables node and impute node. Possibility to perform association analysis and time-series-analysis Creation of score code from all nodes in SAS Enterprise Miner as SAS datastep code

Analytic Data Preparation in SAS Forecast Studio

Analytic Data Preparation in SAS Forecast Studio Convert transactional data into time series data Treatment of missing values Alignment of date values in intervals Aggregation of data on various levels Handling of events Allows users to create event definitions, assign events to selected series in the project and delete events Users can specify event duration, shape, and recurrence options. Pre-defined common events and holidays are available for inclusion in the forecasting models

Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Efficient and Clever SAS Coding Documentation and Maintainance

SAS Data Integration Studio

SAS Data Integration Studio

SAS Data Integration Studio General Features Comprehensive transformation library Graphical user interface, drag & drop, wizard driven Multi-developer support: Check-in/check-out, change management Impact analysis Documentation of the data management process Data Mining Model Scoring Register Model in SAS metadata repository Apply SAS Enterprise Miner model to new data source Creates the target table definition in the metadata

SAS DI Studio Analytic Features Data Mining Model Scoring Register Model in SAS metadata repository Apply SAS Enterprise Miner model to new data source Creates the target table definition in the metadata Forecast Analysis Transformation Allows to perform time series analysis Based on HPF (high performance forecasting) procedure Allows to integrate the forecast step into the data flow process in the metadata

Summary Analytic Data Preparation is a discipline, not a incommodious necessity Analytic Data Preparation is more than just coding The One-Row-Per-Subject Paradigm Central in data mining and predictive modeling Do not stop with simple transpose, summing or averaging Clever aggregations can be the key success factor Predictive Modeling and Historic Data: Which data are you allowed to use for modeling? SAS combines powerful data management functionality and market leading analytics in one integrated package SAS Tools like SAS DI-Studio, SAS Enterprise Miner and SAS Forecast Studio assist you in analytic data preparation

The commercial break

"It is exciting to see a book completely devoted to data preparation in SAS." Jin Li Statistician Capital One Financial Services This is a must read for anyone, who prepares data for data mining and analytics. Not only you receive ideas and suggestions for important derived variables for your datamart, you also get a lot of insight on the business rationale behind the scenes. The author does a great job in explaining you step by step the world of data preparation for data mining and shows a lot of example code and macros. Christine Hallwirth MHE & Partners Gerhard Svolba's book "Data Preparation for Analytics" is an "owner's manual" for data miners, business analysts and all who prefer to be in charge of and responsible for their own datasets. This book is for those who are not afraid of data, who understand data, and for whom rolling up their sleeves and getting their hands into data is an integral part of analytics and predictive modeling. Lessia Shajenko (American Bank)

Recommended Reading Data Preparation for Analytics by Gerhard Svolba SAS-Press (#60502) Business Rationale Concepts Coding Examples

Downloads Data Preparation for Analytics http://www.sas.com/apps/pubscat/bookdetails.jsp?pc=60502 http://support.sas.com/publishing/bbu/companion_site/60502.html http://www2.sas.com/proceedings/sugi31/078-31.pdf http://sascommunity.org/wiki/data_preparation_for_analytics

Questions and Contact Dr. Gerhard Svolba Email: gerhard.svolba@aut.sas.com