SAS Enterprise Miner : What does the future hold? David Duling EM Development Director SAS Inc. Sascha Schubert Product Manager Data Mining SAS International
Topics for Discussion: EM 4.2/SAS 9.0 AF/SCL Architecture EM 5.0/SAS 9.1 3-tier Architecture EM Demo of the Alpha EM 5.0 Java UI
EM Two Paths for Two Goals! Evolutionary Development of Data Mining Functionality! Keep up the quality! Upgrade release for current sites! Stay on top of the market! Revolutionary Development of Data Mining Architecture! Address scalability and performance! Address the limitations of current architecture! Make new architecture future-proof Copyright 2002, SAS Institute Inc. All rights reserved.
Time Line Project Mercury + DM Apr 02 Jun 02 Nov 02 Feb 03 SAS V9 EM 4.2 Evolutionary Release EA LA GA SAS V9.1 EM 5.0 Revolutionary Release DP EA LA GA
Goals for EM 4.2! Maintain current product! Fix known defects! Evolve beta tools to production status! Interactive Grouping! Improve scalability (parallel processing)
EM 4.2 Evolve Beta Tools to Production Status! Memory Based Reasoning! DM Neural! Two-Stage Model! Time Series! Link Analysis! J-Score, XML
Interactive Grouping Node! Was developed as part of Credit Scoring Solution! Will be fully integrated in EM 4.2 / 5.0! Used to calculate weights of evidence! also useful for general interactive grouping! Interactive grouping of variables into natural groups in relation to target! now possible for class and interval variables
Publishing Enterprise Miner Models via the Open Meta Server Save Enterprise Miner Register Read HTTP/JSP WWW clients Search Models Retrieve Models Reports Score code Open Meta Server WWW Server
Mining Model Repository! SAS Code, C Code, Java Code! Statistics, Charts, Reports! Input and Output Variables described in XML Process flow report in HTML format Fit and assessment statistics in SAS data sets Cscore code Cscore meta information stored in XML Fit and assessment statistics stored in CSV Target and input data set info stored in text Formats, score, and macro code as SAS code Metadata info about the model in a SAS catalog
Performance and Scalability! XOT! enables parallel input (read) of partitioned data sets)! Using XOT for data I/O! TK (Threaded Kernel)! Multi Threading, making use of multiple CPUs! TK for PROC DMDB, PROC DMINE (Vsel), PROC DMREG! Optional for all listed procedures
Scale-Up Proc DMINE 25 20 Stones (S64) 64 bit Solaris - 8 CPUs Time 15 10 5 XOT-TK Unthreaded 0 2 4 6 8 Number of Threads
Benchmarking TK (Proc DMDB) 100K obs 100 interval vars 100K obs 50 interval vars 50 class vars 100K obs 50 class vars Single Threaded real time 7.77 seconds cpu time 7.77 seconds real time 26.80 seconds cpu time 26.81 seconds real time 22.69 seconds cpu time 22.68 seconds Multi-Threaded (4 Threads) real time 1.95 seconds cpu time 4.82 seconds real time 1.95 seconds cpu time 4.82 seconds real time 12.48 seconds cpu time 29.00 seconds 5M obs 2 interval vars real time 6.50 seconds cpu time 6.50 seconds real time 1.51 seconds cpu time 4.92 seconds
EM 5.0 The Future of Enterprise Miner
Plans for EM 5.0! Create a new 3-tier architecture SAS server - Batch and interactive modes - Use existing tools and expertise Java foundation services - Metadata services - Configuration management Java client - API Integration projects - GUI Swing-based Data Mining from everywhere
Goals for EM 5.0 Create a new EM 5.0! SAS server Batch and interactive modes Use existing tools and expertise! Java middleware Metadata services Configuration management! Java client API Integration projects GUI Swing-based New procedures PATH production ARBOR production (replace split) TAXONOMY experimental SVM experimental Production version of MFC Tree viewer PROC ARBOR IOM procedure interface for interactive training Production Model Repository EM 5.0 model registration EM 4.2 model registration Web GUI Warehouse Admin. Scoring
Current AF / SCL Architecture Project persistence SAS Server Data Persistence SAS Version 8.2 EM 4.x classes SAS Version 8.2 SAS EM Client! SAS AF/SCL Infrastructure! Project Stored Locally on the Windows Client as well as the SAS installation! EM models trained on EM server (single threaded)
Distributed Architecture in EM 5.0 Data Mining Compute Server Project Data Persistence SAS System Metadata Persistence EM 5.0 Java API EM 5.0 Java UI Java EM Client Middleware Server EM 5.0 Java Middlware
Distributed Architecture in EM 5.0 Reporting Project Data Persistence Compute Server SAS System Metadata Persistence EM 5.0 Java API EM 5.0 Java UI Middleware Server EM 5.0 Java Middlware JSP Server SAS Open Metadata Server Web Client
Distributed Architecture in EM 5.0 Warehousing Compute Server Project Data Persistence SAS System Metadata Persistence EM 5.0 Java API EM 5.0 Java UI Middleware Server EM 5.0 Java Middlware JSP Server SAS Open Metadata Server Web Client Data Builder Java Client
EM 5.0 Configuration Options! Stand alone client! SAS Server, Java middleware, GUI on the same machine! Client server! SAS server, Java middleware server, clients connect through Java GUI! Distributed computing! All components on different machines, user connect from anywhere
Reasons for n-tier Architecture Client 1 SAS Server Client 1 SAS Server EM Server Client 2 OMS Client 2 OMS Central administration Easier thin-client deployment Reduce client footprint Offers centralized location for file storage Improved security control of all login processes Easier configuration More persistence options controlled by administrator Better resource monitoring Who s using the system How many processes are running Copyright 2002, SAS Institute Inc. All rights reserved.
New GUI Based on Java Swing! Improved Graphics! Deployed through the web allowing multiple user access! Platform independent! Server independent! Configurable! On-line help! Extendable! XML import/export of diagrams! Start and stop processes
Sample EM 5.0 Results Exploratory Plots Assessment Plots
Interactive Tree Results Viewer
EM5.0 Reporting! SPK=SAS Publish and Subscribe! SAS distributes a package reader! Tables stored as CSV files => activate MS Excel! Can be registered in OMS and Model Repository
Enhanced Performance! Uses MP CONNECT technologies to distribute mining processes across multiple CPUs providing the ability to run nodes in parallel.! DMINE and DMREG procedures have been reengineered to take advantage of the TK and XOT frameworks of V9.! Supports Stop Processing of an EM process.
User 1 User 2 EM 5.0 Performance! GUI sessions get dedicated SAS/IOM workspace Middleware IOM user session: user1 IOM user session: user2 IOM process session: user2 SAS: Train Model 1 SAS Server! Model training gets dedicated SAS/IOM workspace! Parallel branches in process flow run in dedicated SAS/IOM workspaces! xot procedures with spds libname engine start multiple data read threads! tk enabled procedures start multiple computational threads SAS: Train Model 2 tk 1 tk 2 tk 3 tk 4 Server Operating System CPU CPU CPU CPU Event Threads Total User 1 Connects 1 1 User 2 Connects 1 2 User 2 Starts process 1 3 User 2 Disconnects -1 2 Process starts model 1 training 1 3 Process starts model 2 training 1 4 Model 2 starts four threads running 4 8 Model 2 completes -4 4 Process completes -3 1 User 2 Reconnects 1 2
EM 5.0 Batch Processing! Java API/UI for batch processing Runs in middleware Opens existing workspace and starts training process Loads XML diagram files! XML files API Save entire diagrams as XML files Mail from one user to another Scheduled execution %EM5(xmlfile=) macro for running diagrams!data set API Nodes data set: all nodes and properties Connections data set: flow of logic from one node to another Actions data set: nodes and actions to perform on nodes Workspace data set: library and files locations Variables meta data sets: input, target, rejected, etc %EM5(nodes=,connect=, ) macro for running diagrams
EM 5.0 Batch Processing! Compatible with all EM5 file structures! Run the same diagram from UI or batch! Automate model training from diagrams built in the GUI! All SAS language capabilities! Encapsulates EM processing! BATCH.SAS always created for every node! Automate creation of new diagrams! Distribute diagrams! Consulting: initial setup and delivery! May include results, or not
EM 5.0 Batch Processing! API to Allow Java Programs to Call EM! String ids_id=myworkspace.addnode( Datasource );! String reg_id=myworkspace.addnode( Regression );! myworkspace.connectnode(ids_id,reg_id);! myworkspace.runnode(reg_id);
Integrated with OMS and Data Builder! OMS persists metadata about SAS servers, EM project locations, results packages, and data dictionaries for training tables! Scoring processes as well as input/output data sets can be defined and exchanged with other SAS companion products through registration of EM metadata and processes within the SAS OMR.
Other Major Enhancements! New Mining Algorithms:! Support Vector Machines popular algorithm for general classification problems! Web Path Analysis provides efficient and scalable mining of frequent paths from click-stream data.! Taxonomy supports hierarchical associations to populate rules at different levels in the hierarchy.! Improved decision tree algorithm to enable interactive training on the server and provide improved performance of disk resident data.
New Procedures! PROC PATH! PROC SVM! PROC ARBOR! PROX TAXONOMY
New Path node (production)! PROC PATH - a new procedure to mine frequent paths from preprocessed click stream data! Features:! Efficient, scalable and fast! Path completion - Reintroduce missing requests (e.g., back button clicks)! Detecting path breaks - Identify separate subpaths! Generating longest contiguous sub-paths! Correctly handling page reload requests
Path Analysis! Improved customer experience! Tuning web-site structure based on browsing patterns! Build customer relationships! Customizing content at individual or segment level! Real-time target marketing! Cross-sell, up-sell product recommendations! Ad/Rebate placement! Predict site abandonment! Browsing behavior as input to predictive modeling! Segmentation based on browsing behavior
Support Vector Machines (experimental)! Supervised learning tool for creating functions from a set of labeled training data! A binary classifier! A general regression function! Applications! Suitable for general classification problems! Text Categorization! Biosequence Analysis; Micro Arrays
SVM Classification is achieved by a linear or nonlinear separating surface in the input space of the dataset.! Linear SVMs operate by finding a hypersurface in the space of possible inputs. This hypersurface will attempt to split the positive examples from the negative examples. The split will be chosen to have the largest distance from the hypersurface to the nearest of the positive and negative examples.! If the training examples are not linearly separable, SVMs work by mapping the training data into a higher dimension feature space using an appropriate kernel function.
Other new Nodes/Procedures! Taxonomy Hierarchical associations (exp)! ARBOR Replacement for SPLIT.! Support client/server interactive training As an interactive procedure As an engine for a client side Windows Application! Improved performance of disk-resident data! Documented at the level of SAS/STAT procedures! All procedures will use a dynamic DMDB! No permanent physical DMDB data set is created
Early Adopters for EM 5! Looking for Early Adopters in SeUGI time frame! 5 20 sites worldwide recommended from local offices! Different regions and different industries! Following scenarios
Early Adopters for EM 5! Following scenarios desired! distribute the EM Java thin client to multiple users that are geographically dispersed to test 3-tier architecture! small to medium sized firm to evaluate EM 5.0 running entirely on a local client! site to test Java API to integrate EM analytics and scoring services into site specific mining applications! site to test EM analytical deployment test Model Repository! sites with excellent statistical/ai modeling skills and applications to evaluate the new algorithms (SVM, Path analysis node, Interactive Tree, Hierarchical Associations)
EM 5.0 Summary! Delivered as a modern, distributed client-server system for data mining! Enables wide area collaboration on data mining projects and extensive integration opportunities! SAS server uses new parallel and multi-processing features of the SAS V9.0 system and includes an API for running data mining processes and for adding new data mining tools.! Java middleware manages SAS server sessions, user identity, metadata, and report delivery.! Data mining sessions can be created and managed through a Java API.! The user interface is based on Java Swing libraries containing advanced graphics and visualization techniques! New mining algorithms
EM Summary! Provide renowned data mining functionality based on modern future-proof architecture! Clear differentiation between data processing, meta data management and flexible user interface! Architecture open for integration with other SAS and 3 rd party applications! Ensure backward compatibility by parallel maintenance of traditional AF solution
Other Data Mining Presentations at SeUGI! Wed, 16:25, TKC Distributed Data Mining with SAS Enterprise Miner! Wed, 11:40, Analytical Expertise stream, SAS Text Miner! Wed, 17:05, TKC, SAS Text Mining! Analytical Demo Station in TKC
DEMO