High-Performance Statistical Modeling Koen Knapen Academic Day, March 27 th, 2014 SAS Tervuren
The Routes (Roots) Of Confusion How do I get HP procedures? Just add HP?? Single-machine mode Distributed mode Distributed-Alongside Scalability REG vs. HPREG GENMOD vs. HPGENSELECT Symmetric vs. Asymmetric Mode support.sas.com/statistics/papers
Part 1: General Considerations
GENERAL CONSIDERATIONS Execution Modes Single-Machine Mode Executes entirely on the server where SAS is installed Also called client mode or SMP (Symmetric Multi-Processing) mode Distributed Mode Major computations done on an appliance ( blade server ) Also called MPP (massively parallel processing) mode
Single-Machine Mode SAS Server proc hpgenselect data=a2013; class c:; model ypoisson = x: c: ; selection method=stepwise; run; The HPA procedure determines the n of concurrent threads based on the n of CPUs (cores) on server.
Appliance - Racks of Blades and Software Multi-socket, multi-core platform Commodity blade Chassis of blades Appliance / blade server = tightly integrated homogeneous cluster of computers that are arranged in racks. The individual computers in each rack are called nodes or blades. Database appliances include database software.
Database Appliance Controller Worker Nodes A table is stored in parts across multiple worker nodes SQL queries operate in parallel on the different parts of the table
GENERAL CONSIDERATIONS Data Access Features Client-data (or local-data) method data are moved from SAS server to distributed computing environment. Alongside-the-database-method Data are stored in distributed DBMS and are read in parallel from the distributed DBMS into a SAS analytic process that runs on the database appliance. Alongside-HDFS method HDFS: Hadoop Distributed File System Alongside-LASR method The data are loaded from a SAS LASR Analytic Server that runs on the appliance.
Availability
AVAILABILITY High-Performance Analytical Products High-Performance Analytics Product Associated MVA Product SAS High-Performance Statistics SAS/STAT SAS High-Performance Econometrics SAS/ETS SAS High-Performance Optimization SAS/OR SAS High-Performance Data Mining SAS Enterprise Miner SAS High-Performance Text Mining SAS Text Miner SAS High-Performance Forecasting SAS High-Performance Forecasting MVA products include single-machine mode operation of HP procedures as part of the MVA product license.
AVAILABILITY SAS High-Performance Product Offerings Release 13.1 Available in December with SAS 9.4M High-Performance Statistics High-Performance Data Mining High-Performance Text Mining High-Performance Optimization High-Performance Econometrics High-Performance Forecasting 2 HPLOGISTIC HPREDUCE HPTMINE OPTLSO HPCOUNTREG HPFORECAST HPREG HPLMIXED HPNLMOD HPSPLIT HPGENSELECT HPQUANTSELECT HPFMM HPNEURAL HPFOREST HP4SCORE HPDECIDE HPCLUS HPSVM HPBNET HPTMSCORE Select features in OPTMILP OPTLP OPTMODEL HPSEVERITY HPQLIM HPPANEL HPCOPULA HPCDM HPTIMEDATA HPCANDISC HPPRINCOMP Common Set (HPDS2, HPDMDB, HPSAMPLE, HPSUMMARY, HPIMPUTE, HPBIN, HPCORR)
Part 2: High-Performance Statistical Modeling
HIGH-PERFORMANCE STATISTICAL MODELING General Design Principles for HPA Procedures 1. Support single-machine and distributed modes 2. Use multithreading to exploit all CPUs 3. Support a variety of data sources 4. Require syntactical consistency across modes 5. Require syntactical consistency across HPA procedures
HIGH-PERFORMANCE STATISTICAL MODELING Design Principles for High-Performance Statistical Procedures 1. Focus on prediction and not post-fit inference 2. Standardize and improve syntax where needed 3. Support model selection where appropriate 4. Combine functionality from SAS/STAT procedures when appropriate 5. Provide new functionality within HPA framework when viable
HIGH-PERFORMANCE STATISTICAL MODELING Functionality of HPGENSELECT Procedure Fits generalized linear models Distributions: Normal, Poisson, Tweedie, Link functions: log, logit, Linear predictors: effects involving continuous and classification variables Provides model building Forward, backward, stepwise methods Multiple criteria for choosing model: AIC, AICC, SBC Splitting of classification effects Writes DATA step code for computing predicted values
HIGH-PERFORMANCE STATISTICAL MODELING GENMOD or HPGENSELECT? GENMOD Fits models with moderate-to-large data Offers rich set of methods for statistical inference GEE methods for correlated responses Bayesian inference Exact conditional regression Wide array of postfitting analysis: contrasts, estimates, tests, HPGENSELECT Fits and builds models with large-to-massive data Designed for large-data tasks such as predictive model building
Performance Comparisons
Scalable Percentage Not Scalable Scalable t s t 1 Scalable Percentage = 100 t s / t 1 = 60%
Amdahl s Law Not Scalable 40% Scalable 60% 1 CPU t s t 1 57% 43% 2 CPUs ½ t s t 2 Speedup = t 1 /t 2 = 1.43 57%
HIGH-PERFORMANCE STATISTICAL MODELING Scalability and Big Data Amdahl s law implies a limit to scalability. Yet every job has some unavoidable serial component. Reading data with a single I/O controller in single-machine mode Establishing connections to an appliance and database in distributed mode
HIGH-PERFORMANCE STATISTICAL MODELING Benefits 1. High-performance procedures in SAS/STAT deliver modeling methods and scalability for a wide range of problem sizes. 2. If you have SAS/STAT, you can run these procedures in single-machine mode and exploit all the cores. 3. As your problem size grows, you can take full advantage of all the cores and huge amounts of memory available in distributed computing environments.
High-Performance Statistical Modeling Koen Knapen Academic Day, March 27 th, 2014 SAS Tervuren