High-Performance Statistical Modeling

Similar documents
WHAT S NEW IN SAS INCLUDING BASE, STAT, SAS ENTERPRISE GUIDE

High-Performance Procedures in SAS 9.4: Comparing Performance of HP and Legacy Procedures

SAS and Hadoop. paulmkent. 3 rd Annual State of the Union. Paul Kent VP BigData, SAS

What does SAS Enterprise Miner do? For whom is SAS Enterprise Miner designed?

SAS Enterprise Miner High-Performance Procedures

Base SAS 9.4 Procedures Guide

Base SAS 9.4 Procedures Guide

SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures

Oracle Big Data Connectors

SAS High-Performance Analytics Products

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

Regression Model Building for Large, Complex Data with SAS Viya Procedures

Text Mine Your Big Data: What High Performance Really Means WHITE PAPER

SAS Text Miner High-Performance Procedures

Netezza The Analytics Appliance

SAS Text Miner High-Performance Procedures

Scoring with Analytic Stores

Massively Parallel Processing. Big Data Really Fast. A Proven In-Memory Analytical Processing Platform for Big Data

Model Selection Using Information Criteria (Made Easy in SAS )

Bull Fast Track/PDW and Big Data

GLMSELECT for Model Selection

The Future of the SAS Platform

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

SAS Platform Strategy Prepared for FANS usergroup. Mike Frost, Director, Product Management Fiona McNeill, Global Product Marketing

Big Data Hadoop Stack

Bridging Traditional Analytics with BigData - SAS on UCS

Twelve Cluster Technologies Available in SAS 9.4

Optimizing Your Analytics Life Cycle with SAS & Teradata. Rick Lower

GEN_OMEGA2: The HPSUMMARY Procedure: A SAS Macro for Computing the Generalized Omega-Squared Effect Size Associated with

Embedded Technosolutions

Lecture 7: Parallel Processing

Big Data with Hadoop Ecosystem

Introduction to Hadoop and MapReduce

The Future of the SAS Platform. Mathias

Microsoft Analytics Platform System (APS)

Introduction II. Overview

Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES

Information Criteria Methods in SAS for Multiple Linear Regression Models

What s New in SAS 9.3

Resource allocation for autonomic data centers using analytic performance models.

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Every SAS Cloud has a Silver Lining. Letting your data reign in the cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Introducing Microsoft SQL Server 2016 R Services. Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone

Computer Architecture

Data Quality Control for Big Data: Preventing Information Loss With High Performance Binning

Introducing Oracle R Enterprise 1.4 -

Data Mining: STATISTICA

Data Quality Control: Using High Performance Binning to Prevent Information Loss

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

A Fast and High Throughput SQL Query System for Big Data

BIG DATA TESTING: A UNIFIED VIEW

Data Management - 50%

SAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation

Lecture 7: Parallel Processing

Improving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts

Top Five Reasons for Data Warehouse Modernization Philip Russom

Sub-Second Response Times with New In-Memory Analytics in MicroStrategy 10. Onur Kahraman

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

chapter two: building your first report... 15

COSC 6385 Computer Architecture - Multi Processor Systems

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015

Analyzing Big Data with Microsoft R

DriveScale-DellEMC Reference Architecture

Overview of Data Services and Streaming Data Solution with Azure

Tackling the Challenges of Big Data! Tackling The Challenges of Big Data. This Module. Samuel Madden. Samuel Madden. Visualizing Twitter

Some software included in SAS Foundation may display a release number other than 9.2.

Introduction to Parallel Programming

Pervasive Insight. Mission Critical Platform

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Pervasive DataRush TM

An Oracle White Paper December SAS Application Performance on the Oracle M5-32 SPARC Server

Decision Making Procedure: Applications of IBM SPSS Cluster Analysis and Decision Tree

Data Quality Control: Using High Performance Binning to Prevent Information Loss

Overview. Audience profile. At course completion. Course Outline. : 20773A: Analyzing Big Data with Microsoft R. Course Outline :: 20773A::

Logistic Model Selection with SAS PROC s LOGISTIC, HPLOGISTIC, HPGENSELECT

VOLTDB + HP VERTICA. page

High Performance Computing on MapReduce Programming Framework

SAS/STAT 15.1 User s Guide The HPREG Procedure

Data Analytics and Machine Learning: From Node to Cluster

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

System Requirements for SAS 9.4 Foundation for Solaris for x64

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Topics covered 10/12/2015. Pengantar Teknologi Informasi dan Teknologi Hijau. Suryo Widiantoro, ST, MMSI, M.Com(IS)

Integrate MATLAB Analytics into Enterprise Applications

Safe Harbor Statement

System Requirements for SAS 9.4 Foundation for AIX

SAS/STAT 15.1 User s Guide The HPQUANTSELECT Procedure

Advanced Analytics with Enterprise Guide Catherine Truxillo, Ph.D., Stephen McDaniel, and David McNamara, SAS Institute Inc.

10th August Part One: Introduction to Parallel Computing

Decision Management with DS2

Project Requirements

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

Some software included in SAS Foundation may display a release number other than 9.2.

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Distributed Data Analytics Introduction

Evolving To The Big Data Warehouse

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Transcription:

High-Performance Statistical Modeling Koen Knapen Academic Day, March 27 th, 2014 SAS Tervuren

The Routes (Roots) Of Confusion How do I get HP procedures? Just add HP?? Single-machine mode Distributed mode Distributed-Alongside Scalability REG vs. HPREG GENMOD vs. HPGENSELECT Symmetric vs. Asymmetric Mode support.sas.com/statistics/papers

Part 1: General Considerations

GENERAL CONSIDERATIONS Execution Modes Single-Machine Mode Executes entirely on the server where SAS is installed Also called client mode or SMP (Symmetric Multi-Processing) mode Distributed Mode Major computations done on an appliance ( blade server ) Also called MPP (massively parallel processing) mode

Single-Machine Mode SAS Server proc hpgenselect data=a2013; class c:; model ypoisson = x: c: ; selection method=stepwise; run; The HPA procedure determines the n of concurrent threads based on the n of CPUs (cores) on server.

Appliance - Racks of Blades and Software Multi-socket, multi-core platform Commodity blade Chassis of blades Appliance / blade server = tightly integrated homogeneous cluster of computers that are arranged in racks. The individual computers in each rack are called nodes or blades. Database appliances include database software.

Database Appliance Controller Worker Nodes A table is stored in parts across multiple worker nodes SQL queries operate in parallel on the different parts of the table

GENERAL CONSIDERATIONS Data Access Features Client-data (or local-data) method data are moved from SAS server to distributed computing environment. Alongside-the-database-method Data are stored in distributed DBMS and are read in parallel from the distributed DBMS into a SAS analytic process that runs on the database appliance. Alongside-HDFS method HDFS: Hadoop Distributed File System Alongside-LASR method The data are loaded from a SAS LASR Analytic Server that runs on the appliance.

Availability

AVAILABILITY High-Performance Analytical Products High-Performance Analytics Product Associated MVA Product SAS High-Performance Statistics SAS/STAT SAS High-Performance Econometrics SAS/ETS SAS High-Performance Optimization SAS/OR SAS High-Performance Data Mining SAS Enterprise Miner SAS High-Performance Text Mining SAS Text Miner SAS High-Performance Forecasting SAS High-Performance Forecasting MVA products include single-machine mode operation of HP procedures as part of the MVA product license.

AVAILABILITY SAS High-Performance Product Offerings Release 13.1 Available in December with SAS 9.4M High-Performance Statistics High-Performance Data Mining High-Performance Text Mining High-Performance Optimization High-Performance Econometrics High-Performance Forecasting 2 HPLOGISTIC HPREDUCE HPTMINE OPTLSO HPCOUNTREG HPFORECAST HPREG HPLMIXED HPNLMOD HPSPLIT HPGENSELECT HPQUANTSELECT HPFMM HPNEURAL HPFOREST HP4SCORE HPDECIDE HPCLUS HPSVM HPBNET HPTMSCORE Select features in OPTMILP OPTLP OPTMODEL HPSEVERITY HPQLIM HPPANEL HPCOPULA HPCDM HPTIMEDATA HPCANDISC HPPRINCOMP Common Set (HPDS2, HPDMDB, HPSAMPLE, HPSUMMARY, HPIMPUTE, HPBIN, HPCORR)

Part 2: High-Performance Statistical Modeling

HIGH-PERFORMANCE STATISTICAL MODELING General Design Principles for HPA Procedures 1. Support single-machine and distributed modes 2. Use multithreading to exploit all CPUs 3. Support a variety of data sources 4. Require syntactical consistency across modes 5. Require syntactical consistency across HPA procedures

HIGH-PERFORMANCE STATISTICAL MODELING Design Principles for High-Performance Statistical Procedures 1. Focus on prediction and not post-fit inference 2. Standardize and improve syntax where needed 3. Support model selection where appropriate 4. Combine functionality from SAS/STAT procedures when appropriate 5. Provide new functionality within HPA framework when viable

HIGH-PERFORMANCE STATISTICAL MODELING Functionality of HPGENSELECT Procedure Fits generalized linear models Distributions: Normal, Poisson, Tweedie, Link functions: log, logit, Linear predictors: effects involving continuous and classification variables Provides model building Forward, backward, stepwise methods Multiple criteria for choosing model: AIC, AICC, SBC Splitting of classification effects Writes DATA step code for computing predicted values

HIGH-PERFORMANCE STATISTICAL MODELING GENMOD or HPGENSELECT? GENMOD Fits models with moderate-to-large data Offers rich set of methods for statistical inference GEE methods for correlated responses Bayesian inference Exact conditional regression Wide array of postfitting analysis: contrasts, estimates, tests, HPGENSELECT Fits and builds models with large-to-massive data Designed for large-data tasks such as predictive model building

Performance Comparisons

Scalable Percentage Not Scalable Scalable t s t 1 Scalable Percentage = 100 t s / t 1 = 60%

Amdahl s Law Not Scalable 40% Scalable 60% 1 CPU t s t 1 57% 43% 2 CPUs ½ t s t 2 Speedup = t 1 /t 2 = 1.43 57%

HIGH-PERFORMANCE STATISTICAL MODELING Scalability and Big Data Amdahl s law implies a limit to scalability. Yet every job has some unavoidable serial component. Reading data with a single I/O controller in single-machine mode Establishing connections to an appliance and database in distributed mode

HIGH-PERFORMANCE STATISTICAL MODELING Benefits 1. High-performance procedures in SAS/STAT deliver modeling methods and scalability for a wide range of problem sizes. 2. If you have SAS/STAT, you can run these procedures in single-machine mode and exploit all the cores. 3. As your problem size grows, you can take full advantage of all the cores and huge amounts of memory available in distributed computing environments.

High-Performance Statistical Modeling Koen Knapen Academic Day, March 27 th, 2014 SAS Tervuren