The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

Similar documents
Composite Metrics for System Throughput in HPC

HPCS HPCchallenge Benchmark Suite

Perspectives on the Memory Wall. John D. McCalpin, Ph.D IBM Global Microprocessor Development Austin, TX

Computer Comparisons Using HPCC. Nathan Wichmann Benchmark Engineer

Toward a Grand Unified Theory of High Performance Computing

HPCC Results. Nathan Wichmann Benchmark Engineer

HPCS HPCchallenge Benchmark Suite

Parallel computer architecture classification

What does Heterogeneity bring?

FUJITSU HPC and the Development of the Post-K Supercomputer

HPCC Optimizations and Results for the Cray X1 Nathan Wichmann Cray Inc. May 14, 2004

What are Clusters? Why Clusters? - a Short History

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Architecture of Parallel Computer Systems - Performance Benchmarking -

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Leistungsanalyse von Rechnersystemen

Master Informatics Eng.

Performance Tools for Technical Computing

The Role of Performance

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

The Cray Rainier System: Integrated Scalar/Vector Computing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Basics of Performance Engineering

Performance Evaluation with the HPCC Benchmarks as a Guide on the Way to Peta Scale Systems

COSC 6385 Computer Architecture - Multi Processor Systems

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Performance of Variant Memory Configurations for Cray XT Systems

Application Performance on Dual Processor Cluster Nodes

Performance of the AMD Opteron LS21 for IBM BladeCenter

Commodity Cluster Computing

Trends in HPC (hardware complexity and software challenges)

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Optimising for the p690 memory system

Benchmark Results. 2006/10/03

Compilation for Heterogeneous Platforms

Performance Analysis and Prediction for distributed homogeneous Clusters

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Hakam Zaidan Stephen Moore

Porting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Lecture 3: Intro to parallel machines and models

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing Why & How?

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

The Mont-Blanc approach towards Exascale

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Binding Nested OpenMP Programs on Hierarchical Memory Architectures

Architecture, Programming and Performance of MIC Phi Coprocessor

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Fujitsu s Approach to Application Centric Petascale Computing

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

KNL tools. Dr. Fabio Baruffa

High Performance Computing - Benchmarks. Prof Matt Probert

Adaptive Scientific Software Libraries

A Relative Development Time Productivity Metric for HPC Systems

Performance Impact of Resource Contention in Multicore Systems

Early Evaluation of the Cray XD1

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

Instructor: Leopold Grinberg

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Computing Platforms

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

ECE 574 Cluster Computing Lecture 1

Introduction to HPC Parallel I/O

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

POWER3: Next Generation 64-bit PowerPC Processor Design

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

HPC Challenge Awards 2010 Class2 XcalableMP Submission

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory

CSE5351: Parallel Processing Part III

Static Compiler Optimization Techniques

What s New with GPGPU?

NUMA-aware OpenMP Programming

Chapter 7. Multicores, Multiprocessors, and Clusters

Performance of Variant Memory Configurations for Cray XT Systems

Race to Exascale: Opportunities and Challenges. Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Alpha AXP Workstation Family Performance Brief - OpenVMS

Multi-core Programming - Introduction

Toward Automated Application Profiling on Cray Systems

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

Balance of HPC Systems Based on HPCC Benchmark Results

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Introduction to Parallel Programming

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY RX100 S7

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Dr. Joe Zhang PDC-3: Parallel Platforms

High Performance Computing

Performance of computer systems

Transcription:

The STREAM Benchmark John D. McCalpin, Ph.D. IBM eserver Performance 2005-01-27

History Scientific computing was largely based on the vector paradigm from the late 1970 s through the 1980 s E.g., the classic DAXPY loop: REAL*8 X(N),Y(N) DO I=1,N X(I) = X(I) + Q*Y(I) END DO If all data comes from memory, this requires 12 Bytes per FLOP (=3*8Bytes/2FLOPS)

History (cont d) Vector machines like the Cray X/MP, Cray Y/MP, Cray C90, CDC Cyber 205, ETA-10, and various Japanese vector systems were built to this target of 12 Bytes/second of peak memory bandwidth per peak FLOP Easy to sustain 30%-40% of peak performance for vectorizable code. >90% on LINPACK

History (cont d) 1990: The Attack of the Killer Micros HP PA-RISC (aka snakes ) IBM RS/6000 (aka RIOS, aka POWER, aka POWER1 These systems provided performance roughly matching Cray vector machines on many applications, but fell very short on others Users were confused.

The Genesis of STREAM I was an assistant professor at the University of Delaware in 1990 My research was on large-scale ocean circulation modelling Vectorizable codes that ran well on vector machines RS/6000-320 gave about 4% of Cray Y/MP performance The STREAM benchmark was developed as a proxy for the computational kernels in my ocean models

STREAM Design STREAM is designed to measure sustainable memory bandwidth for contiguous, long-vector memory accesses Four kernels are tested: COPY: SCALE: ADD: TRIAD: a(i) = b(i) a(i) = q*b(i) a(i) = b(i) + c(i) a(i) = b(i) + q*c(i) Arrays should all be much larger than caches

STREAM Design (cont d) The code is set up with dependencies between the kernels to try to prevent excessive optimization Recent versions (in FORTRAN and C) have self-contained validation code Recent versions (in FORTRAN and C) have OpenMP directives for parallelization Code is NUMA-friendly with first touch memory placement An MPI version is also available (FORTRAN)

STREAM History STREAM was maintained at my FTP site at the University of Delaware from 1991-1996 STREAM web/ftp sites moved to the University of Virginia in 1996 when I moved to Industry (SGI) Current run rules have several publication classes standard no source code changes to kernels tuned modified source code MPI uses MPI source code 32-bit uses 32-bit data items and arithmetic

Is STREAM Useful? Yes if its limitations are properly understood STREAM is intended as one component of a performance model that includes both processor and memory timings E.g., T total = T cpu + T memory Such models are intrinsically application dependent Often strongly dependent on cache size

The Big Question Q: How should one think about composite figures of merit based on such a collection of low-level measurements? A: Composite Figures of Merit must be based on time rather than rate i.e., weighted harmonic means of rates Why? Combining rates in any other way fails to have a Law of Diminishing Returns

Does Peak GFLOPS predict SPECfp_rate2000? SPECfp_rate2000 vs Peak MFLOPS 8000 7000 6000 5000 4000 Peak 3000 MFLOPS 2000 1000 0 0.00 5.00 10.00 15.00 20.00 25.00 30.00 SPECfp_rate2000/cpu

Does Sustained Memory Bandwidth predict SPECfp_rate2000? SPECfp_rate2000 vs Sustained BW 7.000 6.000 5.000 4.000 3.000 GB/s per CPU 2.000 1.000 0.000 0.00 5.00 10.00 15.00 20.00 25.00 30.00 SPECfp_rate2000/cpu

A Simple Composite Model Assume the time to solution is composed of a compute time proportional to peak GFLOPS plus a memory transfer time proportional to sustained memory bandwidth Assume x Bytes/FLOP to get: "Balanced GFLOPS" 1 "Effective FP op" 1FP op x Bytes + Peak GFLOPS Sustained GB/s

Does this Revised Metric predict SPECfp_rate2000? Optimized SPECfp_rate2000 Estimates 30.00 25.00 20.00 15.00 10.00 Estimated Rate/cpu 5.00 0.00 0.00 5.00 10.00 15.00 20.00 25.00 30.00 SPECfp_rate2000/cpu

Statistical Metrics 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Peak GFLOPS SWIM BW Optimal R squared (Std Error)/Mean

What about other applications? Effectiveness of caches varies by application area Requirements for interconnect performance vary by application area Some apps are short-message dominated Some apps are long-message dominated Composite models can be tuned to specific application areas if app properties known

Caches Reduce BW demand 20.0 Truncated.. 10.0 2.00

An Example Model tuned for CFD Analyze applications and pick reasonable values: "Balanced GFLOPS" 1FP op LINPACK GFLOPS 1 "Effective FP op" 2 Bytes + STREAM GB/s + 0.1 Bytes Network GB/s Two cases tested: Assume long messages (network BW tracks PTRANS) Assume short messages (network BW tracks GUPS) The relative time contributions will quickly identify applications that are poorly balanced for the target workload

Comparing p655 cluster vs p690 SMP Assumes long messages 0.045 p655 cluster 45 GFLOPS p690 SMP 23 GFLOPS 0.04 0.035 0.03 0.025 0.02 0.015 Interconnect Time Memory Time Compute Time 0.01 0.005 0 p655+ cluster p690+ SMP

Comparing p655 cluster vs p690 SMP Assumes short messages 0.08 p655 cluster 13 GFLOPS p690 SMP 21 GFLOPS 0.07 0.06 0.05 0.04 0.03 Interconnect Time Memory Time Compute Time 0.02 0.01 0 p655+ cluster p690+ SMP

Summary The composite methodology is Simple to understand Based on the components of the HPC Challenge Benchmark Based on a mathematically correct model of performance Much work remains on documenting the work requirements of various application areas in relation to the component microbenchmarks