Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Similar documents
Confessions of an Accidental Benchmarker

Supercomputers. Alex Reid & James O'Donoghue

High Performance Computing in Europe and USA: A Comparison

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

What have we learned from the TOP500 lists?

Presentation of the 16th List

TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Jack Dongarra University of Tennessee Oak Ridge National Laboratory

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Report on the Sunway TaihuLight System. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

Overview. High Performance Computing - History of the Supercomputer. Modern Definitions (II)

Top500

High Performance Computing in Europe and USA: A Comparison

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1

Trends in HPC (hardware complexity and software challenges)

The TOP500 Project of the Universities Mannheim and Tennessee

CS 5803 Introduction to High Performance Computer Architecture: Performance Metrics

HPC as a Driver for Computing Technology and Education

Making a Case for a Green500 List

Automatic Tuning of the High Performance Linpack Benchmark

HPCG UPDATE: ISC 15 Jack Dongarra Michael Heroux Piotr Luszczek

TOP500 Listen und industrielle/kommerzielle Anwendungen

Emerging Heterogeneous Technologies for High Performance Computing

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

High-Performance Computing - and why Learn about it?

Leistungsanalyse von Rechnersystemen

Fabio AFFINITO.

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

The TOP500 list. Hans-Werner Meuer University of Mannheim. SPEC Workshop, University of Wuppertal, Germany September 13, 1999

The Future of High- Performance Computing

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

An Overview of High Performance Computing

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Comparison of Parallel Processing Systems. Motivation

Overview of HPC and Energy Saving on KNL for Some Computations

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Mathematical computations with GPUs

Search for Optimal Network Topologies for Supercomputers 寻找超级计算机优化的网络拓扑结构

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

High Performance Linear Algebra

Hybrid Architectures Why Should I Bother?

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Supercomputing im Jahr eine Analyse mit Hilfe der TOP500 Listen

ECE 574 Cluster Computing Lecture 2

Composite Metrics for System Throughput in HPC

INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Why we need Exascale and why we won t get there by 2020 Horst Simon Lawrence Berkeley National Laboratory

Chapter 1. Introduction

An Overview of High Performance Computing. Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 11/29/2005 1

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Parallel computer architecture classification

HPC Algorithms and Applications

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

European energy efficient supercomputer project

Real Parallel Computers

Iterative Refinement on FPGAs

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Das TOP500-Projekt der Universitäten Mannheim und Tennessee zur Evaluierung des Supercomputer Marktes

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

HPC Technology Update Challenges or Chances?

HPCC Results. Nathan Wichmann Benchmark Engineer

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Computer Comparisons Using HPCC. Nathan Wichmann Benchmark Engineer

An Overview of High Performance Computing and Challenges for the Future

20 Jahre TOP500 mit einem Ausblick auf neuere Entwicklungen

The Mont-Blanc approach towards Exascale

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster

Cray XC Scalability and the Aries Network Tony Ford

Digital Signal Processor Supercomputing

The Constellation Project. Andrew W. Nash 14 November 2016

Toward Automated Application Profiling on Cray Systems

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

Why we need Exascale and why we won t get there by 2020

Jack Dongarra INNOVATIVE COMP ING LABORATORY. University i of Tennessee Oak Ridge National Laboratory

China's supercomputer surprises U.S. experts

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy

Mixed Precision Methods

Fujitsu s Approach to Application Centric Petascale Computing

PART I - Fundamentals of Parallel Computing

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

High-Performance Scientific Computing

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Introduction to Parallel Computing

Overview and history of high performance computing

Introduction to Parallel Programming for Multicore/Manycore Clusters

Advanced Numerical Techniques for Cluster Computing

Motivation Goal Idea Proposition for users Study

The Architecture and the Application Performance of the Earth Simulator

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Transcription:

HPC Benchmarking Presentations: Jack Dongarra, University of Tennessee & ORNL The HPL Benchmark: Past, Present & Future Mike Heroux, Sandia National Laboratories The HPCG Benchmark: Challenges It Presents to Current & Future Systems Mark Adams, LBNL HPGMG: A Supercomputer Benchmark & Metric David A. Bader, Georgia Institute of Technology Graph500: A Challenging Benchmark for High Performance Data Analytics 1

The HPL Benchmark: Past, Present & Future Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 6/21/16 2

Confessions of an http://bit.ly/hpcg-benchmark 3 Accidental Benchmarker Appendix B of the Linpack Users Guide Designed to help users extrapolate execution Linpack software package First benchmark report from 1977; Cray 1 to DEC PDP-10 time for 1979 1979

http://bit.ly/hpcg-benchmark 4 Started 37 Years Ago Have seen a Factor of 6x10 9 - From 14 Mflop/s to 93 Pflop/s In the late 70 s the fastest computer ran LINPACK at 14 Mflop/s Today with HPL we are at 93 Pflop/s Nine orders of magnitude doubling every 14 months About 7 orders of magnitude increase in the number of processors Plus algorithmic improvements Began in late 70 s time when floating point operations were expensive compared to other operations and data movement

5

6 Linpack Benchmark Over Time In the beginning there was the Linpack 100 Benchmark (1977) n=100 (80KB); size that would fit in all the machines Fortran; 64 bit floating point arithmetic No hand optimization (only compiler options) Linpack 1000 (1986) n=1000 (8MB); wanted to see higher performance levels Any language; 64 bit floating point arithmetic Hand optimization OK Linpack TPP (1991) (Top500; 1993) Any size (n as large as you can; n = 12x10 6 ; 1.2 PB); Any language; 64 bit floating point arithmetic Hand optimization OK Strassen s method not allowed (confuses the op count and rate) Reference implementation available In all cases results are verified by looking at: Operations count for factorization 2 1 n 3 2 3 2 n ; solve Ax b = O(1) A x nε 2 2n

7 LINPACK to HPL to TOP500 LINPACK Benchmark report, ANL TM-23, 1984 Performance of Various Computers Using Standard Linear Equations Software, listed about 70 systems. Over time the LINPACK Benchmark when through a number of changes. Began with Fortran code, run the code as is, no changes, N = 100 (Table 1) Later N = 1000 introduced, hand coding to allow for optimization and parallelism (Table 2) Timing harness provided to generate matrix, check the solution The basic algorithm, GE/PP, remained the same. 1989 started putting together Table 3 (Toward Peak Performance) of the LINPACK benchmark report. N allowed to be any size Timing harness provided to generate matrix, check the solution List R max, N max, R peak In 2000 we put together an optimized implementation of the benchmark, called High Performance LINPACK or HPL. Just needs optimized version of BLAS and MPI.

8 TOP500 In 1986 Hans Meuer started a list of supercomputer around the world, they were ranked by peak performance. Hans approached me in 1992 to put together our lists into the TOP500. The first TOP500 list was in June 1993.

9 Rules For HPL and TOP500 Algorithm is Gaussian Elimination with partial pivoting. Excludes the use of a fast matrix multiply algorithm like "Strassen's Method Excludes algorithms which compute a solution in a precision lower than full precision (64 bit floating point arithmetic) and refine the solution using an iterative approach. The authors of the TOP500 reserve the right to independently verify submitted LINPACK results, and exclude computer from the list which are not valid or not general purpose in nature. Any computer designed specifically to solve the LINPACK benchmark problem or have as its major purpose the goal of a high TOP500 ranking will be disqualified.

#1 System on the Top500 Over the Past 24 Years (18 machines in that club) Top500 List Computer 9 6 3 r_max (Tflop/s) n_max Hours MW 6/93 (1) TMC CM-5/1024.060 52224 0.4 11/93 (1) Fujitsu Numerical Wind Tunnel.124 31920 0.1 1. 6/94 (1) Intel XP/S140.143 55700 0.2 11/94-11/95 (3) Fujitsu Numerical Wind Tunnel.170 42000 0.1 1. 6/96 (1) Hitachi SR2201/1024.220 138,240 2.2 11/96 (1) Hitachi CP-PACS/2048.368 103,680 0.6 6/97-6/00 (7) Intel ASCI Red 2.38 362,880 3.7.85 11/00-11/01 (3) IBM ASCI White, SP Power3 375 MHz 7.23 518,096 3.6 6/02-6/04 (5) NEC Earth-Simulator 35.9 1,000,000 5.2 6.4 11/04-11/07 (7) IBM BlueGene/L 478. 1,000,000 0.4 1.4 6/08-6/09 (3) IBM Roadrunner PowerXCell 8i 3.2 Ghz 1,105. 2,329,599 2.1 2.3 11/09-6/10 (2) Cray Jaguar - XT5-HE 2.6 GHz 1,759. 5,474,272 17.3 6.9 11/10 (1) NUDT Tianhe-1A, X5670 2.93Ghz NVIDIA 2,566. 3,600,000 3.4 4.0 6/11-11/11 (2) Fujitsu K computer, SPARC64 VIIIfx 10,510. 11,870,208 29.5 9.9 6/12 (1) IBM Sequoia BlueGene/Q 16,324. 12,681,215 23.1 7.9 11/12 (1) Cray XK7 Titan AMD + NVIDIA Kepler 17,590. 4,423,680 0.9 8.2 6/13 11/15(6) NUDT Tianhe-2 Intel IvyBridge & Xeon Phi 33,862. 9,960,000 5.4 17.8 6/16 Sunway TaihuLight System 93,014. 12,288,000 3.7 15.4 10

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% http://tiny.cc/hpcg 11 Run Times for HPL on Top500 Systems 61 hours 30 hours 20 hours 12 hours 11 hours 10 hours 9 hours 8 hours 7 hours 6 hours 5 hours 4 hours 3 hours 2 hours 1 hour 0% 6/1/93 11/1/93 4/1/94 9/1/94 2/1/95 7/1/95 12/1/95 5/1/96 10/1/96 3/1/97 8/1/97 1/1/98 6/1/98 11/1/98 4/1/99 9/1/99 2/1/00 7/1/00 12/1/00 5/1/01 10/1/01 3/1/02 8/1/02 1/1/03 6/1/03 11/1/03 4/1/04 9/1/04 2/1/05 7/1/05 12/1/05 5/1/06 10/1/06 3/1/07 8/1/07 1/1/08 6/1/08 11/1/08 4/1/09 9/1/09 2/1/10 7/1/10 12/1/10 5/1/11 10/1/11 3/1/12 8/1/12 1/1/13 6/1/13

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1993-06-01 1993-11-01 1994-06-01 1994-11-01 1995-06-01 1995-11-01 1996-06-01 1996-11-01 1997-06-01 1997-11-01 1998-06-01 1998-11-01 1999-06-01 1999-11-01 2000-06-01 2000-11-01 2001-06-01 2001-11-01 2002-06-01 2002-11-01 2003-06-01 2003-11-01 2004-06-01 2004-11-01 2005-06-01 2005-11-01 2006-06-01 2006-11-01 2007-06-01 2007-11-01 2008-06-01 2008-11-01 2009-06-01 2009-11-01 2010-06-01 2010-11-01 2011-06-01 2011-11-01 2012-06-01 2012-11-01 2013-06-01 2013-11-01 2014-06-01 2014-11-01 2015-06-01 2015-11-01 102 60 40 35 30 25 20 15 10 5 1 Run Times for HPL on Top500 Systems 12 Hours

13 Over the Course of the Run Can t just start the run and stop it. The performance will vary over the course of the run. 1.20 1.00 Pflop/s 0.80 0.60 0.40 0.20 0.00 Time in Hours 0 5 10 15 20

HPL: Reducing Execution Time R max = 1 T 2 T 1 T 2 T 1 G(t)dt Gflop/s Initial section Overestimating performance Rough-estimating performance Middle section Underestimate performance End section time

15 How to Capture Performance? Determine the section where the computation and communications for the execution reflect a completed run. 1.20 1.00 Pflop/s 0.80 0.60 0.40 0.20 0.00 Time in Hours 0 5 10 15 20

16 LINPACK Benchmark Still Learning Things We use a backwards error residual to check the correctness of the solution. This is the classical Wilkinson error bound. If the residual is small O(1) then the software is doing the best it can independent of the conditioning of the matrix. We say O(1) is OK, the code allows the residual to be less than O(10). For large problems we noticed the residual was getting smaller.

17 LINPACK Benchmark Still Learning Things We use a backwards error residual to check the correctness of the solution. This is the classical Wilkinson error bound. If the residual is small O(1) then the software is doing the best it can independent of the conditioning of the matrix. We say O(1) is OK, the code allows the residual to be less than O(10). For large problems we noticed the residual was getting smaller.

18 LINPACK Benchmark Still Learning Things The current criteria might be about O(10 3 ) too lax which allows error for the last 10-12 bits of the mantissa to go undetected. We believe this has to do with the rounding errors for collective ops when done in parallel, i.e. MatVec and norms A better formulation of the residual might be:

19 HPL - Bad Things LINPACK Benchmark is 37 years old TOP500 (HPL) is 23 years old Floating point-intensive performs O(n 3 ) floating point operations and moves O(n 2 ) data. No longer so strongly correlated to real apps. Reports Peak Flops (although hybrid systems see only 1/2 to 2/3 of Peak) Encourages poor choices in architectural features Overall usability of a system is not measured Used as a marketing tool Decisions on acquisition made on one number Benchmarking for days wastes a valuable resource