Emerging Heterogeneous Technologies for High Performance Computing

Similar documents
CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

HPC as a Driver for Computing Technology and Education

Jack Dongarra University of Tennessee Oak Ridge National Laboratory

Top500

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Trends in HPC (hardware complexity and software challenges)

An Overview of High Performance Computing and Challenges for the Future

TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology

Why we need Exascale and why we won t get there by 2020 Horst Simon Lawrence Berkeley National Laboratory

High Performance Computing in Europe and USA: A Comparison

Chapter 1. Introduction

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

HPCG UPDATE: ISC 15 Jack Dongarra Michael Heroux Piotr Luszczek

Introduction to Computational Science (aka Scientific Computing)

HPC Technology Update Challenges or Chances?

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Report on the Sunway TaihuLight System. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

What have we learned from the TOP500 lists?

High-Performance Computing - and why Learn about it?

An Overview of High Performance Computing and Future Requirements

Why we need Exascale and why we won t get there by 2020

Fra superdatamaskiner til grafikkprosessorer og

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Cray XC Scalability and the Aries Network Tony Ford

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Steve Scott, Tesla CTO SC 11 November 15, 2011

High Performance Computing in Europe and USA: A Comparison

TOP500 Listen und industrielle/kommerzielle Anwendungen

Presentation of the 16th List

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

HPC Algorithms and Applications

ECE 574 Cluster Computing Lecture 2

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

CS2214 COMPUTER ARCHITECTURE & ORGANIZATION SPRING Top 10 Supercomputers in the World as of November 2013*

20 Jahre TOP500 mit einem Ausblick auf neuere Entwicklungen

Parallel Programming

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Real Parallel Computers

Hybrid Architectures Why Should I Bother?

Roadmapping of HPC interconnects

CS 5803 Introduction to High Performance Computer Architecture: Performance Metrics

Fabio AFFINITO.

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Overview of HPC and Energy Saving on KNL for Some Computations

The TOP500 Project of the Universities Mannheim and Tennessee

Confessions of an Accidental Benchmarker

PART I - Fundamentals of Parallel Computing

Digital Signal Processor Supercomputing

Overview of Tianhe-2

Overview. High Performance Computing - History of the Supercomputer. Modern Definitions (II)

HPC Technology Trends

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

HPCG UPDATE: SC 15 Jack Dongarra Michael Heroux Piotr Luszczek

The TOP500 list. Hans-Werner Meuer University of Mannheim. SPEC Workshop, University of Wuppertal, Germany September 13, 1999

Fujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future

CINECA and the European HPC Ecosystem

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

An Overview of High Performance Computing

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

The Future of High- Performance Computing

INSPUR and HPC Innovation

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

China's supercomputer surprises U.S. experts

Jack Dongarra INNOVATIVE COMP ING LABORATORY. University i of Tennessee Oak Ridge National Laboratory

Mathematical computations with GPUs

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Overview. Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science. Lecture 1 Overview

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Computational Challenges and Opportunities for Nuclear Astrophysics

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

High Performance Computing An introduction talk. Fengguang Song

Motivation Goal Idea Proposition for users Study

High Performance Linear Algebra

Introduction to National Supercomputing Centre in Guangzhou and Opportunities for International Collaboration

JÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich

High-Performance Scientific Computing

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

Timothy Lanfear, NVIDIA HPC

Vectorisation and Portable Programming using OpenCL

CS4961 Parallel Programming. Lecture 1: Introduction 08/25/2009. Course Details. Mary Hall August 25, Today s Lecture.

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Fujitsu s Approach to Application Centric Petascale Computing

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

Parallel Computing. Office: 200A Office: 200A 03.18

CS240A: Applied Parallel Computing. Introduction

Ten Ways to Waste a Parallel Computer Kathy Yelick NERSC Director, Lawrence Berkeley National Laboratory EECS Department, UC Berkleey

Parallel Computing: From Inexpensive Servers to Supercomputers

Transcription:

MURPA (Monash Undergraduate Research Projects Abroad) Emerging Heterogeneous Technologies for High Performance Computing Jack Dongarra University of Tennessee Oak Ridge National Lab University of Manchester 1

Simulation: The Third Pillar of Science Traditional scientific and engineering paradigm: 1) Do theory or paper design. 2) Perform experiments or build system. Limitations: Too difficult -- build large wind tunnels. Too expensive -- build a throw-away passenger jet. Too slow -- wait for climate or galactic evolution. Too dangerous -- weapons, drug design, climate experimentation. Computational science paradigm: 3) Use high performance computer systems to simulate the phenomenon» Base on known physical laws and efficient numerical methods. 2

Why Turn to Simulation? When the problem is too... Complex Large / small Expensive Dangerous to do any other way. Taurus_to_Taurus_60per_30deg.mpeg 3

Why Turn to Simulation? Climate / Weather Modeling Data intensive problems (data-mining, oil reservoir simulation) Problems with large length and time scales (cosmology) 4

Computational Science Source: Steven E. Koonin, DOE 5

High-Performance Computing Today In the past decade, the world has experienced one of the most exciting periods in computer development. Microprocessors have become smaller, denser, and more powerful. The result is that microprocessor-based supercomputing is rapidly becoming the technology of preference in attacking some of the most important problems of science and engineering. 6

Technology Trends: Microprocessor Capacity Gordon Moore (co-founder of Intel) Electronics Magazine, 1965 Number of devices/chip doubles every 18 months 2X transistors/chip Every 1.5 years Called Moore s Law Microprocessors have become smaller, denser, and more powerful. Not just processors, bandwidth, storage, etc. 2X memory and processor speed and ½ size, cost, & power every 18 months. 7

Moore s Law is Alive and Well 1.E+07 1.E+06 1.E+05 Transistors (in Thousands) 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1970 1975 1980 1985 1990 1995 2000 2005 2010 Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick 8

Transistors into Cores 9

Today s Multicores 99% of Top500 Systems Are Based on Multicore Sun Niagra2 (8 cores) IBM Power 7 (8 cores) Intel Westmere (10 cores) Fujitsu Venus (16 cores) Nvidia Kepler (2688 Cuda cores) Intel Xeon Phi (60 cores) AMD Interlagos (16 cores) IBM BG/Q (18 cores) 10

But Clock Frequency Scaling Replaced by Scaling Cores / Chip 1.E+07 1.E+06 1.E+05 15 Years of exponential growth ~2x year has ended Transistors (in Thousands) Frequency (MHz) Cores 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1970 1975 1980 1985 1990 1995 2000 2005 2010 Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick 11

Performance Has Also Slowed, Along with Power 1.E+07 1.E+06 Power is the root cause of all this Transistors (in Thousands) 1.E+05 1.E+04 Frequency (MHz) Power (W) Cores 1.E+03 A hardware issue just became a 1.E+02 software problem 1.E+01 1.E+00 1.E-01 1970 1975 1980 1985 1990 1995 2000 2005 2010 Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick 12

Power Cost of Frequency Power Voltage 2 x Frequency (V 2 F) Frequency Voltage Power Frequency 3 13

Power Cost of Frequency Power Voltage 2 x Frequency (V 2 F) Frequency Voltage Power Frequency 3 14

Look at the Fastest Computers Strategic importance of supercomputing Essential for scientific discovery Critical for national security Fundamental contributor to the economy and competitiveness through use in engineering and manufacturing Supercomputers are the tool for solving the most challenging problems through simulations 15

Example of typical parallel machine Chip/Socket Core Core 16 Core Core

Example of typical parallel machine Node/Board Chip/Socket GPU Chip/Socket GPU Chip/Socket GPU Core Core Core 17 Core Core

Example of typical parallel machine Shared memory programming between processes on a board and a combination of shared memory and distributed memory programming between nodes and cabinets Cabinet Node/Board Node/Board Node/Board Chip/Socket GPU Chip/Socket GPU Chip/Socket GPU Core Core Core 18 Core Core

Example of typical parallel machine Combination of shared memory and distributed memory programming Switch Cabinet Cabinet Cabinet Node/Board Node/Board Node/Board Chip/Socket GPU Chip/Socket GPU Chip/Socket GPU Core Core Core 19 Core Core

What do you mean by performance? What is a xflop/s? xflop/s is a rate of execution, some number of floating point operations per second.» Whenever this term is used it will refer to 64 bit floating point operations and the operations will be either addition or multiplication. What is the theoretical peak performance? The theoretical peak is based not on an actual performance from a benchmark run, but on a paper computation to determine the theoretical peak rate of execution of floating point operations for the machine. The theoretical peak performance is determined by counting the number of floating-point additions and multiplications (in full precision) that can be completed during a period of time, usually the cycle time of the machine. For example, an Intel Xeon 5570 quad core at 2.93 GHz can complete 4 floating point operations per cycle or a theoretical peak performance of 11.72 GFlop/s per core or 46.88 Gflop/s for the socket. 20

Rate H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year SC xy in the States in November Meeting in Germany in June Size - All data available from www.top500.org 21

Performance Development of HPC Over the Last 20 Years 1E+09 100 Pflop/s 00000000 10 Pflop/s 10000000 224 PFlop/s 33.9 PFlop/s 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s 10000 1 Tflop/s 1000 100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 N=1 6-8 years 1.17 TFlop/s N=500 59.7 GFlop/s 400 MFlop/s 96.62 TFlop/s My Laptop (70 Gflop/s) My ipad2 & iphone 4s (1.02 Gflop/s) 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013

TOP500 Editions (41 so far, 20 years) 1.E+09 1.E+08 1.E+07 1.E+06 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 0 10 20 30 40 50 60

TOP500 Editions (41 so far, 20 years) 1.E+09 1.E+08 1.E+07 1.E+06 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 0 10 20 30 40 50 60

Processors Used in the Top500 Systems Intel 80% (402) AMD 10% (50) IBM 8.4% (42) 25

Countries Share Absolute Counts US: 252 China: 66 Japan: 30 UK: 29 France: 23 Germany: 19 Australia: 5

Top Systems in Australia Rank Computer Site Fujitsu PRIMERGY CX250 S1, Xeon E5-2670 8C 2.600GHz, 27 InfB FDR Manufa cturer Segment Cores Rmax % National Computational Infrastructure, ANU Fujitsu Research 53504 978600 88 39 289 320 460 BlueGene/Q, Power BQC 16C 1.60GHz, Custom Nitro G16 3GPU, Xeon E5-2650 8C 2.000GHz, InfB FDR, NVIDIA 2050 Victorian Life Sciences Computation Initiative IBM Research 65536 715551 85 Xenon Systems Government 7308 133700 59 CSIRO National Computational Infrastructure National Facility Oracle Academic 11936 126400 90 Sun Blade x6048, Xeon X5570 2.93 Ghz, InfB QDR xseries x3650m3 Cluster, Xeon X5650 6C 2.660GHz, InfB Defence IBM Government 10776 101998 89 9/11/2013 27

Customer Segments

P e t a f l o p s C l u b Rmax Linpack# 11 4 2 3 2 3 1 Name Pflops Country Tianhe-2 (MilkyWay-2) 33.86 China NUDT: Hybrid Intel/Intel/Custom Titan 17.59 United States Cray: Hybrid AMD/Nvidia/Custom Sequoia 17.17 United States IBM: BG-Q/Custom K Computer 10.51 Japan Fujitsu: Sparc/Custom Mira 8.59 United States IBM: BG-Q/Custom Stampede 5.17 United States Dell: Hybrid/Intel/Intel/IB JUQUEEN 5.01 Germany IBM: BG-Q/Custom Vulcan 4.29 United States IBM: BG-Q/Custom SuperMUC 2.90 Germany IBM: Intel/IB Tianhe-1A 2.57 China NUDT: Hybrid Intel/Nvidia/Custom Pangea 2.10 France Bull: Intel/IB Fermi 1.79 Italy IBM: BG-Q/Custom DARPA Trial Subset 1.52 United States IBM: Intel/IB Spirit 1.42 United States SGI: Intel/IB Curie thin nodes 1.36 France Bull: Intel/IB Nebulae 1.27 China Dawning: Hybrid Intel/Nvidia/IB Yellowstone 1.26 United States IBM: BG-Q/Custom Blue Joule 1.25 United Kingdom IBM: BG-Q/Custom Pleiades 1.24 United States SGI Intel/IB Helios 1.24 Japan Bull: Intel/IB TSUBAME 2.0 1.19 Japan NEC/HP: Hybrid Intel/Nvidia/IB Cielo 1.11 United States Cray: AMD/Custom DiRAC 1.07 United Kingdom IBM: BG-Q/Custom Hopper 1.05 United States Cray: AMD/Custom Tera-100 1.05 France Bull: Intel/IB Oakleaf-FX 1.04 Japan Fujitsu: Sparc/Custom 29

2000 2002 2004 2006 2008 2010 2012 Total Performance [Tflop/s] Performance of Countries 100,000 10,000 US 1,000 100 10 1 0

2000 2002 2004 2006 2008 2010 2012 Total Performance [Tflop/s] Performance of Countries 100,000 10,000 1,000 US EU 100 10 1 0

2000 2002 2004 2006 2008 2010 2012 Total Performance [Tflop/s] Performance of Countries 100,000 10,000 1,000 100 US EU Japan 10 1 0

2000 2002 2004 2006 2008 2010 2012 Total Performance [Tflop/s] Performance of Countries 100,000 10,000 1,000 100 10 US EU Japan China 1 0

2000 2002 2004 2006 2008 2010 2012 Total Performance [Tflop/s] Performance of Countries 100,000 10,000 1,000 100 10 1 US EU Japan China Others 0

June 2013: The TOP10 Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] MFlops /Watt 1 2 3 4 National University of Defense Technology DOE / OS Oak Ridge Nat Lab DOE / NNSA L Livermore Nat Lab RIKEN Advanced Inst for Comp Sci Tianhe-2 NUDT, Xeon 12C 2.2GHz + IntelXeon Phi (57c) + Custom Titan, Cray XK7 (16C) + Nvidia Kepler GPU (14c) + Custom Sequoia, BlueGene/Q (16c) + custom K computer Fujitsu SPARC64 VIIIfx (8c) + Custom China 3,120,000 33.9 70 17.8 1905 USA 560,640 17.6 66 8.3 2120 USA 1,572,864 16.3 81 7.9 2063 Japan 705,024 10.5 93 12.7 827 5 DOE / OS Argonne Nat Lab Mira, BlueGene/Q (16c) + Custom USA 786,432 8.16 81 3.95 2066 6 Texas Advanced Computing Center Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB USA 204,900 2.66 67 3.3 806 7 Forschungszentrum Juelich (FZJ) JuQUEEN, BlueGene/Q, Power BQC 16C 1.6GHz+Custom Germany 458,752 5.01 85 2.30 2178 8 DOE / NNSA L Livermore Nat Lab Vulcan, BlueGene/Q, Power BQC 16C 1.6GHz+Custom USA 393,216 4.29 85 1.97 2177 9 Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 Tianhe-1A, NUDT Nat. SuperComputer 10 Intel (6c) + Nvidia Fermi GPU China 186,368 2.57 55 4.04 636 Center in Tianjin (14c) + Custom 500 US Navy DSRC Cray XT5 USA 12,720.096 79

Comparison 1 Gflop/s 70 Gflop/s 33,000,000 Gflop/s ~factor of 500,000

Tianhe-2 (Milkyway-2) China, 2013: the 34 PetaFLOPS Developed in cooperation between NUDT and Inspur for National Supercomputer Center in Guangzhou Peak performance of 54.9 PFLOPS 16,000 nodes contain 32,000 Xeon Ivy Bridge processors and 48,000 Xeon Phi accelerators totaling 3,120,000 cores 162 cabinets in 720m 2 footprint Total 1.404 PB memory (88GB per node) Each Xeon Phi board utilizes 57 cores for aggregate 1.003 TFLOPS at 1.1GHz clock Proprietary TH Express-2 interconnect (fat tree with thirteen 576-port switches) 12.4 PB parallel storage system 17.6MW power consumption under load; 24MW including (water) cooling 4096 SPARC V9 based Galaxy FT-1500 processors in front-end system

ORNL s Titan Hybrid System: Cray XK7 with AMD Opteron and NVIDIA Tesla processors 38 4,352 ft 2 404 m 2 SYSTEM SPECIFICATIONS: Peak performance of 27 PF 24.5 Pflop/s GPU + 2.6 Pflop/s AMD 18,688 Compute Nodes each with: 16-Core AMD Opteron CPU NVIDIA Tesla K20x GPU 32 + 6 GB memory 512 Service and I/O nodes 200 Cabinets 710 TB total system memory Cray Gemini 3D Torus Interconnect 9 MW peak power

Titan: Cray XK7 System System: 200 Cabinets 18,688 Nodes 27 PF 710 TB Compute Node: 1.45 TF 38 GB Board: 4 Compute Nodes 5.8 TF 152 GB Cabinet: 24 Boards 96 Nodes 139 TF 3.6 TB 39

Sequoia Systems USA, 2012: BlueGene strikes back Built by IBM for NNSA and installed at LLNL 20,123.7 TFLOPS peak performance Blue Gene/Q architecture 1,572,864 total PowerPC A2 cores 98,304 nodes in 96 racks occupy 280m 2 1,572,864 GB DDR3 memory 5-D torus interconnect 768 I/O nodes 7890kW power, or 2.07 GFLOPS/W Achieves 16,324.8 TFLOPS in HPL (#1 in June 2012), about 14 PFLOPS in HACC (cosmology simulation), and 12 PFLOPS in Cardioid code (electrophysiology)

Japanese K Computer Linpack run with 705,024 cores at 10.51 Pflop/s (88,128 CPUs), 12.7 MW; 29.5 hours Fujitsu to have a 100 Pflop/s system in 2014 41

China First Chinese Supercomputer to use a Chinese Processor Sunway BlueLight MPP ShenWei SW1600 processor, 16 core, 65 nm, fabbed in China 125 Gflop/s peak #14 with 139,364 cores,.796 Pflop/s & 1.07 Pflop/s Peak Power Efficiency 741 Mflops/W Coming soon, Loongson (Godson) processor 8-core, 65nm Loongson 3B processor runs at 1.05 GHz, with a peak performance of 128 Gflop/s 42

Industrial Use of Supercomputers Of the 500 Fastest Supercomputer Worldwide, Industrial Use is ~ 50% Aerospace Automotive Biology CFD Database Defense Digital Content Creation Digital Media Electronics Energy Environment Finance Gaming Geophysics Image Proc./Rendering Information Processing Service Information Service Life Science Media Medicine Pharmaceutics Research Retail Semiconductor Telecomm Weather and Climate Research Weather Forecasting 43

Critical Issues at Peta & Exascale for Algorithm and Software Design Synchronization-reducing algorithms Break Fork-Join model Communication-reducing algorithms Use methods which have lower bound on communication Mixed precision methods 2x speed of ops and 2x speed for data movement Autotuning Today s machines are too complicated, build smarts into software to adapt to the hardware Fault resilient algorithms Implement algorithms that can recover from failures Reproducibility of results Today we can t guarantee this. We understand the issues, but some of our colleagues have a hard time with this.

Conclusions For the last decade or more, the research investment strategy has been overwhelmingly biased in favor of hardware. This strategy needs to be rebalanced - barriers to progress are increasingly on the software side. Moreover, the return on investment is more favorable to software. Hardware has a half-life measured in years, while software has a half-life measured in decades. High Performance Ecosystem out of balance Hardware, OS, Compilers, Software, Algorithms, Applications No Moore s Law for software, algorithms and applications