HPC as a Driver for Computing Technology and Education

Similar documents
High-Performance Computing - and why Learn about it?

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Trends in HPC (hardware complexity and software challenges)

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Top500

Emerging Heterogeneous Technologies for High Performance Computing

CS2214 COMPUTER ARCHITECTURE & ORGANIZATION SPRING Top 10 Supercomputers in the World as of November 2013*

The Mont-Blanc approach towards Exascale

Chapter 1. Introduction

Report on the Sunway TaihuLight System. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Mathematical computations with GPUs

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

HPCG UPDATE: ISC 15 Jack Dongarra Michael Heroux Piotr Luszczek

HPC Technology Trends

Why we need Exascale and why we won t get there by 2020 Horst Simon Lawrence Berkeley National Laboratory

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Supercomputers. Alex Reid & James O'Donoghue

Jack Dongarra University of Tennessee Oak Ridge National Laboratory

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Cray XC Scalability and the Aries Network Tony Ford

PART I - Fundamentals of Parallel Computing

ECE 574 Cluster Computing Lecture 2

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Building supercomputers from commodity embedded chips

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Seagate ExaScale HPC Storage

Intel Many Integrated Core (MIC) Architecture

Hybrid Architectures Why Should I Bother?

HPC Algorithms and Applications

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Overview. High Performance Computing - History of the Supercomputer. Modern Definitions (II)

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

European energy efficient supercomputer project

Why we need Exascale and why we won t get there by 2020

Confessions of an Accidental Benchmarker

HPCG UPDATE: SC 15 Jack Dongarra Michael Heroux Piotr Luszczek

High Performance Computing in Europe and USA: A Comparison

Introduction to GPU computing

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Parallel Programming

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Fra superdatamaskiner til grafikkprosessorer og

Steve Scott, Tesla CTO SC 11 November 15, 2011

An Overview of High Performance Computing

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

HPC IN EUROPE. Organisation of public HPC resources

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Technology challenges and trends over the next decade (A look through a 2030 crystal ball) Al Gara Intel Fellow & Chief HPC System Architect

Roadmapping of HPC interconnects

PLAN-E Workshop Switzerland. Welcome! September 8, 2016

Systems Architectures towards Exascale

CS5222 Advanced Computer Architecture. Lecture 1 Introduction

IBM HPC DIRECTIONS. Dr Don Grice. ECMWF Workshop November, IBM Corporation

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Exascale: challenges and opportunities in a power constrained world

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

Digital Signal Processor Supercomputing

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Lecture 1: Gentle Introduction to GPUs

System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited

Race to Exascale: Opportunities and Challenges. Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation

Trends in HPC Architectures and Parallel

The Stampede is Coming: A New Petascale Resource for the Open Science Community

End User Update: High-Performance Reconfigurable Computing

CS 5803 Introduction to High Performance Computer Architecture: Performance Metrics

Overview. Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science. Lecture 1 Overview

Trends in HPC Architectures and Parallel

HPC Technology Update Challenges or Chances?

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Overview of HPC and Energy Saving on KNL for Some Computations

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Technology Trends Presentation For Power Symposium

What does Heterogeneity bring?

HPC future trends from a science perspective

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

A unified Energy Footprint for Simulation Software

Fabio AFFINITO.

CINECA and the European HPC Ecosystem

D6.1 AllScale Computing Infrastructure

High Performance Computing in Europe and USA: A Comparison

Introduction to Parallel Programming for Multicore/Manycore Clusters Introduction

Transcription:

HPC as a Driver for Computing Technology and Education Tarek El-Ghazawi The George Washington University Washington D.C., USA

NOW- July 2015: The TOP 10 Systems Rank Site Computer Cores Rmax [Pflops] % of Peak Power [MW] MFlops /Watt 1 2 3 4 5 National Super Computer Center in Guangzhou, China DOE / OS Oak Ridge Nat Lab USA DOE / NNSA L Livermore Nat Lab USA RIKEN Advanced Inst for Comp Sci, Japan DOE / OS Argonne Nat Lab, USA 6 Swiss CSCS 7 KAUST, Saudi Tianhe-2 NUDT, Xeon 12C 2.2GHz + IntelXeon Phi (57c) + Custom Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14c) + Custom Sequoia, BlueGene/Q (16c) + custom K computer Fujitsu SPARC64 VIIIfx (8c) + Custom Mira, BlueGene/Q (16c) + Custom Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom Shaheen II, Cray XC30, Xeon 16C + Custom 3,120,000 33.9 62 17.8 1905 560,640 17.6 65 8.3 2120 1,572,864 17.2 85 7.9 2063 705,024 10.5 93 12.7 827 786,432 8.16 85 3.95 2066 115,984 6.27 81 2.3 2726 196,608 5.54 77 4.5 1146 8 TACC, USA 9 10 Forschungszentrum Juelich (FZJ), Germany DOE / NNSA LLNL, USA Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB JuQUEEN, BlueGene/Q, Power BQC 16C 1.6GHz+Custom Vulcan, BlueGene/Q, Power BQC 16C 1.6GHz+Custom 204,900 5.17 61 4.5 1489 458,752 5.01 85 2.30 2178 393,216 4.29 85 1.97 2177 500 (422) Software Comp HP Cluster USA 18,896.309 48 2

HPC is a Top National Priority! Executive Order from the White House Establishment of a National Strategic Computing Initiative (NCSI) 29 July 2015 3 3

National Strategic Computing Initiative Five strategic themes of the NSCI: 1) Create systems that can apply exaflops of computing power to exabytes of data 2) Keep the United States at the forefront of HPC capabilities 3) Improve HPC application developer productivity 4) Make HPC readily available 5) Establish hardware technology for future HPC systems 4 4

Future/Investments - International Exascale HPC Programs Country Funding Year(s) Remarks European Union 700M 2014-20 Private-Public Partnership commitment through European Tech Platform for HPC (ETP4HPC) 143.4M in 2014-15 74M 2011-6 dedicated FP7 Exascale projects India $2B 2014-20 Led by IISc (Indian Institute of Science) and ISRO (Indian Space Research Organization). Targeting a 132 ExaFLOP/s machine $750M 2014-19 C-DAC (Center for Development of Advanced Computing) to set up 70 supercomputers over 5 years Japan $1.38B 2013-20 Post-K computer to be installed at RIKEN; Tentatively based on Extreme SIMD chip PACS-G China - Due to U.S./DoC ban will use Chinese 5 Tarek El-Ghazawi, parts GWU to upgrade current #1 system 5

Why is HPC Important? Critical for economic competitiveness (Highlighted by Minster Daoudi) because of its wide applications (through simulations and intensive data analyses) Drives computer hardware and software innovations for future conventional computing Is becoming ubiquitous, i.e. all computing/information technology is turning into Parallel!! Is that why it is turning into an international HPC muscle flexing contest? 6

Why is HPC Important? (1)Competitiveness Design Build Test Design Model Simulate Build 7

Molecular Dynamics HIV-1 Protease Why is HPC Important? Competitiveness Inhibitor Drug Gene Sequence Alignment Simulation for 2ns: 2 weeks on a desktop 6 hours on a supercomputer HPC Application Examples Phylogenetic Analysis: 32 days on desktop 1.5 hrs supercomputer Car Crash Simulations Understanding Fundamental Structure of Matter 2 million elements simulation: 4 days on a desktop 25 minutes on a supercomputer Requires a billionbillion calculations per second 8

Why is HPC Important? (2) HPC of Today is Conventional Computing for Tomorrow The ASCI Red Supercomputer 9000 chips for 3 TeraFLOPs in 1997 Intel 80 Core Chip 1 Chip and 1 TeraFLOPs in 2007 9

3- Why is HPC Important?- HPC Concepts are becoming Ubiquitous Sony PS3 Samsung S6 8 Cores Uses the Cell Processors! HPC is Ubiquitous! All Computing is becoming HPC, Can we become bystanders? The Road Runner: Was Fastest Supercomputer in 08 Tile64: A 64 CPU Chip- Can be in your future laptop! Uses Cell Processors! 10

How Did we Get Here - Supercomputers in recent History Computer Processor # Pr. Year Tianhe-2 (MilkyWay-2) Titan TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P Cray XK7, Opteron 16 Cores, 2.2GHz, Nvidia K20X R max (TFlops) 3120000 2013-till now 33,862 560640 2012 17,600 K-Computer, Japan SPARC64 VIIIfx 2.0GHz, 705024 2011 10,510 Tianhe-1A, China Intel EM64T Xeon X56xx (Westmere-EP) 2930 MHz (11.72 Gflops) + NVIDIA GPU, FT-1000 8C 186368 2010 2,566 Jaguar, Cray Cray XT5-HE Opteron Six Core 2.6 GHz 224162 2009 1,759 Roadrunner, IBM PowerXCell 8i 3200 MHz (12.8 GFlops) 122400 2008 1,026 BlueGene/L - eserver Blue Gene Solution, IBM BlueGene/L - eserver Blue Gene Solution, IBM PowerPC 440 700 MHz (2.8 GFlops) 212992 2007 478 PowerPC 440 700 MHz (2.8 GFlops) 131072 2005 280 BlueGene/L beta-system IBM PowerPC 440 700 MHz (2.8 GFlops) 32768 2004 70.7 Earth-Simulator / NEC NEC 1000 MHz (8 GFlops) 5120 2002 35.8 IBM ASCI White,SP POWER3 375 MHz (1.5 GFlops) 8192 2001 7.2 IBM ASCI White,SP POWER3 375MHz (1.5 GFlops) 8192 2000 4.9 Intel ASCI Red Intel IA-32 Pentium Pro 333 MHz (0.333 GFlops) 9632 1999 2.4 11

How Did we Get Here - Supercomputers in recent History See: http://spectrum.ieee.org/tech-talk/computing/hardware/chinabuilds-worlds-fastest-supercomputer 12

How Did we Get Here - Supercomputers in recent History PetaFLOPS Performance Vector Machines Massively Parallel Processors MPPs with Multicores and Heterogeneous Accelerators TeraFLOPS Discrete Integrated 1993- HPCC 2008-2011 End of Moore s Law in Clocking! Time 13

NOW- July 2015: The TOP 10 Systems Rank Site Computer Cores Rmax [Pflops] % of Peak Power [MW] MFlops /Watt 1 2 3 4 5 National Super Computer Center in Guangzhou, China DOE / OS Oak Ridge Nat Lab USA DOE / NNSA L Livermore Nat Lab USA RIKEN Advanced Inst for Comp Sci, Japan DOE / OS Argonne Nat Lab, USA 6 Swiss CSCS 7 KAUST, Saudi Tianhe-2 NUDT, Xeon 12C 2.2GHz + IntelXeon Phi (57c) + Custom Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14c) + Custom Sequoia, BlueGene/Q (16c) + custom K computer Fujitsu SPARC64 VIIIfx (8c) + Custom Mira, BlueGene/Q (16c) + Custom Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom Shaheen II, Cray XC30, Xeon 16C + Custom 3,120,000 33.9 62 17.8 1905 560,640 17.6 65 8.3 2120 1,572,864 17.2 85 7.9 2063 705,024 10.5 93 12.7 827 786,432 8.16 85 3.95 2066 115,984 6.27 81 2.3 2726 196,608 5.54 77 4.5 1146 8 TACC, USA 9 10 Forschungszentrum Juelich (FZJ), Germany DOE / NNSA LLNL, USA Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB JuQUEEN, BlueGene/Q, Power BQC 16C 1.6GHz+Custom Vulcan, BlueGene/Q, Power BQC 16C 1.6GHz+Custom 204,900 5.17 61 4.5 1489 458,752 5.01 85 2.30 2178 393,216 4.29 85 1.97 2177 500 (422) Software Comp HP Cluster USA 18,896.309 48 14

How to Make Progress Launch a competitive funding cycle or a large national project Pose a system challenge ~ 33.8 PFLOPS/17.8 Mwatt provides about 2GF/Watt To get to Exascale using same total power we need 200GF/Watt Pose an application challenge(s) Let the community compete for government funding with innovative ideas 15

Challenges - The End of Moore s Law The phenomenon of exponential improvements in processors was observed in 1979 by Intel co-founder Gordon Moore The speed of a microprocessor doubles every 18-24 months, assuming the price of the processor stays the same Wrong, not anymore! The price of a microchip drops about 48% every 18-24 months, assuming the same processor speed and on chip memory capacity Ok, for Now The number of transistors on a microchip doubles every 18-24 months, assuming the price of the chip stays the same Ok, for Now 16

No faster clocking but more Cores? Source: Ed Davis, Intel 17

ccelerators and Dealing with the Moore s Law Challenge Through Parallelism Fab. Process Freq # Cores Peak FP Performance Peak Power DP Flops/W Memory nm GHz SPFP GFlops DPFP GFlops W BW GB/s Memory type PowerXCell 8i 65 3.2 1 + 8 204 102.4 92 1.11 25.6 XDR Nvidia Kepler K40 Intel Xeon Phi 7120P Intel Xeon 12- core 2.7 GHz E5-2697v2 AMD Opteron 6370P Interlagos 28 0.75 2880 4290 1430 235 6.1 288 GDDR5 22 1.24 61 (244 threads) 2417 1208 300 4.0 352 GDDR5 22 2.7 12 518.4 259.2 130 1.99 59.7 32 2.5 16 320 160 99 1.62 42.7 DDR3-1866 DDR3-1333 Xilinx XC7VX1140T 28 - - 801 241 43 5.6 - - Xilinx XCUV440 20 - - 1306 402 80* 5.0* Altera Stratix V GSB8 28 - - 604 296 59 5.0 - - 18

Accelerators/Heterogeneous Computing FPGAs Cell GPUs Phi Microprocessor Application Speedup SAVINGS Cost Power Size DNA Match 8723 22x 779x 253x DES Breaker 38514 96x 3439x 1116x El-Ghazawi et. al. The Promise of HPRCs. IEEE Computer, February 2008 19

A General Execution Model for Heterogeneous Computers µp Transfer of Control Input Data GPU Accelerator CELL B.E. PC FPGA Clearspeed Intel Xeon Phi Output Data Transfer of Control 20

Challenges for Accelerators 1. Application must lend itself to the 90-10 rule, and different accelerators suit diffent type of computations 2. Programmer partitions the code across the CPU and accelerator 3. Programmer co-schedules CPU and accelerator, and ensures good utilization of the expensive accelerator resources 4. Programmer explicitly transfers data between CPU and accelerator 5. Accelerators are fast as compared to the link, and overhead that can render the use of the accelerator useless or harmful 6. Multiple programming paradigms are needed 7. New accelerator means learning/porting to a new programming interface 8. Changing the ratio of CPUs to accelerators requires also substantial programming unless accelerators are vituralized 21

Challenges for Advancing or for Exascale DoE ASCAC Subcommittee Report Feb 2014 1. Energy Efficiency 2. Interconnect Technology 3. Memory Technology 4. Scalable System Software 5. Programming Systems 6. Data Management 7. Exascale Algorithms 8. Algorithms for Discovery, Design & Decision 9. Resilience and Correctness 10. Scientific Productivity Data movement Tarek and/or El-Ghazawi, programming GWU related 22

Exascale Technological Challenges The Power Wall Frequency scaling is no longer possible, power increases rapidly The Memory Wall Gap between processor speed and memory speed is widening The Interconnect Wall Available bandwidth per compute operations is dropping Power needed for data movement is increasing Programmability Wall, Resilience Wall,.. 23 23

The Data Movement Challenge Bandwidth density vs. system distance Energy vs. system distance [Source: ASCAC 14] Locality matters a lot, cost (energy and time) rapidly increases with distance Locality should be exploited at short distance, needed more at far distances 24

Data Movement and the Hierarchical Locality Challenge 25 25

Locality is Not Flat Anymore Chip and System 26 26

Locality is Not Flat in Anymore Chip and System 27 27

Locality is Not Flat Anymore Chip and System 28 28

Locality is Not Flat in Extreme Scale Chip and System Cray XC40 29 29

Locality in Extreme Scale Chip and System Perspectives TTT TILE64 Tile64 Cray XC40 30 30

What Does that Mean for Programmers Exploiting Hierarchical Locality Machine level and Chip level Hierarchical Tiled Data Structures Hierarchical Locality Exploitation with RTS MPI+X 31

General Implications Short term programming challenge Golden opportunity for smart programmer New hardware advances needed first and they will influence software May be silicon based, may be nano technologies like carbon nano-tube transistors by IBM (9nm), may keep things the way they are from the software side for a while 32

General Implications- Longer Run Long-term hardware technology may move toward Nano-photonics for computing Quantum Computing Many of the new hardware computing innovations may show first as discrete accelerators, then on the chip accelerator, then move closer to the processor internal circuitry ( data path ) 33

Longer term The bad news: with the limits of the silicon approached we may see departures from conventional methods of computing which may dramatically change the way we conceive software The good news: history has shown that good ideas from the past get resurrected in new ways 34

Conclusions Graduating and intelligent IT workforce can be a golden egg for countries like Morocco You can teach skills but it is imperative to teach and stress concepts in the curriculum Stress Parallelism Stress Locality See the recommendations by IEEE/NSF and SIAM for incorporating parallelism in Computer Science, Computer Engineering, and Computational Science and Engineering Curricula, and add locality For the very long-term There is nothing better than having good foundations in Physics and Math even for CS and CE majors 35

Conclusions cont. Integrate teaching soft skills as President Ouaouicha said Communications Entrepreneurism and marketing, individually and in groups Patenting and legal 36