High-Performance Computing - and why Learn about it?

High-Performance Computing - and why Learn about it? Tarek El-Ghazawi The George Washington University Washington D.C., USA

Outline What is High-Performance Computing? Why is High-Performance Computing Important? Advances in Performance and Architectures Heterogeneous Accelerated Computing Advances in Parallel Programming Making Progress: The HPCS Program, near-term Making Progress: Exascale and DOE Conclusions 2

What is Supercomputing and Parallel Architectures? Also called High-Performance Computing and Parallel Computing Research and innovation in architecture, programming and applications associated with computer systems that are orders of magnitude faster (10X- 1000X or more) than modern desktop and laptop computers Supercomputers achieve speed through massive parallelism- Parallel Architectures! E.g. many processors working together http://www.collegehumor.com/video:1828443 3

Outline What is High-Performance Computing? Why is High-Performance Computing Important? Advances in Performance and Architectures Hardware Accelerators and Accelerated Computing Advances in Parallel Programming What is Next: The HPCS Program, near-term What is Next: Exascale and DARPA UHPC Conclusions 4

Why is HPC Important? Critical for economic competitiveness because of its wide applications (through simulations and intensive data analyses) Drives computer hardware and software innovations for future conventional computing Is becoming ubiquitous, i.e. all computing/information technology is turning into Parallel!! Is that why it is turning into an international HPC muscle flexing contest? 5

Why is HPC Important? Design Build Test Design Model Simulate Build 6

Why is HPC Important? National and Economic Competitiveness Molecular Dynamics Gene Sequence Alignment HIV-1 Protease Inhibitor Drug Simulation for 2ns: 2 weeks on a desktop 6 hours on a supercomputer HPC Application Examples Phylogenetic Analysis: 32 days on desktop 1.5 hrs supercomputer Car Crash Simulations Understanding Fundamental Structure of Matter 2 million elements simulation: 4 days on a desktop 25 minutes on a supercomputer Requires a billionbillion calculations per second 7

Why is HPC Important? National and Economic Competitiveness Industrial competitiveness Computational models that can run on HPC are only for the design of NASA space shuttles, but they can also help with Business Intelligence (e.g. IBM) and Watson Designing effective shapes and/or material for Potato Chips Clorox Bottles 8

HPC Technology of Today is Conventional Computing of Tomorrow: Multi/Many-cores in Desktops and Laptops Intel 80 Core Chip 1 Chip and 1 TeraFLOPs in 2007 The ASCI Red Supercomputer 9000 chips for 3 TeraFLOPs in 1997 Intel 72 Core Chip Xeon Phi KNL 1 Chip and 3 TeraFLOPs in 2016 9

Why is HPC Important?HPC is Ubiquitous Sony PS3 iphone 7 4 Cores 2.34 GHz HPC is Ubiquitous! All Computing is becoming HPC, Can we become Uses the Cell Processors! bystanders? The Road Runner: Was Fastest Supercomputer in 08 Uses Cell Processors! Xeon Phi KNL: A 72 CPU Chip 10

Why this is happening? - The End of Moore s Law in Clocking The phenomenon of exponential improvements in processors was observed in 1979 by Intel co-founder Gordon Moore The speed of a microprocessor doubles every 18-24 months, assuming the price of the processor stays the same Wrong, not anymore! The price of a microchip drops about 48% every 18-24 months, assuming the same processor speed and on chip memory capacity Ok, for Now The number of transistors on a microchip doubles every 18-24 months, assuming the price of the chip stays the same Ok, for Now 11

No faster clocking but more Cores? Source: Ed Davis, Intel 12

Cores and Power Efficiency Source: Ed Davis, Intel 13

Comparative View of Processors and Accelerators Fabrication Process nm Freq GHz # Cores Peak FP Performance SPFP GFlops DPFP GFlops Peak Power W DP Flops/W BW GB/s Memory Memory type PowerXCell 8i 65 3.2 1 + 8 204 102.4 92 1.11 25.6 XDR NVidia Fermi Tesla M2090 40 1.3 512 1330 665 225 2.9 177 GDDR5 Nvidia Kepler K20X NVIDIA Kepler K80 Intel Xeon Phi 5110P (KNC) Intel Xeon Phi 7290 (KNL) Intel Xeon E7-8870 AMD Opteron 6176 SE Xilinx V6 SX475T Altera Stratix V GSB8 28 0.73 2688 3950 1310 235 5.6 250 GDDR5 28 0.88 2x2496 8749 2910 300 9.7 480 GDDR5 22 1.05 14 1.7 32 2.4 (2.8) 60 (240 threads) 72 (288 threads) - 1011 225 4.5 320 GDDR5 - ~3500 245 14.3 115.2 DDR4 10 202.6 101.3 130 0.78 42.7 45 2.5 12 240 120 140 0.86 42.7 DDR3-1333 DDR3-1333 40 - - - 98.8 50 3.3 - - 28 - - - 210 60 3.5 - - 14

Most Power Efficient Architectures: Green 500 https://www.top500.org/green500/lists/2016/11/ 15

Outline What is High-Performance Computing? Why is High-Performance Computing Important? Advances in Performance and Architectures Heterogeneous Accelerated Computing Advances in Parallel Programming What is Next: The HPCS Program, near-term What is Next: Exascale and DoE Conclusions 16

How the Supercomputing Race is Conducted? TOP500 Supercomputers and LINPACK Top500 in November and in June Rmax - Maximal LINPACK performance achieved Rpeak - Theoretical peak performance In the TOP500 List table, the computers are ordered first by their Rmax value In the case of equal performances (Rmax value) for different computers, order is by Rpeak For sites that have the same performance, the order is by memory size and then alphabetically Check www.top500.org for more information 17

Top 10 Supercomputers: November 2016 www.top500.org Rank Countr y Site Computer # Cores R max (PFlops) 1 National Supercomputing Center in Wuxi China Sunway TaihuLight - Sunway MPP, Sunway SW26010 260C 1.45GHz, Sunway NRCPC 10,649,60 0 93.0 2 National University of Defense Technology China 3 Oak Ridge National Laboratory Tianhe-2 (MilkyWay-2) - TH- IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P Titan Cray XK7, Opteron 16 Cores, 2.2GHz, Gemini, Nvidia K20X 3,120,000 33.9 560,640 17.6 4 Lawrence Livermore National Laboratory Sequoia BlueGene/Q, Power BQC 16 Cores, Custom interconnection 1,572,86 4 16.3 5 DOE/SC/LBNL/NERSC United States Cori - Cray XC40, Intel Xeon Phi 7250 68C 1.4GHz, Aries interconnect Cray Inc. 622,336 14.0 18

Top 10 Supercomputers: November 2016 www.top500.org Rank Country Site Computer # Cores R max (PFlop s) 6 Joint Center for Advanced High Performance Computing Japan Oakforest-PACS - PRIMERGY CX1640 M1, Intel Xeon Phi 7250 68C 1.4GHz, Intel Omni- Path, Fujitsu 556,10 4 13.6 7 RIKEN Advanced Institute for Computational Science K Computer SPARC64 VIIIfx 2.0 GHz, Tofu Interconnect 795,02 4 10.5 8 Swiss National Supercomputing Centre (CSCS) Switzerland Piz Daint - Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect, NVIDIA K20x Cray Inc. 206,72 0 9.8 9 Argonne National Laboratory Mira BlueGene/Q, Power BQC 16 Cores, Custom interconnection 786,43 2 8.16 10 DOE/NNSA/LANL/SNL United States Trinity - Cray XC40, Xeon E5-2698v3 16C 2.3GHz, Aries interconnect Tarek El-Ghazawi, Cray GWU Inc. 301,05 6 8.1 19

History Source: top500.org. Also see: http://spectrum.ieee.org/tech-talk/computing/hardware/china-builds-worldsfastest-supercomputer 20

Supercomputers - History Computer Processor # Pr. Year R max (TFlops) Sunway TaihuLight - Sunway MPP, Sunway SW26010 260C 1.45GHz 10649600 2016 93,014 Tianhe-2 (MilkyWay-2) TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P 3120000 2013 33,862 Titan Cray XK7, Opteron 16 Cores, 2.2GHz, Nvidia K20X 560640 2012 17,600 K-Computer, Japan SPARC64 VIIIfx 2.0GHz, 705024 2011 10,510 Tianhe-1A, China Intel EM64T Xeon X56xx (Westmere-EP) 2930 MHz (11.72 Gflops) + NVIDIA GPU, FT-1000 8C 186368 2010 2,566 Jaguar, Cray Cray XT5-HE Opteron Six Core 2.6 GHz 224162 2009 1,759 Roadrunner, IBM PowerXCell 8i 3200 MHz (12.8 GFlops) 122400 2008 1,026 BlueGene/L - eserver Blue Gene Solution, IBM BlueGene/L eserver Blue Gene Solution, IBM PowerPC 440 700 MHz (2.8 GFlops) 212992 2007 478 PowerPC 440 700 MHz (2.8 GFlops) 131072 2005 280 BlueGene/L beta-system IBM PowerPC 440 700 MHz (2.8 GFlops) 32768 2004 70.7 Earth-Simulator / NEC NEC 1000 MHz (8 GFlops) 5120 2002 35.8 IBM ASCI White,SP POWER3 375 MHz (1.5 GFlops) 8192 2001 7.2 IBM ASCI White,SP POWER3 375MHz (1.5 GFlops) 8192 2000 4.9 Intel ASCI Red Intel IA-32 Pentium Pro 333 MHz (0.333 GFlops) 9632 1999 2.4 21

Historical Analysis PetaFLOPS TeraFLOPS Performance Vector Machines Massively Parallel Processors MPPs with Multicores and Heterogeneous Accelerators Tons of Lightweight Cores Discrete Integrated 1993- HPCC 2008-2011 End of Moore s Law in Clocking! 2016 Time 22

DARPA High-Productivity Computing Systems Launched in 2002 Next Generation Supercomputers by 2010 Not only performance, but productivity, where Productivity = f(execution time, Development time) Typically, Productivity = utility/cost Addresses everything hardware and software 23

HPCS Structure Each Team is led by a company and includes university research groups Three Phases Phase I: Research Concepts SGI, HP, Cray, IBM, and Sun Phase II: R&D Cray, IBM, Sun Phase III: Deployment Cray, IBM GWU with SGI in Phase I and IBM in Phase II 24

IBM, Sun & Cray s effort on HPCS Vendor Project Hardware Arch. Language IBM PERCES Power PC X10 Sun Hero Rock, Multi-core Sparc Fortress Cray Cascade Chapel 25

HPCS on IBM, Sun & Cray IBM PERCS(Productive, Easy-to-use, Reliable Computing System) Power Architecture Sum Hero Multi-core Rock Sparc Cray Cascade 26

What is New in HPCS Architecture Lots of Parallelism on the Chip Intelligent and Transactional Memory Innovative Co-Processing: Streaming, PIM, Computations migrate to data, instead of data going to computations Programming PGAS programming Models Parallel Matlab and other simple interfaces Multiple types of parallelism and locality Transactions Reliability Self-Healing More proprietary stuff 27

What is Next: Exascale and DOE The DoE Exascale Computing Project Goals: Deliver 50x performance of today s systems (20 PF) Operate with 20-30 MW power Be sufficiently resilient (MTTI < 1 week) Software stack supporting wide range of apps Growth of supercomputing capability Source: Figure modified from singularity.com Source: https://energy.gov/sites/prod/files/ 2013/09/f2/20130913-SEAB- DOE-Exascale-Initiative.pdf 28

Technical Challenges on The Road to Exascale Bill Dally, Technical Challenges on the Road to Exascale http://developer.download.nvidia.com/gtc/pdf/gtc2012/presentationpdf/billdally_nvidia_sc12.pdf 29

Technical Challenges on The Road to Exascale The High Cost of Data Movement Fetching operands costs more than computing on them 10000" 1000" Picojoules*Per*64bit*opera2on* 100" 10" 1" DP"FLOP" Register" 1mm"on3chip" 5mm"on3chip" 15mm"on3chip" Off3chip/DRAM" local"interconnect" 2008"(45nm)" 2018"(11nm)" Cross"system" Source: ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Courtesy: John Shalf, PGAS 2015 30

Three pre-exascale Supercomputers as part of the CORAL intiative from DOE Summit Sierra Aurora Contract budget $325M combined $200M Location Oak Ridge Livermore Argonne Delivery Date 2017-18 2018-19 Vendor IBM Cray Node Architecture Multiple IBM POWER9 CPUs, Multiple NVIDIA Volta GPUs Intel Knights Hill Many-core CPUs Node Performance 40+ TFLOPS - 3+ TFLOPS Interconnect Mellanox Dual Rail EDR InfiniBand Intel Omni-Path R peak 150 PFLOPS 120-150 PFLOPS 180 PFLOPS Nodes ~3,400-50,000+ Power ~10 MW ~10 MW ~13 MW 31 31

Aurora Highlights Available Data: Cray Shasta compute platform Intel Knights Hill Manycore CPUs 3 rd Gen Manycore 10nm node 3+ TFLOPS per node 50,000+ nodes 180 PFLOPS 13 MW Intel Omni-Path (2 nd Gen) with Silicon Photonics 500+ TB/s Bisection Bandwidth 2.5+ PB/s Aggregate Node Link Bandwidth Prediction for Next Gen: 1 processor per node One 100-core CPU capable of 4.5TFLOPS peak, or 3+TFLOPS sustained Dual Omni400 Gb/s aggregate BW per node 50,000 nodes 4 nodes per blade 12,500 blades 16 blades per chassis 782 chassis 6 chassis per group 130 groups 32 32

GWU HPCL Facility 33

Historical Highlights of the Facility ~ 50 Tons of Cooling, 2000 sq of elevated floor,.25 MW of power Small experimental parallel systems that represent a wide spectrum of architectural ideas Systems with GPU Accelerators from Cray and ACI System with Intel Phi Accelerators from ACI Systems with FPGA Accelerators from SRC, SGI, Cray and Starbridge Homegrown clusters with Infinitband, Myrinet Many experimental boards and workstations from Xilinx, Intel, 34

GW CRAY XE6m/XK7m 1856 Processor Core Based on 12-core 64-bit AMD Opteron 6100 Series processors and 16-core AMD Bulldozer processors 32 Nvidia K20 GPUs 64 GB registered ECC DDR3 SDRAM per compute node 1 Gemini routing and communications ASIC per two compute nodes 38

Conclusions HPC is critical for economic competitiveness at all levels, and it is turning into an international race! Advances in HPC today are the same advances in conventional computing tomorrow HPC is ubiquitous as all computing turns into HPC Multicores and heterogeneous accelerator architecture are getting rising attention but lack software infrastructure and hardware support and will require new programming models and OS support, an opportunity for leadership in research 40

Light Reading http://spectrum.ieee.org/computing/hardware/ibmreclaims-supercomputer-lead, 2005 http://spectrum.ieee.org/techtalk/computing/hardware/china-builds-worldsfastest-supercomputer, 2010 http://spectrum.ieee.org/computing/hardware/china s-homegrown-supercomputers, 2012 41