HPC as a Driver for Computing Technology and Education Tarek El-Ghazawi The George Washington University Washington D.C., USA
NOW- July 2015: The TOP 10 Systems Rank Site Computer Cores Rmax [Pflops] % of Peak Power [MW] MFlops /Watt 1 2 3 4 5 National Super Computer Center in Guangzhou, China DOE / OS Oak Ridge Nat Lab USA DOE / NNSA L Livermore Nat Lab USA RIKEN Advanced Inst for Comp Sci, Japan DOE / OS Argonne Nat Lab, USA 6 Swiss CSCS 7 KAUST, Saudi Tianhe-2 NUDT, Xeon 12C 2.2GHz + IntelXeon Phi (57c) + Custom Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14c) + Custom Sequoia, BlueGene/Q (16c) + custom K computer Fujitsu SPARC64 VIIIfx (8c) + Custom Mira, BlueGene/Q (16c) + Custom Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom Shaheen II, Cray XC30, Xeon 16C + Custom 3,120,000 33.9 62 17.8 1905 560,640 17.6 65 8.3 2120 1,572,864 17.2 85 7.9 2063 705,024 10.5 93 12.7 827 786,432 8.16 85 3.95 2066 115,984 6.27 81 2.3 2726 196,608 5.54 77 4.5 1146 8 TACC, USA 9 10 Forschungszentrum Juelich (FZJ), Germany DOE / NNSA LLNL, USA Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB JuQUEEN, BlueGene/Q, Power BQC 16C 1.6GHz+Custom Vulcan, BlueGene/Q, Power BQC 16C 1.6GHz+Custom 204,900 5.17 61 4.5 1489 458,752 5.01 85 2.30 2178 393,216 4.29 85 1.97 2177 500 (422) Software Comp HP Cluster USA 18,896.309 48 2
HPC is a Top National Priority! Executive Order from the White House Establishment of a National Strategic Computing Initiative (NCSI) 29 July 2015 3 3
National Strategic Computing Initiative Five strategic themes of the NSCI: 1) Create systems that can apply exaflops of computing power to exabytes of data 2) Keep the United States at the forefront of HPC capabilities 3) Improve HPC application developer productivity 4) Make HPC readily available 5) Establish hardware technology for future HPC systems 4 4
Future/Investments - International Exascale HPC Programs Country Funding Year(s) Remarks European Union 700M 2014-20 Private-Public Partnership commitment through European Tech Platform for HPC (ETP4HPC) 143.4M in 2014-15 74M 2011-6 dedicated FP7 Exascale projects India $2B 2014-20 Led by IISc (Indian Institute of Science) and ISRO (Indian Space Research Organization). Targeting a 132 ExaFLOP/s machine $750M 2014-19 C-DAC (Center for Development of Advanced Computing) to set up 70 supercomputers over 5 years Japan $1.38B 2013-20 Post-K computer to be installed at RIKEN; Tentatively based on Extreme SIMD chip PACS-G China - Due to U.S./DoC ban will use Chinese 5 Tarek El-Ghazawi, parts GWU to upgrade current #1 system 5
Why is HPC Important? Critical for economic competitiveness (Highlighted by Minster Daoudi) because of its wide applications (through simulations and intensive data analyses) Drives computer hardware and software innovations for future conventional computing Is becoming ubiquitous, i.e. all computing/information technology is turning into Parallel!! Is that why it is turning into an international HPC muscle flexing contest? 6
Why is HPC Important? (1)Competitiveness Design Build Test Design Model Simulate Build 7
Molecular Dynamics HIV-1 Protease Why is HPC Important? Competitiveness Inhibitor Drug Gene Sequence Alignment Simulation for 2ns: 2 weeks on a desktop 6 hours on a supercomputer HPC Application Examples Phylogenetic Analysis: 32 days on desktop 1.5 hrs supercomputer Car Crash Simulations Understanding Fundamental Structure of Matter 2 million elements simulation: 4 days on a desktop 25 minutes on a supercomputer Requires a billionbillion calculations per second 8
Why is HPC Important? (2) HPC of Today is Conventional Computing for Tomorrow The ASCI Red Supercomputer 9000 chips for 3 TeraFLOPs in 1997 Intel 80 Core Chip 1 Chip and 1 TeraFLOPs in 2007 9
3- Why is HPC Important?- HPC Concepts are becoming Ubiquitous Sony PS3 Samsung S6 8 Cores Uses the Cell Processors! HPC is Ubiquitous! All Computing is becoming HPC, Can we become bystanders? The Road Runner: Was Fastest Supercomputer in 08 Tile64: A 64 CPU Chip- Can be in your future laptop! Uses Cell Processors! 10
How Did we Get Here - Supercomputers in recent History Computer Processor # Pr. Year Tianhe-2 (MilkyWay-2) Titan TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P Cray XK7, Opteron 16 Cores, 2.2GHz, Nvidia K20X R max (TFlops) 3120000 2013-till now 33,862 560640 2012 17,600 K-Computer, Japan SPARC64 VIIIfx 2.0GHz, 705024 2011 10,510 Tianhe-1A, China Intel EM64T Xeon X56xx (Westmere-EP) 2930 MHz (11.72 Gflops) + NVIDIA GPU, FT-1000 8C 186368 2010 2,566 Jaguar, Cray Cray XT5-HE Opteron Six Core 2.6 GHz 224162 2009 1,759 Roadrunner, IBM PowerXCell 8i 3200 MHz (12.8 GFlops) 122400 2008 1,026 BlueGene/L - eserver Blue Gene Solution, IBM BlueGene/L - eserver Blue Gene Solution, IBM PowerPC 440 700 MHz (2.8 GFlops) 212992 2007 478 PowerPC 440 700 MHz (2.8 GFlops) 131072 2005 280 BlueGene/L beta-system IBM PowerPC 440 700 MHz (2.8 GFlops) 32768 2004 70.7 Earth-Simulator / NEC NEC 1000 MHz (8 GFlops) 5120 2002 35.8 IBM ASCI White,SP POWER3 375 MHz (1.5 GFlops) 8192 2001 7.2 IBM ASCI White,SP POWER3 375MHz (1.5 GFlops) 8192 2000 4.9 Intel ASCI Red Intel IA-32 Pentium Pro 333 MHz (0.333 GFlops) 9632 1999 2.4 11
How Did we Get Here - Supercomputers in recent History See: http://spectrum.ieee.org/tech-talk/computing/hardware/chinabuilds-worlds-fastest-supercomputer 12
How Did we Get Here - Supercomputers in recent History PetaFLOPS Performance Vector Machines Massively Parallel Processors MPPs with Multicores and Heterogeneous Accelerators TeraFLOPS Discrete Integrated 1993- HPCC 2008-2011 End of Moore s Law in Clocking! Time 13
NOW- July 2015: The TOP 10 Systems Rank Site Computer Cores Rmax [Pflops] % of Peak Power [MW] MFlops /Watt 1 2 3 4 5 National Super Computer Center in Guangzhou, China DOE / OS Oak Ridge Nat Lab USA DOE / NNSA L Livermore Nat Lab USA RIKEN Advanced Inst for Comp Sci, Japan DOE / OS Argonne Nat Lab, USA 6 Swiss CSCS 7 KAUST, Saudi Tianhe-2 NUDT, Xeon 12C 2.2GHz + IntelXeon Phi (57c) + Custom Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14c) + Custom Sequoia, BlueGene/Q (16c) + custom K computer Fujitsu SPARC64 VIIIfx (8c) + Custom Mira, BlueGene/Q (16c) + Custom Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom Shaheen II, Cray XC30, Xeon 16C + Custom 3,120,000 33.9 62 17.8 1905 560,640 17.6 65 8.3 2120 1,572,864 17.2 85 7.9 2063 705,024 10.5 93 12.7 827 786,432 8.16 85 3.95 2066 115,984 6.27 81 2.3 2726 196,608 5.54 77 4.5 1146 8 TACC, USA 9 10 Forschungszentrum Juelich (FZJ), Germany DOE / NNSA LLNL, USA Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB JuQUEEN, BlueGene/Q, Power BQC 16C 1.6GHz+Custom Vulcan, BlueGene/Q, Power BQC 16C 1.6GHz+Custom 204,900 5.17 61 4.5 1489 458,752 5.01 85 2.30 2178 393,216 4.29 85 1.97 2177 500 (422) Software Comp HP Cluster USA 18,896.309 48 14
How to Make Progress Launch a competitive funding cycle or a large national project Pose a system challenge ~ 33.8 PFLOPS/17.8 Mwatt provides about 2GF/Watt To get to Exascale using same total power we need 200GF/Watt Pose an application challenge(s) Let the community compete for government funding with innovative ideas 15
Challenges - The End of Moore s Law The phenomenon of exponential improvements in processors was observed in 1979 by Intel co-founder Gordon Moore The speed of a microprocessor doubles every 18-24 months, assuming the price of the processor stays the same Wrong, not anymore! The price of a microchip drops about 48% every 18-24 months, assuming the same processor speed and on chip memory capacity Ok, for Now The number of transistors on a microchip doubles every 18-24 months, assuming the price of the chip stays the same Ok, for Now 16
No faster clocking but more Cores? Source: Ed Davis, Intel 17
ccelerators and Dealing with the Moore s Law Challenge Through Parallelism Fab. Process Freq # Cores Peak FP Performance Peak Power DP Flops/W Memory nm GHz SPFP GFlops DPFP GFlops W BW GB/s Memory type PowerXCell 8i 65 3.2 1 + 8 204 102.4 92 1.11 25.6 XDR Nvidia Kepler K40 Intel Xeon Phi 7120P Intel Xeon 12- core 2.7 GHz E5-2697v2 AMD Opteron 6370P Interlagos 28 0.75 2880 4290 1430 235 6.1 288 GDDR5 22 1.24 61 (244 threads) 2417 1208 300 4.0 352 GDDR5 22 2.7 12 518.4 259.2 130 1.99 59.7 32 2.5 16 320 160 99 1.62 42.7 DDR3-1866 DDR3-1333 Xilinx XC7VX1140T 28 - - 801 241 43 5.6 - - Xilinx XCUV440 20 - - 1306 402 80* 5.0* Altera Stratix V GSB8 28 - - 604 296 59 5.0 - - 18
Accelerators/Heterogeneous Computing FPGAs Cell GPUs Phi Microprocessor Application Speedup SAVINGS Cost Power Size DNA Match 8723 22x 779x 253x DES Breaker 38514 96x 3439x 1116x El-Ghazawi et. al. The Promise of HPRCs. IEEE Computer, February 2008 19
A General Execution Model for Heterogeneous Computers µp Transfer of Control Input Data GPU Accelerator CELL B.E. PC FPGA Clearspeed Intel Xeon Phi Output Data Transfer of Control 20
Challenges for Accelerators 1. Application must lend itself to the 90-10 rule, and different accelerators suit diffent type of computations 2. Programmer partitions the code across the CPU and accelerator 3. Programmer co-schedules CPU and accelerator, and ensures good utilization of the expensive accelerator resources 4. Programmer explicitly transfers data between CPU and accelerator 5. Accelerators are fast as compared to the link, and overhead that can render the use of the accelerator useless or harmful 6. Multiple programming paradigms are needed 7. New accelerator means learning/porting to a new programming interface 8. Changing the ratio of CPUs to accelerators requires also substantial programming unless accelerators are vituralized 21
Challenges for Advancing or for Exascale DoE ASCAC Subcommittee Report Feb 2014 1. Energy Efficiency 2. Interconnect Technology 3. Memory Technology 4. Scalable System Software 5. Programming Systems 6. Data Management 7. Exascale Algorithms 8. Algorithms for Discovery, Design & Decision 9. Resilience and Correctness 10. Scientific Productivity Data movement Tarek and/or El-Ghazawi, programming GWU related 22
Exascale Technological Challenges The Power Wall Frequency scaling is no longer possible, power increases rapidly The Memory Wall Gap between processor speed and memory speed is widening The Interconnect Wall Available bandwidth per compute operations is dropping Power needed for data movement is increasing Programmability Wall, Resilience Wall,.. 23 23
The Data Movement Challenge Bandwidth density vs. system distance Energy vs. system distance [Source: ASCAC 14] Locality matters a lot, cost (energy and time) rapidly increases with distance Locality should be exploited at short distance, needed more at far distances 24
Data Movement and the Hierarchical Locality Challenge 25 25
Locality is Not Flat Anymore Chip and System 26 26
Locality is Not Flat in Anymore Chip and System 27 27
Locality is Not Flat Anymore Chip and System 28 28
Locality is Not Flat in Extreme Scale Chip and System Cray XC40 29 29
Locality in Extreme Scale Chip and System Perspectives TTT TILE64 Tile64 Cray XC40 30 30
What Does that Mean for Programmers Exploiting Hierarchical Locality Machine level and Chip level Hierarchical Tiled Data Structures Hierarchical Locality Exploitation with RTS MPI+X 31
General Implications Short term programming challenge Golden opportunity for smart programmer New hardware advances needed first and they will influence software May be silicon based, may be nano technologies like carbon nano-tube transistors by IBM (9nm), may keep things the way they are from the software side for a while 32
General Implications- Longer Run Long-term hardware technology may move toward Nano-photonics for computing Quantum Computing Many of the new hardware computing innovations may show first as discrete accelerators, then on the chip accelerator, then move closer to the processor internal circuitry ( data path ) 33
Longer term The bad news: with the limits of the silicon approached we may see departures from conventional methods of computing which may dramatically change the way we conceive software The good news: history has shown that good ideas from the past get resurrected in new ways 34
Conclusions Graduating and intelligent IT workforce can be a golden egg for countries like Morocco You can teach skills but it is imperative to teach and stress concepts in the curriculum Stress Parallelism Stress Locality See the recommendations by IEEE/NSF and SIAM for incorporating parallelism in Computer Science, Computer Engineering, and Computational Science and Engineering Curricula, and add locality For the very long-term There is nothing better than having good foundations in Physics and Math even for CS and CE majors 35
Conclusions cont. Integrate teaching soft skills as President Ouaouicha said Communications Entrepreneurism and marketing, individually and in groups Patenting and legal 36