CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1

Slide No 1 CSE5351: Parallel Procesisng Part 1B

Slide No 2 State of the Art In Supercomputing Several of the next slides (or modified) are the courtesy of Dr. Jack Dongarra, a distinguished professor of Computer Science at the University of Tennessee.

Slide No 3 Look at the Fastest Computers Strategic importance of supercomputing Essential for scientific discovery Critical for national security Fundamental contributor to the economy and competitiveness through use in engineering and manufacturing Supercomputers are the tool for solving the most challenging problems through simulations 3

Slide No 4 Rate H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem - Updated twice a year SC xy in the States in November Meeting in Germany in June Size TPP performance - All data available from www.top500.org 4

Performance Development of HPC over the Last 23 Years from the Top500 1 Eflop/s 1E+09 362 PFlop/s 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 100 Tflop/s 100000 SUM 33.9 PFlop/s 166 TFlop/s 10 Tflop/s 10000 N=1 1 Tflop/s 1000 1.17 TFlop/s 100 Gflop/s 100 N=500 59.7 GFlop/s 10 Gflop/s 10 1 Gflop/s 1 400 MFlop/s 100 Mflop/s 0.1 Slide No 5 My Laptop 70 Gflop/s My iphone 4 Gflop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 20142015

State of Supercomputing in 2015 Pflops computing fully established with 67 systems. Three technology architecture possibilities or swim lanes are thriving. Commodity (e.g. Intel) Commodity + accelerator (e.g. GPUs) (88 systems) Special purpose lightweight cores (e.g. IBM BG, Knights Landing) Interest in supercomputing is now worldwide, and growing in many new markets (over 50% of Top500 computers are in industry). Exascale projects exist in many countries and regions. Intel processors largest share, 86% followed by AMD, 4%. Slide No 6 6

July 2015: The TOP 10 Systems Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] MFlops/ Watt 1 National Super Computer Center in Guangzhou Tianhe-2 NUDT, Xeon 12C 2.2GHz + IntelXeon Phi (57c) + Custom China 3,120,000 33.9 62 17.8 1905 2 DOE / OS Ridge Nat Lab Oak Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14c) + Custom USA 560,640 17.6 65 8.3 2120 3 DOE / NNSA Livermore Nat Lab L Sequoia, BlueGene/Q (16c) + custom USA 1,572,864 17.2 85 7.9 2063 4 RIKEN Advanced Inst for Comp Sci K computer Fujitsu SPARC64 VIIIfx (8c) + Custom Japan 705,024 10.5 93 12.7 827 5 DOE / OS Argonne Nat Lab Mira, BlueGene/Q (16c) + Custom USA 786,432 8.16 85 3.95 2066 6 Swiss CSCS 7 KAUST Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom Shaheen II, Cray XC30, Xeon 16C + Custom Swiss 115,984 6.27 81 2.3 2726 Saudi Arabia 196,608 5.54 77 4.5 1146 8 Texas Advanced Computing Center Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB USA 204,900 5.17 61 4.5 1489 9 Forschungszentrum Juelich (FZJ) JuQUEEN, BlueGene/Q, Power BQC 16C 1.6GHz+Custom Germany 458,752 5.01 85 2.30 2178 DOE / NNSA L Vulcan, BlueGene/Q, 10 USA 393,216 4.29 85 1.97 2177 CSE5351 Livermore (Part Nat ( 1 Lab Power BQC 16C 1.6GHz+Custom 500 (422) Software Comp HP Cluster USA 18,896.309 48 Slide No 7

Seven Top500 Systems in Australia Rank Name Computer Site Manufacturer Total Cores Rmax Rpeak SuperBlade SBI-7127RG- 15 C01N E/SGI ICE X, Intel Xeon E5- Supermicro/S 2695v2 12C 2.4GHz, Tulip Trading GI Infiniband FDR, Intel Xeon 265440 3521000 4470408 Phi 7120P/NDIVA M2090 57 Magnus 71 Cray XC40, Xeon E5-2690v3 12C 2.6GHz, Aries interconnect Fujitsu PRIMERGY CX250 S1, Xeon E5-2670 8C 2.600GHz, Infiniband FDR 100 Avoca BlueGene/Q, Power BQC 16C 1.60GHz, Custom CSIRO Nitro G16 3GPU, Xeon E5-199 GPU 2650 8C 2GHz, Infiniband Cluster FDR, Nvidia K20m 406 Sukuriputo Okane 408 Galaxy SGI ICE X, Intel Xeon E5-2695v2 12C 2.4GHz, Infiniband FDR, NVIDIA 2090 Cray XC30, Intel Xeon E5-2692v2 10C 3.000GHz, Aries interconnect Pawsey SC Centre, WA Cray Inc. 35712 1097558 1485619 ANU Fujitsu 53504 978600 1112883 Victorian Life Sci Comp Initiative CSIRO IBM 65536 715551 838861 Xenon Systems 6875 335300 472497.5 C01N SGI 46400 192370 1305928 Pawsey SC Centre, WA Cray Inc. 9440 192100 226560 Slide No 8 8

Systems Slide No 9 Accelerators (53 systems) 60 50 40 30 20 10 0 2006 2007 2008 2009 2010 2011 2012 2013 Intel MIC (13) Clearspeed CSX600 (0) ATI GPU (2) IBM PowerXCell 8i (0) NVIDIA 2070 (4) NVIDIA 2050 (7) NVIDIA 2090 (11) NVIDIA K20 (16) 19 US 9 China 6 Japan 4 Russia 2 France 2 Germany 2 India 1 Italy 1 Poland 1 Australia 2 Brazil 1 Saudi Arabia 1 South Korea 1 Spain 2 Switzerland 1 UK

Slide No 10 Processors / Systems 2% 1% 1% Intel SandyBridge 4% 4% Intel Nehalem 10% AMD x86_64 PowerPC 55% Power 23% Intel Core Sparc Others

Slide No 11 Vendors / System Share NEC, 4, 1% Hitachi, 4, Others, 33, 6% 1% NUDT Dell, 8, 2%, 4, Fujitsu, 8, 2% 1% Bull, 14, 3% SGI, 17, 3% Cray Inc., 48, 9% IBM, 164, 33% HP, 196, 39% IBM HP Cray Inc. SGI Bull Fujitsu Dell NUDT Hitachi NEC Others

Countries Share Slide No 12 Absolute Cou US: 267 China: 63 Japan: 28 UK: 23 France: 22 Germany:

Slide No 13 Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 000000 10 Pflop/s 000000 1 Pflop/s 000000 100 Tflop/s 100000 10 Tflop/s 10000 1 Tflop/s 1000 100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 0.1 N=1 N=500 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

Slide No 14 Today s #1 System Systems 2014-15 Tianhe-2 2020-2022 Difference Today & Exa System peak 55 Pflop/s 1 Eflop/s ~20x Power System memory Node performance 18 MW (3 Gflops/W) 1.4 PB (1.024 PB CPU +.384 PB CoP) 3.43 TF/s (.4 CPU +3 CoP) Node concurrency 24 cores CPU + 171 cores CoP ~20 MW (50 Gflops/W) O(1) ~15x 32-64 PB ~50x 1.2 or 15TF/s O(1) O(1k) or 10k ~5x - ~50x Node Interconnect BW 6.36 GB/s 200-400GB/s ~40x System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x Total concurrency 3.12 M 12.48M threads (4/core) O(billion) ~100x MTTF Few / day O(<1 day) O(?)

Exascale System Architecture with a cap of $200M and 20MW Slide No 15 Systems 2014-15 Tianhe-2 2020-2022 Difference Today & Exa System peak 55 Pflop/s 1 Eflop/s ~20x Power System memory Node performance 18 MW (3 Gflops/W) 1.4 PB (1.024 PB CPU +.384 PB CoP) 3.43 TF/s (.4 CPU +3 CoP) Node concurrency 24 cores CPU + 171 cores CoP ~20 MW (50 Gflops/W) O(1) ~15x 32-64 PB ~50x 1.2 or 15TF/s O(1) O(1k) or 10k ~5x - ~50x Node Interconnect BW 6.36 GB/s 200-400GB/s ~40x System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x Total concurrency 3.12 M 12.48M threads (4/core) O(billion) ~100x MTTF Few / day Many / day O(?)

ORNL s Titan Hybrid System: Cray XK7 with AMD Opteron and NVIDIA Tesla processors 4,352 ft 2 404 m 2 SYSTEM SPECIFICATIONS: Peak performance of 27 PF 24.5 Pflop/s GPU + 2.6 Pflop/s AMD 18,688 Compute Nodes each with: 16-Core AMD Opteron CPU NVIDIA Tesla K20x GPU 32 + 6 GB memory 512 Service and I/O nodes 200 Cabinets 710 TB total system memory Cray Gemini 3D Torus Interconnect 9 MW peak power 16 Slide No 16

Slide No 17 Summary Major Challenges are ahead for extreme computing Parallelism Hybrid Fault Tolerance Power and many others not discussed here We will need completely new approaches and technologies to reach the Exascale level

Slide No 19 To be published in the January 2011 issue of The International Journal of High Performance Computing Applications 19 We can only see a short distance ahead, but we can see plenty there that needs to be done. Alan Turing (1912 1954) www.exascale.org

Slide No 20 Technology Trends: Microprocessor Capacity Gordon Moore (co-founder of Intel) Electronics Magazine, 1965 Number of devices/chip doubles every 18 months 2X transistors/chip Every 1.5 years Called Moore s Law Microprocessors have become smaller, denser, and more powerful. Not just processors, bandwidth, storage, etc. 2X memory and processor speed and ½ size, cost, & power every 18 months. 20

Slide No 21 Moore s Law is Alive and Well 1.E+07 1.E+06 1.E+05 Transistors (in Thousands) 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1970 1975 1980 1985 1990 1995 2000 2005 2010 Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick

But Clock Frequency Scaling Replaced by Scaling Cores / Chip Slide No 22 1.E+07 1.E+06 1.E+05 15 Years of exponential growth ~2x year has ended Transistors (in Thousands) Frequency (MHz) Cores 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1970 1975 1980 1985 1990 1995 2000 2005 2010 Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick

Performance Has Also Slowed, Along with Power Slide No 23 1.E+07 1.E+06 1.E+05 Power is the root cause of all this Transistors (in Thousands) Frequency (MHz) Power (W) 1.E+04 Cores A hardware issue just became a software problem 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1970 1975 1980 1985 1990 1995 2000 2005 2010 Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick

Slide No 24 Power Cost of Frequency Power Voltage 2 x Frequency (V 2 F) Frequency Voltage Power Frequency 3 24

Slide No 25 Power Cost of Frequency Power Voltage 2 x Frequency (V 2 F) Frequency Voltage Power Frequency 3 25

Looking at the Gordon Bell Prize (Recognize outstanding achievement in high-performance computing application and encourage development of parallel processing ) Slide No 26 1 GFlop/s; 1988; Cray Y-MP; 8 Processors Static finite element analysis 1 TFlop/s; 1998; Cray T3E; 1024 Processors Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method. 1 PFlop/s; 2008; Cray XT5; 1.5x10 5 Processors Superconductive materials 1 EFlop/s; ~2018;?; 1x10 7 Processors (10 9 threads)

Hardware and System Software Scalability Slide No 28 Barriers Fundamental assumptions of system software architecture did not anticipate exponential growth in parallelism Number of components and MTBF changes the game Technical Focus Areas System Hardware Scalability System Software Scalability Applications Scalability Technical Gap 1000x improvement in system software scaling 100x improvement in system software reliability 100,000 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 Average Number of Cores Per Supercomputer Top20 of the Top500

Commodity Intel Xeon 8 cores 3 GHz 8*4 ops/cycle 96 Gflop/s (DP) Commodity plus Accelerator Today 88 of the Top500 Systems Accelerator (GPU) Nvidia K20X Kepler 2688 Cuda cores.732 GHz 2688*2/3 ops/cycle 1.31 Tflop/s (DP) 192 Cuda cores/smx 2688 Cuda cores Interconnect PCI-e Gen2/3 16 lane 64 Gb/s (8 GB/s) 1 GW/s 6 GB 31 Slide No 31

Recent Developments US DOE planning to deploy O(100) Pflop/s systems for 2017-2018 - $525M hardware Oak Ridge Lab and Lawrence Livermore Lab to receive IBM and Nvidia based systems Argonne Lab to receive Intel based system After this Exaflops US Dept of Commerce is preventing some China groups from receiving Intel technology Citing concerns about nuclear research being done with the systems; February 2015. On the blockade list: National SC Center Guangzhou, site of Tianhe-2 National SC Center Tianjin, site of Tianhe-1A National University for Defense Technology, developer National SC Center Changsha, location of NUDT For the first time, < 50% of Top500 are in the U.S. 231 of the systems are U.S.-based, China #3 w/37. Slide No 32 32

Today s Multicores All of Top500 Systems Are Based on Multicore Intel Haswell (18 cores) Intel Xeon Phi (60 cores) IBM Power 8 (12 cores) AMD Interlagos (16 cores) Nvidia Kepler (2688 Cuda cores) IBM BG/Q (18 cores) Fujitsu Venus (16 cores) 33 ShenWei (16 core) Slide No 33

Slide No 34 Problem with Processors As we put more processing power on the multicore chip, one of the problems is getting the data to the cores Next generation will be more integrated, 3D design with a photonic network 34

Peak Performance - Per Core We are here Floating point operations per cycle per core Most of the recent computers have FMA (Fused multiple add): (i.e. x x + y*z in one cycle) Intel Xeon earlier models and AMD Opteron have SSE2 2 flops/cycle DP & 4 flops/cycle SP Intel Xeon Nehalem ( 09) & Westmere ( 10) have SSE4 4 flops/cycle DP & 8 flops/cycle SP Intel Xeon Sandy Bridge( 11) & Ivy Bridge ( 12) have AVX 8 flops/cycle DP & 16 flops/cycle SP Intel Xeon Haswell ( 13) & (Broadwell ( 14)) AVX2 16 flops/cycle DP & 32 flops/cycle SP Xeon Phi (per core) is at 16 flops/cycle DP & 32 flops/cycle SP Intel Xeon Skylake ( 15) CSE5351 (Part ( 1 32 flops/cycle DL & 64 flops/cycle SP Slide No 35

Memory transfer (Its All About Data Movement) Example on my laptop: One level of memory 56 GFLOP/sec/core x 2 cores CPU ( Omitting latency here. ) Intel Core i7 4850HQ Haswell, 2.3 GHz Turbo Boost 3.5 GHz Cache (6 MB) 25.6 GB/sec Main memory (8 GB) The model IS simplified (see next slide) but it provides an upper bound on performance as well. I.e., we will never go faster than what the model predicts. ( slowergocanwe, courseof, And) Slide No 36