CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1

Size: px

Start display at page:

Download "CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1"

Dennis Price
5 years ago
Views:

1 Slide No 1 CSE5351: Parallel Procesisng Part 1B

2 Slide No 2 State of the Art In Supercomputing Several of the next slides (or modified) are the courtesy of Dr. Jack Dongarra, a distinguished professor of Computer Science at the University of Tennessee.

3 Slide No 3 Look at the Fastest Computers Strategic importance of supercomputing Essential for scientific discovery Critical for national security Fundamental contributor to the economy and competitiveness through use in engineering and manufacturing Supercomputers are the tool for solving the most challenging problems through simulations 3

Yardstick: Rmax from LINPACK MPP Ax=b, dense problem - Updated twice a year SC

4 Slide No 4 Rate H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem - Updated twice a year SC xy in the States in November Meeting in Germany in June Size TPP performance - All data available from 4

Performance Development of HPC over the Last 23 Years from the Top500 1 Eflop/s 1E+09 362 PFlop/s 100 Pflop/s

9 PFlop/s 166 TFlop/s 10 Tflop/s 10000 N=1 1 Tflop/s 1000 1.17 TFlop/s 100 Gflop/s 100 N=500 59.

5 Performance Development of HPC over the Last 23 Years from the Top500 1 Eflop/s 1E PFlop/s 100 Pflop/s Pflop/s Pflop/s Tflop/s SUM 33.9 PFlop/s 166 TFlop/s 10 Tflop/s N=1 1 Tflop/s TFlop/s 100 Gflop/s 100 N= GFlop/s 10 Gflop/s 10 1 Gflop/s MFlop/s 100 Mflop/s 0.1 Slide No 5 My Laptop 70 Gflop/s My iphone 4 Gflop/s

6 State of Supercomputing in 2015 Pflops computing fully established with 67 systems. Three technology architecture possibilities or swim lanes are thriving. Commodity (e.g. Intel) Commodity + accelerator (e.g. GPUs) (88 systems) Special purpose lightweight cores (e.g. IBM BG, Knights Landing) Interest in supercomputing is now worldwide, and growing in many new markets (over 50% of Top500 computers are in industry). Exascale projects exist in many countries and regions. Intel processors largest share, 86% followed by AMD, 4%. Slide No 6 6

July 2015: The TOP 10 Systems Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] MFlops/ Watt 1 National Super

8 1905 2 DOE / OS Ridge Nat Lab Oak Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14c) + Custom USA 560,640 17.6 65 8.

9 2063 4 RIKEN Advanced Inst for Comp Sci K computer Fujitsu SPARC64 VIIIfx (8c) + Custom Japan 705,024 10.5 93 12.

95 2066 6 Swiss CSCS 7 KAUST Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom Shaheen II, Cray XC30, Xeon 16C +

5 1146 8 Texas Advanced Computing Center Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB USA 204,900 5.17 61 4.

7 July 2015: The TOP 10 Systems Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] MFlops/ Watt 1 National Super Computer Center in Guangzhou Tianhe-2 NUDT, Xeon 12C 2.2GHz + IntelXeon Phi (57c) + Custom China 3,120, DOE / OS Ridge Nat Lab Oak Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14c) + Custom USA 560, DOE / NNSA Livermore Nat Lab L Sequoia, BlueGene/Q (16c) + custom USA 1,572, RIKEN Advanced Inst for Comp Sci K computer Fujitsu SPARC64 VIIIfx (8c) + Custom Japan 705, DOE / OS Argonne Nat Lab Mira, BlueGene/Q (16c) + Custom USA 786, Swiss CSCS 7 KAUST Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom Shaheen II, Cray XC30, Xeon 16C + Custom Swiss 115, Saudi Arabia 196, Texas Advanced Computing Center Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB USA 204, Forschungszentrum Juelich (FZJ) JuQUEEN, BlueGene/Q, Power BQC 16C 1.6GHz+Custom Germany 458, DOE / NNSA L Vulcan, BlueGene/Q, 10 USA 393, CSE5351 Livermore (Part Nat ( 1 Lab Power BQC 16C 1.6GHz+Custom 500 (422) Software Comp HP Cluster USA 18, Slide No 7

8 Seven Top500 Systems in Australia Rank Name Computer Site Manufacturer Total Cores Rmax Rpeak SuperBlade SBI-7127RG- 15 C01N E/SGI ICE X, Intel Xeon E5- Supermicro/S 2695v2 12C 2.4GHz, Tulip Trading GI Infiniband FDR, Intel Xeon Phi 7120P/NDIVA M Magnus 71 Cray XC40, Xeon E5-2690v3 12C 2.6GHz, Aries interconnect Fujitsu PRIMERGY CX250 S1, Xeon E C 2.600GHz, Infiniband FDR 100 Avoca BlueGene/Q, Power BQC 16C 1.60GHz, Custom CSIRO Nitro G16 3GPU, Xeon E5-199 GPU C 2GHz, Infiniband Cluster FDR, Nvidia K20m 406 Sukuriputo Okane 408 Galaxy SGI ICE X, Intel Xeon E5-2695v2 12C 2.4GHz, Infiniband FDR, NVIDIA 2090 Cray XC30, Intel Xeon E5-2692v2 10C 3.000GHz, Aries interconnect Pawsey SC Centre, WA Cray Inc ANU Fujitsu Victorian Life Sci Comp Initiative CSIRO IBM Xenon Systems C01N SGI Pawsey SC Centre, WA Cray Inc Slide No 8 8

Systems Slide No 9 Accelerators (53 systems) 60 50 40 30 20 10 0 2006 2007 2008 2009 2010 2011 2012 2013 Intel MIC (13) Clearspeed CSX600 (0) ATI GPU (2) IBM PowerXCell 8i (0) NVIDIA 2070 (4)

9 Systems Slide No 9 Accelerators (53 systems) Intel MIC (13) Clearspeed CSX600 (0) ATI GPU (2) IBM PowerXCell 8i (0) NVIDIA 2070 (4) NVIDIA 2050 (7) NVIDIA 2090 (11) NVIDIA K20 (16) 19 US 9 China 6 Japan 4 Russia 2 France 2 Germany 2 India 1 Italy 1 Poland 1 Australia 2 Brazil 1 Saudi Arabia 1 South Korea 1 Spain 2 Switzerland 1 UK

10 Slide No 10 Processors / Systems 2% 1% 1% Intel SandyBridge 4% 4% Intel Nehalem 10% AMD x86_64 PowerPC 55% Power 23% Intel Core Sparc Others

11 Slide No 11 Vendors / System Share NEC, 4, 1% Hitachi, 4, Others, 33, 6% 1% NUDT Dell, 8, 2%, 4, Fujitsu, 8, 2% 1% Bull, 14, 3% SGI, 17, 3% Cray Inc., 48, 9% IBM, 164, 33% HP, 196, 39% IBM HP Cray Inc. SGI Bull Fujitsu Dell NUDT Hitachi NEC Others

12 Countries Share Slide No 12 Absolute Cou US: 267 China: 63 Japan: 28 UK: 23 France: 22 Germany:

13 Slide No 13 Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E Pflop/s Pflop/s Pflop/s Tflop/s Tflop/s Tflop/s Gflop/s Gflop/s 10 1 Gflop/s N=1 N=

14 Slide No 14 Today s #1 System Systems Tianhe Difference Today & Exa System peak 55 Pflop/s 1 Eflop/s ~20x Power System memory Node performance 18 MW (3 Gflops/W) 1.4 PB (1.024 PB CPU PB CoP) 3.43 TF/s (.4 CPU +3 CoP) Node concurrency 24 cores CPU cores CoP ~20 MW (50 Gflops/W) O(1) ~15x PB ~50x 1.2 or 15TF/s O(1) O(1k) or 10k ~5x - ~50x Node Interconnect BW 6.36 GB/s GB/s ~40x System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x Total concurrency 3.12 M 12.48M threads (4/core) O(billion) ~100x MTTF Few / day O(<1 day) O(?)

15 Exascale System Architecture with a cap of $200M and 20MW Slide No 15 Systems Tianhe Difference Today & Exa System peak 55 Pflop/s 1 Eflop/s ~20x Power System memory Node performance 18 MW (3 Gflops/W) 1.4 PB (1.024 PB CPU PB CoP) 3.43 TF/s (.4 CPU +3 CoP) Node concurrency 24 cores CPU cores CoP ~20 MW (50 Gflops/W) O(1) ~15x PB ~50x 1.2 or 15TF/s O(1) O(1k) or 10k ~5x - ~50x Node Interconnect BW 6.36 GB/s GB/s ~40x System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x Total concurrency 3.12 M 12.48M threads (4/core) O(billion) ~100x MTTF Few / day Many / day O(?)

ORNL s Titan Hybrid System: Cray XK7 with AMD

m 2 SYSTEM SPECIFICATIONS: Peak performance of 27

6 Pflop/s AMD 18,688 Compute Nodes each with:

6 GB memory 512 Service and I/O nodes 200 Cabinets

16 ORNL s Titan Hybrid System: Cray XK7 with AMD Opteron and NVIDIA Tesla processors 4,352 ft m 2 SYSTEM SPECIFICATIONS: Peak performance of 27 PF 24.5 Pflop/s GPU Pflop/s AMD 18,688 Compute Nodes each with: 16-Core AMD Opteron CPU NVIDIA Tesla K20x GPU GB memory 512 Service and I/O nodes 200 Cabinets 710 TB total system memory Cray Gemini 3D Torus Interconnect 9 MW peak power 16 Slide No 16

17 Slide No 17 Summary Major Challenges are ahead for extreme computing Parallelism Hybrid Fault Tolerance Power and many others not discussed here We will need completely new approaches and technologies to reach the Exascale level

18 Slide No 19 To be published in the January 2011 issue of The International Journal of High Performance Computing Applications 19 We can only see a short distance ahead, but we can see plenty there that needs to be done. Alan Turing ( )

1.5 years Called Moore s Law Microprocessors have become smaller, denser, and more powerful.

19 Slide No 20 Technology Trends: Microprocessor Capacity Gordon Moore (co-founder of Intel) Electronics Magazine, 1965 Number of devices/chip doubles every 18 months 2X transistors/chip Every 1.5 years Called Moore s Law Microprocessors have become smaller, denser, and more powerful. Not just processors, bandwidth, storage, etc. 2X memory and processor speed and ½ size, cost, & power every 18 months. 20

20 Slide No 21 Moore s Law is Alive and Well 1.E+07 1.E+06 1.E+05 Transistors (in Thousands) 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick

21 But Clock Frequency Scaling Replaced by Scaling Cores / Chip Slide No 22 1.E+07 1.E+06 1.E Years of exponential growth ~2x year has ended Transistors (in Thousands) Frequency (MHz) Cores 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick

22 Performance Has Also Slowed, Along with Power Slide No 23 1.E+07 1.E+06 1.E+05 Power is the root cause of all this Transistors (in Thousands) Frequency (MHz) Power (W) 1.E+04 Cores A hardware issue just became a software problem 1.E+03 1.E+02 1.E+01 1.E+00 1.E Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick

23 Slide No 24 Power Cost of Frequency Power Voltage 2 x Frequency (V 2 F) Frequency Voltage Power Frequency 3 24

24 Slide No 25 Power Cost of Frequency Power Voltage 2 x Frequency (V 2 F) Frequency Voltage Power Frequency 3 25

Looking at the Gordon Bell Prize (Recognize outstanding

encourage development of parallel processing ) Slide No 26 1

analysis 1 TFlop/s; 1998; Cray T3E; 1024 Processors Modeling of

self-consistent multiple scattering method.

25 Looking at the Gordon Bell Prize (Recognize outstanding achievement in high-performance computing application and encourage development of parallel processing ) Slide No 26 1 GFlop/s; 1988; Cray Y-MP; 8 Processors Static finite element analysis 1 TFlop/s; 1998; Cray T3E; 1024 Processors Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method. 1 PFlop/s; 2008; Cray XT5; 1.5x10 5 Processors Superconductive materials 1 EFlop/s; ~2018;?; 1x10 7 Processors (10 9 threads)

26 Hardware and System Software Scalability Slide No 28 Barriers Fundamental assumptions of system software architecture did not anticipate exponential growth in parallelism Number of components and MTBF changes the game Technical Focus Areas System Hardware Scalability System Software Scalability Applications Scalability Technical Gap 1000x improvement in system software scaling 100x improvement in system software reliability 100,000 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 Average Number of Cores Per Supercomputer Top20 of the Top500

27 Commodity Intel Xeon 8 cores 3 GHz 8*4 ops/cycle 96 Gflop/s (DP) Commodity plus Accelerator Today 88 of the Top500 Systems Accelerator (GPU) Nvidia K20X Kepler 2688 Cuda cores.732 GHz 2688*2/3 ops/cycle 1.31 Tflop/s (DP) 192 Cuda cores/smx 2688 Cuda cores Interconnect PCI-e Gen2/3 16 lane 64 Gb/s (8 GB/s) 1 GW/s 6 GB 31 Slide No 31

Recent Developments US DOE planning to deploy O(100) Pflop/s systems for 2017-2018 - $525M hardware Oak Ridge Lab and

Exaflops US Dept of Commerce is preventing some China groups from receiving Intel technology Citing concerns about nuclear

On the blockade list: National SC Center Guangzhou, site of Tianhe-2 National SC Center Tianjin, site of Tianhe-1A National

28 Recent Developments US DOE planning to deploy O(100) Pflop/s systems for $525M hardware Oak Ridge Lab and Lawrence Livermore Lab to receive IBM and Nvidia based systems Argonne Lab to receive Intel based system After this Exaflops US Dept of Commerce is preventing some China groups from receiving Intel technology Citing concerns about nuclear research being done with the systems; February On the blockade list: National SC Center Guangzhou, site of Tianhe-2 National SC Center Tianjin, site of Tianhe-1A National University for Defense Technology, developer National SC Center Changsha, location of NUDT For the first time, < 50% of Top500 are in the U.S. 231 of the systems are U.S.-based, China #3 w/37. Slide No 32 32

29 Today s Multicores All of Top500 Systems Are Based on Multicore Intel Haswell (18 cores) Intel Xeon Phi (60 cores) IBM Power 8 (12 cores) AMD Interlagos (16 cores) Nvidia Kepler (2688 Cuda cores) IBM BG/Q (18 cores) Fujitsu Venus (16 cores) 33 ShenWei (16 core) Slide No 33

30 Slide No 34 Problem with Processors As we put more processing power on the multicore chip, one of the problems is getting the data to the cores Next generation will be more integrated, 3D design with a photonic network 34

Peak Performance - Per Core We are here Floating point operations per cycle per core Most of the recent computers have FMA (Fused multiple add): (i.e. x x + y*z in one

flops/cycle SP Intel Xeon Sandy Bridge( 11) & Ivy Bridge ( 12) have AVX 8 flops/cycle DP & 16 flops/cycle SP Intel Xeon Haswell ( 13) & (Broadwell ( 14)) AVX2 16

31 Peak Performance - Per Core We are here Floating point operations per cycle per core Most of the recent computers have FMA (Fused multiple add): (i.e. x x + y*z in one cycle) Intel Xeon earlier models and AMD Opteron have SSE2 2 flops/cycle DP & 4 flops/cycle SP Intel Xeon Nehalem ( 09) & Westmere ( 10) have SSE4 4 flops/cycle DP & 8 flops/cycle SP Intel Xeon Sandy Bridge( 11) & Ivy Bridge ( 12) have AVX 8 flops/cycle DP & 16 flops/cycle SP Intel Xeon Haswell ( 13) & (Broadwell ( 14)) AVX2 16 flops/cycle DP & 32 flops/cycle SP Xeon Phi (per core) is at 16 flops/cycle DP & 32 flops/cycle SP Intel Xeon Skylake ( 15) CSE5351 (Part ( 1 32 flops/cycle DL & 64 flops/cycle SP Slide No 35

32 Memory transfer (Its All About Data Movement) Example on my laptop: One level of memory 56 GFLOP/sec/core x 2 cores CPU ( Omitting latency here. ) Intel Core i7 4850HQ Haswell, 2.3 GHz Turbo Boost 3.5 GHz Cache (6 MB) 25.6 GB/sec Main memory (8 GB) The model IS simplified (see next slide) but it provides an upper bound on performance as well. I.e., we will never go faster than what the model predicts. ( slowergocanwe, courseof, And) Slide No 36

Emerging Heterogeneous Technologies for High Performance Computing

MURPA (Monash Undergraduate Research Projects Abroad) Emerging Heterogeneous Technologies for High Performance Computing Jack Dongarra University of Tennessee Oak Ridge National Lab University of Manchester