Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Size: px

Start display at page:

Download "Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester"

Piers Jones
6 years ago
Views:

1 Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 12/3/09 1

2 ! Take a look at high performance computing! What s driving HPC! Issues with power consumption! Future Trends 2

3 TPP performance Rate Size 3

100 Pflop/s 10000000 10 Pflop/s 10000000 1 Pflop/s 1000000 100 Tflop/s 100000 10 Tflop/s 10000 67/% 890% 4$#2

4 100 Pflop/s Pflop/s Pflop/s Tflop/s Tflop/s /% 890% 4$#25%3'()*+,% 0#$2%3'()*+,% 4.#.%1'()*+,% 1 Tflop/s #0$%1'()*+,% 6-8 years 100 Gflop/s Gflop/s 10 1 Gflop/s Mflop/s !..%!"#$%&'()*+,% My Laptop -..%/'()*+,%

Looking at the Gordon Bell Prize (Recognize outstanding

encourage development of parallel processing )!

!Modeling of metallic magnet atoms, using a variation of the

5 Looking at the Gordon Bell Prize (Recognize outstanding achievement in high-performance computing applications and encourage development of parallel processing )!! 1 GFlop/s; 1988; Cray Y-MP; 8 Processors!!Static finite element analysis!! 1 TFlop/s; 1998; Cray T3E; 1024 Processors!!Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method.!! 1 PFlop/s; 2008; Cray XT5; 1.5x10 5 Processors!!Superconductive materials!! 1 EFlop/s; ~2018;?; 1x10 7 Processors (10 9 threads)

Performance Development in Top500 1E+11 1E+10 1 Eflop/s

1 Pflop/s 100000 100 Tflop/s 10000 10 Tflop/s 890% Gordon

..% 10 10 Gflop/s 0 1 1 Gflop/s 100 Mflop/s 0.

6 Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E Pflop/s 10 Pflop/s 67/% Pflop/s Tflop/s Tflop/s 890% Gordon Bell Winners Tflop/s Gflop/s 0/s 0p/ 89!..% Gflop/s Gflop/s 100 Mflop/s

7 7

8 Efficiency 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% TOP500 Ranking

9 Efficiency 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% TOP500 Ranking

10 =6B,.9-/3#J#1<3./7#1>29/# )"# )"# )"# )"# ("# ("# ("# '"# &"# %"# %"# $"# *"#!!"# 55%! 9%! 6%! 6%! 4%! 3%! 2%! 2%! 2%! 1%! 1%! 1%! 1%! 7%! +,-./0#1.2./3# +,-./0#4-,5067# 892,:/# ;/972,<# =2,202# AB3.9-2# C/D#E/2F2,0# 1D/0/,# GB33-2# H.2F<# I.>/93#

11 In The Netherlands 3 Systems on Top500 Rank Site Cores Rmax Tflop/s Rmax/ Rpeak Power MW Processor 93 SARA %.55 POWER6 184 Banking % Intel Xeon Nehalem ASTRON/U Groningen %.13 PowerPC 440 System Model IBM pseries 575 IBM xseries Cluster IBM BlueGene/L

12 =B3.67/9#1/57/,.3#!LL# 6:,;<=,% &LL# 'LL# (LL# )LL# L# )$$'# )$$&# )$$!# )$$%# )$$*# )$$M# )$$$# (LLL# (LL)# (LL(# (LL'# (LL&# (LL!# (LL%# (LL*# (LLM# (LL$# I.>/93# ;6N/9,7/,.# O/,069# =F233-P/0# A:20/7-:# G/3/29:># H,0B3.9<#

13 ! Of the 500 Fastest Supercomputer! Worldwide, Industrial Use is > 60% "! "! "! "! "! "! "! "! "! "! "! "! "! "! "! "! "! "! "! "! "! "! "! "! "! "! "! 13

Rank Site Computer Country Cores Rmax [Tflops] % of Peak Power [MW] Flops/ Watt 1 DOE / OS Oak Ridge Nat Lab Jaguar / Cray

95 151 2 DOE / NNSA Los Alamos Nat Lab Roadrunner / IBM BladeCenter QS22/LS21 USA 122,400 1,042 76 2.

6 GHz USA 98,928 831 81 4 Forschungszentrum Juelich (FZJ) Jugene / IBM Blue Gene/P Solution Germany 294,912 825 82 2.

/ Ames Research Center/NAS Pleiades / SGI SGI Altix ICE 8200EX USA 56,320 544 82 2.

15 Rank Site Computer Country Cores Rmax [Tflops] % of Peak Power [MW] Flops/ Watt 1 DOE / OS Oak Ridge Nat Lab Jaguar / Cray Cray XT5 sixcore 2.6 GHz USA 224, DOE / NNSA Los Alamos Nat Lab Roadrunner / IBM BladeCenter QS22/LS21 USA 122,400 1, NSF / NICS / U of Tennessee Jaguar / Cray Cray XT5 sixcore 2.6 GHz USA 98, Forschungszentrum Juelich (FZJ) Jugene / IBM Blue Gene/P Solution Germany 294, National SC Center in Tianjin / NUDT Tianhe-1 / NUDT TH-1 / IntelQC + AMD ATI Radeon 4870 China 71, NASA / Ames Research Center/NAS Pleiades / SGI SGI Altix ICE 8200EX USA 56, DOE / NNSA Lawrence Livermore NL DOE / OS Argonne Nat Lab BlueGene/L IBM eserver Blue Gene Solution Intrepid / IBM Blue Gene/P Solution USA 212, USA 163, NSF TACC/U. of Texas Ranger / Sun SunBlade x6420 USA 62, DOE / NNSA Sandia Nat Lab Sun / SunBlade 6275 USA 41,

Rank Site Computer Country Cores Rmax [Tflops] % of Peak Power [MW] MFlops /Watt 1 DOE / OS Oak Ridge Nat Lab Jaguar / Cray Cray

0 251 2 DOE / NNSA Los Alamos Nat Lab Roadrunner / IBM BladeCenter QS22/LS21 USA 122,400 1,042 76 2.

09 269 4 Forschungszentrum Juelich (FZJ) Jugene / IBM Blue Gene/P Solution Germany 294,912 825 82 2.

48 380 6 NASA / Ames Research Center/NAS Pleiades / SGI SGI Altix ICE 8200EX USA 56,320 544 82 2.

16 Rank Site Computer Country Cores Rmax [Tflops] % of Peak Power [MW] MFlops /Watt 1 DOE / OS Oak Ridge Nat Lab Jaguar / Cray Cray XT5 sixcore 2.6 GHz USA 224, DOE / NNSA Los Alamos Nat Lab Roadrunner / IBM BladeCenter QS22/LS21 USA 122,400 1, NSF / NICS / U of Tennessee Jaguar / Cray Cray XT5 sixcore 2.6 GHz USA 98, Forschungszentrum Juelich (FZJ) Jugene / IBM Blue Gene/P Solution Germany 294, National SC Center in Tianjin / NUDT Tianhe-1 / NUDT TH-1 / IntelQC + AMD ATI Radeon 4870 China 71, NASA / Ames Research Center/NAS Pleiades / SGI SGI Altix ICE 8200EX USA 56, DOE / NNSA Lawrence Livermore NL DOE / OS Argonne Nat Lab BlueGene/L IBM eserver Blue Gene Solution Intrepid / IBM Blue Gene/P Solution USA 212, USA 163, NSF TACC/U. of Texas Ranger / Sun SunBlade x6420 USA 62, DOE / NNSA Sandia Nat Lab Sun / SunBlade 6275 USA 41,

17 Recently upgraded to a 2 Pflop/s system with more than 224K cores using AMD s 6 Core chip. Peak performance System memory Disk space Disk bandwidth Interconnect bandwidth PF 300 TB 10 PB 240+ GB/s 374 TB/s

19 #! University of Tennessee s National Institute for Computational Sciences #! Housed at ORNL, operated for the NSF, named Kraken #!Number 3 on the Top500 Just upgraded to 1 Pflop/s peak 99,072 cores, AMD 2.6 GHz 6 core chip, w/129 TB memory

Processor type: 32-bit PowerPC 450 core 850 MHz Processors: 294,912 #!

20 ! IBM BG/P - 72 Racks with 32 nodecards x 32 compute nodes (total 73,728) #! Compute node: 4-way SMP processor #! Processor type: 32-bit PowerPC 450 core 850 MHz Processors: 294,912 #! Overall peak performance: 1 Pflop/s #! Linpack: Tflop/s #! Main memory: 2 Gbytes per node (aggregate 144 TB) I/O Nodes: 600 Networks: Three-dimensonal torus (compute nodes)! Power Consumption: #! max. 35 kw per rack 20

21 ! Tianhe-1! Hybrid system, commodity + GPUs! Theoretical peak 1.21 Pflop/s! Linpack Benchmark at Tflop/s! 2560 nodes, each node: 2 Intel Quadcore Xeon ,120 AMD ATI 4780 GPUs (each 10 cores) #! 71,680 cores #! Infiniband connected

22 Performance of Top20 Over 10 Years Pflop/s

23 )QM# )Q%# )Q&# )Q(# )# LQM# LQ%# LQ&# LQ(# L# )# (*#!'# *$# )L!# )')# )!*# )M'# (L$# ('!# (%)# (M*# ')'# ''$# '%!# '$)# &)*# &&'# &%$# &$!#

24 Mooreʼs Law is Alive and Well 1.E+07 1.E+06 1.E+05 Transistors (in Thousands) 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E

25 But Clock Frequency Scaling Replaced by Scaling Cores / Chip 1.E+07 1.E+06 1.E+05 Transistors (in Thousands) Frequency (MHz) Cores 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E

26 Performance Has Also Slowed, Along with Power 1.E+07 1.E+06 1.E+05 1.E+04 Transistors (in Thousands) Frequency (MHz) Power (W) Cores 1.E+03 1.E+02 1.E+01 1.E+00 1.E

27 !!Frequency! 27

28 !!Frequency! 28

29 ! Number of cores per chip doubles every 2 year, while clock speed decreases (not increases). #!Need to deal with systems with millions of concurrent threads!future generation will have billions of threads! #!Need to be able to easily replace interchip parallelism with intro-chip parallelism! Number of threads of execution doubles every 2 year

30 ! Barriers! Fundamental assumptions of system software architecture did not anticipate exponential growth in parallelism! Number of components and MTBF changes the game! Technical Focus Areas! System Hardware Scalability! System Software Scalability! Applications Scalability! Technical Gap! 1000x improvement in system software scaling! 100x improvement in system software reliability 100,000 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 Average Number of Cores Per Supercomputer

! Have been planning this for years.! Started in June 2008! Independent from the Green500, but we try to learn from each other.! Collect power consumption for: #! Linpack as workload #!

31 ! Have been planning this for years.! Started in June 2008! Independent from the Green500, but we try to learn from each other.! Collect power consumption for: #! Linpack as workload #! Including all essential parts of a system (processor, memory, & interconnect) #! Excluding features related to machine room (Most disk, UPS, )! Analyze these data carefully!! Rule of thumb: 1 MW! 1000 Homes

32 Power [KWatts] TOP500 Rank

33 ! To rank objects by size one needs extensive properties: 600 #! Weight or Volume #! Performance: Flop/s 500 (Rmax (TOP500)) 400 Power Effeciency [MFlops/Watts]! A larger system should have a larger Rmax. #! Power Consumption: 300 Watts! The ratio of 2 extensive 200 properties is an intensive one: 100 #! (weight/volume = 0 density) #! Performance / Power 0 Consumption 100 = 200 Power_efficiency TOP500 Rank! One cannot rank objects with densities BY SIZE: #! Density does not tell anything about size of an object #! The density of lead compared to the density of wood does not tell you if one is heavier or larger the other.! Linpack / Power will always sort smaller systems before larger ones!

34 34

35 Rank Top 500 Rank Green 500 Site Cores RMax Rmax/ Rpeak Power MW Processor System Model SARA %.55 POWER Banking %.25 Intel Xeon Nehalem ASTRON/U Groningen %.13 PowerPC 440 IBM pseries 575 IBM xseries Cluster IBM BlueGene/L

37 (8+1) core Embedded Quadcore Dualcore 0

!Climate Science (11/08)!!High Energy Physics (12/08)!!Nuclear physics (1/09)!!Fusion Energy (3/09)!!Nuclear Energy (5/09)!

38 !!DOE Exascale Steering Committee!!ANL, LANL, LBNL, LLNL, SNL, ORNL + PNL, BNL!!Charter: Decadal plan to provide exascale applications and technologies for DOE mission ~100 People!!Climate Science (11/08)!!High Energy Physics (12/08)!!Nuclear physics (1/09)!!Fusion Energy (3/09)!!Nuclear Energy (5/09)!!Biology (8/09)!!Basic Energy Science (8/09)!!Joint National Security (10/09)!!Computer Science!!Mathematics!!Computer Architecture Strong science case for the continued escalation of high-end computing.

39 Systems System peak 2 Pflop/s Pflop/s 1 Eflop/s System memory 0.3 PB 5 PB 10 PB Node performance 125 Gflop/s 400 Gflop/s 1-10 Tflop/s Node memory BW 25 GB/s 200 GB/s >400 GB/s Node concurrency 12 O(100) O(1000) Interconnect BW 1.5 GB/s 25 GB/s 50 GB/s System size (nodes) 18, , ,000 O(10 6 ) Total concurrency 225,000 O(10 8 ) O(10 9 ) Storage 15 PB 150 PB 300 PB IO 0.2 TB 10 TB/s 20 TB/s MTTI days days O(1 day) Power 7 MW ~10 MW ~20 MW 39

40 !Must rethink the design of our software #!Another disruptive technology!similar to what happened with cluster computing and message passing #!Rethink and rewrite the applications, algorithms, and software 40

Limits on power/clock speed implication on multicore #! Reducing communication will become much more intense #!

41 ! Steepness of the ascent from terascale to petascale to exascale! Extreme parallelism and hybrid design #! Preparing for million/billion way parallelism! Tightening memory/bandwidth bottleneck #! Limits on power/clock speed implication on multicore #! Reducing communication will become much more intense #! Memory per core changes, byte-to-flop ratio will change! Necessary Fault Tolerance #! MTTF will drop #! Checkpoint/restart has limitations Software infrastructure does not exist today

42 ! Hardware has changed dramatically while software ecosystem has remained stagnant! Previous approaches have not looked at co-design of multiple levels in the system software stack (OS, runtime, compiler, libraries, application frameworks)! Need to exploit new hardware trends (e.g., manycore, heterogeneity) that cannot be handled by existing software stack, memory per socket trends! Emerging software technologies exist, but have not been fully integrated with system software, e.g., UPC, Cilk, CUDA, HPCS! Community codes unprepared for sea change in architectures! No global evaluation of key missing components

43 Build an international plan for developing the next generation open source software for scientific highperformance computing

44 ! We believe this needs to be an international collaboration for various reasons including: #! The scale of investment #! The need for international input on requirements US, Europeans, Asians, and others are working on their own software that should be part of a larger vision for HPC. #! No global evaluation of key missing components #! Hardware features are uncoordinated with software development 44

!! SC08 (Austin TX) meeting to generate interest!! Funding from DOE s Office of Science & NSF Office of Cyberinfratructure and sponsorship by Europeans and Asians!

45 !! SC08 (Austin TX) meeting to generate interest!! Funding from DOE s Office of Science & NSF Office of Cyberinfratructure and sponsorship by Europeans and Asians!! US meeting (Santa Fe, NM) April 6-8, 2009!! 65 people!! NSF s Office of Cyberinfrastructure funding!! European meeting (Paris, France) June 28-29, 2009!! 70 people!! Outline Report!! Asian meeting (Tsukuba Japan) October 18-20, 2009!! Draft roadmap!! Refine Report!! SC09 (Portland OR) BOF to inform others!! Public Comment!! Draft Report presented!! Oxford April

47 !

48 ! For the last decade or more, the research investment strategy has been overwhelmingly biased in favor of hardware.! This strategy needs to be rebalanced - barriers to progress are increasingly on the software side.! Moreover, the return on investment is more favorable to software. #! Hardware has a half-life measured in years, while software has a half-life measured in decades.! High Performance Ecosystem out of balance #! Hardware, OS, Compilers, Software, Algorithms, Applications! No Moore s Law for software, algorithms and applications

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 12/24/09 1 Take a look at high performance computing What s driving HPC Future Trends 2 Traditional scientific