Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Size: px

Start display at page:

Download "Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester"

Neil Gaines
6 years ago
Views:

1 Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 12/24/09 1

2 Take a look at high performance computing What s driving HPC Future Trends 2

3 Traditional scientific and engineering paradigm: 1) Do theory or paper design. 2) Perform experiments or build system. Limitations:! Too expensive -- build a throw-away passenger jet.! Too slow -- wait for climate or galactic evolution.! Too difficult -- build large wind tunnels.! Too dangerous -- weapons, drug design, climate experimentation. Computational science paradigm: 3) Use high performance computer systems to simulate the phenomenon Base on known physical laws and efficient numerical methods. 3

4 4

5 Strategic importance of supercomputing! Essential for scientific discovery! Critical for national security! Fundamental contributor to the economy and competitiveness through use in engineering and manufacturing Supercomputers are the tool for solving the most challenging problems through simulations 5

6 TPP performance Rate Size 6

7 100 Pflop/s Pflop/s Pflop/s Tflop/s Tflop/s /% 890% 4$#25%3'()*+,% 0#$2%3'()*+,% 4.#.%1'()*+,% 1 Tflop/s #0$%1'()*+,% 6-8 years 100 Gflop/s Gflop/s 10 1 Gflop/s Mflop/s !..%!"#$%&'()*+,% My Laptop -..%/'()*+,%

Looking at the Gordon Bell Prize (Recognize outstanding

encourage development of parallel processing ) " 1 GFlop/s; 1988;

Static finite element analysis " 1 TFlop/s; 1998; Cray T3E; 1024

Modeling of metallic magnet atoms, using a variation of the

8 Looking at the Gordon Bell Prize (Recognize outstanding achievement in high-performance computing applications and encourage development of parallel processing ) " 1 GFlop/s; 1988; Cray Y-MP; 8 Processors! Static finite element analysis " 1 TFlop/s; 1998; Cray T3E; 1024 Processors! Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method. " 1 PFlop/s; 2008; Cray XT5; 1.5x10 5 Processors! Superconductive materials " 1 EFlop/s; ~2018;?; 1x10 7 Processors (10 9 threads)

Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 0000000 100 Pflop/s

Gordon Bell Winners 1000 1 Tflop/s 100 100 Gflop/s 89!

9 Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E Pflop/s 10 Pflop/s 67/% Pflop/s Tflop/s Tflop/s 890% Gordon Bell Winners Tflop/s Gflop/s 89!..% Gflop/s 1 1 Gflop/s 100 Mflop/s

10 10

11 =6B,.9-/3#J#1<3./7#1>29/# )"# )"# )"# )"# ("# ("# ("# '"# &"# %"# %"# $"# *"#!!"# 55% 9% 6% 6% 4% 3% 2% 2% 2% 1% 1% 1% 1% 7% +,-./0#1.2./3# +,-./0#4-,5067# 892,:/# ;/972,<# =2,202# AB3.9-2# C/D#E/2F2,0# 1D/0/,# GB33-2# H.2F<# I.>/93#

12 =B3.67/9#1/57/,.3#!KK# 6:,;<=,% &KK# 'KK# (KK# )KK# K# )$$'# )$$&# )$$!# )$$%# )$$*# )$$L# )$$$# (KKK# (KK)# (KK(# (KK'# (KK&# (KK!# (KK%# (KK*# (KKL# (KK$# I.>/93# ;6M/9,7/,.# N/,069# =F233-O/0# A:20/7-:# G/3/29:># H,0B3.9<#

13 Of the 500 Fastest Supercomputer Worldwide, Industrial Use is > 60% # # # # # # # # # # # # # # # # # # # # # # # # # # 13 #

14 Rank Site Computer Country Procs Rmax [Pflops] % of Peak Power [MW] Flops/ Watt 1 DOE / OS Oak Ridge Nat Lab Jaguar / Cray Cray XT5 sixcore 2.6 GHz USA 224, DOE / NNSA Los Alamos Nat Lab Roadrunner / IBM BladeCenter QS22/LS21 USA 122, NSF / NICS / U of Tennessee Jaguar / Cray Cray XT5 sixcore 2.6 GHz USA 98, Forschungszentrum Juelich (FZJ) Jugene / IBM Blue Gene/P Solution Germany 294, National SC Center in Tianjin / NUDT Tianhe-1 / NUDT TH-1 / IntelQC + AMD ATI Radeon 4870 China 71, NASA / Ames Research Center/NAS Pleiades / SGI SGI Altix ICE 8200EX USA 56, DOE / NNSA Lawrence Livermore NL BlueGene/L IBM eserver Blue Gene Solution USA 212, DOE / OS Argonne Nat Lab Intrepid / IBM Blue Gene/P Solution USA 163, NSF TACC/U. of Texas Ranger / Sun SunBlade x6420 USA 62, DOE / NNSA Sandia Nat Lab Sun / SunBlade 6275 USA 41,

15 Rank Site Computer Country Procs Rmax [Pflops] % of Peak Power [MW] MFlops /Watt 1 DOE / OS Oak Ridge Nat Lab Jaguar / Cray Cray XT5 sixcore 2.6 GHz USA 224, DOE / NNSA Los Alamos Nat Lab Roadrunner / IBM BladeCenter QS22/LS21 USA 122, NSF / NICS / U of Tennessee Jaguar / Cray Cray XT5 sixcore 2.6 GHz USA 98, Forschungszentrum Juelich (FZJ) Jugene / IBM Blue Gene/P Solution Germany 294, National SC Center in Tianjin / NUDT Tianhe-1 / NUDT TH-1 / IntelQC + AMD ATI Radeon 4870 China 71, NASA / Ames Research Center/NAS Pleiades / SGI SGI Altix ICE 8200EX USA 56, DOE / NNSA Lawrence Livermore NL BlueGene/L IBM eserver Blue Gene Solution USA 212, DOE / OS Argonne Nat Lab Intrepid / IBM Blue Gene/P Solution USA 163, NSF TACC/U. of Texas Ranger / Sun SunBlade x6420 USA 62, DOE / NNSA Sandia Nat Lab Sun / SunBlade 6275 USA 41,

16 Recently upgraded to a 2.3 Pflop/s system with more than 224K processor cores using AMD s 6 Core chip. Peak performance System memory Disk space Disk bandwidth Interconnect bandwidth 2.3 PF 300 TB 10 PB 240+ GB/s 374 TB/s

18 ! University of Tennessee s National Institute for Computational Sciences! Housed at ORNL, operated for the NSF, named Kraken! Number 3 on the Top500 Just upgraded to 1 Pflop/s peak 99,072 cores, AMD 2.6 GHz 6 core chip, w/129 TB memory

IBM BG/P - 72 Racks with 32 nodecards x 32 compute nodes (total 73,728)! Compute node: 4-way SMP processor! Processor type: 32-bit PowerPC 450 core 850 MHz Processors: 294,912!

19 IBM BG/P - 72 Racks with 32 nodecards x 32 compute nodes (total 73,728)! Compute node: 4-way SMP processor! Processor type: 32-bit PowerPC 450 core 850 MHz Processors: 294,912! Overall peak performance: 1 Pflop/s! Linpack: Tflop/s! Main memory: 2 Gbytes per node (aggregate 144 TB) I/O Nodes: 600 Networks: Three-dimensonal torus (compute nodes) Power Consumption:! max. 35 kw per rack 19

1 Tflop/s 2560 nodes, each node: 2 Intel Quadcore

20 Tianhe-1 Hybrid system, commodity + GPUs Theoretical peak 1.21 Pflop/s Linpack Benchmark at Tflop/s 2560 nodes, each node: 2 Intel Quadcore Xeon ,120 AMD ATI 4780 GPUs (each 10 cores)! 71,680 cores! Infiniband connected

Performance of Top20 Over 10 Years Pflop/s 1.8 1.6 1.

21 Performance of Top20 Over 10 Years Pflop/s

22 Blue Waters NCSA/Illinois 10 Pflop/s peak; 1 Pflop/s sustained per second in 2010 Kraken NICS/U of Tennessee 1 Pflop/s peak per second Ranger TACC/U of Texas 504 Tflop/s peak per second Campuses across the U.S. Several sites Tflop/s peak per second

23 0..>...% 0.>...% 0>...% 0..% 0.% 0%.%

24 0..>...% 0.>...% 0>...% 0..% 0.% 0%.%

25 0..>...% 0.>...% 0>...% 0..% 0.% 0%.%

26 Rank Site Manufac turer Computer Cores 5 National SuperComputer; Tianjin NUDT NUDT TH-1 Cluster, Xeon ATI Radeon HD 4870! Shanghai Supercomputer Center Dawning Dawning 5000A, QC Opteron 1.9 Ghz, Windows Computer Network Information, CAS Lenovo DeepComp 7000, HS21/x3950 Cluster, Xeon Network Company IBM BladeCenter HS22 Cluster, Xeon Network Company IBM BladeCenter HS22 Cluster, Xeon Network Company IBM BladeCenter HS22 Cluster, Xeon Telecommunication Company HP Cluster Platform 3000 BL480c, Xeon Nanjing University IBM BladeCenter HS22 Cluster, Xeon Network Company IBM BladeCenter HS22 Cluster, Xeon Network Company IBM BladeCenter HS22 Cluster, Xeon Network Company IBM BladeCenter HS22 Cluster, Xeon Network Company IBM BladeCenter HS22 Cluster, Xeon Logistic Services HP Cluster Platform 3000 BL460c G1, Xeon Network Company IBM BladeCenter HS22 Cluster, Xeon Telecommunication Company HP Cluster Platform 3000 BL460c CNPC Sichuan Geophysical IBM BladeCenter HS21 Cluster, Xeon Telecommunication Company HP Cluster Platform 3000 BL460c G Telecommunication Company HP Cluster Platform 3000 BL460c G6, Xeon Institute of Engineering Mechanics HP Cluster Platform 3000 BL460c G1, Xeon Petroleum Company IBM BladeCenter HS21 Cluster, Xeon China Petroleum University IBM BladeCenter LS22, Opteron 3072

27 27

28 Loongson (Chinese: ;!academic name: Godson, also known as Dragon chip) is a family of general-purpose MIPS-compatible CPUs developed at the Institute of Computing Technology, Chinese Academy of Sciences. The chief architect is Professor Weiwu Hu. The 65!nm Loongson 3 (Godson-3) is able to run at a clock speed between 1.0 to 1.2 GHz, with 4 CPU cores (10W) first and 8 cores later (20W), and it is expected to debut in Will use this chip as basis for Petascale system in

29 Most likely be a hybrid design Think standard multicore chips and accelerator (GPUs) Today accelerators are attached Next generation more integrated Intel s Larrabee?! 8,16,32,or 64 x86 cores AMD s Fusion in 2011! Multicore with embedded graphics ATI Nvidia s plans? 29

30 + 3D Stacked Memory Many Floating- Point Cores Different Classes of Chips Home Games / Graphics Business Scientific

31 Moore s Law is Alive and Well 1.E+07 1.E+06 1.E+05 Transistors (in Thousands) 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E

32 But Clock Frequency Scaling Replaced by Scaling Cores / Chip 1.E+07 1.E+06 1.E+05 Transistors (in Thousands) Frequency (MHz) Cores 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E

33 Performance Has Also Slowed, Along with Power 1.E+07 1.E+06 1.E+05 1.E+04 Transistors (in Thousands) Frequency (MHz) Power (W) Cores 1.E+03 1.E+02 1.E+01 1.E+00 1.E

34 Frequency 34

35 Frequency 35

36 Number of cores per chip doubles every 2 year, while clock speed decreases (not increases). Need to deal with systems with millions of concurrent threads Future generation will have billions of threads! Need to be able to easily replace inter-chip parallelism with intro-chip parallelism Number of threads of execution doubles every 2 year 100,000 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 Average Number of Cores Per Supercomputer

37 Must rethink the design of our software! Another disruptive technology Similar to what happened with cluster computing and message passing! Rethink and rewrite the applications, algorithms, and software 37

" DOE Exascale Steering Committee! ANL, LANL, LBNL, LLNL, SNL, ORNL + PNL, BNL!

needs " Workshops @ ~100 People! Climate Science (11/08)! High Energy Physics (12/08)!

Basic Energy Science (8/09)! Joint National Security (10/09)! Computer Science!

38 " DOE Exascale Steering Committee! ANL, LANL, LBNL, LLNL, SNL, ORNL + PNL, BNL! Charter: Decadal plan to provide exascale applications and technologies for DOE mission needs " ~100 People! Climate Science (11/08)! High Energy Physics (12/08)! Nuclear physics (1/09)! Fusion Energy (3/09)! Nuclear Energy (5/09)! Biology (8/09)! Basic Energy Science (8/09)! Joint National Security (10/09)! Computer Science! Mathematics! Computer Architecture Strong science case for the continued escalation of high-end computing.

Exascale systems are likely feasible by 2017±2 10-100 Million

dense as 1,000 cores per socket, clock rates will grow more slowly

PB of aggregate memory Hardware and software based fault management

of sustained performance >> 10 100 MW Exascale system!

39 Exascale systems are likely feasible by 2017± Million processing elements (cores or mini-cores) with chips perhaps as dense as 1,000 cores per socket, clock rates will grow more slowly 3D packaging likely Large-scale optics based interconnects PB of aggregate memory Hardware and software based fault management Heterogeneous cores Performance per watt stretch goal 100 GF/watt of sustained performance >> MW Exascale system! Power, area and capital costs will be significantly higher than for today s fastest systems 39

40 Steepness of the ascent from terascale to petascale to exascale Extreme parallelism and hybrid design! Preparing for million/billion way parallelism Tightening memory/bandwidth bottleneck! Limits on power/clock speed implication on multicore! Reducing communication will become much more intense! Memory per core changes, byte-to-flop ratio will change Necessary Fault Tolerance! MTTF will drop! Checkpoint/restart has limitations Software infrastructure does not exist today

41 For the last decade or more, the research investment strategy has been overwhelmingly biased in favor of hardware. This strategy needs to be rebalanced - barriers to progress are increasingly on the software side. Moreover, the return on investment is more favorable to software.! Hardware has a half-life measured in years, while software has a half-life measured in decades. High Performance Ecosystem out of balance! Hardware, OS, Compilers, Software, Algorithms, Applications No Moore s Law for software, algorithms and applications

42 Top500 Hans Meuer, Prometeus Erich Strohmaier, LBNL/NERSC Horst Simon, LBNL/NERSC

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 12/3/09 1 ! Take a look at high performance computing! What s driving HPC! Issues with power consumption! Future