An Overview of High Performance Computing

Size: px

Start display at page:

Download "An Overview of High Performance Computing"

Delphia Hood
6 years ago
Views:

1 IFIP Working Group 10.3 on Concurrent Systems An Overview of High Performance Computing Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 1/3/2006 1

2 Overview Look at fastest computers From the Top500 Some of the changes that face us Hardware Software Algorithms 00 2

3 H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year SC xy in the States in November Meeting in Germany in June Started in 1993, 13 years of data 00 3 Size - All data available from Rate

4 Performance Development 2.3 PF/s 1 Pflop/s TF/s 100 Tflop/s 10 Tflop/s TF/s SUM N=1 NEC Earth Simulator IBM BlueGene/L 1 Tflop/s 100 Gflop/s 10 Gflop/s 1 Gflop/s 59.7 GF/s 0.4 GF/s Fujitsu 'NWT' NAL Intel ASCI Red Sandia N=500 IBM ASCI White LLNL My Laptop TF/s 100 Mflop/s

5 Architecture/Systems Continuum Tightly Coupled Custom processor Best processor performance for with custom interconnect codes that are not cache friendly Cray X1 NEC SX-8 Good communication performance IBM Regatta Simpler programming model IBM Blue Gene/L Most expensive Loosely Coupled Commodity processor with custom interconnect SGI Altix Intel Itanium 2 Cray XT3, XD1 AMD Opteron Commodity processor with commodity interconnect Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics NEC TX7 IBM eserver 00 Dawning 5 Good communication performance Good scalability Best price/performance (for codes that work well with caches and are latency tolerant) More complex programming model

6 Architecture/Systems Continuum Tightly Coupled Custom processor Best processor performance for with custom interconnect codes Custom that are not cache 80% friendly Cray X1 NEC SX-8 Good communication performance IBM Regatta Simpler programming model IBM Blue Gene/L 60% Most expensive Loosely Coupled Commodity processor with custom interconnect SGI Altix Intel Itanium 2 Cray XT3, XD1 AMD Opteron Commodity processor with commodity interconnect Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics NEC TX7 IBM eserver 00 Dawning 6 100% 40% 20% 0% Good communication Hybrid performance Good scalability Best price/performance (for codes that work well with caches Commod and are latency tolerant) More complex programming model Jun-93 Dec-93 Jun-94 Dec-94 Jun-95 Dec-95 Jun-96 Dec-96 Jun-97 Dec-97 Jun-98 Dec-98 Jun-99 Dec-99 Jun-00 Dec-00 Jun-01 Dec-01 Jun-02 Dec-02 Jun-03 Dec-03 Jun- 04

Commodity Processors Intel Pentium Nocona 3.

4 Gflop/s Linpack 100 = 1.7 Gflop/s Linpack 1000 = 5.

7 Commodity Processors Intel Pentium Nocona 3.6 GHz, peak = 7.2 Gflop/s Linpack 100 = 1.8 Gflop/s Linpack 1000 = 4.2 Gflop/s Intel Itanium GHz, peak = 6.4 Gflop/s Linpack 100 = 1.7 Gflop/s Linpack 1000 = 5.7 Gflop/s AMD Opteron 2.6 GHz, peak = 5.2 Gflop/s Linpack 100 = 1.6 Gflop/s 00 7 Linpack 1000 = 3.9 Gflop/s

8 Architectures / Systems SIMD Single Proc. Cluster (360) Constellations SMP 0 MPP Cluster: Commodity processors & Commodity interconnect 00 8 Constellation: # of procs/node nodes in the system

9 Interconnects / Systems Others Cray Interconnect SP Switch Crossbar Quadrics Infiniband Myrinet (101) Gigabit Ethernet N/A (249)

10 Processor Types 500 SIMD 400 Sparc Vector 300 MIPS 200 Alpha HP 100 AMD IBM Power 0 Intel % 00 10

11 Processors Used in Each of the 500 Systems Hitachi SR8000 0% Sun Sparc 1% NEC 1% HP Alpha 1% Cray 2% Intel IA-32 41% 91% = 66% Intel 15% IBM 11% AMD HP PA-RISC 3% Intel IA-64 9% Intel EM64T 16% AMD x86_64 IBM Power 00 11% 15% 11

12 26th List: The TOP10 Manufacturer Computer Rmax [TF/s] Installation Site Country Year #Proc 1 IBM BlueGene/L eserver Blue Gene DOE/NNSA/LLNL USA IBM BGW eserver Blue Gene IBM Thomas Watson USA IBM ASC Purple Power5 p DOE/NNSA/LLNL USA SGI Columbia Altix, Itanium/Infiniband NASA Ames USA Dell Thunderbird Pentium/Infiniband Sandia USA Cray Red Storm Cray XT3 AMD Sandia USA NEC Earth-Simulator SX Earth Simulator Center Japan IBM MareNostrum PPC 970/Myrinet Barcelona Supercomputer Center Spain IBM eserver Blue Gene ASTRON University Groningen Netherlands Jaguar 10 Cray Oak Ridge National Lab USA Cray XT3 AMD

13 Customer Segments / Performance 100% 90% 80% 70% 60% 50% Classified Vendor Academic Industry Government 50% 40% 30% 20% Research 10% 0%

14 Countries / Performance 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Others France China Germany UK Japan 14% 1.0% 2.6% 3.0% 5.4% 6.0% US 68%

15 Asian Countries / Systems Others 17 India 4 Korea 7 China 17 Japan

16 Concurrency Levels of the Top k-128k 32k-64k 16k-32k 8k-16k 4k-8k

17 Concurrency Levels of the Top500 1,000, ,000 Maximum 131K 10,000 1, Average Minimum # processors 10 1 Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-99 Jun-00 Jun-01 Jun-02 Jun-03 Jun-04 Jun

IBM BlueGene/L /L #1 131,072 Processors Total of 18 systems all in the Top100 1.

2048 processors BlueGene/L Compute ASIC Node Board (32 chips, 4x4x2) 16 Compute Cards 64 processors Compute

2 GF/s 1 GB DDR 90/180 GF/s 16 GB DDR 2.9/5.7 TF/s 0.

18 IBM BlueGene/L /L #1 131,072 Processors Total of 18 systems all in the Top MWatts (1600 homes) 43,000 ops/s/person (64 racks, 64x32x32) Rack 131,072 procs (32 Node boards, 8x8x16) 2048 processors BlueGene/L Compute ASIC Node Board (32 chips, 4x4x2) 16 Compute Cards 64 processors Compute Card (2 chips, 2x1x1) 4 processors Chip (2 processors) 2.8/5.6 GF/s 4 MB (cache) 5.6/11.2 GF/s 1 GB DDR 90/180 GF/s 16 GB DDR 2.9/5.7 TF/s 0.5 TB DDR 180/360 TF/s 32 TB DDR Fastest Computer BG/L 700 MHz 131K proc The compute node ASICs include all networking and processor functionality. 64 racks Each compute ASIC includes two 32-bit superscalar PowerPC 440 embedded 00 Peak: 367 Tflop/s cores (note that L1 cache coherence is not maintained between these cores). 18 (13K sec about 3.6 hours; n=1.8m) Linpack: 281 Tflop/s 77% of peak Full system total of 131,072 processors

19 Performance Projection 1 Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflop/s 1 Tflop/s SUM DARPA HPCS 100 Gflop/s 10 Gflop/s 1 Gflop/s 100 Mflop/s N=1 N=500 My Laptop

20 A PetaFlop Computer by the End of the Decade 10 Companies working on a building a Petaflop system by the end of the decade. Cray } IBM Sun Dawning Galactic Lenovo Hitachi NEC Fujitsu Bull } Chinese } Chinese Companies Japanese Life Simulator (10 Pflop/s) 00 20

21 Flops per Gross Domestic Product Based on the November 2005 Top500 Israel USA New Zealand Switzerland Russia UK Netherlands Australian South Korea Saudi Arabia China Japan Ireland Spain Germany Taiwan Canada Brazil 00 21

22 KFlop/s per Capita (Flops/Pop) Based on the November 2005 Top Hint: Peter Jackson had something to do with this United States Switzerland Israel New Zealand United Kingdom Netherlands Australia Japan Germany Spain Canada Korea, South Sweden Saudia Arabia France Taiwan Italy Brazil Mexico China Russia India Has nothing to do with the 47.2 million sheep in NZ

23 KFlop/s per Capita (Flops/Pop) Based on the November 2005 Top Hint: Peter Jackson had something to do with this WETA Digital (Lord of the Rings) United States Switzerland Israel New Zealand United Kingdom Netherlands Australia Japan Germany Spain Canada Korea, South Sweden Saudia Arabia France Taiwan Italy Brazil Mexico China Russia India Has nothing to do with the 47.2 million sheep in NZ

24 Fuel Efficiency: GFlops/Watt BlueGene/L Blue Gene ASC Purple p5 1.9 GHz Columbia - SGI Altix 1.5 GHz Thunderbird - Pentium 3.6 GHz Red Storm Cray XT3, 2.0 GHz Earth-Simulator MareNostrum PPC 970, 2.2 GHz Blue Gene Jaguar - Cray XT3, 2.4 GHz Thunder - Intel Itanium2 1.4GHz Blue Gene Blue Gene Cray XT3, 2.6 GHz Apple XServe, 2.0 GHz Cray X1E (4GB) Cray X1E (2GB) ASCI Q - Alpha 1.25 GHz IBM p GHz System X 2.3 GHz Apple XServe/ Top 20 systems Based on processor power rating only (3,>100,>800) GFlops/Watt

10,000 Sun s s Surface Power Density (W/cm 2 ) 1,000 100 10 1 4004 8008

Pentium 70 80 90 00 10 00 25 Intel Developer Forum, Spring 2004 - Pat

25 10,000 Sun s s Surface Power Density (W/cm 2 ) 1, Hot Plate 486 Nuclear Reactor Rocket Nozzle Pentium Intel Developer Forum, Spring Pat Gelsinger (Pentium at 90 W) Cube relationship between the cycle time and power.

26 Lower Voltage Increase Clock Rate & Transistor Density We have seen increasing number of gates on a chip and increasing clock speed. Heat becoming an unmanageable problem, Intel Processors > 100 Watts We will not see the dramatic increases in clock speeds in the future. However, the number of gates on a chip will continue to increase Intel Yonah doubles the processing power on a per watt basis.

1 Core Operations per second for serial code Free

as fast every 18 months with no change to the code!

Core 00 27 Additional operations per second if code

27 1 Core Operations per second for serial code Free Lunch For Traditional Software (It just runs twice as fast every 18 months with no change to the code!) 3GHz 1 Core 6 GHz 1 Core 12 GHz, 1 Core 24 GHz, 1 Core Additional operations per second if code can take advantage of concurrency From Craig Mundie, Microsoft

28 1 Core No Free Lunch For Traditional Software (Without highly concurrent software it won t get any faster!) 2 Cores 3 GHz, 4 Cores 4 Cores 3 GHz, 8 Cores 8 Cores Operations per second for serial code Free Lunch For Traditional Software (It just runs twice as fast every 18 months with no change to the code!) 6 GHz 1 Core 12 GHz, 1 Core 24 GHz, 1 Core 3 GHz 2 Cores 3GHz 1 Core Additional operations per second if code can take advantage of concurrency From Craig Mundie, Microsoft

29 CPU Desktop Trends Relative processing power will continue to double every 18 months 256 logical processors per chip in late Year Hardware Threads Per Chip Cores Per Processor Chip

30 Commodity Processor Trends Bandwidth/Latency is the Critical Issue, not FLOPS Annual increase Typical value in 2005 Typical value in 2010 Got Bandwidth? Typical value in 2020 Single-chip floating-point performance 59% 4 GFLOP/s 32 GFLOP/s 3300 GFLOP/s Front-side bus bandwidth 23% 1 GWord/s = 0.25 word/flop 3.5 GWord/s = 0.11 word/flop 27 GWord/s = word/flop DRAM latency (5.5%) 70 ns = 280 FP ops = 70 loads 50 ns = 1600 FP ops = 170 loads 28 ns = 94,000 FP ops = 780 loads 00 Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, pages, 2004, National Academies Press, Washington DC, ISBN

Commodity Processor Trends Bandwidth/Latency is the Critical Issue, not FLOPS Annual increase Typical value in 2005 Typical value in 2010 Got Bandwidth?

31 Commodity Processor Trends Bandwidth/Latency is the Critical Issue, not FLOPS Annual increase Typical value in 2005 Typical value in 2010 Got Bandwidth? Typical value in 2020 Single-chip floating-point performance 59% 4 GFLOP/s 32 GFLOP/s 3300 GFLOP/s Front-side bus bandwidth 23% 1 GWord/s = 0.25 word/flop 3.5 GWord/s = 0.11 word/flop 27 GWord/s = word/flop DRAM latency (5.5%) 70 ns = 280 FP ops = 70 loads 50 ns = 1600 FP ops = 170 loads 28 ns = 94,000 FP ops = 780 loads 00 Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, pages, 2004, National Academies Press, Washington DC, ISBN

Future Challenge: Developing the Ecosystem for HPC From the NRC Report on The Future of Supercomputing : Hardware, software, algorithms, tools, networks, institutions, applications, and people who

32 Future Challenge: Developing the Ecosystem for HPC From the NRC Report on The Future of Supercomputing : Hardware, software, algorithms, tools, networks, institutions, applications, and people who solve supercomputing applications can be thought of collectively as a multifaceted ecosystem Research investment in HPC should be informed by the ecosystem point of view - progress must come on a broad front of interrelated technologies, rather than in the form of individual breakthroughs. A supercomputer ecosystem is a continuum of computing platforms, system software, algorithms, tools, networks, and the people who know how to exploit them to solve computational science applications

33 Real Crisis With HPC Is With The Software Our ability to configure a hardware system capable of 1 PetaFlop (10 15 ops/s) is without question just a matter of time and $$. A supercomputer application and software are usually much more long-lived than a hardware Hardware life typically five years at most. Apps years Fortran and C are the main programming models (still!!) The REAL CHALLENGE is Software Programming hasn t changed since the 70 s HUGE manpower investment MPI is that all there is? Often requires HERO programming Investments in the entire software stack is required (OS, libs, etc.) Software is a major cost component of modern technologies. The tradition in HPC system procurement is to assume that the software is free SOFTWARE COSTS (over and over) 00 33

34 Summary of Current Unmet Needs Performance / Portability Fault tolerance Memory bandwidth/latency Adaptability: Some degree of autonomy to self optimize, test, or monitor. Able to change mode of operation: static or dynamic Better programming models Global shared address space Visible locality Maybe coming soon (incremental, yet offering real benefits): Global Address Space (GAS) languages: UPC, Co-Array Fortran, Titanium, Chapel) Minor extensions to existing languages More convenient than MPI Have performance transparency via explicit remote memory references What s needed is a long-term, balanced investment in hardware, software, algorithms and applications

Collaborators / Support http://www.top500.

35 Collaborators / Support Top500 Team Erich Strohmaier, NERSC Hans Meuer, Mannheim Horst Simon, NERSC Slides are available at my website

An Overview of High Performance Computing. Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 11/29/2005 1

An Overview of High Performance Computing Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 11/29/ 1 Overview Look at fastest computers From the Top5 Some of the changes that face