Future Trends in High Performance Computing

Size: px

Start display at page:

Download "Future Trends in High Performance Computing"

Nathan Shaw
5 years ago
Views:

1 Future Trends in High Performance Computing Horst Simon Lawrence Berkeley National Laboratory and UC Berkeley Seminar at RWTH achen, Germany March 23, 2009

Berkeley Lab Mission Solve the most pressing

humankind Basic science for a secure energy

Build and safely operate leading scientific

3 Berkeley Lab Mission Solve the most pressing and profound scientific problems facing humankind Basic science for a secure energy future Understand living systems to improve the environment, health, and energy supply Understand matter and energy in the universe Build and safely operate leading scientific facilities for the nation Train the next generation of scientists and engineers

4 Key Message Computing is changing more rapidly than ever before, and scientists have the unprecedented opportunity to change computing directions

5 Overview Turning point in 2004 Current trends and what to expect until 2014 Long term trends until 2019

6 Supercomputing Ecosystem (2005) Commercial Off The Shelf technology (COTS) Clusters 12 years of legacy MPI applications base From my presentation at ISC 2005

7 Supercomputing Ecosystem (2005) Commercial Off The Shelf technology (COTS) Clusters 12 years of legacy MPI applications base From my presentation at ISC 2005

Traditional Sources of Performance Improvement are Flat-Lining (2004) New Constraints 15 years of exponential clock rate growth has ended Moore s Law reinterpreted: How do we use all of those

8 Traditional Sources of Performance Improvement are Flat-Lining (2004) New Constraints 15 years of exponential clock rate growth has ended Moore s Law reinterpreted: How do we use all of those transistors to keep performance increasing at historical rates? Industry Response: #cores per chip doubles every 18 months instead of clock frequency! Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton Smith

9 Supercomputing Ecosystem (2005) Commercial Off The Shelf technology (COTS) 2008 PCs and desktop systems are no longer the economic driver. rchitecture and programming model are about to change Clusters 12 years of legacy MPI applications base

10 Overview Turning point in 2004 Current trends and what to expect until 2014 Long term trends until 2019

Roadrunner Breaks the Pflop/s Barrier 1,026 Tflop/s on LINPCK reported on June 9, 2008

11 Roadrunner Breaks the Pflop/s Barrier 1,026 Tflop/s on LINPCK reported on June 9, ,948 dual core Opteron + 12,960 cell BE 80 TByte of memory IBM built, installed at LNL

Cray XT5 at ORNL -- 1 Pflop/s in November 2008 Jaguar Total XT5 XT4

31,328 System Memory (TB) 362 300 62 isk Bandwidth (GB/s) 284 240 44

374 157 The systems will be combined after acceptance of the new XT5

12 Cray XT5 at ORNL -- 1 Pflop/s in November 2008 Jaguar Total XT5 XT4 Peak Performance 1,645 1, M Opteron Cores 181, , ,328 System Memory (TB) isk Bandwidth (GB/s) isk Space (TB) 10,750 10, Interconnect Bandwidth (TB/s) The systems will be combined after acceptance of the new XT5 upgrade. Each system will be linked to the file system through 4x-R Infiniband

13 Cores per Socket

14 Performance evelopment 1.1 PFlop/s TFlop/s

15 Performance evelopment Projection

Jun-13 Jun-14 Jun-15 1,000,000 100,000 10,000 1,000 100 10 1 Concurrency Levels Maximum verage Minimum Jun-97 Jun-98 Jun-99 Jun-00 Jun-01

16 Jun-13 Jun-14 Jun-15 1,000, ,000 10,000 1, Concurrency Levels Maximum verage Minimum Jun-97 Jun-98 Jun-99 Jun-00 Jun-01 Jun-02 Jun-03 Jun-04 Jun-05 Jun-06 Jun-07 Jun-08 Jun-09 Jun-10 Jun-11 Jun-12 ISC 08, resden Jack s Notebook Jun-95 Jun-96 Jun-93 Jun-94 # processors

17 Moore s Law reinterpreted Number of cores per chip will double every two years Clock speed will not increase (possibly decrease) Need to deal with systems with millions of concurrent threads Need to deal with inter-chip parallelism as well as intra-chip parallelism

Multicore comes in a wide variety Multiple parallel general-purpose processors

RR M 2 Intel XScale Core 32K IC 32K C QR SRM 2 RR M 3 Sun Niagara 8 GPP cores

(64b) 66 MHz E/ Q 1 1 8 8 E/ Q 1 1 8 8 1 8 G S K E T QR SRM 3 E/ Q 1 8 1 8 QR

2 15 MEv 2 3 MEv 2 6 MEv 2 11 MEv 2 14 MEv 2 4 MEv 2 5 MEv 2 12 MEv 2 13 IXP280

Scratc 128 Fast_wr h -URT 16KB - Timers -GPIO - BootROM/S lowport IBM Cell 1 GPP

Intel 4004 (1971): 4-bit processor, 2312 transistors, ~100 KIPS, 10 micron PMOS,

18 Multicore comes in a wide variety Multiple parallel general-purpose processors (GPPs) Multiple application-specific processors (SPs) RR M 1 QR SRM Stripe RR M 2 Intel XScale Core 32K IC 32K C QR SRM 2 RR M 3 Sun Niagara 8 GPP cores (32 threads) Intel Network Processor 1 GPP Core 16 SPs (128 threads) 64b 1 8 PCI (64b) 66 MHz E/ Q E/ Q G S K E T QR SRM 3 E/ Q QR SRM 4 E/ Q MEv 2 1 MEv 2 8 MEv 2 9 MEv 2 16 MEv 2 2 MEv 2 7 MEv 2 10 MEv 2 15 MEv 2 3 MEv 2 6 MEv 2 11 MEv 2 14 MEv 2 4 MEv 2 5 MEv 2 12 MEv 2 13 IXP280 S Rbuf P I 16b 128B 0 4 o r C S 16b I X Tbuf 128B Hash CSRs 48/64/ - Scratc 128 Fast_wr h -URT 16KB - Timers -GPIO - BootROM/S lowport IBM Cell 1 GPP (2 threads) 8 SPs Cisco CRS Tensilica GPPs Picochip SP 1 GPP core 248 SPs Intel 4004 (1971): 4-bit processor, 2312 transistors, ~100 KIPS, 10 micron PMOS, 11 mm 2 chip 1000s of processor cores per die The Processor is the new Transistor [Rowen]

19 What s Next? Source: Jack ongarra, ISC 2008

20 Likely Trajectory - Collision or Convergence? multi-threading multi-core many-core? future processor by 2012 programmability fully programmable partially programmable fixed function after Justin Rattner, Intel, ISC 2008 parallelism GPU

21 Trends for the next five years up to 2014 fter period of rapid architectural change we will likely settle on a future standard processor architecture good bet: Intel will continue to be a market leader Impact of this disruptive change on software and systems architecture not clear yet

22 Impact on Software We will need to rethink and redesign our software Similar challenge as the 1990 to 1995 transition to clusters and MPI??

23 Likely Future Scenario (2014) System: cluster + many core node Programming model: MPI+? Not Message Passing Hybrid & many core technologies will require new approaches: PGS, auto tuning,? after on Grice, IBM, Roadrunner Presentation, ISC 2008

24 Why MPI will persist Obviously MPI will not disappear in five years By 2014 there will be 20 years of legacy software in MPI New systems are not sufficiently different to lead to new programming model

25 What will be the? in MPI+? Likely candidates are PGS languages utotuning wildcard from commercial space

26 What s Wrong with MPI Everywhere?

27 What s Wrong with MPI Everywhere? One MPI process per core is wasteful of intra-chip latency and bandwidth Weak scaling: success model for the cluster era not enough memory per core Heterogeneity: MPI per CU threadblock?

28 PGS Languages Global address space: thread may directly read/write remote data Partitioned: data is designated as local or global Global address space x: 1 y: x: 5 y: x: 7 y: 0 l: l: l: g: g: g: p0 p1 pn Implementation issues: istributed memory: Reading a remote array or structure is explicit, not a cache fill Shared memory: Caches are allowed, but not required No less scalable than MPI! Permits sharing, whereas MPI rules it out!

29 Performance dvantage of One-Sided Communication The put/get operations in PGS languages (remote read/write) are one-sided (no required interaction from remote proc) This is faster for pure data transfers than two-sided send/receive R oun dtrip L aten cy (usec) MPI ping-pong GSNet put+sync byte Roundtrip Latency Percent HW peak 100% 90% 80% 70% 60% 50% 40% 30% 20% Flood Bandwidth for 4KB messages MPI GSNet % 0 Elan3/lpha Elan4/I64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed 0% Elan3/lpha Elan4/I64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

utotuning Write programs that write programs utomate search across a complex optimization space Generate space of implementations, search it Performance far beyond current compilers

30 utotuning Write programs that write programs utomate search across a complex optimization space Generate space of implementations, search it Performance far beyond current compilers Performance portability for diverse architectures! Past successes: PhiPC, TLS, FFTW, Spiral, OSKI For finite element problem [Im, Yelick, Vuduc, 2005] Best: 4x2 Reference Mflop/s Mflop/s

31 Multiprocessor Efficiency and Scaling (auto-tuned stencil kernel; Oliker et al., paper in IPPS 08) Performance Scaling Total Gflop/s M Opteron $ Bypass SIM Prefetch T/$ Block Reorder Padding NUM Naïve 3.5x # Cores (SP) # Cores (P) 4.5x MFlop/S/Watt MFlop/S/Watt x x x 4.6x x 0 2.3x 0 Power Efficiency Intel M M Sun Sun Cell Cell G80 G80 G80/Host G80/Host

32 utotuning for Scalability and Performance Portability

The Likely HPC Ecosystem in 2014 + GPU = future many-core driven by commercial

33 The Likely HPC Ecosystem in GPU = future many-core driven by commercial applications MPI+(autotuning, PGS,??) Next generation clusters with many-core or hybrid nodes

34 Overview Turning point in 2004 Current trends and what to expect until 2014 Long term trends until 2019

35 RP Exascale Study Commissioned by RP to explore the challenges for Exaflop computing (Kogge et al.) Two models for future performance growth Simplistic: ITRS roadmap; power for memory grows linear with # of chips; power for interconnect stays constant Fully scaled: same as simplistic, but memory and router power grow with peak flops per chip

36 We won t reach Exaflops with this approach From Peter Kogge, RP Exascale Study

37 and the power costs will still be staggering 1000 System Power (MW) From Peter Kogge, RP Exascale Study

38 Extrapolating to Exaflop/s in 2018 Source: avid Turek, IBM

Processor Technology Trend 1990s - R& computing hardware dominated by desktop/cots Had to learn how to use COTS technology for HPC 2010 - R&

39 Processor Technology Trend 1990s - R& computing hardware dominated by desktop/cots Had to learn how to use COTS technology for HPC R& investments moving rapidly to consumer electronics/ embedded processing Must learn how to leverage embedded processor technology for future HPC systems

40 Consumer Electronics has Replaced PCs as the ominant Market Force in esign!! pple Introduces IPod IPod+ITunes exceeds 50% of pple s Net Profit pple Introduces Cell Phone (iphone)

41 Green Flash: Ultra-Efficient Climate Modeling Project by Shalf, Oliker, Wehner and others at LBNL n alternative route to exascale computing Target specific machine designs to answer a scientific question Use of new technologies driven by the consumer market.

42 Green Flash: Ultra-Efficient Climate Modeling We present an alternative route to exascale computing Exascale science questions are already identified. Our idea is to target specific machine designs to each of these questions. This is possible because of new technologies driven by the consumer market. We want to turn the process around. sk What machine do we need to answer a question? Not What can we answer with that machine? Caveat: We present here a feasibility design study. Goal is to influence the HPC industry by evaluating a prototype design.

43 esign for Low Power: More Concurrency Tensilica P 0.09W PPC450 3W Cubic power improvement with lower clock rate due to V 2 F Intel Core2 15W Slower clock rates enable use of simpler cores Power 5 120W Simpler cores use less area (lower leakage) and reduce cost Tailor design to application to reduce waste This is how iphones and MP3 players are designed to maximize battery life and minimize cost

44 Green Flash Strawman System esign We examined three different approaches (in 2008 technology) Computation.015 o X.02 o X100L: 10 PFlops sustained, ~200 PFlops peak M Opteron: Commodity approach, lower efficiency for scientific applications offset by cost efficiencies of mass market BlueGene: Generic embedded processor core and customize system-onchip (SoC) to improve power efficiency for scientific applications Tensilica XTensa: Customized embedded w/soc provides further power efficiency benefits but maintains programmability Processor Clock Peak/ Core (Gflops) Cores/ Socket Sockets Cores Power Cost 2008 M Opteron 2.8GHz K 1.7M 179 MW $1B+ IBM BG/P 850MHz K 3.0M 20 MW $1B+ Green Flash / Tensilica XTensa 650MHz K 4.0M 3 MW $75M

Climate System esign Concept Strawman esign Study 10PF sustained ~120 m 2 <3MWatts < $75M VLIW : 128b load-store + 2 P MUL/ + integer op/ M per cycle: Synthesizable at 650MHz in commodity 65nm 1mm 2

45 Climate System esign Concept Strawman esign Study 10PF sustained ~120 m 2 <3MWatts < $75M VLIW : 128b load-store + 2 P MUL/ + integer op/ M per cycle: Synthesizable at 650MHz in commodity 65nm 1mm 2 core, mm 2 with inst cache, data cache data RM, M interface, 0.25mW/MHz ouble precision SIM FP : 4 ops/cycle (2.7GFLOPs) Vectorizing compiler, cycle-accurate simulator, debugger GUI (Existing part of Tensilica Tool Set) 8 channel M for streaming from on/off chip RM Nearest neighbor 2 communications grid 32K I 8 chan M K 2x128b 32 boards per rack 100 ~25KW power + comms 32 chip + memory clusters per board ( W RM RM Proc rray RM RM 8 RM per processor chip: ~50 GB/s External RM interface External RM interface External RM interface Opt. 8MB embedded RM 32 processors per 65nm chip 83 7W Master Processor External RM interface Comm Link Control

46 Summary on Green Flash Exascale computing is vital for numerous key scientific areas We propose a new approach to high-end computing that enables transformational changes for science Research effort: study feasibility and share insight w/ community This effort will augment high-end general purpose HPC systems Choose the science target first (climate in this case) esign systems for applications (rather than the reverse) Leverage power efficient embedded technology esign hardware, software, scientific algorithms together using hardware emulation and auto-tuning chieve exascale computing sooner and more efficiently pplicable to broad range of exascale-class applications

47 Summary Major Challenges are ahead for extreme computing Power Parallelism and many others not discussed here We will need completely new approaches and technologies to reach the Exascale level This opens up a unique opportunity for science applications to lead extreme scale systems development

Architectural Trends and Programming Model Strategies for Large-Scale Machines

Architectural Trends and Programming Model Strategies for Large-Scale Machines Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab http://titanium.cs.berkeley.edu http://upc.lbl.gov 1 Kathy