Trends in HPC Architectures

Size: px

Start display at page:

Download "Trends in HPC Architectures"

Mitchell Banks
6 years ago
Views:

1 Mitglied der Helmholtz-Gemeinschaft Trends in HPC Architectures Norbert Eicker Institute for Advanced Simulation Jülich Supercomputing Centre PRACE/LinkSCEEM-2 CyI 2011 Winter School Nikosia, Cyprus

2 Forschungszentrum Jülich (FZJ) Slide 2

3 GCS: Gauss Centre for Supercomputing Germany s Tier-0/1 Supercomputing Complex Association with Garching and Stuttgart A single joint scientific governance Germany s representative in PRACE More information: Slide 3

4 Outline Today's common architectures Clusters MPP Accelerators Programming, Communication Exascale Challenges Energy Resiliency Applications Ideas on the way to Exascale BG/Q, QPACE, DEEP Conclusions Slide 4

5 Current HPC systems Today Supercomputers are (massively) parallel TOP500 list gives a interesting (historical) overview Updated twice a year Based on Linpack benchmark Solve a dense system of linear equations Ranks the 500 most powerful systems Cluster Computers dominate for several years Constellation are kind of Clusters, too Hard to distinguish MPPs from Clusters Less dominant in TOP50 Slide 5

6 TOP 500 Architectures by Systems Slide 6

7 TOP 500 Architectures by Performance Slide 7

8 Cluster ingredients Processor is the heart Provides compute power Memory / Storage is the brain Nowadays many hierarchies (caches, DRAM, SSD, HD, tape) Networks are the nerves Link the nodes to each other Most often more than one (MPI, Administration, I/O) Software is the soul No Cluster-awareness without middleware Felt to be part of MPI, but more (process management, etc.) OpenSource was important prerequisite Balance is more important than single components Slide 8

9 Cluster Computers TOP500 systems: At least 1000 cores Mostly standard processors 78.4% Intel EM64T 11.4 % AMD 8.0 % Power Typically more than 1 OS image Additional Software required: MPI, Middleware Powerful interconnect Basis for scalability Cluster-Computers use COTS Slide 9

10 Cluster Interconnects Main differentiator against MPP Non-proprietary Two classes of Cluster-Systems Capability Clusters Huge Bandwidth, small Latency Capacity Clusters Less powerful interconnect Embarrassing parallel applications Gigabit Ethernet dominates lower half of TOP500 Mainly no real HPC applications Widely used in department HPC Significantly lower Linpack efficiency Slide 10

11 TOP500 Networks Slide 11

12 JuRoPA 2208 compute nodes 2 Intel Nehalem-EP quad-core processors 2.93 GHz SMT (Simultaneous Multithreading) 24 GB memory (DDR3, 1066 MHz) IB ConnectX QDR HCA (MT26428) (QNEM) cores, 207 TF peak Sun Microsystems Blade SB6048 Infiniband QDR with non-blocking Fat Tree topology ParaStation Cluster-OS Slide 12

13 JuRoPA Slide 13

14 HPC-FF 1080 compute nodes 2 Intel Nehalem-EP quad-core processors 2.93 GHz SMT (Simultaneous Multithreading) 24 GB memory (DDR3, 1066 MHz) IB ConnectX QDR HCA (MT26428) 8640 cores, 101 TF peak Bull NovaScale R422-E2 Infiniband QDR with non-blocking Fat Tree topology ParaStation Cluster-OS Slide 14

15 HPC-FF Slide 15

16 Overall design schematic view The Future of Cluster-Computing Slide 16

17 Massively Parallel Processing MPP Main differentiators Interconnect Proprietary on MPP Integration Proprietary cooling, software, etc. Mainly two lines Cray XT-series AMD Opteron processors, Cray Seastar 3-D torus fabric Hard to distinguish from a Cluster from HW point of view Similar performance / scalability characteristics IBM BlueGene family (BG/L, BG/P) PowerPC 4x0-series processors, 3-D torus fabric + more Trade node-performance for energy-efficiency & balance Very scalable codes required More in the top 10% of TOP500 Slide 17

JUGENE: Jülich s Scalable Petaflop System IBM Blue Gene/P JUGENE 32-bit PowerPC 450 core 850 MHz, 4-way SMP 72 racks, 294,912 procs 1 Petaflop/s peak 144 TByte main memory

18 JUGENE: Jülich s Scalable Petaflop System IBM Blue Gene/P JUGENE 32-bit PowerPC 450 core 850 MHz, 4-way SMP 72 racks, 294,912 procs 1 Petaflop/s peak 144 TByte main memory connected to a Global Parallel File System (GPFS) with 5 PByte online disk capacity and up to 25 PByte offline tape capacity Torus network First Petaflop system in Europe Slide 18

19 The Jülich Dualistic Concept 2004: Constellation systems found unable to scale Portfolio of applications can be (very roughly) divided in two to three parts: Highly scalable codes, sparse-matrix vector like or dominated Highly complex codes, adaptive grids or coordinate based, allto-all or more intricate communication patterns, large memory, less scalable Embarrassing parallel codes, parameter studies Not our main focus: Farming, Grids, Clouds At that time JSC was unable to serve highly scalable codes JSC decided to adapt hardware roadmap to this situation Slide 19

20 Jülich Dual Concept Hardware IBM Power 4+ JUMP, 9 TFlop/s 2004 IBM Blue Gene/L JUBL, 45 TFlop/s 2005/6 IBM Blue Gene/P JUGENE, 223 TFlop/s 2007/8 File Server GPFS 2009 File Server GPFS, Lustre General-Purpose Highly-Scalable Slide 20

21 Use by Science Field JUROPA ~ 200 Projects JUGENE ~40 Projects Slide 21

22 Balance Compute-power vs. Bandwidths Measure Bandwidth in Bytes / Flop Memory aims for 1 Byte / Flop Not reached for most machines today (JuRoPA ~0.5 B/Flop) BlueGene trades Compute-power for Balance Only 850 MHz clock Memory wall Bandwidth not expected to grow with compute power Limited by # of connectors (optical links might help) Network aims for 10% of memory-bandwidth System bus (PCIe) shares same pins on package Algorithms surface/volume ratio determines required bandwidth Slide 22

23 Balance Compute-power vs. Latencies Measure Latencies not in absolute times, but in operations Memory latencies tried to hide by Caches Today complex hierarchies (e.g. Nehalem L1/L2/L3) Algorithm has to exploit via memory locality First level of parallelism Network latencies O(1) µsec Several thousands FP operations Hide by algorithm (asynchronous comm.) Latency Wall No significant progress expected for interconnects Algorithm might have to be adapted / changed Slide 23

24 Rationale Can the next generation cluster computers compete with proprietary solutions like Blue Gene or Cray? Blue Gene /P /Q gives factor 20 in compute speed at the same energy envelope and costs in 4 years Cray is more dependent on processor development Standard processor speed will increase by about a factor of 4 to at most 8 in 4 years Clusters need to utilize accelerators Current accelerators are tightly coupled to interconnect Integrated processors expected not before 2015 Slide 24

25 Accelerators FPGA Field Programmable Gate Array Programmable Hardware Algorithm transformed into logical circuits Significant effort to program VHSIC (Very High Speed Integrated Circuit) Hardware Description Language significantly different from C or Fortran Only promising for selected applications FPUs are very good at Multiply/Add Promising for non-fp applications (Genome) or non-pipelined FP-operations (Astrophysics) We had a Cray XD1 equipped with FPGA GRAPE: first sustained PFlop/s system ever (not in TOP500) Slide 25

Accelerators ClearSpeed Put as many FPUs on a Chip as possible Accompany them with fast memory Programming with standard C Pitfalls: Manually split programs to host &

26 Accelerators ClearSpeed Put as many FPUs on a Chip as possible Accompany them with fast memory Programming with standard C Pitfalls: Manually split programs to host & accelerator Manual data-transfer between host & accelerator Not commodity Commodity kills the Performance-Star At some point of time GPUs became more powerful Slide 26

27 Accelerators Cell First heterogeneous multi-core 50 to 80 W at 3.2 GHz 1 PowerPC CPU (PPE) w/ 32 kb L1 caches (D/I) 8 SPEs w/ 256 kb private memory (Local Store) each SPE can do 4 FMAs per cycle 204.8/104.2 GFlop/s at 3.2 GHz 512 kb on-chip shared L2 cache GB/s EIB bandwidth 25.6 GB/s memory bandwidth Unfortunately killed by STI IBM claims to present features in future Power-designs Slide 27

28 Accelerators GPGPU Modern GPUs are basically powerful FPUs Excellent price-performance ratio Surfing the wave of gaming Still missing some features Double Precision IEEE rounding ECC memory PCIe host capabilities nvidia, AMD (ATI), Intel announced MIC Slide 28

29 GPU-Accelerated Cluster GPU CN GPU CN GPU CN InfiniBand Flat topology Simple management of resources CN GPU CN GPU CN GPU Static assignment of accelerators to CPUs Accelerators cannot act autonomously Slide 29

30 GPU-Accelerated Cluster Explicit programming of GPUs required Applications have to be adapted CUDA (nvidia), OpenCL (AMD), TBB (Intel) Unclear which paradigm will survive Might be hidden in the future (Compiler, global paradigm) PGI claims to support CUDA with their compilers Severe interference with considerations on balance Increase node-performance by 10 Memory-bandwidth limited by PCIe Competing use of PCIe bus No direct communication from GPU GPU-mem CPU-mem IB CPU-mem GPU-mem Latency penalties Slide 30

31 Accelerated Cluster-node internal structure DDR3 CPU QPI/HT MEM MEM DDR3 QPI/HT SB PCIe HCA PCIe GPU GDDR5 MEM PCIe GPU GDDR5 MEM CPU No direct communication from GPU to HCA Data passed via CPU's memory GPUs and HCA compete for scarce PCIe resources Hard to find kernels to off-load ~ complex operations, ~ communication, limited bandwidth,... Slide 31

32 ExaScale Systems PetaFlop (1015) systems are up an running Sustained PetaFlop for broader range of applications coming soon (BlueWaters, etc.) History shows: each scale (factor 1000) takes ~10 years Look at problems to expect for next step: ExaFlop (1018) Power consumption (are ~100 MW acceptable?) Resiliency What about I/O How to program such beast Programming models Do current algorithms still work out? Slide 32

33 ExaScale Challenges Energy Power consumption will increase in the future What is the critical limit? JSC has 5 MW, potential of 10 MW 1 MW is 1 M / year 20 MW expected to be the critical limit Are ExaScale systems a Large Scale Facility? LHC uses 100 MW Energy efficiency Cooling uses significant fraction (PUE > 1.2 today 1.0) Hot cooling water (40 C and more) might help Free cooling: use free air to cool water Heat recycling: use waste heat for heating, cooling, etc. Slide 33

34 ExaScale Challenges Resiliency Ever increasing number of components O(10000) nodes O(100000) DIMMs of RAM Each component's MTBF will not increase Optimistic: Remains constant Realistic: Smaller structures, lower voltages decrease Global MTBF will decrease Critical limit? 1 day? 1 hour? Time to write checkpoint! How to handle failures Try to anticipate failures via monitoring Software must help to handle failures checkpoints, process-migration, transactional computing Slide 34

35 ExaScale Challenges Applications Ever increasing levels of parallelism Thousands of nodes, hundreds of cores, dozens of registers Automatic paralellization vs. explicit exposure How large are coherency domains? How many languages do we have to learn? MPI + X most probably not sufficient 1 process / core makes orchestration of processes harder GPUs require explicit handling today (CUDA, OpenCL) What is the future paradigm MPI + X + Y? PGAS + X (+Y)? PGAS: UPC, Co-Array Fortran, X10, Chapel, Fortress,. Which applications are inherently scalable enough at all? Slide 35

36 Exascale Software Initiatives IESP: International Exascale Software Project Led by DOE (ANL/ORNL Beckmann/Dongarra) EESI: European Exascale Software Initiative (EDF France) FP7: ICT Objective HPC Platforms with Exascale Performance PRACE Second implementation phase: scaling applications FET Flagship Initiative Supercomputing (Technology beyond 2020 BSC, INRIA, JSC) G8: Interdisciplinary Program on Application Software towards Exascale Computing for Global Scale Issues Slide 36

37 ExaScale Innovation Center (EIC) Lab together with IBM Böblingen researchers Located in Jülich Collaborating with scientist from IBM Lab Yorktown Heights Energy efficiency Explore new cooling concepts on basis of QPACE Future I/O Investigate I/O concepts for BlueGene /Q Programming Models Tools and algorithms for the ExaScale Slide 37

38 ExaCluster Laboratory (ECL) Lab together with Intel Braunschweig and ParTec Cluster Competence Center researchers Located in Jülich Challenges in system management software Improve scalability of ParaStation Reliable computing on unreliable components Development of a Cluster of Accelerators Prototype Intel MIC architecture (Knights Ferry aka Larrabee) Innovative interconnect architectures (3D-Torus) Slide 38

39 Some developments Target: Arrive at ExaScale at the end of the decade Have to enter the road today Unclear which road(s) will lead to ExaScale Maybe there has to be a completely new road? Some considerations Proprietary vs. commodity designs Are CPU-designs at O(109) $ affordable for HPC? Are Clusters still capable? Do we need new ideas? Let's have a look at some interesting projects QPACE, BlueGene/Q, DEEP Slide 39

40 QPACE Name and Provenience QPACE: QCD PArallel computing on CEll Design of a massively parallel QCD prototype (with suitability for other applications in mind) Enhanced Cell BE processor Custom network processor (based on FPGA) Main development within the German special research focus SFB/TR 55 Hadron Physics led by University of Regensburg in cooperation with IBM. Two installations University of Wuppertal (3.x racks) JSC (1 rack + 3 racks owned by University of Regensburg) #1 - #3 in June 2010 Green500 list ( MFlops / W) Systems have to be in Top500 list Ranked by energy efficiency (#4 less than 500 MFlops / W) Slide 40

Co-processor QPACE Special purpose system (QCD)

41 QPACE vs. RoadRunner Both based on Cell technology RoadRunner #1 in Top500 11/2008 First sustained Linpack PFlop/s system ever Accelerated node design Accelerator as Co-processor QPACE Special purpose system (QCD) Highly energy efficient Accelerator node design Network directly connected to accelerator CPU Slide 41

proprietary internal bus for high-speed links Serial interfaces and

42 QPACE Network processor Fast I/O fabric 2 FlexIO links to CBE = 6 GB/s 6 10 GbE links to nw = 6 GB/s 1 GigE link for I/O Fast proprietary internal bus for high-speed links Serial interfaces and config / status registers attached to Device Control Register (DCR) Bus Slide 42

43 Jülich Installation Slide 43

Transactional memory Integrated memory-controller 1 GB / core 32

44 BlueGene/Q IBM Sequoia Third generation BlueGene Projected for 2012 PowerPC processor w/ 16+1 cores 4-way SMT In-order design Thread-level speculation Transactional memory Integrated memory-controller 1 GB / core 32 compute-node / drawer Water-cooled 5-D torus optical fabric 32 drawers / rack Slide 44

45 BlueGene/Q IBM Sequoia Sequoia to be installed at LLNL 96 racks / 1.6 million cores / 1.6 PB 20 PFlops / 6 MW It's already there First prototype system in TOP500 No 1 in Green Mflops/W (QPACE: MFlops / W) JSC does some research on BlueGene/Q within EIC How to do the I/O at ExaScale Try to safe servers by attaching storage directly to I/O-nodes Slide 45

46 Accelerated Cluster vs. Cluster of Accelerators Cluster with Accelerators Each node has a classical host CPU Accompanied by one or more Accelerators Communication typically via main memory PCIe bus turns out to be a bottleneck Cluster of Accelerators Node consists of Accelerator directly connected to network Impossible with (most) current accelerators Accelerator requires host-cpu to boot Unable to directly talk to the network Accelerator not capable to run general purpose code (OS) See QPACE as a first example Slide 46

47 Some consideration on Scalability Only few application capable to scale to O(300k) cores Sparse matrix-vector codes Highly regular communication patterns Well suited for BG/P Most applications have more complex kernels Complicated communication patterns Less capable to exploit accelerators In fact: Highly scalable apps dominated by highly scalable kernels Less scalable apps dominated by less scalable kernels But there might be highly scalable kernels, too! How to improve their scalability? Slide 47

48 The ideal world CN Go for more capable accelerators (e.g. MIC) Attach all nodes to a low-latency fabric All nodes might act autonomously Acc Acc CN CN Acc Acc Acc Dynamical assignment of cluster-nodes and accelerators IB can be assumed as fast as PCIe besides latency Ability to off-load more complex (including parallel) kernels communication between CPU and Accelerator less frequently larger messages i.e. less sensitive to latency Slide 48

49 Proposal for a new Architecture DEEP CN CN CN InfiniBand BI BN BN BN BI BN BN BN BI BN BN BN Cluster Booster Slide 49

50 Conclusions Clusters dominate main-stream HPC in the last decade Surfing the commodity wave Proprietary systems at the highest end Accelerators will be required Surfing the next wave (gaming) Still unclear how to attach the network The road to ExaScale is unclear Are Clusters capable to reach this goal? Is the Cluster idea expandable for ExaScale? New ideas in HPC-Architectures might be required Slide 50

51 Conclusions ExaScale introduces new challenges Energy Resiliency Input / Output Applications There will be ExaScale systems Sooner or later Unclear How they look like How general purpose they will be How many applications are capable to make use of them Slide 51

52 Thank you Slide 52

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich Mitglied der Helmholtz-Gemeinschaft Welcome to the Jülich Supercomputing Centre D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich Schedule: Monday, May 18 13:00-13:30 Welcome