The BlueGene/L Supercomputer: Delivering Large Scale Parallelism

Size: px

Start display at page:

Download "The BlueGene/L Supercomputer: Delivering Large Scale Parallelism"

Randell Nash
5 years ago
Views:

1 : Deliering Large Scale Parallelism José E. Moreira IBM T. J. Watson Research Center August 2004 BlueGene/L

2 Outline BlueGene/L high-leel oeriew BlueGene/L system architecture and technology oeriew BlueGene/L software architecture BlueGene/L preliminary performance Conclusions 2

3 Outline BlueGene/L high-leel oeriew BlueGene/L system architecture and technology oeriew BlueGene/L software architecture BlueGene/L preliminary performance Conclusions 3

4 The BlueGene/L project Executie summary A research partnership with Lawrence Liermore National Laboratory deliery of a 65,536-node machine in 2004/ Tflops peak, 32 Tbytes memory capacity 4096-node (8192-processor) prototype with pass 1 chips in operation Full system software stack and applications running Support for our Blue Gene science program in inestigating the mechanisms of biologically important processes, particularly protein folding node system being planned for IBM Watson IBM working with collaborators to better understand application behaior on BlueGene/L and broaden the scope of the machine ASTRON in the Netherlands is getting a 6144-node machine Argonne National Laboratory is getting a 1024-node machine (more later) 4

5 BlueGene/L philosophy Some applications display ery high leels of parallelism: they can be executed efficiently on tens of thousands of processors if: low-latency, high-bandwidth interconnect is present machine is reliable enough good compilers and programming models are aailable machine is manageable (simple to build and operate) Vastly improed price-performance for a class of applications by choosing simple low-power building block - highest possible singlethreaded performance is not releant, aggregate is! Scalable architecture and appropriate package can lead to a high density, low power, massiely parallel system BlueGene/L = cellular architecture + aggressie packaging + scalable software 5

6 Compute density 2 Pentium 4/1U 42 1U/rack 84 processors/rack 2 Pentium 4/blade (7U) 14 blade/chassis 6 chassis/frame 168 processors/rack 2 dual CPU/compute card 16 compute cards/node card 16 node cards/midplane 2 midplanes/rack 2048 processors/rack 6

7 BlueGene/L cellular architecture Start with a simple building block (or cell) that can be replicated ad infinitum as necessary aggregate performance is important Cell should hae low power/high density characteristics, and should also be cheap to build and easy to interconnect Build a homogeneous collection of these cells, each with its own operating system image All cells hae the same computational and communications capabilities (interchangeable from OS or application iew) Integrated connection hardware proides a straightforward path to scalable systems with thousands/millions of cells 7

8 BlueGene/L hierarchical system software Organize the cells hierarchically form processing sets (psets) consisting of a collection of compute nodes under control of a master node From a system perspectie, BlueGene/L will look like a 1024-way cluster of independent processing sets - a manageable size In BlueGene/L, each processing set contains 64 compute nodes which can be used exclusiely for running user application processes Each processing set is under control of one Linux image that executes on a 65 th node the I/O node From an application perspectie, a flat iew is maintained an application consists of a set of communicating compute processes, each running on a dedicated compute node 8

9 Outline BlueGene/L high-leel oeriew BlueGene/L system architecture and technology oeriew BlueGene/L software architecture BlueGene/L preliminary performance Conclusions 9

BlueGene/L fundamentals A large number of nodes (65,536) Low-power (20W) nodes for density High floating-point performance System-on-a-chip technology Nodes interconnected as 64x32x32

10 BlueGene/L fundamentals A large number of nodes (65,536) Low-power (20W) nodes for density High floating-point performance System-on-a-chip technology Nodes interconnected as 64x32x32 threedimensional torus Easy to build large systems, as each node connects only to six nearest neighbors full routing in hardware Bisection bandwidth per node is proportional to n 2 /n 3 Auxiliary networks for I/O and global operations Applications consist of multiple processes with message passing Strictly one process/node Minimum OS inolement and oerhead 10

11 BlueGene/L interconnection networks 3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GB/s per node) Communications backbone for computations 350/700 GB/s bisection bandwidth 11

12 BlueGene/L fundamentals (continued) Machine should be dedicated to execution of applications, not system management Aoid asynchronous eents (e.g., daemons, interrupts) Aoid complex operations on compute nodes The I/O node an offload engine System management functions are performed in a (N+1)th node I/O (and other complex operations) are shipped from compute node to I/O node for execution Number of I/O nodes adjustable to needs (N=64 for BG/L) This separation between application and system functions allows compute nodes to focus on application execution Communication to the I/O nodes must be through a separate tree interconnection network to aoid polluting the torus 12

13 BlueGene/L interconnection networks 3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GB/s per node) Communications backbone for computations 350/700 GB/s bisection bandwidth Global Tree One-to-all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link Latency of tree traersal in the order of 2 µs Interconnects all compute and I/O nodes (1024) 13

14 BlueGene/L interconnection networks 3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GB/s per node) Communications backbone for computations 350/700 GB/s bisection bandwidth Global Tree One-to-all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link Latency of tree traersal in the order of 2 µs Interconnects all compute and I/O nodes (1024) Gigabit Ethernet Incorporated into eery node ASIC Actie in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.) 14

15 BlueGene/L fundamentals (continued) Machine monitoring and control should also be offloaded Machine boot, health monitoring Performance monitoring Concept of a serice node Separate nodes dedicated to machine control and monitoring Non-architected from a user perspectie Control network A separate network that does not interfere with application or I/O traffic Secure network, since operations are critical By offloading serices to other nodes, we let the compute nodes focus solely on application execution without interference BG/L is the extreme opposite of time sharing: eerything is space shared 15

16 BlueGene/L control system Control and monitoring are performed by a set of processes executing on the serice nodes Non-architected network (Ethernet + JTAG) supports non-intrusie monitoring and control Core Monitoring and Control System Controls machine initialization and configuration Performs machine monitoring Two-way communication with node operating system Integration with database for system repository 16

17 BlueGene/L system architecture oeriew tree Pset 0 Console Front-end Nodes File Serers I/O Node 0 Linux C-Node 0 C-Node 63 DB2 Serice Node MMCS Functional Functional Gigabit Gigabit Ethernet Ethernet ciod I/O Node 1023 CNK torus C-Node 0 CNK C-Node 63 I 2 C Linux Scheduler Control Control Gigabit Gigabit Ethernet Ethernet ciod CNK CNK IDo chip JTAG Pset

18 BlueGene/L compute chip PLB (4:1) 32k/32k L1 440 CPU Double FPU 32k/32k L1 440 CPU Prefetch Buffers Prefetch Buffers Multiported SRAM Buffer Locks Shared L3 directory for EDRAM Includes ECC ECC 4MB EDRAM L3 Cache Multibank Double FPU 256 Ethernet Gbit JTAG Access Link buffers and Routing DDR Control with ECC Gbit Ethernet 6 Bi-directional 1.4 Gb/s links Gb/s tree 144 bit wide DDR (256MB) 5.6 GB/s 18

19 BlueGene/L 19

20 Complete BlueGene/L system at LLNL 20

21 BlueGene/L system 21

22 Outline BlueGene/L high-leel oeriew BlueGene/L system architecture and technology oeriew BlueGene/L software architecture BlueGene/L preliminary performance Conclusions 22

23 Familiar software enironment Fortran, C, C++ with MPI Full language support Automatic SIMD FPU exploitation Linux deelopment enironment User interacts with system through FE nodes running Linux compilation, job submission, debugging Compute Node Kernel proides look and feel of a Linux enironment POSIX system calls (with restrictions) Tools support for debuggers, hardware performance monitors, trace based isualization 23

24 Scalability through simplicity Performance Strictly space sharing - one job (user) per electrical partition of machine, one process per compute node Dedicated processor for each application leel thread Guaranteed, deterministic execution Physical memory directly mapped to application address space no TLB misses, page faults Efficient, user mode access to communication networks No protection necessary because of strict space sharing Multi-tier hierarchical organization system serices (I/O, process control) offloaded to IO nodes, control and monitoring offloaded to serice node No daemons interfering with application execution System manageable as a cluster of IO nodes Reliability, aailability, sericeability Reduce software errors - simplicity of software, extensie run time checking option Ability to detect, isolate, possibly predict failures 24

25 BlueGene/L software architecture User applications execute exclusiely on the compute nodes and see only the application olume as exposed by the user-leel APIs The outside world interacts only with the I/O nodes and processing sets (I/O node + compute nodes) they represent, through the operational surface functionally, the machine behaes as a cluster of I/O nodes Internally, the machine is controlled through the serice nodes in the control surface goal is to hide this surface as much as possible 25

26 BlueGene/L software architecture rationale We iew the system as a cluster of 1,024 I/O nodes, thereby reducing a 65,536-node machine to a manageable size From a system perspectie, user processes are controlled from the I/O node: Process management (start, signaling, kill) Process debugging Authentication, authorization System management daemon Traditional LINUX operating system User processes actually execute on compute nodes Simple single-process (two threads) kernel on compute node Also possible two single-threaded processes per compute node One processor/thread proides fast, predictable execution User process extends into I/O node for complex operations Application model is a collection of priate-memory processes communicating through messages This is an MPI machine! 26

27 Software stack in BlueGene/L compute node CNK controls all access to hardware, and enables bypass for application use User-space libraries and applications can directly access torus and tree through bypass As a policy, user-space code should not directly touch hardware, but there is no enforcement of that policy Application code can use both processors in a compute node Application code User-space libraries CNK Bypass BG/L ASIC 27

28 File I/O in BlueGene/L Application ciod libc CNK BG/L ASIC Linux NFS IP BG/L ASIC File serer Tree Ethernet 28

29 File I/O in BlueGene/L Application fscanf ciod libc CNK BG/L ASIC Linux NFS IP BG/L ASIC File serer Tree Ethernet 29

30 File I/O in BlueGene/L Application fscanf ciod read libc CNK BG/L ASIC Linux NFS IP BG/L ASIC File serer Tree Ethernet 30

31 File I/O in BlueGene/L Application fscanf ciod read libc read Linux CNK NFS Tree packets BG/L ASIC Tree packets IP BG/L ASIC File serer Tree Ethernet 31

32 File I/O in BlueGene/L Application fscanf ciod read libc read read Linux CNK NFS Tree packets BG/L ASIC Tree packets IP BG/L ASIC File serer Tree Ethernet 32

33 File I/O in BlueGene/L Application fscanf ciod read libc read read Linux CNK NFS Tree packets BG/L ASIC Tree packets IP BG/L ASIC File serer Tree Ethernet 33

34 File I/O in BlueGene/L Application fscanf ciod read libc read read data Linux CNK NFS Tree packets BG/L ASIC Tree packets IP BG/L ASIC File serer Tree Ethernet 34

35 Modes of operation of BlueGene/L compute node Coprocessor mode: CPU 0 executes main thread of user application while CPU 1 is a an off-load engine (asymmetric mode) CPU 1 runs much of message-passing protocol Computations can also be off-loaded with an a asynchronous coroutine model (co_start / co_join) Preferred mode of operation for communication-intensie and memory bandwidth-intensie codes Need careful sequence of cache line flush, inalidate, and copy operations to deal with lack of L1 cache coherence in hardware handled in MPI library for communications Virtual node mode: CPU0 and CPU1 each behae as an independent irtual node, with half the memory of physical node (symmetric mode) Two MPI processes on each physical node, one bound to each processor Priate memory semantics lack of L1 coherence not a problem Intra-node (physical) communication is performed through non-l1 coherent memory region 35

36 The coroutine model main processor int main(argc, arg) { coprocessor hdl = co_start(foo) int foo(x) { } rc = co_join(hdl) } return y; 36

37 The BlueGene/L MPICH2 organization Message passing Process management MPI pt2pt datatype topo collecties PMI Abstract Deice Interface CH3 socket MM bgltorus Message Layer torus tree GI bgltorus mpd uniprocessor simple Torus Packet Layer Tree Packet Layer GI Deice CIO Protocol 37

38 Outline BlueGene/L high-leel oeriew BlueGene/L system architecture and technology oeriew BlueGene/L software architecture BlueGene/L preliminary performance Conclusions 38

39 BlueGene/L status at a glance Status as of today: 4,096-compute node system at IBM Rochester, MN 1 st pass chips, good for software deelopment and testing, running at 500 MHz It is the #4 in the TOP500 list of June 2004 (After Earth Simulator, ASCI Thunder, and ASCI Q) 2048-node system at IBM T. J. Watson, NY 2 nd pass chips, production quality, running at 700 MHz, is #8 in the TOP500 What does that mean? Each node is dual processor, so BlueGene/L already has as many processors as ASCI Q and ASCI White 16 TF of peak (@500 MHz) is faster than ASCI White All this on 40 sq. ft. of floor space! 4 racks as opposed to 100+ for White and Q! LLNL machine will be 16 times bigger and 40% faster! 39

40 ASCI White deliered by IBM to LLNL in

41 BlueGene/L in pictures (4-rack system) 41

42 Initial benchmark results Multi-node (MPI) benchmarks NAS benchmarks ASCI-class benchmarks and applications Comparison with other systems NOTE: Some results are for 500 MHz, others for 700 MHz. Some results are for weak scaling, others for strong scaling. Some results were measured by IBM, others by our collaborators If you are not confused now, you will be by the end of this talk! 42

43 Noise measurements 43

44 Measured MPI Send Bandwidth and Latency Bandwidth 700 MHz neighbor 2 neighbors 3 neighbors 4 neighbors 5 neighbors 6 neighbors M essage size (bytes) MHz = * Manhattan distance * Midplane hops s 44

45 Variation of bandwidth (500 MHz) 45

46 LINPACK summary 500 MHz) LINPACK Performance Gflops Number of nodes Single-node 80% of peak 512-node 70% of peak 1435 Gflop/s #73 on No 2003 TOP500 list Same leel of performance as cluster of Xeon 3.06 GHz with 800 processors 4096-node Achieed Gflop/s Continue to sustain better than 70% of peak (improements in code) Achieed #4 in June 2004 TOP MHz system achieed 8650 Gflop/s and is #8 in TOP500 Has been a great test case/drier for the hardware and system software

47 NAS Parallel Benchmarks Class C on 256 nodes Mop/s/node BGL T3E BT CG EP FT IS LU MG SP All NAS Parallel Benchmarks run successfully on 256 nodes (and many other configurations) Used class C (largest class with published results) No tuning / code changes Compared 500 MHz BG/L and 450 MHz Cray T3E All BG/L benchmarks were compiled with GNU and XL compilers Report best result (GNU for IS) BG/L is a factor of two/three faster on fie benchmarks (BT, FT, LU, MG, and SP), a bit slower on one (EP) 47

48 SPPM on fixed machine size (BG/L 700 MHz) SPPM Computational Performance (32 nodes of BG/L, 32 processors p655) GHz p655 BG/L VNM BG/L COP 6000 Rate (points/sec/node) Grid Dimension 48

49 SPPM on fixed grid size (BG/L 700 MHz) SPPM Scaling (128**3, real*8) P GHz BG/L VNM BG/L COP Rate (points/sec/node) Nodes BG/L, processors p655 49

50 SAGE on fixed machine size (BG/L 700 MHz) SAGE Computational Performance (8 nodes BG/L, processors p655 / timing_h) P GHz BG/L VNM BG/L COP Rate(cells/node/sec) Cells per node 50

51 Effect of mapping on SAGE SAGE Scaling (timing_h) Processing Rate(cells/sec/cpu) Random Map Default Map Heuristic Map Processors 51

52 Sweep3D results ASCI Q s BG/L (700 MHz) Aggregate rate (cell/sec) BG/L VNM ASCI Q BG/L COP # processors (ASCI Q), # physical nodes (BG/L) 52

53 Miranda results (by LLNL) 100 Miranda scaling on BG/L and MCR (Charles Crab) Running time (s) 10 1 BG/L 500 MCR Processor count 53

54 ParaDis on 700 MHz BG/L and MCR 54

55 HPC Challenge: Random Access Updates GUP/s (more is better) Cray X1/ORNL (64) HP Alpha/PSC (128) HP Itanium2/OSC (128) BG/L/YKT (64) 0 MPI 55

56 HPC Challenge: Latency (usec) microseconds (less is better) Rand Ring Cray X1, ORNL (64) HP Alpha, PSC (128) HP Itanium2, OSC (128) BG/L, YKT (64) 56

57 Outline BlueGene/L high-leel oeriew BlueGene/L system architecture and technology oeriew BlueGene/L software architecture BlueGene/L preliminary performance Conclusions 57

58 Conclusions BlueGene/L represents a new leel of performance scalability and density for scientific computing Third-generation machine architecture Leerages system-on-a-chip technology and nearest-neighbor interconnect Very low latency, high-bandwidth communication, with small n 1/2 (problem size for achieing half of maximum performance) BlueGene/L system software stack with Linux-like personality for applications Custom solution (CNK) on compute nodes for highest performance Linux solution on I/O nodes for flexibility and functionality MPI is the default programming model, others are being inestigated BlueGene/L is testing software approaches for management/operation of ery large scale machines Hierarchical organization for management Flat organization for programming Mixed conentional/special-purpose operating systems Encouraging performance results! Many challenges ahead, particularly in scalability and reliability, as we grow to bigger system sizes 58

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers Outline Execution Environments for Parallel Applications Master CANS 2007/2008 Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Supercomputers OS abstractions Extended OS