IBM Blue Gene/Q solution

Size: px

Start display at page:

Download "IBM Blue Gene/Q solution"

Marvin Shannon Bridges
6 years ago
Views:

1 IBM Blue Gene/Q solution Pascal Vezolle

2 Broad IBM Technical Computing portfolio Hardware Blue Gene/Q Power Systems 86 Systems idataplex and Intelligent Cluster GPGPU / Intel MIC PureFlexSystems Storage Software Platform Computing IBM HPC stack Big Data Solution HPC Cloud Solution

3 Reasons of IBM Blue Gene/Q 1. Ultra-scalability for breakthrough science ( peak) System can scale to 256 racks and beyond (>262,144 nodes) Keep ratio Bytes/Flops 2. High power efficiency, smallest footprint, low TCO (Total Cost of Ownership) 3. Superior reliability: Run an application across the whole machine, low maintenance 4. Standard programming model (MPI & threads) Generalized communication runtime layer allows flexibility of programming model Familiar Linux execution environment with support for most POSIX system calls. Familiar programming models: MPI, OpenMP/threads, POSIX I/O 5. Low latency, high bandwidth and extendable inter-processor communication system 6. Hieratical IO (transparent to the applications) 7. Open source and standards-based programming environment Red Hat Linux distribution on service, front end, and I/O nodes Lightweight Compute Node Kernel (CNK) on compute nodes ensures scaling with no OS jitter, enables reproducible runtime results 8. Key foundation for exascale exploration Innovative hardware for threading transition

4 Annualized TCO of HPC Systems (Cabot Partners)

5 Blue Gene/Q Hardware Compute Node: Chip module, Memory Node Card: 32 Compute Nodes + Optical Modules, Link Chips; 5D Torus Single Chip Module Chip 16+2 µp cores Midplane: 16 Node Cards I/O drawer: 8 IO cards w/16 GB 8 PCIe Gen2 x8 Rack: 2 Midplanes 1,2,4 I/O drawers System: e.g. 96 racks (Sequoia) 1 BG/Q rack: 16k cores 16 TB 200TFlops peak ~80 Kwats max

6 Blue Gene/Q Node-Board Assembly Blue Gene/Q Node Board: 32 compute nodes, 512 cores, 2048 threads Fiber-Optic Ribbons (36X, 12 Fibers each), Connecting link chips to connectors at the front Compute Card with One Node (32X) Water Hoses 48-Fiber Connectors Redundant, Hot-Pluggable Power-Supply Assemblies

Inter-Processor Communication Integrated 5D torus Virtual Cut-Through routing Hardware assists for collective & barrier functions FP addition support in network RDMA Integrated on-chip Message Unit 2

7 Inter-Processor Communication Integrated 5D torus Virtual Cut-Through routing Hardware assists for collective & barrier functions FP addition support in network RDMA Integrated on-chip Message Unit 2 GB/s raw bandwidth on all 10 links each direction -- i.e. 4 GB/s bidi 1.8 GB/s user bandwidth protocol overhead 5D nearest neighbor exchange measured at 1.76 GB/s per link (98% efficiency) Network Performance All-to-all: 97% of peak Bisection: > 93% of peak Nearest-neighbor: 98% of peak Collective: FP reductions at 94.6% of peak Hardware latency Nearest: 80ns Farthest: 3us (96-rack 20 system, 31 hops) Additional 11 th link for communication to IO nodes BQC chips in separate enclosure IO nodes run Linux, mount file system IO nodes drive PCIe Gen2 x8 (4+4 GB/s) IB/10G Ethernet file system & world

This turns one of the 4x1x1x1 partitions into: One 2x1x1x1 Two 1x1x1x1

in that line, or by wrapping the links and only using one midplane in

Additionally, using more than one midplane, but less than the entire

8 This turns one of the 4x1x1x1 partitions into: One 2x1x1x1 Two 1x1x1x1 For any dimension, a torus can be formed by using all of the midplanes in that line, or by wrapping the links and only using one midplane in that line. Additionally, using more than one midplane, but less than the entire line, means that the remaining midplanes will set their link chips to wrap/passthrough for that dimension, and can only achieve a torus of size 1

I/O on Blue Gene/Q PCI_E IB IB BG/Q compute racks BG/Q IO Switch RAID Storage & File

and BG/P Uses InfiniBand switch Application fscanf libc read read CIOS Full Red Hat

are not shared between compute partitions IO Nodes are bridge data from

FS IP cn packets BG ASIC File server Components balanced to allow a specified minimum

9 I/O on Blue Gene/Q PCI_E IB IB BG/Q compute racks BG/Q IO Switch RAID Storage & File Servers BlueGene Classic I/O with GS clients on the logical I/O nodes Similar to BG/L and BG/P Uses InfiniBand switch Application fscanf libc read read CIOS Full Red Hat Linux On I/O node read data Uses DDN RAID controllers and File Servers BG/Q I/O Nodes are not shared between compute partitions IO Nodes are bridge data from functionshipped I/O calls to parallel file system client CNK cn packets BG ASIC Linux FS IP cn packets BG ASIC File server Components balanced to allow a specified minimum compute partition size to saturate entire storage array I/O bandwidth Optical Network Ethernet or Infiniband

10 BlueGene/Q Compute chip System-on-a-Chip design : integrates processors, memory and networking logic into a single chip full crossbar switch DDR-3 Controller DDR-3 Controller External DDR3 External DDR3 360 mm² Cu-45 technology (SOI) 16 user + 1 service processors plus 1 redundant processor all processors are symmetric 11 metal layer each 4-way multi-threaded 64 bits 1.6 GHz I/D cache = 16kB/16kB prefetch engines each processor has Quad (4-wide double precision, SIMD) peak performance W Central shared cache: 32 MB edram multiversioned cache supports transactional memory, speculative execution. supports scalable atomic operations Dual memory controller 16 GB external DDR3 memory 42.6 GB/s DDR3 bandwidth (1.333 GHz DDR3) (2 channels each with chip kill protection) Chip-to-chip networking 5D Torus topology + external link 5 x high speed serial links each 2 GB/s send + 2 GB/s receive DMA, remote put/get, collective operations Test dma Network PCI_Express External (file) IO -- when used as IO chip. PCIe Gen2 x8 interface (4 GB/s Tx + 4 GB/s Rx) re-uses 2 serial links interface to Ethernet or Infiniband cards 2011 IBM Corporation

11 PowerPC A2 processor + QPX 256 bit vector unit Embedded 64-bit PowerPC compliant V. Up to 2 instructions issued per cycle One instruction (AXU) One Integer/Load/Store/Control instruction 4 SMT threads issuing to two pipelines Impact of memory access latency reduced At most one instruction issued per thread 4 x 32 register sets In-order execution AXU port allows for unique BGQ style floating point 4-wide double precision SIMD 256 Load Alignment: new module that support multitude of alignments 4R/2W register file RF RF RF RF 64 A2 32x32 bytes per thread 32B (256 bits) datapath to/from cache, 8 concurrent floating point operations (FMA) + load +store MAD0 MAD1 MAD2 MAD3 Permute 2011 IBM Corporation

12 Hardware/Software Co-Design on Blue Gene/Q ** helping take advantage of multi-core environment and help programmers cope with an exploding number of hardware threads - 64 per node Exploiting a large number of threads is a challenge for all future architectures. This is a key component of the BGQ research. Novel hardware and software is utilized in BGQ Scalable atomic instructions Enables development of lock-less producer consumer queues with N producers and 1 or more consumers Hardware wake-up mechanism Support for OpenMP/MPI and other hybrid programming models Multi-valued cache Transactional Memory Speculative Execution List-based prefetching Allows efficient use of cache for broader applications

13 BG/Q Node Features Atomic operations () Pipelined at rather than retried as in commodity processors avoids -to-pu roundtrip cycles of lwarx/stwcx -- queue locking Low latency even under high contention Accelerates s/w operations: locking, barriers Efficient work queue management, with multiple producers and consumers number of processor cycles Barrier speed using different syncronizing hardware atomic: no -invalidates atomic: invalidates lwarx/stwcx number of threads Improvement in atomics under contention Wake up unit Allows SMT threads to sleep waiting for an event Faster OpenMP work hand off; lowers messaging latency Allows active threads to better utilize core resources Reduce wasted core cycles in polling and spin loops Single MPI Task User defined parallelism User defined transaction start User defined transaction end Multiversioning cache () Transactional Memory eliminates need for locks Speculative Execution allows OpenMP threading for sections with data dependencies 13 IBM Confidential 10/4/2012 Hardware detected dependency conflict rollback parallelization completion synchronization point Thread flow using transactions

14 Wakeup Unit Used in conjunction with the PowerPC wait instruction When a hardware thread is in a wait state, the hardware thread stops executing and the other hardware threads will benefit from the additional available cycles. Sends a wakeup signal to a hardware thread Configurable wakeup conditions: WakeUp Address Compare Messaging Unit activity Interrupt sources Allows active threads to better utilize core resources Kernel provides application interfaces to utilize the wakeup unit Reduce wasted core cycles in polling and spin loops Thread guard pages CNK uses the Wakeup Unit to provide memory protection between stack/heap Detects violation, but cannot prevent it

15 Transactional Memory Performance optimization for critical regions Software threads enter transactional memory mode Memory accesses are tracked Writes are not visible outside of the thread until committed Perform calculation without locking Hardware automatically detects memory contention conflicts between threads If conflict: TM hardware detects conflict Kernel decides whether to rollback transaction or let the thread continue If rollback, the compiler runtime decides whether to serialize or retry If no conflicts, threads can commit their memory Threads can commit out-of-order XL Compiler only Single MPI Task User defined parallelism Hardware detected dependen cy conflict rollback parallelization completion synchronization point User defined transaction start User defined transacti on end

Speculative Execution Similar to Transactional Memory Except Ordered thread commit and different usage model Leverages existing OpenMP parallelization However compiler does not need to guarantee that

16 Speculative Execution Similar to Transactional Memory Except Ordered thread commit and different usage model Leverages existing OpenMP parallelization However compiler does not need to guarantee that there is no array overlap Should allow the compiler to do a much better job of auto-parallelizing Total work is subdivided into workunits without locking If work units collide in memory: SE hardware detects Kernel rolls back transaction Runtime decides whether to retry or serialize XL Compiler only This is a DD2 only feature 2011 IBM Corporation

17 BlueGene/Q PUnit. prefetcher Normal mode: Stream Prefetching in response to observed memory traffic, adaptively balances resources to prefetch cache lines 128 B wide) from 16 streams x 2 lines deep through 4 streams x 8 deep Additional: 4 List-based Prefetching engines: One per thread Activated by program directives, e.g. bracketing complex set of loops Used for repeated memory reference patterns in arbitrarily long code segments Record pattern on first iteration of loop; playback for subsequent iterations On subsequent passes, list can be adaptively refined for missing or extra cache misses (async events) miss miss List address address ss addre List ss addre a a b b x c c d d y e z f e g f h g i h k i Prefetched addresses List-based perfect prefetching has tolerance for missing or extra cache misses 2011 IBM Corporation

18 Blue Gene/Q Software High-Level Goals & Philosophy Facilitate extreme scalability Extremely low noise on compute nodes High reliability: a corollary of scalability Standards-based when possible, leverage other IBM HPC Open source where possible Facilitate high performance for unique hardware: Quad, DMA unit, List-based prefetcher TM (Transactional Memory), SE (Speculative Execution) Wakeup-Unit, Scalable Atomic Operations Optimize MPI and native messaging performance Optimize thread performance

19 Summary Blue Gene/Q 1. Ultra-scalability for breakthrough science System can scale to 256 racks and beyond (>262,144 nodes) Cluster: typically a few racks ( nodes) or less. 2. Lowest Total Cost of Ownership Highest total power efficiency, smallest footprint Typically 2 orders of magnitude better reliability 3. Broad range of applications reach Familiar programming models Easy porting from other environments 4. Foundation for Exascale exploration scalablity, density, MTBF, TCO, programming models,

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft Blue Gene/Q Hardware Overview 02.02.2015 Michael Stephan Blue Gene/Q: Design goals System-on-Chip (SoC) design Processor comprises both processing cores and network Optimal performance / watt ratio Small