Blue Gene: A Next Generation Supercomputer (BlueGene/P)

Size: px

Start display at page:

Download "Blue Gene: A Next Generation Supercomputer (BlueGene/P)"

Timothy O’Connor’
6 years ago
Views:

1 Blue Gene: A Next Generation Supercomputer (BlueGene/P) Presented by Alan Gara (chief architect) representing the Blue Gene team IBM Corporation

2 Outline of Talk A brief sampling of applications on BlueGene/L. A detailed look at the next generation BlueGene (BlueGene/P) Future challenges motivated by a look at computing ~10 to 15 years out. Insight into future generation BlueGene/Q machine.

3 Blue Gene Roadmap Performance Providing unmatched sustained $$/perf and Watts/perf for scalable applications The primary IBM system research vehicle which influences our more traditional PowerPC product line SoC Design Cu-08 (9SF) tech Blue Gene/P (PPC 450: 0.85 GHz) Scalable to 1 PFlops Blue Gene/Q (Power architecture) Scales to 10s of PFlops Top500 list (June 2007) # Vendor Rmax TFlops Installation 1GB Version Available 1Q06 Blue Gene/L (PPC Mhz) Scalable to 360 TFlops LA: 12/04, GA 6/ IBM Cray Cray/Sandia IBM IBM IBM IBM Dell IBM SGI BlueGene/L DOE/NSSA/LLNL ORNL Sandia (Red Storm) BlueGene/L at Watson BlueGene/L at Stony Brook/BNL ASC Purple LLNL BlueGene/L at RPI NCSA Barcelona PowerPC blades Leibniz Rechenzentrum

Characterization of materials currently under experimental test Formation of a Non-abrupt SiO 2 /Si interface correctly

4 Car-Parrinello Molecular Dynamics (CPMD): Studying the effect on dopants on SiO 2 /Si boundaries Simulations from first principles to understand the physics and chemistry of current technology and guide the design of next-generation materials Characterization of materials currently under experimental test Formation of a Non-abrupt SiO 2 /Si interface correctly predicted from scratch When nitrogen and hafnium are introduced during the simulation process, detrimental defects are unraveled IBM Corporation

Blue Brain EPFL to simulate the neocortical column Our understanding of the brain is limited by insufficient information and complexity Overcome

5 Blue Brain EPFL to simulate the neocortical column Our understanding of the brain is limited by insufficient information and complexity Overcome limitations of neuroscientific experimentation Inform experimental design and theory Enable scientific discovery for understanding brain function and diseases Finally feasible!!! (although by no means finshed) 8096 processors (BG/L) 100,000 morphologically complex neurons in real-time 35 cm ~10,000 neurons Total area: 1570 cm 2 Thickness: ~3 mm Columns: ~1million 3 mm 50 cm Courtesy of Henry Markham, EPFL 300 µm IBM Corporation

6 POP2 0.1 benchmark 71% of time in solver Projected BlueGene/P Comparison point for same system (node) size. 20% of time in solver Courtesy of M. Taylor, John Dennis

7 Carbon footprint for Courtesy of John Dennis

8 BlueGene/P in Focus 2007 IBM Corporation

9 BlueGene/P Architectural Highlights Scaled performance through density and frequency bump 2x performance through doubling the processors/node 1.2x from frequency bump due to technology Enhanced function 4 way SMP DMA, remote put-get, user programmable memory prefetch Greatly enhanced 64 bit performance counters (including 450 core) Hold BlueGene/L packaging as much as possible: Improve networks through higher speed signaling on same wires Improve power efficiency through aggressive power management Higher signaling rate 2.4x higher bandwidth, improve latency for Torus and Tree networks 10x higher bandwidth for Ethernet IO 72ki nodes in 72 racks should hit 1.00 PF peak.

10 BGP comparison with BGL Property BG/L BG/P Node Properties Node Processors Processor Frequency 2* 440 PowerPC 0.7GHz 4* 450 PowerPC 0.85GHz (target) Coherency Software managed SMP L1 Cache (private) 32KB/processor 32KB/processor L2 Cache (private) 14 stream prefetching 14 stream prefetching L3 Cache size (shared) 4MB 8MB Main Store/node 512MB/1GB 2GB Main Store Bandwidth 5.6GB/s (16B wide) 13.6 GB/s (2*16B wide) Peak Performance 5.6GF/node 13.6 GF/node Torus Network Bandwidth Hardware Latency (Nearest Neighbor) 6*2*175MB/s=2.1GB/s 200ns (32B packet) 1.6us(256B packet) 6*2*425MB/s=5.1GB/s 160ns (32B packet) 500ns(256B packet) Hardware Latency (Worst Case) 6.4us (64 hops) 5us(64 hops) Collective Network Bandwidth Hardware Latency (round trip worst case) 2*350MB/s=700MB/s 5.0us 2*0.85GB/s=1.7GB/s 4us System Properties Peak Performance (72k nodes) 410TF 1PF Total Power 1.7MW 2.7 MW

11 Data 7GB/s Data 7GB/s 7GB/s BlueGene/P node 14GB/s read(each), 14GB/s write(each) PPC 450 FPU PPC 450 FPU PPC 450 FPU PPC 450 FPU L1 L1 L1 L1 Prefetching L2 Prefetching L2 Prefetching L2 Prefetching L2 Multiplexing switch Multiplexing switch 4MB edram L3 4MB edram L3 DDR-2 Controller DDR-2 Controller 4 symmetric ports for collective, torus and global barriers DMA module allows Remote direct put / get JTAG Control Network DMA Torus Collective Barrier 6*3.4Gb/s bidirectional 3*6.8Gb/s bidirectional Arb 10Gb Ethernet 6*3.5Gb/s To 10Gb bidirectional physical layer (Shares I/O with Torus) 13.6GB/s external DDR2 DRAM bus 2*16B 425Mb/s

IBM System Blue Gene /P Solution: Expanding the Limits of Breakthrough Science Blue Gene/P continues Blue Gene s leadership performance in a spacesaving, power-efficient package

Compute Card 1 chip, 20 DRAMs Chip 4 processors 13.6 GF/s 8 MB EDRAM Node Card (32 chips 4x4x2) 32 compute, 0-2 IO cards 13.6 GF/s 2.

12 IBM System Blue Gene /P Solution: Expanding the Limits of Breakthrough Science Blue Gene/P continues Blue Gene s leadership performance in a spacesaving, power-efficient package for the most demanding and scalable high-performance computing applications Blue Gene/P Rack 32 Node Cards 1024 chips, 4096 procs Cabled 8x8x16 System 1 to 72 or more Racks Compute Card 1 chip, 20 DRAMs Chip 4 processors 13.6 GF/s 8 MB EDRAM Node Card (32 chips 4x4x2) 32 compute, 0-2 IO cards 13.6 GF/s 2.0 GB DDR Supports 4-way SMP IBM System Blue Gene /P Solution 435 GF/s 64 GB 14 TF/s 2 TB Front End Node / Service Node System p Servers Linux SLES10 1 PF/s TB + HPC SW: Compilers GPFS ESSL Loadleveler 2007 IBM Corporation

13 IBM System Blue Gene /P Solution: Expanding the Limits of Breakthrough Science Blue Gene/P Interconnection Networks 3 Dimensional Torus Interconnects all compute nodes Communications backbone for computations Adaptive cut-through hardware routing 3.4 Gb/s on all 12 node links (5.1 GB/s per node) 0.5 µs latency between nearest neighbors, 5 µs to the farthest MPI: 3 µs latency for one hop, 10 µs to the farthest 1.7/2.6 TB/s bisection bandwidth, 188TB/s total bandwidth (72k machine) Collective Network Interconnects all compute and I/O nodes (1152) One-to-all broadcast functionality Reduction operations functionality 6.8 Gb/s of bandwidth per link Latency of one way tree traversal 2 µs, MPI 5 µs ~62TB/s total binary tree bandwidth (72k machine) Low Latency Global Barrier and Interrupt Latency of one way to reach all 72K nodes 0.65 µs, MPI 1.6 µs Other networks 10Gb Functional Ethernet I/O nodes only 1Gb Private Control Ethernet Provides JTAG access to hardware. Accessible only from Service Node system IBM System Blue Gene /P Solution 2007 IBM Corporation

14 BG/L applications easily port to 4-way-virtual-node BG/P. May increase performance through new BG/P features: Program model changes: Support mixed OpenMP +MPI (OpenMP on 4-way in node) Virtual Node mode supported as in BGL. In BGP 4 MPI tasks/node. ptheads supported. BlueGene/P Software Number of threads limited to number of cores (4) DMA engine enables effective offloading of messaging and increases value of overlapping compute with communicate. Messaging library utilizes DMA and is built around put/get functionality. HPC toolkit will enable access to performance counters. (BG/P has processor counts.) BGL model of high performance kernel on compute nodes and linux on I/O nodes. Working on supporting dynamic linking on high performance kernel. Above also enables new applications for BG/P.

15 Future Challenges Insights for BlueGene/Q 2007 IBM Corporation

16 June 15, 2005 Challenges for the future Can get understanding of challenges by projecting to issues in 2023 (Exaflop era) Power is fundamental problem that is pervasive at many system levels (compute, memory, disk) Memory cost and performance is not keeping pace with compute potential Network performance (bandwidth and latency) will be both costly (bandwidth) and will not scale well to Exaflops Ease of use to extract promised performance from compute will be main focus. Big peak Flops is mainly a power problem. Reliability at the Exaflop scale will require a holistic approach at the architecture level. This results from both a lessening of the underlying silicon technology and from the shear number of logic elements. 1E+17 Supercomputer Peak Performance Peak Speed (flops) 1E+14 1E+11 1E+8 1E+5 Doubling time = 1.5 yr. Blue Gene/L ASCI Purple Earth Red Storm Blue Pacific ASCI White ASCI Red Option SX-5 T3E ASCI Red NWT SX-4 CM-5 CP-PACS Paragon Delta T3D i860 (MPPs) SX-3/44 CRAY-2 SX-2 VP2600/10 S-810/20 X-MP4 Y-MP8 Cyber 205 X-MP2 (parallel vectors) CDC STAR-100 (vectors) CRAY-1 CDC 7600 ILLIAC IV CDC 6600 (ICs) IBM Stretch IBM 7090 (transistors) IBM 704 IBM 701 UNIVAC ENIAC (vacuum tubes) 1E Long Term (power cost=system cost) Near Term (performance through exponential growth in parallelism) Current/Past (performance growth through exponential processor performance growth ) Year Introduced Page IBM Corporation

17 Extrapolating an Exaflop in 2023 BlueGene/L (2005) Exaflop Directly scaled Exaflop Educated guess Assumption for Educated guess Node Peak Perf 5.6GF 20TF 20TF Same node count (64k) Number of hardware threads/node Assume 3.5GHz, 3-D packaging System Power Compute Chip 1 MW 4 GW 50 MW 80x improvement (very optimistic) Link Bandwidth (Each unidirectional 3- D link) 1.4Gbps 5 Tbps 1 Tbps Not possible to maintain bandwidth ratio Wires per unidirectional 3-D link wires 100 wires Large wire count will eliminate high density and drive links onto cables where they are 100x more expensive. Pins in network on node 24 pins 6,000 pins 1,200 pins 20 Gbps differential assumed Power in network 100 KW 38 MW 8 MW 10 mw/gbps assumed Memory Bandwidth/node 5.6GB/s 20TB/s 2 TB/s Not possible to maintain external bandwidth/flop L2 cache/node 4 MB 16 GB 500 MB About 6-7 technology generations Data pins associated with memory/node 128 pins 32,000 pins 4000 pins 5 Gbps per pin Power in memory I/O (not DRAM) 12.8 KW 50 MW 6 MW 5 mw/gbps assumed Total problem size (QCD example) 64^3X ^3x ^3x256 Approx equal time to science QCD CG single iteration time 2.3 msec 9.4 usec 15 usec Requires: 1)fast global sum 2)hardware offload for messaging (Driverless messaging) Memory footprint/node 2.7 MB 42 MB 42 MB Memory footprint is no problem Power associated with external memory will force high efficiency computing to reside inside chip. (or chip stack) Network scaling will be both a latency and bandwidth problem. Bandwidth is a cost problem and latency will require hardware offload to avoid nearly all software layers. Processing in a node will be done via thousand(s) of hardware units, each which is only somewhat faster than today s IBM Corporation

18 IBM Research System Power Efficiency GFLOP/s per Watt GFLOP/s per Watt Year Power-efficient design focus QCDSP Columbia Single thread focus design QCDOC Columbia/ IBM Blue Gene/L NASA, SGI Cray XT3 ASCI White NCSA, Xeon LLNL, Itanium 2 Power 3 SX-8 ASCI Q ECMWF, p690 Power 4+ Earth Simulator BG/L BG/P Red Storm Thunderbird Purple Fujitsu Bioserver Blue Gene/P? Commodity driven Large peak power efficiency advantage Still need dramatic improvement to enable computing in the future IBM Corporation

June 15, 2005 The Power Problem Thick gate oxide Scaled gate oxide Gate Field effect transistor t 1.2 nm oxynitride Oxide thickness is near the limit. Traditional CMOS scaling has ended.

Architecture can help (to some extent) witness the better power efficiency of commodity processors from simplification New circuits can also help This problem needs to be addressed now.

19 June 15, 2005 The Power Problem Thick gate oxide Scaled gate oxide Gate Field effect transistor t 1.2 nm oxynitride Oxide thickness is near the limit. Traditional CMOS scaling has ended. Density improvements will continue but power efficiency from technology will only improve very slowly. CMOS alone will no longer enable faster computers with similar power Solution is not known! Architecture can help (to some extent) witness the better power efficiency of commodity processors from simplification New circuits can also help This problem needs to be addressed now. 250 TF 1 PF 10 PF 100 PF 1000 PF If power efficiency does not improve Projected Year BlueGene/L 1.0 MWatt 2.5 MWatt 25 MWatt 250 MWatt 2.5 GWatt Earth Simulator 100 MWatt 200 MWatt 2 GWatt 20 GWatt 200 GWatt MareNostrum 5 MWatt 15 MWatt 150 MWatt 1.5 GWatt 15 GWatt Page IBM Corporation

20 Summary/Conclusion BlueGene/L has achieved an application reach far broader than expected (or targeted in the design) Partnership and collaboration have been critical to exploiting BlueGene/L BlueGene/P is an architectural evolution from BlueGene/L Enhancements from BlueGene/L such as a hardware DMA engine promise same or better per node scaling on BlueGene/P. BlueGene/P offers a fully coherent 4-way node with a software stack designed to exploit parallelism. BlueGene/P will offer approximately 2-3x speed up with respect to BlueGene/L for same node count. Future Trends Power will be a severe constraint in the future (and now) Large systems will have millions of threads, each similar in performance to today. Challenges of power will apply to all systems (commercial and HPC). Market forces in commodity commercial world could result in a different, potentially not well aligned with HPC, direction. Reliability of systems in the future will require a holistic approach to reach the extreme levels of scalability. Latency in networks will become a pinch point for capability computing.

Stockholm Brain Institute Blue Gene/L

Stockholm Brain Institute Blue Gene/L 1 Stockholm Brain Institute Blue Gene/L 2 IBM Systems & Technology Group and IBM Research IBM Blue Gene /P - An Overview of a Petaflop Capable System Carl G. Tengwall