Roadmapping of HPC interconnects

Roadmapping of HPC interconnects MIT Microphotonics Center, Fall Meeting Nov. 21, 2008 Alan Benner, bennera@us.ibm.com

Outline Top500 Systems, Nov. 2008 - Review of most recent list & implications on interconnect design Review of various high-end machines designs - RoadRunner: Hybrid Opteron & Cell blades - Cray XT3/4/5 - Blue Gene & Blue Gene/P - Ranger - SunBlade x6420 - Power 575 Summary Systems & Interconnect Characteristics 2

Top500 list, Nov. 2008 11/08 includes 2 machines at >1 PFLOP/s, top 6 machines together at >4 PFLOPs. Top 6 systems are each *quite* different from each other -- Aggregation of outliers Countries: Top 9 in US, #10 in China. (#11 is Germany, #13 is India, #14 is France) H. Meuer, E. Strohmaier, J. Dongarra, H. Simon Top500.org SC08 BOF Presentation 3

Top500 list, Development in time Steady development in time: 95%/year CAGR for N=500, 88%/yr for N=1 ~half of improvement from faster cores & more core/chip, ~half from more chips & better interconnect Note: N=500 grows slightly faster than N=1. Quicker adoption of best-practices? "If you have a job that runs a week on this new Roadrunner system and you took that job on the fastest computer 10 years ago, you would only be half done today. - Herb Schultz, IBM's director of marketing for deep computing 4

Top500 list some other interesting statistics Total number of CPU cores in all 500 systems: 3.12 Million Total power: between 91 & 150 MegaWatts - This is the first year that electrical power was explicitly measured data is very incomplete - By comparison: New York City uses 10,000-15,000 MegaWatts, - Note: 1 MW is $1M / year at $0.114 / [kw*hour] roughly US avg. rate Total system value at average of $500 to $1,000 per core (including memory, storage, interconnect, packaging, ): $1.5B to $3B - Not the whole IT market, by a long shot, but a significant slice Top 6 of 500 systems comprise nearly 25% of full list performance (~4 PF out of 16.7) System sizes roughly follow an inverse power law (more small systems, fewer large systems) but with significant outliers 1000 Histogram of System Sizes (by # cores) Frequency (log scale) 100 10 1 4000 10000 16000 22000 28000 34000 # cores in system 40000 46000 52000 More 5

Top500 List, November 2008 Interconnects in Top500 Measured by the number of systems using a particular interconnect, Gigabit Ethernet is the leader, with ~56% of systems using it. InfiniBand 2 nd, at 28% - other negligible. Only 3 significant networks: Gigabit Ethernet, InfiniBand, & Proprietary (IBM Blue Gene/L /P or Cray XT4/XT5) - Myrinet, Quadrics, SP Switch, etc. have nearly disappeared Market matured to ~3-4 players Interconnects in Top500 by # of systems BG & XT4/5 IB Gigabit Enet 6

Top500 List, November 2008 Interconnects in Top500..But measured by total performance share, higher-performance networks show more Again, 3-4 HPC interconnect options: - High/Middle: IBM Blue Gene / Blue Gene/P & Cray XT4 / XT5 - Middle/High: InfiniBand SDR/DDR (No QDR showing yet) - Low: Gigabit Ethernet (No 10G yet cost/performance not good enough) Interconnects in Top500 by # of systems Interconnects in Top500 by total performance share Performance share = # cores * Linpack Ops/s per core BG & XT4/5 IB BG & XT4/5 IB Gigabit Enet Gigabit Enet 7

Interconnects: # Systems vs. Total performance (# cores) in 2008 Left chart below shows interconnect share by # of systems. Right shows share by perf. Gigabit Ethernet dominates in smaller systems, with fewer numbers of processors and fewer numbers of links. Note: # of links scales super-linearly with system size, so share of Interconnect BW (links) is higher for Proprietary and InfiniBand than shown at right. Gigabit Ethernet 56.4% InfiniBand 28.2% Proprietary 8.4% Cray Interconnect 1.2% Myrinet 2.0% SP Switch 2.0% Others 1.8% Gigabit Ethernet 29.18% InfiniBand 38.82% Proprietary 24.42% Cray Interconnect 2.12% Myrinet 2.06% SP Switch 1.35% Others 2.05% 8

Value of a Cluster Network: Network-Dependent Scalability, 2005 100000 Linpack: Peak vs. Actual Performance-2005 Actual Max Performance (GF) 10000 1000 # of CPUs Ethernet-Interconnected Myrinet-Interconnected InfiniBand-Interconnected 400 500 1,00 1,500 2,000 Data: www.top500.org Nov. 2005 3,200 8,000 1000 10000 100000 Theoretical Peak Performance (GF) Average efficiency (Actual/Peak) across all system sizes GigEthernet: (1+1) Gbps, 10-20 µs 54.1% Myrinet- 2000: (2+2) Gb/s, 5-6 us 64.1% IB-4x-SDR: (8+8) Gbps, 5-6 usec 73.2% Application= Linpack, Nodes: dual-processor, Intel Xeon (2.6-3.6 GHz), Opteron (2.2-2.6 GHz), or Power (2.3 GHz) Systems interconnected with higher-performance networks get better use out of processors on parallel applications Benefit of cluster network grows with system size - 25% difference at 1,000 CPUs, >100% at >2,000 Benefits will also, of course, be application-dependent. - Embarrassingly-parallel codes depend less on the network; tightly-coupled apps depend more than Linpack 9

Value of a Cluster Network: Network-Dependent Scalability, 2008 1000000 Linpack: Peak vs. Actual Performance-2008 Actual Max Performance (GF) 100000 10000 Ethernet-Interconnected InfiniBand-Interconnected # of CPU cores (approx.) Data: www.top500.org Nov. 2008 Average efficiency (Actual/Peak) across all system sizes Gig Ethernet: (1+1) Gbps, 8-12 µs 51.0% IB-4x-DDR: (16+16) Gbps, 1-2 usec 76.0% Application = Linpack Nodes: Various -- Intel Xeon, Opteron, or Power 2,000 4,000 8,000 16,000 32,000 1000 10000 100000 1000000 Theoretical Peak Performance (GF) 3 years later all numbers are bigger, but trends & effects are same or amplified since 2005 - Fewer imbalanced systems but still a few outliers. IB-linked systems (higher BW, lower latency) get ~1.5x performance from same # of CPU cores - Similar benefits in energy efficiency & cost-efficiency 10

Roadrunner at a glance: Statistics as of 11/2008 Cluster of 18 Connected Units - 12,960 IBM PowerXCell 8i accelerators - 6,480 AMD dual-core Opterons (comp) - 432 AMD dual-core Opterons (I/O) - 36 AMD dual-core Opterons (man) - 1.41 Petaflop/s peak (PowerXCell) - 46.6 Teraflop/s peak (Opteron-comp) - 1.105 Petaflop/s sustained Linpack InfiniBand 4x DDR fabric - 2-stage fat-tree; all-optical cables - Full CU bi-section bi-directional BW - 384 GB/s (CU) - 3.3 TB/s (system) - Non-disruptive expansion to 24 CUs (1/3 bigger) 103 TB aggregate memory - 51.8 TB Opteron - 51.8 TB Cell 432 GB/s peak File System I/O: - 216x2 10G Ethernets to Panasas RHEL & Fedora Linux SDK for Multicore Acceleration xcat Cluster Management - System-wide GigEnet network 2.48 MW Power (Linpack): - 445 Megaflop/s per Watt - most power-efficient, other than Cell-only systems Other: - 294 racks - 5500 ft 2-500,000 lbs. - >55 miles of InfiniBand cables TRIBLADE Operated by Los Alamos National Security, LLC for NNSA 12

Roadrunner Packaging & Topology 2-layer Clos-style network, using 288-port IB switches for both leaf and core - 6 levels of switch, altogether. All Interconnect cables are optical. - Copper could have worked for some but optical is easier to deal with, more reliable, & lower power. - Homogeneity of technology is a huge plus Connected Unit 1 ISR9288 IB4x Switch 96 optical Connected Unit 576 unused 8 @ ISR9288 IB4x DDR Switch Misc Connected Unit 18 ISR9288 IB4x Switch 96 optical I/O + Compute x8 Compute x6 Service + Compute Switch + Compute 13

Roadrunner Blades, Racks, Switches, & Cables Hybrid Blades Opteron + Cell Combination BladeCenter Racks Core Switches 288- port IB 4x DDR Active Optical Cables & more Active Optical Cables (20+20) Gb/s each During Build in Poughkeepsie, NY 14

Cray XT3/XT4/XT5 AMD Opteron quad-core sockets, connected to Seastar / Seastar 2, Seastar2+ Bridge/Router/DMA ASIC. 3-D torus topology (6-port routers) allows <6meter cables - Little or no fiber, except to storage. Photos: Dave Bullock / eecue eecue.com 15

Blue Gene & Blue Gene/P SoC chip: 4 CPU cores+memory interface+router 3-D torus topology (plus extra low-bw networks) - no optics needed in this generation Blue Gene/P 16

Ranger, UT Austin SunBlade x6420, InfiniBand Opteron quad-core blades. Top-of-rack IB leaf switches 1 st level of switching. Core IB switch 3,456 ports to leaf switches, at 4x-DDR (20+20) Gb/s each, using 12x connectors 3-level Clos, built with 24-port DDR switch chips - Still all copper - *heavy* & large cables - QDR will need optical cables 17

BlueFire, NCAR - P5-575 Rackmount Drawer, 16 2-core MCMs per drawer InfiniBand network direct to Core switches - Copper used in this machine, Active optical is possible Water-cooling to all processors - 40% savings in power delivery efficiency. - Other advantages: better reliability, density, impact on data center environment / temperature 18

Take-home messages Supercomputer and HPC architecture is still heterogeneous (i.e., interesting) - Processors: Intel / AMD / Power, - Co-processors: Vector Units, Cell processors, FPGAs, GPUs,.. - Networks: Torus, Clos, mixtures,.. - Scalability design Networks are still heterogeneous as well (with some signs of maturing) Overall system design topology (torus/clos /..), packaging (blades/drawers/racks/..), and usage (convenience of installation,..) -- all affect use of optics vs. copper. Active optical cabling makes system design much easier. Steady & fast progress towards more optics The limit of 10 [meters*gigabits/sec] as the cross-over point it still pretty valid. - CTR-I used same number, in different units: 10 [kilometers* Megabits/sec] 20

Appendix: The downside of Massive Parallelism A few real-world scenarios 21

and the Upside of Massive Parallelism: More Insight 22

and the Upside of Massive Parallelism: More Insight For example: Weather Simulation ~2005 ~1995 ~2000 23