Roadmapping of HPC interconnects

Similar documents
IBM HPC DIRECTIONS. Dr Don Grice. ECMWF Workshop November, IBM Corporation

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

Real Parallel Computers

Real Parallel Computers

2008 International ANSYS Conference

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

BlueGene/L. Computer Science, University of Warwick. Source: IBM

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Parallel Computer Architecture II

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Stockholm Brain Institute Blue Gene/L

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Interconnect Your Future

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

CAS 2K13 Sept Jean-Pierre Panziera Chief Technology Director

The Red Storm System: Architecture, System Update and Performance Analysis

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Cray XC Scalability and the Aries Network Tony Ford

IBM Virtual Fabric Architecture

Voltaire Making Applications Run Faster

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

HPC Architectures. Types of resource currently in use

Parallel Computing: From Inexpensive Servers to Supercomputers

HPC and Accelerators. Ken Rozendal Chief Architect, IBM Linux Technology Cener. November, 2008

TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology

The Impact of Optics on HPC System Interconnects

Brand-New Vector Supercomputer

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

John Fragalla TACC 'RANGER' INFINIBAND ARCHITECTURE WITH SUN TECHNOLOGY. Presenter s Name Title and Division Sun Microsystems

Creating High Performance Clusters for Embedded Use

The Road from Peta to ExaFlop

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Cluster Network Products

2008 International ANSYS Conference

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Linux Networx HPC Strategy and Roadmap

About Us. Are you ready for headache-free HPC? Call us and learn more about our custom clustering solutions.

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2

IBM s Data Warehouse Appliance Offerings

How то Use HPC Resources Efficiently by a Message Oriented Framework.

Data Transport: Defining the Problem and the Solution for Photonics in servers

Interconnect Your Future

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

What have we learned from the TOP500 lists?

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

The Future of High Performance Interconnects

COSC 6385 Computer Architecture - Multi Processor Systems

Practical Scientific Computing

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

ACCRE High Performance Compute Cluster

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

The Spider Center-Wide File System

The Stampede is Coming: A New Petascale Resource for the Open Science Community

What does Heterogeneity bring?

Performance of Variant Memory Configurations for Cray XT Systems

Prototypes Systems for PRACE. François Robin, GENCI, WP7 leader

HIGH PERFORMANCE COMPUTING FROM SUN

OCP Engineering Workshop - Telco

Jeff Kash, Dan Kuchta, Fuad Doany, Clint Schow, Frank Libsch, Russell Budd, Yoichi Taira, Shigeru Nakagawa, Bert Offrein, Marc Taubenblatt

Overview of Tianhe-2

Dynamical Exascale Entry Platform

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

QLogic BS21, , and InfiniBand Switches IBM Power at-a-glance guide

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

PRACE prototypes. ICT 08, Lyon, Nov. 26, 2008 Dr. J.Ph. Nominé, CEA/DIF

Pedraforca: a First ARM + GPU Cluster for HPC

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Making a Case for a Green500 List

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Design and Evaluation of a 2048 Core Cluster System

An Overview of High Performance Computing and Challenges for the Future

Solutions for Scalable HPC

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

An Overview of High Performance Computing

Future Routing Schemes in Petascale clusters

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Future Trends in Hardware and Software for use in Simulation

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

AcuSolve Performance Benchmark and Profiling. October 2011

Update on Cray Activities in the Earth Sciences

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Top500

IBM Information Technology Guide For ANSYS Fluent Customers

Intel Connects Cables & HPCC Growth

HP Update. Bill Mannel VP/GM HPC & Big Data Business Unit Apollo Servers

Lecture 20: Distributed Memory Parallelism. William Gropp

Technology Trends IT ELS. Kevin Kettler Dell CTO

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp

2-Port 40 Gb InfiniBand Expansion Card (CFFh) for IBM BladeCenter IBM BladeCenter at-a-glance guide

Customer Success Story Los Alamos National Laboratory

Tianhe-2, the world s fastest supercomputer. Shaohua Wu Senior HPC application development engineer

Transcription:

Roadmapping of HPC interconnects MIT Microphotonics Center, Fall Meeting Nov. 21, 2008 Alan Benner, bennera@us.ibm.com

Outline Top500 Systems, Nov. 2008 - Review of most recent list & implications on interconnect design Review of various high-end machines designs - RoadRunner: Hybrid Opteron & Cell blades - Cray XT3/4/5 - Blue Gene & Blue Gene/P - Ranger - SunBlade x6420 - Power 575 Summary Systems & Interconnect Characteristics 2

Top500 list, Nov. 2008 11/08 includes 2 machines at >1 PFLOP/s, top 6 machines together at >4 PFLOPs. Top 6 systems are each *quite* different from each other -- Aggregation of outliers Countries: Top 9 in US, #10 in China. (#11 is Germany, #13 is India, #14 is France) H. Meuer, E. Strohmaier, J. Dongarra, H. Simon Top500.org SC08 BOF Presentation 3

Top500 list, Development in time Steady development in time: 95%/year CAGR for N=500, 88%/yr for N=1 ~half of improvement from faster cores & more core/chip, ~half from more chips & better interconnect Note: N=500 grows slightly faster than N=1. Quicker adoption of best-practices? "If you have a job that runs a week on this new Roadrunner system and you took that job on the fastest computer 10 years ago, you would only be half done today. - Herb Schultz, IBM's director of marketing for deep computing 4

Top500 list some other interesting statistics Total number of CPU cores in all 500 systems: 3.12 Million Total power: between 91 & 150 MegaWatts - This is the first year that electrical power was explicitly measured data is very incomplete - By comparison: New York City uses 10,000-15,000 MegaWatts, - Note: 1 MW is $1M / year at $0.114 / [kw*hour] roughly US avg. rate Total system value at average of $500 to $1,000 per core (including memory, storage, interconnect, packaging, ): $1.5B to $3B - Not the whole IT market, by a long shot, but a significant slice Top 6 of 500 systems comprise nearly 25% of full list performance (~4 PF out of 16.7) System sizes roughly follow an inverse power law (more small systems, fewer large systems) but with significant outliers 1000 Histogram of System Sizes (by # cores) Frequency (log scale) 100 10 1 4000 10000 16000 22000 28000 34000 # cores in system 40000 46000 52000 More 5

Top500 List, November 2008 Interconnects in Top500 Measured by the number of systems using a particular interconnect, Gigabit Ethernet is the leader, with ~56% of systems using it. InfiniBand 2 nd, at 28% - other negligible. Only 3 significant networks: Gigabit Ethernet, InfiniBand, & Proprietary (IBM Blue Gene/L /P or Cray XT4/XT5) - Myrinet, Quadrics, SP Switch, etc. have nearly disappeared Market matured to ~3-4 players Interconnects in Top500 by # of systems BG & XT4/5 IB Gigabit Enet 6

Top500 List, November 2008 Interconnects in Top500..But measured by total performance share, higher-performance networks show more Again, 3-4 HPC interconnect options: - High/Middle: IBM Blue Gene / Blue Gene/P & Cray XT4 / XT5 - Middle/High: InfiniBand SDR/DDR (No QDR showing yet) - Low: Gigabit Ethernet (No 10G yet cost/performance not good enough) Interconnects in Top500 by # of systems Interconnects in Top500 by total performance share Performance share = # cores * Linpack Ops/s per core BG & XT4/5 IB BG & XT4/5 IB Gigabit Enet Gigabit Enet 7

Interconnects: # Systems vs. Total performance (# cores) in 2008 Left chart below shows interconnect share by # of systems. Right shows share by perf. Gigabit Ethernet dominates in smaller systems, with fewer numbers of processors and fewer numbers of links. Note: # of links scales super-linearly with system size, so share of Interconnect BW (links) is higher for Proprietary and InfiniBand than shown at right. Gigabit Ethernet 56.4% InfiniBand 28.2% Proprietary 8.4% Cray Interconnect 1.2% Myrinet 2.0% SP Switch 2.0% Others 1.8% Gigabit Ethernet 29.18% InfiniBand 38.82% Proprietary 24.42% Cray Interconnect 2.12% Myrinet 2.06% SP Switch 1.35% Others 2.05% 8

Value of a Cluster Network: Network-Dependent Scalability, 2005 100000 Linpack: Peak vs. Actual Performance-2005 Actual Max Performance (GF) 10000 1000 # of CPUs Ethernet-Interconnected Myrinet-Interconnected InfiniBand-Interconnected 400 500 1,00 1,500 2,000 Data: www.top500.org Nov. 2005 3,200 8,000 1000 10000 100000 Theoretical Peak Performance (GF) Average efficiency (Actual/Peak) across all system sizes GigEthernet: (1+1) Gbps, 10-20 µs 54.1% Myrinet- 2000: (2+2) Gb/s, 5-6 us 64.1% IB-4x-SDR: (8+8) Gbps, 5-6 usec 73.2% Application= Linpack, Nodes: dual-processor, Intel Xeon (2.6-3.6 GHz), Opteron (2.2-2.6 GHz), or Power (2.3 GHz) Systems interconnected with higher-performance networks get better use out of processors on parallel applications Benefit of cluster network grows with system size - 25% difference at 1,000 CPUs, >100% at >2,000 Benefits will also, of course, be application-dependent. - Embarrassingly-parallel codes depend less on the network; tightly-coupled apps depend more than Linpack 9

Value of a Cluster Network: Network-Dependent Scalability, 2008 1000000 Linpack: Peak vs. Actual Performance-2008 Actual Max Performance (GF) 100000 10000 Ethernet-Interconnected InfiniBand-Interconnected # of CPU cores (approx.) Data: www.top500.org Nov. 2008 Average efficiency (Actual/Peak) across all system sizes Gig Ethernet: (1+1) Gbps, 8-12 µs 51.0% IB-4x-DDR: (16+16) Gbps, 1-2 usec 76.0% Application = Linpack Nodes: Various -- Intel Xeon, Opteron, or Power 2,000 4,000 8,000 16,000 32,000 1000 10000 100000 1000000 Theoretical Peak Performance (GF) 3 years later all numbers are bigger, but trends & effects are same or amplified since 2005 - Fewer imbalanced systems but still a few outliers. IB-linked systems (higher BW, lower latency) get ~1.5x performance from same # of CPU cores - Similar benefits in energy efficiency & cost-efficiency 10

Outline Top500 Systems, Nov. 2008 - Review of most recent list & implications on interconnect design Review of various high-end machines designs - RoadRunner: Hybrid Opteron & Cell blades - Cray XT3/4/5 - Blue Gene & Blue Gene/P - Ranger - SunBlade x6420 - Power 575 Summary Systems & Interconnect Characteristics 11

Roadrunner at a glance: Statistics as of 11/2008 Cluster of 18 Connected Units - 12,960 IBM PowerXCell 8i accelerators - 6,480 AMD dual-core Opterons (comp) - 432 AMD dual-core Opterons (I/O) - 36 AMD dual-core Opterons (man) - 1.41 Petaflop/s peak (PowerXCell) - 46.6 Teraflop/s peak (Opteron-comp) - 1.105 Petaflop/s sustained Linpack InfiniBand 4x DDR fabric - 2-stage fat-tree; all-optical cables - Full CU bi-section bi-directional BW - 384 GB/s (CU) - 3.3 TB/s (system) - Non-disruptive expansion to 24 CUs (1/3 bigger) 103 TB aggregate memory - 51.8 TB Opteron - 51.8 TB Cell 432 GB/s peak File System I/O: - 216x2 10G Ethernets to Panasas RHEL & Fedora Linux SDK for Multicore Acceleration xcat Cluster Management - System-wide GigEnet network 2.48 MW Power (Linpack): - 445 Megaflop/s per Watt - most power-efficient, other than Cell-only systems Other: - 294 racks - 5500 ft 2-500,000 lbs. - >55 miles of InfiniBand cables TRIBLADE Operated by Los Alamos National Security, LLC for NNSA 12

Roadrunner Packaging & Topology 2-layer Clos-style network, using 288-port IB switches for both leaf and core - 6 levels of switch, altogether. All Interconnect cables are optical. - Copper could have worked for some but optical is easier to deal with, more reliable, & lower power. - Homogeneity of technology is a huge plus Connected Unit 1 ISR9288 IB4x Switch 96 optical Connected Unit 576 unused 8 @ ISR9288 IB4x DDR Switch Misc Connected Unit 18 ISR9288 IB4x Switch 96 optical I/O + Compute x8 Compute x6 Service + Compute Switch + Compute 13

Roadrunner Blades, Racks, Switches, & Cables Hybrid Blades Opteron + Cell Combination BladeCenter Racks Core Switches 288- port IB 4x DDR Active Optical Cables & more Active Optical Cables (20+20) Gb/s each During Build in Poughkeepsie, NY 14

Cray XT3/XT4/XT5 AMD Opteron quad-core sockets, connected to Seastar / Seastar 2, Seastar2+ Bridge/Router/DMA ASIC. 3-D torus topology (6-port routers) allows <6meter cables - Little or no fiber, except to storage. Photos: Dave Bullock / eecue eecue.com 15

Blue Gene & Blue Gene/P SoC chip: 4 CPU cores+memory interface+router 3-D torus topology (plus extra low-bw networks) - no optics needed in this generation Blue Gene/P 16

Ranger, UT Austin SunBlade x6420, InfiniBand Opteron quad-core blades. Top-of-rack IB leaf switches 1 st level of switching. Core IB switch 3,456 ports to leaf switches, at 4x-DDR (20+20) Gb/s each, using 12x connectors 3-level Clos, built with 24-port DDR switch chips - Still all copper - *heavy* & large cables - QDR will need optical cables 17

BlueFire, NCAR - P5-575 Rackmount Drawer, 16 2-core MCMs per drawer InfiniBand network direct to Core switches - Copper used in this machine, Active optical is possible Water-cooling to all processors - 40% savings in power delivery efficiency. - Other advantages: better reliability, density, impact on data center environment / temperature 18

Outline Top500 Systems, Nov. 2008 - Review of most recent list & implications on interconnect design Review of various high-end machines designs - RoadRunner: Hybrid Opteron & Cell blades - Cray XT3/4/5 - Blue Gene & Blue Gene/P - Ranger - SunBlade x6420 - Power 575 Summary Systems & Interconnect Characteristics 19

Take-home messages Supercomputer and HPC architecture is still heterogeneous (i.e., interesting) - Processors: Intel / AMD / Power, - Co-processors: Vector Units, Cell processors, FPGAs, GPUs,.. - Networks: Torus, Clos, mixtures,.. - Scalability design Networks are still heterogeneous as well (with some signs of maturing) Overall system design topology (torus/clos /..), packaging (blades/drawers/racks/..), and usage (convenience of installation,..) -- all affect use of optics vs. copper. Active optical cabling makes system design much easier. Steady & fast progress towards more optics The limit of 10 [meters*gigabits/sec] as the cross-over point it still pretty valid. - CTR-I used same number, in different units: 10 [kilometers* Megabits/sec] 20

Appendix: The downside of Massive Parallelism A few real-world scenarios 21

and the Upside of Massive Parallelism: More Insight 22

and the Upside of Massive Parallelism: More Insight For example: Weather Simulation ~2005 ~1995 ~2000 23