FAST AND DEADLOCK-FREE ROUTING FOR ARBITRARY INFINIBAND NETWORKS

Size: px
Start display at page:

Download "FAST AND DEADLOCK-FREE ROUTING FOR ARBITRARY INFINIBAND NETWORKS"

Transcription

1 FAST AND DEADLOCK-FREE ROUTING FOR ARBITRARY INFINIBAND NETWORKS AKA. DFSSSP ROUTING Torsten Hoefler and Jens Domke All used images belong to the producers.

2 MOTIVATION FOR BETTER ROUTING New challenging problems need more bandwidth Scientific computing Big data analytics Distributed graph problems InfiniBand offers advanced routing (kind of) IB spec mandates deadlock-free deterministic distributed routing OpenSM is the standard routing software Optimal algorithms for fat-tree and tori networks MinHop for others (can produce credit loops) Commercial alternatives exist T. Hoefler and J. Domke Slide 2 of 28

3 FORGET FULL BISECTION BANDWIDTH Expensive topologies do not guarantee high bandwidth Deterministic oblivious routing cannot reach full bandwidth! See Valiant s lower bound Random routing is asymptotically optimal but looses locality InfiniBand routing: Deterministic oblivious, destination-based, simple Linear forwarding table (LFT) at each switch LID mask control (LMC) enables multiple addresses per port Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks T. Hoefler and J. Domke Slide 3 of 28

4 BUT MY VENDOR SAID NON-BLOCKING Two communications 1 6, 4 14 Full bisection bandwidth network No full bandwidth observed! Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks T. Hoefler and J. Domke Slide 4 of 28

5 SO HOW BAD IS CONGESTION? CHiC Supercomputer: slightly aged but reflects routing 566 nodes, full bisection IB fat-tree no endpoint congestion! effective Bisection Bandwidth: Microbenchmarks (yet to be seen in practice) Lower Bound! (~ GigE Speed) Reality? Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks T. Hoefler and J. Domke Slide 5 of 28

6 BUT I HAVE A CLEVER SUBNET MANAGER! OpenSM (IB) routing algorithms: minhop (finds minimal paths, balances number of routes local at each switch) updn/dnup (uses Up*/Down* turn-control, limits choice but routes contain no credit loops) ftree (fat-tree optimized routing, no credit loops) dor/torus-2qos (dimension order routing for k-ary n-cubes, DOR might generate credit loops) lash (uses DOR and breaks credit loops with virtual lanes) It s clever if you have a fat-tree or a torus But beware if you add or remove one link! T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 6 of 28

7 WHY ARBITRARY TOPOLOGIES? Many networks grow over time or fulfill more than one purpose Fat-trees and butterflies are hard to grow Tori networks may have undesirable properties IB supports arbitrary topologies! Hybrid networks exist: T. Hoefler and J. Domke Slide 7 of 28

8 EFFECTIVE BISECTION BANDWIDTH A measure for global network performance Considers routing Can be measured with a benchmark More realistic then bisection bandwidth Effective Bisection Bandwidth (ebb) benchmark Divide network into equal partitions A and B combinations Find one peer in B for each node in A pairings Huge number of patterns Statistics converge fast (~1000 measurements) Implemented in Netgauge/eBB (download and try!) Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks T. Hoefler and J. Domke Slide 8 of 28

9 ACHIEVING HIGH BANDWIDTH Model network as G=(V P [V C, E) Path r(u,v) is a path between u,v 2 V P Routing R consists of P(P-1) paths Edge load l(e) = number of paths on e 2 E Edge forwarding index ¼(G,R)=max e2e l(e) ¼(G,R) is an upper bound to congestion! Goal is to find R that minimizes ¼(G,R) Shown to be NP-hard in the general case T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 9 of 28

10 A SIMPLE HEURISTIC Keep it simple, greedily minimize ¼(G,R) SSSP routing starts an SSSP run at each node Finds paths with minimal edge-load l(e) Updates routing tables in reverse Essentially SDSP Updates l(e) between runs Strives for global balancing An example Source: penny-arcade.com T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 10 of 28

11 STEP 1/3 Step 1: Source-node 0: T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 11 of 28

12 STEP 2/3 Step 2: Source-node 1: T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 12 of 28

13 STEP 3/3 Step 3: Source-node 2: ¼(G,R)=2 T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 13 of 28

14 EVALUATION - ODIN Simulation Benchmark (Netgauge; pattern: ebb) Simulation predicts 5% improvement Benchmark shows 18% improvement! T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 14 of 28

15 EVALUATION - DEIMOS Simulation Benchmark (Netgauge; pattern: ebb) Simulation predicts 23% improvement Benchmark shows 40% improvement! T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 15 of 28

16 IT WORKS, IS THAT ALL? JUST SSSP? Shown to run well on real systems in practice Odin (128 nodes, 18% ebb speedup) Deimos (726 nodes, 40% ebb speedup) Lomonosov 1 (~4.5k nodes, ~10-20% Graph500 speedup) Unfortunately not! SSSP Routing may create credit loops On certain topologies To be proven if some topologies are loop-free Problematic in production environments (and interesting in theory ) 1 Lomonosov experiments were executed by Anton Korzh and Alexander Naumov T. Hoefler and J. Domke Slide 16 of 28

17 WHAT ARE CREDITS AND WHY DO THEY LOOP? IB uses credit-based flow-control egress messages sent only if receive-buffer available Very similar to deadlocks in wormhole-routed systems Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 17 of 28

18 DEAL WITH CREDIT LOOPS Prevent (UP*/Down*, turn-based routing) Limits routing options Resolve (LASH, use VLs to break cycles) Consumes additional buffers Ignore (MinHop, DOR) Potential resolution: packet timeouts Discouraged by IB specification Others: Bubble Routing etc. Not supported by current specification Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 18 of 28

19 USING VLS TO AVOID DEADLOCKS Pioneered with LASH, example: Deadlock! Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 19 of 28

20 USING VLS TO AVOID DEADLOCKS Pioneered with LASH, example: 2 VLs resolve deadlock Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 20 of 28

21 DEADLOCK-FREE SSSP ROUTING Perform normal SSSP Detect cycles Break cycle by adding new VL, rinse, repeat VLs are expensive, how many do we need? NP-hard problem Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 21 of 28

22 Needed #(virtual lanes) IS IT PRACTICAL? DO WE HAVE ENOUGH VLS? Usually yes! But it depends on the topology and network size IB supports up to 15 VLs Current hardware has 8 VLs LASH DFSSSP 0 ChiC Deimos Odin JUROPA Ranger Tsubame Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 22 of 28

23 Runtime in s MPI BENCHMARKS MPI All-to-All on 128 nodes (1 core per node) MinHop LASH DFSSSP Elements in buffer (#floats) 47% faster execution (compared to MinHop) Same performance for SSSP and DFSSSP Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 23 of 28

24 Gflop/s (total) ebb in MiByte/s EBB AND NAS PARALLEL BENCHMARKS MinHop LASH SSSP DFSSSP Measured on Deimos at TU Dresden 1 MPI process per node (except for 1024) BT, class C solver MinHop LASH SSSP DFSSSP Number of MPI-processes Netgauge, ebb Number of MPI-processes Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 24 of 28

25 Eff. bisection bandwidth Runtime in s IS IT PRACTICAL? WHAT ABOUT LARGE SCALE? Up to 64% higher ebb (depending on #nodes and topology) Runtime can be an issue! (for systems w/ #nodes > 5.000) CHiC Deimos JUROPA Odin Ranger Tsubame CHiC Deimos JUROPA Odin Ranger Tsubame MinHop Up*/Down* FatTree LASH DOR SSSP DFSSSP MinHop Up*/Down* FatTree LASH DOR SSSP DFSSSP Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 25 of 28

26 HOW DO I GET/USE DFSSSP? OpenFabrics Alliance integrated DFSSSP into the official OFED stack in May, management/opensm tar.gz Start OpenSM with option -R dfsssp or set routing_engine dfsssp in configuration file Configure QoS for switches/hcas to equally use VLs (Please don t use DFSSSP in opensm /15) Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 26 of 28

27 HOW DO I GET/USE DFSSSP? QoS config example for OpenSM: qos TRUE qos_max_vls 8 qos_high_limit 4 qos_vlarb_high 0:64,1:64,2:64,3:64,4:64,5:64,6:64,7:64 qos_vlarb_low 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4 qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7 Enable SL queries for path setup in Open MPI: Configure Open MPI with --with-openib --enable-openib-dynamic-sl Run your application with --mca btl_openib_ib_path_record_service_level 1 Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 27 of 28

28 CONCLUSION DFSSSP can offer: Same or better eff. bisection bandwidth (compared to other routing algorithms) Huge increase in application performance and speed-up of MPI-collectives (especially for irregular topologies) Usability on arbitrary topologies and efficient usage of VLs Free and easy to use (integration in OFED) Enhancement for other technologies (similar to IB) T. Hoefler and J. Domke Slide 28 of 28

Optimized Routing for Large- Scale InfiniBand Networks

Optimized Routing for Large- Scale InfiniBand Networks Optimized Routing for Large- Scale InfiniBand Networks Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine Open Systems Lab Indiana University 1 Effect of Network Congestion CHiC Supercomputer: 566 nodes,

More information

Improving Parallel Computing Platforms - programming, topology and routing -

Improving Parallel Computing Platforms - programming, topology and routing - Improving Parallel Computing Platforms - programming, topology and routing - Torsten Hoefler Open Systems Lab Indiana University Bloomington, IN, USA 1 State of the Art (what we all know) Moores law mandates

More information

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 10. Routing and Flow Control

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 10. Routing and Flow Control CS 498 Hot Topics in High Performance Computing Networks and Fault Tolerance 10. Routing and Flow Control Intro What did we learn in the last lecture Some more topologies Routing (schemes and metrics)

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Routing, Network Embedding John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 14-15 4,11 October 2018 Topics for Today

More information

The effects of common communication patterns in large scale networks with switch based static routing

The effects of common communication patterns in large scale networks with switch based static routing The effects of common communication patterns in large scale networks with switch based static routing Torsten Hoefler Indiana University talk at: Cisco Systems San Jose, CA Nerd Lunch 21st August 2008

More information

Adaptive Routing Strategies for Modern High Performance Networks

Adaptive Routing Strategies for Modern High Performance Networks Adaptive Routing Strategies for Modern High Performance Networks Patrick Geoffray Myricom patrick@myri.com Torsten Hoefler Indiana University htor@cs.indiana.edu 28 August 2008 Hot Interconnect Stanford,

More information

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control CS 498 Hot Topics in High Performance Computing Networks and Fault Tolerance 9. Routing and Flow Control Intro What did we learn in the last lecture Topology metrics Including minimum diameter of directed

More information

Routing and Fault-Tolerance Capabilities of the Fabriscale FM compared to OpenSM

Routing and Fault-Tolerance Capabilities of the Fabriscale FM compared to OpenSM Routing and Fault-Tolerance Capabilities of the Fabriscale FM compared to OpenSM Jesus Camacho Villanueva, Tor Skeie, and Sven-Arne Reinemo Fabriscale Technologies E-mail: {jesus.camacho,tor.skeie,sven-arne.reinemo}@fabriscale.com

More information

Scheduling-Aware Routing for Supercomputers

Scheduling-Aware Routing for Supercomputers Scheduling-Aware Routing for Supercomputers Jens Domke Institute of Computer Engineering TU Dresden, Germany Email: jens.domke@tu-dresden.de Torsten Hoefler Computer Science Department ETH Zurich, Switzerland

More information

Generic Topology Mapping Strategies for Large-scale Parallel Architectures

Generic Topology Mapping Strategies for Large-scale Parallel Architectures Generic Topology Mapping Strategies for Large-scale Parallel Architectures Torsten Hoefler and Marc Snir Scientific talk at ICS 11, Tucson, AZ, USA, June 1 st 2011, Hierarchical Sparse Networks are Ubiquitous

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

UNIVERSITY OF CASTILLA-LA MANCHA. Computing Systems Department

UNIVERSITY OF CASTILLA-LA MANCHA. Computing Systems Department UNIVERSITY OF CASTILLA-LA MANCHA Computing Systems Department A case study on implementing virtual 5D torus networks using network components of lower dimensionality HiPINEB 2017 Francisco José Andújar

More information

Routing Algorithms. Review

Routing Algorithms. Review Routing Algorithms Today s topics: Deterministic, Oblivious Adaptive, & Adaptive models Problems: efficiency livelock deadlock 1 CS6810 Review Network properties are a combination topology topology dependent

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

Routing and Deadlock

Routing and Deadlock 3.5-1 3.5-1 Routing and Deadlock Routing would be easy...... were it not for possible deadlock. Topics For This Set: Routing definitions. Deadlock definitions. Resource dependencies. Acyclic deadlock free

More information

EE 382C Interconnection Networks

EE 382C Interconnection Networks EE 8C Interconnection Networks Deadlock and Livelock Stanford University - EE8C - Spring 6 Deadlock and Livelock: Terminology Deadlock: A condition in which an agent waits indefinitely trying to acquire

More information

Routing Verification Tools

Routing Verification Tools Routing Verification Tools ibutils e.g. ibdmchk infiniband-diags e.g. ibsim, etc. Dave McMillen What do you verify? Did it work? Is it deadlock free? Does it distribute routes as expected? What happens

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection

More information

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,

More information

ORCS: An Oblivious Routing Congestion Simulator

ORCS: An Oblivious Routing Congestion Simulator ORCS: An Oblivious Routing Congestion Simulator Timo Schneider, Torsten Hoefler, and Andrew Lumsdaine {timoschn,htor,lums}@cs.indiana.edu Open Systems Laboratory, Indiana University 150 S Woodlawn Ave,

More information

Interconnection networks

Interconnection networks Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory

More information

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Routing Algorithm How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Many routing algorithms exist 1) Arithmetic 2) Source-based 3) Table lookup

More information

A Cost and Scalability Comparison of the Dragonfly versus the Fat Tree. Frank Olaf Sem-Jacobsen Simula Research Laboratory

A Cost and Scalability Comparison of the Dragonfly versus the Fat Tree. Frank Olaf Sem-Jacobsen Simula Research Laboratory A Cost and Scalability Comparison of the Dragonfly versus the Fat Tree Frank Olaf Sem-Jacobsen frankose@simula.no Simula Research Laboratory HPC Advisory Council Workshop Barcelona, Spain, September 12,

More information

A Multiple LID Routing Scheme for Fat-Tree-Based InfiniBand Networks

A Multiple LID Routing Scheme for Fat-Tree-Based InfiniBand Networks A Multiple LID Routing Scheme for Fat-Tree-Based InfiniBand Networks Xuan-Yi Lin, Yeh-Ching Chung, and Tai-Yi Huang Department of Computer Science National Tsing-Hua University, Hsinchu, Taiwan 00, ROC

More information

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks HPI-DC 09 Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks Diego Lugones, Daniel Franco, and Emilio Luque Leonardo Fialho Cluster 09 August 31 New Orleans, USA Outline Scope

More information

TECHNISCHE UNIVERSITÄT DRESDEN FAKULTÄT INFORMATIK

TECHNISCHE UNIVERSITÄT DRESDEN FAKULTÄT INFORMATIK TECHNISCHE UNIVERSITÄT DRESDEN FAKULTÄT INFORMATIK INSTITUT FÜR TECHNISCHE INFORMATIK PROFESSUR FÜR RECHNERARCHITEKTUR Routing on the Channel Dependency Graph: A New Approach to Deadlock-Free, Destination-Based,

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N.

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N. Interconnection topologies (cont.) [ 10.4.4] In meshes and hypercubes, the average distance increases with the dth root of N. In a tree, the average distance grows only logarithmically. A simple tree structure,

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013 NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

Performance-Centric System Design

Performance-Centric System Design Performance-Centric System Design Torsten Hoefler Swiss Federal Institute of Technology, ETH Zürich Performance-Centric System Design - modeling, programming, and architecture - Torsten Hoefler ETH Zürich

More information

in Oblivious Routing

in Oblivious Routing Static Virtual Channel Allocation in Oblivious Routing Keun Sup Shim, Myong Hyon Cho, Michel Kinsy, Tina Wen, Mieszko Lis G. Edward Suh (Cornell) Srinivas Devadas MIT Computer Science and Artificial Intelligence

More information

Congestion in InfiniBand Networks

Congestion in InfiniBand Networks Congestion in InfiniBand Networks Philip Williams Stanford University EE382C Abstract The InfiniBand Architecture (IBA) is a relatively new industry-standard networking technology suited for inter-processor

More information

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con t) March 14 th, 212 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

OMNI-PATH FABRIC TOPOLOGIES AND ROUTING

OMNI-PATH FABRIC TOPOLOGIES AND ROUTING 13th ANNUAL WORKSHOP 2017 OMNI-PATH FABRIC TOPOLOGIES AND ROUTING Renae Weber, Software Architect Intel Corporation March 30, 2017 LEGAL DISCLAIMERS INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION

More information

Basic Switch Organization

Basic Switch Organization NOC Routing 1 Basic Switch Organization 2 Basic Switch Organization Link Controller Used for coordinating the flow of messages across the physical link of two adjacent switches 3 Basic Switch Organization

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0 INFINIBAND OVERVIEW -, 2010 Page 1 Version 1.0 Why InfiniBand? Open and comprehensive standard with broad vendor support Standard defined by the InfiniBand Trade Association (Sun was a founder member,

More information

The Impact of Network Noise at Large-Scale Communication Performance

The Impact of Network Noise at Large-Scale Communication Performance The Impact of Network Noise at Large-Scale Communication Performance Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine Open Systems Laboratory Indiana University Bloomington IN 47405, USA {htor,timoschn,lums}@cs.indiana.edu

More information

Workshop on High Performance Computing (HPC) Architecture and Applications in the ICTP October High Speed Network for HPC

Workshop on High Performance Computing (HPC) Architecture and Applications in the ICTP October High Speed Network for HPC 2494-6 Workshop on High Performance Computing (HPC) Architecture and Applications in the ICTP 14-25 October 2013 High Speed Network for HPC Moreno Baricevic & Stefano Cozzini CNR-IOM DEMOCRITOS Trieste

More information

Slim Fly: A Cost Effective Low-Diameter Network Topology

Slim Fly: A Cost Effective Low-Diameter Network Topology TORSTEN HOEFLER, MACIEJ BESTA Slim Fly: A Cost Effective Low-Diameter Network Topology Images belong to their creator! NETWORKS, LIMITS, AND DESIGN SPACE Networks cost 25-30% of a large supercomputer Hard

More information

Unified Runtime for PGAS and MPI over OFED

Unified Runtime for PGAS and MPI over OFED Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #5 1/29/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class Outline

More information

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09 RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement

More information

Introduction to High-Speed InfiniBand Interconnect

Introduction to High-Speed InfiniBand Interconnect Introduction to High-Speed InfiniBand Interconnect 2 What is InfiniBand? Industry standard defined by the InfiniBand Trade Association Originated in 1999 InfiniBand specification defines an input/output

More information

Advanced Computer Networks Exercise Session 7. Qin Yin Spring Semester 2013

Advanced Computer Networks Exercise Session 7. Qin Yin Spring Semester 2013 Advanced Computer Networks 263-3501-00 Exercise Session 7 Qin Yin Spring Semester 2013 1 LAYER 7 SWITCHING 2 Challenge: accessing services Datacenters are designed to be scalable Datacenters are replicated

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

The final publication is available at

The final publication is available at Document downloaded from: http://hdl.handle.net/10251/82062 This paper must be cited as: Peñaranda Cebrián, R.; Gómez Requena, C.; Gómez Requena, ME.; López Rodríguez, PJ.; Duato Marín, JF. (2016). The

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

LID Assignment In InfiniBand Networks

LID Assignment In InfiniBand Networks LID Assignment In InfiniBand Networks Wickus Nienaber, Xin Yuan, Member, IEEE and Zhenhai Duan, Member, IEEE Abstract To realize a path in an InfiniBand network, an address, known as Local IDentifier (LID)

More information

Deadlock and Router Micro-Architecture

Deadlock and Router Micro-Architecture 1 EE482: Advanced Computer Organization Lecture #8 Interconnection Network Architecture and Design Stanford University 22 April 1999 Deadlock and Router Micro-Architecture Lecture #8: 22 April 1999 Lecturer:

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

14th ANNUAL WORKSHOP 2018 A NEW APPROACH TO SWITCHING NETWORK IMPLEMENTATION. Harold E. Cook. Director of Software Engineering Lightfleet Corporation

14th ANNUAL WORKSHOP 2018 A NEW APPROACH TO SWITCHING NETWORK IMPLEMENTATION. Harold E. Cook. Director of Software Engineering Lightfleet Corporation 14th ANNUAL WORKSHOP 2018 A NEW APPROACH TO SWITCHING NETWORK IMPLEMENTATION Harold E. Cook Director of Software Engineering Lightfleet Corporation April 9, 2018 OBJECTIVES Discuss efficiency and reliability

More information

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220 Admin Homework #5 Due Dec 3 Projects Final (yes it will be cumulative) CPS 220 2 1 Review: Terms Network characterized

More information

Physical Organization of Parallel Platforms. Alexandre David

Physical Organization of Parallel Platforms. Alexandre David Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:

More information

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts School of Electrical and Computer Engineering Cornell University revision: 2017-10-17-12-26 1 Network/Roadway Analogy 3 1.1. Running

More information

The Effect of Network Noise on Large-Scale Collective Communications

The Effect of Network Noise on Large-Scale Collective Communications October, 2009 2:45 WSPC/INSTRUCTION FILE Parallel Processing Letters c World Scientific Publishing Company The Effect of Network Noise on Large-Scale Collective Communications Torsten Hoefler, Timo Schneider,

More information

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015 Low latency, high bandwidth communication. Infiniband and RDMA programming Knut Omang Ifi/Oracle 2 Nov, 2015 1 Bandwidth vs latency There is an old network saying: Bandwidth problems can be cured with

More information

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages MACIEJ BESTA, TORSTEN HOEFLER spcl.inf.ethz.ch Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages LARGE-SCALE IRREGULAR GRAPH PROCESSING Becoming more important

More information

Design and Evaluation of a 2048 Core Cluster System

Design and Evaluation of a 2048 Core Cluster System Design and Evaluation of a 2048 Core Cluster System, Torsten Höfler, Torsten Mehlan and Wolfgang Rehm Computer Architecture Group Department of Computer Science Chemnitz University of Technology December

More information

Achieving Efficient Bandwidth Utilization in Wide-Area Networks While Minimizing State Changes

Achieving Efficient Bandwidth Utilization in Wide-Area Networks While Minimizing State Changes 1 Achieving Efficient Bandwidth Utilization in Wide-Area Networks While Minimizing State Changes 2 WAN Traffic Engineering Maintaining private WAN infrastructure is expensive Must balance latency-sensitive

More information

Chapter 7 Slicing and Dicing

Chapter 7 Slicing and Dicing 1/ 22 Chapter 7 Slicing and Dicing Lasse Harju Tampere University of Technology lasse.harju@tut.fi 2/ 22 Concentrators and Distributors Concentrators Used for combining traffic from several network nodes

More information

Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2

Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2 Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2 MVAPICH2 User Group (MUG) Mee#ng by Hari Subramoni The Ohio State University E- mail: subramon@cse.ohio- state.edu h

More information

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,

More information

Parallel Computing Interconnection Networks

Parallel Computing Interconnection Networks Parallel Computing Interconnection Networks Readings: Hager s book (4.5) Pacheco s book (chapter 2.3.3) http://pages.cs.wisc.edu/~tvrdik/5/html/section5.html#aaaaatre e-based topologies Slides credit:

More information

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

Interconnection Networks: Routing. Prof. Natalie Enright Jerger Interconnection Networks: Routing Prof. Natalie Enright Jerger Routing Overview Discussion of topologies assumed ideal routing In practice Routing algorithms are not ideal Goal: distribute traffic evenly

More information

Non-Blocking Collectives for MPI

Non-Blocking Collectives for MPI Non-Blocking Collectives for MPI overlap at the highest level Torsten Höfler Open Systems Lab Indiana University Bloomington, IN, USA Institut für Wissenschaftliches Rechnen Technische Universität Dresden

More information

InfiniBand Administration Tools (IBADM)

InfiniBand Administration Tools (IBADM) Release Notes InfiniBand Administration Tools (IBADM) 2 Copyright 2006., Inc. All Rights Reserved. InfiniBand Administration Tools (IBADM) Release Notes Document Number:, Inc. 2900 Stender Way Santa Clara,

More information

Interconnection Network Project EE482 Advanced Computer Organization May 28, 1999

Interconnection Network Project EE482 Advanced Computer Organization May 28, 1999 Interconnection Network Project EE482 Advanced Computer Organization May 28, 1999 Group Members: Overview Tom Fountain (fountain@cs.stanford.edu) T.J. Giuli (giuli@cs.stanford.edu) Paul Lassa (lassa@relgyro.stanford.edu)

More information

InfiniBand and Mellanox UFM Fundamentals

InfiniBand and Mellanox UFM Fundamentals InfiniBand and Mellanox UFM Fundamentals Part Number: MTR-IB-UFM-OST-A Duration: 3 Days What's in it for me? Where do I start learning about InfiniBand? How can I gain the tools to manage this fabric?

More information

Power-aware or green HPC is receiving

Power-aware or green HPC is receiving G r e e n H p c Software and Hardware Techniques for Power-Efficient HPC Networking Although most large-scale systems are designed with the network as a central component, the interconnection network s

More information

CS575 Parallel Processing

CS575 Parallel Processing CS575 Parallel Processing Lecture three: Interconnection Networks Wim Bohm, CSU Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.

More information

Oak Ridge National Laboratory Computing and Computational Sciences

Oak Ridge National Laboratory Computing and Computational Sciences Oak Ridge National Laboratory Computing and Computational Sciences OFA Update by ORNL Presented by: Pavel Shamis (Pasha) OFA Workshop Mar 17, 2015 Acknowledgments Bernholdt David E. Hill Jason J. Leverman

More information

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big

More information

EE 6900: Interconnection Networks for HPC Systems Fall 2016

EE 6900: Interconnection Networks for HPC Systems Fall 2016 EE 6900: Interconnection Networks for HPC Systems Fall 2016 Avinash Karanth Kodi School of Electrical Engineering and Computer Science Ohio University Athens, OH 45701 Email: kodi@ohio.edu 1 Acknowledgement:

More information

USING HIGH PERFORMANCE NETWORK INTERCONNECTS IN DYNAMIC ENVIRONMENTS

USING HIGH PERFORMANCE NETWORK INTERCONNECTS IN DYNAMIC ENVIRONMENTS 12 th ANNUAL WORKSHOP 2016 USING HIGH PERFORMANCE NETWORK INTERCONNECTS IN DYNAMIC ENVIRONMENTS Vangelis Tasoulas Simula Research Laboratory [ April 7 th, 2016 ] ACKNOWLEDGEMENTS Feroz Zahid, Ernst Gunnar

More information

c-through: Part-time Optics in Data Centers

c-through: Part-time Optics in Data Centers Data Center Network Architecture c-through: Part-time Optics in Data Centers Guohui Wang 1, T. S. Eugene Ng 1, David G. Andersen 2, Michael Kaminsky 3, Konstantina Papagiannaki 3, Michael Kozuch 3, Michael

More information

Towards Efficient MapReduce Using MPI

Towards Efficient MapReduce Using MPI Towards Efficient MapReduce Using MPI Torsten Hoefler¹, Andrew Lumsdaine¹, Jack Dongarra² ¹Open Systems Lab Indiana University Bloomington ²Dept. of Computer Science University of Tennessee Knoxville 09/09/09

More information

ABySS Performance Benchmark and Profiling. May 2010

ABySS Performance Benchmark and Profiling. May 2010 ABySS Performance Benchmark and Profiling May 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

INTERCONNECTION NETWORKS LECTURE 4

INTERCONNECTION NETWORKS LECTURE 4 INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source

More information

CSC630/CSC730: Parallel Computing

CSC630/CSC730: Parallel Computing CSC630/CSC730: Parallel Computing Parallel Computing Platforms Chapter 2 (2.4.1 2.4.4) Dr. Joe Zhang PDC-4: Topology 1 Content Parallel computing platforms Logical organization (a programmer s view) Control

More information

Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems. Scott Marshall and Stephen Twigg

Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems. Scott Marshall and Stephen Twigg Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems Scott Marshall and Stephen Twigg 2 Problems with Shared Memory I/O Fairness Memory bandwidth worthless without memory

More information

Mellanox Infiniband Foundations

Mellanox Infiniband Foundations Mellanox Infiniband Foundations InfiniBand Trade Association (IBTA) Founded in 1999 Actively markets and promotes InfiniBand from an industry perspective through public relations engagements, developer

More information

Non-blocking Collective Operations for MPI

Non-blocking Collective Operations for MPI Non-blocking Collective Operations for MPI - Towards Coordinated Optimization of Computation and Communication in Parallel Applications - Torsten Hoefler Open Systems Lab Indiana University Bloomington,

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics Lloyd Dickman, CTO InfiniBand Products Host Solutions Group QLogic Corporation November 13, 2007 @ SC07, Exhibitor Forum

More information

Performance monitoring in InfiniBand networks

Performance monitoring in InfiniBand networks Performance monitoring in InfiniBand networks Sjur T. Fredriksen Department of Informatics University of Oslo sjurtf@ifi.uio.no May 2016 Abstract InfiniBand has quickly emerged to be the most popular interconnect

More information

Deadlock-free Routing in InfiniBand TM through Destination Renaming Λ

Deadlock-free Routing in InfiniBand TM through Destination Renaming Λ Deadlock-free Routing in InfiniBand TM through Destination Renaming Λ P. López, J. Flich and J. Duato Dept. of Computing Engineering (DISCA) Universidad Politécnica de Valencia, Valencia, Spain plopez@gap.upv.es

More information

From Routing to Traffic Engineering

From Routing to Traffic Engineering 1 From Routing to Traffic Engineering Robert Soulé Advanced Networking Fall 2016 2 In the beginning B Goal: pair-wise connectivity (get packets from A to B) Approach: configure static rules in routers

More information

C H A P T E R InfiniBand Commands Cisco SFS 7000 Series Product Family Command Reference Guide OL

C H A P T E R InfiniBand Commands Cisco SFS 7000 Series Product Family Command Reference Guide OL CHAPTER 4 This chapter documents the following commands: ib sm db-sync, page 4-2 ib pm, page 4-4 ib sm, page 4-9 ib-agent, page 4-13 4-1 ib sm db-sync Chapter 4 ib sm db-sync To synchronize the databases

More information

Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective

Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective A. Vishnu M. Koop A. Moody A. R. Mamidala S. Narravula D. K. Panda Computer Science and Engineering The Ohio State University {vishnu,

More information

Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters

Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters Torsten Hoefler With lots of help from the AUS Team and Bill Kramer at NCSA! All used images belong

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information