FAST AND DEADLOCK-FREE ROUTING FOR ARBITRARY INFINIBAND NETWORKS

Size: px

Start display at page:

Download "FAST AND DEADLOCK-FREE ROUTING FOR ARBITRARY INFINIBAND NETWORKS"

Juliet Parker
5 years ago
Views:

1 FAST AND DEADLOCK-FREE ROUTING FOR ARBITRARY INFINIBAND NETWORKS AKA. DFSSSP ROUTING Torsten Hoefler and Jens Domke All used images belong to the producers.

MOTIVATION FOR BETTER ROUTING New challenging problems need more bandwidth

InfiniBand offers advanced routing (kind of) IB spec mandates deadlock-free

Optimal algorithms for fat-tree and tori networks MinHop for others (can

2 MOTIVATION FOR BETTER ROUTING New challenging problems need more bandwidth Scientific computing Big data analytics Distributed graph problems InfiniBand offers advanced routing (kind of) IB spec mandates deadlock-free deterministic distributed routing OpenSM is the standard routing software Optimal algorithms for fat-tree and tori networks MinHop for others (can produce credit loops) Commercial alternatives exist T. Hoefler and J. Domke Slide 2 of 28

3 FORGET FULL BISECTION BANDWIDTH Expensive topologies do not guarantee high bandwidth Deterministic oblivious routing cannot reach full bandwidth! See Valiant s lower bound Random routing is asymptotically optimal but looses locality InfiniBand routing: Deterministic oblivious, destination-based, simple Linear forwarding table (LFT) at each switch LID mask control (LMC) enables multiple addresses per port Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks T. Hoefler and J. Domke Slide 3 of 28

4 BUT MY VENDOR SAID NON-BLOCKING Two communications 1 6, 4 14 Full bisection bandwidth network No full bandwidth observed! Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks T. Hoefler and J. Domke Slide 4 of 28

5 SO HOW BAD IS CONGESTION? CHiC Supercomputer: slightly aged but reflects routing 566 nodes, full bisection IB fat-tree no endpoint congestion! effective Bisection Bandwidth: Microbenchmarks (yet to be seen in practice) Lower Bound! (~ GigE Speed) Reality? Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks T. Hoefler and J. Domke Slide 5 of 28

6 BUT I HAVE A CLEVER SUBNET MANAGER! OpenSM (IB) routing algorithms: minhop (finds minimal paths, balances number of routes local at each switch) updn/dnup (uses Up*/Down* turn-control, limits choice but routes contain no credit loops) ftree (fat-tree optimized routing, no credit loops) dor/torus-2qos (dimension order routing for k-ary n-cubes, DOR might generate credit loops) lash (uses DOR and breaks credit loops with virtual lanes) It s clever if you have a fat-tree or a torus But beware if you add or remove one link! T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 6 of 28

Fat-trees and butterflies are hard to grow Tori networks may have

7 WHY ARBITRARY TOPOLOGIES? Many networks grow over time or fulfill more than one purpose Fat-trees and butterflies are hard to grow Tori networks may have undesirable properties IB supports arbitrary topologies! Hybrid networks exist: T. Hoefler and J. Domke Slide 7 of 28

EFFECTIVE BISECTION BANDWIDTH A measure for global network performance Considers routing Can be measured with a benchmark More realistic then bisection bandwidth Effective Bisection Bandwidth (ebb)

8 EFFECTIVE BISECTION BANDWIDTH A measure for global network performance Considers routing Can be measured with a benchmark More realistic then bisection bandwidth Effective Bisection Bandwidth (ebb) benchmark Divide network into equal partitions A and B combinations Find one peer in B for each node in A pairings Huge number of patterns Statistics converge fast (~1000 measurements) Implemented in Netgauge/eBB (download and try!) Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks T. Hoefler and J. Domke Slide 8 of 28

upper bound to congestion! Goal is to find R that minimizes ¼(G,R) Shown to be NP-hard in the general case T.

9 ACHIEVING HIGH BANDWIDTH Model network as G=(V P [V C, E) Path r(u,v) is a path between u,v 2 V P Routing R consists of P(P-1) paths Edge load l(e) = number of paths on e 2 E Edge forwarding index ¼(G,R)=max e2e l(e) ¼(G,R) is an upper bound to congestion! Goal is to find R that minimizes ¼(G,R) Shown to be NP-hard in the general case T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 9 of 28

10 A SIMPLE HEURISTIC Keep it simple, greedily minimize ¼(G,R) SSSP routing starts an SSSP run at each node Finds paths with minimal edge-load l(e) Updates routing tables in reverse Essentially SDSP Updates l(e) between runs Strives for global balancing An example Source: penny-arcade.com T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 10 of 28

11 STEP 1/3 Step 1: Source-node 0: T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 11 of 28

12 STEP 2/3 Step 2: Source-node 1: T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 12 of 28

13 STEP 3/3 Step 3: Source-node 2: ¼(G,R)=2 T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 13 of 28

14 EVALUATION - ODIN Simulation Benchmark (Netgauge; pattern: ebb) Simulation predicts 5% improvement Benchmark shows 18% improvement! T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 14 of 28

15 EVALUATION - DEIMOS Simulation Benchmark (Netgauge; pattern: ebb) Simulation predicts 23% improvement Benchmark shows 40% improvement! T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks T. Hoefler and J. Domke Slide 15 of 28

16 IT WORKS, IS THAT ALL? JUST SSSP? Shown to run well on real systems in practice Odin (128 nodes, 18% ebb speedup) Deimos (726 nodes, 40% ebb speedup) Lomonosov 1 (~4.5k nodes, ~10-20% Graph500 speedup) Unfortunately not! SSSP Routing may create credit loops On certain topologies To be proven if some topologies are loop-free Problematic in production environments (and interesting in theory ) 1 Lomonosov experiments were executed by Anton Korzh and Alexander Naumov T. Hoefler and J. Domke Slide 16 of 28

17 WHAT ARE CREDITS AND WHY DO THEY LOOP? IB uses credit-based flow-control egress messages sent only if receive-buffer available Very similar to deadlocks in wormhole-routed systems Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 17 of 28

18 DEAL WITH CREDIT LOOPS Prevent (UP*/Down*, turn-based routing) Limits routing options Resolve (LASH, use VLs to break cycles) Consumes additional buffers Ignore (MinHop, DOR) Potential resolution: packet timeouts Discouraged by IB specification Others: Bubble Routing etc. Not supported by current specification Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 18 of 28

19 USING VLS TO AVOID DEADLOCKS Pioneered with LASH, example: Deadlock! Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 19 of 28

20 USING VLS TO AVOID DEADLOCKS Pioneered with LASH, example: 2 VLs resolve deadlock Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 20 of 28

21 DEADLOCK-FREE SSSP ROUTING Perform normal SSSP Detect cycles Break cycle by adding new VL, rinse, repeat VLs are expensive, how many do we need? NP-hard problem Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 21 of 28

22 Needed #(virtual lanes) IS IT PRACTICAL? DO WE HAVE ENOUGH VLS? Usually yes! But it depends on the topology and network size IB supports up to 15 VLs Current hardware has 8 VLs LASH DFSSSP 0 ChiC Deimos Odin JUROPA Ranger Tsubame Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 22 of 28

Runtime in s MPI BENCHMARKS 0.08 0.07 0.06 MPI All-to-All on 128 nodes (1 core per node) MinHop LASH DFSSSP 0.05 0.04 0.03 0.02 0.

23 Runtime in s MPI BENCHMARKS MPI All-to-All on 128 nodes (1 core per node) MinHop LASH DFSSSP Elements in buffer (#floats) 47% faster execution (compared to MinHop) Same performance for SSSP and DFSSSP Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 23 of 28

24 Gflop/s (total) ebb in MiByte/s EBB AND NAS PARALLEL BENCHMARKS MinHop LASH SSSP DFSSSP Measured on Deimos at TU Dresden 1 MPI process per node (except for 1024) BT, class C solver MinHop LASH SSSP DFSSSP Number of MPI-processes Netgauge, ebb Number of MPI-processes Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 24 of 28

25 Eff. bisection bandwidth Runtime in s IS IT PRACTICAL? WHAT ABOUT LARGE SCALE? Up to 64% higher ebb (depending on #nodes and topology) Runtime can be an issue! (for systems w/ #nodes > 5.000) CHiC Deimos JUROPA Odin Ranger Tsubame CHiC Deimos JUROPA Odin Ranger Tsubame MinHop Up*/Down* FatTree LASH DOR SSSP DFSSSP MinHop Up*/Down* FatTree LASH DOR SSSP DFSSSP Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 25 of 28

26 HOW DO I GET/USE DFSSSP? OpenFabrics Alliance integrated DFSSSP into the official OFED stack in May, management/opensm tar.gz Start OpenSM with option -R dfsssp or set routing_engine dfsssp in configuration file Configure QoS for switches/hcas to equally use VLs (Please don t use DFSSSP in opensm /15) Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 26 of 28

27 HOW DO I GET/USE DFSSSP? QoS config example for OpenSM: qos TRUE qos_max_vls 8 qos_high_limit 4 qos_vlarb_high 0:64,1:64,2:64,3:64,4:64,5:64,6:64,7:64 qos_vlarb_low 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4 qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7 Enable SL queries for path setup in Open MPI: Configure Open MPI with --with-openib --enable-openib-dynamic-sl Run your application with --mca btl_openib_ib_path_record_service_level 1 Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies T. Hoefler and J. Domke Slide 27 of 28

28 CONCLUSION DFSSSP can offer: Same or better eff. bisection bandwidth (compared to other routing algorithms) Huge increase in application performance and speed-up of MPI-collectives (especially for irregular topologies) Usability on arbitrary topologies and efficient usage of VLs Free and easy to use (integration in OFED) Enhancement for other technologies (similar to IB) T. Hoefler and J. Domke Slide 28 of 28

Optimized Routing for Large- Scale InfiniBand Networks

Optimized Routing for Large- Scale InfiniBand Networks Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine Open Systems Lab Indiana University 1 Effect of Network Congestion CHiC Supercomputer: 566 nodes,