Adaptive Routing for Data Center Bridges

Similar documents
Adaptive Routing for Convergence Enhanced Ethernet

ETHERNET ENHANCEMENTS FOR STORAGE. Sunil Ahluwalia, Intel Corporation

Trace-driven co-simulation of highperformance

ETHERNET ENHANCEMENTS FOR STORAGE. Sunil Ahluwalia, Intel Corporation Errol Roberts, Cisco Systems Inc.

TCP and QCN in a Multihop Output Generated Hot Spot Scenario

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks

Extending commodity OpenFlow switches for large-scale HPC deployments

Cisco Data Center Ethernet

IEEE P802.1Qcz Proposed Project for Congestion Isolation

Advanced Computer Networks. Flow Control

Requirement Discussion of Flow-Based Flow Control(FFC)

iscsi : A loss-less Ethernet fabric with DCB Jason Blosil, NetApp Gary Gumanow, Dell

Cross-Layer Flow and Congestion Control for Datacenter Networks

TDT Appendix E Interconnection Networks

QCN: Quantized Congestion Notification. Rong Pan, Balaji Prabhakar, Ashvin Laxmikantha

UM DIA NA VIDA DE UM PACOTE CEE

Congestion Management Protocols Simulation Results and Protocol Variations

Industry Standards for the Exponential Growth of Data Center Bandwidth and Management. Craig W. Carlson

P802.1Qcz Congestion Isolation

Enhanced Forward Explicit Congestion Notification (E-FECN) Scheme for Datacenter Ethernet Networks

Advanced Computer Networks. Flow Control

Towards Massively Parallel Simulations of Massively Parallel High-Performance Computing Systems

Non-minimal Adaptive Routing Based on Explicit Congestion Notifications

How Are The Networks Coping Up With Flash Storage

Throughput & Latency Control in Ethernet Backplane Interconnects. Manoj Wadekar Gary McAlpine. Intel

Transition to the Data Center Bridging Era with EqualLogic PS Series Storage Solutions A Dell Technical Whitepaper

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

FM4000. A Scalable, Low-latency 10 GigE Switch for High-performance Data Centers

EXPERIENCES EVALUATING DCTCP. Lawrence Brakmo, Boris Burkov, Greg Leclercq and Murat Mugan Facebook

Accelerating Development and Troubleshooting of Data Center Bridging (DCB) Protocols Using Xgig

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Fairness Improvement of Multiple-Bottleneck Flow in Data Center Networks

Hálózatok üzleti tervezése

The NE010 iwarp Adapter

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat. ACM SIGCOMM 2013, August, Hong Kong, China

Contention-based Congestion Management in Large-Scale Networks

Comparing Server I/O Consolidation Solutions: iscsi, InfiniBand and FCoE. Gilles Chekroun Errol Roberts

Interconnection Networks

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Boosting the Performance of Myrinet Networks

Messaging Overview. Introduction. Gen-Z Messaging

Transport layer issues

Best Practices for Deployments using DCB and RoCE

Data Center Quantized Congestion Notification (QCN): Implementation and Evaluation on NetFPGA

Contents. Configuring LLDP 2

Baidu s Best Practice with Low Latency Networks

3. Evaluation of Selected Tree and Mesh based Routing Protocols

Unified Storage Networking

Unified Storage Networking. Dennis Martin President, Demartek

Network Management & Monitoring

Enhanced Transmission Selection IEEE 802.1Qaz

Contents. Configuring LLDP 2

Server-to-network edge technologies: converged networks and virtual I/O

Dell EMC Networking Deploying Data Center Bridging (DCB)

Congestion Management in Lossless Interconnects: Challenges and Benefits

Basic Low Level Concepts

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Design and Implementations of FCoE for the DataCenter. Mike Frase, Cisco Systems

Credit based Flow Control PAR, 5 Criteria & Objectives

Routing Algorithms. Review

Module 2 Storage Network Architecture

Discover 2013 HOL2653

Evolution with End-to-End Data Center Virtualization

Lossless 10 Gigabit Ethernet: The Unifying Infrastructure for SAN and LAN Consolidation

Lecture 3: Topology - II

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

UNIVERSITY OF CASTILLA-LA MANCHA. Computing Systems Department

Appendix B. Standards-Track TCP Evaluation

A Survey on Congestion Notification Algorithm in Data Centers

Contents. Configuring LLDP 2

NET ID. CS519, Prelim (March 17, 2004) NAME: You have 50 minutes to complete the test. 1/17

CM: Reaching Good Enough From Below. Are we there yet?

4. Networks. in parallel computers. Advances in Computer Architecture

TCP. CSU CS557, Spring 2018 Instructor: Lorenzo De Carli (Slides by Christos Papadopoulos, remixed by Lorenzo De Carli)

Data Center Ethernet

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

Random Early Detection (RED) gateways. Sally Floyd CS 268: Computer Networks

Quality of Service Mechanism for MANET using Linux Semra Gulder, Mathieu Déziel

Introduction to Infiniband

Revisiting Network Support for RDMA

LS Example 5 3 C 5 A 1 D

Traffic Monitoring and Engineering for UCS

P D1.1 RPR OPNET Model User Guide

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

Transmission Control Protocol. ITS 413 Internet Technologies and Applications

Addressing Concerns with Closed Loop Congestion Management Protocols

Message Passing Models and Multicomputer distributed system LECTURE 7

STEVEN R. BAGLEY PACKETS

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

Advanced Computer Networks

Unit 2 Packet Switching Networks - II

Enabling Block Storage over Ethernet: The Case for Per Priority Pause

Interconnection Network

Future Routing Schemes in Petascale clusters

FIBRE CHANNEL OVER ETHERNET

Reference Architectures for designing and deploying Microsoft SQL Server Databases in Active System800 Platform

All Roads Lead to Convergence

Advancing RDMA. A proposal for RDMA on Enhanced Ethernet. Paul Grun SystemFabricWorks

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing

Transcription:

Adaptive Routing for Data Center Bridges Cyriel Minkenberg 1, Mitchell Gusat 1, German Rodriguez 2 1 IBM Research - Zurich 2 Barcelona Supercomputing Center

Overview IBM Research - Zurich Data center network consolidation Need for Ethernet congestion management Summary of IEEE 802.1Qau standardization effort Adaptive routing based on 802.1Qau Simulation results Conclusions 2

I/O Consolidation in the Data Center CPU CPU Today s data centers comprise multiple networks Expensive to build and maintain (TCO) Inefficient use of resources (h/w, space, power) mem I/O I/O I/O Convergence of all DC traffic onto single wire IPC SAN LAN Conventional Ethernet Philosophy: Keep it simple, low cost, good enough Not optimized for performance: frame drops, throughput, latency, jitter, service-level differentiation Specialized networks filled the gap Fibre Channel, InfiniBand, Myrinet, QsNet, PCIe AS CPU CPU Convergence Enhance Ethernet (CEE) a.k.a. Data Center Bridging (DCB) Enhance Ethernet to enable I/O consolidation in the data center mem I/O Save cost, power, overhead, space Consolidated ICTN 3

The need for L2 congestion management Network congestion can lead to severe performance degradation Lossy networks suffer from the avalanche effect: High load drops retransmissions increased load even more drops Lossless networks suffer from saturation tree congestion: Link-level flow control (LL-FC) can cause congestion to roll back from switch to switch Congestion management (CM) is needed to deal with long-term (sustained) congestion Dropping and LL-FC are ill suited to this task, dealing only with short-term (transient) congestion Push congestion from the core towards the edge of the network 0 0 0 0 0 0 000 1 1 1 1 1 1 001 0 0 1 1 0 0 1 1 0 0 1 1 010 011Hotspot Aggregate throughput with CM 0 0 0 0 0 0 101 1 1 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 1 110 110 111 Hot spot starts Aggregate throughput without CM Hot spot ends 4

IEEE 802.1Qau congestion management framework 1. Congestion point (CP) = switch output queue Sample output queue length every n bytes received Equilibrium length Q eq Compute feedback: F b = Q off W*Q delta, where Q off = Q eq Q now and Q delta = Q old Q now 2. Feedback channel Convey backward congestion notifications (BCN) from CP to sources of offending traffic Notifications contain congestion information: source address = switch MAC, destination address = source MAC of sampled frame, feedback: Q offset and Q delta, quantized with respect to F b,max = (1+2W)*Q eq using 6-8 bits 3. Reaction point (RP) Use rate limiters (RL) at the edge to shape flows causing congestion; separately enqueue rate-limited flows Reduce rate limit multiplicatively when F b < 0 Autonomously increase rate limit based on byte counting or timer (QCN; similar to BIC-TCP) Release rate limiter when limit returns to full link capacity end node NIC RL end node switch BCN switch end node switch BCN switch NIC NIC 5

Congestion management vs. adaptive routing CM solves congestion by reducing injection rate Useful for saturation tree congestion, where many innocent flows suffer because of backlog of some hot flows Does not exploit path diversity Typical data center topologies offer high path diversity Fat tree, mesh, torus Adaptive routing (AR) approach Allow multi-path routing By default route on shortest path (latency) Detect downstream congestion by means of BCN In case of congestion First try to reroute hot flows on alternative paths Only if no uncongested alternative exists, reduce send rate 6

Adaptive routing based on 802.1Qau CM Concept Upstream switches snoop congestion notifications, annotate routing tables with congestion information, and modify routing decisions to route to the least congested port among those enabled for a given destination Routing table Maps a destination MAC to one or more switch port numbers, listed in order of preference, e.g., shortest path first Congestion table Maps a key <destination MAC, switch port number> to a congestion entry comprising the following information: Receiver checks frame order and performs resequencing if needed field type meaning congested boolean Flag indicating whether port is congested local boolean Flag indicating whether congestion is local or remote fbcount integer Number of notifications received feedback integer Feedback severity value 7

Updating congestion table entries Upon reception (remote) or generation (local) of a congestion notification corresponding to a flow with destination MAC d on local switch port p with feedback value fb: if (entry[d,p].local!entry[d,p].congested) entry[d,p].local = local entry[d,p].congested = true entry[d,p].fbcount++ entry[d,p].feedback += (fbcount*fb) if (! entry[d,p].local) restart timer When timer expires (remote) or queue level Q eq /2 (local): entry[d,p].congested = false entry[d,p].local = false entry[d,p].fbcount = 0 entry[d,p].feedback = 0 stop timer local congestion can be overridden by remote, but not vice versa ensures that newer notifications receive larger weight reduces effect of stale information and false positives frequent notifications lead to high congestion level QCN does not provide positive feedback; hence remote entries must be expired using a timer local entries are expired using a threshold on the local queue (Qeq/2) 8

Routing decisions & configuration For a frame destined to MAC address d try eligible ports in order of preference select first port not flagged as congested if all ports flagged as congested, route to port with minimum feedback value To ensure productive and loop-free routing without deadlocks For each destination node n, construct a directed acyclic graph connecting all nodes n to node n S1 S8 S12 S4 H1 H4 S2 S7 S9 S5 H2 H5 S3 S10 S11 S6 H3 H6 9

Routing of congestion notifications S6 S7 H2 H3 H1 1 S1 S2 S3 S4 S5 0 2 3 1 0 2 H4 S8 S9 S10 S11 S12 Congestion notifications need to be routed on all alternative paths leading to the CP When generated at CP, notification is routed to port on which sampled frame arrived In upstream switches, notification is routed on a random port leading to the sampled flow s source 10

Simulation experiments Artificial topologies Highlight problems of CM without AR Test basic functionality of proposed AR scheme Traffic patterns Uniform Bernoulli traffic Test saturation throughput Latency-throughput characteristics Scenarios with congestion solvable through re-routing No end-point contention Demonstrate AR functionality Scenarios with congestion not solvable through re-routing End-point contention Demonstrate fallback CM functionality 11

Test topology IBM Research - Zurich 95% S1 S8 S12 S4 H1 H4 95% S2 S7 S9 S5 H2 H5 95% S3 S10 S11 S6 H3 H6 Three flows (1 4, 2 5, 3 6) of 10 Gb/s each Shortest-path routing leads to contention in S7 Adaptive routing should be able to provide 10 Gb/s to each flow Dashed lines = end-point contention; solid = no end-point contention 12

Results for uniform Bernoulli traffic AR CM None None CM Throughput vs. load AR Latency vs. load 13

Results with 3-flow congestion scenarios Flow throughput without adaptive routing Flow throughput with adaptive routing Without end-point contention Mean throughput/flow limited to 417 MB/s Hot queue length (CM vs. AR) Full throughput/flow achieved for each flow Flow throughput with adaptive routing With endpoint contention 14 With CM, hot queue is controlled around equilibrium With AR, hot spot disappears owing to re-routing Reset spikes every 250 ms clearly visible Hot queue lengths are controlled, indicating that CM still works well in conjunction with AR

Applications IBM Research - Zurich Simulated two applications from the NAS Parallel Benchmarks Conjugate Gradient (CG), 128 nodes Fast Fourier Transform (FT), 128 nodes Characterized by heavy network loading during communication phases Large messages (CG: 750 KB, FT: 130 KB) Multiple sub-phases (CG: 5, FT: 127) Symmetric permutation patterns in each sub-phase (each node sends to a distinct peer node in a pairwise fashion) Simulation methodology Combination of Dimemas MPI trace-replay engine and Venus detailed network simulator Parallel distributed event simulation (PDES) via client-server socket interface Collect trace of application run on real machine (MareNostrum) Co-simulation Dimemas replays application trace Dimemas forwards remote communication message to Venus Venus performs flit-/frame-level network simulation of message Venus returns message to Dimemas once completely received at destination NIC Visualization using Paraver 15

CG and FT communication patterns Communication pattern Traffic volume per node pair Conjugate Gradient Task 128 1 subphase 1 subphase 2 subphase 3 subphase 4 subphase 5 Task 128 1 phase 5 is highly sensitive to modulo-radix routing Fast Fourier Transform Task 128 1 Time Task 128 1 Task 128 1 Task 1 128 Time Task 1 128 16

Simulation environment MPI application run.routes file.map file xgft routereader map2ned MPI replay MPI trace mapping routes topology config file (#buses, bandwidth, latency, eager threshold, ) Dimemas interface (client) co-simulation (socket) Venus interface (server) config file (adapter & switch arch., bandwidth, delay, segmentation, buffer size, ) statistics Paraver node trace Paraver network trace statistics Paraver visualization, analysis, validation Goal: Understand application usage of physical communication resources and facilitate optimal communication subsystem design 17

Extended generalized fat tree (XGFT) topology Multi-path: one path via each top-level switch Self-routing Usual static, oblivious routing method based on label of source or destination node to select path; can lead to significant contention Problem of assigning paths to connections with min. number of conflicts Non-oblivious offline route optimization taking into account traffic pattern source mod k destination mod k XGFT(3;3,3,3;1,3,3) 0/3 mod 3 = (port) 0 0/1 mod 3 = (port) 0 10/3 mod 3 = (port) 0 10/1 mod 3= (port) 1 26/3 mod 3 = (port) 2 26/1 mod 3 = (port) 2 Node 0 Node 10 Dest 26 Node 0 Node 10 Dest 26 18

Results Slowdown factor Slowdown Slowdown factor factor Slowdown factor CG CG with with linear placement CG with random placement 8 2.6 FT with random placement 5.5 7 2.4 2.7 5 2.2 6 2.6 4.5 2.5 2 5 4 1.8 2.4 4 3.5 1.6 2.3 3 3 1.4 2.2 2.5 2 1.2 2.1 2 1 12 1.5 1.9 1 1.8 2-level 2-level (16,8) (16,8) 2-level (8,16) (8,16) 3-level 4-level 4-level 7-level 7-level Topology (#fat tree levels) 2-level (16,8) 2-level (8,16) 3-level 4-level 7-level 2-level (16,8) 2-level (8,16) 3-level Topology (#fat tree levels) 4-level 7-level Topology (#fat tree levels) random random d-mod-m d-mod-m random s-mod-m s-mod-m d-mod-m optimized optimized No AR s-mod-m CM AR_sw optimized itch AR, AR_switch Qeq = 33k CM AR, Qeq = 6k AR_switch CG with linear task placement Random, d-mod-, s-mod-m all perform about equally well Colored is almost ideal CM by itself is terrible AR performs significantly better than oblivious algorithm (25 to 45% better than d-mod-m; within 25% of ideal) CG wih random task placement AR performs best of all schemes FT with random task placement Smaller Qeq performs better; about 10 to 25% better than without AR 19

Conclusions IBM Research - Zurich IEEE 802.1Qau framework defines congestion management for Ethernet data center bridging addresses saturation tree congestion provides temporal reaction: selectively rate-limiting sources can improve, but also harm performance provides sufficient hooks to implement fully adaptive routing Snoop-based adaptive routing (our proposal) snoops notifications to annotate routing tables with congestion information takes advantage of multi-path capabilities present in many HPC/DC topologies significantly reduces latency and increases saturation throughput under uniform traffic efficiently reroutes specific flows in case of congestion interoperates seamlessly with underlying CM scheme improves usefulness for selected parallel benchmarks on realistic topology 20

21

The IEEE 802.1Qau Approach Detection Congestion point = output queue Sample output queue length every n bytes received Equilibrium length Q eq Compute offset: Q off = Q eq Q now Compute delta: Q delta = Q old Q now Compute feedback: F b = Q off W*Q delta F b,max = (1+2W)*Q eq Signaling CP sends notification frame directly to source of sampled frame (BCN) Source address = switch MAC Destination address = source MAC of sampled frame Feedback: Q offset and Q delta, quantized with respect to F b,max using 6-8 bits Congestion point ID Reaction Instantiate rate limiter; separately enqueue rate-limited flows Reduce rate limit multiplicatively when F b < 0 Autonomously increase rate limit based on byte counting or timer (QCN; similar to BIC-TCP) Release rate limiter when limit returns to full link capacity 22

Extended Generalized Fat Trees (XGFTs) XGFT ( h ; m 1,, m h ; w 1,, w h ) h = height number of levels-1 levels are numbered 0 through h level 0 : compute nodes levels 1 h : switch nodes m i = number of children per node at level i, 0 < i h w i = number of parents per node at level i-1, 0 < i h number of level 0 nodes = Π i m i number of level h nodes = Π i w i XGFT ( 3 ; 3, 2, 2 ; 2, 2,3 ) 0,0,0 0,0,1 0,0,2 0,1,0 0,1,1 0,1,2 1,0,0 1,0,1 1,0,2 1,1,0 1,1,1 1,1,2 1 2 0,0,0 0,1,0 1,0,0 1,1,0 0,0,1 0,1,1 1,0,1 1,1,1 0 1 0,0,0 1,0,0 0,1,0 1,1,0 0,0,1 1,0,1 0,1,1 1,1,1 0 1 0,0,0 1,0,0 2,0,0 0,1,0 1,1,0 2,1,0 0,0,1 1,0,1 2,0,1 0,1,1 1,1,1 2,1,1 23

Conclusions IBM Research - Zurich Layer-2 DCB congestion management scheme being defined by IEEE 802.1Qau: QCN Congestion notifications generated by switches, based on queue length offset and delta Rate-limiting at the sources; multiplicative decrease, binary increase Addresses saturation tree congestion We proposed an adaptive routing scheme based on QCN Snoop notifications; annotate routing table; prefer least-congested path Takes advantage of multi-path capabilities present in many HPC/DC topologies Significantly reduces latency and increases saturation throughput under uniform traffic Efficiently reroutes specific flows in case of congestion Interoperates seamlessly with underlying CM scheme Demonstrated usefulness for selected parallel benchmarks on realistic topology 24

DC requirements not met by conventional Ethernet Fine-granularity link-level flow control (802.1Qbb) PAUSE (802.3x) too coarse stops/starts entire link at once New draft PAR: Priority-based Flow Control http://www.ieee802.org/1/files/public/docs2007/new-cn-pelissier-draft-pfc-par-5c- 070913.pdf Bandwidth management (802.1Qaz) Support for bandwidth management, virtual channels/lanes, deadlock-free routing, QoS 802.1p not flexible enough New PAR: Enhanced Transmission Selection for Bandwidth Sharing Between Traffic Classes http://www.ieee802.org/1/files/public/docs2007/new-cm-thaler-trans-select-par- 0709-v2.pdf Multi-pathing Spanning tree provides single path from any source to any sink: Not suitable for typical multi-path ICTN topologies Congestion management IEEE 802.1Qau 25

Uniform Bernoulli traffic Topology 1 Topology 3 Latency vs. load Throughput vs. load 26

Flow throughput without adaptive routing (no end-point contention, CM only) Topology 1 Topology 2 Topology 3 Mean throughput/flow limited to 417 and 625 MB/s, respectively Significant unfairness in Topo 3 QCN has no inherent fairness criterium 27

Flow throughput with adaptive routing (no end-point contention) Topology 1 Topology 2 Topology 3 Full throughput/flow achieved for each flow (1187 MB/s) in each topology Spikes occur when AR timer expires, i.e., reset of congestion entries every 250 ms 28

Hot queue length (CM vs. AR, no end-point contention) Topology 1 Topology 2 Topology 3 With CM, hot queue is controlled around equilibrium With AR, hot spot disappears owing to re-routing Reset spikes every 250 ms clearly visible 29

Flow throughput with adaptive routing (with end-point contention) Topology 1 Topology 2 Topology 3 End-point contention is the limiting factor; AR does not help here However, CM still works properly Mean throughput/flow limited to 417 and 625 MB/s, respectively, as in CM case 30

Hot queue length with adaptive routing (with end-point contention) Topology 1 Topology 2 Topology 3 Hot queue lengths are controlled, indicating that CM still works well in conjunction with AR 31