Congestion Management in Lossless Interconnects: Challenges and Benefits José Duato Technical University of Valencia (SPAIN) Conference title 1
Outline Why is congestion management required? Benefits Congestion and congestion management strategies Challenges Enhancing reactive congestion management Congestion management & adaptive routing HOL blocking elimination techniques Hybrid congestion management strategy HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 2
Current role of interconnection networks For three decades the goal of computer architects has been to keep the processors busy top performance Interconnects were usually cheap, and never a bottleneck Now, global system performance in large systems is limited by the interconnection network (e.g. Tianhe-1A) Network latency directly impacts application performance, and network saturation leads to latency increasing by orders of magnitude Saturation should be avoided at all costs HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 3
Conflicting interests: cost vs. performance Saturation was traditionally avoided by overdimensioning the interconnection network, but this is becoming very expensive No overdimensioning Danger when working with high traffic loads (close to the saturation point) Network performance (throughput, latency) should be good under very different traffic patterns & load scenarios Traffic load may significantly vary over time, reaching saturation At saturation, network performance drops dramatically HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 4
Network throughput at saturation HS = traffic injected to Hot Spot destination HS starts HS ends HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 5
Should we currently care about congestion? Growing processor speed Growing link speed Power consumption increases Processor prices drop (demand) Relative interconnect cost increases Power management Smaller networks Congestion probability grows Congestion Performance Management degradation Strategies Saturation point reached with lower traffic load Bandwidth decreases HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 6
Benefits Stable performance when the network reaches saturation No performance drop Delivers maximum achievable throughput Reacts quickly when power management turned some components off and demand suddenly increases Prevents performance degradation due to power management Enables more aggressive power saving strategies without risk Helps to keep performance when faults occur and fault tolerance techniques enable alternative paths Alternative paths may become congested (less resources are available) HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 7
Contention Several packets from different flows request the same output port in a switch One packet makes progress, the others wait Network contention HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 8
Congestion Persistent contention Buffers containing packets belonging to flows involved in contention become full Persistent network contention HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 9
Congestion propagation In lossless networks, congestion is quickly propagated by flow control, forming congestion trees Flow control Persistent network contention HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 10
Congestion propagation In lossless networks, congestion is quickly propagated by flow control, forming congestion trees Congestion trees may cause Head-of-Line blocking Congestion propagation may reach the sources Persistent network contention Congestion affects packets belonging to flows that do not cause congestion HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 11
Congestion trees Congestion tree structure: Congestion tree leaf Congestion tree branch Congestion tree root Congestion tree leaf Congestion tree branch HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 12
Traditional solution Overdimensioning the network Many more components than really necessary Offered network bandwidth is much higher than the bandwidth requested by end nodes Overdimensioned network HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 13
Latency Traditional solution Overdimensioning the network Advantage: low link utilization low latency Latency Working zone Congestion zone Injected traffic Traffic HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 14
Classical congestion management strategies Proactive congestion management (congestion prevention) Path setup before data transmission Used in ATM, computer networks (QoS) High overhead, high setup latencies, poor link utilization (not suitable for HPC) HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 15
Classical congestion management techniques Reactive congestion management (congestion recovery) Injection limitation techniques using closed-loop feedback Does not scale well with network size and link bandwidth Notification delay (proportional to distance / number of hops) Link and buffer capacity (proportional to clock frequency) May produce traffic oscillations (closed loop system with pure delay) HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 16
Other approaches Adaptive Routing May help to delay the occurrence of congestion Useless when heavy congestion arises Problems regarding in-order packet delivery Packet dropping Not suitable for most current HPC parallel applications HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 17
Challenges To develop congestion management techniques that react locally and immediately when congestion arises To make congestion management techniques truly scalable To achieve coordination among end nodes without explicit communication among them To eliminate instabilities and oscillatory responses To minimize the number of extra resources needed to handle congestion To make congestion management compatible with adaptive routing HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 18
Enhancing reactive congestion management Stop injecting packets for a while when a BECN is received Do not change injection rate again until feedback from previous changes is received to prevent oscillations Source nodes can dynamically adjust their injection rate to available bandwidth without communicating among them Inject exactly one packet when a BECN is received New contenders are automatically detected and injection rate reduced Slightly reduce the above rate to slowly eliminate the congestion tree HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 19
Congestion management & adaptive routing Existing congestion management techniques do not work correctly with adaptive routing Injection rate is adjusted for a certain congestion, but now packets may follow a different path (unstable behavior) Adaptive routing may spread congestion over more links Never use adaptive routing for congested packets when the congestion point is at an end node In this case, adaptive routing does not help, spreads congestion over more links, and increases HOL blocking Use adaptive routing otherwise(more research needed here) HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 20
HOL blocking elimination techniques Key idea: The real problem is not the congestion itself, but its negative effect (HOL blocking) By eliminating HOL blocking, congestion becomes harmless In general, different buffers required at each port for separating packet flows HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 21
HOL blocking example 33 % Sw. 1 33 % Sw. 5 Congested flows Non-congested flows 33 % Sw. 2 33 % Sw. 6 33 % 33 % Sw. 8 100 % Dst. 1 33 % Sw. 3 33 % Sw. 7 66 % 33 % Dst. 2 33 % Sw. 4 33 % 33 % Sending 33 % Stopped 33 % Sending HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 22
Real-life HOL blocking example The A-31 highway metaphor Bottleneck A-31 A-43 The flow is affected by the bottleneck of the A-31 highway Map Source: Google Maps HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 23
HOL blocking elimination techniques VOQnet (Virtual Output Queuing at network level) A separate queue at each input port for every destination Packets with the same destination are stored in the same queue Completely eliminates HOL blocking Number of required buffer resources increases at least quadratically with network size!!! HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 24
HOL blocking elimination techniques VOQsw (Virtual Output Queuing at switch level) & DAMQs (Dynamically Allocated Multi-Queues) A separate queue at every input port for every output port Packets requesting the same output are stored in the same queue Better than nothing but does not eliminate HOL blocking completely. Effectiveness depends on traffic pattern. Virtual Channels Performance depends on channel (queue) assignment HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 25
HOL blocking elimination techniques DBBM (Destination-Based Buffer Management) Several groups of destinations are defined A separate queue for each group at every port Packets with destinations in the same group are stored at the same queue OBQA (Output-Based Queue Assignment) Suitable for fat-trees with DESTRO routing Queue assignment linked with topology & routing algorithm Reduces HOL blocking with the minimum number of queues per port HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 26
OBQA description Logical input port organization Each input port has a number of queues (q) smaller than switch radix OBQA assigns packets to queues using this formula: Selected_Queue = Requested_Output_Port MOD q HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 27
OBQA evaluation Uniform traffic simulation results Network Latency (cycles) vs Normalized Generated Traffic 4-ary 4-tree 8x8 switches (configuration #2) 16-ary 2-tree 32x32 switches (configuration #3) HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 28
HOL blocking elimination techniques RECN (Regional Explicit Congestion Notification) & FBICM (Flow-Based Implicit Congestion Management) Key differences with respect to previous techniques: Explicitly identifies congested points Congestion information storage Dynamic queue allocation for congested flows HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 29
Principles of RECN-like solutions Congestion becomes harmless if the HOL blocking produced by congested packets is completely eliminated. HOL blocking produced by congested packets is completely eliminated if they are buffered separately. Non-congested packets can share queues without suffering significant HOL blocking. Congested packets can be separately buffered by using a small number of queues per port. Congested packets must be explicitly identified (i.e. packets belonging to flows contributing to create some congestion). Precise identification of congested packets is based on previous knowledge of the location of existing congestion points. HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 30
RECN basic procedure Congested points are detected at any input or output switch port of the network The routes to detected congested points are progressively notified to input and output ports crossed by congested flows After receiving a notification, a port dynamically allocates a CAM line to store the location of the congested point, and a set-aside queue (SAQ) to store congested packets A packet arriving at a port will be stored in a SAQ if it will pass through the congested point associated to that SAQ A packet arriving at a port will be stored in the standard ( cold ) queue if its route does not match any CAM entry SAQs can be dynamically deallocated, and later allocated for other congested points HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 31
How RECN works A congestion point forms HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 32
How RECN works Cold queue fills over a threshold HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 33
How RECN works Internal notification to each input sending packets to the congested output HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 34
How RECN works New SAQs are allocated for packets addressed to the congested output port HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 35
How RECN works Notifications sent when the SAQs fill over a threshold HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 36
How RECN works A new SAQ is allocated for the congested output at each notified output HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 37
How RECN works Internal notifications when the SAQs receive packets and the occupancy is over a threshold HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 38
How RECN works HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 39
How RECN works HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 40
How RECN works At the end, congestion tree packets are completely stored in SAQs HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 41
How RECN works Cold flow sharing some network resources with a branch of the congestion tree HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 42
How RECN works Cold packets are never stored in SAQs, so they never share a queue with congested packets HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 43
RECN basics RECN achieves efficiency and scalability in source routing environments 0 3-1 +4 +3 Turnpool Turnpointer Packet header information: The routing information is included in packet header and congestion notifications (turnpool), and it is used at each hop (turnpointer) HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 44
CAM structure Xon/Xoff Flow control CAM v turnpool bit mask b Xoff v turnpool bit mask b Xoff v turnpool bit mask b Xoff SAQ 0 SAQ 1 SAQ n-1 Valid Congested point Blocked HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 45
Reception of packets after SAQ allocation Turn pointer 4 2 Header of incoming packet Cold Queue +4 SAQ 0 SAQ 1 SAQ n-1 1..00004.0000111? 0 CAM line SAQ 0 CAM line SAQ 1 The incoming packet is stored in SAQ0 CAM line SAQ n-1 HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 46
FBICM operation Key features Effective HOL blocking elimination in networks with distributed routing Implicit congestion points identification, detecting flows heading to them just by inspecting packet destination Congestion information is based on destinations instead of turnpools and it represents any network congested point New CAM structure, new detection, propagation and resource management policies HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 47
FBICM operation IQ Switch architecture Normal Flow Queues (NFQ) Congested Flow Queues (CFQ) Separate non-congested flows from congested flows HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 48
FBICM operation CAM structure Congested flow identification fields Congested port, Hops to reach, Destination list, Next CFQ (Congested Flow Queue) Flow Control fields Stop & Go, Sent Stop, Receiving control HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 49
FBICM operation Congestion detection Primary CFQ + Threshold NFQ exceeded Switch 1 CAM allocation Switch 2 P4 P4 P5 NFQ CFQ CAM P5 P6 P6 P7 P7 New CAM Line Information: Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: null HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 50
FBICM operation Packet Processing Congested Packets are stored in the CFQ Switch 1 Switch 2 P4 P4 P5 NFQ CFQ CAM P5 P6 P6 P7 P7 New CAM Line Information: Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: null HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 51
FBICM operation Packet Processing Congested Packets are stored in the CFQ Switch 1 Switch 2 P4 P4 P5 NFQ CFQ CAM P5 P6 P6 P7 P7 New CAM Line Information: Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: null HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 52
FBICM operation Packet Processing Congested Packets are stored in the CFQ Switch 1 Switch 2 P4 P4 P5 NFQ CFQ CAM P5 P6 P6 P7 P7 New CAM Line Information: Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: null HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 53
FBICM operation Packet Processing Congested Packets are stored in the CFQ Switch 1 Switch 2 P4 P4 P5 NFQ CFQ CAM P5 P6 P6 P7 P7 New CAM Line Information: Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: null HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 54
FBICM operation Packet Processing Switch 1 Switch 2 HOL blocking is avoided P4 P4 P5 NFQ CFQ CAM P5 P6 P6 P7 P7 New CAM Line Information: Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: null HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 55
FBICM operation Packet Processing Switch 1 Switch 2 HOL blocking is avoided P4 P4 P5 NFQ CFQ CAM P5 P6 P6 P7 P7 New CAM Line Information: Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: null HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 56
FBICM operation Packet Processing Switch 1 Switch 2 HOL blocking is avoided P4 P4 P5 NFQ CFQ CAM P5 P6 P6 P7 P7 New CAM Line Information: Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: null HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 57
FBICM operation Congestion Information Propagation Switch 1 Switch 2 P4 P4 P5 NFQ CFQ CAM 0 P5 P6 P6 P7 P7 New CAM Line Information: Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: null HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 58
FBICM operation Congestion Information Propagation The threshold is exceeded in the CFQ Switch 1 Switch 2 P4 P4 CAM 1 P5 P6 NFQ CFQ external Stop CAM 0 P5 P6 P7 P7 New CAM Line Information: New CAM allocation (copy of CAM 0) CAM 0 CAM 1 Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: null Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: 0 Stop: true HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 59
FBICM operation Congestion Information Propagation Switch 1 port and matches with congestion Switch 2 Internal Stop P4 Congested packet reaches output information in CAM 1 P4 P5 NFQ CFQ P5 CAM 1 CAM 0 P6 P6 P7 P7 New CAM Line Information: CAM 0 Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: null CAM 1 Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: 0 Stop: true HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 60
FBICM operation Congestion Information Propagation Switch 1 updating congested point Switch 2 Internal Stop NFQ P4 CFQ New CAM + CFQ allocated information P4 CAM 2 P5 NFQ CFQ P5 CAM 1 CAM 0 P6 P6 P7 P7 New CAM Line Information: CAM 0 Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: null CAM 1 Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: 0 Stop: true CAM 2 Active Cong_Port: P6 Hops: 2 Destination_list NextCFQ: 1 Stop: true HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 61
FBICM operation Congestion Information Propagation Switch 1 Congestion tree branch Switch 2 NFQ CFQ CAM 2 P4 P4 CAM 1 P5 P6 NFQ CFQ CAM 0 P5 P6 P7 P7 Congestion tree resources will be released dynamically New CAM Line Information: CAM 0 Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: null CAM 1 Active Cong_Port: P6 Hops: 1 Destination_list NextCFQ: 0 Stop: true CAM 2 Active Cong_Port: P6 Hops: 2 Destination_list NextCFQ: 1 Stop: true HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 62
FBICM evaluation Network Throughput vs. Load (Config. 1) Random Uniform Traffic BMIN 64 x 64 Congested Traffic BMIN 64 x 64 HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 63
FBICM evaluation Network Throughput vs. Time (Config. 1) Real Traffic (CF = 20) BMIN 64 x 64 Real Traffic (CF = 40) BMIN 64 x 64 HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 64
Hybrid congestion management strategy Key ideas Use FBICM to quickly and locally eliminate HOL blocking, propagating congestion information and allocating buffers as necessary Use reactive congestion management to slowly eliminate congestion, deallocating FBICM buffers whenever possible Use of FBICM provides immediate response and allows reactive congestion management to be tuned for slow reaction, thus avoiding oscillations Reactive congestion management drastically reduces FBICM buffer requirements (just one buffer per port) HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 65
Conclusions Interconnects performance : Key when power management enabled Goal: To achieve best possible behavior with limited number of resources Congestion (HOL blocking): Serious menace DBBM and OBQA reduce HOL blocking in specific scenarios techniques ) with a small set of resources (ad-hoc Reactive congestion management: Does not scale well. Can be improved RECN: efficiently eliminates HOL blocking in Source Routing Networks FBICM: efficiently eliminates HOL blocking in Distributed Deterministic Routing Networks Hybrid congestion management: Mechanisms help each other HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 66
Acknowledgements Pedro García (University of Castilla La Mancha) developed the research on RECN and FBICM under my guidance and prepared part of the slides in this presentation The techniques to enhance reactive congestion management are being developed in collaboration with Simula Research Laboratory (Oslo) HPC Advisory Council Workshop. March 21-23, 2011 - Congestion Management 67
Thanks!! Any question? Conference title 68