Congestion Management in HPC Interconnection Networks Pedro J. García Universidad de Castilla-La Mancha (SPAIN) Conference title 1
Outline Why may congestion become a problem? Should we care about congestion in current HPC systems? How can congestion be managed? Challenges 2
Why may congestion become a problem? For three decades the goal of computer architects has been to keep the processors busy top performance Interconnects were usually cheap, and never a bottleneck Now, global system performance in large systems is limited by the interconnection network Network saturation leads to congestion situations that may drastically degrade network performance 3
Contention Several packets from different flows request the same output port in a switch One packet makes progress, the others wait Network contention 4
Congestion Persistent contention, mainly in network saturation state Buffers containing packets belonging gto flows involved in contention become full Persistent network contention 5
Congestion propagation In saturated lossless networks, congestion is quickly propagated by flow control, forming congestion trees Flow control Persistent network contention 6
Congestion propagation In saturated lossless networks, congestion is quickly propagated by flow control, forming congestion trees Congestion tree structure: Congestion propagation may reach the sources Congestion tree leaf Congestion tree branch Congestion tree root Congestion tree leaf Congestion tree branch 7
Congestion trees and Head of Line blocking Congestion trees may cause Head of Line (HoL) blocking Non-congested packets advance at the same speed as congested ones Congestion affects sources that do not cause congestion 8
Network performance at saturation HS = traffic injected to Hot Spot destination HS starts HS ends At saturation, network performance drops dramatically due to congestion situations 9
Should we currently care about congestion? Conflicting interests: cost vs. performance Saturation was traditionally avoided by overdimensioning the interconnection network 10
Network overdimensioning Many more components than really necessary Offered network bandwidth is much higher h than the bandwidth requested by end nodes 11
Network overdimensioning Advantage: low link utilization congestion is unlikely Saturation Working zone zone ency Late Injected traffic Disadvantages: Expensive (processors cheaper relative to interconnects) Power consumption increases (growing link speed) 12
Should we currently care about congestion? Conflicting interests: cost vs. performance Saturation was traditionally avoided by overdimensioning the interconnection network currently not suitable No network overdimensioning? 13
Network not overdimensioned Only the components strictly necessary to interconnect all the processing nodes Offered network bandwidth decreases 14
Network not overdimensioned Advantages: cheaper, less power consumption Saturation Working zone zone tency Lat Injected traffic Disadvantage: high link utilization congestion is likely 15
Should we currently care about congestion? Conflicting interests: cost vs. performance Saturation was traditionally avoided by overdimensioning the interconnection network Currently not suitable No overdimensioning Danger when working with high traffic loads (close to the saturation point) Network performance (throughput, latency) should be good under very different traffic patterns & load scenarios Traffic load may significantly vary over time, reaching saturation Some strategy to deal with congestion is required 16
The big picture: Power Growing processor Growing link consumption speed speed increases Processor prices drop (demand) Relative interconnect cost increases Power management Smaller networks Congestion probability grows Performance Congestion Management degradation Strategies Saturation point reached with lower traffic load Bandwidth decreases 17
Benefits of congestion management Stable performance when the network reaches saturation No performance drop Delivers maximum achievable throughput Reacts quickly when power management turned some components off and demand suddenly increases Prevents performance degradation due to power management Enables more aggressive power saving strategies without risk Helps to keep performance when faults occur and fault tolerance techniques enable alternative paths Alternative paths may become congested (fewer resources are available) 18
How can congestion be managed? Different approaches to congestion management: Packet dropping Proactive techniques Reactive techniques HoL blocking elimination techniques Hybrid techniques Related techniques 19
Packet dropping Packets in congested buffers are discarded Suitable for computer networks (like the Internet) but not suitable for most current HPC parallel applications Both congested and non congested packets may be discarded Discarded packets must be retransmitted, thus increasing final packet latency 20
Proactive congestion management A.K.A. congestion prevention Path setup before data transmission [1] Used in ATM, computer networks (QoS) Optimal performance requires to know in advance: Resource requirements of each transmission Network status Knowledge about network status is not always available High overhead, high setup latencies, poor link utilization (not suitable for HPC) [1] P. Yew, N. Tzeng, D.H. Lawrie, Distributing Hot Spot Addressing in Large Scale Multiprocessors, IEEE Transactions on Computers, 36(4): 388 395, 1987. 21
Reactive congestion management A.K.A. congestion recovery Injection limitation techniques (injection throttling) using closed loop feedback Does not scale well with network size and link bandwidth Notification delay (proportional to distance / number of hops) Link and buffer capacity (proportional to clock frequency) May produce traffic oscillations (closed loop system with pure delay) 22
Reactive congestion management Example: Infiniband FECN/BECN mechanism [2]: Two bits in the packet header are reserved for congestion notification If a switch port is considered as congested, the Forward Explicit Congestion Notification (FECN) bit in the header of packets crossing that port is set Upon reception of such a FECN marked packet, a destination will return a packet (Congestion Notification Packet, CNP) whose header will have the Backward Explicit Congestion Notification (BECN) bit set back to the source Any source receiving a BECN marked packet will then reduce its packet injection rate for this traffic flow [2] E.G. Gran, M. Eimot, S.A. Reinemo, T. Skeie, O. Lysne, L. Huse, G. Shainer, First experiences with congestion control in InfiniBand hardware, in Proceedings of IPDPS 2010, pp. 1 12. 23
HoL blocking elimination techniques Key idea: The real problem is not the congestion itself, but its negative effect (HoL blocking) By eliminating HoL blocking, congestion becomes harmless 24
Example of HoL blocking due to congestion Should congested flows be throttled? Src. 0 33 % Sw. 1 33 % Sw. 5 Congested flows Non-congested flows Src. 1 33 % Sw. 2 33 % 33 % Sw. 6 Sw. 8 33 % 100 % Dst. 1 Src. 2 Sw. 3 Sw. 7 33 % 33 % 66 % 33 % Dst. 2 Src. 3 33 % Sw. 4 33 % 33 % Sending 33 % Stopped 33 % Sending 25
Example of real life HoL blocking The A 31 highway metaphor Bottleneck A-31 A-43 The flow is affected by the bottleneck of the A 31 highway Map Source: Google Maps 26
HoL blocking elimination techniques In general, these techniques rely on having different queues at each port to separate different packet kt flows They differ mainly in the criteria to map packets to queues and in the number of required queues per port 27
HoL blocking elimination techniques VOQnet (Virtual Output Queuing at network level) [3] A separate queue at each input port for every destination Packets with the same destination are stored in the same queue Selected_Queue = Packet_Destination Completely eliminates HoL blocking Number of required buffer resources increases at least quadratically with network size!!! [3] W. Dally, P. Carvey, L. Dennison, Architecture of the Avici terabit switch/router, in Proceedings of 6th Hot Interconnects, 1998, pp. 41 50. 28
HoL blocking elimination techniques VOQsw (Virtual Output Queuing at switch level) [4] & DAMQs (Dynamically Allocated Multi Queues) [5] A separate queue at every input port for every output port Packets requesting the same output t are stored in the same queue Selected_Queue = Requested_Output_Port Better than nothing but they do not completely eliminate HoL blocking Effectiveness depends on topology and traffic pattern [4] T. Anderson, S. Owicki, J. Saxe, C. Thacker, High speed switch scheduling for local area networks, ACM Transactions on Computer Systems, vol. 11 (4), pp. 319 352, November 1993. [5] Y. Tamir, G. Frazier, Dynamically allocated multi queue buffers for VLSI communication switches, IEEE Transactions on Computers,vol. 41 (6), June 1992. 29
HoL blocking elimination techniques DBBM (Destination Based Buffer Management) )[6] Several groups of destinations are defined A separate queue for each group at every port (q queues per port) Packets with destinations in the same group are stored at the same queue Selected_Queue = Packet_Destination MOD q Does not completely eliminate HoL blocking Effectiveness depends on the number of queues, topology and traffic pattern [6] T. Nachiondo, J. Flich, J. Duato, Buffer management strategies to reduce HoL blocking, IEEE Transactions on Parallel and Distributed Systems, vol. 21 (6), pp. 739 753, 2010. 30
HoL blocking elimination techniques OBQA (Output Based Queue Assignment) [7] Suitable for fat trees with DESTRO routing Queue assignment linked with topology & routing algorithm Reduces HoL blocking with the minimum number of queues per port (q) Sl Selected_Queue tdq = Requested_Output_Port t t tmod q q smaller than half the switch radix Does not completely eliminate HoL blocking Effectiveness depends on the number of queues [7] J. Escudero Sahuquillo, P. J. García, F. J. Quiles, J. Duato, An efficient strategy for reducing head of line blocking in fat trees, in LNCS vol. 6272, pp. 413 427. Proceedings of 16 th International Euro Par Conference (II), () Ischia, Italy, Sept. 2010. 31
Performance comparison Uniform traffic simulation results Network Latency y( (cycles) vs Normalized Generated Traffic 4 ary 4 tree 8x8 switches 16 ary 2 tree 32x32 switches 32
HoL blocking elimination techniques RECN (Regional Explicit Congestion Notification) [8] & FBICM (Flow Based Implicit Congestion Management) [9] RECN has been proposed for source based routing networks while FBICM for distributed table based routing networks The key difference with respect to previous techniques is that they completely and dynamically isolate congested flows Basics: Explicit identification of congested flows Storage of congestion information Dynamic queue allocation to isolate congested flows [8] P. J. García, J. Flich, J. Duato, I. Johnson, F. J. Quiles, F. Naven, Efficient, scalable congestion management for interconnection networks, IEEE Micro, vol. 26 (5), pp. 52 66, September 2006. [9] J. Escudero Sahuquillo, P. J. García, F. J. Quiles, J. Flich, J. Duato, Cost effective congestion management for interconnection networks using distributed deterministic routing, in Proceedings of ICPADS 2010, Shanghai, China, December 2010. 33
RECN/FBICM basic procedure Congested points are detected at any port of the network by measuring queue occupancy The location of any detected d congested point is stored in a control memory (a CAM line) at any port forwarding packets towards the congested point: RECN: an explicit route is stored FBICM: a list of destinations is stored to implicitly locate the point A special queue associated to the CAM line is also allocated to exclusively store packets addressed to that congested point Congestion information is progressively notified to any port in other switches crossed by congested flows, where new CAM lines and special il queues are allocated A packet arriving at a port is stored in the standard queue only if its routing information does not match any CAM line 34
RECN/FBICM queue requirements Non congested packets can share queues without suffering significant HoL blocking only one standard queue per port Special queues are allocated/deallocated when required, thus congested packets can be separately buffered by using a small number of special queues per port HoL blocking produced d by congested packets is eliminated in a scalable way 35
RECN/FBICM drawbacks In scenarios with a lot of different congested points, it is possible to run out of special queues at some ports The need for CAMs at switch ports increases implementation cost and required silicon area per port 36
Hybrid congestion management strategies Example: Combining Injection Throttling and FBICM [10]: Use FBICM to quickly and locally eliminate HoL blocking blocking, propagating congestion information and allocating queues as necessary Use reactive congestion management to slowly eliminate congestion, deallocating FBICM queues whenever possible Use of FBICM provides immediate response and allows reactive congestion management to be tuned for slow reaction, thus avoiding oscillations Reactive congestion management drastically reduces FBICM buffer requirements (just one or two queues per port) [10] J. Escudero Sahuquillo, E. G. Gran, P.J. García, J. Flich, T. Skeie, O. Lysne, F.J. Quiles, J. Duato, Combining Congested Flow Isolation and Injection Throttling in HPC Interconnection Networks, to appear in Proceedings of ICPP 2011. 37
Performance comparison Hot spot scenario simulation results Network Normalized Throughput vs Time 4 ary 3 tree 1 hot spot 4 ary 3 tree 4 hot spots 38
Related techniques Adaptive Routing/Traffic balancing May help to delay the occurrence of congestion Useless when heavy congestion arises Problems regarding in order packet delivery Existing congestion management techniques do not work correctly with adaptive routing (congested points may vary) Adaptive routing may spread congestion over more links Virtual Channels Performance depends d on channel (queue) assignment 39
Challenges To develop congestion management techniques that react locally and immediately when congestion arises To make congestion management techniques truly scalable To achieve coordination among end nodes without explicit communication among them To eliminate instabilities and oscillatory responses To minimize the number of extra resources needed to handle congestion To make congestion management compatible with ih adaptive routing 40
Acknowledgements Jose Duato (Universitat Politecnica de Valencia), who generously gave us the main ideas behind our congestion management proposals Jose Flich (Universitat i Politecnica i de Vl Valencia) i) and Jesus Escudero Sahuquillo (Universidad de Castilla La Mancha), who have developed alongside me all our congestion management proposals The technique combining reactive congestion management and FBICM has been developed in collaboration with Simula Research Laboratory (Oslo) 41
Thanks!! Any question? Conference title 42