Congestion in InfiniBand Networks

Size: px

Start display at page:

Download "Congestion in InfiniBand Networks"

Katrina Rice
6 years ago
Views:

1 Congestion in InfiniBand Networks Philip Williams Stanford University EE382C Abstract The InfiniBand Architecture (IBA) is a relatively new industry-standard networking technology suited for inter-processor and I/O communication. One of the challenges facing IBA networks is how to deal with congestion. The most recent version of the IBA specification includes a Congestion Control Annex (CCA) that relies on end-to-end Explicit Congestion Notification (ECN) packet marking to resolve congestion with traffic injection rate throttling. However, setting the CCA parameters to control congestion in a stable and efficient manner requires some experimentation. There are alternatives to ECN that can be used to deal with congestion, including adaptive routing and Virtual Output Queuing (VOQ.) Due to the specific requirements of IBA, adaptive routing and VOQ are tricky to implement, but it can be done. Once implemented, adaptive routing can be effective at preventing congestion by routing around it, and VOQ can be effective at eliminating congestion spreading by bypassing it. However, these two techniques do not address the root cause of congestion as ECN does. This paper examines the advantages and disadvantages of each technique, and suggests that the best approach may be to combine the techniques, in order to both avoid congestion where possible, and recover from it when necessary. 1. InfiniBand Architecture InfiniBand is an industry-standard architecture, designed for high bandwidth, low latency, scalability, and reliability. It is particularly suited to SANs for high-performance clusters. Because scalability and industry-wide versatility are defining characteristics of InfiniBand, many design choices are left intentionally unspecified by the IBA, such as topology and routing, in order to accommodate a wide variety of applications. However, certain characteristics of IBA are universal for all applications. The primary goal is high throughput with low latency. End-to-end latency is typically 10µs or less. Because of this competitive requirement, InfiniBand traffic is typically required to be lossless, that is, packets cannot be dropped. Dropping and resending a packet would incur too much latency for InfiniBand applications. (Note, there are provisions in section of the IBA specification [3] that allow setting a Switch Lifetime Limit and Head of Queue Lifetime Limit to discard packets if they remain in a switch for too long. However, this is not desirable behavior, and most of the research regarding congestion control in InfiniBand networks aims to avoid packet dropping, thus lossless traffic is assumed.) 2. Congestion Another characteristic explicitly specified by IBA is credit-based flow control. Downstream switches send credits to upstream switches to inform them that there is buffer space available. If many switches concurrently request the same resource, there may be no buffer space available. The sending switch must hold on to its packet and wait. This is the definition of congestion. Congestion increases latency by requiring packets to wait before being sent to the next node in the network. With lossless traffic, congestion that is initially concentrated at a single hot spot can spread throughout the entire network. A full input buffer causes the upstream switch to wait; if the input buffer of the upstream switch fills up, then the next upstream switch must wait, and so on. This

2 Figure 1. Head of Line (HOL) Blocking can continue all the way back to the source nodes. This is particularly undesirable because it can impact traffic flows that are not even destined for the original congested link, due to a phenomenon known as Head of Line (HOL) blocking, which reduces throughput. This is illustrated by an example from [1], as shown in Figure 1. Assume nodes 5 and 6 arrive first, then node 2, then node 0. Nodes 5 and 6 both send to port 4, overloading its capacity. When node 2 also attempts to send to port 4, there are no credits available, so it must wait. Since node 0 arrived after node 2, it must wait until node 2 is transmitted, thus node 0 is blocked from sending to port 7, even though port 7 is available. The inter-switch link shared by 2-4 and 0-7 remains idle, despite the fact that there is bandwidth available. Congestion can now spread upstream from node 2 and node 0. If the congestion is not resolved, this pattern of congestion spreading, also known as tree saturation, can ultimately result in total network collapse. In this example, the traffic flows destined for port 4 are hot flows; they are directly causing the congestion. The traffic flows destined for port 7 are cold flows; they contribute to the congestion, but are not directly causing it. 3. Congestion Control There are numerous ways to deal with congestion. a) TOPOLOGY - Overprovision the network. Build the network with excess resources so that packets never have to contend for them. This method aims to avoid congestion altogether. b) ROUTING - Route adaptively. Examine the network state during routing and route around congested areas. This method aims to detect congestion when it occurs, and avoid adding to it. c) FLOW CONTROL - Create virtual channels. Use multiple virtual channels for each input port that allow the allocation of the channel bandwidth to be decoupled from the allocation of the channel state [6]. This method eliminates HOL blocking, allowing cold flows to bypass congestion. d) FLOW CONTROL - Reduce the amount of traffic. Decrease the rate at which traffic is injected into congested nodes, giving them time to process and transmit the backed up traffic. This method attacks the source of the congestion itself, alleviating congestion for both hot flows and cold flows. The advantages, disadvantages, and applicability to IBA of each of these strategies will be discussed in the next six sections. 4. Overprovision The simplest way to deal with congestion is to overprovision the network. If an excess amount of resources (switch nodes, inter-switch links, and buffer space) are provided by the network, packets will rarely contend for the same links or buffer space, and congestion will not occur. Obviously, this is not a cost effective solution. The network must be designed to handle the maximum possible traffic; most of the time the available resources will not be utilized. It is also not a scalable solution. As [2] points out, as the number of nodes in a network increases, the fraction of each node s traffic required for congestion to form decreases proportionally, which implies that the amount of overprovisioning would need to increase. Since scalability is one of the key desired attributes of IBA, overprovisioning is not an acceptable solution. 5. Adaptive Routing With adaptive routing, the routing algorithm examines the network state then adapts to it by choosing a path that is underutilized. This technique detects congested links, then routes

3 Figure 2. Avg packet latency vs accepted traffic Various percentages of adaptive routing 32 switch network, uniform traffic around them. To implement adaptive routing, a network must have enough path diversity such that there are alternate paths available at any congested link. This requirement may or may not be met by a particular InfiniBand network, since topology is completely unspecified in the IBA specification. Users are free to implement any network topology, be it regular or irregular. Assuming that a network topology is chosen that has sufficient path diversity for adaptive routing, another requirement must be met. The routing tables must provide multiple paths for each destination, so that the adaptive routing algorithm can choose among them. This is a problem for IBA. Within an InfiniBand subnet, every port is assigned a unique Local ID (LID.) When a packet is injected into the subnet, its destination is specified by a Destination LID (DLID.) At each node, the packet s DLID is compared with the node s LID to determine if the packet has arrived at its destination. IBA specifies that each switch contain a single routing table, and that each routing table list a single output port for each DLID. Thus, the routing algorithm does not have any choice as to which path to take; there is only one option. To enable adaptive routing, [4] proposes a solution that takes advantage of an IBA provision called LID Mask Control (LMC.) IBA allows each port to be identified with a consecutive range of LID s, rather than just a single LID. The LMC indicates how many of the least significant bits of the LID should be ignored when comparing a packet s DLID with the present node s LID. Each destination node will accept all packets destined for any address within its LID range. Since switches treat every LID as distinct, regardless of the LMC, each switch s forwarding table can list a range of DLID s for each destination, and each DLID within that range can be associated with a different output port. The result is that a range of paths are available to take that all lead to the same destination node. The table addressing logic can be designed in such a way that it simultaneously returns all of the available DLID s to the route selection logic, which can then choose which of the output ports to use. [4] evaluated the performance of this adaptive routing technique by simulating various irregular network topologies with a number of different traffic patterns. Average packet latency and throughput were measured. Performance with adaptive routing was compared with purely deterministic routing. Figure 2 illustrates the latency and throughput improvement achieved with adaptive routing vs deterministic routing under uniform traffic. (The percentages indicate what percentage of traffic was flagged as available for adaptive routing. Packets that were not flagged were limited to a single, deterministic path.) There is a positive linear relationship between the amount of traffic routed adaptively and the saturation point. Figure 3 demonstrates the factor of throughput improvement achieved with adaptive routing compared to deterministic routing. Under uniform traffic, adaptive routing significantly outperformed deterministic routing. Throughput increased by 9%-31% on average, depending on network size. However, with 15% hot spot traffic (15% of the source nodes send their traffic to a randomly

4 Figure 3. Throughput increase factor 100% adaptive routing vs 0% adaptive routing Various network sizes and traffic patterns selected hot spot destination,) throughput only improved by a modest 1%-6% with adaptive routing. The researchers attribute this mediocre improvement to limitations with the particular routing algorithm that was used. The simulation used up*/down* routing, which is well-suited for deadlock-free routing in irregular networks. However, it does not allow a very high degree of adaptivity. The algorithm was able to avoid congested areas when they were of low intensity and randomly dispersed throughout the network, but unable to do so when they were highly concentrated in hot spots. Despite this disappointing result, adaptive routing has potential as a congestion avoidance mechanism. In the very least it can be utilized to avoid congestion under benign traffic. It s possible that another adaptive routing algorithm could be chosen which might yield better results for hot spot traffic. [7] mentions alternative routing algorithms that are also well-suited for irregular networks, adaptive trail routing [8] and minimal adaptive routing [9]. Examination of these routing algorithms and their degree of adaptivity is out of the scope of this paper, a possible area for future research. 6. Virtual Channels/Virtual Output Queuing Virtual channels can be utilized to relieve HOL blocking. This entails creating multiple virtual channels for each input port. If a flow is allocated a virtual channel, and then becomes blocked, it does not hold on to the physical link. The link remains available for other virtual channels to utilize. Thus, cold flows can bypass congestion. Note that virtual channels do not relieve hot flows, but as long as the cold flows are taken care of, congestion spreading can be controlled, and network collapse will not occur. If there are enough virtual channels, Virtual Output Queuing (VOQ) can be implemented. VOQ organizes input buffers into a set of queues such that for every input port, there is a distinct buffer assigned to each output port. This provides better HOL blocking relief than an implementation with less virtual channels. IBA does specify the use of Virtual Lanes (VL s,) but they are not quite the same as the virtual channels we have described. VL s are intended to provide Quality of Service (QoS,) not dynamic congestion control. The effect of VL s is that separate virtual networks are created; each one can be used by a different class of traffic. At the source node, each packet is assigned a Service Level (SL,) which cannot be modified. SL s are mapped to VL s at each switch. Thus, when the VL for a particular SL becomes congested, packets belonging to the other SL s can still be transmitted on their virtual networks. There are many aspects of this VL specification which are problematic for the implementation of VOQ in IBA. First, the switch cannot dynamically allocate any packet to any VL; this allocation is deterministically selected by the switch s SL to VL mapping table, and a packet s SL can only be assigned at the source node. Second, VL s cannot be mapped to specific output ports of the switch; they are mapped to each packet s SL. Third, the packet is stored in a particular VL when it arrives at the switch, before the output port has even been determined. [5] describes a clever strategy for implementing VOQ in InfiniBand networks that

5 Figure 4. Average packet latency vs accepted traffic 8 switch network, hot spot traffic (1 hot spot, with distinct curves for varying amounts of traffic) overcomes all of these difficulties. If the SL s assigned at the source node and the SL to VL mapping tables in each switch are carefully constructed, the result is that VL s can be mapped to specific output ports, thus implementing VOQ. To achieve this, the network topology must be examined, and every combination (4-tuple) of switch, input port, output port, and output port at the next switch must be determined. Next, the paths between all possible source/destination pairs are traversed. For every 4-tuple visited during this traversal, a locally unique SL is assigned. Locally unique means that all of the 4-tuples that share the same switch, input port, and output port must have distinct SL s. If locally unique SL s can be assigned to every 4-tuple, and the number of VL s is greater than or equal to the number of output ports on each switch, then VOQ can be fully implemented. When VOQ is implemented for every path in the network, it referred to as full VOQ. If infinitely many distinct SL s and VL s could be used, then full VOQ would always be realizable. However, IBA only allows a maximum of 15 SL s and 15 VL s (plus one VL that is reserved for subnet management.) In addition, all 15 SL s and VL s might not be available for VOQ; some of them might be reserved for other purposes. For a given network, only partial VOQ might be possible. In the case of partial VOQ, maximal performance can be achieved by weighting the 4- tuples according to the number of paths in the entire network that include them, then implementing VOQ for the 4-tuples with the highest weights. Similar to the simulations described in Section 5 to evaluate adaptive routing, [5] evaluated the performance of VOQ by simulating various irregular network topologies with a number of different traffic patterns. Average packet latency and throughput were measured. Performance with VOQ was compared with a network with the same number of SL s and VL s, used in the traditional sense as separate Virtual Networks (VN.) Figure 4 shows the improvement achieved with VOQ under varying amounts of hot spot traffic. Without VOQ, the network saturates early, due to congestion spreading. With VOQ, saturation occurs at a much higher throughput. Higher amounts of hot spot traffic result in

6 Figure 5. Average packet latency vs accepted traffic 8 switch network, hot spot traffic (4 hot spots receiving 20% of all traffic) reduced throughput for both VOQ and VN, but the ratio of throughput improvement for VOQ vs VN actually increases (150% improvement for 10% hot spot traffic, 900% improvement for 70% hot spot traffic.) Figure 5 analyzes the performance of hot flows and cold flows under hot spot traffic. There is not much improvement seen on the hot flow itself for VOQ, compared with VN. However, the throughput for the cold flows is drastically improved with VOQ, since HOL blocking is reduced or eliminated. The results shown all involve 8 SL s and 8 VL s, and 8 switch networks. With only 4 SL s and 4 VL s, performance improvements are not as large, but are still significant. This is an important result since full VOQ is often not possible. With very limited resources, partial VOQ still successfully controls congestion. Also, in larger 16, 32, and 64 switch networks, VOQ still delivers consistent improvements. It is interesting to note that up*/down* routing was used in this simulation, the same routing algorithm that was unable to handle hot spot traffic with adaptive routing in Section Injection Rate Throttling/Explicit Congestion Notification The fourth way to control congestion is to decrease the amount of traffic injected into the congested area. This allows the congested nodes more time to process and transmit packets, and resolves the network congestion. To implement injection rate throttling, congestion must be detected, the offending traffic flows must be notified, and they must reduce their traffic. There also must be a provision to recover from this reduced injection rate and return to the normal rate once congestion has been resolved, in order to utilize full network bandwidth. [1] proposes a strategy for implementing this congestion control method in IBA using an endto-end Explicit Congestion Notification (ECN) scheme. ECN functions as follows, as is shown in Figure 6. 1) Switch detects congestion when a Virtual Lane s (VL) input buffer becomes full. 2) Switch sets the ECN bit in the headers of the packets in the congested VL. Once it is set, the ECN bit cannot be modified. 3) The packets are propagated forward. When each one reaches its destination, the end node recognizes the ECN bit, and sends another ECN

Figure 6. Explicit Congestion Control (ECN) mechanism in IBA packet back to the source node. (For reliable data streams where ACK packets are already used, the ECN bit is set in the ACK packet.

7 Figure 6. Explicit Congestion Control (ECN) mechanism in IBA packet back to the source node. (For reliable data streams where ACK packets are already used, the ECN bit is set in the ACK packet. For unreliable data streams that do not use ACK packets, a small Congestion Notification (CN) packet is created and sent to the source.) 4) When a source node receives an ECN packet, it decreases its injection rate. (For reliable data streams, the source node can decrease the injection rate, as well as reduce its window limit. The window limit is the number of packets that are allowed to be sent before an ACK is received.) 5) If a source node continues to receive additional ECN packets, it continues to decrease the injection rate. On the other hand, if a certain amount of time passes without receiving additional ECN packets, it can be assumed that the congestion has been abated, and the source will gradually increase the injection rate until it is returned to its default level. Let us examine some of the specific properties of this mechanism. ECN relies on end-to-end notification. This method has a relatively slow reaction time. It would be faster for a switch to send notification upstream without first sending the notification all the way downstream to the destination node, or even faster for a switch to throttle its own packets traffic rate. The relatively slow reaction time is somewhat mitigated by the fact that end-to-end latency in InfiniBand networks is very small. The benefit of this method is that it is simple to implement within the parameters of the existing IBA hardware specification, and it allows the injection rate to be throttled for the specific traffic flow that is causing the congestion. This is because the switches have limited information about the traffic they are routing. By allowing the end nodes to control the notification, the specific traffic flow that is causing the congestion can be identified, rather than throttling all traffic flows originating from the source node. The algorithm for increasing and decreasing injection rate is critical. The rate reduction must be fast enough to respond quickly to congestion and limit spreading, but not so fast that it induces starvation. The rate increase must be fast enough to recover any underutilized bandwidth when congestion has been resolved, but must not be so fast that it prematurely contributes additional traffic before congestion has been resolved. [1] provides guidelines for a source response function to achieve this balance. When an ECN packet is received, the injection rate is decreased. Injection rate is controlled by an Inter-Packet Delay (IPD) value that determines how long to wait between sending packets. (For reliable data streams, the window limit can also be adjusted.) The source response function only needs to decrease the injection rate for ECN packets that resulted from data packets that were sent since the most recent adjustment to the injection rate, to avoid reacting to congestion notification that reflects old settings of the injection rate. For injection rate increases, the source maintains a Remaining Packets until Increase (RPI) counter. For every packet that is sent without receiving an ECN, (or for every unmarked ACK that is received, in the case of reliable data streams,) the RPI counter is reduced by one. Once the RPI counter reaches zero, the injection rate and/or window limit can be increased. Additionally, the number of ECN packets it takes to reduce the injection rate by a given

8 amount must be less than or equal to the number of unmarked packets it takes to increase the injection rate by the same amount. Traffic flows with lower injection rates must recover more quickly than traffic flows with higher injection rates. Traffic flows operating at the minimum injection rate must increase the injection rate after a single unmarked packet, rather than waiting for the RPI counter to reach zero. These requirements ensure stable and efficient operation under a variety of corner case conditions. [1] also has a provision for backward compatibility. Switches and end nodes that do not support ECN are fully compatible with those that do. If a non-compliant switch becomes congested, it will likely be spread to a compliant neighbor switch, which can then initiate the ECN protocol. If a switch marks a packet destined for an end node known to be non-compliant, the switch should also send a CN packet to the subnet manager. If a destination node receives a marked data packet from a source node known to be noncompliant, it should send a CN packet to the subnet manager. When the subnet manager receives CN packets, it will notify the offending source node and adjust its injection rate. ECN is a robust method for congestion control. It is completely dynamic; it can react to any traffic pattern. Because of its closed loop, end-to-end nature, it can handle persistent hot spots that last indefinitely. 8. ECN Implementation [1] was published in 2002 by HP Laboratories, and was adopted as an optional section of the InfiniBand specification in 2004 (version 1.2) as the Congestion Control Annex (CCA.) [3] The actual implementation differs in some regards from the original proposal discussed in Section 7. In some areas, ECN has been simplified in the CCA. The source response function has been simplified. Window limit is not used as a means of controlling traffic, only inter-packet injection rate is considered. Also, there is no RPI counter governing rate increase. Instead, rate increase is controlled by a Congestion Control Table Index (CCTI) Timer. The CCTI Timer is cyclical and its duration can be set anywhere from 1.024µs to 67ms. Each time it expires, the injection rate is increased (and the timer is reset.) The advantage to this implementation is that the injection rate is guaranteed to return to its default level, even if there is a break in the traffic and no packets are sent. It would seem that determining a balanced CCTI Timer value would be difficult, since the rate increase is a function of time, rather than a function of packets; whereas rate decrease is still a function of packets, as originally proposed. Actually, the CCTI Timer can be set in a manner similar to the RPI counter of the original proposal by setting CCTI Timer duration = RPI counter size * IPD time. Another simplification in the CCA is that no subnet manager enhancement is provided for backward compatibility. Non-compliant switches and end nodes simply ignore ECN packets, with no intervention from the subnet manager. A mix of congestion control aware and unaware devices is supported, with, at most, some loss of Congestion control. [3] The enhancement may have been abandoned because it was not wellspecified in the original proposal. The basic functionality was explained, but details of how to implement high-level congestion control in the subnet manager were not defined. More importantly, the enhancement did not seem scalable. In the extreme case, in a large network with all CCA-aware switches and no CCA-aware end nodes, the subnet manager would need to handle injection rate control for every source node in the network. In other areas, the CCA provides more flexibility than the original proposal. Flexibility is to be expected for an industry-wide specification with a multitude of applications. For example, a threshold value can be specified for congestion detection, so that buffers are considered congested when they pass a certain limit, rather than only when they are completely full. This allows networks to detect congestion earlier. Because the CCA is flexible, it requires some tuning to determine parameter values (such as

Figure 7. Throughput vs time WITHOUT congestion control Figure 8. Throughput vs time WITH congestion control Figures 7-8.

9 Figure 7. Throughput vs time WITHOUT congestion control Figure 8. Throughput vs time WITH congestion control Figures port 3-stage network, hot spot traffic (32 sources sending to the hot spot) buffer threshold, maximum IPD, IPD increment amount, and CCTI Timer duration) that result in effective congestion control. [2] claims to be the first paper not only to develop guidelines for setting these parameters, but to prove that parameter values even exist that can utilize the CCA to control congestion in a realistic network. [2] simulated a variety of InfiniBand networks with regular, folded butterfly topologies, under uniform and hot spot traffic. Simulation results demonstrate that ECN does an excellent job of controlling congestion. Figures 7 and 8 show a different-colored plot for the throughput seen at each of 32 end nodes, during a finite period of hot spot traffic. The single red plot at the top of Figure 7 shows that the single hot spot node remained saturated during the congestion period, while throughput for the rest of the nodes was drastically reduced without congestion control, due to congestion spreading. With congestion control enabled, throughput on the cold flows remains high, as shown in Figure 8. These simulations also demonstrate the importance of fine-tuning the CCA parameters. If they are not set correctly, the network throughput can oscillate or become unstable. If the IPD increment amount is set too low (Figure 9,) the source nodes react too slowly to the congestion. If the IPD increment amount is set too high (Figure 10,) the source nodes react too quickly, making drastic, oscillating changes to the injection rates. If the CCTI Timer expires too quickly (Figure 11,) the source nodes are unable to reduce the injection rate, and there is congestion spreading and throughput collapse, just as if there were no congestion control in place. If the CCTI Timer expires too slowly (Figure 12,) the source nodes react quickly but recover slowly, resulting in unnecessary loss of throughput on the hot flow. Thanks to extensive experimentation, [2] is able to offer general guidelines for setting the CCA parameters. Given N = number of network ports, and HSD = the maximum number of sources that could send traffic to a single hot spot, acceptable values are as follows: Buffer threshold = 90% of switch input buffer size Maximum IPD = 2/3*HSD µs IPD table index increment amount = min(n/6, HSD/2) CCTI Timer duration = 10 µs These guidelines are essential for anyone hoping to make use of IBA s built-in ECN mechanism. However, these results are based on regular, butterfly topologies. Further experimentation is required to determine the guidelines for torus networks. It is unknown

Figure 9. Throughput vs time IPD increment amount set too low Figure 10. Throughput vs time IPD increment amount set too high Figure 11.

Hot spot traffic (32 sources sending to the hot spot) whether general guidelines can even be established for irregular networks, which are very common among InfiniBand networks. 9.

It is completely dynamic; it requires little knowledge about the network or traffic patterns that are expected. It targets both cold flows and hot flows.

However, the tunable parameters may require some tweaking to achieve the best performance. VOQ is the next-best alternative.

10 Figure 9. Throughput vs time IPD increment amount set too low Figure 10. Throughput vs time IPD increment amount set too high Figure 11. Throughput vs time CCTI Timer set too fast Figure 12. Throughput vs time CCTI Timer set too slow Figures Hot spot traffic (32 sources sending to the hot spot) whether general guidelines can even be established for irregular networks, which are very common among InfiniBand networks. 9. Conclusion ECN is the best method for congestion control in InfiniBand, for many reasons. It is completely dynamic; it requires little knowledge about the network or traffic patterns that are expected. It targets both cold flows and hot flows. It is a standard part of the IBA specification; it does not require tricky algorithms to transform existing resources for purposes they were not originally designed for. However, the tunable parameters may require some tweaking to achieve the best performance. VOQ is the next-best alternative. It does an effective job of controlling congestion spreading by unblocking cold flows during hot spot traffic. It performs well with just a few dedicated SL s and VL s. However, the traversal and mapping algorithms for determining appropriate SL s and VL s to enable VOQ in InfiniBand switches is complicated. If there are any changes to the network topology, all of the tables must be recalculated. Adaptive routing is another useful method for avoiding congestion in the general case.

11 However, it requires specially configured routing tables to be implemented in IBA, and it has not been proven to be very effective in resolving hot spot traffic. An ideal solution would be to combine all three techniques. They are all compatible. Adaptive routing would enable network traffic to avoid congestion under low-intensity conditions. With hot spot traffic, congestion spots would still develop, and VOQ would allow most flows to bypass HOL blocking, maintaining high overall throughput. If the hot spots persisted, ECN would throttle the hot flow traffic to keep the congestion from spreading upstream. This does not imply that the congestion problem for InfiniBand networks is completely solved. There are areas that can be improved. A better adaptive routing algorithm for irregular networks could be developed. As we saw in Section 5, up*/down* routing does not adapt well to hot spot traffic. Another improvement would be to add more SL s and VL s to the IBA specification. This would allow VOQ to approach full VOQ, rather than partial VOQ, and would also create more resources for other uses, such as QoS. For ECN, the backward compatibility functionality that was outlined in [1] but dropped in the CCA could be developed. Also, a more sophisticated injection throttling algorithm could be developed. For example, [10] proposes a power source response function, rather than the standard multiplicative function, to enable faster reaction times and better utilization of available bandwidth. There are already some effective tools available for controlling congestion in IBA, but there are still opportunities for further research. 10. References [1] Y. Turner, J. R. Santos, and G. Janakiraman, An approach for congestion control in InfiniBand Internet Systems and Storage Laboratory, HP Laboratories Palo Alto, HPL (R.1), May [2] G. Pfister et al. Solving Hot Spot Contention Using Infiniband Architecture Congestion Control. Ion High Performance Interconnects for Distributed Computing, [3] InfiniBand Trade Association, InfiniBand Architecture Specification, Volume 1, Release 1.2 [online document], October 2004, Available at [4] J.C. Martinez, J. Flich, A. Robles, P. L opez, and J. Duato, Supporting Adaptive Routing in InfiniBand Networks, in 11th Euromicro Workshop in Parallel Distributedand Network- Based Processing, February, [5] M. E. Gomez, J. Flich, A. Robles, P. Lopez, J. Duato, "VOQSW: A Methodology to Reduce HOL Blocking in InfiniBand Networks," International Parallel and Distributed Processing Symposium, [6] W. J. Dally, B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann, [7] J. C. Sancho, A. Robles, J. Duato, A Flexible Routing Scheme for Networks of Workstations, High Performance Computing: Third International Symposium, [8] W. Qiao and L. M. Ni, Adaptive Routing in Irregular Networks Using Cut-Through Switches, in Proceedings of the 1996 International Conference on Parallel Processing Volume 1, [9] F. Silla and J. Duato, Improving the Efficiency of Adaptive Routing in Networks with Irregular Topology, in Fourth International Conference on High Performance Computing, [10] S. Yan, G. Min, I. Awan, An Enhanced Congestion Control Mechanism in InfiniBand Networks for High Performance Computing Systems, in 20 th International Conference on Advanced Information Networking and Applications Volume 1, 2006.

Congestion Management in Lossless Interconnects: Challenges and Benefits

Congestion Management in Lossless Interconnects: Challenges and Benefits José Duato Technical University of Valencia (SPAIN) Conference title 1 Outline Why is congestion management required? Benefits Congestion