We approve the thesis of Ki Hwan Yum. Date of Signature Chita R. Das Professor of Computer Science and Engineering Thesis Adviser, Chair of Committee

Size: px

Start display at page:

Download "We approve the thesis of Ki Hwan Yum. Date of Signature Chita R. Das Professor of Computer Science and Engineering Thesis Adviser, Chair of Committee"

Jeffrey Bruce
6 years ago
Views:

1 The Pennsylvania State University The Graduate School Department of Computer Science and Engineering QUALITY OF SERVICE PROVISIONING IN CLUSTERS A Thesis in Computer Science and Engineering by Ki Hwan Yum cæ 2002 Ki Hwan Yum Submitted in Partial Fulællment of the Requirements for the Degree of Doctor of Philosophy December 2002

2 We approve the thesis of Ki Hwan Yum. Date of Signature Chita R. Das Professor of Computer Science and Engineering Thesis Adviser, Chair of Committee Mary Jane Irwin Distinguished Professor of Computer Science and Engineering George Kesidis Associate Professor of Electrical Engineering and Computer Science and Engineering Vijaykrishnan Narayanan Assistant Professor of Computer Engineering Natarajan Gautam Assistant Professor of Industrial and Manufacturing Engineering Raj Acharya Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering

3 Abstract iii Cluster systems are becoming increasingly more attractive for designing scalable servers with switched network architectures that oæer much higher bandwidth than the broadcast-based networks. Design of high performance cluster networks with Quality of Service èqosè guarantees, therefore, becomes important to support a variety of multimedia applications, many of which have real-time constraints. Most commercial routers, which are based on the wormhole-switching paradigm, can deliver high performance, but lack QoS provisioning. Therefore, it would be advantageous if we could leverage oæ of the large amount of eæort that has gone into the design and development of these wormhole routers and adapt them to support integrated traæc with minimal design changes. The overall goal of this thesis is to design and analyze cluster networks that can provide high and predictable performance. Design and evaluation of QoS-capable cluster networks is the focus of this thesis. In particular, we investigate various issues for eæcient handling of best-eæort and real-time traæc in clusters based on the wormhole switching paradigm. We study æve research issues in this thesis. First, a non-preemptive pipelined wormhole router, called MediaWorm, is proposed and investigated based on a basic wormhole router architecture. Second, to overcome the inæexibility of the MediaWorm router for dynamic workloads, a preemptive pipelined router architecture is proposed. We propose a æit-level input buæer preemption and a æit acceleration mechanisms for providing preemption of lower priority messages in favor of higher priority messages. The third part of the thesis deals with

4 iv the design of a QoS-capable network interface card ènicè based on the Virtual Interface Architecture èviaè paradigm. The QoS-capable routers and QoS-capable NICs are integrated to examine end-to-end QoS guarantees in a cluster system. Next, practical admission and congestion control mechanisms in a cluster environment are considered to aid in end-to-end QoS assurance. Finally, we extend our work to develop a simulation testbed conforming to the InæniBand TM Architecture èibaè speciæcation and investigate QoS design issues for system area networks èsansè within the IBA framework.

5 Table of Contents v List of Tables : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : viii List of Figures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ix Acknowledgments : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xii Chapter 1. Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Chapter 2. Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 Chapter 3. Design of Wormhole Routers for QoS support : : : : : : : : : : : : : A Non-Preemptive Router Design MediaWorm Router Architectural Design Issues A Rate-based Scheduling Algorithm for QoS Support Interconnection Topologies Fat Networks Pipelined Circuit Switching A Preemptive Router Design Preemption in the Input Buæer A Flit Acceleration Mechanism Experimental Platform Simulation Testbed Workload Performance Results

6 vi Comparison of MediaWorm and Traditional Routers Comparison of CBR and VBR Traæc Results Results with Mixed Traæc Impact of VCs and Crossbar Capabilities Eæect of Message Size on Jitter Comparison of MediaWorm and PCS Routers Results with MPEG-2 Video Traces Fat-Mesh Results Comparisons of the Three Router Models Aè2æ2è Mesh Network Results Concluding Remarks Chapter 4. A QoS Capable Network Interface Card Design : : : : : : : : : : : : Virtual Interface Architecture A QoS Capable NIC Design Performance Results Concluding Remarks Chapter 5. Integrated Admission and Congestion Control in Clusters : : : : : : Basic Architecture Host Channel Adapter èhcaè Architecture VL Arbitration Admission and Congestion Control Admission Control

7 vii Congestion Control Algorithm Experimental Platform Performance Results Comparisons of Congestion Control Algorithms Results with Admission and Congestion Control Concluding Remarks Chapter 6. QoS Provisioning in InæniBand TM Architecture èibaè : : : : : : : : System Architecture InæniBand TM Architecture èibaè Switch Architecture Host Channel AdapterèHCAè Architecture Performance Enhancement Techniques Deterministic Routing Algorithm Packet Dropping in a Switch Experimental Platform Performance Results Concluding Remarks Chapter 7. Conclusions and Future Work : : : : : : : : : : : : : : : : : : : : : : 111 References : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 115

8 List of Tables viii 3.1 VirtualClock and Fair Queueing Algorithms Comparison of Fine-Grained VirtualClock with Fine-Grained Fair Queueing algorithms when the ratio of real-time to best-eæort traæc is 80:20. Inter-frame time is the averaged time diæerence of frames measured at the destinations, and SD is the standard deviation of the inter-frame time MPEG-2 Video Sequence Statistics Simulation Parameters Average Latency for Best-eæort Traæc è8æ8 switch, 16VCs, 400Mbps linksè Number of attempted, established and dropped connections for reaching a certain input loading in a PCS router. The values presented are for an è8 æ 8è router with 24 VCs, 100Mbps links IBA Simulation Testbed Parameters

9 List of Figures ix 3.1 A Basic Pipelined Wormhole Router Functional units along a router pipe for a2portrouter with 2 VCs per PC. Additional functional units such as the routing decision block and the arbitration unit are not shown. With a multiplexed crossbar as is shown in the ægure, contention amongst multiple pipes can occur in the crossbar input multiplexer èaè for the crossbar input port, within the crossbar èbè for the crossbar output ports, and in the VC multiplexer ècè for the output PC MediaWorm Router Architecture A 4 switch fat-mesh interconnect. Each switch ès0ís3è is an è8 æ 8è switch. Each fat link comprises 2 physical links Input Buæer Preemption in the Router Flit Acceleration for Message m Flit Acceleration for Message m Comparison of MR and TR èè8æ8è switch, 16 VCs, 400 Mbps links, x : y = 80:20è Comparison of CBR and Synthetic VBR traæc in the MediaWorm router èè8 æ 8è switch, 16 VCs, 400 Mbps links, all real-time traæcè Mixed Traæc èsynthetic VBR + best-eæort traæcè èè8 æ 8è switch, 16 VCs, 400 Mbps linksè

10 x 3.11 Impact of VCs and Crossbar Capabilities èè8æ8è switch, 400 Mbps links, x : y = 100:0è Eæect of message size on jitter èè8 æ 8è switch, 400 Mbps link bandwidth, 16 VCs, all synthetic VBR traæcè MR and PCS comparison èè8 æ 8è switch, 100 Mbps link bandwidth, 24 VCsè TR vs. MR with MPEG-2 Video Traæc èè8 æ 8è switch, 16 VCs, 1.6 Gbps, x : y = 80 : 20è Mixed Traæc èmpeg-2 Video Trace + best-eæort traæc, è8 æ 8è switch, 16 VCs, 1.6 Gbpsè Performance of a è2 æ 2è fat mesh è8 æ 8è switches, 400 Mbps link bandwidth, 16 VCsè Deadline missing probability and deadline missing time in a single router under dynamic load variation. The input load is speciæed in the graphs Components of message latency of control traæc and best-eæort traæc in a single router under dynamic load variation. The input load is Eæect of block-level multiplexing in a single preemptive router under dynamic load variation Comparison of preemption+acceleration and only preemption in a single router under dynamic load variation. èsome results for the best-eæort traæc at high load are not included due to saturation.è Deadline missing probability and average latency of best-eæort traæc in aè2æ2è mesh network under dynamic load variation

11 xi 4.1 Virtual Interface Architecture Paradigm A VIA-style NIC with QoS support Validation of the NIC simulator with a ping pong application Co-evaluation of a single router and NICs. The deadline missing probability is shown for dynamically changing workload A Proposed InæniBand TM Host Channel Adapter with QoS Support Connection Setup Procedure Message Latency and Throughput in a 4 æ 4 Mesh Network with 100è Best-eæort Traæc Performance Results of a Single Router Cluster with MPEG-2 Video Traæc èuniform distributionè Performance Results of a Single Router Cluster with MPEG-2 Video traæc èhot spot distributionè Average Message Latency in a 5æ5 Mesh Network with OnèOæ Real-time Traæc Message Latency and Throughput in a Single Router with OnèOæ Real- Time Traæc A 5-stage Pipelined IBA Switch Model Comparison of Various Routing Algorithms in a 15-Node Irregular Network Comparison of Various Routing Algorithms in a 30-Node Irregular Network 110

12 Acknowledgments xii This thesis and my positive experience at Penn State owes a great deal to my advisor Professor Chita Ranjan Das. Dr. Das was instrumental in providing the motivation and background that started oæ this work. His continuous support and guidance, his encouragement during hard times and, when needed, the healthy doses of constructive criticism that he provided, have all been very valuable throughout my stay atpenn State. My thesis committee members have helped review my thesis from its early proposal stages to its current form and have provided valuable inputs and suggestions. Professor Mary Jane Irwin, Professor George Kesidis, Professor Vijaykrishnan Narayanan, and Professor Natarajan Gautam have all taken the time and eæort to accommodate my thesis review into their tight schedules. For this, I express my sincere gratitude. During the course of my work, I have often looked to both internal and external sources of help for ideas and critical reviews. Ihave been fortunate to have received this from many colleagues, collaborators, academicians and practitioners. I wish to thank all of them here. Eun Jung Kim, Professor Jose Duato, Dr. Mazin Yousif, Srinivas Hanabe, Vithal Shirodkar, and Giridhar Viswanathan along with my advisor have all collaborated with me on various research projects. My greatest inspirational source, supporters and fans have been my parents and family. I dedicate this work to my parents Soo Chul Yum and Jung Ja Park, my wife

13 Eun Jung Kim, and my son Sang Joon Yum. Thank you for your unconditional love and support. xiii

14 1 Chapter 1 Introduction Cluster systems are becoming increasingly more attractive for designing scalable servers with switched network architectures that oæer much higher bandwidth than the broadcast-based networks. Quality ofservice èqosè provisioning in such clusters is becoming a critical issue with the widespread use of these systems in diverse commercial applications. The traditional best-eæort service model that has been used for scientiæc computing is not adequate to support many cluster applications with varying consumer expectations. As an example, many web servers and database servers make eæcient use of clustering technology from cost, scalability, and availability standpoints. However, the tremendous surge in dynamic web contents, multimedia objects, e-commerce, and other web-enabled applications requires QoS guarantees in diæerent connotations. The guaranteed communication delay and bandwidth requirements of the applications mandate that the cluster interconnect should be able to handle these traæc demands. These demands, in turn, are passed on to the building blocks of the interconnects, the switching fabrics or routers. Hence, it has become crucial to revisit the design of router architectures to provide high and predictable performance. Typically two classes of traæc are generated with mixed or integrated workloads. These are best-eæort traæc and real-time traæc. While best-eæort traæc usually does not have any stringent performance requirements èhence known as available bit rate

15 2 èabrèè, real-time traæc are further classiæed as constant bit rate ècbrè and variable bit rate èvbrè workloads. A cluster network should therefore support ABR, CBR and VBR eæectively. Two switch or router design paradigms have been used to build clusters ë18ë. One is based on the cut-through switching mechanisms èwormhole ë15ë and virtual-cutthrough èvctè ë34ëè, originally proposed for multiprocessor switches, and the other is based on packet switching. Current multiprocessor routers, primarily based on the cutthrough paradigm, are suitable for handling ABR traæc. However, they may not be able to support the stringent QoS requirements eæciently without possibly modifying the router architecture. On the other hand, packet switching mechanisms like ATM can provide QoS guarantees, but they are not suitable for best-eæort traæc primarily due to high message latency compared to cut-through switching ë17, 20ë. Therefore, none of the existing network architectures are optimized to handle both best-eæort and real-time traæc in clusters. In view of this, a few researchers have explored the possibility ofproviding QoS support in router architecturesë7, 17, 20, 38, 57ë. Most of these designs have used a hybrid approach using two diæerent types of switching mechanisms within the same router one for best-eæort and the other for real-time traæc. They have refrained from using wormhole switching because of potential unbounded delay for real-time traæc. On the other hand, in the commercial world, it appears that wormhole switching has become a de facto standard for clustersèmultiprocessors. Therefore, it would really be advantageous if we could leverage oæ of the large amount of eæort that has gone into the design and development of these wormhole routers and adapt them to support

16 3 all traæc classes with minimal design changes. Some recent modiæcations to wormhole routers have been considered for handling traæc priority ë12, 26, 42, 65ë. However, to our knowledge, there have been no previous studies investigating the viability of supporting multimedia traæc with wormhole switching. QoS support only in the routerèinterconnect is not adequate to assure application level performance guarantees. In order to provide end-to-end QoS guarantees in clusters, QoS provisioning in the network interface card ènicè is also important. It is known the NIC plays a crucial role in reducing the communication overhead ë47ë. The role of the NIC may become even more important to satisfy the QoS requirements of diæerent trafæc classes. Several user-level communication mechanisms have been proposed recently, where an application can directly communicate with an intelligent NIC with minimal kernel support ë2ë. Among them, the Virtual Interface Architecture èviaè ë19, 73ë framework is becoming a standard to design user-level communication on NICs. However, it is not clear how QoS provisioning should be provided in the context of VIA. In addition, a co-evaluation of the cluster routerèinterconnect with a VIA-style NIC is essential to understand the interplay of diæerent designs on the overall performance of the communication architecture. To our knowledge, none of the prior research has considered the above research issues in the design and evaluation of QoS capable cluster interconnects. Finally, admission and congestion control mechanisms are integral parts of any QoS design for systems that support integrated traæc. While an admission control algorithm helps in delivering the assured performance, a congestion control algorithm regulates traæc injection to avoid network saturation. However, integration of admission and congestion control in clusters has not been examined up to now.

17 4 Admission control algorithms help to meet Service Level Agreements èslasè of real-time applications. However, admission control alone may not be eæective enough to guarantee the SLAs of real-time and best-eæort applications because they may exhibit unpredictable behavior, resulting in short- or medium-term network traæc overload. Such traæc overload considerably degrades overall network throughput. Therefore, a congestion management algorithm is typically used to monitor the network load, and intervene when the traæc load reaches a certain threshold indicating possible network congestion. Since a congestion management scheme also brings its own set of constraints on the injection of traæc æows into the network, both admission control and congestion management are collectively needed to guarantee various QoS constraints. This is especially true in clusters running a diverse set of applications. Recently, InæniBand TM Architecture èibaè has been proposed as a new communication standard to design SANs for scalable, high performance clusters. IBA is expected to revolutionize the future communication paradigm by solving the bandwidth, scalability, reliability, and standardization issues under one unifying design. The IBA Trade Association èibtaè consisting of more than 220 industry leaders has released the ærst IBA speciæcation ë29ë and is currently augmenting it with enhanced features such as Congestion Management, Quality of Service èqosè, and Router Management. QoS is becoming an essential part of the IBA framework ë54ë because of the sophistication of services that will be supported by clusters connected through SANs. IBA could use either a packet switched or a virtual cut-through switched technology to connect processors and IèO devices. The speciæcation supports any topology to facilitate ease of expansion and to build large networks consisting of smaller subnets.

18 5 It outlines only the functionalities without any constraints on the actual design. Therefore, it is conceivable to have multiple design alternatives for the same set of high-level requirements. This makes the design space very complex because of the multitude of options possible at diæerent levels of the design. An IBA testbed is, therefore, essential to investigate various design options for satisfying the performance and QoS requirements. However, there is no such simulation platform available now, and as we understand, the IBTA is planning to develop such a platform with help from academia. This research is also aimed at investigating the following design issues for providing improved and predictable performance in IBA. First it is not clear what is a good routing algorithm for IBA considering the fact that the interconnect could be an irregular topology. Second, the IBA speciæcation supports multipathing to facilitate Automatic Path Migration èapmè between a source and destination pair to provide fault-tolerance. However, the actual path set up in the routingèforwarding table is left open for the designers. Moreover, we believe that the multipathing mechanism can not only be used for fault-tolerance, but also for congestion avoidance to improve performance. Therefore, it is essential to understand the design and performance implications of multipathing. Finally, packet dropping is allowed under the IBA framework to limit the life time of a packet in a network. Packet dropping can also be used for deadlock avoidance. Thus, instead of using a complex deadlock-free algorithm, one can use a simple routing scheme with packet dropping to provide competitive performance. This concept needs careful investigation. It appears that QoS provisioning in clusters is an important, but open area of research. The main motivaton of the research is to design and investigate various issues

19 6 for QoS provisioning in clusters. The research includes development of QoS capable routers, QoS capable NICs, and admission and congestion control algorithms for wormhole switched and IBA-style SANs. An overview of the research is given below. The research proposes to investigate the following issues. æ A non-preemptive router model èmediaworm routerè : We propose to design a wormhole router, called MediaWorm to support both real-time and best-eæort traæc. In this model, the virtual channels èvcsè are divided according to traæc types at the conæguration time. A rate-based scheduling algorithm, called VirtualClock ë82ë, is used to provide proportional bandwidth allocation. æ A preemptive router model : Since the MediaWorm router statically divides the input and output VCs among the traæc classes, the conæguration cannot be changed during execution. Therefore, if the workload consisting of real-time and best-eæort traæc dynamically changes during execution, it may suæer from shortage of resources. The preemptive router model can be a solution to this problem, where several classes of traæc can share a VC, with the provision that a higher priority message can preempt a lower priority message. Design of the preemptive model is examined in detail. æ QoS capable NIC design : We propose a network interface card ènicè design based on the Virtual Interface Architecture èviaè to support QoS for real-time traæc. The QoS capable NIC and the router designs will be integrated to evaluate the entire communication substrate for an end-to-end performance analysis.

20 7 æ Admission and congestion control in clusters : Next, in order to provide endto-end QoS gurantees for applications, we propose to develop a simple admission control mechanism and an elegant congestion control mechanism called credit-based congestion control. These algorithms are developed using the Mediaworm router and QoS-capable NIC developed in this research. æ QoS provisioning in InæniBand TM Architecture èibaè : Finally, a simulation testbed for the IBA that includes packet-switched router, adaptive routing, and Weighted Round Robin èwrrè will be developed for the design and evaluation of the IBA framework. Architectural modiæcations will be investigated for QoS provisiong. The rest of the thesis is described as follows. Chapter 2 summarizes the related work. Chapter 3 discusses the designs of the non-preemptive router, called MediaWorm, and the preemptive router. In Chapter 4, the QoS capable NIC design is investigated. Integration of admission and congestion control into cluster networks is the topic of Chapter 5. In Chapter 6, the IBA simulator is discussed, followed by the conclusions in Chapter 7.

21 8 Chapter 2 Related Work With the building block of a multiprocessor interconnect being its router or switch fabric, a considerable amount of research eæort has gone into the design of efæcient routers. Routers from university projects like reliable router ë13ë and Chaos router ë41ë, and commercial routers such as SGI SPIDER ë23ë, Cray T3DèE ë59, 60ë, Tandem Servernet-II ë24ë, IBM SP2 switch ë68ë, and Myrinet ë5ë use wormhole switching, while the HAL mercury ë75ë and Sun S-Connect ë48ë use virtual cut-through èvctè. Most of them support VCs, and at least the Cray T3E, ServerNet-II and S-Connect have adaptive routing capability. Metro ë9ë and Ariadne ë1ë employ the pipelined circuit switching èpcsè technique, while the latter is fully adaptive and tolerates link and switch faults. Ahybrid switch including both wormhole and VCT was designed in ë61ë. All these routers are primarily designed to minimize average message latency and improve the network throughput. The SGI SPIDER, Sun S-Connect, and Mercury support message priority. But, none of these routers can guarantee QoS as required for real-time applications like VOD services. ServerNet is the only router that provides a link arbitration policy ècalled ALU-biasingè for implementing bandwidth and delay control, but it still does not provide any capabilities to support multimedia traæc. Recently, a few researchers have explored the possibility of providing QoS support in multiprocessorècluster interconnects. The need for such services, existing methods to

22 9 support QoS speciæcally in WANèlong-haul networks, and their limitations are summarized in ë8, 38ë. Kim and Chien ë39ë propose a scheduling discipline, called rotating and combined queue èrcqè, to handle integrated traæc in a packet switched network. The Switcherland router ë20ë, designed for multimedia applications on a network of workstations, uses a packet switched mechanism similar to ATM, while avoiding some of the overheads associated with the WAN features of ATM. The router architecture proposed in ë57ë uses a hybrid approach, wherein wormhole switching is used for best-eæort traæc and packet switching is used for time-constrained traæc. A multimedia router architecture èmmrè, proposed in ë7, 17ë, also adheres to a hybrid approach by using pipelined circuit switching èpcsè for multimedia traæc and virtual-cut-through èvctè for best-eæort traæc. The authors have designed a è4 æ 4è router to support both PCS and VCT schemes, and have used MPEG video traces in their evaluations. While a connection-oriented mechanism such as PCS is suitable for multimedia traæc, it needs one VC per connection. For a link bandwidth of 1.24 Gbps, and with each multimedia stream requiring 4 Mbps, the design would require 256 VCs to fully utilize a physical channel. It is not clear whether it is practical to have such a large number of VCs per physical channel and what will be the cost of the corresponding multiplexer and demultiplexer implementations. In addition, the architecture of the router is fairly complex since it has to have facilities for both PCS and VCT transmission. Nevertheless, this is perhaps the most detailed study where router performance has been analyzed with multimedia video streams, best-eæort, and control traæc. A preemptive PCS network to support real-time traæc is also proposed in ë3ë.

23 10 To our knowledge, there are only a handful of research eæorts that have examined the possibility of using wormhole switched networks for real-time traæc ë12, 26, 35, 42, 65ë. In many of these studies ë35, 42, 65ë, the focus is on providing some mechanisms within the router to implement priority èfor real-time traæcè and preemption èwhen the resources are allocated to a less critical messageè. However, these mechanisms are not suæcient èand may not even be necessaryè for providing soft guarantee for multimedia traæc. Three diæerent techniques for providing QoS in wormhole switched routers are explored in ë26ë using a simulated multistage network. These include using a separate subnet for real-time traæc, supporting a synchronous virtual network on the underlying asynchronous network, and employing VCs. The ærst approach may not be cost-eæective. The second solution of using a synchronous network èeither inherently synchronous or simulated on top of an asynchronous network as is done in ë12ë on Myrinetè, is not a scalable option. The third option of using VCs has not been investigated in depth in ë26ë, where it has been cursorily examined in the context of indirect networks. The software oriented synchronization mechanism in the Myrinet switch proposed in ë12ë also lacks scalability. Message preemption in wormhole routers have been addressed in ë40, 65ë. In ë40ë, lower priority messages that block higher priority messages are discarded to allow faster delivery of higher priority messages. This approach has the advantage that it does not require extra resources to store routing information of the preempted messages. But preempted messages are lost in this scheme and thus may not be a viable option for many applications. With additional hardware and æow control, it is possible to recover the low priority messages. Song et al. ë65ë, on the other hand, preempt a lower priority

24 11 message in favor of a higher priority messages using additional buæers. In their scheme, the router has ès, 1è extra input buæers where s is the number of priority levels it supports. By providing these additional input buæers, the router can always establish a free path for higher priority messages. This scheme requires a history stack for storing the header information of the preempted messages in ascending order of their priorities for each output channel. Unlike our pipelined router model, the authors use a lumped router design. Hence, many architectural details that are required to support æit-level preemption are not addressed in their work. Provisioning for preemption in diæerent stages of the pipeline is much more complex than a single stage èlumpedè router model. But none of the above studies ë40, 65ë have examined the design details in the contex of a pipelined router architecture. It is still not clear as to what is the best switching mechanism that can support all traæc classes. Should we resort to hybrid routers that diæerentially service the traæc classes èand pay a high costè like many of the above studies have done? Or, can we use a single switching mechanism èwormhole switching in particular, since it has been proven to work well for best-eæort traæc and we can leverage oæ of the immense body of knowledgeèinfrastructure available for this mechanismè with little or no modiæcations? Instead of discarding the wormhole switching mechanism as an option for multiple traæc classes in an ad hoc manner as many of the above studies have done, this thesis explores how a large number of multimedia connections can be supported in the presence of best-eæort traæc.

25 12 Admission Control: An admission control algorithm determines whether a new real-time traæc æow can be admitted to the network without jeopardizing the performance guarantees given to the already established æows. Such an algorithm is essential irrespective of the underlying communication architecture to regulate the traæc æow. Admission control in packet-switched networks has been a rich area of research. There are two broad classes of admission control algorithms: deterministic and statistical admission control. For real-time services that need a hard or absolute bound on the delay of every packet, a deterministic admission is used ë21ë. For such deterministic services, an admission control algorithm calculates the worst-case behavior of existing æows in addition to the incoming one before deciding if the new æow should be admitted. This model underutilizes network resources, especially with traæc burst. Many of the new applications such as the media streams do not need hard performance guarantees and can tolerate a small violation in performance bounds. A statistical admission control scheme can be used for such applications. In this approach, an eæective bandwidth that is larger than the average rate but less than the peak rate is commonly used. The bandwidth can be computed using a statistical model ë58ë or a æuid æow approximation ë33ë. In addition, a third class of algorithms, called measurement-based algorithms, controls the admissible region based on aggregate traæc measurements ë30, 56ë. For admission control in clusters, the MMR design uses the average and peak rates of requests ë17ë. However, this router uses PCS for real-time traæc and needs one virtual channel èvcè per connection èæowè. The Switcherland router ë20ë, based on the ATM protocol, uses a statistical admission algorithm. A æit reservation æow control

26 13 scheme that uses control æits to reserve bandwidth and buæers prior to the transfer of data æits has been proposed recently ë53ë. To our knowledge, there is no prior work on admission control in wormhole-switched networks. Congestion Control: Congestion control is required to regulate traæc injection into a network to avoid network saturation, which may lead to performance penalty. In networks with QoS guarantees, congestion control mechanisms ærst attempt to regulate best-eæort and misbehaving real-time traæc, and if required, then traæc from other service classes. In wormhole-switched networks, prior work on congestion control tends to limit message injection rate in each node when a speciæed network saturation point is reached ë4, 64, 69ë. Local or global information could be used to determine network saturation. For example, Lopez et al. ë4ë used the busyèfree status of VCs to assess network congestion. Smai and Thorelli ë64ë counted on the global network state to detect network congestion. To achieve a global view of the network, each node communicates its traæc status with other nodes, which may lead to excessive communication overhead. Thottethodi et al. ë69ë suggested a self-tuned approach that determines appropriate threshold values to estimate network congestion. Previous congestion control algorithms for wormhole-switched networks do not provide an end-to-end congestion control. They only consider the networkèrouter status, not the NI, which is closer to the applications. Moreover, instead of penalizing the æow that caused congestion, a uniform reduction rate is typically applied to all the æows that pass through the congested point. Ideally, it should provide selective congestion control per æowèapplication as is done in the Internet TCP æow control. The proposed algorithm has this selective control ability.

27 14 Internet Congestion Control: QoS support in the Internet architecture involves both admission control and end-to-end TCP congestion control. Admission control is included in support of services such as Diæerentiated Services èdiæservè ërfc 2475ë and Integrated Services èintservè ërfc1633ë. TCP congestion is based on manipulating the congestion window size relative to the number of dropped packets and timeouts. Two major drawbacks of the TCP congestion control are that congestion is detected only after noticing a packet drop, and all æows subsequently reduce their injection rate after receiving the congestion notiæcation. The later problem is the well known global synchronization problem. Various solutions have been proposed to mitigate this issue by detecting incipient congestion, including the Random Early Detection èredè ë22ë algorithm and many of its variations ë10, 43, 49ë. A RED gateway continuously computes average buæer size, randomly marks packets when a certain threshold is reached, and drops packets when the buæer gets full. Congestion control mechanisms that have been adopted for the Internet might not be suitable for clustered architectures since clusters are typically hosted in small physical areas and make wide use of reliable communication. Thus, dropping packets is not a suitable option for managing congestion, particularly in a wormhole-switched network. This is also true for IBA ë29ë. Although, IBA allows packet dropping in its various transport services, the details are still unclear. Further, the current version è1.0è of the IBA speciæcation ë29ë provides minimal support for QoS. A 4-bit æeld is reserved for the Service Level èslè speciæcation in the packet header, but does not deæne how to map the SLs to QoS classes. According to the InæniBand Trade Association èibtaè,

28 15 detailed QoS support will be available in the next release of the IBA speciæcation. Thus, integrated admission and congestion control work in ClustersèSAN is in its infancy. In the next chapter, we start with the design of wormhole routers for supporting integrated traæc.

29 16 Chapter 3 Design of Wormhole Routers for QoS support This this chaper, we present two diæerent designs for supporting integrated traæc in workhole routers : a non-preemptive architecture and a preemptive architecture. 3.1 A Non-Preemptive Router Design MediaWorm Router The main motivation of this research is to investigate the feasibility of supporting mixed traæc in wormhole routers with minimal modiæcations to the existing router architecture. We are speciæcally interested in transferring multimedia video streams in addition to usual best-eæort traæc. This requires providing some mechanism within the router that recognizes the bandwidth requirements of VBR and CBR traæc, and accommodates these requests. One can borrow the concepts from real-timeèinternet research toprovide hard or soft guarantees. Instead of conservatively reserving resources within the router to achieve these goals with hard guarantees, we are interested in more optimistic solutions that provide soft guarantees to media streams. In this research, we propose a new wormhole router architecture, called MediaWorm ë78, 80ë, using a conventional pipelined wormhole router design for meeting the bandwidth requirements. Two modiæcations are proposed to a standard wormhole router. First, the VCs are partitioned into two classes one for transferring best-eæort

30 17 traæc and the other for real-time traæc. Second, in order to satisfy the bandwidth requirements of diæerent applications, the round-robin èrrè or First-In-First-Out èfifoè scheduler used in a traditional router is replaced by a rate-based scheduling mechanism, called Virtual Clock ë82ë Architectural Design Issues In wormhole switched networks, messages are segmented into æow-control units called æits. As a message enters a router, its header æit is used to determine the permitted output port that would route the message to its destination. The message then æows through the router crossbar to the appropriate output port. If resources èsuch as output buæer space or output portsè are busy, the message blocks until resources become available. Flits of a message æow through the network in a pipelined manner. Performance of wormhole routers can be enhanced through the use of virtual channels èvcsè ë14ë. VCs are also used for supporting deadlock freedom and providing adaptive routing capabilities. Wormhole routers can be pipelined so that although a æit experiences multi-cycle latency to get from its input port to an output port, the router cycle time can be kept very small ètypically a few nanosecondsè depending on the slowest stage of the pipeline. We use a pipelined router model, called PROUD ë70, 71ë, to design the Media- Worm router. The pipelined model with æve stages, as depicted in Figure 3.1, represents the recent trend in router designs ë52ë. Stage 1 of the pipeline represents the functional units which synchronize the incoming æits, demultiplex a æit so that it can go to the appropriate VC buæer to be subsequently decoded. If the æit is a header æit, routing decision and arbitration for

31 18 Header Flit Path Sync, DeMux, Buffer, Decode Routing Decision Arbitration B/W Resv, Xbar Mux, Xbar Route Buffering VC Mux, Sync Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Tail/Middle Flit Bypass Path Fig A Basic Pipelined Wormhole Router the correct crossbar output are performed in the next two stages èstage 2 and stage 3è. On the other hand, middle æits and the tail æit of a message bypass stages 2 and 3 to move directly to stage 4. Flits get routed to the correct crossbar output in stage 4. The bandwidth of the crossbar may be èoptionallyè multiplexed amongst multiple VCs. This is discussed in detail later in this section. Finally, the last stage of the router performs buæering for æits æowing out of the crossbar, multiplexes the physical channel bandwidth amongst multiple VCs and performs hand-shaking and synchronization with input ports of other routers or the network interface for the subsequent transfer of æits. A pipelined router can thus be modeled as multiple parallel PROUD pipes. In an n-port router, if each PC has m VCs, a router could then be modeled as èn æ mè parallel pipes. Resource contention amongst these pipes could occur for the crossbar output ports èwhich is managed by the arbitration unitè as well as for the physical channel bandwidth of the output link èwhich is managed by the virtual channel multiplexerè. We consider two diæerent crossbar design options a full crossbar and a multiplexed crossbar ë14ë. A full crossbar has number of input and output ports equal to the

32 19 total number of VCs supported èn æ mè for an n-port router with m VCs per PC. On the other hand, a multiplexed crossbar has number of inputèoutput ports equal to the total number of PCs ènè. A full crossbar may improve the router performance at a signiæcantly high implementation cost; a multiplexed crossbar is cheaper to implement but requires more complex scheduling. Support for a larger number of VCs may mandate the use of a multiplexed crossbar from a practical viewpoint. For a multiplexed crossbar implementation, a multiplexer has to be used at the crossbar input ports and a demultiplexer at the crossbar output ports. Introduction of the additional multiplexer introduces a new contention point in the router. Figure 3.2 shows the various functional units along a router pipe when a router implements a multiplexed crossbar. Input Flit Decoder Flit Buffer Output Flit Buffer VC Mux A C Xbar Inp. Mux Crossbar Switch B Xbar Out DeMux Fig Functional units along a router pipe for a2port router with 2 VCs per PC. Additional functional units such as the routing decision block and the arbitration unit are not shown. With a multiplexed crossbar as is shown in the ægure, contention amongst multiple pipes can occur in the crossbar input multiplexer èaè for the crossbar input port, within the crossbar èbè for the crossbar output ports, and in the VC multiplexer ècè for the output PC.

33 20 In order to allocate bandwidth for diæerent types of traæc, we plan to use a ratebased scheduling algorithm at one of the contention places as shown in Figure 3.2. The selection of a rate-based algorithm and its implementation are described next A Rate-based Scheduling Algorithm for QoS Support There are two main categories of bandwidth scheduling algorithms: æow-based and frame-based. The æow-based algorithms like VirtualClock ë82ë, Fair Queueing ë16ë, General Processor Sharing ègpsè ë51ë, Self-Clocked Fair Queueing èscfqè ë27ë, and Frame-based Fair Queueing èffqè ë66ë use time stamps to make scheduling decisions, while the frame-based scheduling algorithms like Round Robin èrrè, Weighted RR èwrrè ë32ë, Deæcit RR èdrrè ë62ë, and Hierarchical RR èhrrè ë31ë poll queues sequentially during each round with diæerent priorities. The frame-based algorithms usually assign a known priority to each queue. But, how to assign a priority to each queue with VBR traæc is not obvious. While the precomputed priority to each queue facilitates to reduce computation overhead, the æow-based algorithms require to timestamp and ænd the minimum time stamp amongst arriving packets every cycle. However, since in our router there could be multiple æows in each queue and we want to assign priorities based on æows, not to a queue, we focus on æow-based algorithms in this work. For this study, we consider two diæerent work conserving, rate-based schedulers; Fair Queueing and VirtualClock. Eæectiveness of the two schemes have been analyzed by several researchers for QoS assurance in packet switched networks ë67ë. In both these algorithms, there is a state variable associated with each channel i to monitor and enforce the rate for that channel. In VirtualClock, the variable is called auxiliary VirtualClock

34 21 èauxvcè; in Fair Queueing, it is called Finish Number èf è. The computation of auxvc and F is shown in Table 3.1. In VirtualClock, AT is the arrival time or wall clock time. In Fair Queueing, R is the number of rounds that has been completed for a hypothetical bit-by-bit round robin server, n is the weight factor, and P is the message length èin bitsè. Vtick in VirtualClock and P n in Fair Queueing specify the inter-arrival time of messages. Therefore, a smaller value implies higher bandwidth. For best-eæort traæc, the Vtick is assigned the largest possible value. With Vtick and P n speciæed, there is no diæerence between VirtualClock and Fair Queueing except that the Fair Queueing uses the round robin numberèrè instead of the actual arrival timeèatè required for the VirtualClock. The computation complexity ofris OèN è, where N is the total number of connections. Fair Queueing algorithms with less computation complexity can be found in ë27, 66ë. We can use the system clock for AT in the VirtualClock algorithm, and hence it needs no extra computation. It has been shown that both these schemes have similar performance ë67ë except that the VirtualClock algorithm cannot handle bursty traæc eæectively without any input regulation. Traæc burstiness can be handled by regulating the traæc injection. VirtualClock auxvc i è maxèat; auxvc i è auxvc i è auxvc i + Vtick i timestamp the packets with auxvc i Fair Queueing F i è maxèr; F i è F i è F i + P i n i timestamp the packets with F i Table 3.1. VirtualClock and Fair Queueing Algorithms

35 The above algorithms were developed for connection-oriented networks, where one channel is dedicated for each connection like the PCS, and when a connection is set up, a æxed Vtick èor P è value is assigned for the entire duration of the connection. n This results in two problems. The ærst one is, when dealing with VBR connections, one representative Vtick èor P è value may cause underutilization of the resources or incur n higher message delay. The other problem is, since one channel services one connection, a large number of VCs is required to handle multimedia streams. Consequently, it may lead to a complex router design with more hardware circuitry. In this study, we are interested in a connectionless paradigm without any explicit 22 connection setup since this provides more eæcient use of the network resources. To overcome the above two problems, we modiæed the connection-oriented algorithms as follows: each message requests its required bandwidth at each router on its way to the destination, and the router implements the VirtualClock èor Fair Queueingè algorithm to allocate the requested bandwidth to its æits. So in our router, each message works as if it were a connection, and each æit works as if it were a message of the originally proposed algorithm. In the original algorithms, the æxed Vtick èor P è can be calculated from the n average bandwidth requirement or the peak bandwidth requirement of the connection. Vtick èor P è in this study implies the intergeneration time between æits, and is given as n Vtick èor P n inter-arrival time è=message : message size in æits

36 Thus, Vticks èor P è of two messages in the same connection can be diæerent if they n 23 belong to diæerent frames of diæerent sizes. A message makes its request by carrying its Vtick èor P è in the header. When the tail æit leaves the router, its Vtick èor n P è information in the router is discarded. We name the modiæed algorithms as Finen Grained VirtualClock èfgvcè and Fine-Grained Fair Queueing èfgfqè, respectively, since bandwidth reservation is done at the message-level granularity. In a router implementation with a multiplexed crossbar, contention for link bandwidth can occur at one of the 3 places the crossbar input multiplexer for the crossbar input port, within the crossbar for the crossbar output port, and at the virtual channel multiplexer for the output physical channel. These are marked as èaè, èbè and ècè respectively in Figure 3.2. All these places are potential candidates where a rate-based bandwidth allocation can be performed. We rule out contentions at èbè and ècè for the following reasons. In case èbè, crossbar output port arbitration is performed at a message level granularity, whereas we are interested in æit-level bandwidth allocation. Case ècè corresponding to the VC multiplexer, is not a strong candidate, either. This is due to the fact that at most one of the VCs of an output PC can receive a æit from the multiplexed crossbar per router cycle. When only one of the VCs has a æit in any given cycle, the scheduling algorithm essentially behaves as a FIFO scheduler. Hence, we have chosen to implement the rate-based scheduler at the crossbar input multiplexer èaè, which means that, at any given cycle, if multiple æits from diæerent VCs are competing for the same output port of the crossbar, the one with the smallest auxvc i will be chosen as the winner.

37 24 In a router that implements a full crossbar, there is no crossbar input multiplexer ènor a demultiplexer at the crossbar outputè. Thus, the only contention points are for the crossbar output ports èat the time of arbitrationè and in the VC multiplexer. In such a router, the rate-based algorithm is implemented at the VC multiplexer ècase ècèè. In order to select between FGVC and FGFQ schemes for the rest of the design, we conducted a performance analysis. We simulated both these schemes in a router and injected media traæc and best-eæort traæc. We measured the inter-frame delivery time and standard deviation of delivery time for the media streams. The results with diæerent input loads to the router are quite similar in both cases as depicted in Table 3.2. However, since implementation of FGFQ is more complex for maintaining the round robin numberèrè, we use FGVC in the rest of our design. In order to avoid traæc burstiness, we regulate the traæc injection as described in the workload generation in Section Figure 3.3 shows the ænal architecture of the MediaWorm router with a multiplexed crossbar and the FGVC scheduling algorithm. Load Inter-frame timeèmsecèèsd Fine-Grained VirtualClock Fine-Grained Fair Queueing 60è 33.12è è è 32.74è è è 32.28è è1.31 Table 3.2. Comparison of Fine-Grained VirtualClock with Fine-Grained Fair Queueing algorithms when the ratio of real-time to best-eæort traæc is 80:20. Inter-frame time is the averaged time diæerence of frames measured at the destinations, and SD is the standard deviation of the inter-frame time.

38 25 VCs 1 middle/tail flit Switch Core 1 VCs 0 C-1 C n x n C-1 C 0 1 scheduler Crossbar 1 n-1 C-1 C-1 n-1 C C header flit Routing Decision Arbitration Crossbar Control Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Fig MediaWorm Router Architecture Interconnection Topologies Fat Networks Cluster interconnects are typically built with high degree switches. Myrinet ë5ë has 8 and 16 port routers, while Servernet-II ë24ë routers have 12 ports. These ports may be used to connect to other switches as well as to endpoints. These endpoints may be compute nodes such as clients and servers or IèO devices. The diæerence between such cluster networks and those in typical multiprocessor interconnects is that while multiple endpoints per switch may be common in the former, the latter typically has only one endpoint per switch. Depending on the expected traæc pattern, it is likely that multiple endpoints may place higher inter-switch bandwidth requirement on cluster interconnects.

39 26 Due to this reason, ëfat" topologies have been proposed for clusters. Examples of fat topologies include the fat-tree and the fat-mesh ë18ë. Other cluster interconnects such as the tetrahedral topologies proposed by Horst ë28ë can also use ëfat" links. Routers such as the Servernet-II ë24ë include hardware support for using multiple physical links connecting a pair of switches indistinguishably through the notion of ëfat-pipes". S 2 S 3 S 0 S 1 Fig A 4 switch fat-mesh interconnect. Each switch ès0ís3è is an è8 æ 8è switch. Each fat link comprises 2 physical links. Although most of the studies reported in this paper detail the performance of a single switch, we also experimentally analyze the performance of a fat mesh. The fat mesh used here is a è2 æ 2è topology with 8 port crossbar switches. èwe have limited our study for a smaller network due to exceedingly high simulation times. One can design a larger router and a larger network using our model.è Two physical links are used to interconnect each pair of switches in a 4-node mesh. Figure 3.4 illustrates the studied

40 interconnect. We use deterministic routing and a message can use any one of the two links to traverse to the next node based on the current load Pipelined Circuit Switching Pipelined Circuit Switching èpcsè ë25ë is a variant ofwormhole switching in which a message is similarly segmented into æits. However, unlike wormhole routing in which middle and tail æits immediately follow the header as it progresses towards its destination, in PCS, æits of a message wait until the header èor probeè reserves the complete path up to the destination. Once such a pathèconnection has been established, an acknowledgment is sent from the destination to the source. The rest of the æits then move along this path in a pipelined manner èsimilar to wormhole switchingè. During path establishment, if the header cannot progress towards the destination, it can backtrack and try alternative paths if an adaptive routing is used. If no path can be established or if adaptive routing is not permitted, a negative-acknowledgment is sent back and the attempted connection is dropped. In this research, we do not assume any adaptive routing capability. PCS as originally proposed in ë25ë, assumed non-minimal and adaptive routing capabilities with backtracking and re-routing. This leads to low connection dropping rates. Due to the requirement of complete path setup before transmission of æits in PCS, it may incur high path setup cost compared to wormhole switching. However, it can potentially provide better bandwidth reservation which is advantageous for real-time traæc. Our intention is to evaluate these trade-oæs by comparing PCS with wormhole switching.

41 A Preemptive Router Design In the previous design, we statically divide the input and output VCs among the traæc classes. A traæc of class c can only use the VCs assigned to it. The VC assignment is done at the system conæguration time and cannot be changed during execution. Therefore, the non-preemptive model èmediawormè is not æexible. A solution to this problem is to develop a preemptive model, where several classes of traæc with diæerent priorities can share the same VC, with the provision that a higher priority message can preempt a lower priority message. The preemptive model can dynamically allocate any VCtoany traæc class. Hence, it is more suitable to handle æuctuating workloads. In this section, we propose a new wormhole router architecture with æit-level preemption capability. Based on the MediaWorm architecture in Figure 3.3, we propose two important mechanisms that enable higher priority messages to preempt lower priority messages in a pipelined wormhole router: input buæer preemption and æit acceleration mechanism. This preemptive model can dynamically allocate any VCtoany traæc class. Hence, it is more suitable to handle dynamically changing workloads Preemption in the Input Buæer The additional hardwares required for preemption in any input buæer èvcè are an extra buæer of size ès,1è, where s is the total number of priority levels, and a history stack of the same size. The extra input buæer is used only for diverting higher priority messages when the regular VC is occupied by a lower priority message. If the input

42 29 Input Buffer m1 Switch Core Output Buffer m3 m1 History Stack Flit Decoder Crossbar Routing Decision Arbitration Fig Input Buæer Preemption in the Router buæer is occupied by ahigher priority message, alower priority message is not allowed to use the extra buæer, and it is blocked behind the higher priority message. On the other hand, if the input buæer is used by a lower priority message, a higher priority message is sent to the extra buæer so that it can subsequently preempt the lower priority message in Stage 1 of the router. Similar to ë65ë, the routing information of a lower priority message is stored in a history stack for forwarding it later. In Stage 1, when the extra buæer has a header æit from a higher priority message, the input buæer preemption process begins. The router ærst checks whether the tail æit of the lower priority message has passed through Stage 1 decoder. If not, a dummy tail æit is created for the preempted message. A dummy tail æit does not carry any payload, but behaves as a regular tail æit to release all the resources held by the lower priority message. Otherwise, the resources are locked and cannot be used by any other message.

43 30 For example, in Figure 3.5, when the higher priority message m3 interrupts the delivery of the lower priority message m1, the dummy tail of m1 is generated. Then the routing information of m1 is stored in the history stack to be used later for making a dummy header for the retransmission of m1. Thus, no dummy header is required in case no dummy tail was sent. During preemption, the remaining æits of m1 and any other lower priority messages are blocked in the input buæer. Next, all the æits of m3 in the extra buæer are sent through the router. After that, if the extra buæer is empty, transmission of remaining æits of m1 resumes from the regular input buæer A Flit Acceleration Mechanism Input Buffer m1 1 Switch Core Output Buffer m3 m1 History Stack Flit Decoder 0 Crossbar Routing Decision Arbitration Fig Flit Acceleration for Message m1

44 31 When the input buæer preemption starts, there could be remaining æits of m1 between the æit decoder buæer and the input port of the crossbar as shown in Figure In addition, when the header æit of m3 tries to reserve the output VC, it could be already occupied by another lower priority message like m2 in Figure 3.7. In both of these cases, the æits of lower priority messages èm1, m2è will slow down the processing of m3, until these æits are pushed to the output buæer. Therefore, we use a æit acceleration mechanism that helps expedite the delivery of æits of such lower priority messages èlike m1 and m2è by assigning a speciæc low virtual clock value to them. This value guarantees that these messages will be selected ærst at the next cycle of the scheduler unless there are other preempted messages at other VCs. èthen we can select them in a RR fashion.è For this purpose, there is a æag called Accelerate associated with each input VC. The Accelerate æag is set until the tail æit of the preempted message èm1è or expedited message èm2è passes through the crossbar. In Figure 3.6, the transmission of a lower priority message m2 is accelerated by setting the æag of the input channel m1 is using to speed up the processing of a higher priority message m3 from the extra input buæer to the input port of the crossbar. In Figure 3.7, the header æit of m3 arrives at Stage 3, and the destination output VC is used by another lower priority message m2. Again, by setting the æag of the input channel m2 is using, the transmission of m2 is accelerated, and the m3 can reserve the destination. 1 At best there could be 3 æits. A header æit and a middle æit of m1 at two diæerent stages and a tail æit of another message at the crossbar input.

45 The other option to handle blocking at other stages is to use the preemption 32 mechanism. The acceleration mechanism is much simpler and easier to control than providing a separate preemption path at such stages. Input Buffer m1 0 Switch Core Output Buffer m2 m3 m2 m1 History Stack Flit Decoder 1 Crossbar Routing Decision m3 Arbitration Fig Flit Acceleration for Message m2 3.3 Experimental Platform Simulation Testbed The above architectural concepts have been extensively evaluated through simulation. We have developed the MediaWorm router èmrè, the preemptive router èpè, a traditional router with FIFO ètrè, and a PCS router èpcsè using CSIM. The simulation

46 33 models are quite æexible in the sense that one can specify the number of physical channels èpcsè, number of VCs per PC, link bandwidth, CBRèVBR rates and the variation of VBR rate, æit size, message size ènumber of æitsè, and the ratio of real-time traæc èvbr and CBRè to best-eæort traæc. In addition, using these routers, one can conægure any network topology. We have developed detailed æit-level simulators with each stage of the router pipeline being modeled, together with several simultaneous streams established from each node in the system. Typically, we gather simulation results over a few million messages. As a result, these simulations are extremely resource intensive, both in terms of simulation time and memory requirements. Two factors that determine simulation resources are the crossbar size, and physical channel bandwidth. Consequently, even though current technologies permit large crossbar sizes and over 1.28 Gbps link bandwidths, many of our simulations use smaller values for these parameters, without loss of generality, to keep them tractable. We have also conducted some experiments varying these parameters, and the overall trendsèresults still apply. The sizes of the input and output buæers in the router are one message long, respectively. Also, we have tested with large buæers. The results show little improvement. The output parameters analyzed here are mean frame delivery interval èdè ç for CBRèVBR messages, standard deviation of frame delivery intervals èç d è for CBRèVBR messages, deadline missing probability of delivered MPEG-2 frames, average deadline missing time of deadline missing frames, and average latency for best-eæort traæc. The delivery interval is measured as the diæerence between the delivery times of two successive

47 frames at a destination. A ç d = 33 msec indicates a frame rate of 30 framesèsec at MPEG rates. Coupled with a ç d =0,this implies jitter-free delivery. A higher d ç andèor ç d implies jitters in transmission. Deadline missing probability isthe ratio of the number of frames that missed their deadlines to the number of total number of delivered frames. The deadline for each frame is determined by adding 33.3 msec to the previous deadline, since the frame rate is 30 framesèsec for MPEG-2 video streams. However, if a previous frame missed its deadline, a new deadline is set by adding 33.3 msec to the arrival time of the previous frame. Whenever a frame misses its deadline, we measure the deadline missing time and then calculate the average deadline missing time Workload Two kinds of VBR traæc are simulated in the experiments. The ærst one is synthetic video streams with an average bandwidth of 4 Mbps and the other is realistic MPEG-2 video streams. The synthetic traæc consists of streams of messages from video frames, whose size is selected from a normal distribution with a mean of 16,666 bytes and standard deviation of 3333 bytes. èthis corresponds to 4 Mbps MPEG-2 streams.è The realistic traæc is generated from MPEG-2 traces ë7ë shown in Table 3.3, where there are 7 video traces with diæerent bandwidth requirements. Each stream generates 30 framesèsec, and each frame èi, P, or B frameè is fragmented into 20è40-æit size messages èexcept possibly the last message of a frameè, with each message carrying the bandwidth requirements èvtick information for the FGVC algorithmè, and the routing information in its header æit. As a result, the network treats each message of a stream independent of the others. The injection rate for the messages

48 35 Video Average Bandwidth Average Size of Average Size of Average Size of Sequences Requirements èkbèsè IFrame èkbitsè PFrame èkbitsè BFrame èkbitsè 1 7, , , , , , , Table 3.3. MPEG-2 Video Sequence Statistics of a stream is determined by the message size and the number of messages constituting a frame. Once the injection rate is determined, an input regulator injects messages of a frame evenly with the interval of è33msec=number of messagesè. For instance, with 200 messages in a frame, the interval between successive message injections is 165 microseconds. Such an input regulator provides two advantages. First, in addition to avoiding traæc burstiness, the input regulator allows intermixture of messages from diæerent streams in the queues. Without this ability, the streams are queued only at frame-level granularity, thereby increasing the delay of certain streams. Second, the input regulator also helps transmission of best-eæort traæc in between video frame messages. In the case of PCS, each stream is transmitted over a distinct connection èas it is connection-orientedè. The ærst æit of the stream establishes the circuit between the source and destination endpoints, simultaneously informing the intermediate switches about its bandwidth requirements èthe required Vtick for the entire streamè. The frames

49 36 of the stream are logically grouped into æits, with each group injected into the established circuit at a speciæed rate èsimilar to how messages are generated in the wormhole switching caseè. In PCS, each connection èand hence a streamè also needs a distinct VC. Therefore, the number of VCs supported by the hardware has to be greater than or equal to the maximum number of concurrent streams in the workload. In the MediaWorm, each message carries routing and bandwidth information. As a result, it would be possible to support multiple connections on a single VC. This would make sense only when the bandwidth available to a VC is at least as large as the sum of the bandwidths of the streams assigned to that VC. This is however not a problem because each message carries its Vtick requirement. It should be noted that stream establishment does not actually fail in wormhole switching. In PCS, on the other hand, a connection establishment probe may not necessarily succeed. This is termed as dropping of a connection. It is assumed that connections may be dropped only at stream set-up. Once the input VC for a connection is determined, the destination is picked randomly using a uniform distribution of all nodes, and the destination VC is also drawn randomly from a uniform distribution of the VCs available for VBR traæc. The generation of the CBR traæc is identical to the synthetic VBR traæc, with the exception that the frame size is kept constant èat 16,666 bytesè. The best-eæort traæc is generated with a given injection rate, ç, that is allocated to this class of traæc èexplained in the next subsectionè, and follows the Poisson distribution. The message length is kept constant at 20è40 æits according to the message

50 37 length of real-time traæc, and its destination is picked from a uniform distribution of the nodes in the system. The input and output VC for a message are picked from a uniform distribution of the available VCs for this traæc class. An important parameter that is varied in our experiments is the input load. This is expressed as a fraction of the physical link bandwidth. For a speciæed load, we consider diæerent mixes èx : y, where x=èx + yè is the fraction of the load for the VBRèCBR component and y=èx + yè is the fraction of the load for the best-eæort componentè to generate mixed traæc. We divide the VCs into two disjoint groups. x=èx + yè èofthe VCs are reserved for the VBRèCBR traæc, and the remaining VCs are allocated to the best-eæort traæc. As mentioned earlier, the number of simultaneous VBRèCBR streams that are possible fromèto a node is limited by the number of VCs in the case of PCS. In the MediaWorm, it is limited by the number of VCs and the bandwidth allocated to a VC. For instance, if a physical channel can support 400 Mbps, and the total number of VCs is 16, then we can support at most 6 connections per VC tosimulate synthetic VBR. If x = y = 1, then the number of VCs dedicated for VBRèCBR traæc is 8, and there can be at most 6 æ 8 = 48 outstandingèincoming streams at each node in the system. 3.4 Performance Results In this section, we ærst analyze the performance results for an 8-port MediaWorm router with varying parameters as well as that of a è2 æ 2è fat mesh. Then we compare the MediaWorm and preemptive router designs. The router parameters used in this performance study are given in Table 3.4.

51 38 Switch Size 8 æ 8 Flit Size 32 è 128 bits Message Size 20 è 40 æits Flit Buæers 20 è 40 æits PC Bandwidth 400 Mbps è 1.6 Gbps VCsèPC variable èwormholeè, 24 èpcsè StreamsèVC variable èwormholeè, 1 èpcsè Table 3.4. Simulation Parameters Comparison of MediaWorm and Traditional Routers We ærst begin by examining how a traditional router ètrè and the MediaWorm router èmrè perform with multimediaèmixed traæc. Note that the main diæerence between the two routers is the scheduling algorithm. The TR uses a FIFO scheduler, whereas the proposed MR uses the FGVC algorithm. Figure 3.8 shows the mean delivery interval èdè ç and its standard deviation èç d è for this router with a mixture of synthetic VBR and best-eæort traæc è80:20è. We can see that both d ç and ç d start growing beyond a load of 0.8, showing that there would be signiæcant jitters in delivery of VBR traæc beyond this point. Compared to this, the MR can provide jitter-free delivery even up to a link load of 0.96 èthe load of the real-time component is around 0.75è. This clearly shows the need for a rate-based scheduling algorithm to eæectively administer the available bandwidth for media streams.

52 Comparison of CBR and VBR Traæc Results Figure 3.9 depicts the d ç and ç d results with only CBR and only synthetic VBR traæc èthere is no best-eæort traæcè. It can be gleaned that both exhibit nearly identical performance, with the CBR traæc experiencing jitter-free performance for slightly higher load. Although, both CBR and VBR streams have the same mean bandwidth requirement, CBR streams by their nature are also intuitively expected to experience better jitter tolerance. Since, VBR streams present a more challenging workload, we focus on VBR streams in the rest of the studies in this thesis Results with Mixed Traæc Next, we vary the ratio of real-time èonly synthetic VBRè and best-eæort traæc for diæerent input loads, and study the eæect on jitter for VBR and average latency for the best-eæort traæc. Figure 3.10 shows the variation of d ç and ç d for these workloads. It can be observed that up to an input load of 0.80, there is no jitter for VBR traæc regardless of the mix between these two traæc classes. Beyond a load of 0.80, it is only when the real-time traæc becomes a dominant component, does the jitter become signiæcant. The eæect of VBR traæc on the average latency of best-eæort traæc èin microsecondsè is given in Table 3.5. For a given mix, the latency degrades with an increase in the load. The presence of real-time traæc also increases the latency of the best-eæort traæc at a given load. This is a consequence of the higher priority given by the FGVC algorithm to the real-time traæc.

53 40 Input Load x:y : : : Sat. 90: Sat. Sat. Table 3.5. Average Latency for Best-eæort Traæc è8æ8 switch, 16VCs, 400Mbps linksè Impact of VCs and Crossbar Capabilities It should be noted that our workload generates multiple connections on each available VC. An important design consideration is to determine whether one should support more VCs with fewer connections per VC, or vice versa. appear that a larger number of VCs would improve performance. Intuitively, it may The performance results in Figure 3.11 also conærm this intuition, where the 16 VC case gives jitter-free performance up to a higher load compared to the 4 and 8 VC cases. However, supporting a large number of VCs may require a large amount of resources in the router. Lower number of VCs, on the other hand, allows us to be able to use a full crossbar èinstead of amultiplexed oneè. This is examined for the 4 VC case èi.e. a32æ32 crossbarè, which shows better performance than 8 VCs with the multiplexed crossbar, and competitive performance compared to the 16 VC results Eæect of Message Size on Jitter Our next experiment examines the impact of message size on synthetic VBR traæc. We vary message size for two diæerent input loads è0.64 and 0.8è that are representative of the behavior observed earlier, and examine changes in ç d and ç d. The results

54 41 in Figure 3.12 show that except for very small message sizes, there is little impact on QoS for real-time traæc. For very small sizes, the eæect of the header æit overhead becomes noticeable. For instance, 1 header æit in a message size of 20 æits consumes 5è of the stream bandwidth. These results show that we do not really need to go for large messages for media traæc. In fact, smaller sizes may help the latency for best-eæort traæc Comparison of MediaWorm and PCS Routers PCS is expected to provide good performance for VBR traæc. This is because it is a connection-oriented switching paradigm and hence can reserve bandwidth at the time of connection establishment. However, it requires a VC per stream, thereby mandating a large number of VCs per PC for high link bandwidth. In this experiment, we compare the performance of the MediaWorm router to that of the PCS router èpcsè. Note that this is the only experiment that we perform for a link bandwidth of 100Mbps è24-25 VBR streams can be supported per link, each with 4Mbps bandwidth requirementè. This is primarily because of the simulation complexity for supporting the large number of VCs èup to 100 VCsè that would be required for 400 Mbps link bandwidth in the PCS router. As can be expected, the MediaWorm router can support jitter-free performance only up to a load of about 0.7 compared to over 0.8 in the case of PCS. This is, however, not a fair comparison because all streams started on a wormhole router are accepted, whereas the PCS router drops many connections that contend for busy resources. For the

55 42 Input Load èconn. Attempts è Established è Dropped Table 3.6. Number of attempted, established and dropped connections for reaching a certain input loading in a PCS router. The values presented are for an è8 æ 8è router with 24 VCs, 100Mbps links. same operating load, this in eæect unfairly improves the crossbar utilization for accepted connections in the PCS router compared to that for the MediaWorm router. While the PCS router provides superior performance, this is at the cost of high resource requirements èlarge number of VCsè as well as a very high number of dropped connections. The number of accepted and dropped connections for various input loads for the PCS router is shown in Table 3.6. These results show that for most realistic operating conditions èan input load of 0.7 is reasonably highè, the MediaWorm router can deliver as good èjitter-freeè performance as a PCS router for real-time traæc, while not turning down connection establishment requests as done in the PCS router. èthe connection drop rate can be minimized by using several alternatives as proposed in ë25ë.è Moreover, by increasing the number of VCs in the MediaWorm router to match with the PCS implementation, its performance could be similar to that of the PCS router at higher load.

56 Results with MPEG-2 Video Traces Here we examine the performance results of a traditional router and the Media- Worm router with realistic MPEG-2 video traæc shown in Table 3.3. Figure 3.14 shows the mean delivery interval èdè ç and its standard deviation èç d è for each router model. Some of the data points of the TR were dropped due to saturation. The results with realistic VBR are almost identical to those with synthetic VBR of Figure 3.8, although d ç of TR in Figure 3.14 is slightly better at 90è load. Next, we vary the ratio of real-time èmpeg-2 videoè and best-eæort traæc for diæerent input loads, and study the eæect on jitter for VBR. Figure 3.15 shows the variation of d ç and ç d for these workloads. Again we can observe similar results as shown in Figure Although the video traces in Table 3.3 have much wider bandwidth variation, the overall results with synthetic and actual traces are almost similar Fat-Mesh Results Up to this point, we have focussed on the performance of a single router with CBRèVBR and best-eæort traæc. In this subsection, we try to examine the performance implications of using such routers in a fat-mesh interconnect. In general, it can be expected that an interconnect with multiple routers may have lower performance than that of a single router. This would be due to the additional points of resource contention in a network. We limit this study to a modest 4 node network èshown in Figure 3.4è due to limited simulation resources. Figure 3.16 èaè and èbè shows the change in mean delivery interval and the corresponding standard deviation for synthetic VBR traæc. This is studied with both

57 44 increasing load and increasing proportion of VBR traæc. The results indicate that VBR performance remains good for smaller proportions of VBR traæc è40è and 60 èè even for a total input load of 0.9 of PC bandwidth capacity. Only at a load of 0.9 with 80è of traæc being VBR, does VBR performance degrade. This good performance for VBR is at the expense of best-eæort traæc and is shown in Figure 3.16 ècè. As expected, for any given load, average latency of best-eæort traæc increases with increasing proportion of VBR traæc. It is also illustrative to compare the performance of a è2 æ 2è fat mesh to that of a single switch. As expected, the maximum input load èfor a given proportion of VBR to best-eæort traæcè that provides jitter-free performance for VBR traæc is lower in the fat-mesh than in the case of a single switch. This can be inferred by comparing Figures 3.10 èaè and èbè with Figures 3.16 èaè and èbè. For example, with a load of 0.9 and a traæc mix of 80:20, we can observe that a single switch is able to provide jitter-free performance, while the fat-mesh cannot. Admission control criteria, thus, have to consider èfor an expected traæc patternè what is the maximum load and proportion of VBR to best-eæort traæc that will provide statistically acceptable QoS to VBR traæc as well as an acceptable latency for besteæort traæc. This load would then determine the number of VBR streams that may be accepted for service Comparisons of the Three Router Models In this subsection, we examine the performance results of a traditional router, a non-preemptive router, and a preemptive router under dynamic workloads. Figure 3.17

58 45 shows the deadline missing probability and the average deadline missing time in a single router for each model. Some of the data points of the traditional router were dropped due to saturation. It is seen that the preemptive router can service real-time traæc with almost constant deadline missing probability, while for the non-preemptive routerèmediawormè, the number of frames missing their deadlines increases as the ratio of real-time traæc increases. The deadline missing time in Figure 3.17 èbè is the minimum for the preemptive router. The traditional router, without a rate-based scheduler, experiences saturation even under light load, and is the worst performer. Since the preemptive router can assign VCs dynamically according to the real-time traæc load, it can provide the best performance among the three architectures. Another important performance parameter is the end-to-end latency. We can measure such latency for each traæc type. Figure 3.18 èaè shows control traæc latency in each router. Here, queueing time represents the time spent outside the router before the message is injected into the router. In the traditional router, control traæc is treated as any other types of traæc, and hence its latency is much higher than those of the other two routers. The preemptive router provides the best performance with almost zero queueing time followed by the non-preemptive router. Figure 3.18 èbè compares best-eæort traæc latency in the three routers. The non-preemptive èmediawormè and the preemptive routers can provide better service for real-time traæc at the expense of best-eæort traæc. Therefore, as expected, the traditional router provides the best performance for best-eæort traæc. Next, we examined the impact of block-level multiplexing in a single preemptive router. Figure 3.19 shows the results for block size of 1, 5, and 10 æits, respectively.

59 46 As the block size increases, the performance degrades signiæcantly. Thus, æit-level multiplexing seems to be the most ideal choice for QoS assurance. However, in an actual implementation, we mayhave to use block-level multiplexing to amortize the scheduling overhead. In order to estimate the contribution of the acceleration scheme explained in Section 3.2 for the preemptive router, we tested the router without the acceleration scheme and with the acceleration scheme. Figure 3.20 demonstrates the role of the acceleration scheme in the preemptive router. The results indicate that by accelerating the æits of the lower priority messages, performance of both real-time and best-eæort traæc improves considerably A è2 æ 2è Mesh Network Results In this section, we examine the performance implications of using the preemptive and non-preemptive routers in a è2 æ 2è mesh network. We use a mixture of real-time traæc and best-eæort traæc. Figure 3.21 èaè shows the deadline missing probability for real-time traæc and Figure 3.21 èbè depicts the average network latency for besteæort traæc. Like the single router results, the preemptive model again exhibits better performance compared to the non-preemptive model. The deadline missing probability increases with an increase of the real-time load. Also as expected, average network latency of the best-eæort traæc in Figure 3.21 èbè gradually increases with the real-time traæc.

60 Concluding Remarks Widespread use of cluster systems in diverse application environments is placing varied communication demands on their interconnects. Commercial routers for these environments currently support wormhole switching. Although wormhole routers can provide small latencies and good performance for best-eæort traæc, these routers are unable to provide QoS guarantees for soft real-time applications such as streaming media. Our study is motivated by the need to simultaneously handle multiple such traæc types that are becoming important and prevalent in clustered environments. We also feel that it is imperative to leverage oæ of existing, mature and commodity technology, i.e. wormhole switching, for providing a cost-eæective solution rather than using relatively new or hybrid switching alternatives proposed by other researchers. We have proposed a new router architecture, called MediaWorm, with only one major modiæcation compared to ëvanilla" wormhole routers incorporating a rate proportional resource scheduler called FGVC, instead of the common rate agnostic schedulers such as FIFO or roundrobin. We have studied the capabilities of the MediaWorm in supporting real-time and best-eæort traæc. The main conclusions of our study are the following: æ We conærm that the FGVC scheduler can provide considerably improved performance for traæc that require soft real-time guarantees èvbrècbrè. æ The MediaWorm router design shows that there is no adverse eæect on the performance of VBR traæc in the presence of best-eæort traæc. However, as the share of VBR traæc increases for a given load, this adversely eæects the latency of

61 best-eæort traæc. A wormhole router can provide jitter-free delivery to VBRèCBR traæc up to a load of 70í80è of the physical channel bandwidth. 48 æ Although the performance of a PCS router is slightly better than the MediaWorm, PCS routers are more complex than wormhole routers and they may drop a large number of connections. æ We ænd that performance of a small fat-mesh network is comparable to that of a single switch. Although it is diæcult to extrapolate performance to much larger clusters directly from our present results, we expect that clusters designed with appropriate bandwidth balance amongst various links by the use of fat-topologies and MediaWorm-like switches should be able to provide good performance for both real-time and best-eæort traæc. æ A preemptive router model seems more appropriate for handling dynamic workload. However, preemption in a pipelined model is more complex than in a non-pipelined model èmediawormè since a lower priority message can block a higher priority message at more than one stage of the pipeline. Instead of providing preemption at several stages, preemption in the input buæers followed by an acceleration mechanism at other stages seems a viable design. In summary, our study suggests that by augmenting a conventional wormhole router with a rate-based resource scheduling technique, one can provide a viable, costeæective switch for cluster interconnects to support both real-time and best-eæort traæc mixes. The MediaWorm router supports this claim. It is also possible to design more sophisticated routers by incorporating a message preemption scheme.

62 49 In the next section, we discuss the design of a network interface card ènicè that can be used in conjunction with the proposed router models to provide end-to-end QoS guarantees.

63 50 Mean Delivery Interval (millisecond) MR TR Input Link Load (Proportion 80:20) Standard Deviation of Delivery Interval MR TR Input Link Load (Proportion 80:20) Fig Comparison of MR and TR èè8 æ 8è switch, 16 VCs, 400 Mbps links, x : y = 80:20è.

64 51 Mean Delivery Interval(millisecond) VBR CBR Input Link Load 22.0 Standard Deviation of Delivery Interval VBR CBR Input Link Load Fig Comparison of CBR and Synthetic VBR traæc in the MediaWorm router èè8 æ 8è switch, 16 VCs, 400 Mbps links, all real-time traæcè.

65 52 Mean Delivery Interval (millisecond) Input Load = 0.60 Input Load = 0.70 Input Load = 0.80 Input Load = 0.90 Input Load = :80 50:50 80:20 90:10 100:0 Proportion of Real Time to Best Effort Traffic(x:y) Standard Deviation of Delivery Interval(millisecond) Input Load = 0.6 Input Load = 0.70 Input Load = 0.80 Input Load = 0.90 Input Load = :80 50:50 80:20 90:10 100:0 Proportion of Real Time to Best Effort Traffic(x:y) Fig Mixed Traæc èsynthetic VBR + best-eæort traæcè èè8 æ 8è switch, 16 VCs, 400 Mbps linksè.

66 53 Mean Delivery Interval(millisecond) Virtual Channels with Multiplexed Crossbar 8 Virtual Channels with Multiplexed Crossbar 4 Virtual Channels with Multiplexed Crossbar 4 Virtual Channels with Full Crossbar Input Link Load Standard Deviation of Delivery Interval (millisecond) Vritual Channels with Multiplexed Crossbar 8 Virtual Channels with Multiplexed Crossbar 4 Virtual Channels with Multiplexed Crossbar 4 Virtual Channels with Full Crossbar Input Link Load Fig Impact of VCs and Crossbar Capabilities èè8 æ 8è switch, 400 Mbps links, x : y = 100:0è.

67 Input Load = 0.64 Input Load = 0.80 Mean Delivery Interval (millisecond) Message Size (flits) Standard Deviation of Delivery Interval (millisecond) Input Load = 0.64 Input Load = Message Size (flits) Fig Eæect of message size on jitter èè8 æ 8è switch, 400 Mbps link bandwidth, 16 VCs, all synthetic VBR traæcè.

68 55 Mean Delivery Time(millisecond) PCS Router Wormhole Router Input Link Load Standard Deviation of Delivery Time(millisecond) PCS Router Wormhole Router Input Link Load Fig MR and PCS comparison èè8æ8è switch, 100 Mbps link bandwidth, 24 VCsè.

69 56 Mean Delivery Interval (msec) TR MR Input Link Load (Proportion 80:20) èaè Standard Deviation of Interdelivery Time (msec) TR MR Input Link Load (Proportion 80:20) èbè Fig TR vs. MR with MPEG-2 Video Traæc èè8 æ 8è switch, 16 VCs, 1.6 Gbps, x : y = 80 : 20è.

70 57 Mean Delivery Interval (millisecond) Input Load = 0.60 Input Load = 0.70 Input Load = 0.80 Input Load = 0.90 Input Load = :80 50:50 80:20 90:10 100:0 Proportion of Real time to Best effort Traffic(x:y) èaè Standard Deviation of Delivery Interval(millisecond) Input Load = 0.60 Input Load = 0.70 Input Load = 0.80 Input Load = 0.90 Input Load = :80 50:50 80:20 90:10 100:0 Proportion of Real time to Best effort Traffic(x:y) èbè Fig Mixed Traæc èmpeg-2 Video Trace + best-eæort traæc, è8 æ 8è switch, 16 VCs, 1.6 Gbpsè.

71 58 Mean Delivery Interval (millisecond) Input Load = 0.7 Input Load = 0.8 Input Load = :60 60:40 80:20 Proportion of Real Time to Best Effort Traffic (x:y) èaè 4.0 Standard Deviation of Delivery Interval Input Load = 0.7 Input Load = 0.8 Input Load = :60 60:40 80:20 Proportion of Real Time to Best Effort Traffic (x:y) èbè Average Latency of Best Effort Traffic (microsecond) Input Load = 0.7 Input Load = 0.8 Input Load = :60 60:40 80:20 Proportion of Real Time to Best Effort Traffic (x:y) ècè Fig VCsè. Performance of a è2 æ 2è fat mesh è8 æ 8è switches, 400 Mbps link bandwidth,

72 59 Deadline Missing Probability TR, load=0.80 TR, load=0.85 MR, load=0.80 MR, load=0.85 P, load=0.80 P, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) èaè Deadline Missing Time(microsec) TR, load=0.80 TR, load=0.85 MR, load=0.80 MR, load=0.85 P, load=0.80 P, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) èbè Fig Deadline missing probability and deadline missing time in a single router under dynamic load variation. The input load is speciæed in the graphs.

73 Queueing Time Network Latency Control Traffic Latency (microsec) TR, 20:80 MR, 20:80 P, 20:80 TR, 30:70 MR, 30:70 P, 30:70 èaè Control traæc TR, 50:50 MR, 50:50 P, 50:50 Best effort Traffic Latency (microsec) Queueing Time Network Latency 0.0 TR, 20:80 MR, 20:80 P, 20:80 TR, 30:70 MR, 30:70 P, 30:70 èbè Best-eæort traæc TR, 50:50 MR, 50:50 P, 50:50 Fig Components of message latency of control traæc and best-eæort traæc in a single router under dynamic load variation. The input load is 0.80.

74 61 Deadline Missing Probability Block size=1, load=0.80 Block size=1, load=0.85 Block size=5, load=0.80 Block size=5, load=0.85 Block size=10, load=0.80 Block size=10, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) Best effort Traffic Latency(microsec) Block size=1, load=0.80 Block size=1, load=0.85 Block size=5, load=0.80 Block size=5, load=0.85 Block size=10, load=0.80 Block size=10, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) Fig Eæect of block-level multiplexing in a single preemptive router under dynamic load variation.

75 62 Deadline Missing Probability Preemption+Acceleration, load=0.80 Preemption+Acceleration, load=0.85 Preemption only, load=0.80 Preemption only, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) Best effort Traffic Latency(microsec) Preemption+Acceleration, load=0.80 Preemption+Acceleration, load=0.85 Preemption only, load=0.80 Preemption only, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) Fig Comparison of preemption+acceleration and only preemption in a single router under dynamic load variation. èsome results for the best-eæort traæc at high load are not included due to saturation.è

76 63 Deadline Missing Probability NP, load=0.70 NP, load=0.80 P, load=0.70 P, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) èaè Best effort Traffic Latency(microsec) NP, load=0.70 NP, load=0.80 P, load=0.70 P, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) èbè Fig Deadline missing probability and average latency of best-eæort traæc in a è2 æ 2è mesh network under dynamic load variation.

77 64 Chapter 4 A QoS Capable Network Interface Card Design In this chapter, we present a network interface card ènicè design for end-to-end QoS support in clusters. Our study is based on the Virtual Interface Architecture èviaè framework, which has become a standard to design user-level communication on NICs. First, we show how QoS provisioing can be provided in the context of VIA. Then, we evaluate a complete cluster interconnect consisting of QoS-capable routers and NICs. 4.1 Virtual Interface Architecture The network interface èniè has a crucial role in the overall communication performance since it is responsible for initiating and responding to communications, for handling data movement, and for providing application isolation. Since improving the performance of the routerèinterconnect alone will shift the communication bottleneck to the NI, design of faster NIs has become a major research thrust recently. Consequently, a few user-level messaging layers such asactive Messages ë74ë, U-Net ë72ë and FM ë50ë have been proposed to minimize the role of the operating system involvement in communication. As a consequence of this concerted eæort, a generic communication layer, called Virtual Interface Architecture èviaè, was introduced as a standard communication paradigm for System Area Networks èor SANsè or clusters ë6, 11ë. The design focus

78 65 of the VIA is to provide an eæcient communication protocol between a user process and the network interface èniè. VIA is a connection oriented paradigm consisting of Virtual Interfaces èvisè. A VI is the mechanism by which applications talk to the NIC hardware, and establishes a connection between two processes. A VI consists of two queues: a send queue and a receive queue. Applications post requests in the form of descriptors in one of the queues. For sending a message, an application posts a descriptor in the send queue, and informs the NI of the pending request by ringing a send doorbell, which is a memory mapped region on the NI. On receiving the doorbell, the NI transfers the descriptor and the data from the user memory to the NI buæers using two DMAs. The NI transfers the message to the wire using another DMA, and updates the status æeld of the send descriptor or that of a completion queue. The actions on the receive are very similar to that of a send. The application creates an empty buæer, posts a descriptor in the receive queue and rings the receive doorbell in the NIC buæer. When a message arrives for a VI, the NI transfers the message to the buæer allocated by the application and updates the status æeld of the receive descriptor. The message is subsequently consumed by the receiving process. Figure 4.1 shows this procedure. Based on this framework, a few implementations of VIA have been developed recently èand some are under developmentè to achieve low latency user-level communication. However, the original VIA framework does not have any QoS design speciæcation. Here, we propose an extension of the VIA design to support diæerent priority classes in the NIC.

79 66 User Application DMA Transfer VI VI S VI R S VI R S R S R Doorbells DMA Transfer VI-Capable NIC Fig Virtual Interface Architecture Paradigm 4.2 A QoS Capable NIC Design We propose three design modiæcations in the original VIA framework as described below. These are a prioritized doorbell structure to support diæerent traæc classes, a virtual channel aware buæer management in the NIC, and a hardware supported Virtual- Clock scheduler to transfer æits to the router. Figure 4.2 shows the diæerent stages in the æow of data from an applications to the NIC. Each application such as a video source or a a best-eæort process has a VI with the corresponding send and receive queues. The send and receive queues reside in the user memory. To support integrated traæc, we implemented prioritized doorbells in the NIC, where there is a doorbell queue èsendèreceiveè for each class. The NIC ærmware picks up the doorbells in FCFS order based on their priority and programs the host DMA engine to transfer the descriptor followed by the message. To avoid head of queue blocking, we use a preemptive solution. If the the NIC buæer èvirtual channelè corresponding to a doorbell is is blocked, the scheduler picks

80 the next doorbell in the queue. Messages of the same class do not get reordered in this scheme. 67 Traffic Source R R S VI S VI NIC Doorbell Q for priority 1 Traffic Source R R S VI S VI Doorbell Q for priority s Prioritized FIFO Physical Channel VC 1 buffer VC 2 buffer User Memory DMA Transfer VC C buffer VirtualClock Fig A VIA-style NIC with QoS support Next, to make the NIC design compatible to the QoS-aware router of the previous section, we implemented in the NIC buæer an equal number of ècè VCs to enable virtual channel æow control in the NIC. Note that this is a logical separation of the NIC local memory. As messages are transferred into the NIC by the host DMA, they are broken into æits by the NIC processor. The NIC buæer behaves as FCFS queues for the diæerent VCs. In the original VIA implementation, the send DMA engine of the NI èfor example in the Myrinet network cardè is used to transfer a complete message into the network at the rate of one æit per cycle. On the other hand, the router model discussed in this

Investigating QoS Support for Traffic Mixes with the MediaWorm Router

Investigating QoS Support for Traffic Mixes with the MediaWorm Router Ki Hwan Yum Aniruddha Vaidya Chita R. Das Anand Sivasubramaniam Department of Computer Science and Engineering The Pennsylvania State