We approve the thesis of Ki Hwan Yum. Date of Signature Chita R. Das Professor of Computer Science and Engineering Thesis Adviser, Chair of Committee

Size: px
Start display at page:

Download "We approve the thesis of Ki Hwan Yum. Date of Signature Chita R. Das Professor of Computer Science and Engineering Thesis Adviser, Chair of Committee"

Transcription

1 The Pennsylvania State University The Graduate School Department of Computer Science and Engineering QUALITY OF SERVICE PROVISIONING IN CLUSTERS A Thesis in Computer Science and Engineering by Ki Hwan Yum cæ 2002 Ki Hwan Yum Submitted in Partial Fulællment of the Requirements for the Degree of Doctor of Philosophy December 2002

2 We approve the thesis of Ki Hwan Yum. Date of Signature Chita R. Das Professor of Computer Science and Engineering Thesis Adviser, Chair of Committee Mary Jane Irwin Distinguished Professor of Computer Science and Engineering George Kesidis Associate Professor of Electrical Engineering and Computer Science and Engineering Vijaykrishnan Narayanan Assistant Professor of Computer Engineering Natarajan Gautam Assistant Professor of Industrial and Manufacturing Engineering Raj Acharya Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering

3 Abstract iii Cluster systems are becoming increasingly more attractive for designing scalable servers with switched network architectures that oæer much higher bandwidth than the broadcast-based networks. Design of high performance cluster networks with Quality of Service èqosè guarantees, therefore, becomes important to support a variety of multimedia applications, many of which have real-time constraints. Most commercial routers, which are based on the wormhole-switching paradigm, can deliver high performance, but lack QoS provisioning. Therefore, it would be advantageous if we could leverage oæ of the large amount of eæort that has gone into the design and development of these wormhole routers and adapt them to support integrated traæc with minimal design changes. The overall goal of this thesis is to design and analyze cluster networks that can provide high and predictable performance. Design and evaluation of QoS-capable cluster networks is the focus of this thesis. In particular, we investigate various issues for eæcient handling of best-eæort and real-time traæc in clusters based on the wormhole switching paradigm. We study æve research issues in this thesis. First, a non-preemptive pipelined wormhole router, called MediaWorm, is proposed and investigated based on a basic wormhole router architecture. Second, to overcome the inæexibility of the MediaWorm router for dynamic workloads, a preemptive pipelined router architecture is proposed. We propose a æit-level input buæer preemption and a æit acceleration mechanisms for providing preemption of lower priority messages in favor of higher priority messages. The third part of the thesis deals with

4 iv the design of a QoS-capable network interface card ènicè based on the Virtual Interface Architecture èviaè paradigm. The QoS-capable routers and QoS-capable NICs are integrated to examine end-to-end QoS guarantees in a cluster system. Next, practical admission and congestion control mechanisms in a cluster environment are considered to aid in end-to-end QoS assurance. Finally, we extend our work to develop a simulation testbed conforming to the InæniBand TM Architecture èibaè speciæcation and investigate QoS design issues for system area networks èsansè within the IBA framework.

5 Table of Contents v List of Tables : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : viii List of Figures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ix Acknowledgments : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xii Chapter 1. Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Chapter 2. Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 Chapter 3. Design of Wormhole Routers for QoS support : : : : : : : : : : : : : A Non-Preemptive Router Design MediaWorm Router Architectural Design Issues A Rate-based Scheduling Algorithm for QoS Support Interconnection Topologies Fat Networks Pipelined Circuit Switching A Preemptive Router Design Preemption in the Input Buæer A Flit Acceleration Mechanism Experimental Platform Simulation Testbed Workload Performance Results

6 vi Comparison of MediaWorm and Traditional Routers Comparison of CBR and VBR Traæc Results Results with Mixed Traæc Impact of VCs and Crossbar Capabilities Eæect of Message Size on Jitter Comparison of MediaWorm and PCS Routers Results with MPEG-2 Video Traces Fat-Mesh Results Comparisons of the Three Router Models Aè2æ2è Mesh Network Results Concluding Remarks Chapter 4. A QoS Capable Network Interface Card Design : : : : : : : : : : : : Virtual Interface Architecture A QoS Capable NIC Design Performance Results Concluding Remarks Chapter 5. Integrated Admission and Congestion Control in Clusters : : : : : : Basic Architecture Host Channel Adapter èhcaè Architecture VL Arbitration Admission and Congestion Control Admission Control

7 vii Congestion Control Algorithm Experimental Platform Performance Results Comparisons of Congestion Control Algorithms Results with Admission and Congestion Control Concluding Remarks Chapter 6. QoS Provisioning in InæniBand TM Architecture èibaè : : : : : : : : System Architecture InæniBand TM Architecture èibaè Switch Architecture Host Channel AdapterèHCAè Architecture Performance Enhancement Techniques Deterministic Routing Algorithm Packet Dropping in a Switch Experimental Platform Performance Results Concluding Remarks Chapter 7. Conclusions and Future Work : : : : : : : : : : : : : : : : : : : : : : 111 References : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 115

8 List of Tables viii 3.1 VirtualClock and Fair Queueing Algorithms Comparison of Fine-Grained VirtualClock with Fine-Grained Fair Queueing algorithms when the ratio of real-time to best-eæort traæc is 80:20. Inter-frame time is the averaged time diæerence of frames measured at the destinations, and SD is the standard deviation of the inter-frame time MPEG-2 Video Sequence Statistics Simulation Parameters Average Latency for Best-eæort Traæc è8æ8 switch, 16VCs, 400Mbps linksè Number of attempted, established and dropped connections for reaching a certain input loading in a PCS router. The values presented are for an è8 æ 8è router with 24 VCs, 100Mbps links IBA Simulation Testbed Parameters

9 List of Figures ix 3.1 A Basic Pipelined Wormhole Router Functional units along a router pipe for a2portrouter with 2 VCs per PC. Additional functional units such as the routing decision block and the arbitration unit are not shown. With a multiplexed crossbar as is shown in the ægure, contention amongst multiple pipes can occur in the crossbar input multiplexer èaè for the crossbar input port, within the crossbar èbè for the crossbar output ports, and in the VC multiplexer ècè for the output PC MediaWorm Router Architecture A 4 switch fat-mesh interconnect. Each switch ès0ís3è is an è8 æ 8è switch. Each fat link comprises 2 physical links Input Buæer Preemption in the Router Flit Acceleration for Message m Flit Acceleration for Message m Comparison of MR and TR èè8æ8è switch, 16 VCs, 400 Mbps links, x : y = 80:20è Comparison of CBR and Synthetic VBR traæc in the MediaWorm router èè8 æ 8è switch, 16 VCs, 400 Mbps links, all real-time traæcè Mixed Traæc èsynthetic VBR + best-eæort traæcè èè8 æ 8è switch, 16 VCs, 400 Mbps linksè

10 x 3.11 Impact of VCs and Crossbar Capabilities èè8æ8è switch, 400 Mbps links, x : y = 100:0è Eæect of message size on jitter èè8 æ 8è switch, 400 Mbps link bandwidth, 16 VCs, all synthetic VBR traæcè MR and PCS comparison èè8 æ 8è switch, 100 Mbps link bandwidth, 24 VCsè TR vs. MR with MPEG-2 Video Traæc èè8 æ 8è switch, 16 VCs, 1.6 Gbps, x : y = 80 : 20è Mixed Traæc èmpeg-2 Video Trace + best-eæort traæc, è8 æ 8è switch, 16 VCs, 1.6 Gbpsè Performance of a è2 æ 2è fat mesh è8 æ 8è switches, 400 Mbps link bandwidth, 16 VCsè Deadline missing probability and deadline missing time in a single router under dynamic load variation. The input load is speciæed in the graphs Components of message latency of control traæc and best-eæort traæc in a single router under dynamic load variation. The input load is Eæect of block-level multiplexing in a single preemptive router under dynamic load variation Comparison of preemption+acceleration and only preemption in a single router under dynamic load variation. èsome results for the best-eæort traæc at high load are not included due to saturation.è Deadline missing probability and average latency of best-eæort traæc in aè2æ2è mesh network under dynamic load variation

11 xi 4.1 Virtual Interface Architecture Paradigm A VIA-style NIC with QoS support Validation of the NIC simulator with a ping pong application Co-evaluation of a single router and NICs. The deadline missing probability is shown for dynamically changing workload A Proposed InæniBand TM Host Channel Adapter with QoS Support Connection Setup Procedure Message Latency and Throughput in a 4 æ 4 Mesh Network with 100è Best-eæort Traæc Performance Results of a Single Router Cluster with MPEG-2 Video Traæc èuniform distributionè Performance Results of a Single Router Cluster with MPEG-2 Video traæc èhot spot distributionè Average Message Latency in a 5æ5 Mesh Network with OnèOæ Real-time Traæc Message Latency and Throughput in a Single Router with OnèOæ Real- Time Traæc A 5-stage Pipelined IBA Switch Model Comparison of Various Routing Algorithms in a 15-Node Irregular Network Comparison of Various Routing Algorithms in a 30-Node Irregular Network 110

12 Acknowledgments xii This thesis and my positive experience at Penn State owes a great deal to my advisor Professor Chita Ranjan Das. Dr. Das was instrumental in providing the motivation and background that started oæ this work. His continuous support and guidance, his encouragement during hard times and, when needed, the healthy doses of constructive criticism that he provided, have all been very valuable throughout my stay atpenn State. My thesis committee members have helped review my thesis from its early proposal stages to its current form and have provided valuable inputs and suggestions. Professor Mary Jane Irwin, Professor George Kesidis, Professor Vijaykrishnan Narayanan, and Professor Natarajan Gautam have all taken the time and eæort to accommodate my thesis review into their tight schedules. For this, I express my sincere gratitude. During the course of my work, I have often looked to both internal and external sources of help for ideas and critical reviews. Ihave been fortunate to have received this from many colleagues, collaborators, academicians and practitioners. I wish to thank all of them here. Eun Jung Kim, Professor Jose Duato, Dr. Mazin Yousif, Srinivas Hanabe, Vithal Shirodkar, and Giridhar Viswanathan along with my advisor have all collaborated with me on various research projects. My greatest inspirational source, supporters and fans have been my parents and family. I dedicate this work to my parents Soo Chul Yum and Jung Ja Park, my wife

13 Eun Jung Kim, and my son Sang Joon Yum. Thank you for your unconditional love and support. xiii

14 1 Chapter 1 Introduction Cluster systems are becoming increasingly more attractive for designing scalable servers with switched network architectures that oæer much higher bandwidth than the broadcast-based networks. Quality ofservice èqosè provisioning in such clusters is becoming a critical issue with the widespread use of these systems in diverse commercial applications. The traditional best-eæort service model that has been used for scientiæc computing is not adequate to support many cluster applications with varying consumer expectations. As an example, many web servers and database servers make eæcient use of clustering technology from cost, scalability, and availability standpoints. However, the tremendous surge in dynamic web contents, multimedia objects, e-commerce, and other web-enabled applications requires QoS guarantees in diæerent connotations. The guaranteed communication delay and bandwidth requirements of the applications mandate that the cluster interconnect should be able to handle these traæc demands. These demands, in turn, are passed on to the building blocks of the interconnects, the switching fabrics or routers. Hence, it has become crucial to revisit the design of router architectures to provide high and predictable performance. Typically two classes of traæc are generated with mixed or integrated workloads. These are best-eæort traæc and real-time traæc. While best-eæort traæc usually does not have any stringent performance requirements èhence known as available bit rate

15 2 èabrèè, real-time traæc are further classiæed as constant bit rate ècbrè and variable bit rate èvbrè workloads. A cluster network should therefore support ABR, CBR and VBR eæectively. Two switch or router design paradigms have been used to build clusters ë18ë. One is based on the cut-through switching mechanisms èwormhole ë15ë and virtual-cutthrough èvctè ë34ëè, originally proposed for multiprocessor switches, and the other is based on packet switching. Current multiprocessor routers, primarily based on the cutthrough paradigm, are suitable for handling ABR traæc. However, they may not be able to support the stringent QoS requirements eæciently without possibly modifying the router architecture. On the other hand, packet switching mechanisms like ATM can provide QoS guarantees, but they are not suitable for best-eæort traæc primarily due to high message latency compared to cut-through switching ë17, 20ë. Therefore, none of the existing network architectures are optimized to handle both best-eæort and real-time traæc in clusters. In view of this, a few researchers have explored the possibility ofproviding QoS support in router architecturesë7, 17, 20, 38, 57ë. Most of these designs have used a hybrid approach using two diæerent types of switching mechanisms within the same router one for best-eæort and the other for real-time traæc. They have refrained from using wormhole switching because of potential unbounded delay for real-time traæc. On the other hand, in the commercial world, it appears that wormhole switching has become a de facto standard for clustersèmultiprocessors. Therefore, it would really be advantageous if we could leverage oæ of the large amount of eæort that has gone into the design and development of these wormhole routers and adapt them to support

16 3 all traæc classes with minimal design changes. Some recent modiæcations to wormhole routers have been considered for handling traæc priority ë12, 26, 42, 65ë. However, to our knowledge, there have been no previous studies investigating the viability of supporting multimedia traæc with wormhole switching. QoS support only in the routerèinterconnect is not adequate to assure application level performance guarantees. In order to provide end-to-end QoS guarantees in clusters, QoS provisioning in the network interface card ènicè is also important. It is known the NIC plays a crucial role in reducing the communication overhead ë47ë. The role of the NIC may become even more important to satisfy the QoS requirements of diæerent trafæc classes. Several user-level communication mechanisms have been proposed recently, where an application can directly communicate with an intelligent NIC with minimal kernel support ë2ë. Among them, the Virtual Interface Architecture èviaè ë19, 73ë framework is becoming a standard to design user-level communication on NICs. However, it is not clear how QoS provisioning should be provided in the context of VIA. In addition, a co-evaluation of the cluster routerèinterconnect with a VIA-style NIC is essential to understand the interplay of diæerent designs on the overall performance of the communication architecture. To our knowledge, none of the prior research has considered the above research issues in the design and evaluation of QoS capable cluster interconnects. Finally, admission and congestion control mechanisms are integral parts of any QoS design for systems that support integrated traæc. While an admission control algorithm helps in delivering the assured performance, a congestion control algorithm regulates traæc injection to avoid network saturation. However, integration of admission and congestion control in clusters has not been examined up to now.

17 4 Admission control algorithms help to meet Service Level Agreements èslasè of real-time applications. However, admission control alone may not be eæective enough to guarantee the SLAs of real-time and best-eæort applications because they may exhibit unpredictable behavior, resulting in short- or medium-term network traæc overload. Such traæc overload considerably degrades overall network throughput. Therefore, a congestion management algorithm is typically used to monitor the network load, and intervene when the traæc load reaches a certain threshold indicating possible network congestion. Since a congestion management scheme also brings its own set of constraints on the injection of traæc æows into the network, both admission control and congestion management are collectively needed to guarantee various QoS constraints. This is especially true in clusters running a diverse set of applications. Recently, InæniBand TM Architecture èibaè has been proposed as a new communication standard to design SANs for scalable, high performance clusters. IBA is expected to revolutionize the future communication paradigm by solving the bandwidth, scalability, reliability, and standardization issues under one unifying design. The IBA Trade Association èibtaè consisting of more than 220 industry leaders has released the ærst IBA speciæcation ë29ë and is currently augmenting it with enhanced features such as Congestion Management, Quality of Service èqosè, and Router Management. QoS is becoming an essential part of the IBA framework ë54ë because of the sophistication of services that will be supported by clusters connected through SANs. IBA could use either a packet switched or a virtual cut-through switched technology to connect processors and IèO devices. The speciæcation supports any topology to facilitate ease of expansion and to build large networks consisting of smaller subnets.

18 5 It outlines only the functionalities without any constraints on the actual design. Therefore, it is conceivable to have multiple design alternatives for the same set of high-level requirements. This makes the design space very complex because of the multitude of options possible at diæerent levels of the design. An IBA testbed is, therefore, essential to investigate various design options for satisfying the performance and QoS requirements. However, there is no such simulation platform available now, and as we understand, the IBTA is planning to develop such a platform with help from academia. This research is also aimed at investigating the following design issues for providing improved and predictable performance in IBA. First it is not clear what is a good routing algorithm for IBA considering the fact that the interconnect could be an irregular topology. Second, the IBA speciæcation supports multipathing to facilitate Automatic Path Migration èapmè between a source and destination pair to provide fault-tolerance. However, the actual path set up in the routingèforwarding table is left open for the designers. Moreover, we believe that the multipathing mechanism can not only be used for fault-tolerance, but also for congestion avoidance to improve performance. Therefore, it is essential to understand the design and performance implications of multipathing. Finally, packet dropping is allowed under the IBA framework to limit the life time of a packet in a network. Packet dropping can also be used for deadlock avoidance. Thus, instead of using a complex deadlock-free algorithm, one can use a simple routing scheme with packet dropping to provide competitive performance. This concept needs careful investigation. It appears that QoS provisioning in clusters is an important, but open area of research. The main motivaton of the research is to design and investigate various issues

19 6 for QoS provisioning in clusters. The research includes development of QoS capable routers, QoS capable NICs, and admission and congestion control algorithms for wormhole switched and IBA-style SANs. An overview of the research is given below. The research proposes to investigate the following issues. æ A non-preemptive router model èmediaworm routerè : We propose to design a wormhole router, called MediaWorm to support both real-time and best-eæort traæc. In this model, the virtual channels èvcsè are divided according to traæc types at the conæguration time. A rate-based scheduling algorithm, called VirtualClock ë82ë, is used to provide proportional bandwidth allocation. æ A preemptive router model : Since the MediaWorm router statically divides the input and output VCs among the traæc classes, the conæguration cannot be changed during execution. Therefore, if the workload consisting of real-time and best-eæort traæc dynamically changes during execution, it may suæer from shortage of resources. The preemptive router model can be a solution to this problem, where several classes of traæc can share a VC, with the provision that a higher priority message can preempt a lower priority message. Design of the preemptive model is examined in detail. æ QoS capable NIC design : We propose a network interface card ènicè design based on the Virtual Interface Architecture èviaè to support QoS for real-time traæc. The QoS capable NIC and the router designs will be integrated to evaluate the entire communication substrate for an end-to-end performance analysis.

20 7 æ Admission and congestion control in clusters : Next, in order to provide endto-end QoS gurantees for applications, we propose to develop a simple admission control mechanism and an elegant congestion control mechanism called credit-based congestion control. These algorithms are developed using the Mediaworm router and QoS-capable NIC developed in this research. æ QoS provisioning in InæniBand TM Architecture èibaè : Finally, a simulation testbed for the IBA that includes packet-switched router, adaptive routing, and Weighted Round Robin èwrrè will be developed for the design and evaluation of the IBA framework. Architectural modiæcations will be investigated for QoS provisiong. The rest of the thesis is described as follows. Chapter 2 summarizes the related work. Chapter 3 discusses the designs of the non-preemptive router, called MediaWorm, and the preemptive router. In Chapter 4, the QoS capable NIC design is investigated. Integration of admission and congestion control into cluster networks is the topic of Chapter 5. In Chapter 6, the IBA simulator is discussed, followed by the conclusions in Chapter 7.

21 8 Chapter 2 Related Work With the building block of a multiprocessor interconnect being its router or switch fabric, a considerable amount of research eæort has gone into the design of efæcient routers. Routers from university projects like reliable router ë13ë and Chaos router ë41ë, and commercial routers such as SGI SPIDER ë23ë, Cray T3DèE ë59, 60ë, Tandem Servernet-II ë24ë, IBM SP2 switch ë68ë, and Myrinet ë5ë use wormhole switching, while the HAL mercury ë75ë and Sun S-Connect ë48ë use virtual cut-through èvctè. Most of them support VCs, and at least the Cray T3E, ServerNet-II and S-Connect have adaptive routing capability. Metro ë9ë and Ariadne ë1ë employ the pipelined circuit switching èpcsè technique, while the latter is fully adaptive and tolerates link and switch faults. Ahybrid switch including both wormhole and VCT was designed in ë61ë. All these routers are primarily designed to minimize average message latency and improve the network throughput. The SGI SPIDER, Sun S-Connect, and Mercury support message priority. But, none of these routers can guarantee QoS as required for real-time applications like VOD services. ServerNet is the only router that provides a link arbitration policy ècalled ALU-biasingè for implementing bandwidth and delay control, but it still does not provide any capabilities to support multimedia traæc. Recently, a few researchers have explored the possibility of providing QoS support in multiprocessorècluster interconnects. The need for such services, existing methods to

22 9 support QoS speciæcally in WANèlong-haul networks, and their limitations are summarized in ë8, 38ë. Kim and Chien ë39ë propose a scheduling discipline, called rotating and combined queue èrcqè, to handle integrated traæc in a packet switched network. The Switcherland router ë20ë, designed for multimedia applications on a network of workstations, uses a packet switched mechanism similar to ATM, while avoiding some of the overheads associated with the WAN features of ATM. The router architecture proposed in ë57ë uses a hybrid approach, wherein wormhole switching is used for best-eæort traæc and packet switching is used for time-constrained traæc. A multimedia router architecture èmmrè, proposed in ë7, 17ë, also adheres to a hybrid approach by using pipelined circuit switching èpcsè for multimedia traæc and virtual-cut-through èvctè for best-eæort traæc. The authors have designed a è4 æ 4è router to support both PCS and VCT schemes, and have used MPEG video traces in their evaluations. While a connection-oriented mechanism such as PCS is suitable for multimedia traæc, it needs one VC per connection. For a link bandwidth of 1.24 Gbps, and with each multimedia stream requiring 4 Mbps, the design would require 256 VCs to fully utilize a physical channel. It is not clear whether it is practical to have such a large number of VCs per physical channel and what will be the cost of the corresponding multiplexer and demultiplexer implementations. In addition, the architecture of the router is fairly complex since it has to have facilities for both PCS and VCT transmission. Nevertheless, this is perhaps the most detailed study where router performance has been analyzed with multimedia video streams, best-eæort, and control traæc. A preemptive PCS network to support real-time traæc is also proposed in ë3ë.

23 10 To our knowledge, there are only a handful of research eæorts that have examined the possibility of using wormhole switched networks for real-time traæc ë12, 26, 35, 42, 65ë. In many of these studies ë35, 42, 65ë, the focus is on providing some mechanisms within the router to implement priority èfor real-time traæcè and preemption èwhen the resources are allocated to a less critical messageè. However, these mechanisms are not suæcient èand may not even be necessaryè for providing soft guarantee for multimedia traæc. Three diæerent techniques for providing QoS in wormhole switched routers are explored in ë26ë using a simulated multistage network. These include using a separate subnet for real-time traæc, supporting a synchronous virtual network on the underlying asynchronous network, and employing VCs. The ærst approach may not be cost-eæective. The second solution of using a synchronous network èeither inherently synchronous or simulated on top of an asynchronous network as is done in ë12ë on Myrinetè, is not a scalable option. The third option of using VCs has not been investigated in depth in ë26ë, where it has been cursorily examined in the context of indirect networks. The software oriented synchronization mechanism in the Myrinet switch proposed in ë12ë also lacks scalability. Message preemption in wormhole routers have been addressed in ë40, 65ë. In ë40ë, lower priority messages that block higher priority messages are discarded to allow faster delivery of higher priority messages. This approach has the advantage that it does not require extra resources to store routing information of the preempted messages. But preempted messages are lost in this scheme and thus may not be a viable option for many applications. With additional hardware and æow control, it is possible to recover the low priority messages. Song et al. ë65ë, on the other hand, preempt a lower priority

24 11 message in favor of a higher priority messages using additional buæers. In their scheme, the router has ès, 1è extra input buæers where s is the number of priority levels it supports. By providing these additional input buæers, the router can always establish a free path for higher priority messages. This scheme requires a history stack for storing the header information of the preempted messages in ascending order of their priorities for each output channel. Unlike our pipelined router model, the authors use a lumped router design. Hence, many architectural details that are required to support æit-level preemption are not addressed in their work. Provisioning for preemption in diæerent stages of the pipeline is much more complex than a single stage èlumpedè router model. But none of the above studies ë40, 65ë have examined the design details in the contex of a pipelined router architecture. It is still not clear as to what is the best switching mechanism that can support all traæc classes. Should we resort to hybrid routers that diæerentially service the traæc classes èand pay a high costè like many of the above studies have done? Or, can we use a single switching mechanism èwormhole switching in particular, since it has been proven to work well for best-eæort traæc and we can leverage oæ of the immense body of knowledgeèinfrastructure available for this mechanismè with little or no modiæcations? Instead of discarding the wormhole switching mechanism as an option for multiple traæc classes in an ad hoc manner as many of the above studies have done, this thesis explores how a large number of multimedia connections can be supported in the presence of best-eæort traæc.

25 12 Admission Control: An admission control algorithm determines whether a new real-time traæc æow can be admitted to the network without jeopardizing the performance guarantees given to the already established æows. Such an algorithm is essential irrespective of the underlying communication architecture to regulate the traæc æow. Admission control in packet-switched networks has been a rich area of research. There are two broad classes of admission control algorithms: deterministic and statistical admission control. For real-time services that need a hard or absolute bound on the delay of every packet, a deterministic admission is used ë21ë. For such deterministic services, an admission control algorithm calculates the worst-case behavior of existing æows in addition to the incoming one before deciding if the new æow should be admitted. This model underutilizes network resources, especially with traæc burst. Many of the new applications such as the media streams do not need hard performance guarantees and can tolerate a small violation in performance bounds. A statistical admission control scheme can be used for such applications. In this approach, an eæective bandwidth that is larger than the average rate but less than the peak rate is commonly used. The bandwidth can be computed using a statistical model ë58ë or a æuid æow approximation ë33ë. In addition, a third class of algorithms, called measurement-based algorithms, controls the admissible region based on aggregate traæc measurements ë30, 56ë. For admission control in clusters, the MMR design uses the average and peak rates of requests ë17ë. However, this router uses PCS for real-time traæc and needs one virtual channel èvcè per connection èæowè. The Switcherland router ë20ë, based on the ATM protocol, uses a statistical admission algorithm. A æit reservation æow control

26 13 scheme that uses control æits to reserve bandwidth and buæers prior to the transfer of data æits has been proposed recently ë53ë. To our knowledge, there is no prior work on admission control in wormhole-switched networks. Congestion Control: Congestion control is required to regulate traæc injection into a network to avoid network saturation, which may lead to performance penalty. In networks with QoS guarantees, congestion control mechanisms ærst attempt to regulate best-eæort and misbehaving real-time traæc, and if required, then traæc from other service classes. In wormhole-switched networks, prior work on congestion control tends to limit message injection rate in each node when a speciæed network saturation point is reached ë4, 64, 69ë. Local or global information could be used to determine network saturation. For example, Lopez et al. ë4ë used the busyèfree status of VCs to assess network congestion. Smai and Thorelli ë64ë counted on the global network state to detect network congestion. To achieve a global view of the network, each node communicates its traæc status with other nodes, which may lead to excessive communication overhead. Thottethodi et al. ë69ë suggested a self-tuned approach that determines appropriate threshold values to estimate network congestion. Previous congestion control algorithms for wormhole-switched networks do not provide an end-to-end congestion control. They only consider the networkèrouter status, not the NI, which is closer to the applications. Moreover, instead of penalizing the æow that caused congestion, a uniform reduction rate is typically applied to all the æows that pass through the congested point. Ideally, it should provide selective congestion control per æowèapplication as is done in the Internet TCP æow control. The proposed algorithm has this selective control ability.

27 14 Internet Congestion Control: QoS support in the Internet architecture involves both admission control and end-to-end TCP congestion control. Admission control is included in support of services such as Diæerentiated Services èdiæservè ërfc 2475ë and Integrated Services èintservè ërfc1633ë. TCP congestion is based on manipulating the congestion window size relative to the number of dropped packets and timeouts. Two major drawbacks of the TCP congestion control are that congestion is detected only after noticing a packet drop, and all æows subsequently reduce their injection rate after receiving the congestion notiæcation. The later problem is the well known global synchronization problem. Various solutions have been proposed to mitigate this issue by detecting incipient congestion, including the Random Early Detection èredè ë22ë algorithm and many of its variations ë10, 43, 49ë. A RED gateway continuously computes average buæer size, randomly marks packets when a certain threshold is reached, and drops packets when the buæer gets full. Congestion control mechanisms that have been adopted for the Internet might not be suitable for clustered architectures since clusters are typically hosted in small physical areas and make wide use of reliable communication. Thus, dropping packets is not a suitable option for managing congestion, particularly in a wormhole-switched network. This is also true for IBA ë29ë. Although, IBA allows packet dropping in its various transport services, the details are still unclear. Further, the current version è1.0è of the IBA speciæcation ë29ë provides minimal support for QoS. A 4-bit æeld is reserved for the Service Level èslè speciæcation in the packet header, but does not deæne how to map the SLs to QoS classes. According to the InæniBand Trade Association èibtaè,

28 15 detailed QoS support will be available in the next release of the IBA speciæcation. Thus, integrated admission and congestion control work in ClustersèSAN is in its infancy. In the next chapter, we start with the design of wormhole routers for supporting integrated traæc.

29 16 Chapter 3 Design of Wormhole Routers for QoS support This this chaper, we present two diæerent designs for supporting integrated traæc in workhole routers : a non-preemptive architecture and a preemptive architecture. 3.1 A Non-Preemptive Router Design MediaWorm Router The main motivation of this research is to investigate the feasibility of supporting mixed traæc in wormhole routers with minimal modiæcations to the existing router architecture. We are speciæcally interested in transferring multimedia video streams in addition to usual best-eæort traæc. This requires providing some mechanism within the router that recognizes the bandwidth requirements of VBR and CBR traæc, and accommodates these requests. One can borrow the concepts from real-timeèinternet research toprovide hard or soft guarantees. Instead of conservatively reserving resources within the router to achieve these goals with hard guarantees, we are interested in more optimistic solutions that provide soft guarantees to media streams. In this research, we propose a new wormhole router architecture, called MediaWorm ë78, 80ë, using a conventional pipelined wormhole router design for meeting the bandwidth requirements. Two modiæcations are proposed to a standard wormhole router. First, the VCs are partitioned into two classes one for transferring best-eæort

30 17 traæc and the other for real-time traæc. Second, in order to satisfy the bandwidth requirements of diæerent applications, the round-robin èrrè or First-In-First-Out èfifoè scheduler used in a traditional router is replaced by a rate-based scheduling mechanism, called Virtual Clock ë82ë Architectural Design Issues In wormhole switched networks, messages are segmented into æow-control units called æits. As a message enters a router, its header æit is used to determine the permitted output port that would route the message to its destination. The message then æows through the router crossbar to the appropriate output port. If resources èsuch as output buæer space or output portsè are busy, the message blocks until resources become available. Flits of a message æow through the network in a pipelined manner. Performance of wormhole routers can be enhanced through the use of virtual channels èvcsè ë14ë. VCs are also used for supporting deadlock freedom and providing adaptive routing capabilities. Wormhole routers can be pipelined so that although a æit experiences multi-cycle latency to get from its input port to an output port, the router cycle time can be kept very small ètypically a few nanosecondsè depending on the slowest stage of the pipeline. We use a pipelined router model, called PROUD ë70, 71ë, to design the Media- Worm router. The pipelined model with æve stages, as depicted in Figure 3.1, represents the recent trend in router designs ë52ë. Stage 1 of the pipeline represents the functional units which synchronize the incoming æits, demultiplex a æit so that it can go to the appropriate VC buæer to be subsequently decoded. If the æit is a header æit, routing decision and arbitration for

31 18 Header Flit Path Sync, DeMux, Buffer, Decode Routing Decision Arbitration B/W Resv, Xbar Mux, Xbar Route Buffering VC Mux, Sync Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Tail/Middle Flit Bypass Path Fig A Basic Pipelined Wormhole Router the correct crossbar output are performed in the next two stages èstage 2 and stage 3è. On the other hand, middle æits and the tail æit of a message bypass stages 2 and 3 to move directly to stage 4. Flits get routed to the correct crossbar output in stage 4. The bandwidth of the crossbar may be èoptionallyè multiplexed amongst multiple VCs. This is discussed in detail later in this section. Finally, the last stage of the router performs buæering for æits æowing out of the crossbar, multiplexes the physical channel bandwidth amongst multiple VCs and performs hand-shaking and synchronization with input ports of other routers or the network interface for the subsequent transfer of æits. A pipelined router can thus be modeled as multiple parallel PROUD pipes. In an n-port router, if each PC has m VCs, a router could then be modeled as èn æ mè parallel pipes. Resource contention amongst these pipes could occur for the crossbar output ports èwhich is managed by the arbitration unitè as well as for the physical channel bandwidth of the output link èwhich is managed by the virtual channel multiplexerè. We consider two diæerent crossbar design options a full crossbar and a multiplexed crossbar ë14ë. A full crossbar has number of input and output ports equal to the

32 19 total number of VCs supported èn æ mè for an n-port router with m VCs per PC. On the other hand, a multiplexed crossbar has number of inputèoutput ports equal to the total number of PCs ènè. A full crossbar may improve the router performance at a signiæcantly high implementation cost; a multiplexed crossbar is cheaper to implement but requires more complex scheduling. Support for a larger number of VCs may mandate the use of a multiplexed crossbar from a practical viewpoint. For a multiplexed crossbar implementation, a multiplexer has to be used at the crossbar input ports and a demultiplexer at the crossbar output ports. Introduction of the additional multiplexer introduces a new contention point in the router. Figure 3.2 shows the various functional units along a router pipe when a router implements a multiplexed crossbar. Input Flit Decoder Flit Buffer Output Flit Buffer VC Mux A C Xbar Inp. Mux Crossbar Switch B Xbar Out DeMux Fig Functional units along a router pipe for a2port router with 2 VCs per PC. Additional functional units such as the routing decision block and the arbitration unit are not shown. With a multiplexed crossbar as is shown in the ægure, contention amongst multiple pipes can occur in the crossbar input multiplexer èaè for the crossbar input port, within the crossbar èbè for the crossbar output ports, and in the VC multiplexer ècè for the output PC.

33 20 In order to allocate bandwidth for diæerent types of traæc, we plan to use a ratebased scheduling algorithm at one of the contention places as shown in Figure 3.2. The selection of a rate-based algorithm and its implementation are described next A Rate-based Scheduling Algorithm for QoS Support There are two main categories of bandwidth scheduling algorithms: æow-based and frame-based. The æow-based algorithms like VirtualClock ë82ë, Fair Queueing ë16ë, General Processor Sharing ègpsè ë51ë, Self-Clocked Fair Queueing èscfqè ë27ë, and Frame-based Fair Queueing èffqè ë66ë use time stamps to make scheduling decisions, while the frame-based scheduling algorithms like Round Robin èrrè, Weighted RR èwrrè ë32ë, Deæcit RR èdrrè ë62ë, and Hierarchical RR èhrrè ë31ë poll queues sequentially during each round with diæerent priorities. The frame-based algorithms usually assign a known priority to each queue. But, how to assign a priority to each queue with VBR traæc is not obvious. While the precomputed priority to each queue facilitates to reduce computation overhead, the æow-based algorithms require to timestamp and ænd the minimum time stamp amongst arriving packets every cycle. However, since in our router there could be multiple æows in each queue and we want to assign priorities based on æows, not to a queue, we focus on æow-based algorithms in this work. For this study, we consider two diæerent work conserving, rate-based schedulers; Fair Queueing and VirtualClock. Eæectiveness of the two schemes have been analyzed by several researchers for QoS assurance in packet switched networks ë67ë. In both these algorithms, there is a state variable associated with each channel i to monitor and enforce the rate for that channel. In VirtualClock, the variable is called auxiliary VirtualClock

34 21 èauxvcè; in Fair Queueing, it is called Finish Number èf è. The computation of auxvc and F is shown in Table 3.1. In VirtualClock, AT is the arrival time or wall clock time. In Fair Queueing, R is the number of rounds that has been completed for a hypothetical bit-by-bit round robin server, n is the weight factor, and P is the message length èin bitsè. Vtick in VirtualClock and P n in Fair Queueing specify the inter-arrival time of messages. Therefore, a smaller value implies higher bandwidth. For best-eæort traæc, the Vtick is assigned the largest possible value. With Vtick and P n speciæed, there is no diæerence between VirtualClock and Fair Queueing except that the Fair Queueing uses the round robin numberèrè instead of the actual arrival timeèatè required for the VirtualClock. The computation complexity ofris OèN è, where N is the total number of connections. Fair Queueing algorithms with less computation complexity can be found in ë27, 66ë. We can use the system clock for AT in the VirtualClock algorithm, and hence it needs no extra computation. It has been shown that both these schemes have similar performance ë67ë except that the VirtualClock algorithm cannot handle bursty traæc eæectively without any input regulation. Traæc burstiness can be handled by regulating the traæc injection. VirtualClock auxvc i è maxèat; auxvc i è auxvc i è auxvc i + Vtick i timestamp the packets with auxvc i Fair Queueing F i è maxèr; F i è F i è F i + P i n i timestamp the packets with F i Table 3.1. VirtualClock and Fair Queueing Algorithms

35 The above algorithms were developed for connection-oriented networks, where one channel is dedicated for each connection like the PCS, and when a connection is set up, a æxed Vtick èor P è value is assigned for the entire duration of the connection. n This results in two problems. The ærst one is, when dealing with VBR connections, one representative Vtick èor P è value may cause underutilization of the resources or incur n higher message delay. The other problem is, since one channel services one connection, a large number of VCs is required to handle multimedia streams. Consequently, it may lead to a complex router design with more hardware circuitry. In this study, we are interested in a connectionless paradigm without any explicit 22 connection setup since this provides more eæcient use of the network resources. To overcome the above two problems, we modiæed the connection-oriented algorithms as follows: each message requests its required bandwidth at each router on its way to the destination, and the router implements the VirtualClock èor Fair Queueingè algorithm to allocate the requested bandwidth to its æits. So in our router, each message works as if it were a connection, and each æit works as if it were a message of the originally proposed algorithm. In the original algorithms, the æxed Vtick èor P è can be calculated from the n average bandwidth requirement or the peak bandwidth requirement of the connection. Vtick èor P è in this study implies the intergeneration time between æits, and is given as n Vtick èor P n inter-arrival time è=message : message size in æits

36 Thus, Vticks èor P è of two messages in the same connection can be diæerent if they n 23 belong to diæerent frames of diæerent sizes. A message makes its request by carrying its Vtick èor P è in the header. When the tail æit leaves the router, its Vtick èor n P è information in the router is discarded. We name the modiæed algorithms as Finen Grained VirtualClock èfgvcè and Fine-Grained Fair Queueing èfgfqè, respectively, since bandwidth reservation is done at the message-level granularity. In a router implementation with a multiplexed crossbar, contention for link bandwidth can occur at one of the 3 places the crossbar input multiplexer for the crossbar input port, within the crossbar for the crossbar output port, and at the virtual channel multiplexer for the output physical channel. These are marked as èaè, èbè and ècè respectively in Figure 3.2. All these places are potential candidates where a rate-based bandwidth allocation can be performed. We rule out contentions at èbè and ècè for the following reasons. In case èbè, crossbar output port arbitration is performed at a message level granularity, whereas we are interested in æit-level bandwidth allocation. Case ècè corresponding to the VC multiplexer, is not a strong candidate, either. This is due to the fact that at most one of the VCs of an output PC can receive a æit from the multiplexed crossbar per router cycle. When only one of the VCs has a æit in any given cycle, the scheduling algorithm essentially behaves as a FIFO scheduler. Hence, we have chosen to implement the rate-based scheduler at the crossbar input multiplexer èaè, which means that, at any given cycle, if multiple æits from diæerent VCs are competing for the same output port of the crossbar, the one with the smallest auxvc i will be chosen as the winner.

37 24 In a router that implements a full crossbar, there is no crossbar input multiplexer ènor a demultiplexer at the crossbar outputè. Thus, the only contention points are for the crossbar output ports èat the time of arbitrationè and in the VC multiplexer. In such a router, the rate-based algorithm is implemented at the VC multiplexer ècase ècèè. In order to select between FGVC and FGFQ schemes for the rest of the design, we conducted a performance analysis. We simulated both these schemes in a router and injected media traæc and best-eæort traæc. We measured the inter-frame delivery time and standard deviation of delivery time for the media streams. The results with diæerent input loads to the router are quite similar in both cases as depicted in Table 3.2. However, since implementation of FGFQ is more complex for maintaining the round robin numberèrè, we use FGVC in the rest of our design. In order to avoid traæc burstiness, we regulate the traæc injection as described in the workload generation in Section Figure 3.3 shows the ænal architecture of the MediaWorm router with a multiplexed crossbar and the FGVC scheduling algorithm. Load Inter-frame timeèmsecèèsd Fine-Grained VirtualClock Fine-Grained Fair Queueing 60è 33.12è è è 32.74è è è 32.28è è1.31 Table 3.2. Comparison of Fine-Grained VirtualClock with Fine-Grained Fair Queueing algorithms when the ratio of real-time to best-eæort traæc is 80:20. Inter-frame time is the averaged time diæerence of frames measured at the destinations, and SD is the standard deviation of the inter-frame time.

38 25 VCs 1 middle/tail flit Switch Core 1 VCs 0 C-1 C n x n C-1 C 0 1 scheduler Crossbar 1 n-1 C-1 C-1 n-1 C C header flit Routing Decision Arbitration Crossbar Control Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Fig MediaWorm Router Architecture Interconnection Topologies Fat Networks Cluster interconnects are typically built with high degree switches. Myrinet ë5ë has 8 and 16 port routers, while Servernet-II ë24ë routers have 12 ports. These ports may be used to connect to other switches as well as to endpoints. These endpoints may be compute nodes such as clients and servers or IèO devices. The diæerence between such cluster networks and those in typical multiprocessor interconnects is that while multiple endpoints per switch may be common in the former, the latter typically has only one endpoint per switch. Depending on the expected traæc pattern, it is likely that multiple endpoints may place higher inter-switch bandwidth requirement on cluster interconnects.

39 26 Due to this reason, ëfat" topologies have been proposed for clusters. Examples of fat topologies include the fat-tree and the fat-mesh ë18ë. Other cluster interconnects such as the tetrahedral topologies proposed by Horst ë28ë can also use ëfat" links. Routers such as the Servernet-II ë24ë include hardware support for using multiple physical links connecting a pair of switches indistinguishably through the notion of ëfat-pipes". S 2 S 3 S 0 S 1 Fig A 4 switch fat-mesh interconnect. Each switch ès0ís3è is an è8 æ 8è switch. Each fat link comprises 2 physical links. Although most of the studies reported in this paper detail the performance of a single switch, we also experimentally analyze the performance of a fat mesh. The fat mesh used here is a è2 æ 2è topology with 8 port crossbar switches. èwe have limited our study for a smaller network due to exceedingly high simulation times. One can design a larger router and a larger network using our model.è Two physical links are used to interconnect each pair of switches in a 4-node mesh. Figure 3.4 illustrates the studied

40 interconnect. We use deterministic routing and a message can use any one of the two links to traverse to the next node based on the current load Pipelined Circuit Switching Pipelined Circuit Switching èpcsè ë25ë is a variant ofwormhole switching in which a message is similarly segmented into æits. However, unlike wormhole routing in which middle and tail æits immediately follow the header as it progresses towards its destination, in PCS, æits of a message wait until the header èor probeè reserves the complete path up to the destination. Once such a pathèconnection has been established, an acknowledgment is sent from the destination to the source. The rest of the æits then move along this path in a pipelined manner èsimilar to wormhole switchingè. During path establishment, if the header cannot progress towards the destination, it can backtrack and try alternative paths if an adaptive routing is used. If no path can be established or if adaptive routing is not permitted, a negative-acknowledgment is sent back and the attempted connection is dropped. In this research, we do not assume any adaptive routing capability. PCS as originally proposed in ë25ë, assumed non-minimal and adaptive routing capabilities with backtracking and re-routing. This leads to low connection dropping rates. Due to the requirement of complete path setup before transmission of æits in PCS, it may incur high path setup cost compared to wormhole switching. However, it can potentially provide better bandwidth reservation which is advantageous for real-time traæc. Our intention is to evaluate these trade-oæs by comparing PCS with wormhole switching.

41 A Preemptive Router Design In the previous design, we statically divide the input and output VCs among the traæc classes. A traæc of class c can only use the VCs assigned to it. The VC assignment is done at the system conæguration time and cannot be changed during execution. Therefore, the non-preemptive model èmediawormè is not æexible. A solution to this problem is to develop a preemptive model, where several classes of traæc with diæerent priorities can share the same VC, with the provision that a higher priority message can preempt a lower priority message. The preemptive model can dynamically allocate any VCtoany traæc class. Hence, it is more suitable to handle æuctuating workloads. In this section, we propose a new wormhole router architecture with æit-level preemption capability. Based on the MediaWorm architecture in Figure 3.3, we propose two important mechanisms that enable higher priority messages to preempt lower priority messages in a pipelined wormhole router: input buæer preemption and æit acceleration mechanism. This preemptive model can dynamically allocate any VCtoany traæc class. Hence, it is more suitable to handle dynamically changing workloads Preemption in the Input Buæer The additional hardwares required for preemption in any input buæer èvcè are an extra buæer of size ès,1è, where s is the total number of priority levels, and a history stack of the same size. The extra input buæer is used only for diverting higher priority messages when the regular VC is occupied by a lower priority message. If the input

42 29 Input Buffer m1 Switch Core Output Buffer m3 m1 History Stack Flit Decoder Crossbar Routing Decision Arbitration Fig Input Buæer Preemption in the Router buæer is occupied by ahigher priority message, alower priority message is not allowed to use the extra buæer, and it is blocked behind the higher priority message. On the other hand, if the input buæer is used by a lower priority message, a higher priority message is sent to the extra buæer so that it can subsequently preempt the lower priority message in Stage 1 of the router. Similar to ë65ë, the routing information of a lower priority message is stored in a history stack for forwarding it later. In Stage 1, when the extra buæer has a header æit from a higher priority message, the input buæer preemption process begins. The router ærst checks whether the tail æit of the lower priority message has passed through Stage 1 decoder. If not, a dummy tail æit is created for the preempted message. A dummy tail æit does not carry any payload, but behaves as a regular tail æit to release all the resources held by the lower priority message. Otherwise, the resources are locked and cannot be used by any other message.

43 30 For example, in Figure 3.5, when the higher priority message m3 interrupts the delivery of the lower priority message m1, the dummy tail of m1 is generated. Then the routing information of m1 is stored in the history stack to be used later for making a dummy header for the retransmission of m1. Thus, no dummy header is required in case no dummy tail was sent. During preemption, the remaining æits of m1 and any other lower priority messages are blocked in the input buæer. Next, all the æits of m3 in the extra buæer are sent through the router. After that, if the extra buæer is empty, transmission of remaining æits of m1 resumes from the regular input buæer A Flit Acceleration Mechanism Input Buffer m1 1 Switch Core Output Buffer m3 m1 History Stack Flit Decoder 0 Crossbar Routing Decision Arbitration Fig Flit Acceleration for Message m1

44 31 When the input buæer preemption starts, there could be remaining æits of m1 between the æit decoder buæer and the input port of the crossbar as shown in Figure In addition, when the header æit of m3 tries to reserve the output VC, it could be already occupied by another lower priority message like m2 in Figure 3.7. In both of these cases, the æits of lower priority messages èm1, m2è will slow down the processing of m3, until these æits are pushed to the output buæer. Therefore, we use a æit acceleration mechanism that helps expedite the delivery of æits of such lower priority messages èlike m1 and m2è by assigning a speciæc low virtual clock value to them. This value guarantees that these messages will be selected ærst at the next cycle of the scheduler unless there are other preempted messages at other VCs. èthen we can select them in a RR fashion.è For this purpose, there is a æag called Accelerate associated with each input VC. The Accelerate æag is set until the tail æit of the preempted message èm1è or expedited message èm2è passes through the crossbar. In Figure 3.6, the transmission of a lower priority message m2 is accelerated by setting the æag of the input channel m1 is using to speed up the processing of a higher priority message m3 from the extra input buæer to the input port of the crossbar. In Figure 3.7, the header æit of m3 arrives at Stage 3, and the destination output VC is used by another lower priority message m2. Again, by setting the æag of the input channel m2 is using, the transmission of m2 is accelerated, and the m3 can reserve the destination. 1 At best there could be 3 æits. A header æit and a middle æit of m1 at two diæerent stages and a tail æit of another message at the crossbar input.

45 The other option to handle blocking at other stages is to use the preemption 32 mechanism. The acceleration mechanism is much simpler and easier to control than providing a separate preemption path at such stages. Input Buffer m1 0 Switch Core Output Buffer m2 m3 m2 m1 History Stack Flit Decoder 1 Crossbar Routing Decision m3 Arbitration Fig Flit Acceleration for Message m2 3.3 Experimental Platform Simulation Testbed The above architectural concepts have been extensively evaluated through simulation. We have developed the MediaWorm router èmrè, the preemptive router èpè, a traditional router with FIFO ètrè, and a PCS router èpcsè using CSIM. The simulation

46 33 models are quite æexible in the sense that one can specify the number of physical channels èpcsè, number of VCs per PC, link bandwidth, CBRèVBR rates and the variation of VBR rate, æit size, message size ènumber of æitsè, and the ratio of real-time traæc èvbr and CBRè to best-eæort traæc. In addition, using these routers, one can conægure any network topology. We have developed detailed æit-level simulators with each stage of the router pipeline being modeled, together with several simultaneous streams established from each node in the system. Typically, we gather simulation results over a few million messages. As a result, these simulations are extremely resource intensive, both in terms of simulation time and memory requirements. Two factors that determine simulation resources are the crossbar size, and physical channel bandwidth. Consequently, even though current technologies permit large crossbar sizes and over 1.28 Gbps link bandwidths, many of our simulations use smaller values for these parameters, without loss of generality, to keep them tractable. We have also conducted some experiments varying these parameters, and the overall trendsèresults still apply. The sizes of the input and output buæers in the router are one message long, respectively. Also, we have tested with large buæers. The results show little improvement. The output parameters analyzed here are mean frame delivery interval èdè ç for CBRèVBR messages, standard deviation of frame delivery intervals èç d è for CBRèVBR messages, deadline missing probability of delivered MPEG-2 frames, average deadline missing time of deadline missing frames, and average latency for best-eæort traæc. The delivery interval is measured as the diæerence between the delivery times of two successive

47 frames at a destination. A ç d = 33 msec indicates a frame rate of 30 framesèsec at MPEG rates. Coupled with a ç d =0,this implies jitter-free delivery. A higher d ç andèor ç d implies jitters in transmission. Deadline missing probability isthe ratio of the number of frames that missed their deadlines to the number of total number of delivered frames. The deadline for each frame is determined by adding 33.3 msec to the previous deadline, since the frame rate is 30 framesèsec for MPEG-2 video streams. However, if a previous frame missed its deadline, a new deadline is set by adding 33.3 msec to the arrival time of the previous frame. Whenever a frame misses its deadline, we measure the deadline missing time and then calculate the average deadline missing time Workload Two kinds of VBR traæc are simulated in the experiments. The ærst one is synthetic video streams with an average bandwidth of 4 Mbps and the other is realistic MPEG-2 video streams. The synthetic traæc consists of streams of messages from video frames, whose size is selected from a normal distribution with a mean of 16,666 bytes and standard deviation of 3333 bytes. èthis corresponds to 4 Mbps MPEG-2 streams.è The realistic traæc is generated from MPEG-2 traces ë7ë shown in Table 3.3, where there are 7 video traces with diæerent bandwidth requirements. Each stream generates 30 framesèsec, and each frame èi, P, or B frameè is fragmented into 20è40-æit size messages èexcept possibly the last message of a frameè, with each message carrying the bandwidth requirements èvtick information for the FGVC algorithmè, and the routing information in its header æit. As a result, the network treats each message of a stream independent of the others. The injection rate for the messages

48 35 Video Average Bandwidth Average Size of Average Size of Average Size of Sequences Requirements èkbèsè IFrame èkbitsè PFrame èkbitsè BFrame èkbitsè 1 7, , , , , , , Table 3.3. MPEG-2 Video Sequence Statistics of a stream is determined by the message size and the number of messages constituting a frame. Once the injection rate is determined, an input regulator injects messages of a frame evenly with the interval of è33msec=number of messagesè. For instance, with 200 messages in a frame, the interval between successive message injections is 165 microseconds. Such an input regulator provides two advantages. First, in addition to avoiding traæc burstiness, the input regulator allows intermixture of messages from diæerent streams in the queues. Without this ability, the streams are queued only at frame-level granularity, thereby increasing the delay of certain streams. Second, the input regulator also helps transmission of best-eæort traæc in between video frame messages. In the case of PCS, each stream is transmitted over a distinct connection èas it is connection-orientedè. The ærst æit of the stream establishes the circuit between the source and destination endpoints, simultaneously informing the intermediate switches about its bandwidth requirements èthe required Vtick for the entire streamè. The frames

49 36 of the stream are logically grouped into æits, with each group injected into the established circuit at a speciæed rate èsimilar to how messages are generated in the wormhole switching caseè. In PCS, each connection èand hence a streamè also needs a distinct VC. Therefore, the number of VCs supported by the hardware has to be greater than or equal to the maximum number of concurrent streams in the workload. In the MediaWorm, each message carries routing and bandwidth information. As a result, it would be possible to support multiple connections on a single VC. This would make sense only when the bandwidth available to a VC is at least as large as the sum of the bandwidths of the streams assigned to that VC. This is however not a problem because each message carries its Vtick requirement. It should be noted that stream establishment does not actually fail in wormhole switching. In PCS, on the other hand, a connection establishment probe may not necessarily succeed. This is termed as dropping of a connection. It is assumed that connections may be dropped only at stream set-up. Once the input VC for a connection is determined, the destination is picked randomly using a uniform distribution of all nodes, and the destination VC is also drawn randomly from a uniform distribution of the VCs available for VBR traæc. The generation of the CBR traæc is identical to the synthetic VBR traæc, with the exception that the frame size is kept constant èat 16,666 bytesè. The best-eæort traæc is generated with a given injection rate, ç, that is allocated to this class of traæc èexplained in the next subsectionè, and follows the Poisson distribution. The message length is kept constant at 20è40 æits according to the message

50 37 length of real-time traæc, and its destination is picked from a uniform distribution of the nodes in the system. The input and output VC for a message are picked from a uniform distribution of the available VCs for this traæc class. An important parameter that is varied in our experiments is the input load. This is expressed as a fraction of the physical link bandwidth. For a speciæed load, we consider diæerent mixes èx : y, where x=èx + yè is the fraction of the load for the VBRèCBR component and y=èx + yè is the fraction of the load for the best-eæort componentè to generate mixed traæc. We divide the VCs into two disjoint groups. x=èx + yè èofthe VCs are reserved for the VBRèCBR traæc, and the remaining VCs are allocated to the best-eæort traæc. As mentioned earlier, the number of simultaneous VBRèCBR streams that are possible fromèto a node is limited by the number of VCs in the case of PCS. In the MediaWorm, it is limited by the number of VCs and the bandwidth allocated to a VC. For instance, if a physical channel can support 400 Mbps, and the total number of VCs is 16, then we can support at most 6 connections per VC tosimulate synthetic VBR. If x = y = 1, then the number of VCs dedicated for VBRèCBR traæc is 8, and there can be at most 6 æ 8 = 48 outstandingèincoming streams at each node in the system. 3.4 Performance Results In this section, we ærst analyze the performance results for an 8-port MediaWorm router with varying parameters as well as that of a è2 æ 2è fat mesh. Then we compare the MediaWorm and preemptive router designs. The router parameters used in this performance study are given in Table 3.4.

51 38 Switch Size 8 æ 8 Flit Size 32 è 128 bits Message Size 20 è 40 æits Flit Buæers 20 è 40 æits PC Bandwidth 400 Mbps è 1.6 Gbps VCsèPC variable èwormholeè, 24 èpcsè StreamsèVC variable èwormholeè, 1 èpcsè Table 3.4. Simulation Parameters Comparison of MediaWorm and Traditional Routers We ærst begin by examining how a traditional router ètrè and the MediaWorm router èmrè perform with multimediaèmixed traæc. Note that the main diæerence between the two routers is the scheduling algorithm. The TR uses a FIFO scheduler, whereas the proposed MR uses the FGVC algorithm. Figure 3.8 shows the mean delivery interval èdè ç and its standard deviation èç d è for this router with a mixture of synthetic VBR and best-eæort traæc è80:20è. We can see that both d ç and ç d start growing beyond a load of 0.8, showing that there would be signiæcant jitters in delivery of VBR traæc beyond this point. Compared to this, the MR can provide jitter-free delivery even up to a link load of 0.96 èthe load of the real-time component is around 0.75è. This clearly shows the need for a rate-based scheduling algorithm to eæectively administer the available bandwidth for media streams.

52 Comparison of CBR and VBR Traæc Results Figure 3.9 depicts the d ç and ç d results with only CBR and only synthetic VBR traæc èthere is no best-eæort traæcè. It can be gleaned that both exhibit nearly identical performance, with the CBR traæc experiencing jitter-free performance for slightly higher load. Although, both CBR and VBR streams have the same mean bandwidth requirement, CBR streams by their nature are also intuitively expected to experience better jitter tolerance. Since, VBR streams present a more challenging workload, we focus on VBR streams in the rest of the studies in this thesis Results with Mixed Traæc Next, we vary the ratio of real-time èonly synthetic VBRè and best-eæort traæc for diæerent input loads, and study the eæect on jitter for VBR and average latency for the best-eæort traæc. Figure 3.10 shows the variation of d ç and ç d for these workloads. It can be observed that up to an input load of 0.80, there is no jitter for VBR traæc regardless of the mix between these two traæc classes. Beyond a load of 0.80, it is only when the real-time traæc becomes a dominant component, does the jitter become signiæcant. The eæect of VBR traæc on the average latency of best-eæort traæc èin microsecondsè is given in Table 3.5. For a given mix, the latency degrades with an increase in the load. The presence of real-time traæc also increases the latency of the best-eæort traæc at a given load. This is a consequence of the higher priority given by the FGVC algorithm to the real-time traæc.

53 40 Input Load x:y : : : Sat. 90: Sat. Sat. Table 3.5. Average Latency for Best-eæort Traæc è8æ8 switch, 16VCs, 400Mbps linksè Impact of VCs and Crossbar Capabilities It should be noted that our workload generates multiple connections on each available VC. An important design consideration is to determine whether one should support more VCs with fewer connections per VC, or vice versa. appear that a larger number of VCs would improve performance. Intuitively, it may The performance results in Figure 3.11 also conærm this intuition, where the 16 VC case gives jitter-free performance up to a higher load compared to the 4 and 8 VC cases. However, supporting a large number of VCs may require a large amount of resources in the router. Lower number of VCs, on the other hand, allows us to be able to use a full crossbar èinstead of amultiplexed oneè. This is examined for the 4 VC case èi.e. a32æ32 crossbarè, which shows better performance than 8 VCs with the multiplexed crossbar, and competitive performance compared to the 16 VC results Eæect of Message Size on Jitter Our next experiment examines the impact of message size on synthetic VBR traæc. We vary message size for two diæerent input loads è0.64 and 0.8è that are representative of the behavior observed earlier, and examine changes in ç d and ç d. The results

54 41 in Figure 3.12 show that except for very small message sizes, there is little impact on QoS for real-time traæc. For very small sizes, the eæect of the header æit overhead becomes noticeable. For instance, 1 header æit in a message size of 20 æits consumes 5è of the stream bandwidth. These results show that we do not really need to go for large messages for media traæc. In fact, smaller sizes may help the latency for best-eæort traæc Comparison of MediaWorm and PCS Routers PCS is expected to provide good performance for VBR traæc. This is because it is a connection-oriented switching paradigm and hence can reserve bandwidth at the time of connection establishment. However, it requires a VC per stream, thereby mandating a large number of VCs per PC for high link bandwidth. In this experiment, we compare the performance of the MediaWorm router to that of the PCS router èpcsè. Note that this is the only experiment that we perform for a link bandwidth of 100Mbps è24-25 VBR streams can be supported per link, each with 4Mbps bandwidth requirementè. This is primarily because of the simulation complexity for supporting the large number of VCs èup to 100 VCsè that would be required for 400 Mbps link bandwidth in the PCS router. As can be expected, the MediaWorm router can support jitter-free performance only up to a load of about 0.7 compared to over 0.8 in the case of PCS. This is, however, not a fair comparison because all streams started on a wormhole router are accepted, whereas the PCS router drops many connections that contend for busy resources. For the

55 42 Input Load èconn. Attempts è Established è Dropped Table 3.6. Number of attempted, established and dropped connections for reaching a certain input loading in a PCS router. The values presented are for an è8 æ 8è router with 24 VCs, 100Mbps links. same operating load, this in eæect unfairly improves the crossbar utilization for accepted connections in the PCS router compared to that for the MediaWorm router. While the PCS router provides superior performance, this is at the cost of high resource requirements èlarge number of VCsè as well as a very high number of dropped connections. The number of accepted and dropped connections for various input loads for the PCS router is shown in Table 3.6. These results show that for most realistic operating conditions èan input load of 0.7 is reasonably highè, the MediaWorm router can deliver as good èjitter-freeè performance as a PCS router for real-time traæc, while not turning down connection establishment requests as done in the PCS router. èthe connection drop rate can be minimized by using several alternatives as proposed in ë25ë.è Moreover, by increasing the number of VCs in the MediaWorm router to match with the PCS implementation, its performance could be similar to that of the PCS router at higher load.

56 Results with MPEG-2 Video Traces Here we examine the performance results of a traditional router and the Media- Worm router with realistic MPEG-2 video traæc shown in Table 3.3. Figure 3.14 shows the mean delivery interval èdè ç and its standard deviation èç d è for each router model. Some of the data points of the TR were dropped due to saturation. The results with realistic VBR are almost identical to those with synthetic VBR of Figure 3.8, although d ç of TR in Figure 3.14 is slightly better at 90è load. Next, we vary the ratio of real-time èmpeg-2 videoè and best-eæort traæc for diæerent input loads, and study the eæect on jitter for VBR. Figure 3.15 shows the variation of d ç and ç d for these workloads. Again we can observe similar results as shown in Figure Although the video traces in Table 3.3 have much wider bandwidth variation, the overall results with synthetic and actual traces are almost similar Fat-Mesh Results Up to this point, we have focussed on the performance of a single router with CBRèVBR and best-eæort traæc. In this subsection, we try to examine the performance implications of using such routers in a fat-mesh interconnect. In general, it can be expected that an interconnect with multiple routers may have lower performance than that of a single router. This would be due to the additional points of resource contention in a network. We limit this study to a modest 4 node network èshown in Figure 3.4è due to limited simulation resources. Figure 3.16 èaè and èbè shows the change in mean delivery interval and the corresponding standard deviation for synthetic VBR traæc. This is studied with both

57 44 increasing load and increasing proportion of VBR traæc. The results indicate that VBR performance remains good for smaller proportions of VBR traæc è40è and 60 èè even for a total input load of 0.9 of PC bandwidth capacity. Only at a load of 0.9 with 80è of traæc being VBR, does VBR performance degrade. This good performance for VBR is at the expense of best-eæort traæc and is shown in Figure 3.16 ècè. As expected, for any given load, average latency of best-eæort traæc increases with increasing proportion of VBR traæc. It is also illustrative to compare the performance of a è2 æ 2è fat mesh to that of a single switch. As expected, the maximum input load èfor a given proportion of VBR to best-eæort traæcè that provides jitter-free performance for VBR traæc is lower in the fat-mesh than in the case of a single switch. This can be inferred by comparing Figures 3.10 èaè and èbè with Figures 3.16 èaè and èbè. For example, with a load of 0.9 and a traæc mix of 80:20, we can observe that a single switch is able to provide jitter-free performance, while the fat-mesh cannot. Admission control criteria, thus, have to consider èfor an expected traæc patternè what is the maximum load and proportion of VBR to best-eæort traæc that will provide statistically acceptable QoS to VBR traæc as well as an acceptable latency for besteæort traæc. This load would then determine the number of VBR streams that may be accepted for service Comparisons of the Three Router Models In this subsection, we examine the performance results of a traditional router, a non-preemptive router, and a preemptive router under dynamic workloads. Figure 3.17

58 45 shows the deadline missing probability and the average deadline missing time in a single router for each model. Some of the data points of the traditional router were dropped due to saturation. It is seen that the preemptive router can service real-time traæc with almost constant deadline missing probability, while for the non-preemptive routerèmediawormè, the number of frames missing their deadlines increases as the ratio of real-time traæc increases. The deadline missing time in Figure 3.17 èbè is the minimum for the preemptive router. The traditional router, without a rate-based scheduler, experiences saturation even under light load, and is the worst performer. Since the preemptive router can assign VCs dynamically according to the real-time traæc load, it can provide the best performance among the three architectures. Another important performance parameter is the end-to-end latency. We can measure such latency for each traæc type. Figure 3.18 èaè shows control traæc latency in each router. Here, queueing time represents the time spent outside the router before the message is injected into the router. In the traditional router, control traæc is treated as any other types of traæc, and hence its latency is much higher than those of the other two routers. The preemptive router provides the best performance with almost zero queueing time followed by the non-preemptive router. Figure 3.18 èbè compares best-eæort traæc latency in the three routers. The non-preemptive èmediawormè and the preemptive routers can provide better service for real-time traæc at the expense of best-eæort traæc. Therefore, as expected, the traditional router provides the best performance for best-eæort traæc. Next, we examined the impact of block-level multiplexing in a single preemptive router. Figure 3.19 shows the results for block size of 1, 5, and 10 æits, respectively.

59 46 As the block size increases, the performance degrades signiæcantly. Thus, æit-level multiplexing seems to be the most ideal choice for QoS assurance. However, in an actual implementation, we mayhave to use block-level multiplexing to amortize the scheduling overhead. In order to estimate the contribution of the acceleration scheme explained in Section 3.2 for the preemptive router, we tested the router without the acceleration scheme and with the acceleration scheme. Figure 3.20 demonstrates the role of the acceleration scheme in the preemptive router. The results indicate that by accelerating the æits of the lower priority messages, performance of both real-time and best-eæort traæc improves considerably A è2 æ 2è Mesh Network Results In this section, we examine the performance implications of using the preemptive and non-preemptive routers in a è2 æ 2è mesh network. We use a mixture of real-time traæc and best-eæort traæc. Figure 3.21 èaè shows the deadline missing probability for real-time traæc and Figure 3.21 èbè depicts the average network latency for besteæort traæc. Like the single router results, the preemptive model again exhibits better performance compared to the non-preemptive model. The deadline missing probability increases with an increase of the real-time load. Also as expected, average network latency of the best-eæort traæc in Figure 3.21 èbè gradually increases with the real-time traæc.

60 Concluding Remarks Widespread use of cluster systems in diverse application environments is placing varied communication demands on their interconnects. Commercial routers for these environments currently support wormhole switching. Although wormhole routers can provide small latencies and good performance for best-eæort traæc, these routers are unable to provide QoS guarantees for soft real-time applications such as streaming media. Our study is motivated by the need to simultaneously handle multiple such traæc types that are becoming important and prevalent in clustered environments. We also feel that it is imperative to leverage oæ of existing, mature and commodity technology, i.e. wormhole switching, for providing a cost-eæective solution rather than using relatively new or hybrid switching alternatives proposed by other researchers. We have proposed a new router architecture, called MediaWorm, with only one major modiæcation compared to ëvanilla" wormhole routers incorporating a rate proportional resource scheduler called FGVC, instead of the common rate agnostic schedulers such as FIFO or roundrobin. We have studied the capabilities of the MediaWorm in supporting real-time and best-eæort traæc. The main conclusions of our study are the following: æ We conærm that the FGVC scheduler can provide considerably improved performance for traæc that require soft real-time guarantees èvbrècbrè. æ The MediaWorm router design shows that there is no adverse eæect on the performance of VBR traæc in the presence of best-eæort traæc. However, as the share of VBR traæc increases for a given load, this adversely eæects the latency of

61 best-eæort traæc. A wormhole router can provide jitter-free delivery to VBRèCBR traæc up to a load of 70í80è of the physical channel bandwidth. 48 æ Although the performance of a PCS router is slightly better than the MediaWorm, PCS routers are more complex than wormhole routers and they may drop a large number of connections. æ We ænd that performance of a small fat-mesh network is comparable to that of a single switch. Although it is diæcult to extrapolate performance to much larger clusters directly from our present results, we expect that clusters designed with appropriate bandwidth balance amongst various links by the use of fat-topologies and MediaWorm-like switches should be able to provide good performance for both real-time and best-eæort traæc. æ A preemptive router model seems more appropriate for handling dynamic workload. However, preemption in a pipelined model is more complex than in a non-pipelined model èmediawormè since a lower priority message can block a higher priority message at more than one stage of the pipeline. Instead of providing preemption at several stages, preemption in the input buæers followed by an acceleration mechanism at other stages seems a viable design. In summary, our study suggests that by augmenting a conventional wormhole router with a rate-based resource scheduling technique, one can provide a viable, costeæective switch for cluster interconnects to support both real-time and best-eæort traæc mixes. The MediaWorm router supports this claim. It is also possible to design more sophisticated routers by incorporating a message preemption scheme.

62 49 In the next section, we discuss the design of a network interface card ènicè that can be used in conjunction with the proposed router models to provide end-to-end QoS guarantees.

63 50 Mean Delivery Interval (millisecond) MR TR Input Link Load (Proportion 80:20) Standard Deviation of Delivery Interval MR TR Input Link Load (Proportion 80:20) Fig Comparison of MR and TR èè8 æ 8è switch, 16 VCs, 400 Mbps links, x : y = 80:20è.

64 51 Mean Delivery Interval(millisecond) VBR CBR Input Link Load 22.0 Standard Deviation of Delivery Interval VBR CBR Input Link Load Fig Comparison of CBR and Synthetic VBR traæc in the MediaWorm router èè8 æ 8è switch, 16 VCs, 400 Mbps links, all real-time traæcè.

65 52 Mean Delivery Interval (millisecond) Input Load = 0.60 Input Load = 0.70 Input Load = 0.80 Input Load = 0.90 Input Load = :80 50:50 80:20 90:10 100:0 Proportion of Real Time to Best Effort Traffic(x:y) Standard Deviation of Delivery Interval(millisecond) Input Load = 0.6 Input Load = 0.70 Input Load = 0.80 Input Load = 0.90 Input Load = :80 50:50 80:20 90:10 100:0 Proportion of Real Time to Best Effort Traffic(x:y) Fig Mixed Traæc èsynthetic VBR + best-eæort traæcè èè8 æ 8è switch, 16 VCs, 400 Mbps linksè.

66 53 Mean Delivery Interval(millisecond) Virtual Channels with Multiplexed Crossbar 8 Virtual Channels with Multiplexed Crossbar 4 Virtual Channels with Multiplexed Crossbar 4 Virtual Channels with Full Crossbar Input Link Load Standard Deviation of Delivery Interval (millisecond) Vritual Channels with Multiplexed Crossbar 8 Virtual Channels with Multiplexed Crossbar 4 Virtual Channels with Multiplexed Crossbar 4 Virtual Channels with Full Crossbar Input Link Load Fig Impact of VCs and Crossbar Capabilities èè8 æ 8è switch, 400 Mbps links, x : y = 100:0è.

67 Input Load = 0.64 Input Load = 0.80 Mean Delivery Interval (millisecond) Message Size (flits) Standard Deviation of Delivery Interval (millisecond) Input Load = 0.64 Input Load = Message Size (flits) Fig Eæect of message size on jitter èè8 æ 8è switch, 400 Mbps link bandwidth, 16 VCs, all synthetic VBR traæcè.

68 55 Mean Delivery Time(millisecond) PCS Router Wormhole Router Input Link Load Standard Deviation of Delivery Time(millisecond) PCS Router Wormhole Router Input Link Load Fig MR and PCS comparison èè8æ8è switch, 100 Mbps link bandwidth, 24 VCsè.

69 56 Mean Delivery Interval (msec) TR MR Input Link Load (Proportion 80:20) èaè Standard Deviation of Interdelivery Time (msec) TR MR Input Link Load (Proportion 80:20) èbè Fig TR vs. MR with MPEG-2 Video Traæc èè8 æ 8è switch, 16 VCs, 1.6 Gbps, x : y = 80 : 20è.

70 57 Mean Delivery Interval (millisecond) Input Load = 0.60 Input Load = 0.70 Input Load = 0.80 Input Load = 0.90 Input Load = :80 50:50 80:20 90:10 100:0 Proportion of Real time to Best effort Traffic(x:y) èaè Standard Deviation of Delivery Interval(millisecond) Input Load = 0.60 Input Load = 0.70 Input Load = 0.80 Input Load = 0.90 Input Load = :80 50:50 80:20 90:10 100:0 Proportion of Real time to Best effort Traffic(x:y) èbè Fig Mixed Traæc èmpeg-2 Video Trace + best-eæort traæc, è8 æ 8è switch, 16 VCs, 1.6 Gbpsè.

71 58 Mean Delivery Interval (millisecond) Input Load = 0.7 Input Load = 0.8 Input Load = :60 60:40 80:20 Proportion of Real Time to Best Effort Traffic (x:y) èaè 4.0 Standard Deviation of Delivery Interval Input Load = 0.7 Input Load = 0.8 Input Load = :60 60:40 80:20 Proportion of Real Time to Best Effort Traffic (x:y) èbè Average Latency of Best Effort Traffic (microsecond) Input Load = 0.7 Input Load = 0.8 Input Load = :60 60:40 80:20 Proportion of Real Time to Best Effort Traffic (x:y) ècè Fig VCsè. Performance of a è2 æ 2è fat mesh è8 æ 8è switches, 400 Mbps link bandwidth,

72 59 Deadline Missing Probability TR, load=0.80 TR, load=0.85 MR, load=0.80 MR, load=0.85 P, load=0.80 P, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) èaè Deadline Missing Time(microsec) TR, load=0.80 TR, load=0.85 MR, load=0.80 MR, load=0.85 P, load=0.80 P, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) èbè Fig Deadline missing probability and deadline missing time in a single router under dynamic load variation. The input load is speciæed in the graphs.

73 Queueing Time Network Latency Control Traffic Latency (microsec) TR, 20:80 MR, 20:80 P, 20:80 TR, 30:70 MR, 30:70 P, 30:70 èaè Control traæc TR, 50:50 MR, 50:50 P, 50:50 Best effort Traffic Latency (microsec) Queueing Time Network Latency 0.0 TR, 20:80 MR, 20:80 P, 20:80 TR, 30:70 MR, 30:70 P, 30:70 èbè Best-eæort traæc TR, 50:50 MR, 50:50 P, 50:50 Fig Components of message latency of control traæc and best-eæort traæc in a single router under dynamic load variation. The input load is 0.80.

74 61 Deadline Missing Probability Block size=1, load=0.80 Block size=1, load=0.85 Block size=5, load=0.80 Block size=5, load=0.85 Block size=10, load=0.80 Block size=10, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) Best effort Traffic Latency(microsec) Block size=1, load=0.80 Block size=1, load=0.85 Block size=5, load=0.80 Block size=5, load=0.85 Block size=10, load=0.80 Block size=10, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) Fig Eæect of block-level multiplexing in a single preemptive router under dynamic load variation.

75 62 Deadline Missing Probability Preemption+Acceleration, load=0.80 Preemption+Acceleration, load=0.85 Preemption only, load=0.80 Preemption only, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) Best effort Traffic Latency(microsec) Preemption+Acceleration, load=0.80 Preemption+Acceleration, load=0.85 Preemption only, load=0.80 Preemption only, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) Fig Comparison of preemption+acceleration and only preemption in a single router under dynamic load variation. èsome results for the best-eæort traæc at high load are not included due to saturation.è

76 63 Deadline Missing Probability NP, load=0.70 NP, load=0.80 P, load=0.70 P, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) èaè Best effort Traffic Latency(microsec) NP, load=0.70 NP, load=0.80 P, load=0.70 P, load= :80 30:70 50:50 70:30 80:20 Proportion of Real Time to Best effort Traffic(x:y) èbè Fig Deadline missing probability and average latency of best-eæort traæc in a è2 æ 2è mesh network under dynamic load variation.

77 64 Chapter 4 A QoS Capable Network Interface Card Design In this chapter, we present a network interface card ènicè design for end-to-end QoS support in clusters. Our study is based on the Virtual Interface Architecture èviaè framework, which has become a standard to design user-level communication on NICs. First, we show how QoS provisioing can be provided in the context of VIA. Then, we evaluate a complete cluster interconnect consisting of QoS-capable routers and NICs. 4.1 Virtual Interface Architecture The network interface èniè has a crucial role in the overall communication performance since it is responsible for initiating and responding to communications, for handling data movement, and for providing application isolation. Since improving the performance of the routerèinterconnect alone will shift the communication bottleneck to the NI, design of faster NIs has become a major research thrust recently. Consequently, a few user-level messaging layers such asactive Messages ë74ë, U-Net ë72ë and FM ë50ë have been proposed to minimize the role of the operating system involvement in communication. As a consequence of this concerted eæort, a generic communication layer, called Virtual Interface Architecture èviaè, was introduced as a standard communication paradigm for System Area Networks èor SANsè or clusters ë6, 11ë. The design focus

78 65 of the VIA is to provide an eæcient communication protocol between a user process and the network interface èniè. VIA is a connection oriented paradigm consisting of Virtual Interfaces èvisè. A VI is the mechanism by which applications talk to the NIC hardware, and establishes a connection between two processes. A VI consists of two queues: a send queue and a receive queue. Applications post requests in the form of descriptors in one of the queues. For sending a message, an application posts a descriptor in the send queue, and informs the NI of the pending request by ringing a send doorbell, which is a memory mapped region on the NI. On receiving the doorbell, the NI transfers the descriptor and the data from the user memory to the NI buæers using two DMAs. The NI transfers the message to the wire using another DMA, and updates the status æeld of the send descriptor or that of a completion queue. The actions on the receive are very similar to that of a send. The application creates an empty buæer, posts a descriptor in the receive queue and rings the receive doorbell in the NIC buæer. When a message arrives for a VI, the NI transfers the message to the buæer allocated by the application and updates the status æeld of the receive descriptor. The message is subsequently consumed by the receiving process. Figure 4.1 shows this procedure. Based on this framework, a few implementations of VIA have been developed recently èand some are under developmentè to achieve low latency user-level communication. However, the original VIA framework does not have any QoS design speciæcation. Here, we propose an extension of the VIA design to support diæerent priority classes in the NIC.

79 66 User Application DMA Transfer VI VI S VI R S VI R S R S R Doorbells DMA Transfer VI-Capable NIC Fig Virtual Interface Architecture Paradigm 4.2 A QoS Capable NIC Design We propose three design modiæcations in the original VIA framework as described below. These are a prioritized doorbell structure to support diæerent traæc classes, a virtual channel aware buæer management in the NIC, and a hardware supported Virtual- Clock scheduler to transfer æits to the router. Figure 4.2 shows the diæerent stages in the æow of data from an applications to the NIC. Each application such as a video source or a a best-eæort process has a VI with the corresponding send and receive queues. The send and receive queues reside in the user memory. To support integrated traæc, we implemented prioritized doorbells in the NIC, where there is a doorbell queue èsendèreceiveè for each class. The NIC ærmware picks up the doorbells in FCFS order based on their priority and programs the host DMA engine to transfer the descriptor followed by the message. To avoid head of queue blocking, we use a preemptive solution. If the the NIC buæer èvirtual channelè corresponding to a doorbell is is blocked, the scheduler picks

80 the next doorbell in the queue. Messages of the same class do not get reordered in this scheme. 67 Traffic Source R R S VI S VI NIC Doorbell Q for priority 1 Traffic Source R R S VI S VI Doorbell Q for priority s Prioritized FIFO Physical Channel VC 1 buffer VC 2 buffer User Memory DMA Transfer VC C buffer VirtualClock Fig A VIA-style NIC with QoS support Next, to make the NIC design compatible to the QoS-aware router of the previous section, we implemented in the NIC buæer an equal number of ècè VCs to enable virtual channel æow control in the NIC. Note that this is a logical separation of the NIC local memory. As messages are transferred into the NIC by the host DMA, they are broken into æits by the NIC processor. The NIC buæer behaves as FCFS queues for the diæerent VCs. In the original VIA implementation, the send DMA engine of the NI èfor example in the Myrinet network cardè is used to transfer a complete message into the network at the rate of one æit per cycle. On the other hand, the router model discussed in this

Investigating QoS Support for Traffic Mixes with the MediaWorm Router

Investigating QoS Support for Traffic Mixes with the MediaWorm Router Investigating QoS Support for Traffic Mixes with the MediaWorm Router Ki Hwan Yum Aniruddha Vaidya Chita R. Das Anand Sivasubramaniam Department of Computer Science and Engineering The Pennsylvania State

More information

Advanced Computer Networks

Advanced Computer Networks Advanced Computer Networks QoS in IP networks Prof. Andrzej Duda duda@imag.fr Contents QoS principles Traffic shaping leaky bucket token bucket Scheduling FIFO Fair queueing RED IntServ DiffServ http://duda.imag.fr

More information

Network Support for Multimedia

Network Support for Multimedia Network Support for Multimedia Daniel Zappala CS 460 Computer Networking Brigham Young University Network Support for Multimedia 2/33 make the best of best effort use application-level techniques use CDNs

More information

Overview. Lecture 22 Queue Management and Quality of Service (QoS) Queuing Disciplines. Typical Internet Queuing. FIFO + Drop tail Problems

Overview. Lecture 22 Queue Management and Quality of Service (QoS) Queuing Disciplines. Typical Internet Queuing. FIFO + Drop tail Problems Lecture 22 Queue Management and Quality of Service (QoS) Overview Queue management & RED Fair queuing Khaled Harras School of Computer Science niversity 15 441 Computer Networks Based on slides from previous

More information

Unit 2 Packet Switching Networks - II

Unit 2 Packet Switching Networks - II Unit 2 Packet Switching Networks - II Dijkstra Algorithm: Finding shortest path Algorithm for finding shortest paths N: set of nodes for which shortest path already found Initialization: (Start with source

More information

Computer Networking. Queue Management and Quality of Service (QOS)

Computer Networking. Queue Management and Quality of Service (QOS) Computer Networking Queue Management and Quality of Service (QOS) Outline Previously:TCP flow control Congestion sources and collapse Congestion control basics - Routers 2 Internet Pipes? How should you

More information

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS A Thesis by SONALI MAHAPATRA Submitted to the Office of Graduate and Professional Studies of Texas A&M University

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

Mohammad Hossein Manshaei 1393

Mohammad Hossein Manshaei 1393 Mohammad Hossein Manshaei manshaei@gmail.com 1393 Voice and Video over IP Slides derived from those available on the Web site of the book Computer Networking, by Kurose and Ross, PEARSON 2 Multimedia networking:

More information

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching. Switching/Flow Control Overview Interconnection Networks: Flow Control and Microarchitecture Topology: determines connectivity of network Routing: determines paths through network Flow Control: determine

More information

QoS Provisioning in Clusters: An Investigation of Router and NIC Design

QoS Provisioning in Clusters: An Investigation of Router and NIC Design QoS Provisioning in Clusters: An Investigation of Router and NIC Design Ki Hwan Yum Eun Jung Kim Chita R. Das Department of Computer Science and Engineering The Pennsylvania State University University

More information

Performance and Evaluation of Integrated Video Transmission and Quality of Service for internet and Satellite Communication Traffic of ATM Networks

Performance and Evaluation of Integrated Video Transmission and Quality of Service for internet and Satellite Communication Traffic of ATM Networks Performance and Evaluation of Integrated Video Transmission and Quality of Service for internet and Satellite Communication Traffic of ATM Networks P. Rajan Dr. K.L.Shanmuganathan Research Scholar Prof.

More information

CHAPTER 3 EFFECTIVE ADMISSION CONTROL MECHANISM IN WIRELESS MESH NETWORKS

CHAPTER 3 EFFECTIVE ADMISSION CONTROL MECHANISM IN WIRELESS MESH NETWORKS 28 CHAPTER 3 EFFECTIVE ADMISSION CONTROL MECHANISM IN WIRELESS MESH NETWORKS Introduction Measurement-based scheme, that constantly monitors the network, will incorporate the current network state in the

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

Priority Traffic CSCD 433/533. Advanced Networks Spring Lecture 21 Congestion Control and Queuing Strategies

Priority Traffic CSCD 433/533. Advanced Networks Spring Lecture 21 Congestion Control and Queuing Strategies CSCD 433/533 Priority Traffic Advanced Networks Spring 2016 Lecture 21 Congestion Control and Queuing Strategies 1 Topics Congestion Control and Resource Allocation Flows Types of Mechanisms Evaluation

More information

Overview Computer Networking What is QoS? Queuing discipline and scheduling. Traffic Enforcement. Integrated services

Overview Computer Networking What is QoS? Queuing discipline and scheduling. Traffic Enforcement. Integrated services Overview 15-441 15-441 Computer Networking 15-641 Lecture 19 Queue Management and Quality of Service Peter Steenkiste Fall 2016 www.cs.cmu.edu/~prs/15-441-f16 What is QoS? Queuing discipline and scheduling

More information

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,

More information

UNIVERSITY OF CALIFORNIA, SAN DIEGO. A Simulation of the Service Curve-based Earliest Deadline First Scheduling Discipline

UNIVERSITY OF CALIFORNIA, SAN DIEGO. A Simulation of the Service Curve-based Earliest Deadline First Scheduling Discipline UNIVERSITY OF CALIFORNIA, SAN DIEGO A Simulation of the Service Curve-based Earliest Deadline First Scheduling Discipline A thesis submitted in partial satisfaction of the requirements for the degree Master

More information

Real-Time Protocol (RTP)

Real-Time Protocol (RTP) Real-Time Protocol (RTP) Provides standard packet format for real-time application Typically runs over UDP Specifies header fields below Payload Type: 7 bits, providing 128 possible different types of

More information

Performance analysis of a QoS capable cluster interconnect

Performance analysis of a QoS capable cluster interconnect Performance Evaluation 60 (2005) 275 302 Performance analysis of a QoS capable cluster interconnect Eun Jung Kim a,,kihwanyum b, Chita R. Das c a Department of Computer Science, Texas A&M University, College

More information

Lecture 4 Wide Area Networks - Congestion in Data Networks

Lecture 4 Wide Area Networks - Congestion in Data Networks DATA AND COMPUTER COMMUNICATIONS Lecture 4 Wide Area Networks - Congestion in Data Networks Mei Yang Based on Lecture slides by William Stallings 1 WHAT IS CONGESTION? congestion occurs when the number

More information

TDDD82 Secure Mobile Systems Lecture 6: Quality of Service

TDDD82 Secure Mobile Systems Lecture 6: Quality of Service TDDD82 Secure Mobile Systems Lecture 6: Quality of Service Mikael Asplund Real-time Systems Laboratory Department of Computer and Information Science Linköping University Based on slides by Simin Nadjm-Tehrani

More information

Internet Services & Protocols. Quality of Service Architecture

Internet Services & Protocols. Quality of Service Architecture Department of Computer Science Institute for System Architecture, Chair for Computer Networks Internet Services & Protocols Quality of Service Architecture Dr.-Ing. Stephan Groß Room: INF 3099 E-Mail:

More information

What Is Congestion? Computer Networks. Ideal Network Utilization. Interaction of Queues

What Is Congestion? Computer Networks. Ideal Network Utilization. Interaction of Queues 168 430 Computer Networks Chapter 13 Congestion in Data Networks What Is Congestion? Congestion occurs when the number of packets being transmitted through the network approaches the packet handling capacity

More information

ATM Quality of Service (QoS)

ATM Quality of Service (QoS) ATM Quality of Service (QoS) Traffic/Service Classes, Call Admission Control Usage Parameter Control, ABR Agenda Introduction Service Classes and Traffic Attributes Traffic Control Flow Control Special

More information

Resource allocation in networks. Resource Allocation in Networks. Resource allocation

Resource allocation in networks. Resource Allocation in Networks. Resource allocation Resource allocation in networks Resource Allocation in Networks Very much like a resource allocation problem in operating systems How is it different? Resources and jobs are different Resources are buffers

More information

CS557: Queue Management

CS557: Queue Management CS557: Queue Management Christos Papadopoulos Remixed by Lorenzo De Carli 1 Congestion Control vs. Resource Allocation Network s key role is to allocate its transmission resources to users or applications

More information

Lecture 9. Quality of Service in ad hoc wireless networks

Lecture 9. Quality of Service in ad hoc wireless networks Lecture 9 Quality of Service in ad hoc wireless networks Yevgeni Koucheryavy Department of Communications Engineering Tampere University of Technology yk@cs.tut.fi Lectured by Jakub Jakubiak QoS statement

More information

Congestion in Data Networks. Congestion in Data Networks

Congestion in Data Networks. Congestion in Data Networks Congestion in Data Networks CS420/520 Axel Krings 1 Congestion in Data Networks What is Congestion? Congestion occurs when the number of packets being transmitted through the network approaches the packet

More information

CSE 3214: Computer Network Protocols and Applications Network Layer

CSE 3214: Computer Network Protocols and Applications Network Layer CSE 314: Computer Network Protocols and Applications Network Layer Dr. Peter Lian, Professor Department of Computer Science and Engineering York University Email: peterlian@cse.yorku.ca Office: 101C Lassonde

More information

DiffServ Architecture: Impact of scheduling on QoS

DiffServ Architecture: Impact of scheduling on QoS DiffServ Architecture: Impact of scheduling on QoS Abstract: Scheduling is one of the most important components in providing a differentiated service at the routers. Due to the varying traffic characteristics

More information

QUALITY of SERVICE. Introduction

QUALITY of SERVICE. Introduction QUALITY of SERVICE Introduction There are applications (and customers) that demand stronger performance guarantees from the network than the best that could be done under the circumstances. Multimedia

More information

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies Mohsin Y Ahmed Conlan Wesson Overview NoC: Future generation of many core processor on a single chip

More information

Network Layer Enhancements

Network Layer Enhancements Network Layer Enhancements EECS 122: Lecture 14 Department of Electrical Engineering and Computer Sciences University of California Berkeley Today We have studied the network layer mechanisms that enable

More information

CSCD 433/533 Advanced Networks Spring Lecture 22 Quality of Service

CSCD 433/533 Advanced Networks Spring Lecture 22 Quality of Service CSCD 433/533 Advanced Networks Spring 2016 Lecture 22 Quality of Service 1 Topics Quality of Service (QOS) Defined Properties Integrated Service Differentiated Service 2 Introduction Problem Overview Have

More information

Network Control and Signalling

Network Control and Signalling Network Control and Signalling 1. Introduction 2. Fundamentals and design principles 3. Network architecture and topology 4. Network control and signalling 5. Network components 5.1 links 5.2 switches

More information

Quality of Service (QoS)

Quality of Service (QoS) Quality of Service (QoS) A note on the use of these ppt slides: We re making these slides freely available to all (faculty, students, readers). They re in PowerPoint form so you can add, modify, and delete

More information

Master Course Computer Networks IN2097

Master Course Computer Networks IN2097 Chair for Network Architectures and Services Prof. Carle Department for Computer Science TU München Chair for Network Architectures and Services Prof. Carle Department for Computer Science TU München Master

More information

Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks

Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks Dr. Vinod Vokkarane Assistant Professor, Computer and Information Science Co-Director, Advanced Computer Networks Lab University

More information

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER A Thesis by SUNGHO PARK Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements

More information

Master Course Computer Networks IN2097

Master Course Computer Networks IN2097 Chair for Network Architectures and Services Prof. Carle Department for Computer Science TU München Master Course Computer Networks IN2097 Prof. Dr.-Ing. Georg Carle Christian Grothoff, Ph.D. Chair for

More information

CMSC 332 Computer Networks Network Layer

CMSC 332 Computer Networks Network Layer CMSC 332 Computer Networks Network Layer Professor Szajda CMSC 332: Computer Networks Where in the Stack... CMSC 332: Computer Network 2 Where in the Stack... Application CMSC 332: Computer Network 2 Where

More information

Modelling a Video-on-Demand Service over an Interconnected LAN and ATM Networks

Modelling a Video-on-Demand Service over an Interconnected LAN and ATM Networks Modelling a Video-on-Demand Service over an Interconnected LAN and ATM Networks Kok Soon Thia and Chen Khong Tham Dept of Electrical Engineering National University of Singapore Tel: (65) 874-5095 Fax:

More information

Module objectives. Integrated services. Support for real-time applications. Real-time flows and the current Internet protocols

Module objectives. Integrated services. Support for real-time applications. Real-time flows and the current Internet protocols Integrated services Reading: S. Keshav, An Engineering Approach to Computer Networking, chapters 6, 9 and 4 Module objectives Learn and understand about: Support for real-time applications: network-layer

More information

Basics (cont.) Characteristics of data communication technologies OSI-Model

Basics (cont.) Characteristics of data communication technologies OSI-Model 48 Basics (cont.) Characteristics of data communication technologies OSI-Model Topologies Packet switching / Circuit switching Medium Access Control (MAC) mechanisms Coding Quality of Service (QoS) 49

More information

Evaluation of NOC Using Tightly Coupled Router Architecture

Evaluation of NOC Using Tightly Coupled Router Architecture IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 01-05 www.iosrjournals.org Evaluation of NOC Using Tightly Coupled Router

More information

Module 2 Communication Switching. Version 1 ECE, IIT Kharagpur

Module 2 Communication Switching. Version 1 ECE, IIT Kharagpur Module 2 Communication Switching Lesson 4 Connectionless And Connection Oriented Packet Switching LESSON OBJECTIVE General This lesson is intended to give the reader the understanding of two important

More information

Supporting Quality of Service for Internet Applications A thesis presented for the degree of Master of Science Research

Supporting Quality of Service for Internet Applications A thesis presented for the degree of Master of Science Research Supporting Quality of Service for Internet Applications A thesis presented for the degree of Master of Science Research Department of Computer Systems Faculty of Information Technology University of Technology,

More information

Converged Networks. Objectives. References

Converged Networks. Objectives. References Converged Networks Professor Richard Harris Objectives You will be able to: Discuss what is meant by convergence in the context of current telecommunications terminology Provide a network architecture

More information

Performance of Multicast Traffic Coordinator Framework for Bandwidth Management of Real-Time Multimedia over Intranets

Performance of Multicast Traffic Coordinator Framework for Bandwidth Management of Real-Time Multimedia over Intranets Performance of Coordinator Framework for Bandwidth Management of Real-Time Multimedia over Intranets Chin Hooi Tang, and Tat Chee Wan, Member, IEEE ComSoc. Abstract Quality of Service (QoS) schemes such

More information

Improving QOS in IP Networks. Principles for QOS Guarantees

Improving QOS in IP Networks. Principles for QOS Guarantees Improving QOS in IP Networks Thus far: making the best of best effort Future: next generation Internet with QoS guarantees RSVP: signaling for resource reservations Differentiated Services: differential

More information

Quality of Service Mechanism for MANET using Linux Semra Gulder, Mathieu Déziel

Quality of Service Mechanism for MANET using Linux Semra Gulder, Mathieu Déziel Quality of Service Mechanism for MANET using Linux Semra Gulder, Mathieu Déziel Semra.gulder@crc.ca, mathieu.deziel@crc.ca Abstract: This paper describes a QoS mechanism suitable for Mobile Ad Hoc Networks

More information

Deadlock. Reading. Ensuring Packet Delivery. Overview: The Problem

Deadlock. Reading. Ensuring Packet Delivery. Overview: The Problem Reading W. Dally, C. Seitz, Deadlock-Free Message Routing on Multiprocessor Interconnection Networks,, IEEE TC, May 1987 Deadlock F. Silla, and J. Duato, Improving the Efficiency of Adaptive Routing in

More information

Quality of Service in the Internet

Quality of Service in the Internet Quality of Service in the Internet Problem today: IP is packet switched, therefore no guarantees on a transmission is given (throughput, transmission delay, ): the Internet transmits data Best Effort But:

More information

Design of a Weighted Fair Queueing Cell Scheduler for ATM Networks

Design of a Weighted Fair Queueing Cell Scheduler for ATM Networks Design of a Weighted Fair Queueing Cell Scheduler for ATM Networks Yuhua Chen Jonathan S. Turner Department of Electrical Engineering Department of Computer Science Washington University Washington University

More information

CSC 4900 Computer Networks: Network Layer

CSC 4900 Computer Networks: Network Layer CSC 4900 Computer Networks: Network Layer Professor Henry Carter Fall 2017 Villanova University Department of Computing Sciences Review What is AIMD? When do we use it? What is the steady state profile

More information

Topic 4a Router Operation and Scheduling. Ch4: Network Layer: The Data Plane. Computer Networking: A Top Down Approach

Topic 4a Router Operation and Scheduling. Ch4: Network Layer: The Data Plane. Computer Networking: A Top Down Approach Topic 4a Router Operation and Scheduling Ch4: Network Layer: The Data Plane Computer Networking: A Top Down Approach 7 th edition Jim Kurose, Keith Ross Pearson/Addison Wesley April 2016 4-1 Chapter 4:

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Introduction to ATM Traffic Management on the Cisco 7200 Series Routers

Introduction to ATM Traffic Management on the Cisco 7200 Series Routers CHAPTER 1 Introduction to ATM Traffic Management on the Cisco 7200 Series Routers In the latest generation of IP networks, with the growing implementation of Voice over IP (VoIP) and multimedia applications,

More information

Chapter 4 Network Layer

Chapter 4 Network Layer Chapter 4 Network Layer Computer Networking: A Top Down Approach Featuring the Internet, 3 rd edition. Jim Kurose, Keith Ross Addison-Wesley, July 2004. Network Layer 4-1 Chapter 4: Network Layer Chapter

More information

Chapter III. congestion situation in Highspeed Networks

Chapter III. congestion situation in Highspeed Networks Chapter III Proposed model for improving the congestion situation in Highspeed Networks TCP has been the most used transport protocol for the Internet for over two decades. The scale of the Internet and

More information

Abstract. Paper organization

Abstract. Paper organization Allocation Approaches for Virtual Channel Flow Control Neeraj Parik, Ozen Deniz, Paul Kim, Zheng Li Department of Electrical Engineering Stanford University, CA Abstract s are one of the major resources

More information

PROVIDING SERVICE DIFFERENTIATION IN OBS NETWORKS THROUGH PROBABILISTIC PREEMPTION. YANG LIHONG (B.ENG(Hons.), NTU)

PROVIDING SERVICE DIFFERENTIATION IN OBS NETWORKS THROUGH PROBABILISTIC PREEMPTION. YANG LIHONG (B.ENG(Hons.), NTU) PROVIDING SERVICE DIFFERENTIATION IN OBS NETWORKS THROUGH PROBABILISTIC PREEMPTION YANG LIHONG (B.ENG(Hons.), NTU) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF ENGINEERING DEPARTMENT OF ELECTRICAL &

More information

Advanced Computer Networks. Flow Control

Advanced Computer Networks. Flow Control Advanced Computer Networks 263 3501 00 Flow Control Patrick Stuedi Spring Semester 2017 1 Oriana Riva, Department of Computer Science ETH Zürich Last week TCP in Datacenters Avoid incast problem - Reduce

More information

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ E. Baydal, P. López and J. Duato Depto. Informática de Sistemas y Computadores Universidad Politécnica de Valencia, Camino

More information

Networking Quality of service

Networking Quality of service System i Networking Quality of service Version 6 Release 1 System i Networking Quality of service Version 6 Release 1 Note Before using this information and the product it supports, read the information

More information

Scheduling Algorithms for Input-Queued Cell Switches. Nicholas William McKeown

Scheduling Algorithms for Input-Queued Cell Switches. Nicholas William McKeown Scheduling Algorithms for Input-Queued Cell Switches by Nicholas William McKeown B.Eng (University of Leeds) 1986 M.S. (University of California at Berkeley) 1992 A thesis submitted in partial satisfaction

More information

University of Castilla-La Mancha

University of Castilla-La Mancha University of Castilla-La Mancha A publication of the Department of Computer Science Traffic Scheduling Solutions with QoS Support for an Input-Buffered MultiMedia Router by Blanca Caminero, Carmen Carrión,

More information

What Is Congestion? Effects of Congestion. Interaction of Queues. Chapter 12 Congestion in Data Networks. Effect of Congestion Control

What Is Congestion? Effects of Congestion. Interaction of Queues. Chapter 12 Congestion in Data Networks. Effect of Congestion Control Chapter 12 Congestion in Data Networks Effect of Congestion Control Ideal Performance Practical Performance Congestion Control Mechanisms Backpressure Choke Packet Implicit Congestion Signaling Explicit

More information

G Robert Grimm New York University

G Robert Grimm New York University G22.3250-001 Receiver Livelock Robert Grimm New York University Altogether Now: The Three Questions What is the problem? What is new or different? What are the contributions and limitations? Motivation

More information

3. Quality of Service

3. Quality of Service 3. Quality of Service Usage Applications Learning & Teaching Design User Interfaces Services Content Process ing Security... Documents Synchronization Group Communi cations Systems Databases Programming

More information

Delay Constrained ARQ Mechanism for MPEG Media Transport Protocol Based Video Streaming over Internet

Delay Constrained ARQ Mechanism for MPEG Media Transport Protocol Based Video Streaming over Internet Delay Constrained ARQ Mechanism for MPEG Media Transport Protocol Based Video Streaming over Internet Hong-rae Lee, Tae-jun Jung, Kwang-deok Seo Division of Computer and Telecommunications Engineering

More information

Real-Time Mixed-Criticality Wormhole Networks

Real-Time Mixed-Criticality Wormhole Networks eal-time Mixed-Criticality Wormhole Networks Leandro Soares Indrusiak eal-time Systems Group Department of Computer Science University of York United Kingdom eal-time Systems Group 1 Outline Wormhole Networks

More information

Congestion Control and Resource Allocation

Congestion Control and Resource Allocation Congestion Control and Resource Allocation Lecture material taken from Computer Networks A Systems Approach, Third Edition,Peterson and Davie, Morgan Kaufmann, 2007. Advanced Computer Networks Congestion

More information

Chapter -5 QUALITY OF SERVICE (QOS) PLATFORM DESIGN FOR REAL TIME MULTIMEDIA APPLICATIONS

Chapter -5 QUALITY OF SERVICE (QOS) PLATFORM DESIGN FOR REAL TIME MULTIMEDIA APPLICATIONS Chapter -5 QUALITY OF SERVICE (QOS) PLATFORM DESIGN FOR REAL TIME MULTIMEDIA APPLICATIONS Chapter 5 QUALITY OF SERVICE (QOS) PLATFORM DESIGN FOR REAL TIME MULTIMEDIA APPLICATIONS 5.1 Introduction For successful

More information

Managing Performance Variance of Applications Using Storage I/O Control

Managing Performance Variance of Applications Using Storage I/O Control Performance Study Managing Performance Variance of Applications Using Storage I/O Control VMware vsphere 4.1 Application performance can be impacted when servers contend for I/O resources in a shared storage

More information

Lecture 16: Network Layer Overview, Internet Protocol

Lecture 16: Network Layer Overview, Internet Protocol Lecture 16: Network Layer Overview, Internet Protocol COMP 332, Spring 2018 Victoria Manfredi Acknowledgements: materials adapted from Computer Networking: A Top Down Approach 7 th edition: 1996-2016,

More information

Routing Algorithms. Review

Routing Algorithms. Review Routing Algorithms Today s topics: Deterministic, Oblivious Adaptive, & Adaptive models Problems: efficiency livelock deadlock 1 CS6810 Review Network properties are a combination topology topology dependent

More information

General comments on candidates' performance

General comments on candidates' performance BCS THE CHARTERED INSTITUTE FOR IT BCS Higher Education Qualifications BCS Level 5 Diploma in IT April 2018 Sitting EXAMINERS' REPORT Computer Networks General comments on candidates' performance For the

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220 Admin Homework #5 Due Dec 3 Projects Final (yes it will be cumulative) CPS 220 2 1 Review: Terms Network characterized

More information

A Survey of Techniques for Power Aware On-Chip Networks.

A Survey of Techniques for Power Aware On-Chip Networks. A Survey of Techniques for Power Aware On-Chip Networks. Samir Chopra Ji Young Park May 2, 2005 1. Introduction On-chip networks have been proposed as a solution for challenges from process technology

More information

CSE 461 Quality of Service. David Wetherall

CSE 461 Quality of Service. David Wetherall CSE 461 Quality of Service David Wetherall djw@cs.washington.edu QOS Focus: How to provide better than best effort Fair queueing Application Application needs Transport Traffic shaping Guarantees IntServ

More information

Chapter 4 Network Layer: The Data Plane

Chapter 4 Network Layer: The Data Plane Chapter 4 Network Layer: The Data Plane A note on the use of these Powerpoint slides: We re making these slides freely available to all (faculty, students, readers). They re in PowerPoint form so you see

More information

Optimistic Parallel Simulation of TCP/IP over ATM networks

Optimistic Parallel Simulation of TCP/IP over ATM networks Optimistic Parallel Simulation of TCP/IP over ATM networks M.S. Oral Examination November 1, 2000 Ming Chong mchang@ittc.ukans.edu 1 Introduction parallel simulation ProTEuS Agenda Georgia Tech. Time Warp

More information

Defining QoS for Multiple Policy Levels

Defining QoS for Multiple Policy Levels CHAPTER 13 In releases prior to Cisco IOS Release 12.0(22)S, you can specify QoS behavior at only one level. For example, to shape two outbound queues of an interface, you must configure each queue separately,

More information

Journal of Electronics and Communication Engineering & Technology (JECET)

Journal of Electronics and Communication Engineering & Technology (JECET) Journal of Electronics and Communication Engineering & Technology (JECET) JECET I A E M E Journal of Electronics and Communication Engineering & Technology (JECET)ISSN ISSN 2347-4181 (Print) ISSN 2347-419X

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

Quality of Service (QoS)

Quality of Service (QoS) Quality of Service (QoS) The Internet was originally designed for best-effort service without guarantee of predictable performance. Best-effort service is often sufficient for a traffic that is not sensitive

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

MMR: A High-Performance Multimedia Router - Architecture and Design Trade-Offs

MMR: A High-Performance Multimedia Router - Architecture and Design Trade-Offs MMR: A High-Performance Multimedia Router - Architecture and Design Trade-Offs Jose Duato 1, Sudhakar Yalamanchili 2, M. Blanca Caminero 3, Damon Love 2, Francisco J. Quiles 3 Abstract This paper presents

More information

Network Model for Delay-Sensitive Traffic

Network Model for Delay-Sensitive Traffic Traffic Scheduling Network Model for Delay-Sensitive Traffic Source Switch Switch Destination Flow Shaper Policer (optional) Scheduler + optional shaper Policer (optional) Scheduler + optional shaper cfla.

More information

Chapter 4. Computer Networking: A Top Down Approach 5 th edition. Jim Kurose, Keith Ross Addison-Wesley, sl April 2009.

Chapter 4. Computer Networking: A Top Down Approach 5 th edition. Jim Kurose, Keith Ross Addison-Wesley, sl April 2009. Chapter 4 Network Layer A note on the use of these ppt slides: We re making these slides freely available to all (faculty, students, readers). They re in PowerPoint form so you can add, modify, and delete

More information

EE482, Spring 1999 Research Paper Report. Deadlock Recovery Schemes

EE482, Spring 1999 Research Paper Report. Deadlock Recovery Schemes EE482, Spring 1999 Research Paper Report Deadlock Recovery Schemes Jinyung Namkoong Mohammed Haque Nuwan Jayasena Manman Ren May 18, 1999 Introduction The selected papers address the problems of deadlock,

More information

Lecture 7. Network Layer. Network Layer 1-1

Lecture 7. Network Layer. Network Layer 1-1 Lecture 7 Network Layer Network Layer 1-1 Agenda Introduction to the Network Layer Network layer functions Service models Network layer connection and connectionless services Introduction to data routing

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Internet Architecture and Protocol

Internet Architecture and Protocol Internet Architecture and Protocol Set# 04 Wide Area Networks Delivered By: Engr Tahir Niazi Wide Area Network Basics Cover large geographical area Network of Networks WANs used to be characterized with

More information

Deadlock-free XY-YX router for on-chip interconnection network

Deadlock-free XY-YX router for on-chip interconnection network LETTER IEICE Electronics Express, Vol.10, No.20, 1 5 Deadlock-free XY-YX router for on-chip interconnection network Yeong Seob Jeong and Seung Eun Lee a) Dept of Electronic Engineering Seoul National Univ

More information

Computation of Multiple Node Disjoint Paths

Computation of Multiple Node Disjoint Paths Chapter 5 Computation of Multiple Node Disjoint Paths 5.1 Introduction In recent years, on demand routing protocols have attained more attention in mobile Ad Hoc networks as compared to other routing schemes

More information

Quality of Service in the Internet

Quality of Service in the Internet Quality of Service in the Internet Problem today: IP is packet switched, therefore no guarantees on a transmission is given (throughput, transmission delay, ): the Internet transmits data Best Effort But:

More information