MMR: A High-Performance Multimedia Router - Architecture and Design Trade-Offs

Size: px

Start display at page:

Download "MMR: A High-Performance Multimedia Router - Architecture and Design Trade-Offs"

Jennifer Andrea Cannon
6 years ago
Views:

1 MMR: A High-Performance Multimedia Router - Architecture and Design Trade-Offs Jose Duato 1, Sudhakar Yalamanchili 2, M. Blanca Caminero 3, Damon Love 2, Francisco J. Quiles 3 Abstract This paper presents the architecture of a router designed to efficiently support traffic generated by multimedia applications. The router is targeted for use in clusters and LANs rather than in WANs, the latter being served by communication substrates such as ATM. The distinguishing features of the proposed router architecture are the use of small fixed-size buffers, a large number of virtual channels, linklevel virtual channel flow control, support for dynamic modification of connection bandwidth and priorities, and coordinated scheduling of connections across all output channels. The paper begins with a discussion of the design choices and architectural trade-offs made in the current MultiMedia Router (MMR) project. The performance evaluation section presents some preliminary results of the coordinated scheduling of constant bit rate (CBR) traffic streams. 1. Introduction 1.J. Duato is with the Dept. of Information Systems and Computer Architecture, Universidad Politecnica de Valencia, P.O.B. 2212, Valencia, SPAIN, {jduato@gap.upv.es} 2.D. Love and S. Yalamanchili are with the School of Electrical and Computer Engineering, Georgia Institute of Technology, {dlove, sudha}@ece.gatech.edu 3.B. Caminero and F. J. Quiles are with the Dept. of Computer Science, Escuela Politecnica Superior de Albacete SPAIN. {blanca, paco}@info-ab.uclm.es In the past few years we have seen an explosive growth in network-based multimedia applications. Example applications include web servers, video-on-demand servers, telemedicine, immersive environments, interactive simulations, and collaborative design environments. The data are often distributed and these applications individually require substantial bandwidth to meet real-time interactive constraints. The network bandwidth must also be shared by other applications that may not have as demanding constraints. The physical constraints of cluster/lan interconnects as well as the applications that utilize them produces different trade-offs in router design from those made in wide area networks (WANs). As a result we arrive at different and more effective architectural solutions for cluster/lan routers. The key issue is the ability to provide quality of service (QoS) guarantees at multiprocessor cut-through latencies. This makes it difficult to use existing substrates such as Gigabit Ethernet and ATM [22]in which message traffic encounters relatively large latencies as compared to networks such as Myrinet [5] or Tandem ServerNet [16,18]. Traditional router technology developed for highspeed multiprocessor networks is optimized for low latency and for best-effort traffic. However, these networks are not designed to permit concurrent guarantees for communication performance for multiple applications. The primary objective of the Multimedia Router (MMR) project is the design and implementation of a single-chip router optimized for multimedia applications. The goal is to provide architectural support to enable a range of quality of service (QoS) guarantees at latencies comparable to state-of-the-art multiprocessor cut-through routers. To achieve this goal we must provide solutions to many difficult hardware resource management and scheduling problems while constraining required resources to permit effective single-chip implementations. This paper presents some specific trade-offs made in the architecture of the MMR single-chip multimedia router and the results of some preliminary simulation experiments. 2. Application Requirements The main distinguishing features of multimedia communication environments are: Very long data streams Wide range of bandwidth requirements Large number of connections Jitter sensitive Latency tolerant

2 Short control messages Multimedia traffic may also coexist with best-effort traffic generated by other applications. The MMR should handle this hybrid traffic efficiently, satisfying the QoS requirements of multimedia traffic, minimizing the average latency of best-effort traffic, and maximizing link utilization. Constant bit rate (CBR) connections are relatively easy to handle. Bandwidth requirements can be met by allocating the requested bandwidth while establishing the connection. Variable bit rate (VBR) connections are more difficult to handle. When establishing a VBR coapproximate knowledge of the average and maximum bandwidth requirements. For example, it is possible to compute the average and maximum bandwidth requirements for a stored compressed movie. However, it is not possible to obtain exact values for real-time transmission of compressed video. Thus, the designer has to carefully select several qualitative and quantitative network design parameters. Qualitative parameters include network topology, switching technique, routing algorithm, admission control strategy, bandwidth allocation algorithm, link/switch scheduling algorithm, buffer organization, buffer management, and crossbar organization. Quantitative parameters include network size, link bandwidth, router degree (number of ports), router clock frequency, buffer size, and number of virtual channels. Physical Input Channels VCM+LS VCM+LS VCM+LS VCM+LS Phit Buffers Crossbar Switch Switch Scheduler Routing and Arbitration Unit VCM - Virtual Channel Memory LS - Link Scheduler Figure 1. The Architecture of the MultiMedia Router (MMR) Phit Buffers Physical Output Channels 3. Router Architecture Figure 1 illustrates the architecture of the MMR. The following subsections describe some basic trade-offs that were made in the individual components. 3.1 Switching Technique Multimedia traffic often requires a bounded jitter on long data streams. Switching techniques that reserve resources on the fly, like wormhole switching, are not well suited for such communication because the time a message is blocked waiting for a busy resource is not bounded. Connection oriented schemes with scheduling support within the routers are better suited to meeting jitter requirements over long data streams. On the other hand, control messages and best-effort traffic will benefit from low-latency switching techniques like wormhole or virtual cut-through (VCT) switching. The overhead of connection oriented schemes is excessive for these messages. A good trade-off can be achieved by using a hybrid switching technique [12, 29]. Long data streams can be transmitted by using circuit switching or a variant thereof by first establishing a connection from source to destination and then forwarding the data. Control messages and best-effort traffic can be transmitted by using wormhole or VCT switching. However, we would like both types of messages to use the same pool of link and buffer resources at a node without partitioning of resources among switching classes (e.g., as in [23]). Among the connection-oriented schemes, we have selected pipelined circuit switching (PCS) because it is simple, can be combined with wormhole or VCT [3], uses flow control to prevent data losses, requires small buffer storage associated with each virtual channel, and the overhead produced by control information is relatively low compared to LAN connection oriented schemes such as ATM. PCS for long data streams is combined with VCT for control and best-effort messages. In both cases the data is organized as a sequence of flow control digits or flits [9]. While the use of large flits amortizes flow control and scheduling delays, it also increases latency and buffer storage requirements. Latency can be reduced by pipelining flit transmission at a finer granularity. Pipelining can be done at the phit level, as in the Cray T3D router [25], where a phit is the amount of information that can be transmitted in parallel through a communication link. As serial links are frequent in LAN environments, we assume that pipelining is performed at the word level, where word size is equal to the width of the router internal data paths.

3 Control Word Decoder demux Buffer Buffer Buffer Buffer 3.2 Buffer Organization RAM RAM RAM RAM Address Generator Figure 2. Organization of Virtual Channel Memory When a connection is established, a virtual channel is reserved at each link in the path from source to destination. In order to support a large number of connections concurrently, buffers at each link must be organized as a large set of virtual channels. Virtual channels have been traditionally organized as a set of queues linked by a multiplexor. As indicated in [8], router delays can increase substantially when a large number of virtual channels are multiplexed onto physical links. This is due in part to the multiplexor and virtual channel controller delays. Moreover fully de-multiplexed crossbars [9] (i.e., one virtual channel per crossbar port) become prohibitively expensive in silicon area as the number of virtual channels increases. Thus, we pursue a different buffer organization to support a large number of virtual channels. The major buffer organizations that must be considered are central buffers, output buffers, and input buffers. We have considered these organizations and our current analysis [13] argues for input buffers modified as follows to remove head-of-line blocking. The MMR will use virtual channels organized as a set of interleaved RAM modules. Each flit is low-order interleaved across memory modules. Flits belonging to the same virtual channel are stored in adjacent sets of memory locations. The number of memory modules and flit size must be selected to balance memory access time, link speed, and crossbar switching delay, while masking flow control and scheduling delays. The read address required to retrieve flits is supplied by the link scheduler. The write address is obtained from the virtual channel flow control circuitry indicating the virtual channel identifier for the next flit transmission. As shown in Figure 1, small phit buffers are mux From Link Scheduler used for link buffers and are deep enough to store all the phits that arrive during a decoding period (i.e., during the computation of the memory address to store those phits). Phit buffers also allow low-latency routing of short messages using VCT, provided that there is no contention (i.e., the requested output link is free). Similarly, phit buffers allow a fast processing of probes and acknowledgments when establishing a connection. There is also some state information stored with each virtual channel that is used for scheduling. The collective resources are referred t o a s virtual channel m emory(vcm). By designing pipelined memory buffer systems we can match increasing external link speeds to decreasing intra-router delays. The variable parameters that can be adjusted include flit sizes, number of memory banks and the virtual channel depth. 3.3 Switch Organization In most routers the internal switch is implemented as a crossbar: multiplexed, partially multiplexed and fully de-multiplexed [9]. Although some routers use a fully demultiplexed crossbar [1], this organization becomes prohibitive when the number of virtual channels is large. Even for a relatively small number of virtual channels, some commercial routers use a multiplexed crossbar [6]. The MMR uses a multiplexed crossbar where the internal switch is a crossbar with as many ports as communication links. It reduces silicon area by V and V 2, respectively, with respect to a partially multiplexed and a fully de-multiplexed crossbar, where V is the number of virtual channels per link. The main drawback of a multiplexed crossbar is that arbitration is needed every time an input link switches from one virtual channel to another. Arbitration is required at the input side to select a virtual channel from each input link. Arbitration is also required at the output side because several virtual channels may request the same output link although this can be hidden by overlapping with the transmission of a previous flit if flits are large enough. Switch reconfiguration overhead can also be similarly hidden. An interesting feature of multiplexed crossbars is that buffers are not required at the output side. As switch output ports are directly connected to output links, flits are directly transmitted through the switch and the corresponding output link. However, a few phit buffers can be used to pipeline information through the switch and the link. Finally, serialization is required if internal data paths are wider than physical links.

4 3.4 Packet and Flit Transmission To fully exploit switch and link bandwidth while simplifying router design, the MMR synchronously assigns switch ports and output links to the requesting virtual channels. Flit transmission is organized as a sequence of flit cycles. During each flit cycle, all the input links with ready flits start by transmitting a control word containing the identifier of the virtual channel to which the next flit belongs. Then they synchronously transmit one flit through the switch and the corresponding output links. Concurrently, arbitration is performed for the assignment of switch ports and output links for the next flit cycle. Once the current flit transmission has finished, the switch is reconfigured. This operation requires one clock cycle. During reconfiguration, no data are transmitted. The MMR uses this cycle to perform pending transmissions of routing probes, backtracking probes, and acknowledgments for connection establishment [1]. Once the switch has been reconfigured, the next flit cycle starts. Although transmission is synchronous inside each router, it should be noted that different routers work asynchronously. Synchronous flit transmission is efficient for data streams but not for control and best-effort messages. These messages are transmitted asynchronously through the switch using VCT switching. Given the large flit sizes ( bits), the following discussion assumes that packet size is equal to flit size. Note that the unit of flow control and buffer management in VCT is a packet. Therefore, flow control units have the same size in PCS and VCT, thus simplifying router design. Packet headers are routed as soon as they reach a router and are forwarded to the routing and arbitration unit. If the requested switch input port and output link are free (i.e., they are not transmitting any flit during the current flit cycle) and there are free virtual channels in the requested output link, control packets are immediately forwarded because they have a higher priority than data streams. This transmission is not synchronized at flit cycle boundaries. As the transmission of control packets may take longer than the rest of the current flit cycle, the corresponding switch port and output link will be considered busy during link arbitration for the next flit cycle. If the requested switch port and/or output link are busy but there are free virtual channels at the next router, a virtual channel is reserved and the packet is stored in the corresponding buffer at the current router. It will be synchronously scheduled together with flits from data streams. If there are no free virtual channels at the next router, the packet is blocked and stored in the corresponding buffer at the current router. Best-effort packets are also routed as soon as the header reaches the routing and arbitration unit. However, best-effort packets have lower priority than data streams. If the requested output link has free virtual channels at the next router, a virtual channel is reserved. Otherwise, the packet is blocked. In both cases, the packet is stored in the corresponding buffer at the current router. Best-effort packets are synchronously scheduled together with flits from data streams. When a control or a best-effort packet is completely transmitted, the corresponding virtual channel is released. 3.5 Routing and Arbitration Unit The routing and arbitration unit executes the routing algorithm. The routing algorithm determines the path followed by the probes when establishing a connection, and the path taken by the best-effort packets. For best effort packets, the MMR uses a fully adaptive routing algorithm that has been proposed for wormhole networks with irregular topology [26,27] and is valid for VCT switching. Exhaustive profitable backtracking (EPB) [17] will be used when establishing connections. This algorithm performs an exhaustive search of the minimal paths in the network until a valid path is found or the probe backtracks to the source node. In order to avoid searching the same links twice, a history store associated with each input virtual channel records all the output links that have already been searched [17]. An implementation of such a protocol has been described in [1]. The routing and arbitration unit keeps the channel mappings between input and output virtual channels for established connections [17]. Virtual channels are specified by indicating the physical link and the virtual channel on that link. Direct and reverse channel mappings are stored. Direct mappings are required to forward data flits. Reverse mappings are used by backtracking headers and returned acknowledgments. Mappings are also used to propagate status information. 4. Bandwidth Allocation and Link/Switch Scheduling The MMR supports QoS for virtual connections realized with virtual channels. Support for QoS guarantees within the MMR takes the form of solutions to three basic problems: bandwidth allocation, link scheduling, and switch scheduling. The bandwidth allocation scheme allocates bandwidth to each connection when it is established while link

5 and switch scheduling strategies must operate in a tightly coupled manner to make the most effective use of the network bandwidth. The major challenge in a single chip MMR is that these strategies must have compact and fast implementations. 4.1 Link Operation Link bandwidth and switch port bandwidth are split into flit cycles: the time taken for a flit to be transmitted through the router and across the physical link. Flit cycles are grouped into rounds also referred to as frames. The number of flit cycles in a round is an integer multiple K (K > 1) of the number of virtual channels per link. Bandwidth for a connection is allocated as an integer number of flit cycles/round. Thus, a greater value of K provides a higher flexibility for bandwidth allocation. However, it may increase jitter on a connection since rounds take longer to complete. Therefore, the selected value for K is a trade-off between flexibility and jitter. The data structures used for supporting fast scheduling decisions are a set of status bit vectors, where each bit in a vector is associated with a single virtual channel. Bit vectors provide information about different conditions for all the virtual channels in the router. Examples of status bit vectors include: flits_available, input_buffer_full, CBR_service_requested, CBR_bandwidth_serviced, VBR_bandwidth_serviced, etc. A bit on one of these bit vectors is updated every time the status of a virtual channel changes. For example, if a given virtual channel has no flits available to be transmitted and a flit belonging to that virtual channel arrives, the corresponding flits_available status bit is updated in the corresponding bit vector. In general status bit vectors are either associated with input or output virtual channels depending on the implementation of the link scheduling algorithm. The basic idea here is to trade space (silicon) for time (scheduling decisions). Using status bit vectors each physical input link can quickly determine the set of input virtual channels at that link which satisfy some conditions with simple highly parallel bit operations. For example, we can quickly determine the virtual channels with flits_available and credits_available, by performing the logical AND of the corresponding bit vectors. Similarly, the sets of channels satisfying other more complex conditions can also be quickly obtained. 4.2 Bandwidth Allocation When a connection is being established in the MMR the source node generates a routing probe that tries to establish a connection by setting up a path from source to destination, reserving link bandwidth and buffer space along that path. If resource reservation is successful the connection is established and the request is granted. If resources cannot be reserved along the whole path, the connection fails and all the resources reserved during the construction of the path are released. Using a backtracking search, alternative paths through the network can be pursued. For CBR connections each probe carries information about the requested bandwidth measured in flit cycles/round. Each output link requires an associated register that keeps track of the total number of flit cycles/round that have been allocated. This register is incremented by the requested number of flit cycles when link bandwidth is allocated and decremented when a connection is removed. A CBR connection can only be allocated if the total number of flit cycles that have been allocated (including the current request) does not exceed the number of flit cycles in a round. Note that it is possible to reserve some bandwidth/round for best-effort traffic in order to prevent starvation of best-effort packets. For VBR connections, the problem is more difficult to solve since the amount of data varies over time. In order to deal with the varying requirements of different connections, a probe establishing a VBR connection will carry the permanent and peak bandwidth for that connection. These values may be estimates depending on the available knowledge of the connection behavior. To support bandwidth allocation for VBR connections, each output link requires an additional register to that used for CBR connections. This second register stores the total peak bandwidth requested by all of the connections using that link. These two registers are incremented by the permanent and peak bandwidth, respectively, when a connection is established and decremented by those values when a connection is removed. A VBR connection will only be accepted if i) the value of the first register plus the permanent bandwidth of the current connection does not exceed the number of flit cycles in a round, and ii) the value of the second register does not exceed the product of the number of flit cycles in a round and a concurrency factor. The concurrency factor is stored in a separate register and is set during power on. Note that the proposed bandwidth allocation mechanism does not guarantee that the connection will be assigned the requested peak bandwidth. Providing such a guarantee could waste a large fraction of

6 link bandwidth especially if peak bandwidths are worst case estimates. The probability that the peak bandwidth will be available at any point in time depends on the concurrency factor. The concurrency factor is a trade-off between the ability to make QoS guarantees, the number of connections that can be concurrently serviced, and link utilization. During data transmission, a policing protocol operates by limiting the injection of new flits into the network in such a way that each connection does not use higher link bandwidth than that allocated to it when the connection was established. The injection of flits for best-effort packets is automatically limited since it only uses bandwidth that is available after satisfying the requirements of connections that are guaranteed some minimal QoS. The MMR uses flow control to prevent flits from being discarded. Additionally, flit buffers are relatively small, producing a fast propagation of flow control information. Eventually, flow control may propagate backward up to the network interface at the source node, limiting the injection of new flits. Policing within the MMR would only be required if there was no policing at the interface. 4.3 Link Scheduling As described in the preceding discussion, the basic mechanism to support QoS in the MMR consists of allocating bandwidth to each connection when it is established and guaranteeing that the allocated bandwidth will be available during data transmission via link/switch scheduling. For CBR connections the link scheduler requires state to store the bandwidth allocated to each virtual channel. The link scheduling algorithm operates on a round basis and ensures that no virtual channel consumes more bandwidth than allocated. For VBR connections, link scheduling becomes more complex. Each virtual channel requires state information storing the permanent and peak bandwidth for that connection, respectively. Additionally, each virtual channel stores the priority associated with the data being transmitted. That priority can be dynamically modified by sending control words from the network interface, for example, based on application specific information. The link scheduling algorithm first assigns all the flit cycles in a round for CBR connections. Then, it assigns the permanent bandwidth to every VBR connection. Note that the bandwidth allocation mechanism guarantees that all the VBR connections can be assigned the permanent bandwidth. Now the link scheduling algorithm considers priorities. Starting from the highest-priority connection, bandwidth (flit cycles in a round) is allocated to each connection ensuring that no virtual channel consumes more than its peak bandwidth. Thus, at each intermediate router the link scheduler allocates flit cycles/round giving priority to CBR connections and permanent bandwidth on VBR connections followed by VBR connections in priority order. Conflicts at output ports will cause flits on some connections to be delayed. The MMR link scheduling approach recognizes that low-priority VBR connections may not be able to deliver all flits on time. However the presence of flow control enables delay information to propagate back to the source interface where appropriate action can be taken. For example, an interface may detect that a connection transmitting a low priority compressed video frame makes little progress at some point in time. The network interface may decide to abort the transmission of that frame. By doing so, less bandwidth is wasted in the transmission of a frame that will not meet the deadline. Also, note that the excess bandwidth (difference between peak and permanent bandwidths) requested by VBR connections is serviced in turn, completely servicing the excess bandwidth of one connection before moving to the next one. The idea here is that it is preferable to service the excess bandwidth of most VBR connections completely at the risk of not servicing some of them at all. Certainly other service disciplines are possible. Finally, we note that using control words along a connection we can dynamically vary the bandwidth requirements of a connection. This may be initiated by the source interface of a connection in response to external (CPU initiated) events or in response to actual performance that is experienced on a connection. The response may involve a change in data rate, selective dropping of data packets, or injection limitation. By using command encodings similar to those used in Myrinet [5] we can encapsulate dynamic bandwidth management in the flow control mechanism. The complex bandwidth control functions can be implemented in the network interfaces or source CPUs where there is a great deal of flexibility and application specific information. In return, the routers remain compact and fast. 4.4 Switch Scheduling Switch scheduling refers to the process of determining which input ports are connected to which output ports in a flit cycle for the transmission of a single flit per port. Switch scheduling must be tightly coupled with link scheduling. Ideally virtual channels on input links must be

7 selected in such a way that there are no conflicts in the use of switch output ports (note that each switch output port is connected to a physical link). For each round, the scheduling algorithm is invoked every flit cycle servicing connections and best-effort traffic. Switch scheduling schemes can be classified as inputdriven or output-driven. Input-driven schemes first consider the set of virtual channels in each input link. For each set, the link scheduling algorithm determines the virtual channel(s) that should transmit a flit during the next flit cycle. The direct channel mapping store indicates the requested output link for each selected virtual channel. As requests from different input links may contend for the same output link, some arbitration is required at each output port. This is the scheme used in the Intel Cavallino router [6]. On the other hand, output-driven schemes consider the set of input virtual channels requesting a given output link. For each set, the link scheduling algorithm determines the virtual channel that should transmit a flit during the next flit cycle. As contention may occur for the use of switch input ports, some arbitration is required for the assignment of switch input ports. For fully de-multiplexed switches output-driven schemes provide superior performance [15]. However, for a large number of virtual channels, a fully de-multiplexed crossbar is infeasible. For multiplexed crossbars the choice between input-driven and output-driven scheduling is not clear. In the former the state information associated with competing channels is located in the same router. This potentially enables intelligent decisions in arbitrating between competing requests from input ports. In outputdriven schemes the state information associated with competing channels are naturally in distinct routers. Separating the state information from the buffer location would appear to result in more complex flow of control information with the resulting overhead. Therefore, the MMR uses input-driven switch scheduling. In the MMR we consider switch scheduling algorithms that attempt to schedule all ports concurrently and synchronously set the switch. These scheduling decisions can be classified according to mechanism for arbitrating among multiple requests for an output port. Arbitration can be performed by using static priorities, dynamic priorities or random selection. The MMR utilizes a dynamic priority biasing scheme motivated by the priority biasing scheme proposed in [7,2]. The priorities of the flits at the head of an input virtual channel are updated periodically as often as every flit cycle. The unique aspect of this scheme is that the rate at which these priorities grow is a function of the QoS metric used for the corresponding connection. The result is a more equitable distribution of bandwidth across connections in a manner that is dependent upon the type of service guarantees rather than simply the time spent by the packet in the network. Dynamic re-computation of priorities can be a time consuming task and the challenge is to develop solutions that can be implemented efficiently in area and time. Finally, switch scheduling algorithms can be classified according to the number of candidates offered by the link scheduling algorithm from each group of virtual channels on an input link. For example, instead of selecting a single virtual channel from each input link, the router can select a set of candidates. This set is simply obtained as the result of some operations with bit vectors (for instance, the set of input virtual channels at that link with flits_available, credits_available for flit transmiss ion, CBR_service_requested and not CBR_Completely_Serviced). Using bit vectors, identification of the set of channels can be quite fast. Having more candidates available per switch port increases the probability of fully utilizing the switch bandwidth in a flit cycle. However, the process of arbitrating among multiple requests is now more complex and time consuming. The challenge is in deciding how many candidates should be considered at an input port to maximize switch bandwidth with minimal impact on switch cycle time. Concurrently with the transmission of a flit, the scheduling algorithm computes the set of virtual channels that will transmit a flit during the next flit cycle. The switch is then reconfigured according to the computed input-output port assignment and the next flit cycle starts. This process is repeated until the round is completed. In summary, the MMR scheduler attempts to maximize the probability of assigning virtual channels to every output link during each flit cycle by using sets of candidates (4-8) at each input port and fast priority biasing schemes. Our inclination is to favor simpler, faster schedulers over more complex schedulers that might make better decisions but which lengthen the switch cycle time. 5. Simulation Studies This section provides some preliminary results for link and switch scheduling. The goal here was to determine if the partitioning of scheduling functionality in the MMR (as shown in Figure 1), the use of large flits, and support for a large number of virtual channels could produce good jitter and delay characteristics in a cut-through router architecture. Encouraging results would point us in the directions for further refinement.

8 Simulation experiments were conducted using a C++ discrete event simulator that models a single router. The following experiments represent an 8x8 router with 256 virtual channels/input port, 1.24 Gbps physical links and 128-bit flits. The behavior for slower links speeds, such as 622 Mbps and 155 Mbps, were qualitatively the same and therefore these latter results are excluded. The simulations were run until steady state was reached and statistics gathered over approximately 1, router cycles. Connections were randomly selected from the set (64 Kbps, 128 Kbps, 1.54Mbps, 2Mbps, 5Mbps, 1Mbps, 2Mbps, 55Mbps, 12Mbps) and assigned to random input and output ports on the router. The offered load is computed as the percentage of switch bandwidth demanded by all connections through the router. These preliminary experiments were conducted on CBR connections and did not include best-effort and control messages. Admission control can guarantee that connections are established only if bandwidth is available on a link and the inter-arrival time on a connection is constant. This study was limited to CBR connections to enable easier interpretation of the interaction between link and switch scheduling and gain some insight into potential bottlenecks. Delay is computed as difference between the times a flit is ready to be transmitted through the switch and the time it actually leaves the switch. The jitter on a connection is defined as the difference in the delays of successive flits on a connection. The flit cycle time is determined by the physical link speed and the flit size. 5.1 Scheduling Algorithms We have studied a simple link scheduling algorithm that uses a biased priority based on the ratio of the delay experienced by a flit at the switch and the inter-arrival time on the connection. This priority is recomputed for all connections (head flit) each flit cycle. High speed connections clearly have their priorities grow at a faster rate. For comparison purposes we include results from an algorithm that represents the scheduling in the Autonet switch [2,24]. This algorithm differs in how the candidates are selected at input links and in how conflicts for output ports are arbitrated. To capture ideal performance we also implemented a perfect switch to provide a lower bound on the delay and jitter, and upper bound on switch utilization. When multiple inputs of a perfect switch request the same output port, the flits from these input ports are transmitted to that output port in one flit cycle. Effectively the switch internal bandwidth is N times the link bandwidth for an NxN switch. There are no port conflicts in a perfect switch and therefore no switch scheduling overheads. 5.2 Simulation Results Jitter (router cycles) jitter vs. Offered Load 1.24 Gb Link 1 C Biased 2 C Biased 1 C Fixed 2 C Fixed Figure 3. Jitter vs. Offered Load: Fixed and Biased Priorities In general, we have observed that using a larger number of candidates is effective in increasing switch utilization and is not significantly affected by the priority scheme. This is because while priorities affect which flits are transmitted in a flit cycle they do not directly affect how many inputs can be concurrently transmitting in the same cycle. In contrast the jitter and delay characteristics of individual connections are very sensitive to the priority scheme. The jitter characteristics for fixed and biased priority schemes are illustrated in Figure 3. The vertical axis is represented in router cycles (equivalently flit cycles) which is the time to transmit a flit across the router or link. We chose to represent the jitter in terms of flit cycles since flits emerge from the network at flit cycle boundaries and jitter occurs as an integer number of flit cycles. With 1.24 Gbps links and 128-bit flits a flit cycle is approximately 13 ns. We see that priority biasing consistently performs better especially with a higher number of candidates. These jitter values are averaged over a large range of connection speeds. Actual jitter values for high-speed connections will be even less and those for low-speed connections will be relatively higher. While we may not be too concerned with relatively higher jitter values on a 64 Kbps connection we expect that jitter values on a 1 Mbps connection will be of more concern. Thus, overall the proposed scheme appears very encouraging for use in multimedia routers. Jitter (router cycles) Jitter vs. Offered Load 1.24 Gb Link 4 C Biased 8 C Biased 4 C Fixed 8 C Fixed

9 25 2 Delay vs. Offered Load 1.24 Gb Link 1 C Biased 2 C Biased 1 C Fixed 2 C Fixed 7 6 Delay vs. Offered Load 1.24 Gb Link 4 C Biased 8 C Biased 4 C Fixed 8 C Fixed 7 6 Delay vs. Offered Load 1.24 Gb Link biased Fixed DEC Perfect Jitter vs. Offered Load 1.24 Gb Link biased Fixed DEC Perfect Delay (microseconds) 15 1 Delay (microseconds) 4 3 Delay (microseconds) 4 3 Jitter (router cycles) Figure 4. Delay vs. Offered Load: Fixed and Biased Priorities Figure 4 illustrates the behavior of delay expressed in microseconds as a function of offered load (note the plots for 1 and 2 candidates are clipped to avoid scaling problems). It is apparent that prior to saturation the delay characteristics using a biased priority are consistently better than that of the fixed priority scheme. The differences are particularly pronounced in the region just prior to saturation. For example with two candidates and at 7% load, the biased scheme produces an average delay of.82 microseconds while with fixed priority we have ~5 microseconds. With 8 candidates delays for biased priorities are consistently in the range of.4-.6 microseconds while the fixed priorities realize delays on the order of 1-2 microseconds. Saturation does not appear to occur before 95% load. Figure 5 compares the delay and jitter characteristics of four algorithms. The plots for biased and fixed priority use 8 candidates. The results are quite favorable with the use of 8 candidates, closely tracking the performance of the perfect switch. While the Autonet algorithm realizes very good jitter characteristics at high loads (>8%), the biased priority scheme maintains extremely low jitter values ranging from.168 router cycles at 8% load to.51 router cycles at 95% load. 6. Concluding Remarks The primary (longer term) objective of the Multimedia Router (MMR) project is the design and implementation of a single-chip router optimized for multimedia applications. This paper focused in delineating the initial trade-offs that were made and on the scheduling frame- Figure 5. Delay and Jitter vs. Offered Load: Fixed and Biased Priorities, Autonet, Perfect Switch work being employed. Targeting 1-2 Gbps links and 128- bit flit sizes, the crossbar must be capable of computing switch settings at a rate of 64 ns-128 ns. At the heart of the proposed algorithm is a dynamic priority update or priority biasing scheme. Other novel features/goals of MMR include use of flow control for dynamic bandwidth management, fixed-size buffers for VBR traffic, cachelike memory design for the virtual channel memory, and the partitioning of switch and link scheduling functionality. Preliminary performance evaluation results indicate that the use of biased priorities is consistently better below switch saturation. These numbers are predicated on the availability of a router that can schedule a switch in a single cycle. In the example here this would imply a cycle time of 13 ns which is matched to the rate at which 128- bit quantities arrive on a 1.24 Gbps link. The MMR architecture was formulated to permit concurrency and pipelining. The link schedulers can all operate in parallel and be pipelined with the switch scheduler. While we have been concerned with the study of the functional behavior of the switch scheduler, we now turn our attention to supported VBR traffic and best-effort traffic and the hardware implementation to meet these timing constraints. 7. References [1] J. D. Allen, et al., Ariadne - an adaptive router for fault-tolerant multicomputers, Proceedings of the 21st International Symposium on Computer Architecture, pp , April 1994.

10 [2] T. E. Anderson et. al., High speed switch scheduling for local area networks, Technical Report SRC research report 99, DEC. Also in ACM Transactions on Computer Systems, November, [3] B. V. Dao, J. Duato and S. Yalamanchili, Configurable flow control mechanisms for fault-tolerant routing, Proceedings of the 22nd International Symposium on Computer Architecture, pp , June [4] S. Balakrishnan and F. Ozguner, A priority-based flow control mechanism to support real-time traffic in pipelined direct networks, Proceedings of the 1996 International Conference on Parallel Processing, vol. I, pp , August [5] N. J. Boden, et al., Myrinet - A gigabit per second local area network, IEEE Micro, pp.~29--36, February [6] J. Carbonaro and F. Verhoorn, Cavallino: The teraflops router and NIC, Proceedings of Hot Interconnects Symposium IV, August [7] A. Chien and J. H, Kim, Approaches to Quality of Service in High Performance Networks, Lecture Notes in Computer Science: Proceedings of the Workshop on Parallel Computer Routing and Communication, Springer Verlag (pubs.), pp.1-2, June [8] A. A. Chien, A cost and speed model for k-ary n-cube wormhole routers, Proceedings of Hot Interconnects 93, August [9] W. J. Dally, Virtual-channel flow control, IEEE Transactions on Parallel and Distributed Systems, vol. 3, no. 2, pp , March [1] W. J. Dally, et. al., The Reliable Router: A reliable and high-performance communication substrate for parallel computers, Lecture Notes in Computer Science: Proceedings of the Workshop on Parallel Computer Routing and Communication, Springer Verlag (pubs.) pp , May [11] J. Duato, A new theory of deadlock-free adaptive routing in wormhole networks, IEEE Transactions on Parallel and Distributed Systems, vol. 4, no. 12, pp , December [12] J. Duato, P. Lopez, F. Silla and S. Yalamanchili, A high performance router architecture for interconnection networks, Proceedings of the 1996 International Conference on Parallel Processing, vol. 1, pp , August [13] J. Duato, S. Yalamanchili, B. Caminero, D. Love, F. J. Quiles, MMR: Architecture and Trade-offs in a High Performance Multimedia Router, Technical Report, Computer Architecture and Systems Laboratory, Georgia Institute of Technology available from [14] D. Ferrari, D. Verma A scheme for real-time channel establishment in wide-area networks, IEEE Journal on Selected Areas in Communications, April 199. [15] M. L. Fulgham and L. Snyder, A comparison of input and output driven routers, Proceedings of Euro- Par 96, vol. 1, pp , August [16] D. Garcia and W. Watson, ServerNet II, Lecture Notes in Computer Science, Proceedings of the Workshop on Parallel Computer Routing and Communication, Springer Verlag (pubs.), pp , June [17] P. T. Gaughan and S. Yalamanchili, A family of faulttolerant routing protocols for direct multiprocessor networks, IEEE Transactions on Parallel and Distributed Systems, vol. 6, no. 5, pp , May [18] R. Horst, TNet: A Reliable System Area Network, IEEE Micro, pp , February [19] M. G. H. Katevenis, et al., ATLAS I: A single-chip ATM switch for NOWs, Proceedings of the Workshop on Communications and Architectural Support for Network-based Parallel Computing, February [2] J.H.~Kim, Bandwidth and latency guarantees in lowcost, high-performance networks, Ph. D. Dissertation, University of Illinois at Urbana-Champaign, [21] A. Mekkittikul and N. McKeown, A practical scheduling algorithm to achieve 1% throughput in inputqueued switches, Proceedings INFOCOM, pp , April, [22] M. Prycker, Asynchronous transfer mode: solution for broadband ISDN, Ellis Horwood Limited, Chichester, West Susex, PO191EB, England, [23] J. Rexford, J. Hall and K. G. Shin, A Router Architecture for Real-Time Point-to-Point Networks, Proceedings of the International Symposium on Computer Architecture, May [24] M. D. Schroeder et al., Autonet: A high-speed, selfconfiguring local area network using point-to-point links, Technical Report SRC research report 59, DEC, April 199. [25] S. L. Scott and G. Thorson, Optimized routing in the Cray T3D, Lecture Notes in Computer Science: Proceedings of the Workshop on Parallel Computer Routing and Communication, Springer Verlag (pubs.), pp , May [26] F. Silla, et al., Efficient adaptive routing in networks of workstations with irregular topology, Lecture Notes in Computer Science: Proceedings of the Workshop on Communications and Architectural Support for Network-based Parallel Computing, pp , February [27] F. Silla and J. Duato, Improving the efficiency of adaptive routing in networks with irregular topology, Proceedings of the 1997 Conference on High Performance Computing, December [28] C. B. Stunkel, et al., The SP2 high-performance switch, IBM Systems Journal, vol. 34, no. 2, pp , February [29] Y. L. Chen and J.-C. Liu, A hybrid interconnection network for integrated communication services, Proceedings of the 11th International Parallel Processing Symposium, pp , April 1997.

11 MMR: A High-Performance Multimedia Router - Architecture and Design Trade-Offs Jose Duato 1, Sudhakar Yalamanchili 2, M. Blanca Caminero 3, Damon Love 2, Francisco J. Quiles 3 Abstract This paper presents the architecture of a router designed to efficiently support traffic generated by multimedia applications. The router is targeted for use in clusters and LANs rather than in WANs, the latter being served by communication substrates such as ATM. The distinguishing features of the proposed router architecture are the use of small fixed-size buffers, a large number of virtual channels, linklevel virtual channel flow control, support for dynamic modification of connection bandwidth and priorities, and coordinated scheduling of connections across all output channels. The paper begins with a discussion of the design choices and architectural trade-offs made in the current MultiMedia Router (MMR) project. The performance evaluation section presents some preliminary results of the coordinated scheduling of constant bit rate (CBR) traffic streams. 1. Introduction 1.J. Duato is with the Dept. of Information Systems and Computer Architecture, Universidad Politecnica de Valencia, P.O.B. 2212, Valencia, SPAIN, {jduato@gap.upv.es} 2.D. Love and S. Yalamanchili are with the School of Electrical and Computer Engineering, Georgia Institute of Technology, {dlove, sudha}@ece.gatech.edu 3.B. Caminero and F. J. Quiles are with the Dept. of Computer Science, Escuela Politecnica Superior de Albacete SPAIN. {blanca, paco}@info-ab.uclm.es In the past few years we have seen an explosive growth in network-based multimedia applications. Example applications include web servers, video-on-demand servers, telemedicine, immersive environments, interactive simulations, and collaborative design environments. The data are often distributed and these applications individually require substantial bandwidth to meet real-time interactive constraints. The network bandwidth must also be shared by other applications that may not have as demanding constraints. The physical constraints of cluster/lan interconnects as well as the applications that utilize them produces different trade-offs in router design from those made in wide area networks (WANs). As a result we arrive at different and more effective architectural solutions for cluster/lan routers. The key issue is the ability to provide quality of service (QoS) guarantees at multiprocessor cut-through latencies. This makes it difficult to use existing substrates such as Gigabit Ethernet and ATM [22]in which message traffic encounters relatively large latencies as compared to networks such as Myrinet [5] or Tandem ServerNet [16,18]. Traditional router technology developed for highspeed multiprocessor networks is optimized for low latency and for best-effort traffic. However, these networks are not designed to permit concurrent guarantees for communication performance for multiple applications. The primary objective of the Multimedia Router (MMR) project is the design and implementation of a single-chip router optimized for multimedia applications. The goal is to provide architectural support to enable a range of quality of service (QoS) guarantees at latencies comparable to state-of-the-art multiprocessor cut-through routers. To achieve this goal we must provide solutions to many difficult hardware resource management and scheduling problems while constraining required resources to permit effective single-chip implementations. This paper presents some specific trade-offs made in the architecture of the MMR single-chip multimedia router and the results of some preliminary simulation experiments. 2. Application Requirements The main distinguishing features of multimedia communication environments are: Very long data streams Wide range of bandwidth requirements Large number of connections Jitter sensitive Latency tolerant

Investigating Switch Scheduling Algorithms to Support QoS in the Multimedia Router

Investigating Switch Scheduling Algorithms to Support QoS in the Multimedia Router B. Caminero C. Carrión F. J. Quiles J. Duato S. Yalamanchili Dept. of Computer Science. Escuela Politecnica Superior.