ignored Virtual L1 Cache HAC

Size: px

Start display at page:

Download "ignored Virtual L1 Cache HAC"

Gladys Greer
5 years ago
Views:

1 Suez: A Cluster-Based Scalable Real-Time Packet Router Tzi-cker Chiueh? Prashant Pradhan? Computer Science Division, EECS University of California at Berkeley Berkeley, CA Computer Science Department State University of New York at Stony Brook Stony Brook, NY chiueh@cs.berkeley.edu Abstract Suez is a high-performance real-time packet router that supports fast best-eort packet routing and scalable QoS-guaranteed packet scheduling, and is built on a hardware platform consisting of a cluster of commodity PCs connected by a gigabit/sec system area network. The major goal of the Suez project is to demonstrate that the PC cluster architecture can be as cost-eective a platform for high-performance network packet routing as for parallel computing. Suez features a cache-conscious routing-table search algorithm that exploits CPU caching hardware for fast lookup by treating IP addresses directly as virtual addresses. To scale the number of real-time connections supportable with the link speed, Suez implements a xed-granularity uid fair queuing (FGFFQ) algorithm that completely eliminates the per-packet sorting overhead associated with conventional weighted fair queuing algorithms. With the use of general-purpose CPU as the main processing engine, Suez is inherently a programmable network device that could be dynamically extended with functionalities important to end user applications or to the eciency of the entire network. This paper describes in detail the architectural features of Suez, and reports the performance measurements of the initial Suez prototype, which is built on four Pentium-II 400MHz machines and Myrinet, running the Linux operating system. Keywords: Routing Table Lookup, Packet Scheduling, Ecient Data Movement, Cluster Computing, Extensible Systems, IP Router Technical Areas: Clustered Architecture, Internet Computing, Distributed Operating Systems

2 1 Introduction High-performance packet routers that support guaranteed Quality of Services (QoS) are a crucial building block for next-generation Internet. Given the exponentially growing Internet trac, the increasing number of distributed multimedia applications, and the constant request for new protocols to accommodate emerging network applications, next-generation Internet routers should satisfy the following requirements: Non-real-time network packets should be delivered at the wire speed. Reservations for QoS from real-time connections should be guaranteed regardless of the trac loads. A exible service extension mechanism should be provided to extend the core router services in an ecient and safe way. Although a large amount of research eorts from the industry have been invested in building highperformance real-time packet routers, all of these projects, when it comes to actual implementations, are based on proprietary, custom-designed hardware. The goal of the Suez project is to demonstrate that just as PC clusters are a cost-eective platform for parallel computing, it is technically feasible to build high-performance real-time packet routers with optimized software running on commodity PC hardware. A packet router typically consists of the following four components: link controllers, buer memory, a control processor, and the switching backplane. A Suez router is comprised of multiple Pentium-II PCs, connected through a scalable Myrinet switch through full-duplex 1.2 Gbits/sec links. One of the PCs is a dedicated control processor, handling all administrative functions such as real-time connection establishment through RSVP, routing table updates, as well as network monitoring and management, etc. Each of the remaining PCs, in addition to being connected to the Myrinet switch, also connects to the outside world, and thus plays the role of input/output link controllers that buer incoming packets, perform routing-table lookup or real-time packet classication, forward packets to the corresponding output ports, and schedule packets to ensure their respective QoS requirements be met. The main memory on the PCs buers input trac before forwarding decisions are made, typically in the form of per-connection output queues. Suez's routing backplane is the Myrinet switch, which boasts a 10 Gbits/sec switching bus. When a packet arrives at one of the input ports of a network router, it is rst buered, and then forwarded to the corresponding output port cross the router backplane, buered at the output queue and eventually scheduled to be put on the output link for transmission. Accordingly, Suez features two novel technologies to streamline this process, both designed to exploit to the fullest extent the raw hardware capabilities of the underlying general-purpose CPU and I/O subsystem. To speed up routing-table lookup, which is mapped to a longest prex match problem, Suez takes advantage of the cache memory hierarchy in modern PC hardware for aggressive host address caching, and recasts the routing-table search algorithm into a series of direct table indexing steps, again exploiting processor cache to trade memory for performance. To improve the scalability of real-time packet scheduling with increasing link speeds, Suez employs a new scheduling algorithm called Fixed Granularity Fluid Fair Queuing (FGFFQ), which relies on a special link-layer multiplexing/demultiplexing protocol to emulate FFQ at a xed granularity. FGFFQ eliminates the 1

3 overhead of packetized weighted fair queuing (WFQ), particularly, virtual/physical time mapping and priority sorting. In addition, FGFFQ provides a unied framework for both work-conserving and non-work-conserving disciplines, and supports decoupling of bandwidth and delay guarantees. As new distributed applications emerge, more and more intelligence is required to be embedded into the network. Instead of hard-coding these advanced functionalities directly to network devices, a more general solution is to provide an open platform on which new protocols can be easily developed and quickly deployed. Suez's underlying operating system, in addition to being optimized for a high-performance real-time packet router, also supports a exible programming interface that allows extension of the core protocols in a safe and ecient manner. Specically, Suez exposes the protocol structure of the network subsystem so that user-level extensions could be attached to the existing protocol graph. A protection mechanism based on segmentation hardware is used to ensure that user-level extensions be sandboxed within a separate area from the kernel without resorting to separate processes or in-line checks on every store instruction. Finally, Suez employs a unied data buer management mechanism for dierent modules within the kernel and for user extensions, so that no memory-to-memory copying for the packet payload is ever required. Due to space constraints, the special OS support of Suez is not covered in this paper. The rest of this paper is organized as follows. In Section 2, we review previous and current eorts in building gigabit routers, and compare their approaches with ours. In Section 3 and 4, we describe the CPU-cache conscious routing-table lookup and caching algorithm, and the FGFFQ packet scheduling algorithm, respectively. Section 5 and 6 details the current Suez prototype implementation and the results of a performance study. Section 7 concludes this paper with the current implementation status of the Suez router, and its on-going plan for further improvements. 2 Related Work BBN's MGR project produced [1] a high-performance router that achieves 32 million packets per second IP forwarding rate using a 50 Gbits/sec backplane over 16 ports. Although MGR's forwarding engine is based on a general-purpose processor, DEC Alpha 21164, its backplane and QoS processors are still based on proprietary designs. Washington University's IP over ATM project [4] takes a similar approach as in IP switching [3] that sets up an ATM connection for long-term IP connections. The work was similar to Suez in that the link-layer transmission are based on xed-sized data units, which greatly simplies real-time packet scheduling. The scalable IP router project in Michigan State University [2] is based on the same hardware/software platform as Suez. However, that project's focus is on parallelizing route computation to address route instability problems rather than fast routing-table lookup or QoS provisions. Similar to IP switching, Suez employs o-the-shelf high-performance interconnects for packet forwarding, except the switching platform is a system area network rather than an ATM switch. In essence, the Suez project pushes the notion of software-based router to the extreme by completely avoiding custom hardware, and aims to demonstrate that the proposed approach can gain the best of both worlds: excellent cost eectiveness in performance and exible extensibility. In fast routing-table lookup area, Waldvogel et al. [7] developed a lookup scheme using multiple hash tables, each based on a distinct prex length. The worst-case lookup time is shown to be log 2 (No: of address bits). This work also introduced a Mutating Binary Search technique to 2

4 further reduce the required number of hash table lookups. Degermark et al. [8] developed a compact routing table representation to ensure the entire representation be t within typical L2 caches. They estimated each lookup can be completed within 100 instructions using 8 memory references. Gupta et al. [9] proposed a similar routing-table lookup scheme, which assumes two instead of three levels, and focused more on ecient route updates than lookup. Compared to previous schemes, Suez's routing table organization is much simpler and the lookup operation is thus more ecient. None of the previous works exploited the CPU cache available in modern processors, as in Suez. In addition, we believe Suez's routing table lookup scheme is the rst that integrates routing table caching and searching in a single algorithm. In the real-time packet scheduling area, WFQ was discussed in [10] and an implementation and simulation study was reported in [11], which discussed the performance cost of priority sorting required in WFQ but mainly considered insertion costs to the underlying data structures. Virtual Clock [13] assigns virtual service times independent of the system load by attempting to emulate a static TDM system. Hence, the priority calculation in Virtual Clock is simple. However, as shown in [12], Virtual Clock can punish packets for over-using the link during underload even if they are not aecting the guarantees of other active connections later. Work in [14] attempts to do away with the virtual time computation overhead by using the nish time of the packet currently in service to approximate the current virtual time, at the expense of looser delay bounds. [15] uses a similar idea and uses the start tag of the packet currently in service to approximate the current virtual time. This does away with the large delay bounds of SCFQ. However, the sorting overhead is still there. None of these works, except maybe [11], focused on how to scale the packet scheduling techniques to terabit/sec link speed or million connections per output link, which is one of the major research focuses of the Suez project. 3 Cache-Conscious Routing-Table Lookup Algorithm Because Suez is built on general-purpose CPUs, its routing table lookup algorithm [26] heavily leverages o the cache memory hierarchy. The algorithm is based on two data structures, a destination host address cache (HAC), and a destination network address routing table (NART). Both are designed to use CPU cache eciently. The algorithm rst looks up the HAC to check whether the given IP address is cached in the HAC because it has been seen recently. If so, the lookup succeeds and the corresponding output port is used to route the associated packet. If not, the algorithm further consults the NART to complete the lookup. 3.1 Host Address Caching To explain how the HAC lookup algorithm works, let's make the following assumptions. Assume the L1 data cache is direct-mapped and physically addressed, has 16 KBytes in total and 32-byte blocks. Also assume that the virtual memory page size is 4 KBytes, and each HAC entry is 4 bytes wide, containing a 23-bit tag, one unused bit, and an 8-bit output port identier. Therefore each cache block contains at most 8 HAC entries. Finally, assume one quarter of the L1 cache, i.e., 128 out of 512 cache sets, is reserved for HAC. 3

5 Destination bits [11, 5] IP Address (DA) bits [1, 0] bits [31, 12] bits [4, 2] ignored 01 Virtual Address Physical Address 32 bytes Address Translation 0 0 Software Compare 512 sets L1 Cache 384 sets Cache Lookup HAC Tag 32 bits OutPort Figure 1: The data ow of HAC lookup. Here we assume that L1 cache is direct mapped with byte cache sets, among which 128 cache sets are reserved for HAC. Each HAC entry is 4 bytes wide. The search through the HAC entries within a cache set is performed by software. Given a destination IP address, DA, the DA 5;11 1 portion is extracted with 0's padded at both ends to form a 32-bit virtual address, V A, which is then used to access the cache. V A always falls within the 0-th virtual page, and will eventually access the 0-th physical page and thus the rst 4 KBytes of the L1 cache, assuming that the 0-th virtual page is mapped to the 0-th physical page. The V A such formed fetches the rst word, CW, of the cache set as specied by DA 5;11. In the mean while, a 23-bit tag, DAT, is constructed from a concatenation of DA 12;31 and DA 2;4. The rst word of each HAC cache set is a duplicated HAC entry that corresponds to the HAC entry last accessed in the previous visit to the cache set. If DAT matches CW 9;31, the lookup is completed and CW 0;7 is the output port identier. Otherwise, the software search continues with the second word of the cache set and so on, until a hit or the end of the set. The way V A is formed from DA means that only one virtual page is allocated to hold the entire HAC. As a result, only one page table entry and thus one TLB entry is required. Such a setup guarantees that the performance overhead in HAC lookup due to TLB misses be zero. After the lookup is completed, a new HAC entry for the destination IP address is constructed and a word from the corresponding cache set is allocated for it, if such an HAC entry does not exit yet. Finally, this HAC entry is copied to the rst word of the cache set if necessary. When a cache word is chosen for replacement, the associated HAC entry is gone. Because the HAC contents are generated dynamically from looking up the NART, it is safe to over-write HAC entries without keeping a copy of them. Figure 1 illustrates the ow of the HAC lookup process. 1 We will use the notationx n;m to represent a bit subsequence of X ranging from the n-th bit to the m-th bit. 4

6 The HAC algorithm gives a 5-cycle best-case latency for an IP address lookup, and takes 3 cycles for each additional HAC entry access in the cache set. So it takes = 27 cycles to detect a HAC miss. By pipelining the lookups of consecutive IP packets, i.e., overlapping the 5-cycle work of each lookup, the best-case throughput can be improved to 3 cycles per lookup. Because HAC is guaranteed to be in the L1 cache, the 3-cycle per lookup estimate is exact, as long as the HAC lookup itself is a hit. 3.2 Network Address Routing Table If the HAC access results in a miss, a full-scale network address routing table lookup is required. The guiding principle of Suez's NART design is to trade table size for lookup performance. This principle is rooted in the observation that the L2 cache size of modern microprocessors is comparatively large for storing routing tables and is expected to continue increasing over time. For example, one MByte of L2 cache is a norm in mid-range PCs today. On the other hand, the routing table of an Internet backbone router has about 100,000 entries. Assume that each entry takes 8 bytes, then the entire routing table takes only 800 KBytes, i.e., comfortably t into a PC's L2 cache. Of course, such a comparison is over-simplied because the 800 KBytes estimate assumes a sequential search procedure, and thus does not include any index data structure to speed up the lookup process. NART lookup is dicult to speed up because given an IP address, it is not possible to determine a priori its corresponding network address, which is the most signicant N bits of the IP address, where N is unknown beforehand. With the increasing threat of the IP address depletion, a technique called Classless InterDomain Routing (CIDR) is currently in use that allows more ecient allocation of contiguous address chunks in the IP address space. In contrast to the original Class A/B/C scheme, in which N can take only than than one of three possible values: 8, 16, and 24, CIDR allows N to range from 1 to 29. This generality complicates routing table lookup because it signicantly increases the number of possible network addresses. NART Construction The network addresses in an IP routing table are classied into three types according to their length: smaller than or equal to 16 bits (called Class AB), between 17 and 24 bits (called Class BC), and between 25 and 30 bits (called Class CD). To represent a given routing table, Suez's NART includes three levels of tables: one Level-1 table and one Level-3 table, but a variable number of Level-2 tables. Entries in these tables are denoted in this paper as Level1-Table[.], Level2-Table[.], and Level3-Table[.], respectively. The Level-1 table has 64K entries, each of which is 2 bytes wide, and contains either an 8-bit output port identier or an oset pointer to a Level-2 table. These two cases are distinguished by the entry's most signicant bit. Each Level-2 table has 256 entries, each of which is 1 byte wide, and contains either a special indicator ( ) or an 8-bit output port identier. The Level-3 table has a variable number of entries, each of which contains a 4-byte network mask eld, a 4-byte network address, and an 8-bit output port identier. The number depends on the given IP routing table and is typically small. To convert an IP routing table to the NART representation, Class AB addresses are processed rst, then Class BC addresses, followed by Class CD addresses. Each Class AB network address, NA, of length L in the routing table takes 2 16?L entries of the Level-1 table starting from Level1-5

7 bits [31, 16] 2 Base1 * + 2 bytes Level-1 Table Destination IP Address (DA) bits [15, 8] 1 OutPort 64K entries 0 Level-2 Pointer * 256 Base3 OutPort bytes Level-2 Table Level-3 Table 1 byte 256 entries 32 bits + + Base2 bits [31, 0] & Compare 32 bits OutPort Network Address Network Mask Figure 2: The data ow of the NART lookup. It starts with the Level-1 table, and if necessary looks up a Level-2 table pointed by the Level1-Table[DA 16;31 ]. If the chosen Level-2 table entry still can not resolve DA, then a sequential search through the Level-3 table is needed. Note that NART has only one Level-1 table and one Level-3 table, and many Level-2 tables. Table[NA 0;L?1 2 16?L ], with each of these entries lled with NA's output port identier, as specied in the IP routing table. After all Class AB addresses are processed, the unlled Level-1 table entries are lled with the default output port identier. For each Class BC address, NA, of length L in the IP routing table, if Level1-Table[NA L?16;L?1 ] contains an 8-bit output port identier, a Level-2 table is created; Level1-Table[NA L?16;L?1 ] is put aside as OP I and changed to be an oset that can be used to compute the base of the newly created Level-2 table; and every entry in the Level-2 table is initialized to OP I. If Level1- Table[NA L?16;L?1 ] already contains an oset to a Level-2 table, no table is created. In both cases, 2 24?L entries of the Level-2 table, starting from Level2-Table[NA 0;L? ?L ], are lled with NA's output port identier as specied in the IP routing table. For each Class CD address, NA, of length L, if Level1-Table[NA L?16;L?1 ] contains an 8-bit output port identier, a Level-2 table is created; Level1-Table[NA L?16;L?1 ] is put aside as OP I1 and changed to be an oset that can be used to compute the base of the newly created Level-2 table; and every entry in the Level-2 table is initialized to OP I1. If Level1-Table[NA L?16;L?1 ] already contains an oset to a Level-2 table, this Level-2 table is used in the next step. If Level2-Table[NA L?24;L?17 ] is not , Level2-Table[NA L?24;L?17 ] is put aside as OP I2 and is changed to ; and 6

8 a new Level-3 table entry with a network mask of , a network address of NA L?24;L?1, and an output port identier of OP I2 is created. Regardless of the content of Level2-Table[NA L?24;L?17 ], a Level-3 table entry with NA's network mask network address, and output port identier is added to the global Level-3 table. NART Lookup Given a destination IP address, DA, the most signicant 16 bits, DA 16;31, is used to index into the Level-1 table. The exact address used to index the Level-1 table is Base1 + 2 DA 16;31, where Base1 is the base address of the Level-1 table. If DA has a Class AB network address, one lookup into the Level-1 table is sucient to complete the routing table search. If DA has a Class BC address, Level1-Table[DA 16;31 ] contains an oset from which the base of the Level-2 table is computed. The corresponding Level-2 table's base is Base Level1? T able[da 16;31 ], where Base2 is the base for an address region specically reserved for all Level-2 tables. If Level2-Table[DA 8;15 ] is not , the lookup is completed. Otherwise the network address of DA is a Class CD address, and a sequential search through the Level-3 table is required. We chose to use a single global Level-3 table, rather than multiple Level-3 tables like Level-2 tables, because it reduces the storage space requirement and because it allows Level-2 table entries to specify up to 255 output port identiers. The search through the Level-3 table is based on DA 0;31 and is done sequentially. Because the Level-3 table entries are sorted according to the length of their network addresses, the linear search can stop at the rst match. As will be shown in the Section 5, the performance overhead of this sequential search is negligible. Figure 2 illustrates the data ow of the NART lookup process. If DA's network address is a Class AB address, the lookup takes 7 cycles to complete. If DA's network address is a Class BC address, the lookup takes 12 cycles to complete. If DA's network address is a Class CD address, the lookup takes 20+4K cycles to complete, where K is the number of Level-3 table entries the algorithm has to examine before the rst match. Unlike HAC lookup, these cycle estimates are not precise because there is no guarantee that the memory accesses in NART lookup always result in cache hits. 4 Fixed-Granularity Fluid Fair Queuing The most well-known real-time packet scheduling technique is Weighted Fair Queuing (WFQ), which allocates the output link bandwidth among competing network connections in proportion to their bandwidth reservations. Each connection that goes through an output link is assumed to have a per-connection queue associated with that output. For each incoming packet, WFQ computes the virtual nish time slot at which it should be sent out on the corresponding output link, and puts it to the corresponding queue. Whenever an output link is free, WFQ selects for transmission the packet with the earliest virtual nish time from the per-connection queues at the output. Specically, the virtual nish time of the i-th packet of a real-time connection whose bandwidth reservation is W bits per virtual time unit is calculated by the following formula: V irtual F inish T ime(i) = MAX(V irtual Arrival T ime(i); V irtual F inish T ime(i? 1)) P acket Size(i) + (1) W 7

9 where V irtual Arrival T ime(i) and P acket Size(i) are the virtual arrival time and the size of the i-th packet, respectively. The scalability of WFQ is limited by the following two operations that are to be performed for each packet. First, the mapping between a packet's physical arrival time to its virtual arrival time requires constant tracking of the set of active network connections whose queues are non-empty. Second, the ordering among head packets from an output link's perconnection queues requires an log(n) sorting overhead, where N is the number of active network connections whose queues are non-empty. To route one million packets per second, the delay total of both operations has to be smaller than 1 sec. Suez takes a dierent scheduling approach called the Fixed-Granularity Fluid Fair Queuing (FGFFQ) packet scheduling algorithm, which completely eliminates the need for virtual-to-physical time mapping and packet sorting, and thus can be shown to be more scalable than WFQ. Moreover, FGFFQ is more suitable for hardware implementation than WFQ. The key observation behind FGFFQ is that the overheads of current WFQ implementations result from the assumption that a network-layer packet has to be transferred from one router to another in an indivisible fashion. FGFFQ does away with this assumption by introducing a multiplexing/demultiplexing mechanism above the link-layer protocol. As a result, FGFFQ can now physically transfer partial networklayer packets from router to router, and rely on the multiplexing/demultiplexing mechanism to logically forward entire network-layer packets. This exibility allows FGFFQ to emulate the uid fair queuing model directly by sending a data quantum from each active connection in a round robin fashion, with the quantum size determined according to their bandwidth reservations. 4.1 The Scheduling Model In addition to addressing the implementation overhead problem of WFQ, FGFFQ is also designed to support two additional features not available in WFQ. First, FGFFQ supports both workconserving and non-work-conserving scheduling disciplines. It is known that at the expense of link under-utilization in some cases, the non-work-conserving scheduling discipline reduces the burstiness of packet sequences, and thus the router buer requirement. Second, FGFFQ allows real-time applications to decouple the bandwidth and delay requirements. That is, applications can minimize the end-to-end packet latency without reserving a much larger bandwidth than needed, as in WFQ. The performance guarantee requirement of a real-time connection is specied in terms of the following triple: (H; W; C), which means the packet scheduler transmits H bits from this connection in each of the W scheduling periods within a cycle of C periods, where C should be larger than or equal to W. Therefore the bandwidth requirement of this connection is HW bits/sec. Assuming CD each scheduling period is D seconds and the link speed is B bits/sec, then the link can transmit a total of D B bits at the maximum within one scheduling period. Assume that the i-th active connection of an output link has a reservation of (H i ; W i ; C i ), and two counter variables, W Counter i and CCounter i, which are initialized to W i and C i, respectively. In each scheduling period, the packet scheduler visits each active connection on the output link, and transmits H i bits if W Counter i is greater than 0. After W Counter i reaches 0, the packet scheduler will skip the i-th connection until CCounter i becomes zero. The FGFFQ scheduling algorithm is as follows. 8

10 while (1) f /* In each scheduling period the packet scheduler visits each active real-time connection */ for (i=1; i <= N; i++) f g if (W Counter i if (Queue i g Ccounter i --; if (Ccounter i g > 0) f Transmit H i is not empty) f W Counter i --; g <= 0) f bits; if (Queue i W Counter i = W i ; Ccounter i = C i ; g is not empty) f g Transmit non-real-time data or loop idly until the current scheduling period ends; where N is again the number of active real-time network connections, and Queue i is the perconnection queue for the i-th connection. There is a separate best-eort connection queue for all non-real-time packets. Typically packets from the i-th connection are of the size H i W i. In the above algorithm, a real-time connection can send out at most H i W i bits within a cycle of C i scheduling periods. The unused bandwidth in the previous cycle can not be carried over to the following cycles. After servicing all the active real-time network connections in each scheduling period, FGFFQ's packet scheduler spends the rest of the period in either transmitting non-real-time or best-eort trac, or simply looping idly if there is no non-real-time data to send. Because the packet scheduler chooses to service non-real-time data or sit idle even when there are real-time data for transmission, FGFFQ is non-work-conserving. FGFFQ is non-work-conserving also in the sense that when W Counter i is zero, the packet scheduler would not consider the i-th connection for transmission even though Queue i is not empty, until Ccounter i becomes zero. If a real-time connection is not concerned with packet latency, then C is usually chosen to be equal to W. If packet latency needs to be minimized, W is set to be smaller than C. In this case, the connection is distinguishing between the instantaneous bandwidth requirement, which is set for reduced latency, and the average bandwidth requirement, which corresponds to the long-term bandwidth need. Given a performance specication (H; W; C), the instantaneous bandwidth is H HW, whereas the average bandwidth is. Packets are physically transmitted at the rate of H, D CD D although the logical long-term bandwidth consumption is HW. CD A major advantage of bandwidth/delay requirement decoupling, as supported in FGFFQ, is 9

11 Real-Time Connection-ID/Output-Queue Map Non-Real-Time Routing Table Output Queuing MUX DEMUX empty Link 4 Input Queuing Output Queuing 1 2 Figure 3: There are three real-time connections (1, 2, 3) and one none-real-time connection (4). At the time the packet scheduler visits Connection 3's queue, it is still empty. Therefore, the linklayer packet contains data transmission units or quanta from Connection 1, 2, and 4. For real-time connections, their transmission units are forwarded to the connections' corresponding output queues by consulting with the Connection-ID/Output-Queue map. For non-real-time trac, however, whole network-layer packets need to be assembled at the input queue before being forwarded to the output queues, which are determined by looking up the non-real-time routing table. that unlike WFQ, low-delay and low-bandwidth real-time connections can be serviced at high instantaneous rates without over-reserving the link bandwidth. The dierence between a connection's instantaneous and average bandwidth requirements can be used by another real-time connection. However, in this case, the delay bound of the latter connection may suer. Section 4.3 presents the delay bounds of real-time connections under FGFFQ. 4.2 Transparent Multiplexing/Demultiplexing Traditionally network-layer packets are transferred from one network node to another as atomic units. Because of this assumption, existing real-time packet schedulers are forced to schedule data transmission on the output links on a packet by packet basis. FGFFQ chooses to use a scheduling granularity that is smaller than individual packets, so that each active connection receives its reserved share of the link bandwidth within each scheduling period. To support FGFFQ, a multiplexing/demultiplexing (MUDEM) layer is transparently added above the link-layer protocol. This subsection discusses the design of this MUDEM scheme in detail. Every time the packet scheduler visits a non-empty connection queue, say the i-th connection, a quantum or data transmission unit of H i bits is sent to the MUDEM layer, which merges multiple quanta into one link-layer packet and sends to the next-hop router. The MUDEM layer at the other end receives the composite packet, and directs each quantum in the packet to their respective per-connection output queues. Figure 3 shows the data structures used at the input and output sides of the MUDEM layer. In the most general case, an output link of a router may be connected to multiple next-hop routers. Therefore, the data structures shown in Figure 3 are set up for each pair of routers that are directly connected, either through a point-to-point link or through a shared medium. 10

12 The MUDEM layers at the two sides of a link are synchronized on every link-layer packet as to which connections' data transmission units are contained in the packet. At the beginning of a link-layer packet is a 4-byte MUDEM header that includes the number of connections that have put a quantum in the packet and the connection ID of the rst such connection. These connection IDs are unique on a link, and are established at connection setup time, and agreed upon on both sides of a link. The number of participating connections in a packet is needed for the receiver to determine when to stop decoding the packet. There are two reasons why the starting connection ID is included. First, when link-layer packets are corrupted, the errors are restricted only to those connections that transmit quanta in those packets, and would not propagate to other connections. Second, incorporation of a starting connection ID makes it possible for the rst quantum in a transmission unit to be from the next connection that has any data to send, i.e., skipping inactive connections if necessary. The sender side of the MUDEM layer also inserts a header for each connection onward starting with the rst connection, regardless of whether a connection is empty or not. The rst bit of this header byte indicates whether the corresponding connection is empty. If it is not, the remaining bits are used to represent the length of the quantum, which follows the header. The length of the perconnection header is xed and is known at the other end of the link. The quantum length is specied explicitly because each connection may use a dierent H i 's, and because the tail transmission unit of a network-layer packet may be shorter than its corresponding H i. If a connection queue is empty, one header byte is inserted to the bit stream without any payload. The MUDEM layer composes link-layer packets by including as many connections as possible until the total size of headers and payloads reach the MTU (Maximum Transmission Unit). At the receiver end, the MUDEM layer rst decides the starting connection ID, and for each non-empty real-time connection, enqueues its transmission unit to its corresponding per-connection queue at the associated output link. For real-time connections, the mapping from connection IDs to the per-connection queues is constructed at connection establishment time. For non-real-time trac, the header of the network-layer packet is used to look up the IP routing table. However, data transmission units of non-real-time packets can not be immediately forwarded to the corresponding per-connection output queue, because multiple best-eort packets that are to be routed through the same output link may arrive at the same time. Therefore, cut throughs of best-eort packets may lead to packet interleaving in the buer. Instead, data transmission units of non-real-time trac should be buered at the input queue to form a complete network-layer packet before pushed to the per-connection output queue. Formation of network-layer packets is relatively straightforward since the network-layer header contains the length information. In summary, real-time packets can be forwarded inside a router in a cut-through fashion because each real-time connection has a dedicated queue in each router, whereas non-real-time packets must be serviced in a store-and-forward fashion because multiple non-real-time connections may share the same output queue. 4.3 Performance Bounds Given a real-time connection's performance requirement specication, (H; W; C), the per-hop packet latency bound provided by FGFFQ is (1 + W ) D, which is independent of the maximum packet size of other connections and the number of connections sharing an output link. The corresponding bandwidth guarantee is HW. The choice of D can not be arbitrarily small for implementation DC 11

13 eciency, especially when the number of connections sharing a link is large. The latency bound is broken down into a queuing component, D, and a transmission component, W D. The worst-case queuing delay occurs when packets from all active connections arrive simultaneously, and the packet that is delayed the longest experiences a queuing delay of D. The per-hop delay bound of WFQ for a real-time connection with a bandwidth reservation of H D Pmax is H D + Pmax B, where P max is the maximum packet size and B is again the link bit rate. Assume that B can maximally support N max such connections. Then H = B D Nmax of WFQ is Pmax(Nmax+1). On the other hand, the delay bound of FGFFQ is B. Therefore the delay bound (1+W )HNmax B. So the delay bound ratio between FGFFQ and WFQ is NmaxH(1+W ) Nmax+Nmax Pmax, which is equal to Pmax(Nmax+1) Nmax+1 when P max = H W. As long as the quantum size, H, remains much smaller than P max, the delay bound ratio remains close to 1. Assume that P max = 1500 bytes, N max = 1000, B = 10 9 bits=sec, and D = 1 msec. Then the ratio is 1.082, or 8.2% increase in the delay bound compared to WFQ. One desirable property of FGFFQ is that its delay bound remains unchanged as the maximum number of network connections sharing an output link increases with the increase in the link speed. That is, assuming H is xed, then as N max increases in proportion to the increase in B, D and thus the delay bound stay the same, because H N max = B D. 5 Prototype Implementation 5.1 Overview The current Suez prototype consists of four Pentium-II 400MHz processors interconnected through a Myrinet system area network. A 1.28Gbps full-duplex link connects each Pentium node to the Myrinet switch. Each host interface consists of an on-board Lanai 4.3 processor, 512 KBytes of memory and a DMA engine. The on-board processor executes a control program loaded into the board memory during system start-up. Both the bridging and the external interfaces on a host are Myrinet interfaces running tailored control programs to perform such operations as peer-to-peer DMAs, interface-to-memory DMAs and shared memory communication with the host processor. The interfaces reside on a 32-bit 33MHz PCI bus. Myrinet interface cards allow data transfer from the wire to the interface card to proceed concurrently with a DMA transaction from the card memory to the host memory. The software of the Suez prototype is based on Linux, and includes a routing table lookup engine, a real-time output link scheduler and a data buering and movement module. To reserve a portion of L1 cache for the HAC, the page table entry creation routines have been modied in the Linux kernel. Essentially, unless the allocation is for the HAC, any page table entry created for an allocated physical page that is a multiple of 4 is created with the page cache disable bit set. This is done during dynamic memory allocation as well as during the static initial setup of the rst 4 MBytes of the kernel's address space. Code pages in the kernel page tables are not cache disabled since they reside in a separate instruction cache and hence can not conict with the HAC which resides in the data cache. Also, the pages allocated to the NART search data structure are always mapped to physical pages that are not multiples of 4, so that they are not cache disabled. Given an incoming IP packet, its destination address is extracted and used for the lookup according to the algorithm described earlier. H 12

14 Each output interface has a set of per-connection real-time queues and a best-eort queue. The FGFFQ algorithm is used to schedule and transmit the packets buered in these queues. Under FGFFQ, packets received on an input interface would be a set of quanta with an associated bitmap. Real-time quanta are demultiplexed to appropriate output queues based on a real-time connection ID table, which is set up during connection establishment. However, this demultiplexing does not involve data copying and is accomplished by simply appending a descriptor to the output queue that contains a pointer to the quantum and its length. Because a single arriving packet consists of multiple quanta, multiple enqueuing operations may be required. 5.2 Data Buering and Movement A key implementation challenge for the Suez prototype is to develop an ecient data buering and movement scheme to reduce the latency that a packet experiences when it travels from the input node to the output node of the Suez cluster, and to maximize the aggregate packet throughput of the entire Suez router. When an interface card receives a packet, it rst performs routing table lookup for non-realtime quanta or packet classication for real-time quanta. Because the memory capacity on the interface card is limited, it can only hold a real-time connection ID table for packet classication. It is the CPU that performs routing table lookup for non-real-time packets, using the HAC and NART algorithms. To eliminate unnecessary interrupts, the interface card places a lookup request on a per-interface shared memory region. The CPU actively polls this shared memory region and services these lookup requests with high priority. The CPU puts the lookup results again in the corresponding per-interface shared memory region. An incoming non-real-time packet can be transferred out of the input interface only after the lookup result becomes available. In the case that the CPU can not perform the lookup fast enough, the interface card is forced to DMA packets that are delayed to a per-interface buer queue. This additional DMA overhead is the cost of eliminating interrupts. Once the output queue for a given packet/quantum is determined, there are two cases to consider. If the output interface of a packet/quantum resides in a dierent Suez node as its input interface, the packet/quantum is rst transferred across the Myrinet switch to the output node, through a peer-to-peer DMA mechanism over PCI. Peer-to-peer DMA is possible because the memory on the interface cards are memory-mapped and thus could serve as source and destination addresses of DMA transactions. Peer-to-peer DMA transactions are blocking transactions since interface cards may not have enough buer to maintain queues for ow control. The interface card on a Suez node that is connected to the Myrinet switch (called the bridging interface) provides a well-known set of memory locations, one for each interface on the same node, to hold requests for peer-to-peer DMA transactions. The host sends these memory locations to the interfaces at startup, using the shared memory regions between the host and the interfaces. The i-th interface card initiates a inter-node data transfer transaction by writing a small request message, through peer-to-peer DMA, to the request area allocated to the i-th interface on the bridging interface, which cycles through the set of data transfer requests in its memory and services each of them by pulling data from the requesting interface, again through peer-to-peer DMA. Then, the bridging interface attempts to batch all data items that are destined to the same output node into one transfer unit, and ship them through the switching fabric to the appropriate nodes in the cluster. The results of routing table lookup 13

15 1. Classifier Invocation (NRT) X. Y Node Memory 2. peer DMA req 3. peer DMA data 5. DMA data 7. schedule 1. Maptable Lookup (RT) 4. peer DMA ack Switch 6. DMA descriptor Input cluster node Output cluster node Figure 4: The life of a data packet through the cluster-based router when its input interface and output interface reside on separate nodes. The packet is transferred to the output cluster node through a combination of a peer-to-peer DMA transaction on the input node, a transport over the backplane switch, and an interface-to-host memory DMA transaction on the output node. The output link scheduler eventually puts data on the wire through a gather DMA transaction. and packet classication are sent together with the packets/quanta, to avoid redundant lookups or classication at the output nodes. Once a packet reaches its output node, the subsequent processing is the same as the case in which the input and output interfaces reside in the same node. The interface through which a packet arrives at its output node DMAs the packet to its corresponding output queue in host memory. Each interface on a Suez node is pre-allocated a buer pool and a descriptor pool residing on the host memory. A descriptors to a rst approximation includes a memory location pointer and a length led. To put a packet to its output queue, the receiving interface card allocates a buer to hold the packet, and creates one descriptor for each quantum in the packet. Then the interface card logically inserts each packet/quantum into its corresponding output queue by appending the descriptor to the associated output queue, which is physically implemented as a list of descriptors. The advantage of this approach is that a data packet never gets copied from one memory location to another. The disadvantage is that the DMA required for putting a data packet on the output link is less ecient because the quanta making up a packet do not necessarily reside in a contiguous memory region. The CPU on a Suez node executes the FGFFQ scheduler, which scans across the output queues on all the node's output interfaces. Every time a packet is put on the output link, the scheduler returns the associated buer and descriptor to the pools of the interface through which the packet arrives at the node. Therefore, logically there could be one consumer and multiple producers to each per-interface buer pool and descriptor buer. However, because the FGFFQ scheduler is nonpreemptive, at any point in time there is only one consumer and one producer to each per-interface buer/descriptor pool. By building a \guard" entry into each per-interface buer/descriptor pool, one can guarantee that the accesses to these pools are free of race condition without using locks. Figure 4 illustrates the processing of a data packet within the Suez router. The input interface performs packet classication for real-time packets, while leaving routing-table lookup for non-real- 14

16 time packets to the CPU. Once a data packet reaches its output cluster node, it is DMAed into the host memory and waits for the output link scheduler to ship it out at appropriate time. Buering data packets at the output node's host memory is essential because arrival order and departure order of data packets can be very dierent due to real-time packet scheduling. Without packet scheduling, peer-to-peer DMA could also be used on the output node as well. 6 Performance Evaluation 6.1 Routing Table Lookup We have performed a simulation study as well as a measurement study of the initial Suez prototype based on an IP trace collected from the main router of Brookhaven National Laboratory from 9AM on 3/3/98 to 5PM on 3/6/98. The performance metric used is million lookups per second (MLPS). We have built a detailed simulator that takes into account the HAC and NART lookup algorithms, as well as the eects of L1 cache, L2 cache and main memory. The results of this simulation study were reported in [26]. In this paper, we only focused on the lookup performance as measured by running the same packet trace through the Suez prototype, which runs on a Pentium-II 400MHz machine with a 16-KByte 2-way set associative L1 cache, and 512-KByte 4-way set associative L2 cache. The HAC hit ratios and lookup performance results for dierent HAC associativity are shown in Table 1. The timing measurements were made using the cycle counter in the Pentium architecture. The best-case measured performance, MLPS, is comparable to state-of-the-art high-end routers. We believe better performance is possible on RISC architecture for the following reasons. Pentium instruction set architecture provides a limited set of registers and thus prohibits aggressive exploitation of instruction level parallelism. It also has a very high branch misprediction penalty. As a result, the best-case scenario, where the HAC hit occurs at the very rst entry of the cache set, takes 23 cycles rather than 5 cycles, as we estimated originally. Because the rst HAC entry in each cache set is replicated for performance optimization, 7-way and 3-way logical set associativities actually correspond to 8-way and 4-way physical set associativities. For 1-way and 2-way set associativities, however, this optimization is not applied. Although larger degrees of associativity increase the HAC hit ratio, the best performance occurs when the degree of associativity is 1. The main reason is that set associativity is implemented in software in this algorithm, which entails two performance overheads. First, with a large set-associative cache, the access time for HAC hits at the end of the cache sets is not necessarily shorter than NART lookup time. That is, the time to match an IP address at the 8-th HAC entry of a 8-way HAC's cache set may be longer than looking it up in the NART directly. Second, large set associativities increase the HAC miss detection time, thus adding a higher xed overhead for every NART lookup. In the BNL trace, because the HAC hit ratio is already high (95.22%) when it is direct mapped, the additional gain in HAC hit ratio is not sucient to oset the additional overhead due to software-implemented set associativity. The reason that the overall performance of the 3-way case is slightly better than that of the 2-way case, despite lower HAC hit ratio and higher HAC miss detection time, is because the average hit access time for the 3-way case is shorter, owing to duplication optimization. When the HAC is completely turned o, the measured performance is 8.93 MLPS, which shows 15

Performance Evaluation of Myrinet-based Network Router

Performance Evaluation of Myrinet-based Network Router Information and Communications University 2001. 1. 16 Chansu Yu, Younghee Lee, Ben Lee Contents Suez : Cluster-based Router Suez Implementation Implementation