Advanced Core Router Architectures A Hardware Perspective

Size: px

Start display at page:

Download "Advanced Core Router Architectures A Hardware Perspective"

Sheena Warren
5 years ago
Views:

1 Advanced Core Router Architectures A Hardware Perspective Muzammil Iqbal Department of Electrical and Computer Engineering University of Delaware, DE 976. Abstract In the environment of high performance IP network and growing consumer demand for Internet bandwidth has levied unprecedented load on today s Internet Service Provider (ISP). Gateways or Routers as they have come to be known, play an essential role in modern Internet infrastructure. Their design has seen tremendous advancements during the last decades in order to sustain the high bandwidth requirements at the core. In this paper we identify important trends in design and some design issues facing the fourth generation of IP routers. This area encompasses vast topics and issues but it is not possible to cover them all. Therefore we restrict our survey to architectures of high-end backbone routers from hardware viewpoint. Introduction The popularity of the Internet has caused the Internet traffic to grow drastically every year. The predicted growth is that the Internet traffic triples every three to four months and it is expected to see a daily volume of 000 Terabits by 2003 which is almost 0 times of traffic volume traversing the Internet today [Pluris 7]. This will incapacitate the presently deployed Core having throughput in gigabit regime. Another factor is the advancement in optical fiber communication technology, which with the advent of DWDM has made it possible to carry several channels over a single fiber each having a capacity of several gigabits. 6-channel OC-92 ( 0 Gbps) are not uncommon these days and OC-786 speed shall be available soon. In such scenario, an obvious bottleneck is the routing/switching device, which is not compatible with available ultra-fast communication technology. The reason is inherent in functionality of these devices that cause them to run at a rate slower than the line rate. Also with the advent of QoS applications running complex algorithms, the switching speeds have been adversely affected. Therefore a great deal of research interest lies in building high capacity and high-speed routers that can scale to predicted growth trends as well as provide advanced QoS services like the Integrated services or Differentiated service models. Our paper is focused on the discussion elaborating design trends for backbone routers, therefore scheduling and route look-up algorithms although alluded to, but are not discussed in detail. The rest of the paper is organized as follows. In 2, we have briefly discussed initial design trends resulting in several generations of router architectures. In 3 &, Single bus shared memory architecture and Crossbar switch architecture has been discussed in greater detail. 5 describes various techniques employed for switch arbitration and control, as well as discusses hardware implementation of a fast crossbar scheduler. 6 deals with endeavors to build VLSI based Internet routers having throughput in terabit regime. Finally in 7, we discuss the emerging idea of using terabit capacity optical backplanes to switch data at a very high speed. Section 8 presents the conclusion. 2 Essential Elements of a Routing system A router is a layer 3 device in the OSI layer model. It has two fundamental functions: ) Routing & 2) Forwarding of IP packets. The routing process gathers information about the network topology and creates a Routing table. The packet forwarding process copies a packet from an input interface to proper output interface based on the information contained in the forwarding table. Any routing system requires four essential elements to implement the routing and forwarding process: routing software, packet processing, a switch fabric and line cards (Fig 2.). For any system designed to operate at the core of the Internet, all four elements must be equally efficient. Routing Engine Forwarding Engine Main Processor Routing Table Packet Processing Forwarding Table Switch Fabric Routing Software Fig 2.: Block diagram of a Generic Router

2 The main processor runs the routing software, which performs the routing functions and maintains information of Internet topology through the routing protocol. The switch fabric performs the forwarding operation using the forwarding processor, which is usually an ASIC optimized to perform specific task with high speeds. At this point it is appropriate to mention that in today s backbone routers, a usual practice is to maintain another table called the forwarding table at the forwarding engine (FE). FE actually reads the destination address from the packet header performs route lookup, finds the best match and forwards the packet to determined output interface. The forwarding table is essentially derived from the routing table but is less often updated. The routing table is maintained by the routing software that performs relatively slow processing while updating the table and once the table has been updated, its image is transferred to the forwarding engine. This image can be a subset of the complete routing table or may be modified to fit into a smaller version. The switch fabric is essentially the core of the forwarding engine. In essence the router comprises of the routing engine, implemented in software and uses the main CPU of the router to carry out more complex operations like running the routing protocol, traffic engineering features, QoS guarantees, modifying the packet header before departure and other software based features of the router. The other part, Forwarding engine carries out simpler tasks at a very high rate. Route Processor CPU DMA M LINE CARD MAC A Fig 2.2 First Generation Architecture 2. Legacy Architectures DMA M M A LINE CARD MAC Memory DMA M A M A LINE CARD MAC BUS Over the years, there has been many different architecture used for routers. Particular architectures have been selected based on a number of factors, including cost, number of ports, required performance and currently available technology. The detailed implementation of individual commercial routers has generally remained proprietary, but in broad terms all routers have evolved in similar ways and have followed a common trend of development. The first trend has been the implementation of more and more of data-path functions in hardware. In recent years, improvements in the in the integration of CMOS technology has made it possible to implement larger number of functions in ASIC components specially those that were traditionally implemented in software. Cache Update DMA Route Cache Memory MAC Route Processor CPU Packet Path DMA Route Cache Memory MAC Memory Fig 2.3 Second generation Architectures DMA Route Cache Memory MAC The First Generation of routers were simpler from architectural viewpoint in that they used a centralized processor, centralized buffer and a centralized bus connected to the line cards. The incoming packets had to traverse the same bus in order to be scheduled on an output interface. The interface cards were dumb I/O devices with no packet processing capabilities. This design clearly shows deficiencies in that the bus can only be used by one line card at a time. Moreover a packet has to traverse the bus twice after arrival on an ingress port. First, it is written into the memory while route lookup and scheduling decision is taken by the route processor and secondly once such a decision has been reached it is de-queued from the memory and traverses the bus to reach an appropriate output interface(s). This is illustrated in Fig Another deficiency was that a general purpose CPU was the workhorse of the device. All functions pertaining to routing and forwarding processes had to be performed by same processor levying tremendous load, and contribute to the bottleneck of the system. The second generation design as shown in Fig 2.3 added a special purpose ASIC processors and some memory in the interface cards which can do the packet header lookup to retrieve destination information and also buffer the packet until the time bus is available for it to traverse. The satellite processors in the line cards each kept only a modest cache of recently used routes, which enables the line card to perform route lookup and to depends on the central processor only for bus arbitration. This cache is periodically updated. If a route is not found in the route cache the main processor looks it up. This technique relieves the processing load off the central CPU but bus arbitration is still a bottleneck. The second-generation architectures were short lived because of the inability to support higher throughput BUS 2

3 requirements at the core. The major optimization deficiencies pointed out by the newly emerging crossbar architecture regime were Congestion: i.e. the bandwidth is shared among all the ports, leading to contention and additional delay (forwarding delays). Also in severe congestion conditions where the arrival rate of packets exceeded the capacity of the bus, buffers would overflow and data would be lost. Secondly, high speed shared buses are difficult to design as the electrical loading caused by multiple ports on a shared bus, the number of connectors that signal encounters, and reflections from the end of unterminated lines lead to limitations on the transfer capacity of a bus. Another bus based multiple processor router architecture is described in [Asthana 8] uses multiple forwarding engines in parallel to achieve high packet processing rates as shown in Fig. 2.. As a packet is received, the IP header is stripped by the control circuitry, augmented with an identifying tag, and sent to a forwarding engine for validation and routing. While the forwarding engine is performing the routing function, the remainder of the packet is deposited in an input buffer in parallel. The forwarding engine determines which outgoing link the packet should be transmitted on, and send an updated header fields to the appropriate destination interface module along with the tag information. The packet is then moved from the buffer in the source interface module to a buffer in the destination interface module and eventually transmitted to an outgoing link. Forwarding Engine Forwarding Engine Forwarding Engine Network Interface Network Interface Resource Control Network Interface Fig 2. Bus based Architectures with multiple parallel forwarding engines 3 The shared memory architecture Control Bus Forwarding Bus Data Bus The design described above forms a premise for shared memory architecture, which employs it for building higher end backbone routers. It belongs to a contemporary regime that competes with the crossbar switch fabric regime, and both have high-end devices in the market targeting core applications. This design belongs to the third generation of router architectures. It employs an ultra high speed, high capacity bus with a logically centralized memory that is physically distributed over multiple forwarding engines working in parallel. In this section we discuss the architecture of a shared memory based router from Juniper, Inc. Invariably all higher end routing systems of today use specialized hardware, which tends to become more and more complex as technology advances. Router vendors use these special purpose ASICs for varied functions as packet processing, route lookups, packet classifications, maintaining of per flow states for QoS and traffic engineering. The complexity of such hardware grows exponentially as you increase their functionality. The processing time for basic forwarding operations is the benchmark for overall device efficiency. Currently the CMOS process has reached 0.8 microns and even 0.3 microns for state of the art Microprocessors and RISC processors. With this level of gate density it is now possible to integrate several millions of transistors on a single ASIC. These high-density chips lend themselves well for more and more complex network operations unprecedented so far. The point of this discussion to emphasize that performance of a routing device has seen tremendous enhancement owing to rapid growth of VLSI technology. We have basically two applications for large-scale electronic integration. Firstly, in the switch fabric, where the connection switching latency also referred to as the cell time is an important factor in determining the overall data throughput of the device. Secondly as said above, the specialized ASICs have also evolved into massively sophisticated devices, they offer very high-speed functionality for operations thus far implemented in software in the main routing engine. An appropriate reference would be to QoS and traffic engineering functions, which limited their scope of implementation due to complexity and slower execution in software. The Juniper M-series routers have been offered with 0 to 60 Gbps backplanes. Such high throughput requires highspeed transceiver circuitry to retrieve the data from line card, very efficient buffering, fast route lookup and ultra high speed centralized bus to switch the packet from one interface to another. There are also more subtle issues involved in case of a shared memory system, such as memory management, control signaling and their efficient implementation and coordination can form a premise for a high speed device. The design can be generalized as an emulation of Output Queued, fixed length cell switch. As we discuss in the latter sections as well, it is becoming a trend in gigabit and terabit regimes to adopt fixed length cell switching due to ease of implementation of communication channels. This idea has been adopted from ATM switches and utilized in a packet switching device instead of circuit switched. The centralized memory architecture has following functions: 3

4 ) All input packets must be fragmented into fixed length 6 byte cells and new header (notification) containing control information is generated. This is performed at the ingress port by an ASIC. 2) The cells are forwarded to a buffer management ASIC for storage in the memory. Input Interface Packets Fig 3. Notification Distributed Buffer Manager Blocks I/O Manager Route Lookup Internet Processor Shared Memory Packet Forwarding Engine of M-Series Juniper Router 3) The notification information for each packet is forwarded to the Internet processor ASIC, which performs the route lookup and takes the scheduling decision according to state of centralized bus. ) The Internet processor ASIC also performs bus arbitration and schedules the packet on an output interface. Once this decision is taken the relevant buffer management ASIC is notified to be de-queued from the memory. 5) However, instead of packet the output notification generated by Internet processor ASIC is queued on the output port. When it reaches head of line in the queue, the cells pertaining to that notification are dequeued from the distributed memory locations and assembled at the output port before departure. 00 Mbps PCI bus Distributed Buffer Manager I/O Manager Output Interface Packets It should be emphasized that internal control communication overhead is heavily reduced with exchange of notifications which amount to 53 bytes of data. The actual packet resides in the memory until it output notification has reached the head of line in output port. Fig 3.shows the logical process of packet forwarding. Actual hardware implementation is more distributed as shown in Fig. 3.2.The packet-forwarding engine constitutes the main switching fabric of the device. Shared memory is distributed over several Flexible packet concentrators but logically it acts as a single entity. The architecture is designed in a hierarchical manner. The line cards or Physical interface cards (PIC) as they call it terminates in Flexible PIC concentrators (FPC). The FPC is includes a 28 MB DRAM memory bank, which is a part of large pool, constituted by in all eight FPCs. The I/O manager responsible for memory access and packet fragmentation is also housed on the FPC. The FPCs are connected to a pool of Packet Forwarding modules containing the distributed buffer manager ASIC and Internet processor ASIC. Their functionality is described in greater detail below. Flexible PIC Concentrators (FPC) Four Physical Interface s (PICs): These refer to the line cards. Receives and Transmits Data packets from Network. Performs Framing, Speed Signaling and encapsulation. Two Packet Director ASICs: Distributes Incoming Packets from PICs among I/O managers. The process allows for uniform distribution of traffic load among several devices to achieve some level of parallelism. Four I/O Manager ASICs: Parse Layer 2 & 3 data, performs encapsulation & segmentation of data packets into 6 byte cells. Reads and writes data to memory according to address assigned by Buffer manager. The I/O manager is also responsible for packet reassembly at the time of departure. It queues the packet notifications generated by the buffer manager and awaits till a particular notification has reached the head of line of particular queue, at this time the cells are de-queued from the memory and reassembled for departure. Shared Memory: This comprises of eight identical 28-MB SDRAMs. Essentially a two way, highly Non Blocking memory that offers single stage buffering capability. The memory is physically distributed over several FPCs. Logically it form member to the overall memory pool of the system. The packets are stored in the memory independent of their arrival on a particular FPC. Switching and Forwarding Module (SFM) High Speed 8 X OC 8 link 20Gbps (half Duplex) Fig 3.2 Architecture of M-series Forwarding Engine Internet Processor II ASIC: Centralized Processor for Route Look-up and forwarding decision. Performs longest match lookups at a rate of 0 million routes per second, which is

sufficient for wire-speed lookup rates. It maintains forwarding tables for scheduling decision. The forwarding table is an image derived from the main routing table maintained by the routing engine.

5 sufficient for wire-speed lookup rates. It maintains forwarding tables for scheduling decision. The forwarding table is an image derived from the main routing table maintained by the routing engine. During updating of the routing table, essentially forwarding tables are blocked to avoid inconsistent and spurious scheduling but this also contributes to system blockage, undesirable in high end routers where a delay can cause huge back locks. To prevent from blocking of forwarding table during periods of route instability, the Internet ASIC maintains two forwarding tables. During the time the forwarding/routing table is being updated scheduling is performed on basis of secondary table until the primary one has been updated to replace it. This mechanism is called Atomic updates. It is highly programmable via JUNOS Operating system to cater for QoS and traffic engineering requirements. Distributed Buffer Manager ASIC: Manages 8 X 28Mb memory on each FPC as parts of systems shared memory pool. Generates notifications on packet arrival, communicates with the Internet processor ASIC and upon scheduling decision, de-queues the packet from the memory. Mid Plane: Shared Memory Interconnect of size 8 X OC8! "#$ #"&%&'( )*#+,-$ for different models and range from 0-60 Gbps full duplex. This is the main switching bus of the system. It forms the communication link between the FPCs and Switching forwarding modules. The shared memory architecture is prone to several inefficiencies. We describe some of them. The first issue is that segmenting packets into cells becomes a problem when the packet size exceeds the cell size with a minor difference. In a 6 byte cell size system, a packet of length 65 bytes would consume 2 cell space of 28 bytes. This becomes a problem when we have varied lengths of packet arriving at the interfaces, which is usually the case. Another problem associated with all fixed length cell devices is the overhead generated due to notification information for all cells pertaining to different packets. This information is vital for correct reassembly of the packets at the output interface. For an Internet backbone scale router as discussed above this overhead can be very large due to sheer volume of traffic serviced by it. Also the generation of notification, processing of control information, segmentation and reassembly, storage and de-queuing all contribute to overall packet service latency of the system. Note that sustaining service levels for IP multicast traffic is particularly problematic in any shared-memory architecture. The reason is that each multicast packet is written to a shared memory just once, but needs to be read multiple times by different line cards, all from the same address. This situation increases the chances of other packets getting blocked. In the next section we describe the crossbar fabric regime of backbone Internet routers. These also belong to the third generation of router architectures. The Crossbar Switch fabric Architectures The essential architecture of Crossbar architecture remains unchanged as the 2 nd generation designs with a major change of shared bus fabric with a crossbar switch fabric. The crossbars have been widely used in switching applications like Telephony, parallel computer communication and data communications. It is being extensively used for building higher end backbone routing and switching devices due to ease of implementation and largely due to alleviation of bus contention issues inherent with single bus architectures. The architecture comprises of line cards, main routing engine, several forwarding engines associated with the line cards and a switch fabric. This is depicted in Fig.. In an NXN crossbar, having N input and N output ports an input can be connected to any of the output simultaneously with other connections. This is the non-blocking property of the crossbar architecture and alleviates the bus contention issues as in single bus system. Route Processor Switch Fabric Fig.. Crossbar based Fabric. Forwarding Engine Forwarding Engine Forwarding Engine The forwarding process can be roughly described by following stages ) As a packet arrives on an input line card the forwarding engine performs basic error checking to ascertain an error free packet and header. Computes the hash offset into the forwarding table and loads the route. 2) The forwarding engine checks to see if the cached route matches the destination of the datagram in the route cache, if not, the forwarding engine carries out an extended lookup of the forwarding table associated with it. This forwarding table is actually extracted from the routing table maintained by the main routing processors, a process similar to the shared bus 5

6 architecture. On finding a match the engine check IP time-to-live (TTL) field and compute the updated TTL and IP checksum, and determines if the datagram is for and discuss hardware implementation issues of each of the architectures. With this viewpoint, in the next section we explain the implementation of a 6 X 6 crossbar switch fabric with simple implementation tools.. Hardware Architecture of 6 X 6 crossbar There can be many ways to efficiently implement the crossbar in VLIS hardware but in this section we present an overview on one of the ways of implementation using very simple hardware macros. This design is implemented in a 0.35 micron CMOS process and requires 600 MOS transistors. The overall area of the chip is.2mm X 0.6mm and operational voltage is 5 Volts. S0 S S3 S the router itself. Fig.2 Block diagram of a 6 X 6 crossbar 3) The updated TTL and checksum are put into the IP header. The necessary routing information is extracted from the forwarding table entry and the updated IP header is written out along with link-layer information from the forwarding table. ) As this process is completed, the packet is queued in the input buffer at the line card. The crossbar uses a sophisticated buffering technique to prevent head of line blocking. This is discussed later. The forwarding engine performing route lookup and header modification determines the output interface for the packet to be routed. This is notified to the centralized scheduler that controls the crossbar arbitration by maintaining all states of the crossbar. S0 S S3 S OUT 0 OUT 5) As a scheduling decision is reached, the forwarding engine gains access through port specific line card to the crossbar. The scheduler closes necessary switch points on the crossbar and the packet traverses it to the output line card. The above is a very brief and rough description of forwarding process, going into further details is not intended. Our basic aim is to elaborate working principles Fig.2 Multiplexer macro This fabric has a throughput of Gbps full duplex direction. It is implemented using cascaded X Multiplexers with 6

7 switching controlled through a 6-bit shift register. Fig.2 shows the block diagram representation of the device. This design is optimized to fit for digital data communications application and high-speed digital switching equipment. The fabric offers Gbps throughput with a switch latency of ns. The crossbar connection pattern is stored on 6 on-chip registers. A serial interface is used to configure the crossbar connection pattern in 6 clock cycles. The building blocks are the shift register, cascaded multiplexer array and fanout buffers. The multiplexer module is built from 5 Multiplexers cascaded in two stages. The 6 input pins are connected to each of the multiplexer macro in a manner shown in Fig. The shift register forms an important part of control circuit that stores the configuration patters of the crossbar and manipulates it according to directive of the scheduler. Fig. shows logic schematic for a 6-bit shift register with three input bits and 8 output bits. However in this case we only use of these output bits. The register is built out off DQ flip-flops cascaded together. Fig. logic schematic for 6-bit shift register Fig.3(a) Symbolic representation of X MUX (b) Logic Function The multiplexer macro comprises of a 5 X multiplexer set cascaded in two stages. All the 6 outputs from fan out buffers forms input to 6 inputs distributed on Muxes. The selection of MUX module is done on the basis of select lines S2 & S3 while the output of nth stage Out(n) is selected by select lines S0 & S. These select lines originate from the 6-bit shift register where the configuration pattern of the crossbar is stored. Fig.3a & b illustrates the block diagram of a X Mux along with its logic function. Input lines are selected from select inputs () and (2). Logic table for several configurations is shown in Fig.3(b). As an example when S2() & S3(2) are 0 &, maps to two states as shown in table above. This configuration of select lines select input 5 in Fig.3(a) which can either have a 0 or as an input, this input is latched to output 7. Therefore with varying the two select bits, we can select an input line. For the output MUX this selection is governed by S0 and S bits from the shift register and any one input from the input Muxes is selected to be output on Out(n) port. We have briefly discussed the implementation details of a crossbar with a view to provide the reader with an exposure of issues related to implementation in router architectures. The design above is not state-of-art and offers only a modest throughput of Gbps. There are several products in the market offering aggregate throughput of the order of several hundred Gbps, however these have been achieved using similar technologies but at a higher gate density level. Also the Mux implementation is not a standard in industry as it has larger switch latency unacceptable in high-end products. Some architectures use tri-state buffers as switching points of the crossbar due to compactness and faster switching speeds. There can be several other ways as well but due to limitation of space we would not discuss them. Also, the design details above only correspond to fabric part of the overall crossbar architecture and there are several other components, which play a vital role in determining the efficiency, foremost being the centralized scheduler and scheduling algorithm. In the next sub section we discuss the inherent inefficiencies of the crossbar design and how they have been rectified using more sophisticated algorithms like islip..2 The scheduling algorithm The crossbar is inherently non-blocking but it suffers from other problems like as Head of Blocking (HOL) [Mckeown 5]. Even under benign traffic patterns, HOL blocking limits the throughput to just 60% of aggregate 7

8 bandwidth. This can be alleviated using Virtual Output Queuing (VOQ) in which each input interface maintains a separate FIFO queue for each packets destined for each output interface. At the beginning of each time slot, a centralized scheduling algorithm examines the contents of all input queues, and finds a conflict-free match between inputs and outputs. The islip algorithm proposed in [Mckeown 6] is designed to meet the goals of highthroughput and starvation free arbitration process. During each time slot, multiple iterations are performed to select a crossbar configuration, matching inputs to outputs. The islip uses a rotating priority ( round-robin ) arbitration to schedule each active input and output in turn. A hardware implementation of this algorithm shows that it can operate at a very high speed and makes scheduling decisions at 0ns. This implementation is discussed in 5.. islip attempts to quickly converge on a conflict-fee match in multiple iterations, where each iteration consists of three steps. ) Request: Each input sends a request to every output for which it has a queued cell. 2) Grant: If an output receives any requests, it chooses the on that appears next in a fixed round-robin schedule starting from the highest priority element. The output notifies each input weather or not the request was granted. 3) Accept: If an input receives a grant, it accepts the one that appears next in a fixed, round robin schedule. Fig.5 Schematics of Arbitration scheduler and implementation By considering only unmatched inputs and outputs, each iteration matches inputs and outputs that were not matched during earlier iterations. Fig.5 highlights the iteration steps and way to implement it using a priority encoder. The islip algorithm offers high throughput of 00% for uniform and un-correlated arrivals. No connection is starved and service is guaranteed at most in N-iterations for an N X N crossbar system. It is simple to implement as discussed in next sections. In 5 we discuss various arbitration techniques for the crossbar architecture along with a description of implementation of islip scheduler. 5 Arbitration and Control techniques The bus arbitration algorithms are implemented largely in hardware ASICs due to their fast computational ability for dedicated tasks. A controller/arbiter therefore constitutes a vital component of overall Crossbar Switch architectures and dictates the performance of a Routing device. Two major approaches have been either to centralize the arbitration task on a single chip usually on a centralized scheduler or to distribute it on multiple chips to attain high speeds by exploiting parallelism. In large non-blocking crossbars, the arbiter may become a bottleneck in terms of performance and reliability. A major research issue has therefore been to reduce the latencies of connection setup resulting from this bottleneck. The arbitration time and subsequent connection setup operation can severely affect the overall efficiency when heavy traffic loads cause all the input ports to request a connection with output ports. This is more specifically true for high end backbone routers with forwarding speeds reaching 0 million packets per second (Mpps) which require very fast arbitration and scheduling chips to keep up with aggregate throughput of the system [Newman 8]. In what follows we intend to give an overview of two popular control schemes and their variants, as well as the implementation of islip scheduler [gupta ] to highlight some hardware aspects of the arbitration operation and tradeoffs.. 5. Centralized Versus Distributed Arbitration Conventional crossbar designs have N Input ports placed conceptually perpendicular to N Output ports, thus constituting an N X N matrix. The centralized Bus Arbiter receives a request from a port say P 0 for a connection to port P k. The arbitration unit will ascertain a free path through the fabric to port P k and will subsequently close the required switch points for a successful connection. A more specific Single-sided crossbar [Ghosh ] has N port with M buses perpendicularly bisecting each of the N port. For full duplex communication each port line is assumed to comprise of two wires and the same is true for buses. All the Switching points on a crossbar can be distributed on several chips to ease the complexity in design or with the current level of CMOS VLSI technology reaching 0.25 microns can be 8

9 placed on a single chip [Mckeown7] as well. In any case the arbitration decision is based on the availability of an open path to the output port via an internal or external switching action. An arbiter might receive following requests from a port.. A connect request to setup a communication path to another port 2. A disconnect request to terminate an existing connection between two ports. As described above the connection procedure requires the arbiter to ascertain an unused bus column connecting input and output port-lines for each connection request. With large crossbars reaching sizes of 72 X 72 port lines the computational overhead with use of centralized controller may become a bottleneck reducing the overall efficiency of the switch. A simple remedy is to use multiple arbitration units to reduce the connection setup latency. Several schemes have been proposed in literature like [Ghosh] [Zerrouk 3] describing methods to efficiently employ a distributed set of controllers thereby optimizing the efficiency and reliability. The distribution of request load uniformly among several controllers reduces the arbitration load on a single controller. The centralized scheme offers a single point of failure, which is undesirable, and hence the distributed scheme increases the reliability of switch by multiplying the points of failure. In the following subsection we describe two of such methodologies. 5.2 The Symmetric Triangular Scheme Arbitration & Control distribution can be achieved by dividing all source and destination port pairs into K subsets assuming that there are K controllers C C K. With N ports the total number of distinct port pairs are N(N-)/2. This is best illustrated by representing the port pairs in a triangular pattern. Fig illustrates this distribution for N= 6 ports and K = Controllers. An even distribution of port pairs is achieved if they are equally distributed among the controllers. It can be achieved by geometrically dividing the port-pair triangle horizontally into K regions of approximately equal area. This scheme however is still prone to blocking effect if the total number of buses is less than or equal to N/2+. Also each region pertaining to a controller C I represents a central point of failure for that region. In case of a failure, measures should be taken to redistribute the port-pair regions among the surviving controllers. This can result in disruption of services for a period of time that is undesirable. A more detailed description of this scheme is not permissible here due to space constraint. 5.3 The Chessboard Scheme This scheme, first proposed in [ghosh], is a variant of triangular scheme in the sense that it enhances the redundancy and fault tolerance by assigning two controllers to any service request (a, b). But this enhancement comes with the increase in complexity of the system. From the viewpoint of implementation this concern might not be sizeable due to current improvements in density of largescale integration devices. Moreover it is weighed against the hazard of service disruption which can be detrimental for performance in high-end routers or switches. Keeping in view the same triangular port pair distribution above, two controllers referred to as primary and secondary are assigned to each request. In other words two redundant controllers look after each distinct region of port pairs. A request (a, b) is handled by the primary controller and in case of its failure is forwarded to the redundant secondary controller. This scheme is further optimized by idea of dynamically distributing the service requests among K redundant controllers. This scheme offers the flexibility to adapt according to an individual controller load. As an example a request is dynamically assigned to a pair of servicing controllers, however any one of them depending on individual load conditions may service the request. 5. Arbiter for islip Scheduling Algorithm In this section we present brief implementation details of a classical crossbar scheduler and distributed arbiters with islip being the underlying algorithm. As described islip algorithm is an iterative combination of three distinct operations, i.e. the request-grant-accept cycle. This is implemented on separate arbiters in a pipelined manner. In the previous section Fig.5 illustrates the basic islip scheduling algorithm and the specific task of each arbiter therein. It is observed that a request from any of the N through N K ports is received by the grant arbiter and is accepted by an Accept arbiter on the line card. Pipelining allows for optimization in terms of lesser clock cycles. For this algorithm, I iterations consume only 2I+2 cycles instead of I for a non-pipelined scheme. The Centralized scheduler runs at a clock speed for 75Mhz and each time slot consists of 9 clock cycles. That is, for the three usual iterations involved in actually scheduling a packet from an input to an output port would require clock cycles. The scheduler is designed to configure 32 X 32 crossbar once every 5ns. The islip algorithm employs round-robin grant and accepts arbiters to ensure 00% throughput and approximately fair scheduling for all Input queues to avoid starvation. The round-robin arbiters are efficiently implemented as a programmable priority encoder (PPE). The PPE differs from a simple priority encoder in that an external input dictates which input has the highest priority. In what follow, we take a detailed look at the design of a high-speed round robin arbiter as a PPE. Fig 5. shows the circuit diagram of a round-robin arbiter. It has some state (called round-robin 9

10 pointer, P_enc, of width log N bits), that points to the current highest priority input. In every arbitration cycle, it uses this pointer P_enc to choose one among the N incoming requests, through a PPE. The Programmable priority encoder takes in N -bit wide requests and a log N wide P_enc as the input. It then chooses the first non-zero request value beyond Request[P_enc], resulting in an N-bit grant. Clearly, the core function of contention resolution is carried out by this combinational logic. round robin pointers such that Request is rotated i positions to the right before being presented to the multiplexer. The design is based on principle that the first non-zero request beyond the current pointer value can be found in a The pointer update mechanism is generally simple and can be performed in parallel. The path from request to grant determines the speed of arbitration decision and therefore the design of a fast PPE based on combinational logic is desirable. There are several ways to implement a fast PPE but only the RIPPLE & CLA (Carry-lookahead) is discussed here. Fig 5.3 mini_rpl: a sub-block of design RIPPLE Fig 5. Block Diagram of a Round Robin Arbiter 5.5 RIPPLE & Carry-lookahead PPE A simple Exhaustive (EXH) PPE is shown in Fig 5.2 We use this design to be the premise of a more optimal version of EXH PPE i.e. The Ripple & Carry-lookahead method. The PPE in Fig 5.3 uses N duplicate copies of a simple priority encoder (PE), and a programmable priority input to selects among N PEs. This is done via a (N+)-bit MUX with P_enc as the select signal. The rr(i) represents the series of at most N sequential operations, which is the essence of islip algorithm that guarantees a virtual queue making a request at any port would not starve beyond N rounds. First, we look at input P_enc is then Gnt(i) will be for i=p_enc and 0 otherwise. If on the other hand Request(P_enc) is 0, we look at the input numbered (P_enc+)mod N, and continue the process till either we find a non-zero request or we are back to the input number we started from (P_enc). In the latter case, none of the input request were, and so Gnt(i) will be 0 for all i, and anygnt will be 0. A sub-block (mini_rpl), as shown in Fig 5. can carry out each of these steps. P_dec is obtained by decoding the input signal P_enc. The -bit signal Imed[i] is used to indicate whether the search process has already found a (in which case Request[i] is to be ignored, Gnt[i] is 0 and Imed [I+] is ). Imed[i] is to be ignored if P_dec[i] is meaning a top priority request, and the starting point of search process. Once we have the sub-block mini_rpl, the RIPPLE PPE is simply a connection of N such sub-blocks connected back-to-back cyclically through Imed. (Figure 5.). This design is area-efficient utilizing a small number of gates in each of N stages. However, with the introduction of CLA (Carry Look-ahead) the need for combinational feed back loop is eliminated which increases the latency, undesirable in high speed arbiters. For example a CLA of m bits would reduce the rippling delay to about /6 th to the original delay. However, a detailed discussion on CLA Ripple Arbiter is beyond the scope of this study. 5.5 Discussion Fig 5.2 Block Diagram of a Exhaustive PPE There are several factors that should be kept in mind while considering a design for the arbitration process. The first, is the underlying algorithm that actually dictates the arbitration process and second a feasible VLSI implementation that would be consistent with the efficiency requirements of the 0

11 ) The transit throughput dictates that the overall capacity cannot be greater than the interconnect speed. This can be over come by increasing the line card interconnections between the routers but it adds to capitol cost and operational complexity. 2) There is a significant decrease in port density as useable ports left of the incoming traffic are slashed to a certain fraction of the total, rest being used up by the interconnects. Fig 5. Ripple Design ]routing/switching device. Several algorithms have been proposed in literature with a focus on optimization of overall throughput but most of them lack the capability of fast implementation. The tradeoffs involved in the implementation phase actually out weigh the predilection of getting 00% throughput. As discussed in the section above an arbitration schemes that guarantees 00% reliability and throughput must be carefully examined to find a way where one can meet the speed requirements while keeping throughput in a reasonable range. The symmetric triangular scheme requires lesser buses and ensures reliability but usage of redundant controllers in this manner implies wastage of resources as the secondary controllers would be idle most of the time and therefore the chessboard scheme a variant of triangular scheme focuses on optimization of controller utilization. But this comes at the expense of increase in the number of buses, which in this case sounds like a reasonable tradeoff. As from the implementation viewpoint, a major factor is that the VLSI technology is going through rapid evolution (gate density doubles every 8 months) implying that a restriction with today s technology might not be there in near future. The current CMOS process is of the order of 0.8 microns, whereas the islip arbitration module described in 5.5 was built with a premise of 0.32-micron CMOS process. The advent of high speed optical interconnects employed in very high throughput routers and switches puts a larger demand on the contemporary VLSI processes which constitute the drivers for Smart-Pixel arrays in such system. These interconnects are discussed in detail in Terabit Routing Scenarios Exponential growth of the internet and the demand for more bandwidth has pushed beyond the capacity of today s routing architectures prompting adoption of highly scalable multi terabit routing solutions. A model as it has been employed is to cluster multiple high capacity routers at the core to increase its overall throughput, which is achieved by interconnecting routers via OC-2 or OC-8, line cards. This scenario may have following shortcomings. 3) Due to the advances in DWDM, a single fiber can carry anywhere from 0 to 80 channels of 2.5 or 0 Gbps transmission paths, however to terminate these links at IP backbone routers requires large number of OC-8 or OC-92 line cards which leads to increase in number of routers at the core. Fig 6. illustrates such a scenario. The point of this discussion is to illustrate that clustering of routers via line cards is sub optimal in terms of available bandwidth and port density. OC-8 links that handle transit traffic among routers Fig 6. Router Clusters forms the Core Usable OC-8 ports There can be several ways to tackle with the ever-increasing bandwidth demands. Two of these can be: ) To build a single device with a throughput of the order of several Terabits/sec. 2) To groups several relatively smaller routers in an integrated system so that the combined throughput of the system is compatible with bandwidth requirements. However as discussed, the later suffers from aforementioned performance deficiencies. In what follows we try to present ways to overcome these with the use of totally integrated Optical backplane. But first we start with an account of the techniques presented/used so far for building high capacity VLSI Routing/Switching devices.

12 6.2 VLSI Terabit Switch Routers The Multi-Gigabit VLSI routers as presented in provides a basis for the construction of higher capacity devices. The premise of research in this area has been to capitalize on the rapid advances in CMOS VLSI technology. The CMOS crossbar switching fabric forms the basic constituent of each of these endeavors. We shall present in this section an overview of design methodologies and contrasting features of two such architectures. We begin with the Tiny Tera project as reported in [Mckeown 6]. crossbar slice is informed. The Scheduler and Crossbar slices form the centralized hub of the switch. The scheduler is connected to the I/O ports and to the slice stack through high-speed serial links operating at several Gbps. The slice stack comprises of several slices, each of which is a printed circuit board containing a -bit 32 X 32 crossbar chip. The reason for having multiple slices of crossbar is to exploit parallelism that actually provides the speed up needed to reach high throughput. It also allows for scalability as number of slices can be increased to meet varied requirement. In the discussion that follows, we intentionally skip the description of the crossbar VLSI architecture and the islipscheduling algorithm as it has been covered in the previous section. The distinguishing feature of this design is the I/O port architecture, which is optimized to fragment and store fixed sized segments. Another area that we would stress on is the serial communication between the scheduler, I/O ports and the slice stack that lends itself to flexible pipelining. The logical steps of data flow in port architecture can be summarized into following: Fig 6.2 logical switch architecture with I/O ports. 6.3 The Tiny Tera: A Packet Switch Core The Tiny Tera presents an excellent example of a compact high capacity Switch that can be employed in diverse applications as ATM switches or Internet Routers. The switch employs a similar kind of architecture as the GSR presented in. The fabric comprises of a 32 X 32 Crossbar which switches fixed-size packets across 32 I/O ports, each operating at OC 92 (0 Gbps) line rate and reaching an aggregate bandwidth of 320Gbps. The switch employs an input-queued architecture with three logical elements: Ports/ s, a central crossbar fabric and a centralized packet scheduler. Input-queuing has the advantage that the buffer memories do not require to run faster than the line rate but inherently suffer from Head of (HOL) blocking. This can be rectified by using Virtual Output Queuing (VOQ) scheme in which each input maintains a separate queue for each output. It also uses the islip-scheduling algorithm that achieves approximately 00% throughput while making scheduling decision in less than 0ns with 32-micron CMOS process. Fig 6.2 shows the typical port configuration of the switch. The packets are fragmented on arrival into fixed size 6-bits segments, which are stored in the input buffers at each port and await access to the crossbar. The centralized scheduler takes routing decision based on the current configuration of the crossbar and contents of all input queues. This decision is passed on to the concerned port and in turn the relevant ) The line cards support OC 92 links. As the packets arrive on the ports they are fragmented in 6-bit segment and distributed over several serial communication links. 2) A centralized port processor receives the segments and assigns them memory address in the on-chip SRAM. It also generates a header, which contains information about a particular segment s association with a certain packet and the source/destination pointers for a routing decision. 3) The Port Processor also communicates necessary information about all arriving packets and queue status to centralized scheduler. This communication takes place over a high capacity serial link. ) On arriving at a scheduling decision, this is conveyed to the port processor which determines the memory location from which the packet is to be de-queued and subsequently put on the crossbar links which forms a group of serial interconnects to the crossbar and corresponds to the input serial links. 5) As for all fixed-sized packet switching systems, Tiny Tera also requires some mechanism for output queuing to allow for packet assembling latencies. Since the ports are full duplex, the same on chip SRAM has to be used for both input and output queuing. This requirement however prompts the efficient usage of the memory by dynamic partitioning. Fig 6.3 illustrates the functionality of each port. The fact that this switch is designed to switch fixed sized packets 2

13 leads to the generation of additional overhead in terms of storage, additional headers and necessity of output queuing. However such a design lends itself to efficient usage of the 6-bit wide crossbar grid. It therefore ends as a tradeoff between larger overhead and efficient crossbar utilization. since the usual communication is done between a low fanout chip such as a port circuit and high fanout chip such as a crossbar, it is optimal to locate the phase adjustment and clock sync circuitry not on the higher fanout chip rather on a lower fanout one. This is to reduce complexity and lower power consumption. Fig 6. shows the link between one of the data slice chips on a port (smart end) and a crossbar slice chip (dumb end). In a multi board system such as this, performance is often limited by the distribution of high frequency clocks, and the resultant clock skew and phase noise. Therefore a careful clock distribution scheme to minimize skew and jitter is desired. The PLL circuit in figure above is used to achieve clock stability which is vital to track feedback loop drift in long feed back path. Steady and precise phase alignment is accomplished with a digitally controlled clock phase interpolator. 6. Terabit CMOS transceiver Router Fig 6.3: Block Architecture of Each Port card. The On-Chip communication presents a significant challenge to this architecture as the segmentation and reassembly of packets, centralized scheduling and high throughput requirements dictates a high degree of Control communication between ASICs. This communication also requires being sufficiently fast to maintain the control and update information of switching fabrics and status of queuing buffers to cope with throughput requirements. In this section, we detail another highly integrated, fixed length cell based terabit switch fabric [Wang ]. Fundamental to the fabric is the integration on high-speed CMOS transceiver to provide low power, low pin count and low chip-count system with a raw capacity of 2Tbps. The fabric can support user bandwidth up to 60 Gbps with a speed up factor of 2. With IQ (input queuing), OC (Output Queuing) and CIOQ (Combined input output queuing), it is necessary to buffer and retrieve data at a speed equal to or faster than the line rate. Speed up is required when DRAM random access speed cannot sustain high capacity line rates of say OC-8 or OC- 92 which is normally the case with commercially available memories. Using a multiple stage CLOS architecture [Ayer 7] can rectify this, which allows each memory device to run slower than the line rate. It was proven in [Ayer 7] that by using such an architecture a speed up of S > 2 N + can be guaranteed. Fig 6. Serial Link Architecture The first challenge in design of on chip communication infrastructure is that the crossbar and scheduler chip need to terminate more than 32 or more communication links, with each link originating from a different board. These links should achieve high data rates in a noisy digital environment. This is achieved by using a variant of traditional serial link with the phase adjustment circuitry in the transmitter instead of previously practiced regime of being in the receiver. Phase adjustment and synchronization play important roles in serial communication between two devices. In this case Fig 6.5 CLOS Like Memory Architecture for Speed up The switch in question employs a combined input and output queuing system and also provides a per class per port based QoS. The distinguishing feature of this fabric is the low power, low pin count and highly reliable CMOS 3

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located