The GLIMPS Terabit Packet Switching Engine

February 2002 The GLIMPS Terabit Packet Switching Engine I. Elhanany, O. Beeri Terabit Packet Switching Challenges The ever-growing demand for additional bandwidth reflects on the increasing capacity requirements from backbone switches and routers. As these systems increase in port count (tens and even hundreds) and line rates (oc-48 and oc-192), the engineering effort pertaining to the design and management of their switch fabric cores becomes more challenging. The performance of a switch/router as a whole is greatly affected by that of its fabric. Therefore, legacy packet-switching architecture and algorithms, which were suitable for the Gigabit space, need to be replaced by more scalable, higher-performance technologies. One of the primary building blocks of any switch fabric is the scheduler, which is in charge of determining which input ports are to transmit to which output ports at any given time. TeraCross patent-pending GLIMPS technology offers a unique and innovative packetscheduling scheme, which offers the scalability and performance characteristics, expected by tomorrow s high-end packet-switched networks. Current Trends in Switch Fabric Architectures Switches and routers are logically comprised of line cards and a switch fabric. On each line card, which is associated with a port, resides a network processing unit(s) that interfaces to the physical data links, decodes incoming traffic, finds the proper route for it, and applies traffic shaping, filtering and other policy functions. Hence the network processor is generally in charge of data received and transmitted by the port. The switch fabric is where actual transmission of data across ports takes place. In general, a fabric includes a set of crosspoint switch, a scheduler module and buffers. The crosspoints are configurable interconnecting elements that dynamically establish links between input and output ports. A central role is played by the scheduler which acts as the brain of the switch fabric; it configures the crosspoint, allows buffered data to traverse and repeatedly reconfigures the crosspoint for successive transmissions. Two fabric architectures are deployed

today at the mainstream of high-end switching systems: distributed shared-memory and crossbar based. SHARED MEMORY SWITCHES Let N denote the number of ports in a system. Shared memory schemes are very simple: each input port can send packets to each output port via a shared memory space in accordance with a prescribed arbitration scheme. Contention is resolved by speeding up the read/write operations to and from the shared memory device by N. In affect, this is an output queueing scheme since there is a dedicated logical path between any two ports. The downside of these schemes is the need for large speed up (order of N). This disadvantage prevents the scheme from being implemented in Terabit capacities. Yet another technique applicable to shared memory techniques, is packet-slicing over several shared memory chips. However, since such a scheme implies higher chip count and, as such, higher power consumption it is found to be limited in scalability. Utilizing a group of small shared memory switches to form a larger switch, such as in the use of a Clos Network topology, typically results in very high port count and large latencies. Such schemes also require complex management techniques, such as simultanteous configuration of the multiple shared-memory switches. Moreover, since power consumption is a direct product of the number of switching elements, shared memory solutions becomes less attractive for highend systems. SELF-ROUTING CROSSBARS A self-routing crossbar operates without global decision-making. Rather, it is based on multiple stages, each of which performs a basic switching task. Typically, a packet that enters into it propagates from the ingress node to the egress node of the crossbar without any need to configure the crossbar device by any external means. In most cases a self-routing crossbar is based on a net of basic modular crosspoints, each with a few ingress nodes and a few egress nodes (usually 2 for ingress and 2 for egress). To allow full connectivity the crossbar should have at least N log 2 N basic crosspoints (N crosspoints in each stage and log 2 N stages). When a packet enters the switch it receives a tag that defines the egress port it is destined to. This tag may be the definition of the exact path through the net or only the destination, in case of a worm hole algorithm.

The self-routing crossbar has some evident advantages over other schemes, such as modularity, which makes it easily scalable, and the fact that it is self-sustaining in the sense that no scheduler or other central management unit is required. However, a self-routing device has some obvious disadvantages and problems, resulting mainly from the fact that real data traffic is usually not free of contentions. When part of the path of two packets that enter the switch at the same time overlaps ( hot spot phenomena) only one of the packets may be forwarded while the other has to be dropped or buffered. Without buffering, this results in throughput degredation, which grows more prominent as the number of stages increase. Adding a buffer in the entry of each crosspoint will allow higher throughput, but is not cost-efficient and carries higher latency. Having the crossbar fully non-blocking (with N 2 crosspoints) does not solve the problem, rather it shifts it to the egress in the worst case the egress has to handle N packets simultaneously, which is impractical for high-end systems. To eliminate contentions inside the crossbar some additional mechanism is usually added, namely some sorting mechanism before the crossbar. Such a solution is obviously not the native and simple crossbar that we discussed so far, rather it requires a central management and scheduling unit that is added to the self-routing crossbar. Using simple sorting mechanism will, again, result in poor throughput, poor latency and very limited QoS supports. MULTISTAGE INTERCONNECT NETWORK A multistage interconnect network switch (MINS) is a switch that is based on a net of smaller interconnected sub-switches. This infrastructure is highly modular and theoretically allows growth almost to infinity. Since each of the sub-switches is relatively small, it may incorporate relatively complex and sophisticated scheduling algorithms, and therefore may solve scalability limitations. Nevertheless, using multistage switching has some overheads and disadvantages as will be discussed below. Each of the switch elements has only a partial view of the system state, which is limited to its local environment. This prevents taking decisions that are truly consistant with the overall system resources. Moreover, such local-scheduling decisions may overlap with each other, or worse, even contradict a desired global optimization. The controllability of the QoS provisioning in such a case is complex and is not straightforward. The full path of packets through the switch is the result of multi-routing decision and buffering, where each stage has its own overhead of queuing latency. The aggregation of all overheads is higher than in the case of single stage switching.

Although multistage interconnect switching is modular, the implementation for a high number of ports is not trivial, keeping in mind that up to 80% of the connectivity is allocated (thus wasted ) for interconnections. The overhead of multistage is therefore non-negligible. Moreover, since the I/O cost in space and power is of the same order as the logic cost, such structures are not cost-effective. CROSSBAR SWITCHES The crossbar is a configurable interconnecting element that dynamically establishes links between input and output ports. Nearly all crossbar devices up to date are electronic, however there has been significant progress in the optical domain, to provide all-optical high density crossbar modules. These are based on dynamic WDM, MEMS and other optical technologies. The clear advantages of using optical crossbars are the low-power, scalability and space efficiency. Electronic crossbar technology today offers over 128 input ports to be linked to 128 output ports each running at 3.125Gbps in a single chip package. This potential may only be realized if the crossbar is managed correctly. In other words, a central role is played by the scheduler which acts as the brain of the switch fabric; it configures the crossbar, allows buffered data to traverse it and repeatedly reconfigures the crossbar for successive transmissions. If the scheduler performs well, the entire fabric performs well. The task at hand, however, is far from being simple. Traditionally, switch fabrics were designed for moderate traffic loads that were considerably lower than the pragmatic router capacity. Such a design has the advantage of offering low average packet delay even when employing simple scheduling algorithms. As Internet traffic grows, it is necessary to increase the router capacity in order to maintain the ratio between traffic load and router capacity required for acceptable performance. Since the growth of traffic volume is faster than that of router capacity, simple algorithms are found inefficient and infeasible for modern routers. Internet traffic is accompanied by an increased volume of data which is sensitive to packet latency and jitter. Incoming packets are typically classified to Quality of Service (QoS) classes as means of providing differentiated services. For each QoS class the router must not exceed the acclaimed latency and jitter requirements. As can be derived, the task of the scheduler is a complex one: for each transmission period determine what packets from which input port are to be transmitted to their designated output ports. As high port count and QoS become predominent characteristics, the scheduler must complete its task within the time frame of a

single cell or packet, which for oc-192 translates to approximately 50 nsec - an extremely short period of time to complete such a compound task efficiently. Traffic Models Key to Performance Evaluation Traffic traversing real-life switches and routers is far from being ideal in the statistical sense. In other words, as opposed to being well scattered and uniformly distributed along the output ports, traffic tends to be burty and non-uniformly distributed. This is due to the diverse nature of applications and generating the traffic and the non-homogenous network aggregation. Advanced models, such as Self-Similar fractal models, have been recently proposed to represent the traffic flowing through the Internet backbone. These attempts, originating from hursitic and analytical work, indicate that the traffic is complex in behavior and should be regarded as such. To this end, the scheduling algorithms should be evaluated with respect to the traffic arriving at the systems. Frequently, various switching schemes are compared under ideal traffic as opposed to more complex, real-life scenarios. LEGACY SCHEDULING ALGORITHMS Commonly deployed scheduling algorithms are based on an early discipline, which was developed by DEC Systems Research Center for a 16-port, 1 Gbps switch, called Parallel Iterative Matching (PIM). The essence of PIM, as well as its popular derivatives, is to use randomness as means of avoiding starvation and maximizing the matching process. Inputs and outputs, which have not been previously matched, contend during each time slot. An iteration is comprised of the following three basic steps: Step 1 (Request) All unmatched inputs send requests to every output for which they have packets to be sent. Step 2 (Grant) - Each output selects at random one of its requesting inputs. Step 3 (Accept) - Each input selects at random a single output among those that granted it. The PIM technique has two interesting properties. First, by eliminating the matched pairs in each iteration convergence to a maximal match is assured in O(log N) iterations. Second, it ensures that all requests are eventually granted. PIM does, however, introduce some significant

drawbacks the principal of which is large queueing latencies in the presence of traffic loads exceeding 60%. Moreover, its inability to provide prioritized QoS and a requirement for O(N 2 ) connectivity renders PIM implementation impractical for modern switching cores. The basic round-robin matching (RRM) algorithm was designed to overcome the disadvantages of PIM in terms of both fairness and complexity. Instead of arbitrating randomly in steps 2 and 3, selections are made based on a prescribed rotating priority discipline. Two pointers are updated after every grant and every accept, respectively. RRM offers a minor improvement with respect to PIM whereas under non-uniformly distributed traffic loads the overall performance remains poor mainly due to a pointer synchronization phenomenon. A popular descendant of PIM and RRM, which was developed by Nick McKeown of Standford University, is a widely implemented scheduling algorithm called islip. islip is an iterative algorithm consisting of the following 3 steps: Step 1 (Request) All unmatched inputs send requests to every output for which they have packets to be sent. Step 2 (Grant) - Each output selects a requesting input coinciding with a predefined priority sequence. A pointer is utilized in order to indicate the current location of the highest priority elements. The pointer is incremented (modulo N) to one beyond the granted input if the latter is accepted in step 3. Step 3 (Accept) In a similar fashion to step 3, each input selects one of the granting outputs of step 2 in accordance with a predefined priority order. Similar to step 2, a unique pointer is maintained to indicate the position of the highest priority element. This pointer is incremented (modulo N) to one location beyond the accepted output. As opposed to updating the output pointer after every grant, such update occurs only if the grant has been accepted by an input in step 3. The islip algorithm has been shown to exhibit a significant reduction in pointer synchronization and accordingly higher throughput with a lower average packet delay. The algorithm does, however, convey some basic shortcomings including degraded performance in the presence of non-uniform and bursty traffic flows, lack of inherent QoS support and limited scalability with respect to high (>32) port densities. Almost all switch fabrics today are scheduled using some variant of the islip algorithm. These solutions are limited to the sub-320gbps space and could not scale to Terabit systems, much less support QoS and other such advanced features. The major drawback of request-grantaccept based algorithms, such as PIM and islip, lies in the independent treatment of queueing pointers which tend to synchronize, leading to degradation in the overall performance.

GLIMPS TECHNOLOGY BREAKING THE SCALABILITY BARRIER It would be intuitive to assume that the ideal scheduler consist of a centralized decision making unit which receives requests from all ports, determines the optimal matches and issues them to the crossbar and line cards. The only problem is that a for port counts and rates characterizing a Terabit router, such a centralized approach is found impractical, mainly due to the amount of information which needs to be sent from the line cards and received at the centralized scheduler. For this reason, it has been assumed so far that the distributed, pointer-based schemes are the only pragmatic schedluling approach. TeraCross patent-pending GLIMPS technology shatters this notion by offering an innovative approach to implementing a centralized scheduling mechanism which scales to multi- Terabit capacities. This is done due to a unique queueing architecture together with a matching packet-scheduling algorithm which together offer the following unprecedented features: Scalability GLIMPS architecture obeys an order NlogN complexity rule, which renders it practical to support hundreds of ports, hence multi-terabit capacities Low latency and jitter due to the centralized approach, the matching decisions are made with a global optimization perspective which translates to extremely low packet latencies and guarenteed bounded jitter Traffic invariance the GLIMPS algorithm would regard highly bursty and nonuniform traffic as it would ideal traffic, since the decision making process remains the same. As a result, performance metrics are almost unaffected by the statistical nature of the traffic thereby offering the inherent support of multi-media and other time-critical applications QoS provisioning the classes of service, as implemented in the ingress queueing, are projected to the GLIMPS algorithm in order to prioritize the various traffic flows. To this end, the support of differentiated service is an intergrated part of the decision making progress thereby offering QoS at the foundations of the fabric. Fairness centralized scheduling allows for assymetric ports to be regarded respectfully, thereby sharing the switching bandwidth allocation in a balanced and fair fashion No Starvation by reflecting an overall view of the queue status, the scheduler can verify that no queues remain unserved for prolonged periods of time. These features render TeraCross GLIMPS scheduling technology well positioned to lead the emergence of next-generation, multi-terabit, high-performance switch fabrics.