PARALLEL ALGORITHMS FOR IP SWITCHERS/ROUTERS

THE UNIVERSITY OF NAIROBI DEPARTMENT OF ELECTRICAL AND INFORMATION ENGINEERING FINAL YEAR PROJECT. PROJECT NO. 60 PARALLEL ALGORITHMS FOR IP SWITCHERS/ROUTERS OMARI JAPHETH N. F17/2157/2004 SUPERVISOR: DR. G.S.O ODHIAMBO EXAMINER: DR. OUMA H. ABSOLOMS

PROJECT OBJECTIVES To design and analyze a parallel lookup algorithm that can work on routers. To model a suitable simulation of the algorithm and investigate if it improves the router throughput. INTRODUCTION Internet traffic continues to increase rapidly year by year. The explosive growth of the Internet in number of users and service variety is parallel to the growth in transmission links capacity due to the advances in fibre optic bandwidth that has created huge supply of wide area network (WAN) bandwidth. New applications (e.g., the Web, video conferencing, remote imaging) have higher bandwidth needs than traditional applications. Therefore, further increases in users, hosts, domains, and traffic is expected. To provide adequate internet performance, communication links in the internet backbone are being upgraded with high speed fibre optic links. Even Gigabit links are insufficient unless Internet routers are able to forward packets at Gigabit rates. When an Internet router gets a packet P from an input link interface, it uses the destination address in packet P to lookup a routing database. The result of the lookup provides an output link interface, to which packet P is forwarded. There is some additional bookkeeping such as updating packet headers, but the major tasks in packet forwarding are address lookup and switching packets between link interfaces Address lookup is thus a major bottleneck in routers high speed forwarding. Therefore high speed IP routers and switches that forward exponentially increasing volume of traffic and high speeds are required. Currently core routers operate at multi gigabit or terabit speeds. Parallel processing i.e. parallel IP lookups can increase the lookup rates.

COMPONENTS OF A ROUTER A router has four components: input ports, output ports, a switching fabric, and a routing processor. An input port is the point of attachment for a physical link and is the point of entry for incoming packets. The switching fabric interconnects input ports with output ports. The routing processor participates in routing protocols and creates a forwarding table that is used in packet forwarding.

THE PACKET FORWARDING PROCESS The IP packet processing steps are as follows: 1. IP Header Validation: As a packet enters an ingress port, the forwarding logic verifies all Layer 3 information (header length, packet length, protocol version, checksum, etc.). 2. Route Lookup and Header Processing: The router then performs an IP address lookup using the packet s destination address to determine the egress (or outbound) port, and performs all IP forwarding operations (TTL decrement, header checksum, etc.). 3. Packet Classification: In addition to examining the Layer 3 information, the forwarding engine examines Layer 4 and higher layer packet attributes relative to any QoS and access control policies. 4. With the Layer 3 and higher layer attributes in hand, the forwarding engine performs one or more parallel functions: associates the packet with the appropriate priority and the appropriate egress port(s) (an internal routing tag provides the switch fabric with the appropriate egress port information, the QoS priority queue the packet is to be stored in, and the drop priority for congestion control), redirects the packet to a different destination (ICMP redirect), drops the packet according to a congestion control policy (e.g., Random Early Detection (RED)), or a security policy 5. The forwarding engine notifies the system controller that a packet has arrived. 6. The system controller reserves a memory location for the arriving packet. 7. Once the packet has been passed to the shared memory, the system controller signals the appropriate egress port(s). For multicast traffic, multiple egress ports are signalled. 8. The egress port(s) extracts the packet from the known shared memory location using an appropriate algorithms: e.g. Weighted Fair Queuing (WFQ. 9. When the destination egress port(s) has retrieved the packet, it notifies the system controller, and the memory location is made available for new traffic.

IP VERSION 4 PACKET FORMAT Version number specifies the version of the IP protocol and determines packet format. version 6 is defined; similar to version 4 but longer address Header Length (HLen) gives number of 32bit words in header. Type of Service (TOS) field is infrequently used but allows for application specific treatment of packets. Fragmentation Identifier, flags and Offset use for fragmentation and reassembly of IP packets. Time to live (TTL) specifies the remaining number of hops before packet should be discarded. Prevents infinite looping of packets Protocol used for demultiplexing at destination. Checksum for detecting errors on end to end basis. Address fields specify source and destination. Hierarchical address structure for large scale internet. Options are rarely used but must be supported in complete IP protocol implementations. Classless Inter domain Routing (CIDR) extends subnet idea where arbitrary address prefixes can be used to represent a set of addresses that can be treated as a group, for purposes of forwarding packets.

PARALLEL IP ROUTING TABLE LOOKUP ALGORITHM PSEUDOCODE Input: Destination IP Address (Abbreviated as DIP) Output: Next Hop IP Address (Abbreviated as NIP) Step 1: Input destination IP Address Step 2: Use the ID bits of DIPs to allocate them to different memory blocks then push them into firstin first out buffers (FIFOs) of the corresponding memory modules. Step 3: For every FIFO buffer for each memory unit Mi parallelly do While (true) do Begin If FIFO is empty then continue; Else Pop a DIP from local FIFO; End; while (true) do Step 4: Lookup For each DIP popped out of the FIFOs of the memory modules do Use the ID to associate the DIP with its corresponding NIP in every memory unit; Pop the NIP from the memory module; Push the NIP to the output buffer; Step 5: Pop the NIP as the result; Step 6: Stop

ALGORITHM FLOW CHART

QUEUING ANALYSIS Queuing theory is used to model the FIFO queues in the lookup subsystem. Assume that the arriving process of the incoming IP addresses is a Poisson process with average arrival rate λ. The service type of lookup is a Poisson process with service rate μ and service time, T = 1/μ s. M/M/8/K queue model is used to represent the system. This is Kendall notation where the first letter, M denotes the distribution of the inter arrival time, the second M denotes the distribution of the service time, The third number, 8 denotes the number of servers, K denotes the maximum size of the waiting queue. M (Markov) denotes the exponential distribution. The letter M stems from the fact that the exponential distribution is the only continuous distribution with the Markov property, i.e. it is memoryless. Evidently, the average arriving rate of each M/M/1/K queue reduces to λ /8 while the service rate is still μ. Then we use the classic queuing theory to solve the model and calculate the corresponding parameters as follows: Average number of destination IP addresses in each memory module is given by; L q = λ 2 / µ(µ λ); where L q is the average number of packet headers in a queue. Average delay time, W q in a queue is given by W q = λ/ µ(µ λ);

According to the analysis presented above, the graph shows the running average queuing delay time as a function of the traffic intensity ρ. The graph shows that the queuing delay will not grow up sharply as long as ρ remains under 0.6.

LOOKUP PROCESS The routing table is modelled by microsoft access database containing a list of prefixes and their corresponding Next Hop Addresses. The parallel memory units are modelled by eight tables in the routing database where each table is stored in a single memory unit Lookup is effected by searching for an entry that matches a prefix obtained from the first octet of the IP (version four) address input into the system as a Destination IP Address (DIP). ID (bit 16, 24 & 32) 000 001 010 011 100 101 110 111 Memory Unit 1 2 3 4 5 6 7 8 The table shows the criterion used to allocate the destination IP addresses to different memory blocks. Bits 16, 24, and 32 are used as the first, second and third ID bits respectively. The search operation is implemented in visual basic programming language. The database is accessed using the an ActiveX Data Object (ADO) tool referred to as adodc. ADO is a languageneutral object model that lets you manipulate data accessed by an underlying OLE DB provider. (An OLE DB provider is a data manager that interfaces directly with a database. ADO s Recordset object contains records returned from a database plus the cursor for those records. This tool links the search code to the database. Every memory block is linked differently so that the search operation in one memory block is independent of the other units. The search operation in the in individual memory units is absolute. This implies that at a given time instant, a packet that arrives at the system input is searched for only in its corresponding memory block. Thus for every packet arrival, only one of the eight memory blocks that meets the criterion in table 3.1 will be searched.

RESULTS

ANALYSIS The forwarding table is sub divided into eight tables stored in separate parallel accessible memory blocks. This parallelism minimizes memory contention hence IP lookups are faster. COMPLEXITY ANALYSIS Given that routing tables for units 1, 2, 3, 4, 5, 6, 7 & 8 have N 1, N 2, N 3, N 4, N 5, N 6, N 7, & N 8 entries respectively, the lookup algorithm complexity for the units are O(N 1 ), O(N 2 ), O(N 3 ), O(N 4 ), O(N 5 ), O(N 6 ), O(N 7 )& O(N 8 ) respectively in the worst case. In this design N 1 = N 2 = N 3 = N 4 = N 5 = N 6,= N 7 = N 8 = 60. Thus a lookup in one unit terminates in 60 comparisons. The algorithm was employed in a model network simulated by packet tracer software. The parameters of router B have used in this analysis. It is a Cisco 3600 router with an Embedded Services Processor with 10 packet processor elements at clock rate of 1 GHZ (PPEs).It has 32 MB of 50 ns DRAM. The lookup process is a memory read operation. With the 1 GHZ processor (i.e. 1 ns clock), the 50 ns DRAM can perform the first read in 50 clock cycles. If the full bandwidth the gigabit link is utilized the maximum rate of incoming packets is 1 Gbps. Each memory lookup unit is assigned a single packet processor element. Thus a single comparison takes 50 ns. In the worst case, the lookup operation takes (50 x N) ns, where N = 60 giving a lookup time of 3 µs. Thus the throughput is: 1/ (50 x60) = 333333.3 packets/second = 333333.3 x 32 since a packet is 32 bits long = 10.666656 Mbps per memory in the worst case. Thus the total throughput of the system will be 10.666656 x 8 = 85.3333 Mbps

MODEL NETWORK A model network simulated by packet tracer software

ANALYSIS (CONTD) In the best case, the algorithm complexity reduces to O (1) for every memory unit. This implies that the algorithm can terminate in only one comparison. The best case occurs when the lookup procedure matches the first entry in the routing table. With the 50 ns DRAM, the first read is done in 50 clock cycles. 1GHZ clock rate (=1 ns clock), the read operation takes 50ns. Thus the throughput for one memory unit will be 1/50ns = 20Mpackets/second giving a throughput of 20 x 8 x 32 = 5.12 Gbps for the whole system. Thus with parallel processing, the work load i.e. IP lookup is now shared between eight different processors as opposed to shared memory systems have single global memory which is accessed by all processors. When many processors are making simultaneous requests to a single memory location or bank, and memory access becomes a bottleneck, access times can increase greatly. Because of limitations on processor to memory bandwidth, performance suffers when too many of these processes attempt to access the same memory location simultaneously. For this reason, this design employed parallel physical memory layout to ensure that the memory system can handle as many simultaneous requests as possible. Different processors are assigned different memory units so that the lookup process can take place in separate units concurrently.

CONCLUSION In this design the major bottleneck in high speed packet forwarding, routing table lookup has been analysed. A simple parallel algorithm has been proposed that can speed up the lookup process hence improve the router throughput. The algorithm has been analysed in relation to a network simulated by packet tracer software. It has been shown that the router throughput has been significantly increased. Faster lookups imply higher router throughput (i.e. the rates that the packets are transferred for the input interface to the appropriate output interface measured in number of bits transferred per second). This is because routing table lookups take the largest share of the router resources, i.e. memory access time and processing speed, in the packet forwarding process. Thus speeding up the lookup process has a significant effect of improving the router throughput. RECOMMENDATIONS FOR FURTHER WORK In future this work can be extended by investigating other aspects of parallel processing. For instance the criterion to split the routing table entries to separate memory modules should be implemented in the routing processor. The routing table should also be implemented as a trie based structure as opposed to the access database used in this work. THANK YOU.