Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks Shivvasangari Subramani [shivva1@umbc.edu] Department of Computer Science and Electrical Engineering University of Maryland Baltimore County,USA ABSTRACT In the advanced computer networks, to implement various internet functions such as traffic policing, Quality of Service, firewall processing, and normal unicast and multicast forwarding we need to classify packets based on its multi-field header which has time complexity. It is estimated that approximately 14% of the processing load is spent on packet classification, therefore, designing a hardware assisted packet classification scheme not only can reduce the work load placed on the network processor, but also potentially speed up the entire classification process. We propose a multiple SIMD processors based architecture to take advantage of the parallel processing paradigm offered by the decomposition algorithm. Our design is composed of seven identical SIMD processors to simultaneously handle the multiple data chunks from the various IP header fields. We used a single thread C code to simulate the classification process that occurs on a network processor, along with additional logging for individual instruction execution time and frequency. We have analyzed our code for packet classifier using various cache structures and sizes. And also performance analysis of single threaded Vs multithreaded architecture of packet classifier has been done. 1. I TRODUCTIO Packet classification is the process of comparing an incoming packet flow with a set of pre-established flow characteristics (known as filters or rules) to determine its identity. In the current generation of Internet Protocol - IPv4, packet classification is primarily used in enhancing security, monitoring network, and differentiating quality of service. With the emerging of the new internet procotol - IPv6, packet classification has exp1anded its roles to include matching flow characteristics to various flow profiles(know as Robust Header Compression Profile, defined in RFC3095 and RFC 3096, profile serves as a basis for different compression schemes targeting at various packet content) to determine its best header compression method. At its core, packet classification is a multiple fields search problem involving finding the best matching filter based on exact or wildcard pattern. Since filters can contain overlapping properties, search results often yield multiple matching filters for a single flow, thus, filter priority is also added to help eliminating non-exclusive search results. In the context of Robust Header Compression used for IPv6 in a wireless network, Packet Classification combined with Header encoding & decoding and CRC comprise the three components that are most computationally intensive on a network processor[ref 2]. Our design goal is motivated by the idea of forming a three piece hardware linked modules, work in conjunction to reduce the computational overhead imposed on the network processor (see Fig 1 below). Although this paper focuses just on the packet classification module, our overall design model still remains intact, and Figure 1 illustrates the placement of the packet classification module relative to the overall structure. Our packet classification module design is derived from a class of algorithms categorized as decomposition methods. In comparison with the remaining classification algorithms including decision tree, exhaustive 1

search and tuple space, decomposition stands out as a natural candidate for a hardware implementation. Decomposition method focuses on breaking down multiple field search into independent searches on single field, then combine all the search results. This type of search is a natural fit for a modern parallel SIMD processing architecture. Upon further research, we narrowed down our choice to the Recursive Flow Classification(RFC) due to its efficiency and speed. RFC treats the packet classification problem as a X-bit string reduction problem, where a X-bit string must be reduced from its original S-bit length to a new T-bit length such that T << S [ref 1]. The S-bit represents the total bits for all the header fields in a packet, and the T-bit represents a set of matching filters found for the S-bit header fields. The unique T-bit pattern used to represent the set of matching filters for a packet is known as an equivalence class identifier(eqid)[ref 1]. RFC carries out this bit-reduction process through multiple phases, each phase combines eqids returned from previous lookup phases, then subsequently re-applies the same reduction method to yield more concise eqid classifications with total lesser bit length. The last phase of this successive merging and reduction process yields a final eqid that specifies the flow action for the packet. The basic idea of RFC is illustrated in Figure 2[ref 1]: We divide the remaining of this paper into four major sections: Architecture Model, Architecture Simulation, Data Analysis and Conclusion. In section two - architecture model, we dive into the details of our RFC hardware architecture, explaining its various components and individual construction. In section three- 2

architecture simulation, we describe the simulation methods used to experiment with our design, and parameters employed for data inputs. In section four - data analysis, we present our simulation results and our analysis on the performance impact of the design. Finally, We conclude in section five with key findings of our simulation results and discuss some of the limitations in our experiment and their implications. 2. ARCHITECTURE MODEL The RFC Packet Classification architecture, depicted in Figure 3, can be summarized into four major parts: Buffer Units, Dispatching Unit(aka Control Unit), SIMD units, and Cache & Memory Units. The Buffer Units, described in section 2.1, is used to store input/output data stream, these streams include the original packet header fields and the intermediate filter search results from different RFC phases. The Buffer Units feed the data streams to the Dispatching Unit. The Dispatching Unit, described in section 2.2, acts as a load balancer for all the SIMD units, its major function is to issue the same instructions to all SIMD units, as well as divide the header fields or eqids into multiple data chunks and distribute them. The SIMD units(described in section 2.3) are the center piece of our design, they perform the necessary RFC bit-reduction process and forwards their search results(eqids) to the Output Buffer Unit. The last group of our design is the Cache & Memory Units(described in section 2.4), they serve as a storage place for instructions and classification filters. 2.1 BUFFER U IT The RFC RFC Packet Classification Architecture encompasses two separate piece of buffer units: Input Data Buffer and Output Data Buffer. The primary use of the Input Data Buffer unit is to capture the incoming packet header fields and queue them if necessary in order to pass the data down to the Dispatching Unit. The Input Buffer size is targeted at 512K, it can approximately queue 12000 IPv6 packet headers(each IPv6 header contains 352-bit header fields). The Input Buffer Unit is the equivalent of the original 2 S data source for the initial phase 0 shown in Figure 2. For the Output Buffer, the same identical hardware is used but the capacity is reduced in half since each reduction phase will produce data with much shorter bit-length. The purpose of the Output Data Buffer is to provide an intermediate storage area for keeping search results produced from different RFC phases. Using the example from Figure 2, the Output Data Buffer will store the search results 2 64, 2 24 and 2 12 from phase 1, 2 and 3. Once the Dispatching Unit detects the current incoming eqid is generated from a final RFC bit-reduction phase, it redirects the output 2 T to the output data link of our hardware module (see Figure 3). 2.2 DISPTACHI G U IT The Dispatching Unit is the central coordinator for distributing data and issuing instructions. It takes both the original packet header fields from the Input Data Buffer as well as the intermediate search results from the Output Buffer, the two data streams represent the entire data feed for the SIMD architecture. Raw data streams are further divided into multiple chunks of data groups by the Dispatching Unit and send to SIMD unit for processing. The dispatching unit has the ability to detect the processing condition for each unit and stalls if necessary when all SIMD units reach their capacities. Instruction fetch is done through a linkage to the Instruction Cache Unit, the Dispatching Unit loads and issues the same instruction to all the SIMD unit in the same execution cycle. Instructions are issued in-order and must be completed in-order in our model. The Dispatching Unit also ensures the current packet header has been completely processed before loads the next packet header from the Input Buffer. The final phase of a bit-reduction is detected when there is only one eqid coming from the Output Buffer Unit, once this condition occurs, the Dispatching Unit redirects the 3

final output to the output data link of the RFC hardware module for the next stage of processing by the network processor. 2.3 SIMD U ITS The SIMD units are the main work engine in our design, they carry out the filter search operation for each RFC bit-reduction phase and sends their results to the Output Buffer unit. We used seven SIMD units to accommodate the seven header fields in an IPv6 Packet Header. Since each RFC phase reduces the output data length, it means there is less input to feed back into SIMDs units, therefore some of the SIMD units may become idle after several phases. We considered the possibility of allowing fetching the next packet header in the queue, however, this potentially complicates the matter due to the fact the Output Buffer Unit will need to be shared. In addition, the Dispatching unit also must discern eqids between different packet headers. An apparent solution for the problem is to introduce some additional bit identifier attached to the data chunks, effectively labeling each packet header, but this will increase the data length and introduce extra processing overhead of stripping them during filter search, therefore, we decided not to address this issue in this design. In our design, all the SIMDs units connect to the memory unit and the Output Buffer unit via a common memory bus. The memory bus is configured to be 64 bit-wide and operates at a frequency of 200mhz. It is capable of 2 transfers per cycle, which offers a total 3200 megabytes per second bandwidth. In addition, our design targets each SIMD processor as a 32-bit processor with a clock rate of 100 mhz, and capable of processing 4 bytes per cycle. Assuming each packet header is exactly 352 bits(44 bytes), this configuration provides a processing power of 63 million packet headers per second, with a total of 2800 mega bytes per second throughput, which is well under the capacity limit of the memory bus. The remaining memory bus bandwidth can be used for loading the necessary filter data set from the main memory unit. 4

2.4 CACHE A D MEMORY U ITS The RFC hardware module is equipped with two cache units and a main memory unit to provide for instruction buffering and data fetching. The two cache units consist of an instruction cache unit and a victim cache unit. Both units adopt the fully associative cache structure aimed at reducing miss rate. Since our instruction data is shared across all the SIMD units, we believe providing a common shared instruction cache is the preferred option in this setup. The instruction cache is also linked to a victim cache to add a second level of instruction buffering. The victim cache stores data entries purged from the instruction cache due to conflict or capacity misses. The support for adding a victim cache comes from the fact that RFC recursively uses the same routines in multiple phases, there is a high probability of recently purged instruction will be used again in the next phase. The intended configuration for the cache is to set the cache size to be 8k and 4k respectively for the i-cache and v-cache. The data transfer between the instruction cache unit and the Dispatching unit is done via a memory bus similar to the bus architecture used to link SIMDs and the buffer units, however, this bus link is enhanced to run at a frequency of 400mhz -doubling the previous bus speed. All data transfer from cache unit is facilitated through the Dispatching Unit, there are no direct interactions between the SIMD units and cache, this is a sharp contrast in comparison with the main memory unit setup, which is directly connected to all the SIMDs units via the shared memory bus. The memory unit is engineered this way because it stores all the necessary filter data set relevant to the search operations being performed inside each SIMD unit. Since filter data sets tends to be large - approximately 4 mega bytes in size after compressed[ref 1], thus having a direct connection will likely reduce overhead and improve transfer speed. In our initial design, we contemplated the idea of having individual memory unit attached to each SIMD unit, and splitting the large filter data set to distribute them among the individual memory units, however, this approach will increase the complexity of the Dispatching Unit, which will have the increased responsibility of managing memory exchanges between SIMD units, this approach is also likely to increase the cost for the design, thus we decided to use a shared memory model for this architecture. 3.1 RULES FRAMI G 3. ARCHITECTURE SIMULATIO As the rules have been framed based on various fields of the header, a short description about the IPv6 header has been given[ref 3]. 3.1.1 IPv6 HEADER Figure 4: IPv6 Header 5

The IPV6 header consists of 40 bytes as given below: 1. Version - version 6 (4-bit IP version). 2. Traffic class - packet priority (8-bits). Priority values: 0-7 Low priority; 8-15 High Priority 3. Flow label - QoS management (20 bits). 4. Payload length - payload length in bytes (16 bits). 5. Next header - Specifies the next encapsulated protocol. The values are compatible with those specified for the IPv4 protocol field (8 bits). 6. Hop limit It is equivalent to the time to live field of IPv4 (8 bits). 7. Source and destination addresses - 128 bits each. For classifying a packet based on the header, we need to determine the fields which we have to consider and their various values. For example as we are dealing with IPV6 packets Version field is going to be the same. Hop limit and Payload length are not going to do anything with packet classification. Hence there is no need for considering these fields. However Traffic class, Flow label, Next header and address fields influence the type of packet as per their respective definitions. The values we have considered for various fields of header for classifying packets have been enlisted in the table 1. Table 1: Rule Table Address Type Destination address Flow Label Protocol Traffic class Rule o./ Priority Multicast FF00 **** **** **** IntServ TCP Low 29 Unicast 4000 **** **** **** IntServ TCP Low 30 Site Local FEC0 **** **** **** IntServ TCP Low 31 Link local FE80 **** **** **** IntServ TCP Low 32 Multicast FF00 **** **** **** DiffServ TCP Low 25 Unicast 4000 **** **** **** DiffServ TCP Low 26 Site Local FEC0 **** **** **** DiffServ TCP Low 27 Link local FE80 **** **** **** DiffServ TCP Low 28 Multicast FF00 **** **** **** IntServ TCP High 13 Unicast 4000 **** **** **** IntServ TCP High 14 Site Local FEC0 **** **** **** IntServ TCP High 15 Link local FE80 **** **** **** IntServ TCP High 16 Multicast FF00 **** **** **** DiffServ TCP High 9 6

Unicast 4000 **** **** **** Site Local FEC0 **** **** **** Link local FE80 **** **** **** Multicast FF00 **** **** **** Unicast 4000 **** **** **** Site Local FEC0 **** **** **** Link local FE80 **** **** **** Multicast FF00 **** **** **** Unicast 4000 **** **** **** Site Local FEC0 **** **** **** Link local FE80 **** **** **** Multicast FF00 **** **** **** Unicast 4000 **** **** **** Site Local FEC0 **** **** **** Link local FE80 **** **** **** Multicast FF00 **** **** **** Unicast 4000 **** **** **** Site Local FEC0 **** **** **** Link local FE80 **** **** **** DiffServ TCP High 10 DiffServ TCP High 11 DiffServ TCP High 12 IntServ UDP Low 21 IntServ UDP Low 22 IntServ UDP Low 23 IntServ UDP Low 24 DiffServ UDP Low 17 DiffServ UDP Low 18 DiffServ UDP Low 19 DiffServ UDP Low 20 IntServ UDP High 5 IntServ UDP High 6 IntServ UDP High 7 IntServ UDP High 8 DiffServ UDP High 1 DiffServ UDP High 2 DiffServ UDP High 3 DiffServ UDP High 4 3.2 SYSTEM ARCHITECTURE The task of packet classification is accomplished by mapping S bits of packet header to T bits of CLASS ID. This mapping involves three phases which has been discussed below. 3.2.1 PHASE 0 The various fields of packet header which are valid for packet classification are divided into chunks and supplied as input to the phase 0. For example, with the first two bytes of destination address the type of the packet can be identified. I.e. the packet can be classified in one of the four categories such as multicast, site 7

local, link local, and unicast. Hence first byte of destination address is supplied to chunk#1 and second byte of destination address is given to chunk#2. Similarly chunk#3 is supplied with bits of CLASS field; chunk#4 is supplied with flow label and chunk#5 with next header. Mapping of actual input to the Equivalent ID (Eq ID) is done as given in the figure 5. 3.2.2 PHASE 1 The output of first two chunks in phase 0 is given as input to chunk#6 to determine the address type of the packet. The output of other three chunks in phase 0 is given as input to chunk#7 of phase 1. The mapping of these inputs to the corresponding Eq IDs is done as given in the figure 5. 3.2.3 PHASE 2 This is the final stage of the classifier where the Eq IDs of chunks#6 and #7 are combined to produce the CLASS ID of 5 bits which identifies the type of packet. The packet header with the size of 40 bytes has been reduced to the CLASS ID of size 5 bits. Figure 5: Implementation of packet Classifier 8

3.3 CACHE SETUP For performance analysis it is necessary to have various environments. In order to analyze the performance of the packet classifier, we have created a cache program with three classes. First class represents an I-cache and second class represents V-cache. Third class represents an I-cache connected with V-cache. First and second class combined together works in the conventional way. I.e. if the instruction is found in the I-cache it is a hit, else it marks it as miss, and goes to victim cache. In the third class where I-cache is connected with V-cache, if the instruction is not found in I-cache then V- cache is checked. If it exists there, then it still counts as a hit, if it is not in V-cache, it counts as a miss. Whenever cache is full, the oldest entry is stored into the v-cache. We have analyzed our code for packet classifier for various cache sizes. V-cache is always adjusted to be half the size of I-cache and also it supports same replacement strategy as that of I-cache (FIFO & Random). 4. DATA A ALYSIS From the implementation of the packet classifier module with a C program simulation we were able to observe certain results which helped us to evaluate the performance of our proposed architecture. A single threaded program to perform this kind of packet classification was run on a Simple Scalar simulator and the total simulation time was found to be 112 seconds. A similar C program which basically accepts the packet header as input, separate them into different chunks of data according to the specifications and gives the output for Phase 0 level took a total simulation time of 83 seconds. These two steps of the program which segregates the incoming packet into different chunks of data and generating the EqID (an output of Phase 0) is the tedious phase of execution in this program. When implemented in a multi threaded program Phase 0 takes the maximum amount of time since it involves comparing the input header with the required parameters and generating the intermediate EqID. Since always in a multithreaded program the total execution time of the program is same as the execution time of the longest phase we can state that the execution time for multithreaded implementation of the same program will be approximately 83 seconds. But however the overhead of switching between different threads in multithreading brings down the performance of the entire system. So, although the execution time is less in this case the overall performance of the system is brought down. Our single thread and multi-thread timing information provided us with a clue as to how a RFC hardware scheme will perform in a SIMD environment. We can observe that the parallel executing of search among SIMD hardware is definitely beneficial to the RFC based packet classification. Since in each phase, a single thread program can only process one chunk of data at a time, and the total time is the aggregation of all the time spend on processing all the headers, it is drastically different than SIMD execution, where for each phase, the worst timing is the longest execution time occurs on one of the SIMD unit, while the other units waiting for the last executing SIMD to finish. So for each, the total execution time will only be the worst execution or longest running time on one particular unit. The amount of improvement will be based on different data chunks we have to split among the SIMD, this is similar to the pipe-lined stage effect, so we anticipate the time to be only 1/N of the originally non-simd implementation, assume we have N multiple header fields. 9

An analysis of the miss rate for various size and types of cache implementations is tabulated as follows: I-cache without v-cache (random) I-cache with v-cache (random) o. of entries Miss rate 64 entries 61% 128 entries 24% 256 entries 12% 512 entries 12% I-cache without v-cache (FIFO) o. of entries Miss rate 64 entries 41% 128 entries 37% 256 entries 12% 512 entries 12% o. of entries Miss rate 64 entries 33% 128 entries 12% 256 entries 12% 512 entries 12% I-cache with v-cache (FIFO) o. of entries Miss rate 64 entries 38% 128 entries 12% 256 entries 12% 512 entries 12% Figure 7: Graph to show the impact in miss rate as we increase cache size and varying the replacement policy 10

From the above tabulated data we find that the miss rate reduces as the we increase the cache size, the v-cache plays a less and less role in helping bring down the miss rate as the size of the cache approaches 4KB and 8KB, which is what we targeted as our size for the i-cache, our results indicates the v-cache probably plays a much lesser role than we originally anticipated. But in all likelihood, can be safely removed from the overall design without any impact to performance if we decide to choose a cache size around 4KB to 8KB. Furthermore, based on our observation, we can speculate that the RFC algorithm uses media size quantities of instructions and if we provide an I-cache with 4KB or 8KB size, it is sufficient to accommodate the cache needs, and the addition of the v-cache will not add any more benefits. However, if the addition of 4KB or 8KB cache will significantly increase the cost of the overall cost of our design, and we have to work with small amount of cache hardware, then the addition of the v-cache can definitely be justified and will for sure improve the cache miss rates. One interesting note from the experiment is that it shows the miss rate in the Phase 0 (initial stage) are much higher than that of the Phase 1 and 2, it effectively illustrates the fact that when the initial cache contents are empty, the phase 0 suffer more misses, but as we proceed to the subsequent phases, the cache contents get filled and there is significant improvement in the miss rate. 5. Conclusions Although this project has helped us find the performance improvement considering the various simulation results, this will only be a speculation since we didn t actually simulate the fully functioning version of the SIMD model. We did not take into consideration the cost involved in constructing a 7 SIMD unit. Practically this cost might be too high to construct such a unit. We did not try to implement with a large size filter table to apparently show the benefit of parallel processing. A pipelined implementation of this architectural design could potentially result in significant increase in performance. In sum, from the above data analysis, using SIMD approach for this type of packet classification in IPv6 networks is sure to benefit when compared with all other approaches and our hopes are so high that our proposed model will work well in a real world implementation scenario. References [1] Pankaj Gupta and Nick McKeown, "Algorithms for Packet Classification", IEEE Network Special Issue, March/April 2001, volume 15, no. 2, pp 24-32. [2] Taylor, David E.; Herkersdorf, Andreas; Doring, Andreas; Dittmann and Gero, "Robust Header Compression (ROHC) in Next-Generation Network Processors", IEEE/ACM Transactions on Networking, August 2005, Volume 13, Issue 4, pp 755-768. [3] RFC 2460- Internet Protocol, Version 6 (IPv6) Specification. 11