Design of a Multi-Dimensional Packet Classifier for Network Processors

Design of a Multi-Dimensional Packet Classifier for Network Processors Stefano Giordano, Gregorio Procissi, Federico Rossi, Fabio Vitucci Dept. of Information Engineering, University of Pisa, ITALY E-mail: s.giordano@iet.unipi.it, g.procissi@iet.unipi.it, f.rossi@netserv.iet.unipi.it, fabio.vitucci@iet.unipi.it Abstract Nowadays packet classification is a fundamental task for network devices such as edge routers, firewalls and intrusion detection systems. Determining which flow packets belong to is important for many applications, and it is necessary, for example, to provide differentiated services, to detect anomalous traffic and to sort attack patterns. Therefore packet classification is becoming more and more complex, with more flexibility and higher performance requirements. Network Processors (NPs) are emerging as very promising platforms due to their capability to combine the flexibility of general-purpose processors with high performance of hardware-based solutions. In this paper we illustrate the design of a multidimensional packet classifier realized on the Radisys ENP-2611 board equipped with Intel IXP2400 Network Processor. The first goal of this study is the selection of the most suitable classification algorithm to be integrated into the embedded system. Our investigation is then directed to adjustments and refinements of the selected algorithm (namely the multidimensional multibit trie algorithm) to capitalize the peculiar functional properties and capabilities of our network processor. I. INTRODUCTION Interest in modern Internet applications is constantly growing among service providers and potential customers. A significant number of both existing and emerging applications need a high level of quality of service and impose strict requirements on network performance. Hence, Internet routers that, to date, provide best-effort service only, are now called to provide service differentiation to several classes of applications. Thus, new mechanisms, such as admission control, resource reservation, per-flow queueing, and fair scheduling are necessary: all of such mechanisms need to distinguish packets belonging to different flows. Moreover, the growing concern on network security is determining a huge interest towards the development of techniques devoted to detect anomalous traffic and to sort attack patterns. Therefore, even firewalls and intrusion detection systems need to determine which flow packets belong to. The process of categorizing packets into flows is called packet classification. Formally, given a set of rules or policies defining the packet attributes, classification is the process of identifying the rule or policy to which a packet conforms. As both the number and the complexity of the applications requiring classification are increasing, such a task is becoming more and more critical. Indeed, packet classification requires to perform, at real-time, a considerable number of operations (such as to analyze packet header fields according to different rule specifications) while maintaining high speeds with almost no impact in traffic dynamics. In this scenario, in order to achieve flexibility and high performance, the most promising solution is represented by the adoption of Network Processors (NPs). NPs are emerging platforms that offer very high packet processing capabilities (e.g. for gigabit networks) and combine the programmability of general-purpose processors with the high performance typical of hardware-based solutions. Usually, a NP card includes a set of programmable processors that run in parallel to facilitate fast-packet processing, and provides a set of instructions optimized for packet processing, as well as a set of hierarchically distributed memory devices. Also, it allows multitasking and multithreaded programming. The research presented in the paper is based on the fully programmable Intel IXP2400 Network Processor, the most used NP platform in Academic context, provided to Universities through the Intel IXA University Program. The main target of the paper is to propose a rigorous methodology to the design of a network component, the packet classifier, which takes into account both the functional specifications of the component itself and the specific features of the candidate hardware for its actual implementation. The rest of the paper is organized as follows. Section II presents the background notions on classification algorithms and a brief description of the Intel IXP2400 architecture. Section III illustrates a performance comparison among several classification algorithms and describes the multibit multidimensional trie as the best candidate to be implemented into the NP. Section IV introduces our modifications and improvements to the original classification algorithm in order to adapt it to the specific hardware available to us, the Radisys ENP-2611 board equipped with Intel IXP2400 Network Processor. Finally, Section V presents summary, conclusions and future works.

II. BACKGROUND In this section, we introduce the background basis of the research activity illustrated in this paper. We first present a brief overview of classification algorithms and then a high level description of the Intel IXP2400 architecture. A. Packet Classification Algorithms: State-of-the-Art Given a set of rules defining the packets attributes, packet classification is defined as the process of finding the best match among rules and packet fields. Rule matching schemes may be an exact match or a prefix/range match on multiple fields. Applications use policies based on layer2-layer4 header fields, called selectors of packets. If the algorithm uses more than one selector, it is called multidimensional. Since multidimensional classification is a complex problem, many researchers have explored and proposed a wide variety of algorithms. In our research, we investigated all these solutions in order to find the most suitable for our hardware platform. In the following, we present the main theoretical principles of these algorithms (for details see references [2-11]), divided into four big categories [1]. The first category consists of the algorithms that use basic data structures. The simplest data structure is a linked-list of rules stored with decreasing priority: a packet is compared to each rule sequentially until a rule is found to match all fields; clearly, such a linear search exhibits poor scaling properties. An advanced data structure is a d-dimensional hierarchical trie, recursively constructed as follows: we first build a 1- dimensional trie according to the first dimension of rules. Then, for each prefix in the first trie, we recursively construct a (d-1)-dimensional hierarchical trie according to rules that specify exactly that prefix in the first dimension. Other advanced data structures proposed for packet classification are Set-Pruning trie [2] and Grid of Tries [3]. The second category is composed of Geometric Algorithms. Each field of a rule can be specified either through a prefix/length pair or in terms of an operator/number. From a geometric point of view, both these specifications could be interpreted as intervals on a line. Thus, a rule with d fields represents a d-dimensional hyperrectangle and a classifier is a set of such hyper-rectangles with the associated priorities. In this context, a packet header (d-tuple) represents a point P in the d-dimensional space and the packet classification problem consists of finding the highest priority hyper-rectangle that encloses P (see Fig. 1). This category includes the Cross-Producting algorithm [3], a solution for 2-D classifiers proposed by Lakshman et al. [4], the Area-Based Quadtree (AQT) structure [5] and the Fat Inverted Segment Tree (FIS-tree) [6]. The third category is that of Heuristic Algorithms. Such algorithms are based on the evidence that, due to the considerable redundancy of classification rules, the size of data structures of real classifiers is generally much smaller than that theoretically calculated in worst-cases. Rule F1 F2 R1 00 00 R2 0 01 R3 1 0 R4 00 0 R5 0 1 R6 1 Figure 1. Packet classification in a geometric representation. Recursive Flow Classification (RFC) [7], Hierarchical Intelligent Cuttings (HiCuts) [8], Tuple Space Search [9] are classification algorithms that, in different manners, perform a heuristic pre-partitioning of the problem space first, and then rearrange the decision data structure in order to take advantage of the above mentioned redundancy. This way, they reduce considerably the time for packet classification while saving memory. The last group is composed of Hardware-Based Algorithms, such as Bitmap Intersection [10] and Ternary Content-Addressable Memories (T-CAMs) (see Fig. 2). These algorithms usually split the multidimensional classification problem into several one-dimensional searches to be performed by means of specific hardware-based solutions (for example, the use of T-CAMs allows achieving 100 million queries per second [11]). As it will be shown in the following, classification algorithms of the first category prove to be well tailored for implementation into the network processor architecture. In particular, we select the very complex multidimensional multibit trie data structure because of its low search time and small number of memory accesses. The reason of this choice is tightly related to the specific characteristics of the Intel IXP2400 hardware architecture, which will be accurately described in the next section. Figure 2. An example of Ternary-CAM.

B. Intel IXP2400 Architecture The Intel IXP2400 network processor [12] is a member of the Intel's second-generation network processor family. It is designed to perform a wide range of functionalities, including multi-service switches, routers, broadband access devices and wireless infrastructure systems. The IXP2400 is an evolution of the first-generation Intel IXP1200 and it is a fully programmable network processor that implements a high-performance parallel processing architecture on a single chip suitable for processing complex algorithms, detailed packet inspection, traffic management and forwarding at the wire speed. For research purposes, the Intel IXP2400 network processor is granted to Universities through the Intel IXA University Program [13], along with the required development environment, which includes the Intel IXA Portability Framework, an easy-to-use modular programming framework that provides software investment protection through code portability and reuse. The Intel IXP2400 architecture (see Fig. 3) combines a high-performance Intel XScale core with eight 32-bit independent microengines v2 (connected in two clusters of four) that cumulatively provide more than 5.4 gigaoperations per second. The Intel XScale core is a generalpurpose 32-bits RISC processor (ARM Version 5 Architecture compliant) used to initialize and manage the NP, to handle exceptions, and to perform slow path processing and other control plane tasks. The XScale processor and the whole set of microengines run at 600 MHz. Microengines provide the processing power to perform tasks that would require very expensive high-speed ASICs (Application- Specific Integrated Circuits). Each microengine offers a 4K instruction store of 40-bit instructions and a local memory of 2560 bytes, while it can access all the shared resources (SRAM, DRAM, etc.) and private connections between adjacent microengines. Also, microengines support softwarecontrolled multithreaded operations by providing eight hardware-assisted threads of execution (i.e. zero-overhead context switch [14]). Other noteworthy features of the IXP2400 architecture include a hash unit (that expedites address table lookup by performing polynomial hash on several values simultaneously), a scratchpad memory (used to exchange data back and forth between microengines), a Control Status Register (CSR) Access Proxy (that provides an interface to the chip-wide CSRs), a Peripheral Computer Interface (PCI) Controller (used to connect either external host CPUs or other PCI-compatible peripherals), a Media Switch Fabric (MSF) Interface (to connect external media devices) and DRAM and SRAM Controllers (to provide access to external memories). For more details about Intel IXP2400 NP we refer to [12] [13] [14]. Since the Intel IXP2400 NP is hosted by an external board, in order to accurately design a packet classifier, the knowledge of the sizes of the available external memories is also required. We adopt the Radisys ENP-2611 [15] board, equipped with 8 MB SRAM and 256 MB DRAM. External Media Devices MSF Interface PCI Controller Optional host CPU, PCI bus devices Scratchpad Hash Unit SHaC SRAM CSR Acc. Proxy SRAM Controller 1 SRAM Controller 0 0:0 0:1 DRAM Controller 0 Figure 3. IXP2400 Hardware Architecture. III. MULTIDINSIONAL MULTIBIT TRIE In order to choose an algorithm that exploits the capabilities of our Radisys ENP-2611 board, we compared all the classification algorithms described in previous section. We analyzed classic performance metrics as well as specific parameters to investigate the compatibility with NPs. The classic performance parameters analyzed are: 1) Search Speed: the increase of link speeds requires faster and faster classification; 2) Memory requirements: small storage requirements determine high memory access speed and, in turn, low power consumption (both features are very relevant for NPs); 3) Scalability in classifier table size: the size of a classification table depends on the applications; the capability of handling a large number of rules makes a classification algorithm appealing for many applications; 4) Scalability in the number of header fields: the number of header fields to be processed increases with the number of services provided; 5) Update time: the data structure needs to be updated upon insertion and removal of rules in the classifier; some applications require strictly short updating time; 6) Flexibility in rule specification: the ability of an algorithm to handle a wide range of rule specifications, such as prefix/length, operator/number, and wildcards, makes it reusable for many applications. Performance metrics of classification algorithms belonging to the four categories (as described in section II-A) have been evaluated and compared. For the sake of conciseness, Table I [1] shows the results for two specific metrics only, namely the worst case complexity of search time and storage requirements. The worst case is computed considering a classifier table filled with totally different rules, without overlapping in the values of the fields. In the table, we denote the number of entries in the classifier with N, the maximum prefix length (in bits) of a field with W, the number of fields with D, and the total number of data structure levels with L. DRAM Cluster 0 Cluster 1 Intel XScale Core

TABLE I COMPARISON OF CLASSIFICATION ALGORITHMS Algorithm Worst-Case Search Time Worst-Case Storage Linear search O(N) O(N) Hierarchical Tries O(W D ) O(NDW) Set-Pruning Trie O(W D ) O(N D ) Grid of Tries O(W D-1 ) O(NDW) SA DA 16 bit 8 bit 8 bit 16 bit Cross-Producting O(DW) O(N D ) FIS-Tree O((L+1)W) O(LN 1+1/L ) AQT O(NW) O(W) HiCuts O(D) O(N D ) RFC O(D) O(N D ) Bitmap Intersection O(DW+N/W) O(DN 2 ) Ternary CAMs O(1) O(N) Figure 4. Beginning of a multidimensional multibit trie. 8 bit TCP/UDP ports are divided in two strides of 6-10 (because most of the port numbers in use are usually below 1024). Finally, a single stride of 8 bits is used for the protocol field. A packet may match more than one rule, thus each rule in the database is associated with a field that identifies its priority. In addiction to the above listed metrics, for a correct selection of the packet classification algorithm, all the possible bottlenecks of NPs have to be taken into account. In particular, the number of memory accesses, the amount of memory consumption, and the size of the instruction store have to be considered. In other words, the key challenge is to design a packet classification algorithm that requires both low memory space and low access overhead: this would provide a proper scaling with respect to high bandwidth networks and large databases of classification rules. The analysis of results (especially the ones associated with search speed and number of accesses not shown in the Table I) led to identify a classification algorithm based on a multidimensional multibit trie structure as the most appropriate for IXP2400. Such an algorithm has, in the worstcase, a research speed of the order of O(W/K), a fix number of memory accesses (L) and a storage complexity of the order of O(2 (K-1) N W/K), where K is the average length of strides. In our implementation, the classifier processes the five main header fields, as typically adopted in the literature [16]: the IP Destination Address, IP Source Address, IP Destination Port, IP Source Port, and Layer 4 Protocol Type. Hence, we have a 5-dimensional classifier, which uses a 5-stage hierarchical search trie. In each stage, the classification algorithm processes a single packet header field; the analysis of each field is divided into several steps, performed by using a specific number of bits (named stride), instead of 1 bit at a time, to decrease the number of memory accesses (see Fig. 4). The choice of the strides lengths is based on a theoretical method proposed by Srinivasan [17] and on the analysis of the actual distributions of rules in real classifiers [16]. Thus, according to the classic IP classful addressing structure (in [16], it has been shown that the most of classification rules ignore subnetting) 32 bits long IP addresses are divided into three strides of 16-8-8 bits respectively, while the 16 bits long IV. THE MODIFIED CLASSIFICATION ALGORITHM In order to improve the packet classification performance, this section presents the results of a number of modifications that have been applied to the original multidimensional multibit trie algorithm. The main issue of classical multidimensional multibit trie is memory consumption. Indeed, to handle rules with nonspecified parts (e.g. 121.132..) and to have a single memory access in each stride of trie, 2 s nodes for level are necessary (where s = stride length) though only a few of them correspond to actual existing rules and the remaining nodes are virtual. This requires large memory storage: for example, a classifier of 50 rules (the average number of rules in real classifiers [7]), needs a data structure of 9 MB, that, in our scenario, would force to put the classification table in DRAM, the memory with the highest access latency. To decrease memory requirements, when there is a rule with non-specified parts (use of non-contiguous bit-masks is very frequent in real classifiers [7]), we perform a level crossing, which consists of jumping to the next level without need for nodes explosion. This operation, regardless to the specific sizes of available memories, drastically reduces memory consumption as required by NPs; this permits a feasible integration of different complex applications into the device. Table II shows the comparison, in terms of storage requirements, between the original classification algorithm and our modified version. We developed an apposite simulator in C language to compute the number of memory accesses and the memory size of the classifier. In the simulation runs, we used classification tables very similar to the ones used in some actual firewalls and edge routers; such tables contain both fully specified rules and partially specified rules (that is, several fields are not specified).

TABLE II STORAGE REQUIRENTS FO R DIFFERENT VERSIONS OF ALGORITHM Number of Original Modified rules algorithm algorithm Classifier Table 50 9 MB 440 KB 100 12 MB 730 KB 200 17 MB 1.2 MB 500 29 MB 1.8 MB 1000 55 MB 2.3 MB 10000 258 MB 8 MB Obviously, the introduction of level crossing increases the algorithm complexity, because of possible backtrackings: indeed, the use of direct crossing can create, during the matching search operation, cases of wrong paths and consequent needs for traversals (see Fig. 5 for an example with one field only). This, in turn, makes the number of processing steps variable and raises the number of instructions necessary to implement packet processing (in spite of this, the instruction store of IXP2400 microengines is large enough to easily handle the modified algorithm). This phenomenon is largely compensated by the possibility of locating the classifier table into the SRAM, a memory device with lower access delay than that of the DRAM, with the global effect of a reduction of the packet processing time. Another improvement to the original algorithm is derived from the analysis of many real rules databases, and deals with the process of handling ranges of TCP/UDP port fields. Indeed, for these fields, it is common to find value specifications such as greater than 1024, between 100 and 128, and so on. Therefore, to cope with this issue, we introduce suitable objects in the structure of the trie nodes to identify lower and upper bounds of the range. Thus, when a packet header is processed to be classified, its port fields are no longer compared to a single value but rather to a range of them. R1 131.114 4 R2 R3 131.114 131.114 3 4 R1 4 131. 114 3 4 R2 Packet header P: 131.114.4.4 R3 Figure 5. A simple example of trie backtracking. V. CONCLUSIONS AND FUTURE WORKS In order to provide QoS and network security, packet classification arises as a fundamental task for network devices. At the same time, real-time applications are especially sensitive to the latency introduced by packet processing; these facts lead to the adoption of high performance and flexible network components. In this context, network processors are emerging as very promising platforms to perform the classification process, due to their capability to combine the programmability and flexibility of general-purpose processors with the high performance of hardware-based solutions. They are generally equipped with a small amount of memory and with limited access bandwidth. Hence, a key challenge is to design packet classification algorithms optimized for the specific network processor hardware platform. In this paper we have illustrated the design of a multidimensional packet classifier for the Radisys ENP- 2611 board equipped with Intel IXP2400 Network Processor. The first phase of our research has been focused on the selection of a classification algorithm that would achieve an optimal trade-off between its performance and its possible integration in the specific hardware platform available to us. We have analyzed and compared the performance of an exhaustive number of multidimensional classifiers proposed in the literature and we identified the multidimensional multibit trie as the one that best matches our requirements. The multidimensional multibit trie algorithm is characterized by the low number of memory accesses and the high speed of search, but, in its original version, it requires large memory storage. Hence, in order to reduce memory consumption, we have applied some modifications to the original version of the algorithm, by introducing level crossing and the management of port range intervals. Finally we have validated these modifications by means of simulations: the results show a clear reduction on the amount of memory storage as well as the improvement of the processing time. We are perfectly aware that the overall performance of the classifier will also depend on its specific implementation on the hardware (code optimization for a very complex system which includes several microengines, hierarchy of memories, multitasking and multithreaded programming, etc.). At this moment, we are currently integrating the classification module into the IXP2400 by differentiating the functions of the XScale processor (that must receive the rules list and create the data structure in SRAM) and the functions of microengines (that must perform matching search in the data structure). The next steps will consist of integrating first the classification module on the NP platform into a real DiffServ edge router and, then, of performing a series of measurements to validate its actual performance. ACKNOWLE DGNTS This work was pa rtially supported by the Euro-NGI Network of Excellence funded by the European Commission.

REFERENCES [1] Pankaj Gupta and Nick McKeown, Algorithms for packet classification, IEEE/ACM Trans. Networking, vol. 15, pp. 24-32, March-April 2001. [2] P.Tsuchiya, A search algorithm for table entries with non-contiguous wildcarding, Bellcore, unpublished report. [3] V. Srinivasan, G. Varghese, S. Suri, and M. Waldvagel, Fast and scalable layer four switching, Proceedings of ACM Sigcomm 98, pp. 191-202, August 1998. [4] T. V. Lakshman and D. Stiliadis, High-speed policybased packet forwarding using efficient multidimensional range matching, Proceedings of ACM Sigcomm 98, pp. 203-214, August 1998. [5] M. M. Buddhikot, S. Suri, and M. Waldvogel, Space decomposition techniques for fast layer-4 switching, Proceedings of Conference on Protocols for High Speed Networks, pp. 25-41, August 1999. [6] A. Feldman and S. Muthukrishnan, Tradeoffs for packet classification, Proceedings of Infocom, vol. 3, pp. 1193-1202, March 2000. [7] Pankaj Gupta and Nick McKeown, Packet classification on multiple fields, Proceedings of ACM Sigcomm 99, pp.147-160, August 1999. [8] Pankaj Gupta and Nick McKeown, Packet classification using hierarchical intelligent cuttings, Proceedings of Hot Interconnects VII, Stanford, CA, August 1999. [9] V. Srinivasan, S. Suri, and G. Varghese, Packet classification using tuple search space, Proceedings of ACM Sigcomm, pp. 135-146, September 1999. [10] T. V. Lakshman and D. Stiliadis, High-speed policybased packet forwarding using efficient multidimensional range matching, Proceedings of ACM Sigcomm, pp.191-202, September 1998. [11] Ultra9M, Datasheet from SiberCore Technologies (online), Available: http://www.sibercore.com. [12] Intel Corporation, Intel IXP2400 Network Processor, http://www.intel.com/design/network/products/npfamily/ ixp2400.htm. [13] Intel Corporation, Intel IXA University Program, http://www.ixa edu.com. [14] Erik J. Johnson and Aaron R. Kunze, IXP2400/2800 Programming", Intel Press, 2003. [15] RadiSys Corporation, "ENP-2611 Data Sheet", http://www.radisys.com. [16] Michael E. Kounavis, Alok Kumar, Harrick Vin, Raj Yavatkar, and Andrew T. Campbell, Directions in Packet Classification for Networks Processors, Proceedings of 2nd Workshop on Network Processors, February 2003. [17] V. Snirivasan and G. Varghese, Faster IP lookups using controlled prefix expansion, Proceedings of ACM Sigmatics, June 1998.