SINCE the ever increasing dependency on the Internet, there

Similar documents
Performance Improvement of Hardware-Based Packet Classification Algorithm

Packet Classification Using Dynamically Generated Decision Trees

Scalable Packet Classification for IPv6 by Using Limited TCAMs

Design of a High Speed FPGA-Based Classifier for Efficient Packet Classification

Fast Packet Classification Algorithms

Tree-Based Minimization of TCAM Entries for Packet Classification

Implementation of Boundary Cutting Algorithm Using Packet Classification

Problem Statement. Algorithm MinDPQ (contd.) Algorithm MinDPQ. Summary of Algorithm MinDPQ. Algorithm MinDPQ: Experimental Results.

Scalable IP Routing Lookup in Next Generation Network

AN EFFICIENT HYBRID ALGORITHM FOR MULTIDIMENSIONAL PACKET CLASSIFICATION

DESIGN AND IMPLEMENTATION OF OPTIMIZED PACKET CLASSIFIER

Performance Evaluation and Improvement of Algorithmic Approaches for Packet Classification

Packet Classification using Rule Caching

Grid of Segment Trees for Packet Classification

Three Different Designs for Packet Classification

A Scalable Approach for Packet Classification Using Rule-Base Partition

TOWARDS EFFECTIVE PACKET CLASSIFICATION

Efficient Packet Classification using Splay Tree Models

AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction. 2. Associative Cache Scheme

Towards Effective Packet Classification. J. Li, Y. Qi, and B. Xu Network Security Lab RIIT, Tsinghua University Dec, 2005

Selective Boundary Cutting For Packet Classification SOUMYA. K 1, CHANDRA SEKHAR. M 2

Enhancement of the CBT Multicast Routing Protocol

Packet classification using diagonal-based tuple space search q

MULTI-MATCH PACKET CLASSIFICATION BASED ON DISTRIBUTED HASHTABLE

Recursive Flow Classification: An Algorithm for Packet Classification on Multiple Fields

Design of a Multi-Dimensional Packet Classifier for Network Processors

EVERY Internet router today can forward entering Internet

Cross-Layer QoS Support in the IEEE Mesh Network

Design and Evaluation of Diffserv Functionalities in the MPLS Edge Router Architecture

AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction

Performance Evaluation of Cutting Algorithms for the Packet Classification in Next Generation Networks

Multi-core Implementation of Decomposition-based Packet Classification Algorithms 1

CS 268: Route Lookup and Packet Classification

Frugal IP Lookup Based on a Parallel Search

Scalable Packet Classification on FPGA

FPGA Implementation of Lookup Algorithms

Routing Lookup Algorithm for IPv6 using Hash Tables

Data Structures for Packet Classification

Homework 1 Solutions:

An Efficient IP Routing Lookup by Using Routing Interval

Forwarding and Routers : Computer Networking. Original IP Route Lookup. Outline

Packet Classification Using Standard Access Control List

QoS-Aware Hierarchical Multicast Routing on Next Generation Internetworks

Dynamic Routing Tables Using Simple Balanced. Search Trees

ECE697AA Lecture 21. Packet Classification

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

Efficient TCAM Encoding Schemes for Packet Classification using Gray Code

Switch and Router Design. Packet Processing Examples. Packet Processing Examples. Packet Processing Rate 12/14/2011

An Efficient Parallel IP Lookup Technique for IPv6 Routers Using Multiple Hashing with Ternary marker storage

Packet Classification. George Varghese

LONGEST prefix matching (LPM) techniques have received

A Multi-stage IPv6 Routing Lookup Algorithm Based on Hash Table and Multibit Trie Xing-ya HE * and Yun YANG

Towards High-performance Flow-level level Packet Processing on Multi-core Network Processors

IP Address Lookup in Hardware for High-Speed Routing

Algorithms for Packet Classification

A Multi Gigabit FPGA-based 5-tuple classification system

High-Performance Packet Classification on GPU

TUPLE PRUNING USING BLOOM FILTERS FOR PACKET CLASSIFICATION

Rule Caching for Packet Classification Support

Parallel-Search Trie-based Scheme for Fast IP Lookup

Master Course Computer Networks IN2097

Tree, Segment Table, and Route Bucket: A Multistage Algorithm for IPv6 Routing Table Lookup

Master Course Computer Networks IN2097

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT

ITTC High-Performance Networking The University of Kansas EECS 881 Packet Switch I/O Processing

Multi-Field Range Encoding for Packet Classification in TCAM

IN recent years, the amount of traffic has rapidly increased

A Method of Identifying the P2P File Sharing

THE advent of the World Wide Web (WWW) has doubled

Growth of the Internet Network capacity: A scarce resource Good Service

PERFORMANCE ANALYSIS OF AF IN CONSIDERING LINK

A Multi-constraint Resource Search Algorithm for P2P-SIP Conference Services

THE orthogonal frequency-division multiplex (OFDM)

Binary Search Schemes for Fast IP Lookups

Fast Update of Forwarding Tables in Internet Router Using AS Numbers Λ

Priority Area-based Quad-Tree Packet Classification Algorithm and Its Mathematical Framework

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

Disjoint Superposition for Reduction of Conjoined Prefixes in IP Lookup for Actual IPv6 Forwarding Tables

Shape Shifting Tries for Faster IP Route Lookup

Enhanced Cores Based Tree for Many-to-Many IP Multicasting

Interaction of RSVP with ATM for the support of shortcut QoS VCs: extension to the multicast case

Real Time Packet Classification and Analysis based on Bloom Filter for Longest Prefix Matching

PARALLEL ALGORITHMS FOR IP SWITCHERS/ROUTERS

Tag Switching. Background. Tag-Switching Architecture. Forwarding Component CHAPTER

Fast Firewall Implementations for Software and Hardware-based Routers

An Evaluation of Shared Multicast Trees with Multiple Active Cores

PC-DUOS: Fast TCAM Lookup and Update for Packet Classifiers

International Workshop NGNT 31. DiffServ and MPLS. Tímea Dreilinger

Performance Evaluation of Mesh - Based Multicast Routing Protocols in MANET s

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

Shape Shifting Tries for Faster IP Route Lookup

Efficient hardware architecture for fast IP address lookup. Citation Proceedings - IEEE INFOCOM, 2002, v. 2, p

Bitmap Intersection Lookup (BIL) : A Packet Classification s Algorithm with Rules Updating

Fast Packet Classification Using Bloom filters

The Improved WCMRP Protocol for Mobile Wireless Sensor Networks

Packet Classification via Improved Space Decomposition Techniques

THE DATA networks using TCP/IP technology, i.e., the. Flow Aggregated, Traffic Driven Label Mapping in Label-Switching Networks

RECENTLY, researches on gigabit wireless personal area

DiffServ Architecture: Impact of scheduling on QoS

A FORWARDING CACHE VLAN PROTOCOL (FCVP) IN WIRELESS NETWORKS

Transcription:

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 8, NO. 6, DECEMBER 2006 1239 Scalable Packet Classification for Enabling Internet Differentiated Services Pi-Chung Wang, Member, IEEE, Chia-Tai Chan, Chun-Liang Lee, and Hung-Yi Chang Abstract Nowadays, IP networks are rapidly evolving toward a QoS-enabled infrastructure. The need for packet classification is increasing in accordance with emerging differentiated services. While the new differentiated services could significantly increase the number of rules, it has been demonstrated that performing packet classification on a potentially large number of rules is difficult and has poor worst-case performance. In this work, we present an enhanced tuple pruning search algorithm called Tuple Pruning Plus (TPP) for packet classification, which outperforms the existing schemes on the scalability. Our main idea is to simplify the lookup procedure and to avoid unnecessary tuple probing by maintaining the least-cost property of rule through precomputation and the proposed Information Marker. With extra rules added for Information Marker, only one tuple access is required in each packet classification. In our experiments, 70 MB DRAM is used to achieve 50 million packets per second (MPPS) for a 1 M-rule set, showing a performance improvement by a factor of 50. We also present a heuristic to further reduce the required storage to about 20 MB. These results demonstrate the effectiveness of the TPP scheme to achieve high speed packet classification. Index Terms Best matching prefix, multicast, multidimensional range lookup, packet classification. I. INTRODUCTION SINCE the ever increasing dependency on the Internet, there has been a rapid evolution in the Internet applications. It provides a broad range of multimedia services, such as IP telephony, video conferencing, collaborative research, and distance based virtual reality/visualization. To meet real time property, Quality-of-Service (QoS) management for networked multimedia applications over IP is thus a significant and demanding challenge. During the past several years, numerous mechanisms have been proposed for providing QoS networks. The ultimate goal of these QoS mechanisms is to provide differentiated services to the applications at the edges of the Manuscript received July 29, 2005; revised January 29, 2006. This work was supported in part by the National Science Council of Taiwan, R.O.C., under Grant NSC 94-2213-E-214-027. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Anna Hac. P.-C. Wang is with the Department of Computer Science, National Chung Hsing University, Taichung, Taiwan, R.O.C. (e-mail: pcwang@cs.nchu.edu.tw). C.-T. Chan is with the Institute of Biomedical Engineering, National Yang- Ming University, Taipei 112, Taiwan, R.O.C. (e-mail: ctchan@ym.edu.tw). C.-L. Lee is with the Department of Computer Science and Information Engineering, Chang-Gung University, Taoyuan 333, Taiwan, R.O.C. (e-mail: leecl@csie.nctu.edu.tw). H.-Y. Chang is with the Department of Information Management, National Kaohsiung First University of Science and Technology, Kaohsiung 811, Taiwan, R.O.C. (e-mail: leorean@ccms.nkfust.edu.tw). Digital Object Identifier 10.1109/TMM.2006.884610 Fig. 1. Enabling procedures for QoS mechanism. network. These mechanisms usually rely on two procedures [1], as shown in Fig. 1. First, traffic arriving at edge routers is separated into distinct forwarding classes, e.g., indicated by the differentiated services codepoint (DSCP field) in the DiffServ model [2], via the process of packet classification. Packets from each flow are then directed to a corresponding queue. Then, the queue-scheduling algorithm determines the rate at which packet from each queue is forwarded that the resources are allotted to each queue and to the corresponding flows. Currently, end-to-end service guarantees for specific aggregated flows are achieved through RSVP, MPLS, or similar reservation protocols by routing these flows along specific traffic engineered paths. The directed routing is based on the source and destination addresses of packets [3] [5]. The 2-D packet classification determines the next hop and the allocated resource for each packet based on its source and destination addresses. Hence, packet classification is a key component of the QoS mechanisms that determines to which forwarding class a packet belongs. For example, the current discussions about differentiated services within the IETF assume that the edge routers of a core network are capable to classify the packets of different users [6]. Furthermore, 2-D packet classification is also useful for multicast forwarding which requires lookups based on both the source address and multicast group [7], [8]. Packet classification entails searching a table of rules which binds a packet to a flow or set of flows and returning the forwarding class for the least-cost rule which matches the packet. A rule consists of a set of fields, which in turn correspond to another set of fields in the packet header. The most common fields include the IP source address (SA, 32 bits), the destination address (DA, 32 bits), the protocol type (8 bits), port numbers (16 bits) of source/destination applications and protocol flags in the packet header. Each field can be any variable length prefix bit string, range, explicit value or wildcard. A -dimensional rule is thus defined as. A packet is said to match a particular rule if for all, the field of the header satisfies. Each rule has an associated action, which 1520-9210/$20.00 2006 IEEE

1240 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 8, NO. 6, DECEMBER 2006 is usually assigned a cost to define its priority among the actions of matched rules. The matched rule with the least-cost action will be enacted to process the arriving packets. In sum, the complexity of packet classification builds on the search for the least-cost, matching rule. While packet classification has been extensively employed in the Internet for security, the properties of rules for security and service differentiation are quite different. Currently, the largest rule sets in firewalls contain a few thousand rules [9] [11]; however, dynamic resource reservation protocols could cause rule sets to swell into the tens of thousands. For example, let s consider a backbone router with 100 K prefixes. If each destination prefix is coupled with even a few source prefixes (e.g., for resource reservation between content providers and customers), it is not hard to imagine the need for several hundred thousand rules. Thus the problem of finding the best matching rule for more than 100 K rules at Gigabit speeds is an important challenge [12]. In the last few years, the problem of packet classification has been studied extensively. Most of these algorithms are designed for applications on firewalls. Since these rule sets used for security and firewalls are fairly small [9], the performance of these algorithms cannot be guaranteed with a reasonable storage as the number of rules increases. Thus, they might not be suitable for differentiated services. To successfully deploy the Internet differentiated services, we are interested in solutions that can scale to several hundreds of thousands rules. In addition, the algorithms should be able to achieve fast updating since the rules specified for differentiated services might change frequently. Furthermore, we concern only worst-case performance of the algorithms since the header processing delay should be avoided in order to provide service assurances. Our Contributions: In previous work, tuple pruning search is proposed to achieve fast and scalable 2 D (SA, DA) packet classification [10]. However, the worst-case performance is not acceptable for the extremely large rule set. For example, at most 51 hash accesses are required for a packet classification within a 1M-rule set in our experiment. In this work, we propose an enhanced tuple pruning search algorithm called Tuple Pruning Plus (TPP) to achieve much faster forwarding throughput. By maintaining the least-cost property of rule through precomputation and the introduction of Information Marker, an improved tuple pruning mechanism is proposed to reduce the number of the probed tuples. There is a trade-off between the number of the hash accesses and the required storage. With extra rules added for Information Marker, the number of hash accesses for each packet classification could remain constant. In 1M-entry set, the lookup could be achieved in one hash access with sevenfold entries or four hash accesses with twofold entries. As compared with the existing tuple pruning search scheme, the experimental results demonstrate that the lookup speed increased by a factor of 50. An incremental update procedure is also provided. The rest of the paper is organized as follows. Firstly, the related algorithms are introduced in Section II. Section III presents the proposed algorithm. A further refinement to the proposed TPP scheme is shown in Section IV. The experimental setup and results are presented in Section V. Finally, a summary is given in Section VI. II. RELATED WORKS Several algorithms for classifying packets have recently appeared in the literature [9] [11], [13] [19]. They can be grouped into the following classes: linear search/caching, hardware-based solutions, grid of tries, decision-based, cross-producting-based and hash-based solutions. The following briefly describes the important properties of these algorithms. Assume that is the number of the rules, is the number of classified fields and is the length of the IP address. Linear Search/Caching: The simplest method for packet classification involves a linear search of all the rules. The spatial and temporal complexity is. Caching is a technique frequently used at either the hardware or the software level to improve the performance of linear search. However, the performance of caching depends critically on each flow s having large number of packets. Also, if the number of simultaneous flows exceeds the cache size, then the performance would be severely degraded. Hardware-Based Solutions: A high degree of parallelism can be implemented in hardware to provide a speed-up advantage. In particular, ternary content addressable memories (TCAMs) can be used effectively to look up rules. However, TCAMs with a particular word width cannot be used when flexibility of the rule specification is required. Manufacturing TCAMs with sufficiently wide words to accommodate all bits in a rule is difficult. It also suffers from the problem of power consumption and scalability [20]. Lakshman et al. presented another scheme that depends on a very wide memory bus [14]. The algorithm reads bits from memory, corresponding to the BMPs in each field, and determines their intersection to find a set of matching rules. The memory requirement for this scheme is and the time complexity is, where is the memory bus width. Recently, Baboescu et al. addressed the speed issue and described an improved version by merging consecutive bits, although the required storage was not improved [12]. In sum, the hardware-oriented schemes rely on heavy parallelism, and involve considerable hardware cost; the flexibility and scalability of hardware solutions remain to be very limited. Grid of Tries: Specifically for the case of two-field rules, Srinivasan et al. [13] presented a trie-based algorithm. The algorithm has a memory requirement of and requires memory accesses per rule lookup. In addition, there is an enhanced version that is presented in [19]. FIS Trees: In [15], Feldman and Muthukrishnan proposed the Fat Inverted Segment trees (FIS trees) for 2-D classification. FIS tree is a modification of a segment tree by adopting the data structure of multiway search tree and child-to-parent pointers. By adjusting the number of levels in the FIS trees, the required storage can be traded off with lookup time. Decision-Based Solutions: The decision-based algorithms include works presented by Gupta et al. [18] and Woo [17]. Both schemes use a decision tree to divide the rules into multiple groups. Each group is listed in the leaf nodes of the decision tree, and linear search is used to traverse the group. The number of rules in each group is limited by a predefined value. The decision at each node could be a field [18] or a bit of any field [17]. A suitable selection of decisions would

WANG et al.: SCALABLE PACKET CLASSIFICATION 1241 minimize the required storage and search time. The hypercuts presented by Singh et al. [11] further extends the 1-D cut into a multidimensional one. Cross-Producting-Based Solutions: A general mechanism, called cross-producting, involves BMP lookups on individual fields and the use of a precomputed table to combine the results of individual prefix lookups [13]. However, this scheme suffers from an memory blowup for -field rules. Gupta et al. presented an algorithm that can be considered to be a generalization of cross-producting [9]. In this algorithm, after BMP lookup is accomplished, a recursive flow classification algorithm hierarchically performs cross-producting. Thus BMP lookups and additional memory accesses are required per rule lookup. The algorithm is expected to improve the average throughput significantly; nevertheless, it requires space in the worst case. Also, in the case of two-field rules, this scheme is identical to cross-producting scheme. Hash-Based Solution: This solution is motivated by the observation that, although rule sets include several different prefixes or ranges, the distinct prefix lengths tend to be few [10]. For example, backbone routers have around 200K destination address prefixes, but only 32 distinct prefix lengths exist. Hence, all the prefixes can be divided into 32 groups, one for each length. Since all prefixes in a group have the same length, the prefix bit string can be used as a hash key, leading to a simple IP lookup scheme, which requires hash lookups, independent of the number of prefixes. The algorithm of Waldvogel [21] performs a binary search over the length groups and has a worst-case time complexity. The tuple space idea generalizes the foregoing approach [21] to 2-D rules [10]. A tuple is a set of rules with specific prefix lengths, and the resulting set of tuples is called a tuple space. For example, the 2-D rules and both belong to the tuple in the second row and third column in the tuple space. When searching for, a hash key is constructed by concatenating two bits of the source field with three bits of the destination field. The matched rule can be found by probing each tuple alternately while tracking the least-cost rule. Even a linear search of the tuple space represents a considerable improvement over a linear search of the rules since the number of tuples is typically much smaller than the number of rules. The rectangle search, a tuple-based algorithm, was proposed to improve the performance of the tuple lookup [10]. The lower bound has been demonstrated to be given a rectangular tuple space, where is the number of distinct prefix lengths. The primary aim is to eliminate a set of tuples during each probing, as depicted in Fig. 2. Tuples above are eliminated if the probe of tuple returns Match. Otherwise, tuples to the right of tuple are discarded. Markers and a precomputation mechanism are required to reach this goal. Assuming that the number of rules is, a rectangle search requires memory space. In our experiments, the number of the generated markers are about twelve times of the original rules. In [22], Wang et al. presented an algorithm to improve the required storage and the lookup speed of the rectangle search. Based on the observation that the performance of the rectangle search ties to the number of tuples, this scheme adopts a dynamic program- Fig. 2. Rectangle search algorithm. Fig. 3. Tuple pruning search algorithm. Fig. 4. Sample direwall database with two rules. ming scheme to calculate the optimal set of tuples and reorganizes the rules by using rule expansion. However, the cost of precomputation would increase exponentially as the classifier expands and makes the scheme unsuitable for large rule sets. Another heuristic, tuple pruning search, performs lookups on individual fields to eliminate tuples that cannot match the query. For each dimension, the referred information is collected in the pruning table. Accordingly, the lookup procedure starts by searching the pruning tables. The set of referred tuples for each prefix are recorded and the tuples corresponding to the intersection will be probed. Since no extra entry is required besides the pruning table, it features low update cost. We use an example to explain the lookup procedure. The two rules and are located in and, respectively. For the incoming packets with addresses (101000,110010), the matched prefixes of source address include and which are referred in and, respectively. For the destination address, the matched prefixes are and which are also referred in and. Hence the intersected tuples and will be probed, as

1242 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 8, NO. 6, DECEMBER 2006 Fig. 5. Tuple construction algorithm. shown in Fig. 3. The authors claimed that the intersected rules are very rare in the industrial firewall database, hence it might perform well in the practical environment [10]. However, its worst case performance is identical to that of the linear search. III. ENHANCED TUPLE PRUNING Since the action of the least-cost matching rule is used to process the arriving packet. The tuple pruning search can be improved dramatically by maintaining the least-cost property through precomputation. An enhanced tuple pruning scheme for 2-D packet classification is described as follows. To begin with, two prefix tables are constructed by collecting the referred prefixes for both dimensions. Via precomputation, we can calculate the best matching prefixes (BMPs) for both SA and DA and concatenate them as a so called best matching rule (BMR). Since there is only one BMR existing for each (SA,DA) pair, the number of the probed tuples is reduced to one for any incoming packets. To facilitate the explanation of our idea, let s assume that there are two 2-D rules in the rule set, and. For the incoming packet with header (101000,110010), the BMR is and indicates that the will be probed, as shown in Fig. 4. However, two obstacles might impede the correctness of BMR probing. The first is the cost of the fetched BMR which might not be the lowest one. For example, the cost of might be lower than that of while is the BMR for the incoming packets. Second, the BMR in the specific tuple may not exist, as shown in the example that the BMR for header (100000,110010) does not exist in the original set. To deal with both of the critical situations, an informational marker is introduced to maintain the associated information. For the example in Fig. 3, the BMR has to be generated at, as shown in Fig. 4. Since the tuple dominates the tuple, the action of is equal to that of. We name this extra rule as an information-marker ( -marker here after). The associated -markers for are also inserted into and. Since the one inserted into is identical to, the action of willbe compared with that of. If the cost of action is lower than that of, its action will replace the action of. The -marker is used to improve the search procedure on a par with the marker used in [10]. The major difference is that -markers Fig. 6. Number of different subprefix lengths for each route prefix. are generated in correspondence to the existence of longer prefixes, while the markers are generated according to the position of the rule. Furthermore, the -marker can result in an efficient data structure and search algorithm. In the following, the generation of the -markers as well as the searchable data structure are presented. Tuple Construction: The construction procedure of the searchable tuples consists of two parts. First, each rule is inserted into the associated tuple according to the lengths of its prefixes. In the mean time, the prefix trees for both dimensions are constructed to record the referred prefixes in the rule set and indicate the associated tuple. The data structure of prefix tree can be a binary tree or a multibit tree proposed in [23]. The pruning tables will be generated from the prefix trees. Also, it can keep track of the relationship between each prefix and its subprefixes. To be more specific, let and be two prefix strings, where. Assume that the string matching function, P-match, will return the common string between and.if is a substring of, P-match. Next, the -marker is generated and inserted into the associated tuples. Before the -marker is inserted into the tuple, the occurrence of the duplicate rule is checked. If there is no duplicate entry, the -marker is inserted. Otherwise, the cost of rule actions will be compared and the lower-cost action will be recorded in

WANG et al.: SCALABLE PACKET CLASSIFICATION 1243 Fig. 7. Rule insertion algorithm. the entry. Note that the rules with a wildcard field can be treated as 1-D prefixes. They will not be inserted into the tuples but inserted into the prefix tree. Assume each rule consists of two fields, (source address prefix) and (destination address prefix). The set of the longer prefixes for and are listed in and, respectively. Each combination in the cross-product of and are used to examine the related tuples and to check whether the rule with identical prefixes exists. If yes, the cost of both actions are compared and the one with lower cost will be kept in the existing rule to guarantee that once the rule is probed, the least-cost action will be taken. Otherwise, an -marker with the action of will be put into the tuple for the possible probing in the tuple pruning search. Let be the length of prefix. The tuple construction algorithm is given in Fig. 5. In the worst case, tuples are probed for each rule insertion, where is the length of IP address. The total number of generated -markers can be expressed as where, is the number of existing rules whose and. One of the major concerns about this approach is the number of the additional -markers which ties to the different lengths of referred prefixes in each dimension. Apparently, the number of the -markers ties to the existence of rules with longer prefixes. Therefore, each rule can result in at most -markers with shorter prefixes. Nevertheless, the observation from the real-world routing tables and rule sets indicate that the number of different subprefix lengths is few. First, we use the routing tables downloaded from [24], [25] to show the number of different subprefix lengths for each route prefix without counting the default route. For most route prefixes, there are usually less than (1) three subprefixes in the routing table and six in the worst case, as shown in Fig. 6. Thus at most 48 extra rules will be generated for each inserted rule. However, the occurrence of the worst-case situation should be relatively low since only 5% of route prefixes have more than three and two subprefixes in the NLANR and the rest routing tables, respectively. In [12], the authors also reported that the phenomenon remains for the industrial classifiers. In our experiments, we further demonstrate that the extra cost is reasonable with respect to the improvement in performance. Search: The classification procedure consists of two pruningtable lookups and one hash lookup to the tuple. First, the BMP lookups are performed in the pruning tables for both dimensions. However, the lookup result fetched here is the length of the BMP. If the results are and, the tuple will be probed for the best matched rule. The tuple space lookup performance mainly ties to the lookup performance of pruning tables. The fast 1-D lookup algorithm proposed in the previous schemes can be applied to provide good performance, as shown in [23], [26]. Update: The tuple updates can be divided into three categories: change of rule action, insertion, and deletion of rule. We only explain how to perform rule insertion and deletion since change of rule action can be treated as re-insert the rule with updated action. To deal with rule insertion, the prefix tree for each dimension and the -markers within the related tuples must be created and maintained. Let,, be the inserted rule and, be the set of shorter prefixes of,. The rule insertion algorithm is given Fig. 7. In the worst case, tuples are modified. The insertion of -markers will not affect the construction of prefix tree since it is based on the original rules. Furthermore, the tuples covered by the least-cost rule will not be probed for the insertion since they will not be affected by the inserted rule, as shown in Fig. 8. A rule is inserted into the two-rule set of Fig. 4. After inserting the SA and DA into the prefix trees, the set of probed tuples are derived. According to the row-major order, the -markers are put into,, and, respectively.

1244 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 8, NO. 6, DECEMBER 2006 Fig. 10. Implement with parallel hardware. Fig. 8. Example of rule insertion. Fig. 11. Flooding avoidance of i-markers. Fig. 9. Example of rule deletion. While traversing, a collision with is encountered. After comparing their cost, if the cost of is lower, its action will replace the least-cost action field of and keep traversing the rest tuples (, and ). Otherwise, the entry in will remain unchanged and the remained three tuples covered by will not be probed in this insertion. As a result, at most tuples would be probed. The procedure of rule deletion is similar to that of rule insertion. For each deleted rule, the related -markers should be dropped. The -markers in,, and will be removed for the deletion of, as shown in Fig. 9. Furthermore, the tuples covered by the deleted rule will also be examined for the correctness of the least-cost action. The nearest rule which covers should be checked for possible revision. This is because if the examined tuples with -markers or rules with cost higher than, the action of should replace those entries to ensure that the lowest cost action will be taken. Otherwise, their actions will remain unchangeable. The number of the deleted -markers are and the examined tuples are in the worst case. For the ease of updating, each entry in the tuple should have two action fields: one is the least-cost action related to the rule and another is its original action. Implementation: The tuple pruning search can be implemented with software or hardware. With software implementation, the total lookup time is plus one hash access time. The lookup performance can be further improved through hardware implementation. By exploiting hardware parallelism, the total lookup time of the pruning tables is reduced to max(lookup(sa),lookup(da)), as shown in Fig. 10. We can also perform pruning and hashing simultaneously by adopting pipeline design and accomplish one packet classification within maximum(pruning(sa), pruning(da), one hash access to the tuple). Assume the worst-case pruning time is 20 ns with 10ns SRAM [26] and one hash access time (without collision) is 20 ns (one 20 ns DRAM access time), the proposed scheme can achieve 50 MPPS. IV. FLOODING AVOIDANCE OF -MARKERS The basic scheme described in Section III takes just two lookups in the pruning tables and one hash access to the tuple. Although the experimental results show that the number of generated -markers is acceptable with respect to the number of rules, the flooding avoidance mechanism can further reduce the number of -markers as well as the required storage. The basic idea is to divide the tuple space into multiple subspaces. Since the number of subprefixes in each subspace is reduced, the generated -markers will be decreased as well. For each subspace, at most one tuple will be probed. Consequently, the number of probed tuples is not greater than the number of subspaces. Though the increase of the memory accesses incurs the performance degradation, it can be alleviated through parallel design with multibank memory architecture. By putting these two subspaces into two separate memory banks, the lookups can be performed concurrently. We use the example in Fig. 8 to illustrate the operation. Assume the boundary line is at the bit of DA, the tuple space is divided into two subspaces, as shown in Fig. 11. There are two rules in the left subspace and one for the other. Thus only two

WANG et al.: SCALABLE PACKET CLASSIFICATION 1245 TABLE I COMPARE TO THE TUPLE PRUNING SEARCH BASED ON CLASSIFIERS WITH 80% LOCALITY Fig. 12. Modification to the pruning tables. -markers are generated in the left subspace and none in the right one. To lookup the best matched rule, we use the address pair (100000,110010) as an example. The tuple will be probed in the left subspace and in the right one. Their action will be compared to decide whose action will be taken. The action of -markers in the tuple is equal to that in since they are duplicates of. Thus, can be removed without influence, so does. Consequently, the action of will be equal to that of since the activated action of must have the least cost between and. is withdrawn in the same way. Overall, the -marker reduction will not affect the correctness of classification result. The original pruning procedure has to be modified to support lookup in the subspaces. To make the illustration even easier, we use the example in Fig. 11 and adapt the binary tree as the pruning table. The prefix tree for SA and DA prefixes are shown in Fig. 12. Each node in the SA prefix tree has two fields (1 bit for each subspace) that are used to indicate the SA prefix occurrence in the left/right subspaces. For example, prefixes and occur in the left subspace, thus the left fields of the correspondent nodes will be set to 1. Also, the right field of node mapped to 1010 is set to 1, while the DA prefix tree stays unchanged. While performing tuple pruning, the lookup procedure in the SA prefix tree has to record the last occurrence of 1 for both left and right fields. For the DA prefix tree, the revised procedure has to record the length of the longest match prefix which is shorter than the boundary line, i.e., cut-bit (the bit of DA). Namely, to classify the packet with address pair (SA:100000, DA:110010), the last 1 appears at traversed node for the left field and traversed node for the right one in the SA prefix tree while searching for 100000. To search the DA address 110010, and bit will be recorded. Thus the probed tuples will be and, respectively. The dynamic programming can be used to calculate the optimum scope of each subspace. Assume the tuple space are divided into subspaces. The number of the boundary lines in the row is expressed as and the one in the column is expressed TABLE II COMPARE TO THE TUPLE PRUNING SEARCH BASED ON RANDOM CLASSIFIERS as which satisfies, as shown in (2) at the bottom of the page. For each combination, there are sets of the cut-bit positions. The number of the generated -markers in each cut-bit combination is listed in (3), shown at the bottom of page. The definition of is modified as which is the set of the longer prefixes conforming to the length restriction of the subspace. V. PERFORMANCE EVALUATION To evaluate the performance of the proposed TPP scheme, we use the synthetical rule sets with 5 K to 1 M entries. Since our algorithm is designed for supporting QoS of multimedia applications, we generate (3), where the rule sets from the routing table in NLANR [24]. There are 102 309 prefixes in the sample (2) (3)

1246 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 8, NO. 6, DECEMBER 2006 Fig. 13. Rule lengths distribution of original database. (left: random, right: 80% locality). Fig. 14. Rule lengths distribution of database with i-markers. (left: random, right: 80% locality). routing table. We use two different sampling schemes to generate the (SA,DA) rules: the first one is to choose the prefixes uniformly [10] and the other is to concentrate 80% rules in 20% address space to show locality [27]. Note that the rules with wildcard are not considered in the simulation because they will be inserted into the pruning table and will not affect the tuples. The rule length distribution of the 100 K-rule set with 80% locality is shown in the right part of Fig. 13 which is similar to the figure represented by uniformly chosen rules. Most rules correspond to the tuples near (24,24). The color shades represent the density of rules: the darker the shade, the greater number of rules. A. Comparison With Tuple Pruning Search We first examine the rules set with 80% locality. The major performance metric is the number of -markers. From Table I, we can see that the numbers of entries are about three to six times of the original tables. However, with a larger set (larger than 10 K entries), the increased entry ratio is smaller with respect to the smaller set (1 K). This is because with more entries in the table, the probability to generate an -marker collided with any existing rule is also higher, thereby reducing the ratio of increased entries. The result for the random-generated set is shown in Table II. The number of occupied tuples is slightly reduced due to the address locality. The number of entries is increased for the large set (for set with more than 50 K rules) since the wide-spread rules might cause the occurrence of -marker collision to be reduced. For the 1 M-rule set, it requires about 70 MB memory whose cost is lower than US$50. For both sets, the TPP scheme offers apparent improvement on speed since only one tuple is probed. With the native tuple pruning search, the probed tuples will increase to 51 in the worst case, i.e., at least 51 memory accesses. The TPP scheme is, therefore, much suitable for speed-critical environment, even with sevenfold storage. As described above, the TPP scheme can achieve 50 MPPS while coupling with fast 1-D lookup algorithm. The resulted rule lengths distribution from Fig. 13 is shown in Fig. 14. It can be noticed that the number of required tuples is increased and the colors of most of the blocks are darker than as compared with Fig. 13. Furthermore, the number of colored blocks is increased because the -markers might be inserted to the tuples, which do not contain any rule originally. B. Performance With Flooding Avoidance Next, we apply the heuristic described in Section IV and show the results in Table III. By dividing the tuple space into four subspaces and executing dynamic programming, the cut-bits are set to bit. By using the flooding avoidance scheme, the number of -markers is reduced significantly. Note that after -marker reduction, the entry counts for both sets are pretty close. The number of entries is reduced to about twofold entries, but it also comes with three more hash accesses. The parallel design can increase the throughput by searching four subspaces simultaneously. We also found that the number of tuples is much less than

WANG et al.: SCALABLE PACKET CLASSIFICATION 1247 TABLE III NUMBER OF ENTRIES WITH i-markers REDUCTION TABLE V COMPARE TO THE TUPLE REDUCTION SCHEME TABLE IV COMPARE TO THE RECTANGLE SEARCH SCHEME the basic scheme, which demonstrates the effect of -marker reduction. For the 1 M-rule set, the required storage is reduced to about 20 MB. C. Comparison With the Rectangle Search Schemes In this section, the TPP scheme is compared to the rectangle search schemes presented in [10], [22]. Since the existing rectangle search schemes require extensive precomputation, we only compare their performance based on classifiers with at most 100 K rules. First, we present the comparison between the TPP scheme and the rectangle search. As shown in Table IV, the TPP scheme might result in more tuples than the rectangle search, but the required storage and the lookup speed are greatly superior than the rectangle search. This is due to the fact that the -markers are inserted into the tuples based on the referred prefixes, the markers used in the rectangle search are inserted into every lefthand tuple. Hence the -makers can result in an efficient data structure and search algorithm, as demonstrated in Table IV. Next, we compare the TPP scheme with the tuple-reduction scheme [22]. As listed in Table V, the tuple reduction scheme outperforms the rectangle search in both speed and storage. Nevertheless, the TPP scheme provides better performance than the tuple reduction scheme due to the efficiency of the -markers. D. Comparison With Other Existing Schemes It is difficult to compare the practical performance of the existing schemes because there is no public available benchmark tools for packet classification yet. Currently, most schemes use randomly generated rule set to examine the performance while few schemes use real-world firewall databases. In addition, most schemes are designed for 5-D packet classification except for the tuple-based schemes, Grid of Tries and the proposed scheme. Therefore, it is necessary for us to demonstrate their performance based on theoretical complexity. The comparisons of theoretical time, space and update complexity are shown in Table VI, where is the number of rules, is the length of the IP address, is the number of fields, is the memory bus width and the and are the number of divisions for SA and DA. As compared to the 2-D schemes, the TPP scheme features faster lookup performance, but the worst-case storage requirement shows possible memory explosion and requires the heuristic presented in Section IV to avoid the worst-case situation. By dividing the tuple space into more subspaces, the storage complexity will approach to. Also, the TPP scheme provides good scalability and incremental update. VI. CONCLUSION Aimed to support QoS of multimedia applications, we propose a remarkable enhancement to the 2-D tuple pruning search. Our main idea is to simplify the lookup procedure and avoid unnecessary tuple probing. By maintaining the least-cost property of rule through precomputation and the introduced Information Marker, we can reduce the number of probed tuples from the worst-case to O(1). The incremental update is also supported in the new approach. The TPP scheme can achieve a constant throughput of 50 MPPS with parallel hardware design. Through experiments, 70 MB storage is required for a 1 M-rule set as a reasonable tradeoff. To lessen the memory demand, we introduce the heuristic to reduce the required storage. By dividing the tuple space into multiple subspaces, the number of the generated -markers are decreased. For example, the required storage for 1M-rule set is reduce to about 20 MB with four subspaces. As compared with the tuple-based algorithms, the predeterminable performance can provide sustainable throughput and ease the implementation. Therefore, the proposed Tuple Pruning Plus scheme is a feasible solution for the emerging Internet differentiated services.

1248 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 8, NO. 6, DECEMBER 2006 TABLE VI SEARCH, MEMORY USAGE AND UPDATE COMPLEXITY REFERENCES [1] J. R. Gallardo, D. Makrakis, and M. Angulo, Dynamic resource management considering the real behavior of aggregate traffic, IEEE Trans. Multimedia, vol. 3, no. 2, pp. 177 185, Jun. 2001. [2] S. Blake et al., An Architecture for Differentiated Services RFC 2475, 1998. [3] T. Li and Y. Rekhter, Provider Architecture for Differentiated Services and Traffic Engineering (PASTE) RFC 2430, Oct. 1998. [4] J. Boyle, RSVP extensions for CIDR aggregated data flows, Internet Draft [Online]. Available: draft-ietfrsvp-cidr-ext-01.txt Dec. 1997 [5] E. Rosen, A. Viswanathan, and R. Callon, Multiprotocol Label Switching Architecture RFC 3031, Jan. 2001. [6] V. P. Kumar, T. V. Lakshman, and D. Stiliadis, Beyond best effort: router architectures for the differentiated services of tomorrows internet, IEEE Commun. Mag., vol. 36, no. 5, pp. 152 164, May 1998. [7] D. Estrin, D. Farinacci, A. Helmy, D. Thaler, S. Deering, M. Handley, V. Jacobson, C. Liu, P. Sharma, and L. Wei, Protocol Independent Multicast Sparse Mode: Protocol Specification RFC 2117, Jun. 1997. [8] D. Waitzman, C. Partridge, and S. Deering, Distance Vector Multicast Routing Protocol. RFC 1075, Jun. 1993. [9] P. Gupta and N. McKeown, Packet classification on multiple fields, in ACM SIGCOMM, September 1999, pp. 147 160. [10] V. Srinivasan, G. Varghese, and S. Suri, Packet classification using tuple space search, in ACM SIGCOMM, Sep. 1999, pp. 135 146. [11] S. Singh, F. Baboescu, G. Varghese, and J. Wang, Packet classification using multidimensional cutting, in ACM SIGCOMM 03, Aug. 2003, pp. 213 224. [12] F. Baboescu and G. Varghese, Scalable packet classification, IEEE/ACM Trans. Netw., vol. 13, no. 1, pp. 2 14, 2005. [13] V. Srinivasan, G. Varghese, S. Suri, and M. Waldvogel, Fast scalable level four switching, in ACM SIGCOMM, Sep. 1998, pp. 191 202. [14] T. V. Lakshman and D. Stidialis, High speed policy-based packet forwarding using efficient multi-dimensional range matching, in ACM SIGCOMM, Sep. 1998, pp. 203 214. [15] A. Feldmann and S. Muthukrishnan, Tradeoffs for packet classification, in IEEE INFOCOM, Mar. 2000, pp. 1193 1202. [16] M. Buddhikot, S. Suri, and M. Waldvogel, Space decomposition techniques for fast layer-4 switching., in IFIP Sixth Int. Workshop on High Speed Networks, 2000, pp. 25 41. [17] T. Woo, A modular approach to packet classification: algorithms and results, in IEEE INFOCOM, Mar. 2000, pp. 1213 1222. [18] P. Gupta and N. McKeown, Packet classification using hierarchical intelligent cuttings, IEEE Micro, vol. 20, no. 1, pp. 34 41, 2000. [19] F. Baboescu, S. Singh, and G. Varghese, Packet classification for core routers: is there an alternative to CAMs?, in IEEE INFOCOM, Mar. 2003, pp. 53 63. [20] P. Gupta and N. McKeown, Algorithms for packet classification, IEEE Network Mag., vol. 15, no. 2, pp. 24 32, 2001. [21] M. Waldvogel, G. Varghese, J. Turner, and B. Plattner, Scalable high speed IP routing lookups, in ACM SIGCOMM, Sep. 1997, pp. 25 36. [22] P. C. Wang, C. T. Chan, S. C. Hu, C. L. Lee, and W. C. Tseng, Highspeed packet classification for differentiated services in next-generation networks, IEEE Trans. Multimedia, vol. 6, no. 6, pp. 925 935, 2004. [23] V. Srinivasan and G. Varghese, Fast IP lookups using controlled prefix expansion, ACM Trans. Comput., vol. 17, pp. 1 40, Feb. 1999. [24] NLANR Project, National Laboratory for Applied Network Research [Online]. Available: http://www.nlanr.net [25] Merit Networks Inc., IMPA Project [Online]. Available: http://www. merit.edu/ipma/routing table/ [26] P. C. Wang, C. T. Chan, and Y. C. Chen, A fast IP routing lookup scheme, IEEE Commun. Lett., vol. 5, no. 3, pp. 125 127, Mar. 2001. [27] Y. D. Lin, H. Y. Wei, and K. J. Wu, Ordered lookup with bypass matching for scalable per-flow classification in layer 4 routers, Comput. Commun., vol. 24, no. 7 8, pp. 667 676, 2001. Pi-Chung Wang (M 02) received the M.S. and Ph.D. degrees in computer science and information engineering from the National Chiao-Tung University, Hsinchu, Taiwan, R.O.C., in 1997 and 2001, respectively. From 2002 to 2006, he was with Telecommunication Laboratories of Chunghwa Telecom, Taipei, Taiwan, working on network planning in broadband access networks and PSTN migration. During these four years, he also worked on IP lookup and classification algorithms. Since February 2006, he has been an assistant professor of Computer Science at National Chung Hsing University. His research interests included IP lookup and classification algorithms, scheduling algorithms, congestion control, network processors, algorithms and applications related computational geometry. He is currently working on high speed string matching for network intrusion detection.

WANG et al.: SCALABLE PACKET CLASSIFICATION 1249 Chia-Tai Chan received the Ph.D. degree in computer sceince and information engineering from National Chiao-Tung University, Hsinchu, Taiwan, R.O.C., in 1998. From 1999 to 2005, he was with Telecommunication Laboratories Chunghwa Telecom Co., Ltd., as a Project Researcher. In August 2005, he joined the faculty of the Institute of Biomedical Engineering, National Yang-Ming University, Taipei, Taiwan, as an Associate Professor. His research interests include the design, analysis and traffic engineering of broadband multiservice networks. Hung-Yi Chang was born in Taiwan, R.O.C., in 1970. He received the M.S. and Ph.D. degrees in computer science and information engineering from National Chiao-Tung University, Hsinchu, Taiwan, in 1994 and 1999, respectively. He is now an assistant professor in the Department of Information Management in National Kaohsiung First University of Science and Technology, Kaohsiung, Taiwan. His research interests include network management and interconnection networks. Chun-Liang Lee received the M.S. and Ph.D. degrees in computer science and information engineering from National Chiao-Tung University, Hsinchu, Taiwan, R.O.C., in 1997 and 2001, respectively. From 2002 to 2006, he was with the Telecommunication Laboratories, Chunghwa Telecom Co., Ltd. Since February 2006, he has been an assistant professor of Computer Science and Information Engineering at Chang-Gung University, Taoyuan, Taiwan. His research interests include design and analysis of network protocols, quality of service in the Internet, and packet classification algorithms.