ERFC: An Enhanced Recursive Flow Classification Algorithm

Gong XY, Wang WD, Cheng SD. ERFC: An enhanced recursive flow classification algorithm. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 25(5): 958 969 Sept. 2010. DOI 10.1007/s11390-010-1076-5 ERFC: An Enhanced Recursive Flow Classification Algorithm Xiang-Yang Gong ( ), Wen-Dong Wang ( ), Senior Member, CCF, and Shi-Duan Cheng ( ) State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications Beijing 100876, China E-mail: {xygong, wdwang, chsd@bupt.edu.cn Received March 14, 2009; revised June 22, 2010. Abstract Packet classification on multi-fields is a fundamental mechanism in network equipments, and various classification solutions have been proposed. Because of inherent difficulties, many of these solutions scale poorly in either time or space as rule sets grow in size. Recursive Flow Classification (RFC) is an algorithm with a very high classifying speed. However, its preprocessing complexity and memory requirement are rather high. In this paper, we propose an enhanced RFC (ERFC) algorithm, in which a hash-based aggregated bit vector scheme is exploited to speed up its preprocessing procedure. A compressed and cacheable data structure is also introduced to decrease total memory requirement and improve its searching performance. Evaluation results show that ERFC provides a great improvement over RFC in both space requirement and preprocessing time. The search time complexity of ERFC is equivalent to that of RFC in the worst case; and its average classifying speed is improved by about 100%. Keywords packet classification, ERFC (enhanced recursive flow classification), preprocessing and storage optimization 1 Introduction In today s network equipments, packet classification is a fundamental mechanism to implement Quality of Service (QoS) guarantee and security services, such as IntServ, DiffServ, access control and firewall. Generally, classification is the first step of packet processing in network elements such as routers. Typically, a router classifies input packets into equivalence classes (flows) based on control fields in packet headers, and then provides differentiated processing for different packet classes. Each equivalence class is defined by a filter (or a rule), which is used to determine if a packet belongs to its corresponding class and how to process packets in this class. A classifier consists of a set of rules, which defines all equivalence classes supported in the system. Several metrics must be considered when designing a packet classification scheme [1] : 1) space complexity, the bound of memory space required to store and maintain the data structure used by the algorithm; 2) time complexity, the bound of time consumption required to classifying a packet; 3) setup/update complexity, the time consumption required to build or update data structures for packet classification. General k-dimension packet classification problem (k 2) is inherently difficult, for it is hard to simultaneously achieve relatively low time and space complexity in the worst case [1]. The problem has received lots of research attentions. Among the recent proposed packet classification schemes, recursive flow classification (RFC) [2] provides the highest classifying speed, and it is quite easy to be implemented in parallel and pipeline hardware. However, its space complexity is very high, and its preprocessing procedure is timeconsuming, which prevents RFC from being suitable for applications that require frequent incremental rule updates. In this paper, an enhanced RFC (ERFC) algorithm is proposed to improve the performance of RFC by reducing preprocessing complexity and storage requirement so as to make it suitable for large classifiers. In the setup stage of ERFC, we exploit a hash-based aggregated bit vector scheme to speed up pre-computing. A new cacheable data structure is also introduced to reduce total memory requirement of ERFC and improve its classifying speed. Evaluation results show that ERFC provides a great improvement over RFC in both storage and preprocessing performance. In the worst case, the classifying time complexity of ERFC is equivalent to that of RFC; and its average classifying speed Regular Paper Supported by the National Basic Research 973 Program of China under Grant No. 2009CB320504 and the National Hi-Tech Research and Development 863 Program of China under Grant Nos. 2008AA01A324 and 2009AA01Z210. 2010 Springer Science + Business Media, LLC & Science Press, China

Xiang-Yang Gong et al.: ERFC: An Enhanced Recursive Flow Classification Algorithm 959 is improved by about 100%. Previous researches related to the topic of this paper are briefly introduced in Section 2. In Section 3, we discuss the problem of packet classification and analyze RFC algorithm. Section 4 describes the proposed ERFC algorithm in detail. The experimental evaluation results are presented in Section 5. Finally, Section 6 states our conclusion. 2 Related Work In recent researches of multidimensional packet classification, heuristic is used in a number of proposed algorithms. Usually in heuristic schemes, structural characteristics and redundancy of real classifiers are exploited to optimize algorithm performance. For k-dimension packet classification, RFC provides a minimum classifying time complexity of O(k), i.e., if k is constant, it requires a constant number of memory accesses to classify a packet. However, its storage requirement and preprocessing complexity is very high. In Lucent Bit Vector (BV) search scheme [3], storage requirement is O(kN 2 ) and query time is O(Wk + N/w) in the worst case, where N is the number of rules, W is the length of IP address, w is memory bus width. The Aggregated Bit Vector [4] (ABV) scheme adds new techniques to the BV scheme, and reports an order of magnitude improvement on performance over BV scheme. The tuple space search (TSS) [5] has a small memory requirement O(N), but its search speed in the worst case depends on characteristics of classifier and it supports only prefixes rather than arbitrary ranges. Extended Grid of Tries with Path Compression [6] (EGT-PC) is a modified grid-of-tries [7] scheme. EGT-PC employs path compression to reduce the search time and memory requirement. Its memory requirement is O(N), and its time complexity of search is about O(W 2 ). Another grid-of-tries based classification scheme is proposed in [8], which exploit non-collision hash algorithm to improve lookup performance. Hierarchical Intelligent Cuttings [9] (HiCuts) supports lookup in O(W + T/c) time with storage requirement of O(N k ), where T is the maximum bucket size and c is the cache line size. Similar to HiCuts, HyperCuts [10] is a decision tree based algorithm, but its performance is improved. It is reported that HyperCuts uses 2 to 10 times less memory than HiCuts, and its worst case search time is 50 500% better than HiCuts. 3 Problem Statement and the RFC Algorithm 3.1 k-dimension Packet Classification In k-dimension packet classification, a packet p is abstracted to a k-dimension vector (p 1, p 2,..., p k ). Each p i (i [1, k]) is a w i -bit nonnegative integer, which is a control field in packet header. Let R i [0, 2 wi 1] and R R 1 R 2 R k, p is a point in direct product space R. A filter F is a k-tuple (Γ 1, Γ 2,..., Γ l ). Γ i is usually specified in a regular expression of prefix match, range match or exact match. Γ i defines a subset of range R i, and F defines a subspace of R, i.e., F Γ 1 Γ 2 Γ k. For a given packet p, if p i Γ i, p i is said to match Γ i ; if p F, i.e., i [1, k], p i Γ i, p is said to match F. A classifier Ĉ is a finite set of N filters: Ĉ {F 0, F 1,..., F N 1. Each filter F i is assigned an identifier id(f i ) and a priority pri(f i ). It is reasonable to assume in this paper that id(f i ) = i, and pri(f i ) is higher than pri(f j ) if i < j. When classifying a packet p, a search is performed in Ĉ for filters that satisfy p F. If such an F exists, p is classified into the class corresponding to F. F is said to conflict with another rule G if F G [11-12]. Conflicts in classifier Ĉ will affect the time and/or storage consumption of the packet classification algorithm. Let M(Ĉ, p) be the subset of filters that packet p matches, i.e., M(Ĉ, p) {F : p F, F Ĉ. Let bmf (Ĉ, p) denote the best matched filter (bmf) in Ĉ of packet p. In this paper, we define bmf (Ĉ, p) as the filter with the highest priority in M(Ĉ, p). The goal of packet classification is to find the bmf of given packet. Bit vector [3] is another representation of M(Ĉ, p). The bit vector for M(Ĉ, p), bv(ĉ, p), is an N-bit string that: { 1, if i-th bit of bv(ĉ, p) = Fi M(Ĉ, p), 0 i < N. 0, otherwise, A classifier is unable to distinguish between p and q if bv(ĉ, p) = bv(ĉ, q). In this case, p and q will be classified into the same equivalent class [2] (eq-class). Equivalent class is defined as: assuming R is partitioned into a series of subsets S 1, S 2,..., S e, each S i is an eqclass of Ĉ, if: a) S i S j = (i j) and S 1 S 2 S e = R; b) p S i, q S j, bv(ĉ, p) = bv(ĉ, q) S i = S j. Let E(Ĉ) {S 1, S 2,..., S e, E(Ĉ) is the eq-class set (ECS) of Ĉ. According to above definition, S i E(Ĉ), bit vectors (and also bmfs) of all packets in S i are identical. We define the class bitmap (cbm) and the class bmf of S i respectively as: cbm(ĉ, S i) bv(ĉ, p) and bmf (Ĉ, S i) bmf (Ĉ, p), in which p is an arbitrary packet in S i. Obviously, each S i in E(Ĉ) corresponds to a unique cbm. Let θ {i, j,..., l (n = θ, n k) be a nonempty subset of {1, 2,..., k, and let R θ denote the

960 J. Comput. Sci. & Technol., Sept. 2010, Vol.25, No.5 direct product space R i R j R l. For a packet p = (p 1, p 2,..., p k ), its projection in space R θ is denoted by p θ, where p θ = (p i, p j,..., p l ) is an n- dimension vector. Similarly, the projection of rule F in R θ is an n-tuple, F θ = (Γ i, Γ j,..., Γ l ). For instance, if θ = {1, 3, 4, the projection of p in space R 1 R 3 R 4 is p {1,3,4 = (p 1, p 3, p 4 ); and the projection of F in space R 1 R 3 R 4 is F {1,3,4 = (Γ 1, Γ 3, Γ 4 ). The projection of a rule set Ĉ in space Rθ is defined as Ĉ θ {F θ : F Ĉ, which is a set of N projected rules. We assume identifiers of rules in Ĉ are preserved after projecting, i.e., F Ĉ, id(f θ ) = id(f ). Since Ĉ θ is also a classifier, its eq-class set E(Ĉθ ) can be calculated; and cbms of eq-classes in E(Ĉθ ) and E(Ĉ) are compatible. obtained from IT θ and ET θ. Fig.2 demonstrates a data flow (a reduction tree) of a typical IPv4 5-tuple classifier (src-ip, des-ip, prot, src-port, des-port), which respectively denotes source IP address, destination IP address, transport layer protocol, source port and destination port in packet header. RFC requires 32-bit fields (src-ip and des-ip) to be split into two 16-bit chunk [2], thus 5-tuple rules are converted into 7-tuple rules. 3.2 RFC Algorithm In RFC, most of complex operations are moved to preprocessing stage to create efficient data structures for high speed classifying. For a k (k is constant) dimension classifier, RFC requires a constant number of memory accesses to classify a packet, i.e., its time complexity of classification equals O(k). However, its storage and preprocessing complexities are rather high. Hence, RFC is considered to be not scalable. In the rest of this section, RFC will be analyzed in details. Fig.1. Data structure of RFC: eq-class table (ET ) and index table (IT ). Two types of tables are employed in data structures of RFC: eq-class table (ET ) and its corresponding index table (IT ). Each IT and ET are associated with a certain projected rule set Ĉθ. Let ET θ and IT θ respectively denote the ET and IT associated with Ĉθ. ET θ is a list to store E(Ĉθ ), i.e., the eq-class set of Ĉθ. Each eq-class in E(Ĉθ ) is stored in an entry of ET θ, and is assigned an eqid which equals that entry s index. Because each eq-class has a unique cbm, it just requires cbms to be recorded in ET θ. IT θ is a linear list of eqids that supports quick eq-class lookup in packet classification. As illustrated in Fig.1, the eqid and cbm of the eq-class corresponding to a given input can be directly Fig.2. Reduction tree of RFC for a typical IPv4 5-tuple classifier. Each IT is associated with a corresponding ET. Preprocessing of RFC is to calculate E(Ĉ), i.e., the eq-class set of Ĉ. The procedure consists of several phases, as shown in Fig.3. In phase 0, for each field i (1 i k), E(Ĉ{i ), the eq-class set of projection rule set Ĉ{i is calculated, and its corresponding ET {i and IT {i are created using the procedure in Fig.3(a). Then the results of previous phases are involved in calculations of succeeding phases. In phase j (j > 0), the procedure in Fig.3(b) is used to calculate E(Ĉθ ) and its associated ET θ and IT θ based on table ET θ1, ET θ2,..., ET θn, where θ = θ 1 θ 2 θ n. Table ET θ1, ET θ2,..., ET θn are outputs of previous phases which are respectively associated with E(Ĉθ1 ), ),..., ). Take Fig.2 as an example, ET {1,2 (as well as IT {1,2 ) is created based E(Ĉθ2 E(Ĉθn on ET {1 and ET {2 in phase 1; and in phase 2, ET {1,2,3,4 (and IT {1,2,3,4 ) is created based on ET {1,2 and ET {3,4. In Fig.3(b), the variable ptr is the IT θ table index which point to the eqid value corresponding to the input e 1, e 2,..., e n : ptr = e 1 ET θ2 ET θ3 ET θ4 ET θn + e 2 ET θ3 ET θ4 ET θn + e n 1 ET θn + e n in which, e 1 ET θ1, e 2 ET θ2,, e n ET θn are output eqid results calculated in previous phases. When IT Ω (Ω = {1, 2,..., k) is created in the last phase, data structures of RFC are constructed completely. Since ET tables are not involved in packet

Xiang-Yang Gong et al.: ERFC: An Enhanced Recursive Flow Classification Algorithm 961 classifying, they will be released after preprocessing. 1 for each e R i do { 2 bit vector := bv(ĉ{i, e); 3 eqid := ET {i.search(bit vector); 4 if (eqid = null) {// bit vector not found in ET {i 5 eqid := ET {i.new EqClass(); 6 ET {i [eqid] := bit vector; 7 IT {i [e] := eqid; 8 else { 9 IT {i [e] := eqid; 1 ptr := 0 2 for each e 1 ET θ1 do (a) 3 for each e 2 ET θ2 do 4 for each e n ET θn do { 5 bit vector := ET θ1 [e 1 ]&ET θ2 [e 2 ]& &ET θn [e n]; 6 eqid := ET θ.search(bit vector); 7 if (eqid =null) {//bit vector not found in ET θ 8 eqid := ET θ.new EqClass(); 9 ET θ [eqid] := bit vector; 10 IT θ [ptr] := eqid; 11 else { 12 IT θ [ptr] := eqid; 13 ptr++; (b) Fig.3. (a) Pseudocode for preprocessing phase 0 (1 i k). (b) Pseudocode for preprocessing phase j (j > 0). When classifying packets, each IT on reduction tree will be accessed only once per packet, hence the number of memory accesses required by RFC is constant. two bit vectors. Let Q denote the length of ET θ, it requires O(Q) bit-vector comparisons to search a bit vector in linear list ET θ. Therefore, the time complexity of searching a bit vector in ET θ equals O(Q N/w p ). It should be noted that Q is increasing during calculation, and Q max E(Ĉθ ). In order to improve operation performance of comparing bit vectors and computing intersection (bitwise AND) of bit vectors, a new hash-based aggregated bit vector structure which combines the advantages of ABV [4] and hash schemes is proposed in the preprocessing of ERFC. In this scheme, ABV is exploited to speed up intersection calculation of bit vectors; and a hashbased algorithm is used to improve the performance of bit vector comparing and searching. In the proposed scheme, the principle of ABV is described as follows. Let X = x 0 x 1 x 2... x N 1 denote the original bit vector (N-bits). Let A denote the aggregate size. Bit vector X is divided into t chunks (t = N/A ): X[0] = x 0 x 1... x A 1, X[1] = x A x A+1 x 2A 1,..., X[t] = x (t 1)A x (t 1)A+1... x ta 1. The size of each chunk is A-bits (the last chunk X[t] may need some padding bits of 0 to ensure its size equals A). Let Y = y 0 y 1 y 2... y t denote the t-bit aggregated bit vector of X. The i-th bit of Y is the aggregation of chunk X[i]: y i is set to bitwise OR of all bits in X[i], i.e., y i = (i+1) A 1 OR x j. The same aggregating process is j=i A repeated on bit vector Y, until finally the bit vector is aggregated into a signal word (the size of which is less than A-bits) and a multiple level tree is constructed. The tree root represents the top level aggregated bits. Fig.4 demonstrates the structure of a 2-level aggregated bit vector (abv) with A = 4. The maximum rule set size supported by the 2-level abv is N max = A 3. 4 Proposed ERFC Algorithm 4.1 High-Speed Preprocessing In RFC algorithm, the major factors that affect preprocessing performance are: 1) the operations required to calculate intersection of bit vectors in step 5 of Fig.3(b); 2) the operations required to search a bit vector in ET table and compare two bit vectors, which are involved in step 3 of Fig.3(a) and step 6 of Fig.3(b). Let w p denote the memory width of processor. In Lucent bit vector scheme, about N/w p (N = Ĉ ) processor operations are required to calculate the intersection of two N-bit bit vectors. Similarly, it requires about N/w p processor operations to compare Fig.4. 2-level aggregated bit vector (abv) with A = 4, which can support a rule set of maximum N = 64. An aggregated bit is zero implies that all its corresponding descendant chunks in the tree equal 0. For example in Fig.4, z 3 = 0 indicates that Y [3] and

962 J. Comput. Sci. & Technol., Sept. 2010, Vol.25, No.5 X[12] X[15] are all zero. These zero chunks are not necessary to be checked when calculating intersection of bit vectors. That is, branches with zero chunks (such as dashed line parts in Fig.4) can be pruned from the abv tree structure. Hence, ABV scheme can speed up calculation of bit vector intersections (as well as comparisons). Fig.5 presents the procedure of computing intersection of 2-level aggregated bit vectors. hash scheme without ABV. In addition to ABV, a hash scheme is also introduced into ET table structure to speed up the bit vector comparing and searching operations in the preprocessing of ERFC. The new hash-based ABV data structure for ET table is shown in Fig.6. // let tabv denote the type definition of the ABV structure 1 tabv Intersection (tabv abv 1, tabv abv 2,..., tabv abv n) { 2 tabv result; //variable to store the result abv //calculate aggregated bits in level 1 of the abv tree 3 result.z := abv 1.Z & abv 2.Z&... &abv n.z; 4 for i := 0 to N/A 2 1 step 1 do { 5 if (the i-th bit z i of result.z equals to 0) continue; //otherwise, calculate aggregated bit chunk Y [i] of //level 2 6 result.y [i]:= abv 1.Y [i] & abv 2.Y [i]& & abv n.y [i]; 7 for j := 0 to A 1 step 1 do { 8 if (the j-th bit of result.y [i] equals 0) continue; //otherwise, calculate chunk X[i A+j] of original //bit vector 9 result.x[i A + j]:= abv 1.X[i A + j]&abv 2.X[i A + j]& &abv n.x[i A + j]; 10 if (result.x[i A + j] = 0) set j-th bit of result.y [i] to 0; 11 if (result.y [i] = 0) set i-th bit z i of result.z to 0; 12 return result; Fig.5. Pseudo code for computing intersection of n 2-level aggregated bit vectors (abv 1, abv 2,..., abv n). The size of original bit vector is N bits; and the aggregate size equals A bits. Let n abv be the number of processor operations required to calculate intersection (or comparison) of 2 abvs. In this paper, we choose A = w p. For the 2-level ABV scheme, n abv = N/w p + N/wp +1 2 in the worst case, which increases slightly in contrast to BV scheme. In addition, ABV also introduces some extra costs in calculating aggregated bits. However, when N becomes larger, the average performance of ABV scheme is improved, and n abv is expected to be less than N/w p. The evaluation results in [4] indicate that ABV scheme outperforms BV scheme by an order of magnitude on both industrial firewall and synthetic rule sets [4]. In the proposed ERFC, the evaluation results show that, using hash-based ABV scheme, the average preprocessing time is reduced by about 25% and 50% (respectively when N = 10000 and 30000) in comparison with the Fig.6. ET table structure in the proposed ERFC. In this paper we assume A = w p = 32. Each bit vector (i.e., cbm of eq-class) in the original ET is replaced with an aggregate bit vector (abv). In addition, two short fixed length attributes, r and d, are attached to each abv. The abv and its attributes r, d compose a new data structure called extended bit vector (ebv). The data structure of ET table is also extended in ERFC: each entry of ET stores an ebv of its corresponding eqclass. The values of r and d are defined by r = h 1 (abv) and d = h 2 (abv), where abv is the bitmap value in ebv, and h 1 (), h 2 () are two independent hash functions. When creating or calculating an abv, r and d are simultaneously calculated. Additional codes for calculating r, d of ebv are required in step 2 of Fig.3(a) and the procedure in Fig.5. In order to speed up ebv searching in ET table, a hash table HT θ is constructed for each ET θ. In this paper, we adopt a simple link-list hash table scheme. Assuming the length of HT θ is L, and each slot of HT θ contains a pointer of a linked list of eqid. When inserting a new ebv into ET θ, its eqid is inserted into the linked list in HT θ [i], the i-th slot of HT θ, where i = ebv.r mod L. This operation spends only O(1) time. When searching a given ebv in ET θ, firstly i = ebv.r mod L is calculated; then all eqids in the linked list

Xiang-Yang Gong et al.: ERFC: An Enhanced Recursive Flow Classification Algorithm 963 HT θ [i] are sequentially checked. Each corresponding ebv in the linked list is compared with the given ebv. With an appropriate hash function h 1 (), only O(Q/L) ebv comparisons are required when searching an ebv. Since Q is monotonically increasing during preprocessing, we adjust L dynamically to avoid performance decline. Choosing a constant λ, if increment of Q leads to Q/L > λ, L will be doubled (i.e., L:= 2L), and HT θ will be reconstructed. Since Q/L is always less than λ, the number of ebv comparisons required in searching is almost a constant O(λ). During calculating ET θ, about log 2 (2λ Q max /L 0 ) reconstructions are required, where L 0 is initial value of L. Because values of r of each ebv are stored in ET θ, HT θ can be reconstructed quickly. The attribute d in ebv is a digest of abv, and we assume its word size is w p. When comparing ebv 1 and ebv 2, their digests are compared at first: 1) if ebv 1.d ebv 2.d, we can assert that ebv 1 ebv 2, and no additional comparisons are needed; 2) if ebv 1.d = ebv 2.d, their abv fields need to be compared. Let n c be the number of processor operations required to compare 2 ebvs. We now analyze the complexity of ebv comparison. In the case of ebv 1 = ebv 2, both d and abv fields are required to be compared. Hence, n c equals 1 + n abv, where n abv is the number of operations to compare two abvs. In the case of ebv 1 ebv 2, if ebv 1.d = ebv 2.d, n c = 1 + n abv ; otherwise if ebv 1.d ebv 2.d, it requires only 1 processor operation to compare digests, i.e., n c = 1. Assuming h 2 () is nearly a perfect hash function, the probability of collision (ebv 1 ebv 2 but ebv 1.d = ebv 2.d) is about 2 wp. The expectation of n c in the case of ebv 1 ebv 2 equals: is less than its corresponding ET θ. Moreover, since ET and HT will be released after preprocessing, no additional storage is required in classifying stage. Evaluation results show that, choosing simple XOR-and- SHIFT functions as h 1 () and h 2 (), the proposed scheme provides at least an order of magnitude improvement in preprocessing performance over original RFC. 4.2 Storage Optimization Storage requirement of RFC is total memory of all IT tables created in preprocessing. Length of each IT is determined by the sizes of ET which are created in previous phases: when merging ET µ and ET ρ to create IT θ and ET θ (θ = µ ρ), the length of the result IT θ equals E(Ĉµ ) E(Ĉρ ). Each IT is a linear list of eqid. Theoretically, it requires w e = log 2 Q θ bits to represent an eqid in IT θ, where Q θ is the length of table ET θ. Let B θ denote the length of IT θ (generally, B θ Q θ ), IT θ requires B θ w e bits. For calculation convenience, w e = 16 or 32 are chosen in usual implementations. We have investigated large numbers of rule sets, and discovered that distinct eqid values in an IT are generally sparsely distributed. Let b be a constant which is much less than B θ. An IT θ can be divided into B θ /b fixed-length blocks, each contains b eqids (probably padding is required for the last block). Let δ denote the number of distinct (unique) eqid values in a block. Fig.7 illustrates the distributions of δ in IT tables of 72 different rule sets (N = 100 30000), which indicate that there is a great probability that δ is much more less than b; and in more than 40% of blocks δ equals 1. (1 + n abv ) 2 wp + 1 (1 2 wp ) = n abv 2 wp + 1. Since n abv is expected to be less than N/w p, when w p = 32 and N 10 6, n c is expected very close to 1. Considering that each eq-class in ET θ has a unique cbm, there exists at most only 1 ebv that equals the given ebv. So it is expected that when searching in ET θ, the cases of ebv 1 ebv 2 are more than the cases of ebv 1 = ebv 2. Hence, a large number of bit vector comparisons are substituted with comparisons of digests, which improves the preprocessing performance enormously. Such an improvement is achieved at the expense of storage increment: ET table is extended to hold ebv; and an extra HT is needed for each ET. When N is large, storage increment of ebv is not very significant. According to our proposed hash table mechanism, the maximal size of HT θ is about 2λ Q max, and elements in HT θ are eqids (indexes of ebvs), so memory size of HT θ Fig.7. Distributions of δ in IT tables of 72 test rule sets (N = 100 30000). P (x) is the probability of δ x. Statistical characteristics of δ enable us to propose a heuristic design to reduce size of IT. Let D denote a block in IT (its length equals b), and D contains δ distinct eqid values e 0, e 1,..., e δ 1. Let d i (0 i < b) denote the i-th item in D. Fig.8 illustrates an approach

964 J. Comput. Sci. & Technol., Sept. 2010, Vol.25, No.5 to compress original block D: 1) create a linear list V of δ items, and let v j = e j, where v j is the j-th item in V (0 j < δ); 2) create another linear list U of b items, the value of each item u i (0 i < b) in U is determined by d i : if d i equals a certain eqid value v j (0 j < δ), the u i is set to j, i.e., u i equals the index in V of the value d i. V requires δ w e bits storage to record δ distinct eqid values; and U requires log 2 δ bits for each index of V. Therefore, the total storage of U and V is b log 2 δ + δ w e bits. Fig.8. Compressing memory size of block D. Let g(δ, b) = (b log 2 δ + δ w e )/(b w e ), which is the storage compression ratio of U/V structure to the original D. For b 256 and w e = 16, g(δ, b) is less than 1 if δ < b/2 and is always less than 0.625 if δ < b/4. Considering distributions of δ, the storage will decrease if D is replaced with U/V structure. A new data structure is proposed to replace original IT table as shown in Fig.9. Assuming length of IT is B, the block size is b. The original IT is partitioned into B/b blocks (D 1, D 2,...). A constant δ 0 is defined (in this paper, δ 0 = b/4). block D i. J T will introduce a small overhead to each D i. J T [i] consists of 3 fields: a) J T [i].δ, an integer field to store δ value of D i ; b) J T [i].ptrv, a pointer which records the address of V structure if D i is compressed; c) J T [i].ptru, a pointer which records the address of U structure if D i is compressed. Since items in IT are calculated in sequence, data structures for D i are also constructed in sequence. The approach is: 1) Allocate a block of memory for block D i, and calculate all eqid values in D i. 2) Store the value of δ (the number of unique eqid values in D i ) into J T [i].δ field. 3) If δ = 1, which means all eqids in D i equal a same value e, then store e into the memory location for J T [i].ptru and J T [i].ptrv, and the memory for block D i is released. 4) If δ δ 0, D i is possible to be compressed. Construct U/V structures using the method in Fig.8. The addresses of U and V are stored respectively into J T [i].ptru and J T [i].ptrv; and the memory for block D i is released. 5) If δ > δ 0, no compression is preformed, and simply set J T [i].δ to 0, the address of original block D i is directly stored into J T [i].ptru field. The storage size (bits) required by block D i is: w h, if δ = 1, S(D i ) = b log 2 δ + δ w e + w h, if δ δ 0, b + w h, if δ > δ 0. Storage size for D i is always compressed except when δ > δ 0 (it is increased by w h bits). According to the distribution of δ (see Fig.7), the probability of δ δ 0 is great; the total storage size is expected to be considerably reduced. If b = 128, the expectation of S(D i ) is 270 bits (assuming w e = 16 bits, w h = 64 bits and δ 0 = 32), which is decrease by about 86% in comparison with the original size of block D i (b w e = 2048 bits). In addition, observations reveal that total size of jump table J T and V structures is relatively small. Therefore, J T and V structures can be load into high speed cache memory to speed up searching operations. 4.3 Search Operation Fig.9. Proposed data structure for IT tables, in which w h is the overhead introduced by fields in J T [i]. A jump table (J T ) is introduced, which contains B/b entries, and each J T [i] is corresponding to a When classifying a packet, each IT table is searched to find out the eqid corresponding to a given input. In original RFC, searching in each IT table requires only 1 memory access. Because caches are introduced in the proposed scheme, the number of memory accesses is expected to be reduced. For given input a, following steps are required to perform a searching in an IT

Xiang-Yang Gong et al.: ERFC: An Enhanced Recursive Flow Classification Algorithm 965 table: 1) Let i = a/b, which is the index (in jump table) of the block that input a belongs to. 2) If J T [i].δ = 1, directly return the eqid value stored in the pointer fields of entry J T [i]. In this case, only 1 cache access is required. 3) Otherwise, calculate j = a mod b, where j is the position of input a in the block D i. 4) If J T [i].δ = 0, which means J T [i].ptru points to an uncompressed block, return the j-th item of the block as the result eqid. In this case, it requires 1 cache access for accessing J T and 1 memory access for accessing the uncompressed block structure of D i. 5) If J T [i].δ > 1, which means a compressed U/V structure is associated with this block. Firstly, the j-th item is read out from U, and then this value is used as an index to access V to acquire final result of eqid. In this case, only 1 memory access for accessing U is required. Meanwhile, other 2 cache accesses are required: one for accessing J T, and one for accessing V. Assuming in the classifier implementation, the working speed of cache is the same as the speed of the processor, i.e., the time for cache accessing can be ignored in comparison with memory accessing. Let n ma denote the number of memory accesses required to perform a search in IT. In ERFC, the upper bound (in the worst case) of n ma is a constant: n ma = 1. Since there exists a considerable probability of δ = 1, the expectation of n ma is less than 1. For instance in Fig.7, when b = 128, the probability of δ = 1 is 47.1%, the expectation of n ma is about 0.529. When classifying a packet, each IT table in the reduction tree will be searched once. Assume the reduction tree in Fig.2 is used. In the worst case, 13 memory accesses are required by ERFC to classify a packet, which is the same as RFC. The classifying complexity of ERFC is equivalent to that of RFC, i.e., O(k). Moreover, since cache is introduced in ERFC, its average performance of packet classification is expected to be greatly improved in comparison with RFC. 5 Performance Evaluations To evaluate the performance of proposed ERFC, typical IPv4 5-tuple rule sets are used. ERFC and other 3 classification algorithms, RFC, HiCuts and HyperCuts, are implemented in C/C++. The implementations of RFC, HiCuts and HyperCuts are based on source codes from online Web page Evaluation of Packet Classification Algorithms (http:// www.arl.wustl.edu/ hs1/pclasseval.html). The evaluation metrics are storage requirements, preprocessing time and classifying speed. In the implementation of ERFC, the block size b is determined by the size of IT table B: 64, if B < 64 K, b = 128, if 64 K B < 4 M, 256, if B > 4 M. For the implementations of HiCuts and HyperCuts, binth = 8 and spfac = 2.0 are used in the evaluation. 5.1 Test Rule Set Heuristic algorithm may exhibit very different performances on different rule sets (databases), because the time and storage complexity is generally affected by characteristics of rule set, such as: 1) the number of rules and fields, i.e., N and k; 2) statistical characteristics of IP addresses, transport layer protocols and port numbers; 3) correlation between different fields; 4) conflicts in rule set. To perform a thorough evaluation, the proposed algorithm is tested on rule sets from diverse application scenarios. Rule databases from real life applications are undoubted good test vectors, but they are hard to obtain because of confidential reasons. In addition, real life databases are usually not large enough to test scalability of algorithm. Therefore, it is necessary to employ synthetic rule sets. Synthetic databases should provide characteristics which are similar to real life databases. In many recent related research activities [2,4,10-15], characteristics of diverse real life rule sets have been analyzed, and approaches to generate test rule set are proposed. D. Taylor and J. Turner proposed methodology to develop metrics and characterizations of rule set structure that aid in generating synthetic rule sets [15]. 12 realistic rule sets from ISPs and network equipment vendors are analyzed, which are categorized into 3 different types: 1) Access Control List (ACL), which is used in backbone, edge or enterprise equipments for purposes of access control, VPN, NAT, etc.; 2) Firewall, which specifies security rules in firewalls; 3) IP Chain, which is used in software-based systems for security, VPN, etc. In this paper, we adopt the methodology and ClassBench [15] tool to generate test rule sets to emulate different application scenarios. ClassBench tool provides 12 seeds for generating test rule sets: 5 seeds for ACL (names of which are acl1 acl5), 5 seeds for firewall (fw1 fw5) and 2 seeds for IP Chain (ipc1, ipc2). These 12 seeds are used to generate 12 groups of test rule sets. Each group contains 6 synthetic rule sets of

966 J. Comput. Sci. & Technol., Sept. 2010, Vol.25, No.5 Fig.10. Storage requirements of 4 algorithms on 12 groups of test rule sets. different sizes (100 30000), which are generated based on a same seed. The characteristics of rule set groups are summarized in Table 1. Table 1. Characteristics of 12 Test Rule Set Groups Group No. Seed Application Type Size (N) 1 5 acl1 acl5 ACL 100, 500, 6 10 fw1 fw5 Firewall 1000, 5000, 11 12 ipc1 ipc2 IP Chain 10000, 30000 HyperCuts obviously require much more memory than ERFC. ERFC introduces cache memory to accelerate search process. The cache memory sizes required by ERFC on different rule sets are show in Fig.11, which indicates that cache memory size increases with N. On all rule sets except ipc1 group, cache size is below 1MB when N varies between 100 and 30 000. For the rule sets of N = 30 000 in ipc1 group, the maximum cache size is about 4 MB. 5.2 Evaluation of Storage Requirements Let MS req denotes the memory space size required by classification algorithms. In this evaluation, the memory space requirements (MS req ) of ERFC and other 3 algorithms are measured on 72 rule sets, and the results are presented in Fig.10. Generally in all 4 algorithms, MS req increase with growth of N; and application type also influences MS req. Fig.10 shows that ERFC provide the best space performance among 4 algorithms. On ACL and Firewall rule sets, MS req of ERFC is always less than 2 MB even when N grows to 30 000. On the rule set of N = 30 000 in ipc1 group, ERFC requires 14 MB memory; whereas RFC and Hi- Cuts require 165 MB and 151 MB respectively. In contrast with RFC, MS req of ERFC decreases by 55% 98% among all cases. Moreover, on 89% of 72 test rule sets, MS req decreases by at least 80%; and on 70% of rules sets, it decreases by more than 90%. When N is small (N 1000), MS req of ERFC is very close to, or in many cases less than, that of HiCuts and HyperCuts. When N grows above 5000, HiCuts and Fig.11. Cache size required by ERFC on 12 groups of test rule sets. 5.3 Evaluation of Classifying Speed We evaluated classifying speed of 4 algorithms using a similar method in [6-7, 14]. The number of memory accesses to classify a packet is measured as the major metric of this evaluation. Let MA worst denote

Xiang-Yang Gong et al.: ERFC: An Enhanced Recursive Flow Classification Algorithm 967 the number of memory accesses per classification in the worst case; and let MA average denote the number of memory accesses per classification in the average case. Fig.12 illustrates MA worst values of ERFC, RFC, Hi- Cuts and HyperCuts on 12 groups of rule sets; and Fig.13 shows MA average values of 4 algorithms. The results show that in RFC and ERFC, the influences of N and application types are very small. In the contrast, HiCuts and HyperCuts trend to require much more memory accesses than ERFC and RFC to classify a packet when N grows. In the worst case, ERFC requires constant 13 memory accesses to classify a packet, which is equivalent to that required by RFC. In the average case, MA average of ERFC decreases to 6.0 7.4. In comparison with RFC, ERFC reduces MA average by about 43 54%, which means that the classifying speed of ERFC is improved by about 75 116%. 5.4 Evaluation of Preprocessing Time In this evaluation, the preprocessing time of 4 algorithms is measured on relatively large rule sets (N 1000). The evaluation is performed under user mode (not kernel mode) of Windows XP; and the platform is Fig.12. Number of memory accesses per classification (in the worst case) of 4 algorithms on 12 groups of test rule sets. One memory access is one 32-bit word. Fig.13. Number of memory accesses per classification (in the average case) of 4 algorithms on 12 groups of rule sets. One memory access is one 32-bit word.

968 J. Comput. Sci. & Technol., Sept. 2010, Vol.25, No.5 Fig.14. Preprocessing time of 4 algorithms on 12 groups of test rule sets. a work station with a 3.2 GHz Intel Xeon processor, 512 KB cache and 2 GB memory. The results are presented in Fig.14, in which T p denotes the preprocessing time required by classification algorithms. On ACL and Firewall rule sets, the maximum T p of ERFC is about 1 second when N 10 000; and even when N = 30 000, T p of ERFC is less than 5.6 seconds. On the rule set in ipc1 group, the maximum T p of ERFC is 90 seconds when N increases to 30 000. Fig.14 indicates that preprocessing time of ERFC is reduced enormously in comparison with RFC: it deceases by at least 90% when N > 1000. On the ipc1 rule set of N = 30 000, RFC consumes approximate 4.88 hours for preprocessing; whereas ERFC only requires 1.5 minutes. In addition, the results in Fig.14 also reveals that, on most of rules sets, ERFC leads to a T p value which equals, or is very close to, the minimum T p value among 4 algorithms. 6 Conclusions In this paper, we propose an enhanced high speed packet classification algorithm ERFC based on RFC. In the proposed ERFC, a hash based ABV preprocessing scheme is exploited, which reduces the preprocessing complexity greatly. A compressed data structure is introduced, which reduces the total memory requirement of the algorithm. The average classifying speed is also improved through introducing an appropriate amount of cache memory. The evaluation results show that ERFC provides at least an order of magnitude improvement over RFC in preprocessing time when N > 1000, and reduces the storage requirement by about 55 98%. The time complexity of ERFC for packet classifying in the worst case is equivalent to that of RFC; but its average performance is greatly improved. The future work should be focused on the study of ERFC algorithm in IPv6 networks. References [1] Gupta P, McKeown N. Algorithms for packet classification. IEEE Network, Special Issue, Mar. 2001, 15(2): 24-32. [2] Gupta P, McKeown N. Packet classification on multiple fields. ACM SIGCOMM Computer Communication Review, Oct. 1999, 29(4): 147-160. [3] Lakshman T V, Stidialis D. High-speed policy-based packet forwarding using efficient multi-dimensional range matching. ACM SIGCOMM Computer Communication Review, Oct. 1998, 28(4): 203-214. [4] Baboescu F, Varghese G. Scalable packet classification. IEEE/ACM Transactions on Networking, Feb. 2005, 13(1): 2-14. [5] Srinivasan V, Suri S, Varghese G. Packet classification using tuple space search. ACM SIGCOMM Computer Communication Review, Oct. 1999, 29(4): 135-146. [6] Baboescu F, Singh S, Varghese G. Packet classification for core routers: Is there an alternative to CAMs. In Proc. IEEE INFOCOM 2003, San Francisco, USA, Mar. 30-Apr. 3, 2003, vol.1, pp. 53-63. [7] Srinivasan V, Suri S, Varghese G, Waldvogel M. Fast and scalable layer four switching. ACM SIGCOMM Computer Communication Review, Oct. 1998, 28(4): 191-202. [8] Xu K, Wu J, Yu Z, Xu M. A non-collision hash Trie-Tree based fast IP classification algorithm. Journal of Computer Science and Technology, 2002, 17(2): 219-226. [9] Gupta P, McKeown N. Packet classification using hierarchical intelligent cuttings. IEEE Micro, Jan. 2000, 20(1): 34-41. [10] Singh S, Baboescu F, Varghese G, Wang J. Packet classification using multidimensional cutting. In Proc. the 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (ACM SIGCOMM), Karlsruhe, Germany, Aug. 2003, pp.213-224. [11] Hari A, Suri S, Parulkar G. Detecting and resolving packet filter conflicts. In Proc. IEEE INFOCOM 2000, Tel-Aviv,

Xiang-Yang Gong et al.: ERFC: An Enhanced Recursive Flow Classification Algorithm 969 Israel, Mar. 26-30, 2000, Vol.3, pp.1203-1212. [12] Bacoescu F, Varghese G. Fast and scalable conflict detection for packet classifiers. Computer Networks, Aug. 2003, 42(6): 717-735. [13] Feldman A, Muthukrishnan S. Tradeoffs for packet classification. In Proc. IEEE INFOCOM 2000, Tel-Aviv, Israel, Mar. 26-30, 2000, Vol.3, pp.1193-1202. [14] Woo T Y C. A Modular approach to packet classification: Algorithms and results. In Proc. IEEE INFOCOM 2000, Tel-Aviv, Israel, Mar. 26-30, 2000, Vol.3, pp.1213-1222. [15] Taylor D E, Turner J S. ClassBench: A packet classification benchmark. IEEE/ACM Transaction on Networking, Jun. 2007, 15(3): 499-511. Xiang-Yang Gong received his Master s degree in computer science in 1995 from Xi an Jiaotong University. In 1995, he joined Beijing University of Posts and Telecommunications (BUPT). He is an associate professor of the State Key Lab of Networking and Switching Technology at BUPT. His research interests are IP QoS, network security, advanced networking and switching technologies and novel network architectures. Wen-Dong Wang received the Master s degree in computer science in 1991 from BUPT. He is a professor of State Key Lab of Networking and Switching Technology at BUPT. His research interests are QoS control and management, novel network architectures, and next generation Internet. He is a senior member of CCF. Shi-Duan Cheng is a professor of State Key Lab of Networking and Switching Technology at BUPT. From 1984 to 1987 and in 1994 she twice joined Alcatel Bell, Belgium as a visiting scholar. From 1992 to 1999 she was the head of The Switching and Networking Expert Group in 863 programs of China. Her research interests cover traffic engineering, network performance and QoS of broadband networks and Internet. Currently she is concentrating on the architecture of next generation Internet.