A Packet Header Compression Algorithm based on TCP Flow Clustering and Huffman Encoding

Size: px

Start display at page:

Download "A Packet Header Compression Algorithm based on TCP Flow Clustering and Huffman Encoding"

Joseph Woods
5 years ago
Views:

1 Anais do XXVI Congresso da SBC SEMISH l XXXIII Seminário Integrado de Software e Hardware 14 a 20 de julho de 2006 Campo Grande, MS A Packet Header Compression Algorithm based on TCP Flow Clustering and Huffman Encoding Raimir Holanda 1 1 Universidade de Fortaleza (UNIFOR) Mestrado em Informatica Aplicada Fortaleza CE Brazil raimir@unifor.br Abstract. Packet traces of Internet traffic are important tools for performance evaluation and design purposes of many network elements. For instance, packet traces can be used to evaluate prototypes of network systems such as routers, firewalls, Internet servers, etc. However, with the increase of link rates, the required storage for packet header traces of meaningful duration becomes too large. In this paper I am focused on the problem of compression of these potentially huge packet header traces. Previous compression methods have been developed for saving transmission bandwidth on channels such as wireless and slow point-to-point links. Here, we propose a combined packet header compression method based on TCP flow clustering for small flows and Huffman encoding for large flows. With our proposed method, storage size requirements for.tsh packet header traces are reduced to 16% of their original size. Others known methods have their compression ratio bounded to 50% and 32%. 1. Introduction A critical requirement for performance evaluation and design of network elements is the availability of realistic traffic traces. A popular scheme to obtain real traces for extended periods of time is to collect them from routers [CAIDA]. There are, however, several reasons that make difficult in many cases to have access to them. Firstly, Internet providers are usually reluctant to make public real traces captured in their networks. Moreover, when these traffic traces are made public [NLANR], they are delivered after some transformations, such as sanitization [Pang and Paxson 2003], which modify some basic semantic properties (such as IP address structure). Secondly, there are others problems which arise due to the increasing speed of Internet routers. Hardware for collecting traces at high speed (e.g. to link rates of 2.5 Gbps, 10 Gbps or even 40 Gbps) is usually expensive. Moreover, with the increase of link rates, the required storage for packet traces of meaningful duration becomes too large. A first approach to cope with the huge storage needs is to use a standard compression method. Content compress can be as simple as removing all extra space characters, inserting a single repeat character to indicate a string of repeated characters, and substituting smaller bit strings for frequently occurring characters. The compression is performed by algorithms which determine how to compress and decompress. Some of the most popular compression algorithms are the Huffman coding [Knuth 1985], LZ77 [LZ77], and Deflate [Deflate]. Those specifications define lossless compressed data formats. From our measurements, using these methods on files containing packet headers, we can expect a compression ratio of around 50%. 1

2 The previous methods do not take into account the specific properties of the data to be compressed. The compression techniques thought for the specific case of packet headers have been developed for saving transmission bandwidth on channels such as wireless and slow point-to-point links. The original scheme proposed for TCP/IP header compression in the context of transmission of Internet traffic through low speed serial links is Van Jacobson s header compression algorithm [Jacobson 1990]. The method is based on the fact that in TCP connections, the content of many TCP/IP header fields of consecutive packets of a flow can be usually predicted. As we will show, the achievable compression ratio using this method is around 30%, reducing the file size, for instance, from 100MB to 30MB. Since then, specifications for the compression of a number of other protocols have been written. Degermark proposed additional compression algorithms for UDP/IP and TCP/IPv6 [Degermark, Engan, Nordgren and Pink 1996]. Detailed specifications for compressing these protocols, as well as others such as RTP, were described in a number of subsequent RFC s [RFC2507], [RFC2508], and [RFC2509]. Each of these descriptions specify a solution for a given protocol. For multimedia services in wireless environments ROHC (Robust Header Compression) was introduced. ROHC was standardized in [ROHC] and will be an integral part of the 3GPP-UMTS specification [3GPP]. Equally for wireless environments, another scheme that makes use of the similarity in consecutive flows from or to a given mobile terminal is described in [Westphal 2003]. In this paper we propose a packet header compression method, focused not on the problem of reducing transmission bandwidth or latency in wireless or slow point-to-point links, but on the problem of saving storage space of these potentially huge packet traces. The method presented here, does not have some limitations of the previously mentioned methods, for instance we can know all the packets in a flow before compressing them and the compression ratio that we achieve is around 16%, reducing the file size, for instance, from 100MB to 16MB. To reach this performance, the method uses two classes of algorithms. A first class fits well for small flows and the second for large flows. Analysis using both classes have demonstrated that the best combined performance is reached when we consider small flows ranging from 1 to 12 packets per flow. In these analysis, we have assumed the most common case of storing TSH (Time Sequence Header) packet headers files [NLANR]. The method presented here is lossless, in the sense that for some fields the decompression algorithm regenerates exactly the original value, while for others, those for which the initial values are random as for instance initial TCP sequence number, the values are shifted, as if we were capturing the trace at another execution time. Evidently, these changes do not affect, in most cases, the analysis taken from the decompressed file. 2. Cluster Analysis In this work we propose a compression scheme based on multivariate statistics. Multivariate statistics are appropriate for any data set where multiple measurements are taken with possible correlations between the measurements. Multivariate techniques in general, account for the correlation structure of the variables being analyzed often yielding a more complete picture of the analysis results than if the variables had been analyzed separately [Johnson 1998]. Cluster analysis is a multivariate technique used for finding groups in observed data [Kaufman and Rousseeuw 1990]. We chose cluster analysis as a means of form- 2

3 ing normal groups of TCP/IP flows. The term cluster analysis encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. A general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures. In other words, cluster analysis is an exploratory data analysis tool which aims at partitioning the components into groups so the members of a group are as similar as possible and different groups are as dissimilar as possible [Jain 1991]. Statistically, this implies that the intragroup variance should be as small as possible and intergroup variance should be as large as possible. Each cluster thus describes, in terms of data collected, the class to which its members belong; and this description may be abstracted through use from the particular to the general class or type. Cluster analysis is thus a tool of discovery. It may reveal associations and structure in data which, though not previously evident, nevertheless are sensible and useful once found. The results of cluster analysis may contribute to the definition of a formal classification scheme. A number of clustering techniques have been described in the literature. These techniques fall into two classes: hierarchical and nonhierarchical. In nonhierarchical approaches, one starts with an arbitrary set of k clusters, and the members of the clusters are moved until the intragroup variance is minimum. There are two kinds of hierarchical approaches: agglomerative and divisive. In the agglomerative hierarchical approach, given n components, one starts with n clusters (each cluster having one component). Then neighboring clusters are merged successively until the desired number of clusters is obtained. In the divisive hierarchical approach, on the other hand, one starts with one cluster (of n components) and then divides the cluster successively into two, three, and so on, until the desired number of clusters is obtained. 3. Data collection Network traffic measurements provides a mean to understand a local-area or wide-area network. Using specialized network measurement hardware or software, a network researcher can collect detailed information about the transmission of packets on the network, including their time structure and contents. With detailed packet-level measurements, and some knowledge of the Internet Protocol stack, it is possible to obtain significant information about the structure of an Internet application or the behavior of an Internet user. The Off-line analysis carried along this work were based on publicly available archive of traces collected and maintained by the National Laboratory for Applied Network Research (NLANR) [NLANR]. We downloaded traces collected from the following sites: Colorado State University (COS), Front Range GigaPOP (FRG), University of Buffalo (BUF), Columbia University (BWY). In all cases, the traces were stored using the TSH packet header format. For.tsh files the header size is 44 bytes: 8 bytes of timestamp and interface identifier, 20 bytes of IP, and 16 of TCP. No IP or TCP options are included. The packet payload is also not stored. 3

4 4. Flow Mapping In this work, we use a flow characterization approach that incorporates semantic characteristics of flows [Holanda 2004]. We understand by semantic characterization the joint analysis of traffic characteristics including the inter packet time and some of the most important fields (source and destination address, port numbers, packet length and TCP flags) of the TCP/IP headers content. Let us define a packet flow as a sequence of packets in which each packet has the same value for a 5-tuple of source and destination IP addresses, protocol number, and source and destination port numbers and such that the time between two consecutive packets does not exceed a threshold. In our case, we have adopted a threshold of 5 sec. For a best representation of the header fields as well as a way to analyze their behavior, we have developed a header field mapping (see Figure 1). In this mapping, for some header fields, the values are simply copied from the packets; for others the mapped value represents the increment or decrement between consecutive packets into a flow, and finally for some of them which the distribution of values is highly skewed, we can replace the original value by a transformation or function of the values. Bellow, we describe this mapping. P 5 P 5 3 (1) 3 (2) P 3 5 (3) F 5 3 (1) F 3 5 (2) F 5 3 (3) P4 5 (1) P4 5 (2) P 4 5 (3) F 4 5 (1) F 4 5 (2) F4 5 (3) Figure 1. Flow mapping Let Pi m be the packet header of the i-th packet of a flow consisting of m packets. Pi m (j) is a selected header field of Pi m. For each field Pi m (j), a function χ j performs a mapping into an integer value Fi m (j): For each packet, let F m i Fi m (j) = χ j (Pi m (j)) (1) = (Fi m (1), Fi m (2),...) (2) denote a vector of integers, where we include the selected fields. For the complete flow we can define: P m = (P m 1, P m 2,..., P m m ) (3) and F m = (F1 m, F 2 m,..., F m m ). (4) 4

5 Note that the vector F m can be viewed as a numerical representation of the m packet headers, as we substitute some selected packet header fields by integers. As an example of header fields that are simply copied are: Version, IHL, Type of service, Flags, Fragment Offset, Data Offset, and Control bits. However, for some fields; such as timestamp; we map their values in terms of the increment or decrement between consecutive packets. On the other hand, as we have said earlier, if the distribution of a parameter is highly skewed, one should consider the possibility of replacing the parameter by a transformation or function of the parameter. For instance, we can replace the TCP flags into more appropriate values. 5. Flow Clustering Using the flow mapping described in the last section, in a high-speed link we can find potentially a large variety of flows. However, from our studies, we have seen that the flows are not very different from each other. To study the variety among flows, we have used an approach based on cluster analysis, a classical technique used for workload characterization [Jain 1991]. Generally, a measured trace consists of a large number of flows. For analysis purposes, it is useful to classify these flows into a small number of classes or clusters such that the components within a cluster are very similar to each other. Later, one member from each cluster may be selected to represent the class. Our clustering analysis basically consists of mapping each component into an n- dimensional space and identifying components that are close to each other. Here n is the number of parameters. The closeness between two components is measured by defining a distance measure. The Euclidian distance is the most commonly used distance metric and is defined as: n d = { (x ik x jk ) 2 } 0.5 (5) k=1 Starting from a real trace, we break it down into flows with the same number of packets. Using the mapping described in the last section, from a set of vectors, we calculate the Euclidian distance between the F m vectors and the results are stored in a distance matrix of flows. Initially, each vector F m represents a cluster. Evidently, distance 0 means that two vectors are exactly identical. Later, we search the smallest element of the distance matrix. Let d rs, the distance between clusters r and s, be the smallest. We merge clusters r and s and also merge any other cluster pairs that have the same distance. We have used the Minimum Spanning Tree hierarchical clustering technique [Jain 1991], which starts with n clusters of one component each and successively joins to the nearest clusters until be reached a specific distance between the clusters. For each m (number of packets per flow), we apply, separately, the clustering method. After all, Templates of Flows are generated from the clusters. We started our analysis using the following header fields: interface, version, IHL, type of service, flags, fragment offset, time to live, data offset and control flags. More- 5

6 over, we use the definition of flow as a sequence of packets in which each packet has the same value for a 5-tuple of source and destination addresses, source and destination ports, protocol and such that the time between two consecutive packets does not exceed a threshold of 5 sec. For each flow, we converted it into a F m vector and we calculated the distance between the flows. For different values of m (flows with m packets), we observed that the number of clusters reach a limit, and the new read flows always fit with some of the previous clusters. To improve the accuracy of our outcomes, we extended our analysis to different networks. In Figure 2, we show, for m = 4 packets, the number of clusters when the flows inside each cluster have inter-distance greater than zero. The outcomes are from four ATM OC-3 traces downloaded from NLANR web site. The plotted curves are from Colorado State University (COS), Front Range GigaPOP (FRG), University of Buffalo (BUF), and Columbia University (BWY). Furthermore, in Figure 2, we see the behavior of the Joined Trace (upper curve). This trace was obtained joining the four downloaded NLANR traces. From Figure 2, we obtain two important conclusions: (i) we can obtain a small number of clusters to represent a packet trace; (ii) the number of clusters in the joined trace is less than the summation of clusters of the other four traces. This implies that the type of flows is basically the same in all traces. Similar behavior was obtained for different values of m. Number of Clusters Joined COS BUF BWY FRG Flows for m=4 packets Figure 2. Number of clusters for ATM OC-3 traces and joined trace - NLANR traces We have seen that the relation between the number of clusters and the number of flows for each class of m-packets flows is highly favorable for small flows. However, as we increase the value of m, this relation is impaired, what lead us to adopt another approach for large flows. The problem with large flows occurs because we have a small number of large flows with exactly m packets. Hence, for large flows, we changed the clustering analysis 6

7 executing a joint analysis into flows with different number of packets. In this case, m is a new parameter to be taken into consideration. Often, m is not the predominant parameter to define the cluster. From previous analysis, we have concluded that behind the great number of flows in a high-speed link, there is not so much variety among them and clearly they can be grouped into few clusters. Moreover, the TCP/IP flows are not equally distributed into the clusters, with a high predominance of few clusters. Under our point of view, these conclusions are consequence of the reduced amount of available tools used in the Internet which are concentrated in fews programs. For instance, almost all operating systems are Windows or Linux based, TCP is the dominant protocol, there are few TCP versions, and Web and P2P are the most common applications. Moreover, since that many people use the same type of searchers (google, scholar, etc) the users tend to show similar behavior when using the Internet. 6. Entropy of TCP/IP Header Fields As we have seen in the last section, behind the great number of flows in a high-speed link, there is not so much variety among them and clearly they can be grouped into a set of clusters. The evidence that Internet flows can be grouped into a small set of clusters leaded us to create templates of flows and apply this method to compress a packet trace file. Before start with the compression itself, we wanted to determine what was the limit of lossless compression for TCP/IP headers trace. To do that, we have used the definition of entropy. The entropy is a measure of the amount of information required on the average to describe the random variable. For data compression, the entropy H of a random variable is a lower bound on the average length of the shortest description of the random variable and is used to establish the fundamental limit for the compression of information. Data compression can be achieved by assigning short descriptions to the most frequent outcomes of the data source and necessarily longer descriptions to the less frequent outcomes. Thus the entropy is the data compression limit as well as the number of bits needed in random number representation. Codes achieving H turn out to be optimal. The entropy of a random variable X with a probability mass function p(x) is defined by H(X) = p(x)log 2 p(x) (6) Note that entropy is a functional of the distribution of X. It does not depend on the actual values taken by the random variable X, but only on the probability. Using logarithms to base 2 the entropy will then be measured in bits. It is the number of bits on the average required to describe the random variable Packet Level Entropy For a sequence of packets into a Internet trace, we can assume that the sequence of header field values constitute a stationary ergodic process. Moreover, given the high aggregation 7

8 level of high speed Internet links, we can assume also that the header field values of consecutive packets are independent. In this case, the joint entropy is: H(S i, S i+1 ) = H(S i ) + H(S i+1 H(S i )) (7) where H(S i ) is the header field entropy of packet i. For a sequence of n packets: H(S i, S i+1 ) = H(S i ) + H(S i+1 ) (8) H(S i, S i+1 ) = 2 H(S) (9) H(S i, S i+1,..., S n ) = H(S i ) + H(S i+1 H(S i )) H(S n S n 1,..., S 1 ) (10) H(S i, S i+1,..., S n ) = n H(S) (11) Using a header field approach, we have calculated for each one of the TSH header fields, what is the entropy. For our analysis, we used 1,000,000 packets. The summation of their correspondent code sizes give us the average length to establish the limit for the compression of packet headers. We have used the following header fields: timestamp, Interface, Version, Internet Header Length (IHL), Type of service (TOS), Total Length, Identification, Flags, Fragment Offset, Time to live, Protocol, Header checksum, Source address, Destination Address, Source port, Destination port, Sequence number, Acknowledgment number, Data Offset, Control Bits, and Window. Using a packet compression approach, the compression ratio is limited to 40%, what means that a packet header file of 100MB can be reduced at maximum to 40MB. Clearly, this compression bound is not satisfactory and other approaches must be evaluated to reach higher compression ratios Flow Level Entropy We have seen that for a sequence of packets into a Internet trace, we assumed that they are independent. However, this is not true for a sequence of packets into the same flow, where the sequence of some header fields show a strong dependence such as the IP address and TCP port numbers. Then, for a flow approach we have three scenarios. First scenario: From the chain rule for entropy we have that if X 1, X 2,..., X n is drawn according p(x 1, x 2,..., x n ), then: n H(X 1, X 2,..., X n ) = H(X i X i 1,..., X 1 ) (12) i=1 In the first scenario, we have that the entropy of the field F for a set of flows with n packets is: H(F 1 ) 0, H(F 2 ) 0,..., H(F n ) 0 (13) 8

9 consequently; H(F 1, F 2,..., F n ) 0 (14) This scenario embraces the following fields: Interface, Version, IHL, Type of Service, Flags, Fragment Offset, Protocol. and that: Second scenario: In the second scenario, we have: H(F 1 ) > 0, H(F 2 ) > 0,..., H(F n ) > 0 (15) H(F i, F i+1 ) = H(F i ) + H(F i+1 H(F i )) }{{} 0 (16) H(F i, F i+1 ) H(F ) (17) where H(F i ) is the entropy of the header field F into the packet i. For a flow with n packets: H(F i, F i+1,..., F n ) = H(F i ) + H(F i+1 H(F i )) }{{} H(F n F n 1,..., F 1 ) }{{} 0 (18) H(F i, F i+1,..., F n ) H(F ) (19) This scenario embraces the following fields: Source Address, Destination Address, Source Port, Destination Port, Time to Live, Data Offset, Control Bits. Third scenario: In the last scenario, the sequence of the header fields into a flow has the following behavior: H(F 1 ) > 0, H(F 2 ) > 0,..., H(F n ) > 0 (20) and the joint entropy is: H(F i, F i+1 ) = H(F i ) + H(F i+1 H(F i )) (21) H(F i, F i+1 ) = H(F i ) + H(F i+1 ) (22) For a flow with n packets: H(F i, F i+1 ) = 2 H(F ) (23) H(F i, F i+1,..., F n ) = H(F i ) + H(F i+1 H(F i )) H(F n F n 1,..., F 1 ) (24) H(F i, F i+1,..., F n ) = n H(F ) (25) This third scenario embraces the following fields: Timestamp, Total Length, Identification, Sequence Number, Acknowledgment Number, Window. From the three scenarios showed, we defined A m as: 9

10 A m = 1 st Scenario {}}{ H(F ) 2ndScenario + {}}{ 3rdScenario H(F ) + m {}}{ H(F ) m 352 where, the summation for the three scenarios are showed in the tables 1, 2 and 3 and 352 is the header packet size. In table 3 we are considering that the entropy for the Sequence and Acknowledgment number header fields are equal to zero because they can be deduced from the Total Length header field. (26) Table 1. 1st scenario entropy Header Field Entropy H(x) Interface Version IHL Type of Service Flags Fragment Offset Protocol Total Table 2. 2nd scenario entropy Header Field Entropy H(x) Time to Live Source Address Destination Address Source Port Destination Port Data Offset Control Bits Total Table 3. 3rd scenario entropy Header Field Entropy H(x) Timestamp Total Length Identification Sequence Number Acknowledgment Number Window Total Hence, the final expression for A n is: A m = m m 352 (27) 10

11 and the maximum compression ratio is: C A m f m (28) where f m is the flow probability distribution for m-packet flows. Applying the equations showed previously, we deduced that the compression bound for a TCP/IP header trace is around 13%. 7. Lossless Compression Method The main reason why header compression can be done at all is the fact that there is significant redundancy between header fields, both within consecutive packets belonging to the same flow and in particular between flows. The big gain of our proposed method comes from the observation that, for a set of selected header fields, the flows traveling into an Internet link are very similar. By utilizing a set of pre-computed templates of flows, and Huffman encoding, the header size can be significantly reduced. Hence, we have embarked upon the development of a new header compression scheme for packet header files that reduces drastically storage requirements. This section provides the details of how the method works, focusing on the fact that the decompressed header is functionally identical to the original header. First of all, we analyzed the behavior of the TCP/IP header fields for packets belonging to the same flow. Let F (i) be a header field for the i-th packet of a flow. We also define F (i) = F (i) F (1), where the minus operator represents an arithmetic operation. For the first packet of a flow, F (1) can be classified as F (1)-random, F (1)- predictable, or F (1)-not predictable: F (1)-random fields: Are fields whose initial values could or should be chosen at random: Identification, Sequence Number, and Acknowledgment Number. The identification field is primarily used for uniquely identifying fragments of an original IP datagram and many operating systems assign a sequential number for each packet. Hence, assigning a random number for the first fragment does not constitute a problem. Equally, the sequence number and acknowledgment number fields are not affected if we assign random values for the first packet of each flow. F (1)-predictable fields: Are fields whose value is usually known or at least predictable: Interface, Version, IHL, Type of Service, Flags, Fragment Offset, Protocol, Data Offset, Reserved, and Control Bits. The fields placed in this group preserve a high level of similarity among different flows. F (1)-not predictable fields: Are fields whose value cannot be predicted and has an specific meaning: Timestamp, TTL, Header Checksum, Total Length, Source Address, Source Port, Destination Address, Destination Port, and Window. This group embrace the fields that are very hard to guess their values for the first packet of each flow. For instance, we can not know previously, where will start each flow. Hence, is impossible to guess, for each flow, the value of the timestamp field of the first packet. The TTL field is modified in Internet header processing, hence depending of the amount of hops previously visited, its value must vary broadly for different flows. The total length carried by each packet as well the window field 11

12 also show a large variation. Finally, the source and destination address, represents a set of directions that is impossible to know in advance. Moreover, according with the F (i) behavior, the fields of the i-th packet of a flow were classified as F (i) = 0, F (i)-predictable, and F (i)-not predictable. F (i) = 0 fields: are header fields whose values are likely to stay constant over the life of a connection: Version, Type of Service, Protocol, Source Address, Destination Address, Source Port, and Destination Port. Here we are grouping the set of TSH header fields that is likely to stay constant over the life of a connection. F (i)-predictable fields: Are fields whose F (i) values are predictable, can be calculated based on the information stored in another field or follows sequential increments: Interface, IHL, Identification, Flags, Fragment Offset, Time to Live, Sequence Number, acknowledgment number, Data Offset, Reserved, and Control Bits. F (i)-not predictable fields: Are fields that are likely to change over the life of the conversation, and furthermore, are impossible to be calculated: timestamp, total length, header checksum, and Window. In this group are inserted fields that are likely to change over the life of the conversation, and furthermore, are impossible to be calculated. Our proposed compression method is in the context of saving storage space of potentially huge packet traces. The advantage of our proposed consists of know in advance the trace file and consequently the flows. Thus, we explore the properties of TCP/IP flows to predict some header fields and the Huffman encoding to obtain optimal compression ratio. A previous phase to the compression itself was the step to determine the clusters of TCP/IP flows and build a Huffman tree to them. The clusters were obtained from traces downloaded from NLANR. The following steps consist of traverse all trace file building the header field frequency tables examining the presence of new clusters and finally compress the file. We have applied different approaches for small flows and large flows. In both cases, the compression is carried out at flow level but for small flows we apply the clustering techniques described in the last sections. In the case of large flow, when it reaches the completed status, each packet is inspected by the compressor in order to determine the correspondent header field Huffman codding Compression Ratio The expected message length (L) can be calculated in order to measure the efficiency of the algorithm. Let N be the number of messages, P (i) the probability of the i-th value and L(i) the length of a message (number of bits assigned to it). Then: N L = P (i)l(i). (29) i=1 12

13 Moreover, the expected length L of any instantaneous code for a random variable X is greater than or equal to the entropy H(X), i.e., L H(X) (30) Is important to say that there are some data structures with informations related to this method that are also needed as for instance the Flow Clustering dataset. However, we do not take into account because they stay almost constant. Using the compression algorithm described here, the trace compression is around 16%, what means that for a header trace of 100 MB our method compress it to 16 MB. We studied the efficiency of the proposed compression method, comparing it against GZIP [Gailly and Adler - GZIP] and Van Jacobson methods. The GZIP and also ZIP and ZLIB [Gailly and Adler - ZLIB] applications use the deflation algorithm. The measures were taken from a TSH (Time Sequence Header) header trace file [NLANR]. The compressed file size obtained using the GZIP application is around 50% of the original TSH file size. For the Van Jacobson method, the header size of a compressed datagram ranges from 3 to 16 bytes. However, we must modify slightly the original method because the number of active flows is much more larger in a high-speed Internet link than in a low speed serial link (the scenario where Van Jacobson was originally proposed). Hence, we must increase thus the number of bytes needed to store the flow identifier (we have increased it from 1 byte to 3 bytes). Moreover, we assume that a time stamp (3 bytes) is added to each header. As a result we assume that minimal encoded headers becomes 8 bytes in the best case and 21 bytes in the worst case. Taking into account only the best case and considering the changes that we have explained before, the compression ratio for n-packet flows using the Van Jacobson method is bounded by: f V J (m) = obtaining thus a compression ratio given by: C V J Ratio = m (m 1), (31) 44m P V J m f V J (m) (32) Using this approach, we conclude that the compression rate of the Van Jacobson method reaches 32% in the best case. The performance of these three compression methods under analysis (our proposed method, GZIP method and Van Jacobson method) is depicted in Figure 3. For an uncompressed.tsh header file (X axis), we show the correspondent storage needs (Y axis). 8. Conclusions In this paper, we have introduced a novel lossless packet header compression method based on TCP flow clustering and Huffman encoding. The combined algorithm applies the Flow Clustering technique for small flows and the Huffman encoding for large flows. This approach has significantly increased the compression ratio. 13

14 Compressed file size (MBytes) GZIP method VJ method Proposed method Uncompressed file size (MBytes) Figure 3. Compression techniques comparison We have seen that the Flow Clustering technique fits well for small flows, where many flows are grouped in few templates. In these circumstances, the number of templates remains constant or shows small variations. The technique is based on semantic similarities among flows and TCP/IP functionalities. However, we have not seen the same behavior for large flows, where the number of templates tends to increase whenever new large flows arise. These characteristics led us to adopt another approach. This approach is based on Huffman encoding and explores the similarity between packets during the life of a connection. The analysis performed with both algorithms have concluded that for small flows the optimum number of packets per flow ranges from 1 to 12. On the other hand, flows with the number of packets greater than 12 is the best choice to define large flows. With our proposed method, storage size requirements for.tsh headers packet traces are reduced to 16% of its original size. The compression proposed here is more efficient than any other and simple to implement. Others known methods have their compression ratio bounded to 50% (GZIP) and 32% (Van Jacobson method), pointing out the effectiveness of our method. As future work, we intend to extend the compression method presented here to other header data formats. References DEFLATE. Compressed data format specification. In Available in ftp://ds.internic.net/rfc/rfc1951.txt. Gailly, J. L. and Adler, M. GZIP documentation and sources. In ftp://prep.ai.mit.edu/pub/gnu/. Gailly, J. L. and Adler, M. ZLIB documentation and sources. In ftp://ftp.uu.net/pub/archiving/zip/doc/. 14

15 Holanda R., Garcia J., and Almeida, V. (2004). Flow Clustering: a New Approach to Semantic Traffic Characterization. In 12th Conference on Measuring, Modelling, and Evaluation of Computer and Communication Systems, Dresden Germany. Jacobson, Van (1990). Compressing TCP/IP headers. In RFC Jain, R. (1991). The Art of Computer Systems Performance Analysis. In John Wiley Sons, Inc., New York. Johnson, D.E. (1998). Applied Multivariate Methods for Data Analysis. In Brooks/Cole Publisging Co. Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. In Wiley and Sons, Inc.. Knuth, D. E. (1985). Dynamic Huffman coding. Journal of Algorithms,6: Ziv, J. and Lempel, A.. A universal algorithm for sequential data compression. In IEEE Transactions on Information Theory, Vol. 23, n o 3, pp NLANR. Measurement and Network Analysis. In CAIDA. The Cooperative Association for Internet Data Analysis. In Pang, R. and Paxson, V. (2003) A High-Level Programming Environment for Packet Trace Anonymization and Transformation. In Proceedings of ACM SIGCOMM Conference. Degermark, M., Engan, M., Nordgren, B., and Pink, S. (1996) Low-loss TCP/IP Header Compression for Wireless Networks. In Proc. MOBICOM, Rye, NY. Degermark, M., Nordgren, B., Pink, S. (1999) IP Header Compression. In Internet Engineering Task Force, RFC Casner, S. and Jacobson, V. (1999) Compressing IP/UDP/RTP Headers for Low-Speed Serial Links. In Internet Engineering Task Force, RFC Engan, M., Casner, S., and Bormann, C. (1999) IP Compression over PPP. In Internet Engineering Task Force, RFC Bormann, C. et al (2001) RObust Header Compression ROHC: Framework and four profiles: RTP, UDP, ESP, and uncompressed. In Request for Comments rd Generation Partnership Project (2002) Radio Acess Bearer Support Enhancements. In 3GPP, Tech. Rep.. Westphal, C. (2003) Improvements on IP Header Compression In GLOBECOM IEEE Global Telecommunications Conference, vol. 22, no. 1, pp

A Lossless Compression Method for Internet Packet Headers

A Lossless Compression Method for Internet Packet Headers Raimir Holanda and Jorge García Computer Architecture Dept. Technical University of Catalonia Barcelona, Spain Email: {rholanda,jorge}@ac.upc.edu