A SIMPLE DATA COMPRESSION ALGORITHM FOR ANOMALY DETECTION IN WIRELESS SENSOR NETWORKS

Volume 117 No. 19 2017, 403-410 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A SIMPLE DATA COMPRESSION ALGORITHM FOR ANOMALY DETECTION IN WIRELESS SENSOR NETWORKS 1 Uthayakumar J, 2 Vengattaraman T, 3 Dr. J. Amudhavel 1 Research Scholar, Department of Computer Science, Pondicherry University, Puducherry, India 2 Assistant Professor, Department of Computer Science, Pondicherry University, Puducherry, India 3 Associate Professor, Department of CSE, KL University, Andhra Pradesh, India 1* uthayresearchscholar@gmail.com, 2 vengattaramant@gmail.com, 3 info.amudhavel@gmail.com Abstract: Wireless Sensor Networks (WSN) consists of numerous sensor nodes and is deeply embedded into the real world for environmental monitoring. As the sensor nodes are battery powered, energy efficiency is considered as an important design issue in WSN. Since data transmission consumes more energy than sensing and processing of data, many researchers have been carried out to reduce the number of data transmission. Data compression (DC) techniques are commonly used to reduce the amount of data transmission. On the other side, anomaly detection is also a challenging task in WSN to enhance the data integrity. To achieve this data integrity, the sensor nodes append labels to the sensed data to differentiate the actual value and abnormal value. The labeled value can be represented as 0 for actual data and 1 for anomaly data. In this paper, we employ a Lempel Ziv Markov-chain Algorithm (LZMA) to compress the labeled data in WSN. LZMA is a lossless data compression algorithm which is well suited for real time applications. LZMA algorithm compresses the labeled data and transmits to Base Station (BS) via single hop and multi hop communication. Extensive experiments were performed using real world labeled WSN dataset. To ensure the effectiveness of LZMA algorithm, it is compared with 5 well-known compression algorithms namely Deflate, Lempel Ziv Welch (LZW), Burrows Wheeler Transform (BWT), Huffman coding and Arithmetic coding (AC). By comparing the compression performance of LZMA method with existing methods, LZMA achieves significantly better compression with an average compression ratio of 0.0104 at the bit rate of 0.839 respectively. Keywords: Anomaly detection; Data compression; Lempel Ziv Welch; Multi hop communication; Wireless Sensor Networks 1. Introduction The recent advancement in wireless networks and Micro-Electro-Mechanical-System (MEMS) leads to the development of low cost, compact and smart sensor nodes. WSN is randomly deployed in the sensing field to measure physical parameters such as temperature, humidity, pressure, vibration, etc [1]. WSN is widely used in tracking and data gathering applications include surveillance (indoor and outdoor), healthcare, disaster management, habitat monitoring, etc [2]. A sensor node is built up of four components namely transducer, microcontroller, battery, and transceiver. The sensor nodes are constrained in energy, bandwidth, memory and processing capabilities. As the sensor nodes are battery powered and are usually deployed in the harsh environment, it is not easy to recharge or replace batteries [3]. The lifetime of WSN can be extended in two ways: increasing the battery storage capacity and effectively utilizing the available energy. The way of increasing the battery capacity is not possible in all situations. So, the effective utilization of available energy is considered as an important design issue. Several researchers observed that a large amount of energy is spent for data transmission when compared to sensing and processing operation [4]. This study reveals that the reduction in the amount of data transmission is an effective way to achieve energy efficiency. Data transmission is the most energy consuming task due to the nature of strong temporal correlation in the sensed data. DC is considered as a useful approach to eliminate the redundancy in the sensed data [5] DC technique represents the data in its compact form without negotiating the data quality to a certain extent. It is used to compress text, image, audio, video, etc [6]. The compact form of any data can be achieved by the recognition and utilization of patterns exists in the data. DC is divided into two types based on the reconstructed data quality and the two types are lossless compression and lossy compression [7]. Lossy compression refers to a loss of quality in reconstructed data. It achieves better compression and is useful in situations where the loss of data quality is acceptable. Example: images, audio, videos. In some situations, the loss of information is unacceptable where the reconstructed data should be the 403

exact replica of the original data [8]. The basic idea of compressing data involves two steps: eliminating redundant and irrelevant data. The nature of redundancy in the real world data makes data compression possible. The removal of data in compression process which cannot be identified by the human eye is termed as irrelevancy reduction. The reduction in the amount of data enables to store a large amount of information in the same storage space and reduces the transmission time significantly. This nature is highly useful in WSN to compress sensed data [9]. Another important challenge in WSN is to handle the integrity of data sensed by the sensors. This requirement leads to a research problem known as anomaly detection [10]. It plays a major role in the intrusion detection and fault diagnosis. It is needed to detect any misbehavior or anomalies for the reliable and secure functioning of the network. Anomaly detection is useful in WSN to identify the abnormal variations in the sensing field [11]. It is a process of raising an alert when a significant change occurs. For instance, WSN is considered to monitor the environmental conditions like temperature and humidity level of forest fire detection. When the sensor malfunction or fire is caught, the sensed value will drastically vary from the actual values. These abnormal conditions are identified and notified to BS for further investigation. Anomaly Detection operates in two ways while integrating to WSN: centralized approaches and distributed approaches. In centralized approaches, the sensor node senses the environment and transmits the sensed data to BS. BS only identifies the data whether it is actual data or anomaly data. But, this traditional approach makes the sensor node to send all the raw or erroneous measurements to BS. This results in wastage of energy by transmitting large number of raw sensor measurements. In distributed approaches, the sensor nodes sense the field and identify the anomalies using anomaly detection algorithm [12]. The sensor node appends a label to the sensed value to represent anomalies. This label is used to differentiate between the normal data and anomaly data. In this paper, we employ an LZMA lossless compression algorithm to compress labeled data in WSN. 1.1 Contribution of this paper The contribution of the paper is summarized as follows: (i) A lossless LZMA compression algorithm is used to compress labeled WSN data. (ii) Two labeled WSN datasets (temperature and humidity) in both single hop and multi hop communication is used, and (iii) LZMA results are compared with 5 well-known compression algorithms namely Huffman coding, AC, LZW, BWT and Deflate algorithm in terms Compression Ratio (CR), Compression Factor (CF) and Bits per character (BPC). 1.2 Organization of this paper The rest of the paper is organized as follows: Section 2 explains the different types of classical DC techniques and anomaly detection techniques in WSN. Section 3 presents the LZMA compression algorithm for labeled data in WSN. Section 4 explains the performance evaluation in single hop as well as multi hop scenario. Section 5 concludes with the highlighted contributions, future work, and recommendations. 2. Related Work Energy efficiency is the major design issue in WSN. Clustering and routing are the most widely used energy efficient techniques [13]. Numerous clustering and routing techniques have been developed and these techniques are found in the literature [14], [15]. Data compression is an alternative way to achieve energy efficiency. DC compression techniques have been presented in [16]. The popular coding methods are Huffman coding, Arithmetic coding, Lempel Ziv coding, Burrows-wheeler transform, RLE, Scalar and vector quantization. Huffman coding [17] is the most popular coding technique which effectively compresses data in almost all file formats. It is a type of optimal prefix code which is widely employed in lossless data compression. It is based on two observations: (1) In an optimum code, the frequent occurrence of symbols is mapped with shorter code words when compared to symbols that appear less frequently. (2) In an optimum code, the least frequent occurrence of two symbols will have the same length. The basic idea is to allow variable length codes to input characters depending upon the frequency of occurrence. The output is the variable length code table for coding a source symbol. It is uniquely decodable and it consists of two components: Constructing Huffman tree from input sequence and traversing the tree to assign codes to characters. Huffman coding is still popular because of its simpler implementation, faster compression and lack of patent coverage. It is commonly used in text compression. AC [18] is an another important coding technique to generate variable length codes. It is superior to Huffman coding in various aspects. It is highly useful in situations where the source contains small alphabets with skewed probabilities. When a string is encoded using arithmetic 404

coding, frequent occurring symbols are coded with lesser bits than rarely occurring symbols. It converts the input data into a floating point number in the range of 0 and 1. The algorithm is implemented by separating 0 to 1 into segments and the length of each segment is based on the probability of each symbol. Then the output data is identified in the respective segments based on the symbol. It is not easier to implement when compared to other methods. There are two versions of arithmetic coding namely Adaptive Arithmetic Coding and Binary Arithmetic Coding. A benefit of arithmetic coding than Huffman coding is the capability to segregate the modeling and coding features of the compression approach. It is used in image, audio and video compression. Dictionary based coding approaches find useful in situations where the original data to be compressed involves repeated patterns. It maintains a dictionary of frequently occurring patterns. When the pattern comes in the input sequence, they are coded with an index to the dictionary. When the pattern is not available in the dictionary, it is coded with any less efficient approaches. The Lempel-Ziv algorithm (LZ) is a dictionary-based coding algorithm commonly used in lossless file compression. This is widely used because of its adaptability to various file formats. It looks for frequently occurring patterns and replaces them by a single symbol. It maintains a dictionary of these patterns and the length of the dictionary is set to a particular value. This method is much effective for larger files and less effective for smaller files. For smaller files, the length of the dictionary will be larger than the original file. The two main versions of LZ were developed by Ziv and Lempel in two individual papers in 1977 and 1978, and they are named as LZ77 [19] and LZ78 [20]. These algorithms vary significantly in means of searching and finding matches. The LZ77 algorithm basically uses sliding window concept and searches for matches in a window within a predetermined distance back from the present position. Gzip, ZIP, and V.42bits use LZ77. The LZ78 algorithm follows a more conservative approach of appending strings to the dictionary. LZW is an enhanced version of LZ77 and LZ78 which is developed by Terry Welch in 1984 [21]. The encoder constructs an adaptive dictionary to characterize the variable-length strings with no prior knowledge of the input. The decoder also constructs the similar dictionary as encoder based on the received code dynamically. The frequent occurrence of some symbols will be high in text data. The encoder saves these symbols and maps it to one code. Typically, an LZW code is 12-bits length (4096 codes). The starting 256 (0-255) entries represent ASCII codes, to indicate individual character. The remaining 3840 (256-4095) codes are defined by an encoder to indicate variable-length strings. UNIX compress, GIF images, PNG images and others file formats use LZW coding. BWT [22] is also known as block sorting compression which rearranges the character string into runs of identical characters. It uses two techniques to compress data: move-to-front transform and RLE. It compresses data easily to compress in situations where the string consists of runs of repeated characters. The most important feature of BWT is the reversibility which is fully reversible and it does not require any extra bits. BWT is a "free" method to improve the efficiency of text compression algorithms, with some additional computation. It is s used in bzip2. A simpler form of lossless data compression coding technique is RLE [23]. It represents the sequence of symbols as runs and others are termed as non-runs. The run consists of two parts: data value and count instead of original run. It is effective for data with high redundancy. 3. The LZMA Algorithm on Anomaly Labeled Data LMZA is the modified version of Lempel-Ziv algorithm to achieve higher CR [24]. It is a lossless data compression algorithm based on the principle of dictionary based encoding scheme. LZMA utilizes the complex data structure to encode one bit at a time. It uses a variable length dictionary (maximum size of 4 GB) and is mainly used to encode an unknown data stream. It is capable of compressing the data generated at a rate of 10-20 Mbps in a real-time environment. Though it uses larger size dictionary, it still achieves the same decompression speed like other compression algorithms. LZ77 algorithm encodes the byte sequence from existing contents instead of the original data. When no identical byte sequence is available in the existing contents, the address and sequence length is fixed as '0' and the new symbol will be encoded. LZ77 also employs a dynamic dictionary to compress unknown data by the help of sliding window concept. LZMA extends the LZ77 algorithm by adding a Delta Filter and Range Encoder. The Delta Filter alters the input data stream for effective compression by the sliding window. It stores or transmits data in the form of differences between sequential data instead of complete file. The output of the first-byte delta encoding is the data stream itself. The subsequent bytes are stored as the difference between the current and its 405

previous byte. For a continuously varying real time data, delta encoding makes the sliding dictionary more efficient [18, 19]. For example, consider a sample input sequence as 2,3,4,6,7,9,8,7,5,3,4 7. The input sequence is encoded with LZMA technique and the encoded output sequence is 2, 1, 1, 2, 1, 2,-1,-1,-2,-2, 1. So, the number of symbols in the input sequence is 8 and the number of symbols in output sequence is 4. Static and adaptive dictionary are the commonly used dictionaries. The static dictionary uses the fixed entries and constants based on the application of the text. Adaptive dictionaries take the entries from the text and generate on run time. A search buffer is employed as a dictionary and the buffer size is chosen based on the implementation parameters. Patterns in the text are assumed to occur within range of the search buffer. The offset and length are individually encoded, and a bit-mask is also separately encoded. Usage of an appropriate data structure for the buffer decreases the search time for longest matches. Sliding Dictionary encoding is comparatively tedious than decoding as it requires to identify the longest match. Range encoder encodes all the symbols of the message into a single number to attain better CR. It efficiently deals with probabilities which are not the exact powers of two. The steps involved in range encoder are listed below. Given a large-enough range of integers, and probability estimation for the symbols. Divide the primary range into sub-ranges where the sizes are proportional to the probability of the symbol they represent. Every symbol of the message is encoded by decreasing the present range down to just that sub-range which corresponds to the successive symbol to be encoded. The decoder should have same probability estimation as encoder used, which can either be sent in advance, derived from already transferred data [20]. The overall operation is shown in Fig. 1. LZMA compression is used to compress real time data generated rapidly. Initially, the sensor nodes sense the physical environment. The sensed value is tested for anomalies and the label value is appended. A label value is appended by the sensors to every individual sensed data. The labeled value '1' is appended to the sensed data when the sensed data differs from actual data, i.e. the abnormal value is found. Likewise, the labeled value '0' is appended to the sensed data when the sensed data has no deviation from actual data, i.e. normal value is found. The sensor node appends the label value to sensed data and then performs compression. Sensor node runs LZMA algorithm and the compression algorithm efficiently compress the labeled data irrespective of the label value. LZMA algorithm uses the dictionary, sliding window concept and range encoder to efficiently compress labeled data. Then the compressed data will be transmitted to the BS. The BS receives the compressed data and performs decompression process. As the LZMA is the lossless compression technique, the reconstruction data is the exact replica of original data with no loss of information. 4. Performance Evaluation To ensure the effectiveness of the LZMA algorithm while compressing labeled data, its lossless compression performance is compared with 5 different, well-known compression algorithms namely Huffman coding, AC, LZW, BWT coding and Deflate algorithm. Fig.1.Workflow of LZWA on Anomaly labeled data in WSN 4.1 Metrics In the section, various metrics used to analyze the compression performance are discussed. The performance metrics are listed below: CR, CF, and BPC. Compression Ratio (CR): CR is defined as the ratio of the number of bits in the compressed data to the number of bits in the uncompressed data and is given in Eq. (1). A value of CR 0.62 indicates that the data consumes 62% of its 406

original size after compression. The value of CR greater than 1 result to negative compression, i.e. compressed data size is higher than the original data size. No. of bits in compressed data No. of bits in original data 1 Compression Factor (CF): CF is a performance metric which is the inverse of compression ratio. A higher value of CF indicates effective compression and lower value of CF indicates expansion. No. of bits in original data No. of bits in compressed data 2 Bits per character (BPC): BPC is used to calculate the total number of bits required, on average, to compress one character in the input data. It is defined as the ratio of the number of bits in the output sequence to the total number of character in the input sequence. No. of bits in compressed data No. character in the original data 4.2 Dataset Description 3 For experimentation, two publicly available labeled WSN datasets are used. The labeled WSN dataset consists of temperature, humidity and labeled values gathered from single-hop and multi-hop scenarios using TelosB motes [25]. The data is collected for 6 hours at a time interval of 5 seconds. The dataset contains the labeled data in which value 0 indicates the actual value and 1 indicates the abnormal value. The data were collected in both indoor and outdoor environment. The description of labeled WSN dataset is tabulated in Table 1. Table 1 Dataset Description 5. Results And Discussion To highlight the good characteristics of LZMA based labeled data compression, it is compared with 5 states of art approaches. A direct comparison is made with the results of existing methods using the same set of 2 datasets. Table 2 summarizes the experiment results of compression algorithms based on three compression metrics such as CR, CF and BPC. As evident from Table 2, the overall compression performance of LZMA algorithm is significantly better than other algorithms on both two dataset. It is observed that LZMA algorithm achieves almost equal compression on both single and multi-hop scenarios. It is also noted that Huffman coding produces poor results than other algorithms. The compression performance on 8 different dataset files reveals some interesting facts that the compression algorithms perform extremely different based on the nature of applied dataset. The existing methods especially Deflate and BWT achieves almost similar compression performance. Likewise, Huffman and Arithmetic coding also produce appropriately equal compression performance. This is due to the fact that the efficiency of an Arithmetic code is always better or at least identical to a Huffman code. Similar to Huffman coding, Arithmetic coding also tries to calculate the probability of occurrence of particular certain symbols and to optimize the length of the necessary code. It achieves an optimum which exactly corresponds to the theoretical specifications of the information theory. A minor degradation result from inaccuracies, because of correction operations for the interval division. On the other side, Huffman coding generates rounding errors because of its code length is limited to multiples of a bit. The variation from the theoretical value is more than the inaccuracy of arithmetic coding. Though LZW achieves better compression than Huffman and arithmetic coding, it fails to achieve better than Deflate and BWT. 407

Table 2 Comparison results of LZMA with existing methods LZW works well in situations where the levels of redundancies are high. When the dictionary size is increased, the number of bits required for the indexing also increases. This limitation of LZW makes the results to lag behind Deflate, BWT, and LZMA. In overall, LZMA results in effective compression than existing methods. Generally, dictionary based coding approaches find useful in situations where the original data to be compressed involves repeated patterns. As LZW is a dictionary based method, it produces better results for labeled WSN dataset because of the possibility of repeated occurrence of temperature and humidity values. LZMA extends LZW with range encoding technique enables to produce significantly higher compression than LZW. Interestingly, LZMA requires the minimum bit rate of 0.749 BPC for single hop indoor data and maximum bit rate of 0.922 BPC for single hop outdoor 2 data. It is also noted that Huffman coding and Arithmetic coding achieves poor performance with the average bit rate of 3.84 BPC and 3.659 BPC respectively. It is observed that LZMA achieves the average CR of 0.104 at a bit rate of 0.839 BPC. 6. Conclusion This paper employs a Lempel Ziv Markov chain Algorithm (LZMA) lossless compression technique to compress WSN labeled data. The sensor node uses LZMA algorithm to compress the labeled data and transmits to BS via single-hop or multi-hop communication. This proposed method enhances the network lifetime by reducing the amount of data transmission. At the same time, anomaly data can also be easily identified. The performance of LZMA method is compared with state of art approaches such as Arithmetic coding, Huffman coding, BWT, LZW and Deflate algorithm. By comparing the compression performance of LZMA method with existing methods, LZMA achieves significantly better compression with an average CR of 0.0104 at the bit rate of 0.839 respectively. In future, it can be extended to compress real time data of several applications. References [1] D. Estrin, J. Heidemann, S. Kumar, and M. Rey, Next Century Challenges: Scalable Coordination in Sensor Networks, in Proceedings of the 5th annual ACM, 1999, pp. 263 270. [2] K. Sohraby, D. Minoli, and T. Znati, Wireless Sensor Networks. 2007. [3] F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, Wireless sensor networks: a survey, Comput. Networks, vol. 38, no. 4, pp. 393 422, 2002. [4] C. S. Raghavendra, K. M. Sivalingam, and T. Znati, Wireless Sensor Networks, 1st ed. US: Springer US, 2004. [5] C. A. Smith, A Survey of Various Data Compression Techniques, Int. J. pf Recent Technol. Eng., vol. 2, no. 1, pp. 1 20, 2010. [6] D. Salomon, Data Compression The Complete Reference, 4th ed. Springer, 2007. [7] K. Sayood, Introduction to Data Compression. 2006. 408

[8] S. W. Drost and N.. Bourbakis, A Hybrid system for real-time lossless image compression, Microprocess. Microsyst., vol. 25, no. 1, pp. 19 31, 2001. [9] N. Kimura and S. Latifi, A Survey on Data Compression in Wireless Sensor Networks, Proc. Int. Conf. Inf. Technol. coding Comput., pp. 16 21, 2005. [10] J. W. Branch, B. K. Szymanski, C. Giannella, R. Wolff, and Kargupta, In-network outlier detection in wireless sensor networks, in Proc. of ICDCS, 2006. [11] S. Rajasegarar, J. C. Bezdek, C. Leckie, and M. Palaniswami, Elliptical anomalies in wireless sensor networks, ACM Trans. Sens. Networks, vol. 6, no. 1, 2009. [12] M. Moshtaghi, S. Rajasegarar, C. Leckie, and S. Karunasekera, Anomaly detection by clustering ellipsoids in wireless sensor networks, in Proc. of the ISSNIP, 2009. [13] W. R. Heinzelman, A. Chandrakasan, and H. Balakrishnan, Energy-efficient communication protocol for wireless microsensor networks, Proc. 33rd Annu. Hawaii Int. Conf. Syst. Sci., vol. 0, no. c, pp. 3005 3014, 2000. [14] Sariga and P. Sujatha, A survey on unequal clustering protocols in Wireless Sensor Networks, J. King Saud Univ. - Comput. Inf. Sci., 2017. [15] M. M. Afsar and N. M. H. Tayarani, Clustering in sensor networks: a literature survey, J. Netw. Comput. Appl., vol. 46, pp. 198 226, 2014. [16] T. Srisooksai, K. Keamarungsi, P. Lamsrichan, and K. Araki, Practical data compression in wireless sensor networks: A survey, J. Netw. Comput. Appl., vol. 35, no. 1, pp. 37 59, 2012. [17] D. A. Huffman, A Method for the Construction of Minimum-Redundancu Codes, A Method Constr. Minimum-Redundancu Codes, pp. 1098 1102, 1952. [18] W. I. H., N. R. M., and C. J. G., Arithmetic coding for data compression, Commun. ACM, vol. 30, no. 6, pp. 520 540, 1987. [19] J.Ziv and A.Lempel, A Universal Algorithm for Data Compression, IEEE Trans. Inf. Theory, vol. 23, no. 3, pp. 337 343, 1977. [20] J.Ziv and A.Lempel, lz78.pdf. IEEE, pp. 530 536, 1978. [21] T. A. Welch, A technique for high-performance Data Compression, IEEE, pp. 8 19, 1984. [22] M. Burrows and D. Wheeler, A block-sorting lossless data compression algorithm, Algorithm, Data Compression, no. 124, p. 18, 1994. [23] J. Capon, A probabilistic model for run-length coding of pictures, IRE Trans. Inf. Theory, vol. 100, pp. 157 163, 1959. [24] Z. Tu and S. Zhang, A Novel Implementation of JPEG 2000 Lossless Coding Based on LZMA, in Proceedings of the Sixth IEEE International Conference Computer and Information Technology, 2006. [25] S. Suthaharan, M. Alzahrani, S. Rajasegarar, C. Leckie, and M. Palaniswami, Labelled Data Collection for Anomaly Detection in Wireless Sensor Networks, pp. 269 274, 2010. 409

410