A Comprehensive Review of Data Compression Techniques

Volume-6, Issue-2, March-April 2016 International Journal of Engineering and Management Research Page Number: 684-688 A Comprehensive Review of Data Compression Techniques Palwinder Singh 1, Amarbir Singh 2 1,2 Department of Computer science, GNDU, Amritsar, India, INDIA ABSTRACT In these days communication and storage of data becomes a huge challenge because of its large amount. Data which we store or transmit is generally redundant in direct or indirect form. Data compression is a process of reducing the size of data which is being stored or transmit by controlling redundancy in given data. Transmission or storage of large amount of data increase hardware and transmission cost. Hence selection of good data compression algorithm is very important. Data compression involves encoding information using fewer bits than the original representation. This paper presents the different data compression techniques and conclusion will be derived on the basis of these techniques. Keywords Data Compression, Redundancy, Lossless compression, Lossy compression. I. INTRODUCTION Compression is the conversion of data in another format that requires few bits usually formed to store and transmit the data easily and efficiently. In short data compression is an art and science to represent information in a compact form. Data compression is a technique to transform a text, audio, or video file may be transformed to another format, in such a way that the original file may be recovered with or without any loss of information. This is desirable for data storage and transmission application. Smaller files are desirable for data communication, because smaller the file, faster it can be transferred with less power and bandwidth. There are two types of compression techniques based on the recovery of data, lossless compression and lossy compression [1]. But as we deal with the data, lossless compression technique is best suitable. II. TYPES OF DATA COMPRESSION Data Compression is possible because most of the real world data is very redundant. Data Compression is basically defined as a technique that reduces the size of data by applying different methods that can either be Lossy or Lossless. A compression program is used to convert data from an easy-to-use format to one optimized for compactness. Two basic classes of data compression are applied in different areas [2]. One of these is lossy data compression, which is widely used to compress image data files for communication or archives purposes. The other is lossless data compression that is commonly used to transmit or archive text or binary files required to keep their information intact at any time. 684 Copyright 2016. Vandana Publications. All Rights Reserved.

Lossy Data compression Data compression Lossless Data Compression Figure 2 Classification of Data Compression A. LOSSY DATA COMPRESSION A lossy data compression system is one where the data reconstructed after decompression may not be exactly same as that of original data, but rather is "significantly close" to be valuable for particular reason. For example, random noise has very high information content, but when present in an image or a sound file, we would typically be perfectly happy to drop it. Also certain losses in images or sound might be completely imperceptible to a human viewer. For this reason, lossy compression algorithms on images can often get a factor of 2 better compressions than lossless algorithms with an imperceptible loss in quality. However, when quality does start degrading in a noticeable way, it is important to make sure it degrades in a way that is least objectionable to the viewer (e.g., dropping random pixels is probably more objectionable than dropping some colour information). For these reasons, the ways most lossy compression techniques are used are highly dependent on the media that is being compressed. Lossy compression for sound, for example, is very different than lossy compression for images. Some lossy data compression techniques are given below. 1. Transform Coding This compression technique compresses basically natural data like images or audio files. It may produce a low quality output of original image. This is a linear process and no information is lost, the number of coefficients produced is equal to the number of pixels transformed. There are many types of transforms tried for picture coding, for example Fourier, Walsh, Hadamard, lapped orthogonal, discrete cosine (DCT), and recently Wavelets transform and Curvelet transform [3]. Transform coding is used to change the pixels of original image into frequency domain coefficients that is called as Transform Coefficients. Transform coding technique uses linear mathematical transform to map the pixel values into a set of coefficients, which are quantized and encoded. The key idea of transform based coding scheme depends on resulting coefficients for most natural images which have small magnitudes and can be quantized (or discarded altogether) without causing significant distortion in the decoded image. For compression purposes, the higher capability of compressing information in fewer coefficients, the better is the transform obtained and for that reason, the Discrete Cosine Transform (DCT) has become the most widely used transform coding technique. Basic steps of DCT transform encoding are given below. Input the image. The image firstly broken into 8x8 blocks of pixels. The Discrete Cosine Transform (DCT) is applied to each and every block, it reads pixels from left to right, and top to bottom. Each block is compressed using quantization table. The array of compressed blocks of image is occupy less memory space. When desired, the image is reconstructed through decompression, a process that uses the Inverse Discrete Cosine Transform (IDCT). 2. Vector Quantization Vector Quantization is an efficient technique of image compression. VQ compression system contains two components that is VQ encoder and decoder. The VQ encoder finds a closest match codeword for each image block in the codebook or directory and the index of the codeword is transmitted to VQ decoder. The next phase is decoding phase in which VQ decoder replaces the index values with the respective codeword from the codebook and produces the quantized image that is called as reconstructed image [4]. A vector quantization is usually defined as a block of pixel values. The basic idea behind VQ technique is to develop a dictionary of fixed-size vectors that is called a code vector. Vector Quantization is also known as Block Quantization" or "Pattern Matching Quantization. This technique is commonly used in lossy compression methods. It works by encoding values from a multidimensional vector space into a finite set of values. A lower-space vector requires less storage space, so the data is compressed. Due to the density matching property of vector quantization, the compressed data contains errors that are inversely proportional to density. The transformation is usually carried out by projection or codebook. Basic working of Vector Quantization is as following: Input image. Find the closest match code/vector for each image block from the directory or codebook. Replaces code /vector by transmitted index of code for further processing. Above property is used to reduce the storage space of image. 3. Block Truncation Block truncation coding (BTC) is a simple and fast lossy ximage compression technique, which work with digitized gray scale images and introduced by Delp and Mitchell. The key idea of BTC is to perform moment 685 Copyright 2016. Vandana Publications. All Rights Reserved.

preserving (MP) quantization for blocks of pixels. So that the quality of the image will be acceptable and at the same time the demand for the storage space will decrease. BTC technique can be improved by dividing the encoding into three separate tasks: performing the quantization of a block, coding the quantization data, and coding the bit plane. In this technique, the image is divided into non overlapping blocks of pixels [5]. Then, determine the threshold and reconstruction value for each block. The threshold is usually the mean of the pixel values in the block. The bitmap of the block is derived by replacing all pixels whose values are greater than or equal (less than) to the threshold by 1 (0). Then for each segment (group of 1s and 0s) in the bitmap, the reconstruction value is determined. This is the average of the values of the corresponding pixels in the original block. B. LOSSLESS DATA COMPRESSION Lossless data compression is a technique that which allows the use of data compression algorithms to compress the data file and also allows the exact original data to be reconstructed from the compressed data. This is in contrast to the lossy data compression in which the original data cannot be restored from the compressed data. The popular ZIP file format that is being used for the compression of data files is also an application of lossless data compression approach. Lossless compression is used when it is important that the original data and the decompressed data be identical where as compression ratio doesn t matters. Lossless text data compression algorithms usually exploit statistical redundancy in such a way so as to represent the sender's data more concisely without any error or any sort of loss of important information contained within the text input data [6]. Since most of the real-world data has statistical redundancy, therefore lossless data compression is possible. For instance, In English text, the letter 'a' is much more common than the letter 'z', and the probability that the letter 't' will be followed by the letter 'z' is very small. So this type of redundancy can be removed using lossless compression. Lossless compression methods may be categorized according to the type of data they are designed to compress. Compression algorithms are basically used for the compression of text, images and sound. Most lossless compression programs use two different kinds of algorithms: one which generates a statistical model for the input data and another which maps the input data to bit strings using this model in such a way that frequently encountered data will produce shorter output than improbable(less frequent) data. Some lossless data compression techniques are given below. 1. Run Length Coding Run-Length Encoding (RLE) is a very simple form of data compression in which runs of data (that is, sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. This is most useful on data that contains many such runs: for example, simple graphic images such as icons and line drawings. For example, consider a screen containing plain black text on a solid white background. There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text. Example of Run Length Encoding: Let us take a hypothetical single scan line as input, with B representing a black pixel and W representing white: WWWWWWWWWWWWWWBWWWWWWWWWW WWWWBBBWWWWWWWWWWWWWWWWWWW WWWWWB Output: 14WB14W3B24WB If we apply a simple run-length code to the above hypothetical scan line, we get the following: 14WB14W3B24WB Interpret this as Fourteen W s, one B, Fourteen W s, three B s, etc. The run-length code represents the original 57 characters in only 13. Of course, the actual format used for the storage of images is generally binary rather than ASCII characters like this, but the principle remains the same [7]. Even binary data files can be compressed with this method; file format specifications often dictate repeated bytes in files as padding space. However, newer compression methods such as deflation often use LZ77-based algorithms, a generalization 2. Huffman Coding Huffman coding deals with data compression that follows top down approach means the binary tree is built from the top down to generate an optimal result. In Huffman Coding the characters in a data file are converted to binary code and the most common characters in the file have the shortest binary codes, and the characters which are least common have the longest binary code [8]. A Huffman code can be determined by successively constructing a binary tree, whereby the leaves represent the characters that are to be encoded. Every node contains the relative probability of occurrence of the characters belonging to the sub tree beneath the node. The edges are labeled with the bits 0 and 1. The algorithm to generate Huffman code is: Parse the input and count the occurrence of each symbol. Determine the probability of occurrence of each symbol using the symbol count. Sort the symbols according to their probability of occurrence, with the most probable first. Then generate leaf nodes for each symbol and add them to a queue. Take two least frequent characters and then logically group them together to obtain their combined frequency that leads to the construction of a binary tree structure. Repeat step 5 until all elements are reached and there remains only one parent for all nodes which is known as root. 686 Copyright 2016. Vandana Publications. All Rights Reserved.

Then label the edges from each parent to its left child with the digit 0 and the edge to right child with 1. Tracing down the tree yields to Huffman codes in which shortest codes are assigned to the character with greater frequency. 3. LZW Coding LZW (Lempel-Ziv Welch) is a totally dictionary based coding. Lzw encoding is further divided into static & dynamic. In static, dictionary is fixed during the encoding and decoding processes. In dynamic dictionary coding, the dictionary is updated if needed. LZW compression replaces strings of characters with single codes. It does not perform any analysis of the incoming text. Instead, it just adds every new string of characters from the table of strings. The code that the LZW algorithm outputs can be of any arbitrary length, but it must have more bits in it than a single character [9]. LZW compression works best for files containing lots of repetitive data. LZW compression maintains a dictionary. In this dictionary all the stream entry and code are stored. Basic steps of LZW coding are given below. Input the data stream Initialize the dictionary to contain entry of each character of stream. Read the stream if current byte is the end of stream, then exit. Otherwise read next character and produce a new code. If the bunch of character is frequently occurring then give them a unique code (according to the diagram) Read next input character of stream from dictionary If there is no such pattern of stream in dictionary,then a) Add new string to the dictionary b) Write the new code for new entered string. c) Go to step 4. Write out code for encoded string and exit. 4. Arithmetic Coding Arithmetic coding is an optimal entropy coding technique as it provides best compression ratio and usually achieves better results than Huffman Coding. It is quite complicated as compared to the other coding techniques. When a string is converted in to arithmetic encoding, the characters having maximum probability of occurrence will be stored with fewer bits and the characters that do not occur so frequently will be stored with more bits, resulting in fewer bits used overall. Arithmetic coding converts the stream of input symbols into a single floating point number as output [10]. Unlike Huffman coding, arithmetic coding does not code each symbol separately. Each symbol is instead coded by considering all prior data. Thus a data stream encoded in this fashion must always be read from the beginning. Consequently, random access is not possible. Here is an algorithm to generate the arithmetic code: Calculate the number of unique symbols in the input. This number represents the base b (e.g. base 2 is binary) of the arithmetic code. Assign values from 0 to b to each unique symbol in the order they appear. Using the values from step 2, the symbols are replaced with their codes in the input. Convert the result from step 3 from base b to a sufficiently long fixed-point binary number to preserve precision. Record the length of the input string somewhere in the result as it is needed for decoding. 5. Area Coding Area Coding is an enhanced form of run length encoding (RLE), It shows the two dimensional character of images as a run length encoding. The algorithm for area coding tries to find out rectangular regions with the same characteristics. These regions are coded in a descriptive form as an element with two points and a certain structure. This type of coding can be highly effective but it bears the problem of a nonlinear method, which cannot be implemented in hardware. Thus, the performance in terms of compression time is not competitive. In this technique, special codeword are used to identify large areas of contiguous 1 s or 0 s. In this method the whole image is divided into blocks size that is m*n pixels, which are classified into some blocks having only white pixels and some blocks having only black pixels or blocks with mixed intensity. The most frequent occurring category is then assigned a 1-bit codeword 0 and the remaining other two categories are assigned with 2-bit codes 10 and 11. The code assigned to the mixed intensity category is used as a prefix, which is followed by the mn-bit pattern of the block. Compression is achieved because the mn bits that are normally used to represent each constant area are replaced by a 1-bit or 2-bit codeword. White text documents compression is slightly simpler approach that is called as white block skipping. That is used to code the solid white areas as 0 and all other blocks including the solid black blocks are coded as 1 followed by the bit pattern of the block. This approach takes advantage of the anticipated structural patterns of the image to be compressed. As few solid black areas are expected, they are grouped with the mixed regions, allowing a 1-bit codeword to be used for the highly probable white blocks, [11]. 687 Copyright 2016. Vandana Publications. All Rights Reserved.

III. CONCLUSION In this paper we have present the various technique to compress the text data in lossless manner. Various techniques along with their algorithms and disadvantages has been proposed in this paper. It is shown that there is no algorithm which provide promising result that can be used in practical applications to compress the data. Hence in future there is a need to develop an lossless and lossy data compression algorithm that can compress the text data in the better way which can also be used in various practical application where compression of text data is required. science and information technologies Vol 5(1) 2014 [11] Kashyap N.and Singh S.N Review of images compression and compression of its algorithms International Journal of Application or Innovation in Engineering & Management (IJAIEM) Volume 2, Issue 12, December 2013. REFERENCES [1] Bhammar M.B,. Mehta K. A survey of various image compression techniques IJDI-ERET- International Journal Of Darashan Institute On Engineering Research & Emerging Technology Vol. 1, No. 1, 2012. [2] R.S. Brar and B.Singh, A survey on different compression techniques and bit reduction algorithm for compression of text data International Journal of Advanced Research in Computer Science and Software Engineering (IJARCSSE ) Volume 3, Issue 3, March 2013 [3] S. Porwal, Y. Chaudhary, J. Joshi and M. Jain, Data Compression Methodologies for Lossless Data and Comparison between Algorithms International Journal of Engineering Science and Innovative Technology (IJESIT) Volume 2, Issue 2, March 2013 [4] S. Shanmugasundaram and R. Lourdusamy, A Comparative Study of Text Compression Algorithms International Journal of Wisdom Based Computing, Vol. 1 (3), December 2011. [5] Satish Kannale, KavitaKhare, Deepak Andore and MallikarjunMugli FPGA implementation of selective Huffman coding for high speed data compression and decompression, World Journal of Science and Technology 2011, 1(8): 89-93 ISSN: 2231 2587 [6] Suzanne Rigler, William Bishop, Andrew Kennings FPGABased Lossless Data Compression using Huffman and LZ77 Algorithms, 2007 IEEE [7] S. Kapoor and A. Chopra, "A Review of Lempel Ziv Compression Techniques" IJCST Vol. 4, Issue 2, April - June 2013 [8] Francesco Marcelloni, Massimo Vecchio, A Simple Algorithm for Data Compression in Wireless Sensor Networks, IEEE communications letters, vol. 12, no. 6, June 2008. [9] A. Singh and Y. Bhatnagar, Enhancement of data compression using Incremental Encoding International Journal of Scientific & Engineering Research, Volume 3, Issue 5, May-2012 [10] Khobragade.P.B Thakare.S.S Image compression techniques A Review International Journal of computer 688 Copyright 2016. Vandana Publications. All Rights Reserved.