Comparison between Variable Bit Representation Techniques for Text Data Compression

Similar documents
A New Algorithm based on Variable BIT Representation Technique for Text Data Compression

An Effective Approach to Improve Storage Efficiency Using Variable bit Representation

Data Compression Scheme of Dynamic Huffman Code for Different Languages

FPGA based Data Compression using Dictionary based LZW Algorithm

Keywords Data compression, Lossless data compression technique, Huffman Coding, Arithmetic coding etc.

A Comparative Study of Lossless Compression Algorithm on Text Data

Research Article Does an Arithmetic Coding Followed by Run-length Coding Enhance the Compression Ratio?

Text Data Compression and Decompression Using Modified Deflate Algorithm

A Comparative Study of Entropy Encoding Techniques for Lossless Text Data Compression

A Research Paper on Lossless Data Compression Techniques

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

Comparative Study of Dictionary based Compression Algorithms on Text Data

AN ANALYTICAL STUDY OF LOSSY COMPRESSION TECHINIQUES ON CONTINUOUS TONE GRAPHICAL IMAGES

DEFLATE COMPRESSION ALGORITHM

FILE CONVERSION AFTERMATH: ANALYSIS OF AUDIO FILE STRUCTURE FORMAT

Data Encryption on FPGA using Huffman Coding

HARDWARE IMPLEMENTATION OF LOSSLESS LZMA DATA COMPRESSION ALGORITHM

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 10, 2015 ISSN (online):

Analysis of Basic Data Reordering Techniques

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

Information Technology Department, PCCOE-Pimpri Chinchwad, College of Engineering, Pune, Maharashtra, India 2

Lossless Compression Algorithms

A Comprehensive Review of Data Compression Techniques

Comparison of Text Data Compression Using Run Length Encoding, Arithmetic Encoding, Punctured Elias Code and Goldbach Code

Analysis of Parallelization Effects on Textual Data Compression

Image Compression for Mobile Devices using Prediction and Direct Coding Approach

Cost Minimization by QR Code Compression

ENTROPY ENCODERS: HUFFMAN CODING AND ARITHMETIC CODING 1

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

IMAGE COMPRESSION TECHNIQUES

An Efficient Compression Technique Using Arithmetic Coding

Analysis of Huffman and Run-length encoding Compression Algorithms on Different Image Files

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

EE67I Multimedia Communication Systems Lecture 4

A COMPRESSION TECHNIQUES IN DIGITAL IMAGE PROCESSING - REVIEW

A New Compression Method Strictly for English Textual Data

Lossless Audio Coding based on Burrows Wheeler Transform and Run Length Encoding Algorithm

ENHANCED DCT COMPRESSION TECHNIQUE USING VECTOR QUANTIZATION AND BAT ALGORITHM Er.Samiksha 1, Er. Anurag sharma 2

LOSSLESS VERSUS LOSSY COMPRESSION: THE TRADE OFFS

DCT SVD Based Hybrid Transform Coding for Image Compression

Image coding and compression

Dynamic with Dictionary Technique for Arabic Text Compression

A Novel Image Compression Technique using Simple Arithmetic Addition

Comparative Analysis of Sequitur Algorithm with Adaptive Huffman Coding Algorithm on Text File Compression with Exponential Method

Multimedia Networking ECE 599

Department of electronics and telecommunication, J.D.I.E.T.Yavatmal, India 2

IMAGE COMPRESSION USING HYBRID QUANTIZATION METHOD IN JPEG

A Compression Method for PML Document based on Internet of Things

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm

IMAGE COMPRESSION USING HYBRID TRANSFORM TECHNIQUE

Comparison of Existing Dictionary Based Data Compression Methods for English and Gujarati Text # MCA,SRIMCA

Message Communication A New Approach

Lossless Image Compression with Lossy Image Using Adaptive Prediction and Arithmetic Coding

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

ROI Based Image Compression in Baseline JPEG

Viterbi Algorithm for error detection and correction

An Implementation on Pattern Creation and Fixed Length Huffman s Compression Techniques for Medical Images

Lossless Text Compression Technique Based on Static Dictionary for Unicode Tamil Document

REVIEW ON IMAGE COMPRESSION TECHNIQUES AND ADVANTAGES OF IMAGE COMPRESSION

Volume 2, Issue 9, September 2014 ISSN

A Novel Approach for Reduction of Huffman Cost Table in Image Compression

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

International Journal of Advanced Research in Computer Science and Software Engineering

So, what is data compression, and why do we need it?

Using Arithmetic Coding for Reduction of Resulting Simulation Data Size on Massively Parallel GPGPUs

Optimized Compression and Decompression Software

An Efficient Decoding Technique for Huffman Codes Abstract 1. Introduction

Image Processing and Watermark

Enhancing the Image Compression Rate Using Steganography

Optimization of Bit Rate in Medical Image Compression

CHAPTER 6. 6 Huffman Coding Based Image Compression Using Complex Wavelet Transform. 6.3 Wavelet Transform based compression technique 106

Textual Data Compression Speedup by Parallelization

Image Compression Algorithm and JPEG Standard

Data Compression Fundamentals

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

A New Technique of Lossless Image Compression using PPM-Tree

Illustration of Random Forest and Naïve Bayes Algorithms on Indian Liver Patient Data Set

CHAPTER 4 REVERSIBLE IMAGE WATERMARKING USING BIT PLANE CODING AND LIFTING WAVELET TRANSFORM

OPTIMIZATION OF LZW (LEMPEL-ZIV-WELCH) ALGORITHM TO REDUCE TIME COMPLEXITY FOR DICTIONARY CREATION IN ENCODING AND DECODING

STUDY OF VARIOUS DATA COMPRESSION TOOLS

IMAGE COMPRESSION TECHNIQUES

FRACTAL IMAGE COMPRESSION OF GRAYSCALE AND RGB IMAGES USING DCT WITH QUADTREE DECOMPOSITION AND HUFFMAN CODING. Moheb R. Girgis and Mohammed M.

A Comparative Study Of Text Compression Algorithms

Achieving Reliable Digital Data Communication through Mathematical Algebraic Coding Techniques

A Methodology to Detect Most Effective Compression Technique Based on Time Complexity Cloud Migration for High Image Data Load

Journal of Computer Engineering and Technology (IJCET), ISSN (Print), International Journal of Computer Engineering

Implementation and Analysis of Efficient Lossless Image Compression Algorithm

Image Compression - An Overview Jagroop Singh 1

Topic 5 Image Compression

Enhancing the Compression Ratio of the HCDC Text Compression Algorithm

A Mediator based Dynamic Server Load Balancing Approach using SDN

TEXT COMPRESSION ALGORITHMS - A COMPARATIVE STUDY

Information Theory and Communication

A WAVELET BASED BIOMEDICAL IMAGE COMPRESSION WITH ROI CODING

IMAGE PROCESSING (RRY025) LECTURE 13 IMAGE COMPRESSION - I

Study of LZ77 and LZ78 Data Compression Techniques

Source Coding Basics and Speech Coding. Yao Wang Polytechnic University, Brooklyn, NY11201

2-D SIGNAL PROCESSING FOR IMAGE COMPRESSION S. Venkatesan, Vibhuti Narain Rai

Transcription:

Volume 119 No. 10 2018, 631-641 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Comparison between Variable Bit Representation Techniques for Text Data Compression 1 Swathy Nair, 2 Nesna Hakkim and 3 V.R. Rajalakshmi 1 Department of Computer Science and IT, Amrita School of Arts & Sciences, Kochi, Amrita Vishwa Vidyapeetham, India. swathynairoct18@gmail.com 2 Department of Computer Science and IT, Amrita School of Arts & Sciences, Kochi, Amrita Vishwa Vidyapeetham, India. h.nesna@yahoo.com 3 Department of Computer Science and IT, Amrita School of Arts & Sciences, Kochi, Amrita Vishwa Vidyapeetham, India. rajiprithviraj@gmail.com Abstract In this era where computer applications are becoming an essential in daily life, compression of data is vital. A range of algorithms for compression of files have been introduced and choosing the best among them is a vital decision. In this paper, we discuss lossless compression of text file using two different techniques of Variable Bit Representation Word-based compression and Character-based compression. Word-based compression technique replaces words with bit values whereas Characterbased compression technique replaces character. In these methods, the repeated words/characters are represented as binary values where the word/character with the highest repetition count are represented with the lowest binary value(0) and the following word/character with the highest frequency(repetition count) is assigned the next value (1) and so on. These word/character and their equivalent bit values are saved in index table. These techniques are examined and then compared with each other to assess their effectiveness in compressing text data. This article infers which technique performs well for text data compression. 631

Key Words:Text data compression, Lossless compression algorithms, Bit Representation method, Word-based compression, Character-based compression. 632

1. Introduction Efficient transfer and storage of data files are some of the important aspects that increase the performance of computer applications. Compression is one of the techniques that help to facilitate these requirements. Compression is the act of representing a file in a compact form rather than its original format leading to the reduction of file size. Compression can be either lossless compression or lossy compression. Compression of files produces a compressed file which can be reconstructed back to the original file. In lossless compression this reconstruction can be done without any loss of data whereas in lossy. Compression there will be some loss of data after reconstruction. Examples of lossless compression techniques used are Huffman Coding, Variable Bit Representation, and Arithmetic Encoding etc. This paper compares the two techniques (algorithms) of Bit Representation method to find which technique is best for text file compression. 2. Related Works Figure 1: Lossless/Lossy Compression Techniques Nimisha E, Shyama P, Rajalakshmi V R, A New Approach to Increase the Storage Efficiency of Databases Using BIT Representation [1]. Here, the author proposes a lossless compression technique that uses binary values to represent every individual attribute in a database. The count of distinct attributes (n) were found and the number of bits needed to represent each of these attributes were calculated (using the general rule, with n bits- combinations can be represented ) and an index table was created with the unique attribute value and their corresponding bit values. The original table was updated with the corresponding bit values but the frequencies of attributes were not taken into consideration. Here equal numbers of bits were used to represent every distinct attribute values. Anoop R, Subhadra.G.Varma, Rajalakshmi V R, "An Effective Approach to Improve Storage Efficiency Using Variable Bit Representation" [2]. Here, the author proposed a lossless compression technique which uses binary values to represent every distinct attribute in a database. This concept is similar to Nimisha E et al [1] research work which uses a uniform-length bit representation but in this paper it is further improved by truncating the redundant leading zeroes in the binary values used during compression 633

ShrustiPorwal, Yashi Chaudhary, Jitendra Joshi, Manish Jain, Data Compression Methodologies for Lossless Data and Comparison between Algorithms [3]. Here, the author compared two of the lossless compression techniques Huffman Coding and Arithmetic Encoding. Various performance measures were performed, to analyze which technique is better. The calculated performance measures were: compression ratio, compression speed, decompression speed, memory space needed, permits random access and compressed pattern matching. Finally, the author concludes that arithmetic encoding has the best compression ratio when compared to Huffman Coding technique. S.R. Kodituwakku, U.S.Amarasinghe, "Comparison of Lossless Data Compression Algorithms for Text Data [4]. Here, the author performed comparison between different lossless compression algorithms for text data. They were tested on different type of files, but the main interest was on different test patterns. It was concluded that the Shannon Fano algorithm was the best among the algorithms by comparing the compression times, decompression times and saving percentages between the selected algorithms. 3. Proposed Work The methods compared in this paper are: Word-based Bit Representation and Character-based Bit Representation. Word-based Bit Representation In word-based bit representation technique, each of the repetitive words in a text file is assigned unique bit value. A new file is then created in which the repetitive words are replaced with their corresponding bit values. This file is the compressed form of the original file. Compression Method Step I: Find the frequency of repetitive words in the text file. Step II: Find the words with count equal to or greater than 2 and sort them in decreasing order of their frequencies. Step III: Create index table and insert the repetitive words and its corresponding frequency. EXAMPLE: INSERT INTO IndexTable (Words, Frequency) VALUES ('"+word+"',"+count+") Step IV: Calculate the count of bits needed to represent each of these words. Step V: Generate bit values for all the words in the index table. Step VI: Update the table created in Step III with the bit values of each word. Words are sorted based on frequency (decreasing order) and bit values are assigned i.e. the word with the highest frequency gets the least bit value, thus reducing the size. EXAMPLE: UPDATE IndexTable SET BitValue='"+ bit_value +"' where Frequency = "+ frequency +" AND Words = '"+ word +"'" 634

Step VII: Create a new text file with the same content as that of the original file. Step VIII: Update this text file by replacing the repetitive words (of the original file) with its corresponding bit values. Decompression Method Step I: From the index table, retrieve the words and its equivalent bit value. EXAMPLE: SELECT Word, BitValue FROM IndexTable Step II: In the compressed file, replace the bit values with its equivalent words. Character-based Bit Representation In character-based bit representation, each of the repetitive characters (a-z, A-Z, special characters etc.) in a text file is assigned unique bit value. A new file is then created in which all the characters will be replaced with their corresponding bit values. This file is the compressed form of the original file. Compression Method Step I: Find the count of every (distinct) character in the text file. Step II: Sort them in decreasing order of their frequencies. Step III: Create an index table with the characters and its corresponding frequency. EXAMPLE: INSERT INTO CharacterIndexTable (Characters, Frequency) VALUES ('"+character+"',"+count+") Step IV: Calculate the count of bits needed to represent each of these characters. Step V: Generate bit values for all the words in the index table. Step VI: Update the table created in Step III with the bit values of each character. Characters are sorted based on frequency (decreasing order) and bit values are assigned i.e. the character with the highest frequency gets the lowest bit value. EXAMPLE: UPDATE CharacterIndexTable SET BitValue='"+ bit_value +"' where Frequency = "+ frequency +" AND Character = '"+ character +"'" Step VII: Create a new text file with the same content as that of the original file. Step VII: Update this text file by replacing the characters with its corresponding bit values. Decompression Method Step I: From the index table, retrieve the characters and its equivalent bit value. EXAMPLE: SELECT Characters, BitValue FROM CharacterIndexTable Step II: In the compressed file, replace the bit values with its equivalent characters. 4. Experiments and Results Word-based Bit Representation This paper uses the text file File1.txt (Fig. 2) which has repetitive words in it for experimentations. 635

Figure 2: Text File File1.txt Firstly, the frequency of each word in the given text file is calculated. The words with count greater than or equal to 2 are selected as it would require a significant amount of storage space. For example, in the above text file, the appears 6 times. Similarly, there are 13 words that satisfy the specified condition. These words are selected and are stored in a table along with their frequencies (Table 1). As per the general rule, with n bits there can be unique value combinations. Here we have 13 unique words and we need a maximum of 4 bits to represent all the values. The actual bit values for these combinations are represented as: 0, 1, 10, 11, 100,, 1111. Next, in the index table words are sorted in decreasing order of their frequencies and then the generated bit values are stored. The word with maximum frequency is given the lowest bit value. This reduces the storage space because the word that repeats the maximum is assigned the smallest bit value. There may be different words with the same frequency; in this case, the assignment is done in a sequential manner. The index table created for File1.txt is presented below Table 1: Database - IndexTable IndexTable Words Frequency BitValue the 6 0 Lorem 5 10 Ipsum 5 1 of 4 11 and 3 100 with 2 1101 typesetting 2 1100 type 2 1011 text 2 1010 is 2 1001 has 2 1000 dummy 2 111 a 2 110 It 2 101 In the above table, the word with the maximum frequency is the and it is given the smallest value 0, which is one bit. In the same way, bit values are assigned to the remaining words in the index table according to their frequency. After the index table is created, a new text file is generated with the same content as that of the original file. But here, the words from the index table will 636

be replaced with their equivalent bit values (Fig.3). Figure 3: Compressed Text File Here, the size of the original text file is4752 bits whereas the compressed file is of 2870 bits. Character-based Bit Representation Firstly, the frequency of each character is obtained from the given text file File1.txt (Fig.2).For example, in this text file, the character t appears 44 times. Likewise, there are 34 characters (alphanumeric and special characters) in the file. These characters are stored in an index table along with their frequencies(table 2). As per the general rule, with n bits there can be unique value combination. Here we have 34 unique words and we need a maximum of 6 bits to represent all the values. The actual bit values for these combinations are represented as: 0, 1, 10, 11, 100,, 111111. Now, in the index table, the characters are sorted in decreasing order of their frequencies and then the generated bit values are stored. The character with highest frequency is assigned the lowest bit value. In cases where frequencies of different characters are same, the assignment of bit values is done in sequential manner. The index table generated for the text file File1.txt is presented below Table 2: Database-CharacterIndexTable CharacterIndexTable Characters Frequency BitValue e 60 0 t 44 1 s 41 10 n 38 11 i 33 100 a 29 101 o 26 110 r 25 111 m 20 1000 p 19 1001 u 18 1010 l 17 1011 d 16 1100 h 15 1101 In the above table, the character with highest frequency - e, is given the smallest value 0, which is one bit. The next character with highest frequency - t, is given the next bit value 1, s is given 10 and so on. 637

After the index table is created, a new text file is generated with the same content as that of the original file. But here, all the characters will be replaced with their equivalent bit values (Fig.4). Figure 4: Compressed Text File Here, the size of the original text file is4752 bits whereas the compressed file is of1520 bits. Figure 5: Bar Chart Representing the Sizes of the Original File and the Compressed Files As these are lossless compression techniques, the contents of the file are compressed without any loss of data and also the storage space is decreased. From the bar chart (Fig.5), it can be inferred that the Character-based compression is the best technique for reducing the size of the file. Thus, the character-based compression technique helps in escalating storage efficiency when compared to word-based compression algorithm. 5. Conclusion In this paper, we have compared the two techniques of variable bit representation. The major difference between the two is that the word-based compression does not generate bit values for all the words in the file, rather it generates only for words with frequency greater than 2 whereas in characterbased compression bit values are generated for all the characters. This results in a corresponding bit value for all the words in the file and this is the reason that 638

character-based compression is better than word-based compression technique. Character-based compression reduces the file size by a significant difference when compared to the word-based compression technique. References [1] Nimisha E., Shyama P., Rajalakshmi V.R., A New Approach to Increase the Storage Efficiency of Databases Using BIT Representation, International Journal of Control Theory and Applications 10(24) (2017), 367-375. [2] Anoop R., Varma S.G., Rajalakshmi V.R., An Effective Approach To Improve Storage Efficiency Using Variable Bit Representation. [3] Porwal S., Chaudhary Y., Joshi J., Jain M., Data compression methodologies for lossless data and comparison between algorithms, International Journal of Engineering Science and Innovative Technology 2 (2013), 142-147. [4] Kodituwakku S.R., Amarasinghe U.S., Comparison of lossless data compression algorithms for text data. Indian journal of computer science and engineering 1(4) (2010), 416-425. [5] Pu I.M., Fundamental Data Compression, Elsevier, Britain (2006). [6] Blelloch E., Introduction to Data Compression, Computer Science Department, Carnegie Mellon University (2002). [7] Kesheng W., Otoo J., Arie S., Optimizing bitmap indices with efficient compression, ACM Trans. Database Systems 31 (2006), 1-38. [8] Kaufman K., Shmuel T., Semi-lossless text compression, Intl. J. Foundations of Computer Sci. 16 (2005), 1167-1178. [9] Campos A.S.E, Basic arithmetic coding by Arturo Campos Website (2009). http://www.arturocampos.com/ac_arithmetic.html [10] Vo Ngoc, Alistair M., Improved word aligned binary compression for text indexing, IEEE Trans. Knowledge & Data Engineering 18 (2006), 857-861. [11] Cormak V., Horspool S., Data compression using dynamic Markov modeling, Comput. J. 30 (1987), 541 550. [12] Capocelli M., Giancarlo R., Taneja J., Bounds on the redundancy of Huffman codes, IEEE Trans. Inf. Theory 32 (1986), 854 857. [13] Gawthrop J., Liuping W., Data compression for estimation of the physical parameters of stable and unstable linear systems, Automatica 41 (2005), 1313-1321. 639

[14] Paul G. Howard, Jerey Scott Vitter, Practical Implementations of Arithmetic Coding, A shortened version appears in the proceedings of the International Conference on Advances in Communication and Control (1991). [15] Amandeep Singh Sidhu, Er. Meenakshi Garg, Text Data Compression Algorithm using Hybrid Approach, International Journal of Computer Science and Mobile Computing 3(12) (2014), 01 10. [16] Anmol Jyot Maan, Analysis and Comparison of Algorithms for Lossless Data Compression, International Journal of Information and Computation Technology 3(3) (2013), 139-146. 640

641

642