A New Algorithm based on Variable BIT Representation Technique for Text Data Compression

Similar documents
Comparison between Variable Bit Representation Techniques for Text Data Compression

An Effective Approach to Improve Storage Efficiency Using Variable bit Representation

Data Compression Scheme of Dynamic Huffman Code for Different Languages

FPGA based Data Compression using Dictionary based LZW Algorithm

Keywords Data compression, Lossless data compression technique, Huffman Coding, Arithmetic coding etc.

A Comparative Study of Lossless Compression Algorithm on Text Data

Research Article Does an Arithmetic Coding Followed by Run-length Coding Enhance the Compression Ratio?

A Research Paper on Lossless Data Compression Techniques

Text Data Compression and Decompression Using Modified Deflate Algorithm

A Comparative Study of Entropy Encoding Techniques for Lossless Text Data Compression

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

DEFLATE COMPRESSION ALGORITHM

Comparative Study of Dictionary based Compression Algorithms on Text Data

Information Technology Department, PCCOE-Pimpri Chinchwad, College of Engineering, Pune, Maharashtra, India 2

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

A Comprehensive Review of Data Compression Techniques

Lossless Compression Algorithms

FILE CONVERSION AFTERMATH: ANALYSIS OF AUDIO FILE STRUCTURE FORMAT

Analysis of Parallelization Effects on Textual Data Compression

IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 10, 2015 ISSN (online):

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

HARDWARE IMPLEMENTATION OF LOSSLESS LZMA DATA COMPRESSION ALGORITHM

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

A COMPRESSION TECHNIQUES IN DIGITAL IMAGE PROCESSING - REVIEW

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

ENTROPY ENCODERS: HUFFMAN CODING AND ARITHMETIC CODING 1

EE67I Multimedia Communication Systems Lecture 4

Multimedia Networking ECE 599

A New Compression Method Strictly for English Textual Data

Analysis of Huffman and Run-length encoding Compression Algorithms on Different Image Files

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

AN ANALYTICAL STUDY OF LOSSY COMPRESSION TECHINIQUES ON CONTINUOUS TONE GRAPHICAL IMAGES

Cost Minimization by QR Code Compression

So, what is data compression, and why do we need it?

Data Compression Fundamentals

Image Compression for Mobile Devices using Prediction and Direct Coding Approach

Comparison of Text Data Compression Using Run Length Encoding, Arithmetic Encoding, Punctured Elias Code and Goldbach Code

Comparative Analysis of Sequitur Algorithm with Adaptive Huffman Coding Algorithm on Text File Compression with Exponential Method

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

An Efficient Compression Technique Using Arithmetic Coding

Data Encryption on FPGA using Huffman Coding

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

ENHANCED DCT COMPRESSION TECHNIQUE USING VECTOR QUANTIZATION AND BAT ALGORITHM Er.Samiksha 1, Er. Anurag sharma 2

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

A Novel Image Compression Technique using Simple Arithmetic Addition

Dynamic with Dictionary Technique for Arabic Text Compression

A Novel Approach for Reduction of Huffman Cost Table in Image Compression

Analysis of Basic Data Reordering Techniques

Volume 2, Issue 9, September 2014 ISSN

A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm

Comparison of Existing Dictionary Based Data Compression Methods for English and Gujarati Text # MCA,SRIMCA

DCT SVD Based Hybrid Transform Coding for Image Compression

LOSSLESS VERSUS LOSSY COMPRESSION: THE TRADE OFFS

Optimized Compression and Decompression Software

Image coding and compression

Viterbi Algorithm for error detection and correction

CHAPTER 6. 6 Huffman Coding Based Image Compression Using Complex Wavelet Transform. 6.3 Wavelet Transform based compression technique 106

IMAGE COMPRESSION USING HYBRID QUANTIZATION METHOD IN JPEG

STUDY OF VARIOUS DATA COMPRESSION TOOLS

ROI Based Image Compression in Baseline JPEG

Image Processing and Watermark

IMAGE COMPRESSION TECHNIQUES

IMAGE COMPRESSION USING HYBRID TRANSFORM TECHNIQUE

Message Communication A New Approach

FRACTAL IMAGE COMPRESSION OF GRAYSCALE AND RGB IMAGES USING DCT WITH QUADTREE DECOMPOSITION AND HUFFMAN CODING. Moheb R. Girgis and Mohammed M.

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

Lossless Audio Coding based on Burrows Wheeler Transform and Run Length Encoding Algorithm

Journal of Computer Engineering and Technology (IJCET), ISSN (Print), International Journal of Computer Engineering

Enhanced Hybrid Compound Image Compression Algorithm Combining Block and Layer-based Segmentation

A Comparative Study Of Text Compression Algorithms

Multimedia Data and Its Encoding

Ch. 2: Compression Basics Multimedia Systems

Enhancing the Compression Ratio of the HCDC Text Compression Algorithm

A Compression Method for PML Document based on Internet of Things

Optimization of Bit Rate in Medical Image Compression

Textual Data Compression Speedup by Parallelization

Department of electronics and telecommunication, J.D.I.E.T.Yavatmal, India 2

Source Coding Basics and Speech Coding. Yao Wang Polytechnic University, Brooklyn, NY11201

Illustration of Random Forest and Naïve Bayes Algorithms on Indian Liver Patient Data Set

Quality Measurements of Lossy Image Steganography Based on H-AMBTC Technique Using Hadamard Transform Domain

TEXT COMPRESSION ALGORITHMS - A COMPARATIVE STUDY

CHAPTER 4 REVERSIBLE IMAGE WATERMARKING USING BIT PLANE CODING AND LIFTING WAVELET TRANSFORM

2-D SIGNAL PROCESSING FOR IMAGE COMPRESSION S. Venkatesan, Vibhuti Narain Rai

IMAGE COMPRESSION TECHNIQUES

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

Information Theory and Communication

Highly Secure Invertible Data Embedding Scheme Using Histogram Shifting Method

Fingerprint Image Compression

Shri Vishnu Engineering College for Women, India

Lossless Image Compression with Lossy Image Using Adaptive Prediction and Arithmetic Coding

Part 1 of 4. MARCH

Chapter 7 Lossless Compression Algorithms

Computational Complexities of the External Sorting Algorithms with No Additional Disk Space

Implementation and Analysis of Efficient Lossless Image Compression Algorithm

Engineering Mathematics II Lecture 16 Compression

An Implementation on Pattern Creation and Fixed Length Huffman s Compression Techniques for Medical Images

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

Lossless Text Compression Technique Based on Static Dictionary for Unicode Tamil Document

Transcription:

Volume 119 No. 10 2018, 657-667 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A New Algorithm based on Variable BIT Representation Technique for Text Data Compression 1 Nesna Hakkim, 2 Swathy Nair and 3 V.R. Rajalakshmi 1 Department of Computer Science and IT, Amrita School of Arts & Sciences, Kochi, Amrita Vishwa Vidyapeetham, India. h.nesna@yahoo.com 2 Department of Computer Science and IT, Amrita School of Arts & Sciences, Kochi, Amrita Vishwa Vidyapeetham, India. swathynairoct18@gmail.com 3 Department of Computer Science and IT, Amrita School of Arts & Sciences, Kochi, Amrita Vishwa Vidyapeetham, India. rajiprithviraj@gmail.com Abstract Data compression is a widespread requisite for the majority of the computer applications these days, and numerous algorithms have been suggested for this purpose. It is important to choose the best among these proposed algorithms. In this paper, we discuss a new lossless compression technique for text file compression. This algorithm combines two techniques of Variable Bit Representation Word-based compression and Character-based compression and is called Hybrid Compression technique. Here, the distinct repetitive values of a word and character are represented as binary values where the word/character with the highest frequency are assigned the lowest binary(0) value and the subsequent word/ character is assigned the next value(1) and so on. These words along with their assigned bit values are stored in an index table. This new hybrid algorithm is compared with the word-based and character-based techniques to assess its ability in compression of data files. This article deduces which technique works efficiently for data compression. Key Words:Bit representation method, hybrid bit representation algorithm, lossless compression algorithms, text data compression. 657

1. Introduction Representing a file in compact or compressed form for easier transfer and storage is a common requirement in this digital age. Different compression algorithms can be applied on a file to reduce its size. In spite of the large capacity storage devices available these days, efficient data storage is a major issue. Hence it is important to find a competent way to transfer and store data files in order to increase performance and reduce memory size. Compression can be either lossless compression or lossy compression. In Lossless compression, the original file can be fully recovered from the compressed file without any data loss. It is used in text files, executable programs, source code and so on. In Lossy compression, the original message is reconstructed with some loss of data. This is usually used in multimedia data such as image, audio and video files. Various lossless compression techniques have been proposed and used. Some of the main techniques used are the Huffman Coding, Run Length Encoding, Variable Bit Representation, Arithmetic Encoding and Dictionary Based Encoding. This paper proposes a new algorithm and compares it with the two techniques of Bit Representation method to find the best amongst them for compressing text data. 2. Related Works Figure 1: Compression Technique Nimisha E, Shyama P, Rajalakshmi V R, A New Approach to Increase the Storage Efficiency of Databases Using BIT Representation [1]. Here, the author proposed a lossless compression technique which uses binary values to represent every distinct attribute in a database. The count of distinct attributes (n) were found and the number of bits needed to represent each of these attributes were calculated (using the general rule, with n bits- combinations can be represented ) and an index table was created with the unique attribute value and their corresponding bit values. The original table was updated with the corresponding bit values but the frequencies of attributes were not taken into consideration. Here equal numbers of bits were used to represent every distinct attribute values. 658

Anoop R, Subhadra.G.Varma, Rajalakshmi V R, "An Effective Approach to Improve Storage Efficiency Using Variable Bit Representation" [2].Here, the author proposed a lossless compression technique which uses binary values to represent every distinct attribute in a database. This concept is similar to Nimisha E et al [1] research work which uses a uniform-length bit representation but in this paper it is further improved by truncating the redundant leading zeroes in the binary values used during compression Shrusti Porwal, Yashi Chaudhary, Jitendra Joshi, Manish Jain, Data Compression Methodologies for Lossless Data and Comparison between Algorithms [3].Here, the author compared two of the lossless compression techniques Huffman Coding and Arithmetic Encoding. Various performance measures were performed, to analyze which technique is better. The calculated performance measures were: compression ratio, compression speed, decompression speed, memory space needed, permits random access and compressed pattern matching. Finally, the author concludes that arithmetic encoding has the best compression ratio when compared to Huffman Coding technique. S.R. Kodituwakku, U.S.Amarasinghe, "Comparison of Lossless Data Compression Algorithms for Text Data [4]. Here, the author performed comparison between different lossless compression algorithms for text data. They were tested on different type of files, but the main interest was on different test patterns. It was concluded that the Shannon Fano algorithm was the best among the algorithms by comparing the compression times, decompression times and saving percentages between the selected algorithms. 3. Proposed Work A new algorithm, Hybrid Bit Representation which combines Word-based Bit Representation and Character-based Bit Representation is suggested and is compared with its originators to find the best among them. Hybrid Bit Representation In hybrid bit representation, repetitive words and the frequency of characters from the remaining words in a text file are found and these are replaced with a unique bit value. Compression Method Step I: Find the count of distinct words in a text file and store it in a word table Step II: Find the words with count equal to or greater than 2 Step III: Create an index table with the word, its frequency EXAMPLE: INSERT INTO IndexTable (Values,Frequency) VALUES('"+value+"',"+count+") Step IV: Find the frequency of characters from the remaining words (i.e. from the words with frequency 1) Step V: Insert these characters and its corresponding frequency into the index table created in Step III 659

EXAMPLE: INSERT INTO IndexTable (Values,Frequency) VALUES ('"+value+"',"+count+") Step VI: Sort the values (words and characters) in the IndexTable in descending order of their frequencies Step VII: Generate bit values to represent each of the word/character in IndexTable Step VIII: Update the index table with the corresponding bit value of each word/character. Words/character are assigned the bit values based on the decreasing order of their frequencies i.e. the word/character with the highest frequency gets the least bit value. EXAMPLE: UPDATE IndexTable SET BitValue='"+ bitvalue +"' WHERE Frequency = "+ frequency +" AND Values = '"+ value +"'" Step IX: For the words in the text file that do not have bit values of their own (in the Index table), find its corresponding bit values i.e. find the bit value of each character in the word and combine the values Step X:Store the bit values of all the words in a separate table WordFile Step XI: Retrieve the words and its corresponding bit value from WordFile EXAMPLE: SELECT Words, BitValue from WordFile Update the text file by replacing the words with its corresponding bit values Thus, all the values in the text file will be replaced with bit values. Decompression Method Step I: From the WordFile table, fetch the words and its equivalent bit value EXAMPLE: SELECT Words, BitValue FROM WordFile Step II: In the compressed file, replace each of the bit value with its corresponding word 4. Experiments and Results In this paper, a sample text file File1.txt (Figure 2) is used for experimentation, which has repetative values. Figure 2: Text File File1.txt Hybrid Bit Representation Initially, the count of each word is taken from the text file. The words with length greater than 1 and frequency greater than or equal to 2 are selected. For example, in the given file, Lorem appears 5 times. Likewise, there are 13 words which satisfy the given condition. These selected words are stored in an index table IndexTable along with their frequencies. Then the counts of each character from the remaining words are taken. For example, e appears 37 times. Likewise, there are 31 characters (alphanumeric and special characters) in the 660

file. These characters and their frequencies are stored in the IndexTable. According to the general rule, with n bits there can be unique combination. Here we have 44 unique words and characters, so we need a maximum of 6 bits to represent all the values. The actual bit values for these combinations are represented as: 0, 1, 10, 11, 100,, 101011. Now, the values in the IndexTable are sorted in decreasing order of their frequencies and then we store the generated bit values. The value with highest frequency is assigned the lowest bit value. This will reduce the storage space as the value that repeats the most is given the smallest bit value. In case, if the frequencies of different values are same, the assignment is just done in a sequential manner. The index table created for the text file File1.txt is given below Table 1: Database-IndexTable WordChar Frequency BitValue e 41 0 n 33 1 s 30 10 i 27 11 a 24 100 t 22 101 r 20 110 l 17 1000 o 17 111 u 11 1001 d 11 1010 p 10 1011 c 10 1100 g 9 1101 y 7 1111 k 7 1110 the 6 10001 m 6 10000 Ipsum 5 10010 b 5 10101 Lorem 5 10100 v 5 10011 h 5 10110 w 4 10111 of 4 11000 and 3 11001 It 2 11010 dummy 2 11011 has 2 11100 is 2 11101 text 2 11110 typesetting 2 100000 f 2 100001 with 2 100010 type 2 11111 W 1 100101 5 1 100011 6 1 100100 9 1 101001 A 1 101010 L 1 101000 M 1 100111 P 1 100110 661

In the above table(table 1)the value (here it is a character) with highest frequency - e, is given the smallest value 0, which is one bit. Similarly bit values are assigned to the rest of the words according to their frequency. The bit values for the words that are not present in the IndexTable are found. For that, the bit value of each of the character in these words are taken from the table and combined. For example, the word with is present in the table with the bit value 11110, so there is no need to find its bit value again. But the word What is not present in the table. Here, bit value of W is 101011, h is 11001, a is 11100 and t is 101, so the bit value of What is 1010111100111100101. Similarly the bit values of all words are found. A table (Table 2)is created to store all the words in the text file along with their corresponding bit values. Table 2: WordFile Table WordFile Words Bit What 10010110110100101 is 11101 Lorem 10100 Ipsum 10010 simply 101110000101110001111 dummy 11011 text 11110 of 11000 the 10001 printing 10111101111011111101 and 11001 typesetting 100000 industry 11110101001101011101111 has 11100 been 10101001 industrys 1111010100110101110111110 standard 10101100110101001101010 ever 0100110110 Now, in a separate file that has the contents of the original file, the words are replaced with their corresponding bit values. The compressed file is given below (Figure 3) Figure 3: Compressed Text File Here, the original text file has a size of 3888 bits whereas the compressed file has size of 1042 bits. 662

Word-based Bit Representation Initially, the count of each word is taken from the text file. The words with frequency greater than or equal to 2 are selected. These words along with the frequency are stored in an index table. The count of the total words in the table is found and that many bit values are generated. Now in the index table we sort the words in decreasing order of their frequencies and then store the generated bit values. The word with highest frequency is assigned the lowest bit value. The index table (Table 3) created for the text file File1.txt is given below Table 3: Database IndexTable IndexTable Words Frequency BitValue the 6 0 Lorem 5 10 Ipsum 5 1 of 4 11 and 3 100 with 2 1101 typesetting 2 1100 type 2 1011 text 2 1010 is 2 1001 has 2 1000 dummy 2 111 a 2 110 It 2 101 After creating the index table, a separate text file(figure 4) is created with the contents of the original file. But here, the repetitive words will be replaced with their corresponding bit values. Figure 4: Compressed Text File Here, the original text file has a size of 3888 bits whereas the compressed file has size of 2870 bits. Character-based Bit Representation Initially, the count of each character is taken from the text file. For example, in the given file, e appears 60 times. These characters are stored in an index table along with their frequencies. The count of all the characters inthe table is found and that many bit values are generated. Now, in the index table we sort the characters in decreasing order of their frequencies and then store the generated 663

bit values. The character with highest frequency is assigned the lowest bit value. The index table (Table 4) created for the text file File1.txt is given below. Table 4: Database CharacterIndexTable CharacterIndexTable Characters Frequency BitValue e 60 0 t 44 1 s 41 10 n 38 11 i 33 100 a 29 101 o 26 110 r 25 111 m 20 1000 p 19 1001 u 18 1010 l 17 1011 d 16 1100 h 15 1101 y 13 1110 g 11 1111 c 10 10000 k 7 10001 I 7 10010 w 6 10101 f 6 10011 L 6 10100 v 5 10111 b 5 10110 x 2 11000 W 1 11001 6 1 11110 9 1 11101 A 1 11100 P 1 11010 5 1 11111 M 1 11011 After creating the index table, a separate text file (Figure 5) is created with the contents of the original file. But here, the repetitive characters will be replaced with their corresponding bit values. Figure 5: Compressed Text File 664

Here, the original text file has a size of 3888 bits whereas the compressed file has size of 1520 bits. Comparing the hybrid compression technique with its originators have resulted in the below bar chart. Figure 6: Bar Chart representing the Sizes of the Original File and the Compressed Files From the bar chart(figure 6), we can infer that Hybrid compression is the best technique for reducing the size of the file. Here, the contents of the file are compressed without any loss and storage space is decreased. Thus, the hybrid compression algorithm helps to increase the storage efficiency than word-based or character-based compression algorithm. 5. Conclusion In this paper, we have introduced a hybrid algorithm based on word-based and character-based compression technique. Then we compared this new algorithm with its originators. We found out that the new hybrid compression method is better than the word-based or character-based compression technique as the file size is reduced by a considerable difference. References [1] Nimisha E., Shyama P., Rajalakshmi V.R., A New Approach to Increase the Storage Efficiency of Databases Using BIT Representation, International Journal of Control Theory and Applications 10(24) (2017), 367-375. [2] Anoop R., Varma S.G., Rajalakshmi V.R., An Effective Approach To Improve Storage Efficiency Using Variable Bit Representation. [3] Porwal S., Chaudhary Y., Joshi J., Jain M., Data compression methodologies for lossless data and comparison between algorithms, International Journal of Engineering Science and Innovative Technology 2 (2013), 142-147. [4] Kodituwakku S.R., Amarasinghe U.S., Comparison of lossless data compression algorithms for text data. Indian journal of computer science and engineering 1(4) (2010), 416-425. 665

[5] Pu I.M., Fundamental Data Compression, Elsevier, Britain (2006). [6] Blelloch E., Introduction to Data Compression, Computer Science Department, Carnegie Mellon University (2002). [7] Kesheng W., Otoo J., Arie S., Optimizing bitmap indices with efficient compression, ACM Trans. Database Systems 31 (2006), 1-38. [8] Kaufman K., Shmuel T., Semi-lossless text compression, Intl. J. Foundations of Computer Sci. 16 (2005), 1167-1178. [9] Campos A.S.E, Basic arithmetic coding by Arturo Campos Website (2009). http://www.arturocampos.com/ac_arithmetic.html [10] Vo Ngoc, Alistair M., Improved word aligned binary compression for text indexing, IEEE Trans. Knowledge & Data Engineering 18 (2006), 857-861. [11] Cormak V., Horspool S., Data compression using dynamic Markov modeling, Comput. J. 30 (1987), 541 550. [12] Capocelli M., Giancarlo R., Taneja J., Bounds on the redundancy of Huffman codes, IEEE Trans. Inf. Theory 32 (1986), 854 857. [13] Gawthrop J., Liuping W., Data compression for estimation of the physical parameters of stable and unstable linear systems, Automatica 41 (2005), 1313-1321. [14] Paul G. Howard, Jerey Scott Vitter, Practical Implementations of Arithmetic Coding, A shortened version appears in the proceedings of the International Conference on Advances in Communication and Control (1991). [15] Amandeep Singh Sidhu, Er. Meenakshi Garg, Text Data Compression Algorithm using Hybrid Approach, International Journal of Computer Science and Mobile Computing 3(12) (2014), 01 10. [16] Anmol Jyot Maan, Analysis and Comparison of Algorithms for Lossless Data Compression, International Journal of Information and Computation Technology 3(3) (2013), 139-146. 666

667

668