Journal of Computer Engineering and Technology (IJCET), ISSN (Print), International Journal of Computer Engineering

Similar documents
A New Compression Method Strictly for English Textual Data

Keywords image compression, lossless compression, run length coding, comparison technique, entropy coding

Keywords Data compression, Lossless data compression technique, Huffman Coding, Arithmetic coding etc.

A Compression Technique Based On Optimality Of LZW Code (OLZW)

International Journal of Advanced Research in Computer Science and Software Engineering

OPTIMIZATION OF LZW (LEMPEL-ZIV-WELCH) ALGORITHM TO REDUCE TIME COMPLEXITY FOR DICTIONARY CREATION IN ENCODING AND DECODING

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

HARDWARE IMPLEMENTATION OF LOSSLESS LZMA DATA COMPRESSION ALGORITHM

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

Comparative Study of Dictionary based Compression Algorithms on Text Data

So, what is data compression, and why do we need it?

A Comparative Study of Lossless Compression Algorithm on Text Data

A Novel Image Compression Technique using Simple Arithmetic Addition

EE-575 INFORMATION THEORY - SEM 092

Three Dimensional Motion Vectorless Compression

Study of LZ77 and LZ78 Data Compression Techniques

Optimization of Bit Rate in Medical Image Compression

IMAGE COMPRESSION TECHNIQUES

IMAGE COMPRESSION USING HYBRID QUANTIZATION METHOD IN JPEG

HYBRID TRANSFORMATION TECHNIQUE FOR IMAGE COMPRESSION

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

EE67I Multimedia Communication Systems Lecture 4

A Comprehensive Review of Data Compression Techniques

Improving LZW Image Compression

A Research Paper on Lossless Data Compression Techniques

A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm

Image compression. Stefano Ferrari. Università degli Studi di Milano Methods for Image Processing. academic year

Noise Reduction in Data Communication Using Compression Technique

The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods

Multimedia Communications ECE 728 (Data Compression)

Ch. 2: Compression Basics Multimedia Systems

IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 10, 2015 ISSN (online):

Image Compression for Mobile Devices using Prediction and Direct Coding Approach

Highly Secure Invertible Data Embedding Scheme Using Histogram Shifting Method

Image Compression - An Overview Jagroop Singh 1

An Effective Approach to Improve Storage Efficiency Using Variable bit Representation

Topic 5 Image Compression

Overview. Last Lecture. This Lecture. Next Lecture. Data Transmission. Data Compression Source: Lecture notes

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and

Comparative data compression techniques and multi-compression results

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

AN ANALYTICAL STUDY OF LOSSY COMPRESSION TECHINIQUES ON CONTINUOUS TONE GRAPHICAL IMAGES

Department of electronics and telecommunication, J.D.I.E.T.Yavatmal, India 2

A Comparative Study of Entropy Encoding Techniques for Lossless Text Data Compression

Information Technology Department, PCCOE-Pimpri Chinchwad, College of Engineering, Pune, Maharashtra, India 2

IMAGE COMPRESSION USING HYBRID TRANSFORM TECHNIQUE

Multimedia Systems. Part 20. Mahdi Vasighi

Compression and Decompression of Virtual Disk Using Deduplication

DEFLATE COMPRESSION ALGORITHM

Analysis of Parallelization Effects on Textual Data Compression

Modeling Delta Encoding of Compressed Files

Optimized Compression and Decompression Software

Sparse Transform Matrix at Low Complexity for Color Image Compression

Design and Implementation of FPGA- based Systolic Array for LZ Data Compression

Using data reduction to improve the transmission and rendering. performance of SVG maps over the Internet

Enhancing Text Compression Method Using Information Source Indexing

A QUAD-TREE DECOMPOSITION APPROACH TO CARTOON IMAGE COMPRESSION. Yi-Chen Tsai, Ming-Sui Lee, Meiyin Shen and C.-C. Jay Kuo

AUDIO COMPRESSION USING WAVELET TRANSFORM

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

Comparison of Text Data Compression Using Run Length Encoding, Arithmetic Encoding, Punctured Elias Code and Goldbach Code

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

An Implementation of Efficient Text Data Compression

A COMPRESSION TECHNIQUES IN DIGITAL IMAGE PROCESSING - REVIEW

Image and Video Coding I: Fundamentals

University of Mustansiriyah, Baghdad, Iraq

Text Data Compression and Decompression Using Modified Deflate Algorithm

Source Coding Basics and Speech Coding. Yao Wang Polytechnic University, Brooklyn, NY11201

LOSSLESS DATA COMPRESSION AND DECOMPRESSION ALGORITHM AND ITS HARDWARE ARCHITECTURE

Dictionary Based Compression for Images

LIPT-Derived Transform Methods Used in Lossless Compression of Text Files

Rate Distortion Optimization in Video Compression

ISSN (ONLINE): , VOLUME-3, ISSUE-1,

Engineering Mathematics II Lecture 16 Compression

Chapter 2 Studies and Implementation of Subband Coder and Decoder of Speech Signal Using Rayleigh Distribution

Digital Image Processing

Fundamentals of Video Compression. Video Compression

Histogram Based Block Classification Scheme of Compound Images: A Hybrid Extension

Lempel-Ziv-Welch (LZW) Compression Algorithm

ECE 499/599 Data Compression & Information Theory. Thinh Nguyen Oregon State University

Lossless Audio Coding based on Burrows Wheeler Transform and Run Length Encoding Algorithm

ADVANCED LOSSLESS TEXT COMPRESSION ALGORITHM BASED ON SPLAY TREE ADAPTIVE METHODS

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

Lossless Compression Algorithms

An Adaptive Video Compression Technique for Resource Constraint Systems

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Review of Image Compression Techniques

Implementation of Robust Compression Technique using LZ77 Algorithm on Tensilica s Xtensa Processor

COMPARATIVE STUDY AND ANALYSIS OF ADAPTIVE REGION BASED HUFFMAN COMPRESSION TECHNIQUES

GUJARAT TECHNOLOGICAL UNIVERSITY

On Data Latency and Compression

JOINT VIDEO COMPRESSION AND ENCRYPTION USING SECURE WAVELET TRANSFORM AND ENTROPY CODING

FPGA based Data Compression using Dictionary based LZW Algorithm

Research Article Does an Arithmetic Coding Followed by Run-length Coding Enhance the Compression Ratio?

Medical Image Compression using DCT and DWT Techniques

Removing Spatial Redundancy from Image by Using Variable Vertex Chain Code

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Error Resilient LZ 77 Data Compression

Textual Data Compression Speedup by Parallelization

Proposing A Symmetric Key Bit-Level Block Cipher

Evolutionary Lossless Compression with GP-ZIP

Transcription:

Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 1 Number 2, Sept - Oct (2010), pp. 38-46 IAEME, http://www.iaeme.com/ijcet.html IJCET I A E M E EFFICIENT TEXT COMPRESSION USING SPECIAL CHARACTER REPLACEMENT AND SPACE REMOVAL ABSTRACT Debashis Chakraborty Department of Computer Science & Engineering St. Thomas College of Engineering. & Technology Kolkata-23, West Bengal E-Mail: sunnydeba@gmail.com Sutirtha Ghosh Department of Information Technology St. Thomas College of Engineering. & Technology Kolkata-23, West Bengal E-Mail: sutirtha84@yahoo.co.in Joydeep Mukherjee Department of Information Technology St. Thomas College of Engineering. & Technology Kolkata-23, West Bengal In this paper, we have proposed a new concept of text compression/decompression algorithm using special character replacement technique. Moreover after the initial compression after replacement of special characters, we remove the spaces between the words in the intermediary compressed file in specific situations to get the final compressed text file. Experimental results show that the proposed algorithm is very simple in implementation, fast in encoding time and high in compression ratio and even gives better compression than existing algorithms like LZW, WINZIP 10.0 and WINRAR 3.93. Keywords: Lossless compresssion; Lossy compression; Non-printable ASCII value; Special character, Index, Symbols. 38

INTRODUCTION As evident from the name itself data compression is concerned with the compression of a given set of data [5,6,8]. The primary reason behind doing so is to reduce the storage space required to save the data, or the bandwidth required to transmit it. Although storage technology has developed significantly over the past decade, the same cannot be said for transmission capacity. As a result the concept of compressing data becomes very important. Data compression or source coding is the process of encoding information using fewer bits (or other information bearing units) than an unencoded representation would use through specific use of encoding schemes. It follows that the receiver must be aware of the encoding scheme in order to decode the data to its original form. The compression schemes that are designed are basically trade- offs among the degree of data compression, the amount of distortion introduced and the resources (software and hardware) required to compress and decompress data [5,9]. Data compression schemes may broadly be classified into 1.Lossless compression and 2.Lossy compression. Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender s data more concisely without error. Lossless compression is possible because most real world data has statistical redundancy. Another kind of compression, called lossy data compression is possible if some loss of fidelity is acceptable. It is important to consider that in case of lossy compression, the original data cannot be reconstructed from the compressed data due to rounding off or removal of some parts of data as a result of redundancies. These types of compression are also widely used in Image compression [10, 11, 12, 13]. The theoretical background of compression is provided by information theory and by rate distortion theory. There is a close connection between machine learning and compression: a system that predicts the posterior probabilities of a sequence given its entire history can be used for optimal data compression (by using arithmetic coding on the output distribution), while an optimal compressor can be used for prediction (by finding the symbol that compresses best, given the previous history). This equivalence has been used as justification for data compression as a benchmark for "general intelligence". We hereby focus on the compression of text. Various algorithms have been proposed for text compression 39

[1,2,3,4,7]. We proposed an efficient text compression algorithm that should yield better compression than existing algorithm like Lempel-Zip-Welch and existing software like Winzip10.0 and Winrar3.93, while ensuring the compression is a lossless compression process. Our proposed algorithm is based on a systematic special character replacement technique. The rest of this paper is organized as follows:- Section II, the concepts of special character replacement is provided. Section III describes the creation and maintenance of dynamic dictionary. Section IV describes the process of removal of spaces between special symbols in the intermediary compressed file. Section V gives the proposed algorithm and section VI described the experimental results and Section VII concludes the paper. 1. SPECIAL CHARACTER REPLACEMENT In the proposed algorithm we replaced every word in a text with an ASCII character. In the extended ASCII set of characters there are two hundred and fifty-four (254) characters. Among these some of them hold NULL, Space, Linefeed or English alphabets as their special symbols. Neglecting them there are one hundred and eightyfour(184) ASCII characters that have been used in this proposed algorithm. In this proposed algorithm, one letter or two letter English words in the text file are not replaced with an ASCII character. A non-printable ASCII character replaces words having more than two letters. For an example, the word of remains the same, whereas a non-printable ASCII character 1 replaces the word name. Whenever a new word is found we maintained an index (integer) and the corresponding special ASCII character is replaced for the word in the compressed text file. When the word is repeated in the file, it is replaced by the same ASCII value assigned to it previously. In this algorithm, we used one hundred and eighty-four symbols for the first one hundred and eighty-four words. Once the number of words exceeds the above value, we combined the ASCII characters to generate new symbols for the new words in the text file. When a space is encountered between two words, it is replaced with an integer 0. To determine the end of a statement symbol 9 is used, so the termination of a sentence 40

can be identified during the process of decompression of the text file from the compressed file. For an example, suppose there is a line of text: My name is Debashis Chakraborty. Assuming this is the first sentence in the file and following the proposed algorithm the words My and is, are kept unchanged in the compressed file. Name being the first word to be compressed we assigned the index 1 to it and replace the word with the ASCII character of 1. Similar process is repeated for the other words whose length is greater than two. We also replaced the space between words with 0 and the. with 9. Therefore the corresponding compressed sentence for the above example is: My0$0is#0&9 where $, #,& are non printable ASCII characters for integers 1 for name, 2 for Debashis and 3 for Chakraborty respectively, each occupying one byte of memory space in the memory. The original line of text had 32 bytes, whereas the compressed line has 12 bytes. Thus the above proposed method enables us to obtain comprehensive compression of text, resulting in better transmission bandwidth management and requires less storage. 2. CREATION OF DYNAMIC DICTIONARY Text Compression Algorithms should always be lossless compression algorithms, preventing any loss of information. The text file regenerated from the compressed file must be identical to the original file. All text compression algorithms maintain a dictionary containing the words that appear in the text file. The text file is regenerated from the original file with the help of this dictionary. Dictionaries maintained can either be static or dynamic. In this proposed algorithm, dynamic dictionary is used. We maintained a table containing the fields, named Index, Symbol and Word to form the dictionary. Initially the table is empty. When a word to be compressed in the text file is encountered, check whether the word exists in the table. Every time a new word is found in the text file, assign an integer value to it and tabulate its special symbol using single nonprintable ASCII characters or combination of such character symbols. It stored the assigned integer under index field, the special symbol under symbol field and the 41

corresponding word under the word field. Every time a new word is found in the text the dictionary is updated using the same procedure. When a word is repeated, the already assigned symbol for the word is used. During the process of decompression, the special symbol in the compressed file is searched to obtain its corresponding integer value or index and the corresponding word. Finally the symbols are replaced with their corresponding words to regenerate the original file. 3. REMOVAL OF SPACES FROM THE INTRMEDIARY COMPRESSED FILE Every file contains spaces between the words to identify different words. The words are separated from each other with spaces. Here we propose a method to remove these spaces without loosing any information to obtain better compression. The usage of special symbols for every word in the original file compresses the size of the file and an intermediary compressed file is obtained. We do not remove the spaces from the intermediary file. Instead it contains 0 to specify the location of spaces between words. Every word in the original file is replaced by either one special symbol or a combination of two special symbols. In the intermediary compressed file when we obtain 0 after one special symbol, the contents are not modified. Whereas when a word is replaced by a combination of two symbols or the word is a one or two lettered word( no replacement in the intermediary compressed file), we remove the 0 after them i.e. the space between the present word and next word is removed. For example, suppose the there is a line of text: My name is Debashis Chakraborty. Assuming the special symbol for name is $, Debashis is ## and Chakraborty is @, then after the final compression the output of the above sentence is: My$0is##@9 4. PROPOSED ALGORITHM We proposed an algorithm that takes a text file as input. The proposed algorithm can compress text files to comparable size of Lempel-Ziv Welch Algorithm, Winzip10.0 and Winrar3.93. The proposed algorithm is: 42

Algorithm for Compression Step 1: Read the contents of Text file on word at a time. Step 2: Create a dictionary containing the fields Index, Symbol and Word. The dictionary is initially empty. Step 3: Length Calculation Calculate the length of the word read from the text file. Write the original word into the intermediary file if the length of the word is less than or equal to two. If the read word is single character and represents. replace with 9 in the compressed file and if the read character is a space between two words replace with 0 in the compressed file. For word of length greater than two it is replaced with special symbol (nonprintable ASCII character or combination of ASCII characters). Step 4: Special Character replacement Check whether the word exists in the dictionary. For a new word assign an integer which acts as the index and a special symbol for the corresponding word. For the index of the word being less than one hundred and eighty-four, assign index s respective single character ASCII symbol as their special symbol. If a word has an index more than one hundred and eighty-three, combine ASCII characters to form the new symbol. Update the dictionary by inserting the new word along with its index value and assigned symbol, which can be used for future reference. For repetition of an existing word replace the pre-assigned symbol for the word as obtained from the dictionary. Step 5: Continue the above process of compression and updation of the dynamic dictionary till the end of the original file is reached. Step 6: Removal of Spaces from the intermediary file Read the contents of the intermediary file, one special symbol at a time. 43

Check whether the word in the original file is replaced by one special character as symbol or a combination of two. If there is 0 after symbol containing one special character as a replacement of a word, retain the zero. If there is a combination of special characters to represent a word or the word itself (one or two letter words) remove the 0 (representing space between words) to obtain the final compressed file. Step 7: Continue the above process of compression till the end of the intermediary file is reached. Algorithm for Decompression Step 1: Read the symbol form the compressed file. Step 2: If the read symbol is 0 replace with a space or tab. For the symbol 9 replace with a. to indicate end of sentence. Step 3: Decoding of Special Characters If the read symbol from compressed file is an English alphabet, write the same into the decompressed file. Write space or tab after the word in the decompressed file. For a special symbol, find a match for it in the dictionary and write the corresponding word in the decompressed file. If the special symbols are a combination of two special characters, write space or tab after the corresponding word in the decompressed file. Step 4: Continue the above process till the end of the compressed file is reached. 5. EXPERIMENTAL RESULTS The algorithm developed has been simulated using TURBO C. The input text files are considered to be.txt,. rtf,.cpp and.c files. All the text files that we have tested are of different sizes. The compression ratios obtained are tabulated in Table 1. The compression ratio is better than that of Lempel-Ziv Welch Algorithm,Winzip10.0 and Winrar3.93 for majority of the text files. All the text files reconstructed from the compressed file are of the same size as that of original file. Therefore the proposed algorithm follows lossless compression. 44

Original Original Compression Compression Compression Compression File File Size by by by by Proposed LZW WINRAR WINZIP Algorithm 3.93 10.0 sgjm1.txt 5046bytes 3292 bytes 2056bytes 2468 bytes 1537 bytes (35%) (59%) (51%) (69%) sgjm2.txt 7061 4357 bytes 2565 bytes 2547 bytes 2129 bytes bytes (38%) (63%) (64%) (70%) sgjm3.rtf 2891 1842 bytes 1303 bytes 1269 bytes 896 bytes bytes (36%) (55%) (56%) (69%) sgjm4.txt 431 bytes 388 bytes 260 bytes 231 bytes 158 bytes (9%) (39%) (46%) (63%) sgjm5.rtf 2037 1330 bytes 859 bytes 828 bytes 635 bytes bytes (35%) (58%) (59%) (63%) sgjm6.txt 3369 2196 bytes 1545 bytes 1504 bytes 1110 bytes bytes (35%) (54%) (55%) (67%) sgjm7.txt 10549 5457 bytes 3933 bytes 3923 bytes 3492 bytes bytes (47%) (63%) (63%) (66%) sgjm8.txt 7584 4216 bytes 3067 bytes 3048 bytes 2389 bytes bytes (44%) (59%) (60%) (68%) sgjm9.rtf 5529 3249 bytes 2351 bytes 2324 bytes 1793 bytes bytes (41%) (57%) (58%) (67%) sgjm10.rtf 4152 2658 bytes 1869 bytes 1831 bytes 1428 bytes bytes (36%) (55%) (56%) (66%) sgjm11.cpp 458 bytes 421 bytes 259 bytes 239 bytes 134 bytes (8%) (43%) (48%) (70%) Table 1 Compression of text files for different algorithms 6. CONCLUSIONS In this paper, a new text compression algorithm used to compress different type of text files has been introduced. The main advantage of this compression scheme is that the algorithm gives better compression than existing algorithms for different text file sizes. This compression scheme is comparable to the Lempel-Zip Welch Algorithm, Winzip10.0 and Winrar3.93 in terms of compression ratio. REFERENCES [1] J.Ziv and A. Lempel, Compression of individual sequences via variable length coding, IEEE Transaction on Information Theory, Vol 24: pp. 530 536, 1978. [2] J.Ziv and A. Lempel, A universal algorithm for sequential data compression, IEEE Transaction on Information Theory, Vol 23: pp. 337 343, May 1977. 45

[3] Gonzalo Navarro and Mathieu A Raffinot, General Practical Approach to Pattern Matchingover Ziv-Lempel Compressed Text, Proc. CPM 99, LNCS 1645, Pages14-36. [4] S. Bhattacharjee, J. Bhattacharya, U. Raghavendra, D.Saha, P. Pal Chaudhuri, A VLSI architecture for cellular automata based parallel data compression, IEEE- 2006,Bangalore, India, Jan 03-06. [5] Khalid Sayood, An Introduction to Data Compression, Academic Press, 1996. [6] David Solomon, Data Compression: The Complete Reference, Springer Publication, 2000. [7] M. Atallah and Y. Genin, Pattern matching text compression: Algorithmic and empirical results, International Conference on Data Compression, vol II: pp. 349-352, Lausanne, 1996. [8] Mark Nelson and Jean-Loup Gaily, The Data Compression Book, Second Edition, M&T Books. [9] Timothy C. Bell, Text Compression, Prentice Hall Publishers, 1990. [10] Ranjan Parekh, Principles of Multimedia, Tata McGraw-Hill Companies, 2006 [11] Amiya Halder, Sourav Dey, Soumyodeep Mukherjee and Ayan Banerjee, An Efficient Image Compression Algorithm Based on Block Optimization and Byte Compression, ICISA-2010, Chennai, Tamilnadu, India, pp.14-18, Feb 6, 2010. [12] Ayan Banerjee and Amiya Halder, An Efficient Image Compression Algorithm Based on Block Optimization, Byte Compression and Run-Length Encoding along Y- axis, IEEE ICCSIT 2010, Chengdu, China, IEEE Computer Society Press, July 9-11, 2010. [13] Rafael C. Gonzalez, Richard E. Woods, Digital Image Processing. [14] Debashis Chakraborty, Sutirtha Ghosh and Joydeep Mukherjee, An Efficient Data Compression Algorithm Using Differential Feature Extraction, NCETCS August 26-28,2010. 46