Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

Similar documents
A New Compression Method Strictly for English Textual Data

More Bits and Bytes Huffman Coding

Intro. To Multimedia Engineering Lossless Compression

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Ch. 2: Compression Basics Multimedia Systems

Multimedia Networking ECE 599

CSE100. Advanced Data Structures. Lecture 12. (Based on Paul Kube course materials)

15 July, Huffman Trees. Heaps

Lossless Compression Algorithms

CMPSCI 240 Reasoning Under Uncertainty Homework 4

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

DEFLATE COMPRESSION ALGORITHM

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

A Research Paper on Lossless Data Compression Techniques

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

Digital Image Processing

Data Compression Fundamentals

FPGA based Data Compression using Dictionary based LZW Algorithm

14.4 Description of Huffman Coding

An Efficient Decoding Technique for Huffman Codes Abstract 1. Introduction

EE67I Multimedia Communication Systems Lecture 4

Keywords Data compression, Lossless data compression technique, Huffman Coding, Arithmetic coding etc.

CS/COE 1501

Data Compression Techniques

Chapter 7 Lossless Compression Algorithms

A Comparative Study of Lossless Compression Algorithm on Text Data

Data Encryption on FPGA using Huffman Coding

Journal of Computer Engineering and Technology (IJCET), ISSN (Print), International Journal of Computer Engineering

CS/COE 1501

CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77

Algorithms and Data Structures CS-CO-412

Highly Secure Invertible Data Embedding Scheme Using Histogram Shifting Method

Comparative Study of Dictionary based Compression Algorithms on Text Data

CS 206 Introduction to Computer Science II

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

Data Compression Scheme of Dynamic Huffman Code for Different Languages

Data Compression. Guest lecture, SGDS Fall 2011

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

PERFORMANCE ANALYSIS OF IMAGE COMPRESSION USING FUZZY LOGIC ALGORITHM

Analysis of Parallelization Effects on Textual Data Compression

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

Huffman Coding Assignment For CS211, Bellevue College (rev. 2016)

EE-575 INFORMATION THEORY - SEM 092

ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding

Data Compression 신찬수

Chapter 4: Application Protocols 4.1: Layer : Internet Phonebook : DNS 4.3: The WWW and s

ENTROPY ENCODERS: HUFFMAN CODING AND ARITHMETIC CODING 1

Data Compression Techniques

Message Communication A New Approach

Greedy Algorithms CHAPTER 16

Indexing Variable Length Substrings for Exact and Approximate Matching

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

Text Compression through Huffman Coding. Terminology

CIS 121 Data Structures and Algorithms with Java Spring 2018

A Comparative Study of Entropy Encoding Techniques for Lossless Text Data Compression

Shri Vishnu Engineering College for Women, India

Simultaneous Data Compression and Encryption

CSE 421 Greedy: Huffman Codes

Data compression.

OPTIMIZATION OF LZW (LEMPEL-ZIV-WELCH) ALGORITHM TO REDUCE TIME COMPLEXITY FOR DICTIONARY CREATION IN ENCODING AND DECODING

Data compression with Huffman and LZW

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

A COMPRESSION TECHNIQUES IN DIGITAL IMAGE PROCESSING - REVIEW

ENSC Multimedia Communications Engineering Topic 4: Huffman Coding 2

Image Compression for Mobile Devices using Prediction and Direct Coding Approach

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

COSC-211: DATA STRUCTURES HW5: HUFFMAN CODING. 1 Introduction. 2 Huffman Coding. Due Thursday, March 8, 11:59pm

Engineering Mathematics II Lecture 16 Compression

Digital Image Processing

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

Hybrid Image Compression Using DWT, DCT and Huffman Coding. Techniques

International Journal of Advanced Research in Computer Science and Software Engineering

Ch. 2: Compression Basics Multimedia Systems

Dictionary techniques

Optimized Compression and Decompression Software

Document Compression and Ciphering Using Pattern Matching Technique

A Comprehensive Review of Data Compression Techniques

Red-Black, Splay and Huffman Trees

HARDWARE IMPLEMENTATION OF LOSSLESS LZMA DATA COMPRESSION ALGORITHM

Huffman Coding. (EE 575: Source Coding Project) Project Report. Submitted By: Raza Umar. ID: g

TREES. Trees - Introduction

A Compression Method for PML Document based on Internet of Things

Figure-2.1. Information system with encoder/decoders.

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

Textual Data Compression Speedup by Parallelization

Linked Structures Songs, Games, Movies Part IV. Fall 2013 Carola Wenk

Lecture 7 February 26, 2010

A study in compression algorithms

Text Compression. Jayadev Misra The University of Texas at Austin July 1, A Very Incomplete Introduction To Information Theory 2

Data Representation. Types of data: Numbers Text Audio Images & Graphics Video

Department of electronics and telecommunication, J.D.I.E.T.Yavatmal, India 2

Analysis of Algorithms

CS15100 Lab 7: File compression

University of Waterloo CS240 Spring 2018 Help Session Problems

Huffman Code Application. Lecture7: Huffman Code. A simple application of Huffman coding of image compression which would be :

Transcription:

Lossless Data Compression for Security Purposes Using Huffman Encoding A thesis submitted to the Graduate School of University of Cincinnati in a partial fulfillment of requirements for the degree of Master of Science in the College of Engineering and Applied Science department by Viveka Aggarwal Bachelor of Technology SRM University, India May 2013 Committee Chair: Prof. George Purdy, Ph. D. i

ABSTRACT The aim of this thesis is to efficiently compress data for security purposes. Here, we analyze and extend the working of Huffman encoding. Our results will ultimately provide us with a list of binary numbers that can be encrypted in AES. The encoding works on reducing the redundancy of the input, enabling us to change the key less often. We use static and dynamic Huffman encoding, which are forms of compression techniques. We have modified Huffman encoding so that it uses the frequencies of consecutive letter pairs and triplets. The usual way to compress English text is with Huffman encoding on the individual letters. The character frequency used in this encoding schema can be obtained in two ways, either dynamically, that is calculating it on the flow (once the text is obtained as input) or statically (use a pre- created character frequency table). Using the static encoding method, with commonly available frequency table, the effort and the cost of encoding and sending the table can be prevented. ii

The main idea behind this work is dual and triple character encoding. For the purpose of this thesis, only letters are encoded. We compare the results of Huffman encoding methods by encoding the input data both dynamically and statically. Through these results we analyze the best method for data compression for security purposes of different types of data (with frequently repeated and non repeated words). iii

iv

TABLE OF CONTENTS 1. Introduction 1.1 Goal of research 1.2 Motivation 2. Input 3. Compression techniques 3.1 Static Huffman Encoding 3.2 Extended Static Huffman Encoding 3.2.1 Pairs 3.2.2 Triplets 3.3 Dynamic Huffman Encoding 3.4 Extended Dynamic Huffman Encoding 3.4.1 Pairs 3.4.2 Triplets 4. Results 5. Conclusion and Future work 6. References v

1. INTRODUCTION The usual way to compress text is using Huffman encoding [6]. This minimum redundancy codes construction method was introduced by D. A. Huffman in 1952 [1][2]. This encoding scheme works on encoding each letter based on its frequency. We extended Huffman encoding to encode pairs and triplets and also experimented with some dictionary based encoding techniques too such as LZW [3], arithmetic encoding [13]. The modified Huffman encoding gave better results. The reason behind this stems from the fact that the modified Huffman encoding technique creates a new tree for each letter or letter pair, thereby reducing the code length for each letter pair or triplet even though it encodes the same number of pairs or triplets as the number of letters in the input. Time complexity for modified Huffman encoding remains almost the same for the first GB of text. 6

1.1 GOAL OF RESEARCH The aim of this thesis is to efficiently compress data to obtain a list of binary numbers that can be encrypted in Advanced Encryption Standard(AES) [7], such that the key can be changed less often. The basis of our thesis is that English language has lot of redundancy, or repeated words, for instance the and be are the most frequently used English words. We use four lossless encoding techniques to compress data: static Huffman encoding, modified static Huffman encoding, dynamic Huffman encoding and modified dynamic Huffman encoding. The modified Huffman encoding works on encoding groups (pairs and triplets) of characters instead of encoding every character based on its frequency as in the traditional Huffman encoding schema. The compression ratio for each technique is calculated for the input data and compared based on the amount of redundancy the data might have. 7

1.2 MOTIVATION Data compression, also called source coding is the process of encoding information in fewer number of bits than an un- coded representation, using various encoding schemes. Files are usually compressed to save space, bytes, bandwidth etc. But our motive behind compression is that the compressed version would start to resemble a random file and lists of binary numbers can be encrypted in AES. The compression schema aims on reducing the redundancy [3] of the input, enabling us to change the key less often. The usual way to compress English text is using Huffman encoding on the individual characters, that is, encoding each character based on its frequency of appearance. But for security purpose and to make use of the redundancy present in English language, we have to modify Huffman encoding so that it randomizes the frequencies of consecutive letter pairs, or triples or more tuples. 8

2. INPUT For the purpose of this thesis, we only consider input with upper case characters in the range A to Z. The code designed for this purpose, accepts any form of input, and changes it into this format. For the static Huffman encoding algorithm, single letter frequencies [10] and letter pair frequencies [12] are obtained from standard data available on the web. Data for the triple letter frequencies is not available on the web, hence for the extended static Huffman encoding for triplets, we created the frequency chart using the frequencies obtained from the e- book The adventures of Tom Sawyer by Mark Twain. 9

3. THE COMPRESSION TECHNIQUES Data compression is the process of encoding information in fewer number of bits than an uncompressed representation, using various encoding schemes. It is the practice of reducing data quantity without compromising on the quality. It reduces the number of bits required to store and transfer the data, hence making storage of large data feasible. Compression can either be lossy or lossless. Lossy compression reduces a file by permanently eliminating certain information, especially redundant information. On uncompressing the file, only a part of the original information is retrieved. Whereas, with lossless compression, every single bit of data that was originally in the file remains after the file is uncompressed. In this thesis, we will be covering four lossless compression schemas namely, Static Huffman encoding, Extended Static Huffman encoding (for letter pairs and triplets), Dynamic 10

Huffman encoding and, Extended Dynamic Huffman encoding (for letter pairs and triplets). Huffman Encoding Huffman encoding [11] is an optimal prefix encoding method that is commonly used for lossless text compression. Prefix encoding refers to a set of words containing no word that is a prefix of another word of a set. The advantage of this technique is that we obtain a uniquely decodable code, such that each word is identifiable without the requirement of a special marker between words. The Huffman model generates a code that is the most optimal in the sense that the compression is best possible. It maps frequency of characters to a unique code. That is, the more common characters are mapped to fewer bits as compared to the lesser common ones. 11

3.1 STATIC HUFFMAN ENCODING For the purpose of this thesis, we use the standard character frequency table [10] in the English language. This would help in decoding of data without the need of any extra information from the sender. The complete compression algorithm is based on three steps: getting the character frequencies from the table, generating the prefix code and finally encoding each character with its respective code. 14 12 12.02 10 8 6 4 2 0 8.12 1.49 2.71 4.32 2.3 2.03 5.92 7.31 0.1 0.69 3.98 2.61 7.68 6.95 1.82 0.11 6.02 6.28 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 9.1 2.88 1.11 2.09 0.17 2.11 0.07 Fig 1 Character frequency table 12

After obtaining the character frequencies which is input in the program as a single array, the model works on creating a Huffman tree. The Huffman tree is a type of binary search tree. The pseudo code for building the Huffman tree is as follows: 1. For each alphabet(a) ranging from A- Z: a. If frequency is not zero, create a new node t, such that weight(t) is the character frequency and label(t) is the character. 2. List all the created nodes in the increasing order of their weight. We used a priority queue for this operation. 3. Extract two nodes with the smallest weight from the above list. Do the following till the above list has elements: a. Create a new node n with weight as the sum of the above two nodes. b. Add the lower frequency node as the left value of n and the other node as the right value of n. c. Insert this new node to the above list. 13

4. Return the root of this tree. Encoding: 1. We take in the first letter from the input text. For example, if the given text is CAT, the program takes C as input first. The following steps are repeated, till the program encounters end of the input string. 2. We start at the root of the previously created Huffman tree, and check the left and the right branches. a. If the input character is a leaf node, and is on the left branch, add 0 to the output string, if it s on the right branch, add 1 to the output string. Move to the next input character from the input string. b. Else, move to the left branch (the left sub tree) and add 0 to the output string. c. Repeat steps a through c, till we reach the the required leaf node. 14

Example working of the Huffman encoding process: INPUT (x): THECATHASTHEBAT Length of x: 15 * 8 = 120 bits Frequency table: A B C E H S T 8.12 1.49 2.71 12.02 5.92 6.28 9.1 8 1 3 12 22 A B C E 6 9 6 H T S 8 4 12 6 9 6 A E H T S 1 B B 3 C 15

8 A 10 12 9 6 E T S 4 6 H 1 B 3 C 8 11 12 9 A E T 6 10 S 4 6 H 1 B 3 C 16

45 12 E 33 9 T 24 8 16 A 6 10 S 4 6 H 1 B 3 C Fig 2 Huffman tree for static single letter encoding 17

A B C E H S T 110 111100 111101 0 11111 1110 01 Encoded string: 011111101111011100111111110111001111110 11110011001 Length of encoded string: 50 bits Space saving percentile: 58.33% Decoding: 1. We start at the root of the previously created Huffman tree and start reading the binary input (the previously compressed data) For example: 1010010100 2. If the character is 0 we move towards the left branch and, and towards the right branch if the character encountered is 1. 3. Repeat step 2, till we reach a leaf node. 4. We replace the whole string of binary characters used in the path with the letter of the leaf node. 18

5. We move towards the next binary character from the input string. The whole process is repeated till we reach the end of the input string. 19

3.2 EXTENDED STATIC HUFFMAN ENCODING 3.2.1 LETTER PAIRS Extended Huffman compression for letter pairs deals with compression of data, taking into consideration the frequency for pairs of characters. The advantage of this technique is that it effectively takes into consideration the redundancy in the English language. For example, In the English language, the words to, be, of are the most frequently used [9]. With single letter Huffman encoding schema, we encoded each letter according to the probability of it s occurrence, but with double letter Huffman encoding schema, we consider the frequency of every possible pair of letters. Using these frequencies, we create 26 different Huffman trees. The pseudo code for generating these trees is very similar to the single letter Huffman encoding scheme. 20

Input: A 26*27 matrix, with each matrix[0][0] representing the frequency of A- A, and matrix[25][25] representing the frequency of Z- Z. The last element in each row is - 1, placed as an error check. Output: 26 Huffman trees 1. For each alphabet(a) ranging from A- Z, create a Huffman tree: a. We start with 26 leaf nodes, each depicting its frequency with respect to the previously chosen character. For example, if we are creating a tree for letter B then the node with label A would depict the frequency of letter A in the English language, given letter B precedes it. b. Place all these nodes in a priority queue, in increasing order of their frequency values. c. Extract the first two nodes with the smallest weight from the above list. Do the following till the above list has elements: d. Create a new node n with weight as the sum of the above two nodes. 21

e. Add the lower frequency node as the left value of n and the other node as the right value of n. f. Insert this new node to the above list. g. Return the root of this tree. Fig 3 Huffman tree for letter A 22

Encoding: 1. The first letter of the input string is encoded using single letter Huffman encoding schema. 2. We now take the first letter of the input string as a reference for the Huffman tree to look into, and search for the second character s code, based on the tree obtained. For example: If our input string is CAT we encode C using single simple Huffman encoding, and then consider the pair CA. In this case we search the Huffman[2] tree or the Huffman tree for letter C and search for the encoding of A in it. 3. This process is repeated until the entire string is encoded. Example working of the Extended Static Huffman (Pairs) encoding process: INPUT (x): THECATHASTHEBAT Length of x: 15 * 8 = 120 bits 23

Frequency table: A B C E H S T A 1 20 33 0 5 95 104 B 11 1 0 47 0 2 1 C 31 0 9 38 38 1 15 E 110 23 45 48 33 115 83 H 114 2 2 302 6 5 32 S 67 11 17 74 50 43 109 T 59 10 11 75 330 34 56 Based on the same strategy as for the single Huffman encoding, we create seven trees. Huffman tree for letter A is illustrated below. 24

154 104 154 AT 59 95 AS 26 33 AC 6 20 AB 1 5 AA AH Fig 4 Huffman tree for letter A 25

We encode the first letter from the previous example s T s code: 01 A B C E H S T A 10000 1001 101 Null 10001 11 0 B 01 0010 Null 1 Null 000 0011 C 111 Null 11001 0 10 11000 1101 E 10 111110 11110 1110 111111 0 110 H 01 000100 000101 1 0000 00011 001 S 110 111100 111101 10 1110 11111 0 T 010 011000 011001 00 1 01101 0111 Encoded string: 011111110111010111011111110010 Length of encoded string: 30 bits Space saving percentile: 75.00% 26

Decoding: 1. We start at the root of the previously created Huffman tree and start reading the binary input (the previously compressed data) For example: 1010010100. We decode the first character with the help of previously stated decoding schema in the single letter Huffman encoding method. 2. Now we refer the tree of the character obtained in the previous step. This character is added to the output string. 3. We search for the next possible substring from the remaining part of the input string to obtain a valid character as a leaf node. 4. We replace the whole string of binary characters used in the path with the letter of the leaf node. 5. Repeat from step two to step four until we reach the end of the input string. 27

3.2.2 TRIPLETS On having worked on extending the Huffman encoding to character pairs we noticed that, more than 25 percent of the top 100 most frequent words [2] in English language are three lettered. The most frequent word in the language being the and the fifth being and. Hence, we worked on extending the Huffman encoding to triplets. Huffman encoding for triplets works on generating 26*26 different trees, while considering the frequency of every three possible character combination in the English language. In triple letter Huffman encoding each tree would represent a pair of letters (A- Z) and each node of the tree would represent the frequency of the tree name plus the node character together. For example, the node E in the Huffman tree for pair TH would represent the frequency of THE in the English language. The frequency has been pre- obtained and input to the program in the form of 26*27*27 matrix. The frequency has been obtained by forming 28

triplets and noting their frequency from the English book The Adventures of Tom Sawyer by Mark Twain. The algorithm for creating Huffman trees as per extended Huffman encoding for triplets is as follows: 1. For each alphabet(a) pair ranging from AA- ZZ, create a Huffman tree, that is a total of 676 trees: a. We start with 26 leaf nodes, each depicting its frequency with respect to the previously chosen character pair. For example, if we are creating a tree for the pair AB then the node with label A would depict the frequency of letter A in the English language, given letters AB precedes it. b. Place all these nodes in a priority queue, in increasing order of their frequency values. c. Extract the first two nodes with the smallest weight from the above list. Do the following till the above list has elements: d. Create a new node n with weight as the sum of the above two nodes. 29

e. Add the lower frequency node as the left value of n and the other node as the right value of n. f. Insert this new node to the above list. g. Return the root of this tree. 2. We encode the first character using the character prefix map obtained from the single static Huffman encoding, followed by the first two characters being encoded using the character- prefix map obtained from the static Huffman encoding for pairs. 3. We now proceed to finding all the triplets in the input sentence and replacing each set with its respective prefix string. For example in the input sentence HELLO the triplets will be: HEL, ELL, LLO. The final string of all these prefixes is then returned as the encoded output of the input string. 30

Example working of the Extended Static Huffman (Triplets) encoding process: INPUT (x): THECATHASTHEBAT Length of x: 15 * 8 = 120 bits Frequency table: A B C E H S T AS 281 76 58 119 207 283 546 AT 157 63 167 372 346 307 411 BA 0 10 144 0 0 13 25 CA 0 0 1 2 0 36 72 EB 90 0 0 227 0 2 0 EC 219 0 1 53 94 4 150 HA 0 33 25 0 8 95 1395 HE 462 383 406 152 350 806 365 TH 1467 27 22 6189 116 70 138 ST 429 47 17 308 611 82 138 31

1570 546 1024 AST 283 741 ASS 281 460 ASA 207 253 ASH 119 134 ASE 58 ASC 76 ASB Fig 5 Huffman tree for letter AS 32

Prefix table: A B C E H S T AS 110 111111 111110 11110 1110 10 0 AT 111111 111110 11110 10 110 1110 0 BA Null 000 1 Null Null 001 01 CA Null Null 000 001 Null 01 1 EB 01 Null Null 1 Null 00 Null EC 0 Null 111000 11101 110 111001 10 HA Null 001 0011 Null 0010 01 1 HE 10 1110 110 111110 111111 0 11110 TH 01 001101 001100 1 0010 00111 000 ST 10 110101 110100 111 0 11011 1100 Encrypting T as obtained from static Huffman encoding tree for single character: 01 33

Encrypting TH as obtained from static Huffman encoding tree for character pairs: 1 Encoded string: 011111001110010100111100101 Length of encoded string: 27 bits Space saving percentile: 77.5% Decoding: 1. For decoding the input binary string, we start at the root of the previously created Huffman tree and start reading the binary input (the previously compressed data) For example: 1010010100. We decode the first character with the help of previously stated decoding schema in the single letter Huffman encoding method. We store the result as the first char of the output string. 2. Using the character obtained in the previous step we find the next character using the static Huffman decoding schema for character pairs. We store the result as the second character of the output string. 34

3. Now, using these two characters, we approach the tree of character first + second. In the obtained tree we find the next possible character, moving left for zero and right for one. Using this previous step, we get the first three characters of our decoded string. 4. Use the last two characters of the output string as reference for the tree to be considered to decode the next character. Add the obtained character to the output string. 5. Repeat step four, till we reach the end of the binary input string. Return the decoded string. 35

3.3 DYNAMIC HUFFMAN ENCODING Dynamic Huffman encoding works exactly like static Huffman encoding except, it doesn t use a pre obtained frequency table. Dynamic Huffman encoding algorithm works on creating a frequency table on the fly based on the input string. It has its own set of advantages and disadvantages when compared with static Huffman encoding algorithm. The output of Huffman s algorithm is a variable length code table which is then used to encode the input string. The algorithm derives this table from the probability of frequency of letters which it generates from the input string. As in other entropy encoding methods the more frequent or common characters are represented using fewer bits as compared to those which are comparatively less frequent. The Huffman algorithm can be implemented in linear time if the character frequencies are sorted. For this purpose, our program places the obtained frequencies in a priority queue where they are available in an ascending order. 36

The dynamic Huffman algorithm is a three step process: 1. Create a frequency table based on the input string. 2. Using the above created frequency table, create a tree for all the characters involved in the given input. 3. Using the tree create a map relating each character to its prefix. 4. Using the map, encode the input string, into a binary string. 5. While communicating this text with another person or group, the map too has to be communicated, we plan on doing so using AES encryption. Creating the Huffman tree: 1. Read through the entire string to create a frequency table, where each index represents the ASCII value of it s character and the array value represents its frequency. 2. This frequency table is then used to create a Huffman tree, each index from out frequency table is used to create a leaf node, which contains the character value and its frequency. 37

3. These leaves are then put into a priority queue, which on each poll gives us the current lowest frequency leaf. 4. We start with the first two lowest frequency leaves and build a Huffman tree, with its root representing a null value node, with frequency equivalent to the sum of the two input nodes. 5. We continue doing this till out queue becomes empty. Encode: 1. We start at the root of the previously created Huffman tree, and check the left and the right branches. a. If the input character is a leaf node, and is on the left branch, add 0 to the output string, if it s on the right branch, add 1 to the output string. Move to the next input character from the input string. b. Else, move to the left branch (the left sub tree) and add 0 to the output string. c. Repeat step 2, till we reach the the required leaf node. 38

d. Each time we reach our required character, we add it to our map, along with its prefix. Example working of the Huffman encoding process: INPUT (x): THECATHASTHEBAT Length of x: 15 * 8 = 120 bits Creating the frequency table based on input string: A B C E H S T 3 1 1 2 3 1 4 1 1 1 2 3 3 4 B S C E A H T 39

2 1 2 3 3 4 C E A H T 1 1 B S 3 2 3 3 4 E A H T 1 2 C 1 1 B S 40

11 4 11 T 3 8 H 3 5 A 2 3 E 1 2 C 1 1 B S Fig 6 Huffman tree for static single letter encoding 41

Prefix table: A B C E H S T 110 111110 11110 1110 10 111111 0 Encoded string: 01011101111011001011011111101011101111101100 Length of encoded string: 44 bits Space saving percentile: 63.33% Decoding: 1. We are already given the tree for the process of decoding. The tree is initially encrypted, and should be decrypted using AES. 2. We then start at the root of the Huffman tree and start reading the binary input (the previously compressed data) For example: 1010010100 3. If the character is 0 we move towards the left branch and, towards the right branch if the character encountered is 1. 42

4. Repeat step 2, till we reach a leaf node. 5. We replace the whole string of binary characters used in the path with the letter of the leaf node. 6. We then move our initial pointer to the end of the previously encountered binary string, as encountered in the previous step. 7. The whole process is repeated till we reach the end of the input string. 43

3.4 EXTENDED DYNAMIC HUFFMAN ENCODING 3.4.1 LETTER PAIRS Extended dynamic Huffman encoding for character pairs works on creating 26 Huffman trees. That is each letter has its own Huffman tree, where the leaves of a tree represent the frequency of the pair tree value and leaf value. We assume that if a particular pair is the only leaf in a tree, we consider its binary prefix equivalent to be 1. That is if E is the only leaf in the Huffman tree for Q then binary prefix for QE will be 1. This will enable us to decode the binary string more efficiently and be able to represent each character pair in the encoded string. When there is a single leaf in a tree, it cannot actually create a Huffman tree, and hence the modification. 44

Encoding: 1. Create a 26*26 matrix and traverse through the input string. For each character pair XY go to array index for X and increment the array index of Y. 2. Using the above 26 by 26 matrix, for each row of the matrix create a Huffman tree using its column values. 3. The Huffman trees for each row can be created using the same method as described in the dynamic Huffman encoding method. 4. The above obtained 26 Huffman trees depict frequencies for pairs contained in our input string, ranging from AA to ZZ. Using these Huffman trees, create an array of maps. 5. The index of the array is the character value of the first character of an input pair, and the second character of the input pair can be searched for in the map obtained. The resultant prefix would describe the encoding of the input character pair. 6. Return the maps array. 45

7. Now encode the first character as the prefix of the character obtained in the dynamic Huffman encoding for single character. 8. The rest of the pairs are replaced as per their values from the map. Example working of the Huffman encoding process: INPUT (x): THECATHASTHEBAT Length of x: 15 * 8 = 120 bits Creating the frequency table based on input string: A B C E H S T A 0 0 0 0 0 1 2 B 1 0 0 0 0 0 0 C 1 0 0 0 0 0 0 E 0 1 1 0 0 0 0 H 1 0 0 2 0 0 0 S 0 0 0 0 0 0 1 T 0 0 0 0 3 0 0 46

3 1 2 AS AT Fig 7 Huffman tree for letter A 2 1 EB 1 EC Fig 8 Huffman tree for letter E 47

Prefix table: A B C E H S T A Null Null Null Null Null 0 1 B 1 Null Null Null Null Null Null C 1 Null Null Null Null Null Null E Null 0 1 Null Null Null Null H 0 Null Null 1 Null Null Null S Null Null Null Null Null Null 1 T Null Null Null Null 1 Null Null Encrypting T as obtained from static Huffman encoding tree for single character: 0 Encoded string: 01111110011111 Length of encoded string: 14 bits Space saving percentile: 88.33% 48

Decoding: 1. For the process of decoding, we are provided with 26 Huffman trees, each representing a different letter in the range A- Z. 2. We decode the first prefix using the Huffman tree obtained from the algorithm dynamic Huffman encoding for single letter encoding. 3. We keep a pointer at the beginning of the input binary string initially. One pointer is kept at the root of the Huffman tree from step 2. 4. We keep moving the pointer forward in the input string, until we find a match. For each binary character from our input string, we move left if we encounter a zero and right if we encounter a one. Once we reach a leaf node, we return that leaf value, which would be a character ranging between A- Z. Print this character. 5. Using the character obtained from the previous step, we move to that character s Huffman tree, and follow steps three and four till we reach the end of our input string. 49

3.4.2 TRIPLETS Extended dynamic Huffman encoding for groups of three letters works on creating 576 Huffman trees. Each tree leaf reflects the frequency of characters A - Z from the tree value. The tree value reflects character pairs in the range AA - ZZ and their respective frequencies. Hence, each character pair has its own Huffman tree. The output of this code is a 26*26 matrix representing the 576 frequency tables which contain the binary string for each character triplet. If any of the Huffman trees has a single node, then the binary code for that triplet is considered as 1 to avoid perplexity. Encoding: 1. For encoding the input string, we first encode the first character through the dynamic Huffman encoding and the first two characters using the dynamic Huffman encoding for character pairs algorithms. 50

2. We now build a 26*26*26 matrix and store the frequency of every three- character group in the string in it. 3. Using the above created matrix, create a Huffman tree for each character pair, as illustrated in the dynamic Huffman encoding for single character. 4. For each of the character pair trees, create a matrix of prefix tables, which maps each three- character group to its respective binary code. 5. Character pairs, having just one node to form triplet, are encoded as 1. For example, if for character pair AB, the only triplet group that exists is ABC then ABC is encoded as 1. 6. Using the prefix tables obtained in the previous step, encode each three letter group with its respective binary code. 7. Return the binary encoded string. 51

Example working of the Huffman encoding process: INPUT (x): THECATHASTHEBAT Length of x: 15 * 8 = 120 bits Creating the frequency matrix based on input string: A B C E H S T AS 0 0 0 0 0 0 1 AT 0 0 0 0 1 0 0 BA 0 0 0 0 0 0 1 CA 0 0 0 0 0 0 1 EB 1 0 0 0 0 0 0 EC 1 0 0 0 0 0 0 HA 0 0 0 0 0 1 0 HE 0 1 1 0 0 0 0 TH 1 0 0 2 0 0 0 ST 0 0 0 0 1 0 0 52

2 1 1 HEB HEC Fig 9 Huffman tree for pair HE 3 1 2 THA AAB THE Fig 10 Huffman tree for pair TH 53

Prefix table: A B C E H S T AS Null Null Null Null Null Null 1 AT Null Null Null Null 1 Null Null BA Null Null Null Null Null Null 1 CA Null Null Null Null Null Null 1 EB 1 Null Null Null Null Null Null EC 1 Null Null Null Null Null Null HA Null Null Null Null Null 1 Null HE Null 0 1 Null Null Null Null TH 0 Null Null 1 Null Null Null ST Null Null Null Null 1 Null Null Encrypting T as obtained from static Huffman encoding tree for single character: 0 54

Encrypting TH as obtained from static Huffman encoding tree for character pairs: 1 Encoded string: 011111101111011 Length of encoded string: 15 bits Space saving percentile: 87.5% Decoding: 1. In the beginning we are provided with Huffman trees in the range 0-576, each representing a character pair in the range AA- ZZ, where the leaf of each tree represents the node s value s distance from the tree value. 2. We encode the first character using dynamic Huffman encoding and the first two characters using Huffman encoding for character pairs. Print these values and add the characters to the output string. 55

3. Now, using the two alphabets obtained in the previous step, access the tree with the same same value. Define a pointer p at the beginning of the input string. 4. Move left on encountering 0 and right on encountering 1. Increment the pointer. Continue doing this till we reach a leaf node. Print the value of the leaf node. 5. Using, the last two characters of the output string, access the tree with the same value. 6. Repeat steps four and five till p reaches the end of input string. 56

4. RESULTS Input: The Quick brown fox jumps over the lazy dog. Edited: THEQUICKBROWNFOXJUMPSOVERTHELAZYDOG STATIC DYNAMIC SINGLE 37.14% 42.86% PAIRS 45.71% 84.28% TRIPLETS 61.07% 86.78% Input: how I need a drink alcoholic of course after the heavy chapters involving quantum mechanics. Edited: HOWINEEDADRINKALCOHOLICOFCOURSEAFTERTHEHEAVYCHAPTERSIN VOLVINGQUANTUMMECHANICS STATIC DYNAMIC SINGLE 45.61% 47.56% 57

PAIRS 51.94% 77.60% TRIPLETS 53.90% 85.88% Input: resume.txt STATIC DYNAMIC SINGLE 46.50% 47.19% PAIRS 51.77% 55.56% TRIPLETS 54.47% 72.70% 58

5. CONCLUSION AND FUTURE WORK The results obtained were surprisingly good, and in most cases for the given input we noted a 20-30% increased compression for modified dynamic and a 10-15% increased compression for modified static Huffman encoding over the pre- existing techniques. This was a lot better than expected. We would like to conclude by saying extended dynamic Huffman encoding would work better if the input varied a good deal. Future work, involves extending the dynamic and static modified Huffman encoding algorithm to quadruplets and further till we cannot compress the data any further. The input too could be widened to include small case letters and numbers. Currently, the compression concentrates only on text files [8], this could be extended to include images and videos. 59

6. REFERENCES [1] C.E.Shannon, Communication theory of secrecy systems, Bell Systems Technical Journal, 28 (1949) pp. 656-715. [2] D. A. Huffman, A method for the construction of minimum redundancy codes, Proceedings of the IRE, vol. 40, no. 9, pp. 1098 1101, 1952. [3] R M Capocelli, R Giancarlo and I J Taneja, Bounds on the redundancy of Huffman codes, IEEE Transactions on Information Theory, IT- 32, pp 854-857, 1986 [4] M Nelson and J- L Gaily, The Data Compression Book, 2nd edition., MIS Press, 1995. [5] Chen, H. C. and Wang, Y. L. and Lan, Y. F.: A Memory Efficient and Fast Huffman Decoding Algorithm Information Processing Letters, Vol. 69, No. 3, pp. 119-122, February 1999. [6] Wikipedia Huffman Encoding [7] Cryptography: Theory and Practise by Douglas R. Stinson [8] Harris, T., How File Compression Works, HowStuffWorks, 2006 [9] https://en.wikipedia.org/wiki/most_common_words_in_english [10] http://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html 60

[11] Sharma, M.: 'Compression Using Huffman Coding'. International Journal of Computer Science and Network Security, VOL.10 No.5, May 2010. [12] http://homepages.math.uic.edu/~leon/mcs425- s08/handouts/char_freq2.pdf [13] Jafari, A. and Rezvan, M. and Shahbahrami, A.: A Comparison Between Arithmetic and Huffman Coding Algorithms The 6th Iranian Machine Vision and Image Processing Conference, pp: 248-254, October, 2010. 61