Huffman Coding (EE 575: Source Coding Project) Project Report Submitted By: Raza Umar ID: g200905090
Algorithm Description Algorithm developed for Huffman encoding takes a string of data symbols to be encoded along with a vector containing respective symbol probabilities as input. It calls two recursive functions to generate the Huffman dictionary and reports the average length of the codeword dictionary as output. The main theme of algorithm is to make use of cell structures in matlab to build the Huffman tree while keeping track of child and parent nodes. Once the tree has been built, codeword corresponding to each input data symbol (which acts like a leaf node in Huffman tree) can be found out by simply traversing the tree from the branch till that leaf node is encountered. The general structure contains cells corresponding to input data symbol, probability and its original order in the list of symbols passed to the algorithm as a string. Two additional cells have been added in the structure to keep information regarding the child nodes and code word of the current node. A structure is made for each data symbol and M (= number of input data symbols) instances of this structure are filled with known information and sorted in ascending order of probability. This result in M leaf nodes corresponding to M data symbols arranged in ascending order of probability. Huffman tree is generated by passing this structure (with M nodes) to a recursive function gen_h_tree. This function combines the top two nodes (nodes with least probability) to make one parent node. Parent node contains the information of two combining nodes as child nodes and the probability of parent node is equal to the sum of probabilities of child nodes. The two child nodes are then removed from the Huffman tree and depending on the probability of this parent node, it is inserted in the Huffman tree such that all the (M-1) nodes remain in ascending order of probability. Note that, by replacing two child nodes with one parent node, number of nodes gets reduced by 1. This function is recursively called till the Huffman tree consists of only one final node with probability 1. Huffman dictionary is then generated by traversing this tree recursively till the leaf nodes. Essentially, Huffman dictionary is another structure containing cells corresponding to input data symbol, probability, codeword, length of its codeword and its original order in input string of data symbols. Since Huffman tree is a binary tree so each parent node contains information of its two child nodes. A child node with least probability is assigned bit 1 while a child node with higher probability is assigned a bit 0. All these bits corresponding to each node are concatenated into a vector which ultimately becomes the code word of the node which has no child i.e. leaf node. Each time when a leaf node is encountered, weighted average length of the code word is accumulated in a variable avglen containing 0 as its initial value. This variable represents the average length of the codeword dictionary when all leaf nodes get their codewords assigned.
Codeword dictionary is then arranged according to the desired output format e.g. either same as input data symbol order (original order) or in ascending/descending order of code length. The algorithm then output each data symbol along with its respective codeword from the codeword dictionary.
Algorithm Flowchart Start Read the inputs 1. String of input data symbols 2. Vector of respective probabilities Fill in the structure h_tree h_tree.symbol h_tree.prob h_tree.org_order corresponding to each i/p data symbol sort M nodes in ascending order of prob. Generate h_tree Is this structure has only 1 node? yes no Combine top two nodes to form a parent node Combining nodes are two children of parent node Prob. of parent node is the sum of prob. of child nodes Insert new_node index=1 While (new_node.prob > h_tree(index).prob) do index=index+1 Place new_node before h_tree(index) in struct h_tree
avglen=0 h_tree Generate h_dict Is this a leaf node? yes For i=1:2 h_tree.child{i}.code=[h_tree.code 2-i] call Generate h_dict with h_tree.child{i} as input end For no Copy h_tree to h_dict Avglen=avglen+h_tee.prob*length(h_tree.code) Display Output Sort h_dict instances according to original order of input data symbols Output symbols and their respective codewords Output avglen of codeword dictionary