Information Science 2 - Path Lengths and Huffman s Algorithm- Week 06 College of Information Science and Engineering Ritsumeikan University
Agenda l Review of Weeks 03-05 l Tree traversals and notations for arithmetic expressions l Review of some of the most basic algorithms l Binary trees and path lengths l Huffman coding l Quiz 2
Recall concepts from l Arrays and Linked lists l Graphs and Trees Weeks 03-05 l Binary and Linear search l Search on a binary tree l Sorting algorithm l Bubble sort, Selection sort, and Insertion sort 3
Class objectives l Discuss some of the most basic algorithms, and learn how binary trees are used to store data and expressions l After this lecture and study, you must be able to: Show infix, prefix, and postfix notations as code and as binary tree traversal Understand the basic algorithm types Understand and apply Huffman s algorithm to encode data 4
Tree traversals l A tree-traversal refers to the process of visiting each node in a tree data structure, exactly once, in a systematic way l We will explore three basic traversals: Inorder Preorder Postorder l E.g: Compilers and interpreters use these traversals in algorithms that convert computer programs into executable code 5
Traversal applications l Inorder corresponds to normal infix notation for arithmetic expressions, as used in many programming languages, e.g: 4+3 or 4 ADD 3 l Preorder corresponds to prefix notation for arithmetic expressions, as used in assembly and languages where operators are functions, e.g: + 4 3 or ADD(4, 3) l Postorder corresponds to postfix notation where evaluation order is left-to-right, as used by interpreters and some types of calculator, e.g: 4 3 + or 4 3 ADD 6
Inorder traversal Problem: Print a tree (where internal nodes represent the operators and external nodes the operands) in normal infix notation + l To solve it, apply inorder rules: Traverse the left subtree * + Process the root - c d Traverse the right subtree l Pseudocode: inorder(node) = a b e if node null then inorder(node.left) print node.value inorder(node.right) l Result: (((a-b)*c)+(d+(e/(f+g)))) / f + g 7
l l l Preorder traversal Problem: Print an expression tree in prefix notation, treating operators as if they were function calls + To solve it, apply preorder rules: Process the root * + Traverse the left subtree Traverse the right subtree - c d / Pseudocode: preorder(node) = if node null then a b e print node.value f preorder(node.left) preorder(node.right) Result: +(*(-(a, b), c), +(d, /(e, (+(f, g)))) + g 8
Postorder traversal Problem: Print an expression tree in postfix notation, where operands and operators appear in the exact order they are evaluated + l To solve it, apply postorder rules: Traverse the left subtree * + Traverse the right subtree - c d Process the root l Pseudocode: postorder(node) = a b e if node null then postorder(node.left) postorder(node.right) print node.value l Result: a b c * d e f g + / + + / f + g 9
Basic algorithms l Previously, we have studied a few basic types of algorithms for data conversion, searching, and sorting l Other basic types of algorithms include: Error checking Error correction Compression Encryption, and Data encoding 10
Error checking l All data communications, storage, and manipulation have the possibility of errors l An error in binary data usually means that some bits have been altered (changed) l The simplest error checking algorithms add up the number of ones (or zeros), for example, in a byte or word: parity algorithms l The parity data is sent or stored along with the data so that most errors can be detected before the data is used 11
Error correction l Algorithms can also be used to correct errors in data, sometimes without having to get it again l The simplest way is to send three copies of the same data and use the two that are the same l Copying the data is called mirroring, but that may waste storage and communication capacity l Various error correction algorithms are available to encode data in a format that can be checked and corrected efficiently l Because errors are not uncommon, modern computing and communication would be almost impossible without error checking and correction 12
Data encryption l Encryption encodes data in a format that is intentionally difficult for others to decode l Encryption is generally a way of keeping data and its access secret and secure l After data is encrypted, it may be sent or stored, and then decrypted for use l Access data, such as passwords, may be encrypted one way so it can only easily be confirmed but never easily decrypted l Encryption and decryption have become increasingly important for almost all communication, storage, and access 13
Data encoding l In addition to the just considered problems (i.e., error detection and correction, compression, and encryption), we have previously learned some other common tasks related to data encoding: number representation, ASCII, RGB codes, etc. l In many situations, data encoding is done, using simple tables or arithmetic l There are also many algorithms for encoding data, for example with binary trees, that can be, in a way, better (faster, require less memory, ) 14
15 Data compression l Compression means encoding data in a more compact (economic) format l Compression allows more data to be stored, e.g., a high-definition movie, or thousands of photographs or songs stored on a single disk l Compression allows faster communication l After data is compressed and stored, it must be expanded (or decompressed) to use again l Like all encoding, compression and expansion require both sides (transmitter and receiver) to have related algorithms
Binary trees: Complete and extended l Recall the basic binary tree concept: a node may have 2, 1 or 0 vertices below it, the left and right child nodes l An extended binary tree has either zero (for external) or two (for internal) child nodes at each node l A complete binary tree has two child nodes for each internal node at every level, with a possible exception for the last level of internal nodes 16
Complete binary tree: l Check each node of the graph, using BFS (do not include the last level of internal nodes): Does each checked node have two child nodes? l If yes, it is complete l The final level must be filled from the left Example filled filled not filled no vertices 17
Extended binary tree: l Check each node of the graph, using DFS (or any other): Does every node have zero or two vertices below it? l Internal nodes (shown here as circles) have two child nodes l External nodes (shown here as squares) have zero child nodes l Suitable for encoding data Example 18
Encoding with binary trees l A binary tree can encode a digital, binary code: Each left child corresponds to a binary 0 Each right child corresponds to a binary 1 0 000 001 010 011 100 101 110 111 l In the example, the external nodes encode the octal digits; encoded bits describe a path from the root down to a digit 0 1 0 1 0 1 0 1 0 1 0 (8 1 (8 2 (8 3 (8 4 (8 5 (8 6 (8 7 (8 0 1 1 19
Path length l This binary tree is complete and extended l In the octal digit code, each symbol is represented with the same number of bits each 0 1 0 symbol has the same 0 0 1 path length (i.e., the number 001 010 011 100 101 of edges between the symbol and the root) 0 1 0 1 0 1 000 110 111 0 (8 1 (8 2 (8 3 (8 4 (8 5 (8 6 (8 7 (8 1 1 20
ASCII binary tree sample control, symbols, numerals, punctuation upper case, etc. Example: q = 1110001 = 71h ` a b c d e f g h i j k l m n o p q r s t u v w x y z { } ~ l The encoding is, however, not quite efficient: The resulting tree is huge, even though many symbols may rarely be used The length of each code is (the same and) long even when the corresponding symbol may be used often 21
22 Huffman Coding l Huffman Coding is an algorithm for building a compact tree (i.e., smaller than in the case of the plain binary encoding) l The obtained compact tree is extended but not necessarily complete: symbols that are used more frequently get smaller numbers of bits l Huffman coding is used in many kinds of data compression, including those in image files and video and audio files
23 Huffman Coding algorithm: Overview 1. Get sample data (or the actual symbols) to be encoded 2. Count how many times each symbol is used in the sample data 3. Use that frequency to build the tree from the bottom up, each frequency becoming a node 4. Start with the two least-used symbols to create a node with two child nodes, add the two (child node) frequencies for the new node 5. Evaluate all subtrees, including the new node or any nodes, to find the two least used again
Huffman Coding: Example Sample text: this is an example of a huffman tree Symbols used and their frequencies: a : 4 e : 4 f : 3 h : 2 i : 2 l : 1 m : 2 n : 2 o : 1 p : 1 r : 1 s : 2 t : 2 u : 1 x : 1 space: 7 For the sake of simplicity, symbols not met in the text will be ignored (i.e., the zero frequency will not be used) 24
a : 4 e : 4 f : 3 h : 2 i : 2 l : 1 m : 2 n : 2 o : 1 p : 1 r : 1 s : 2 t : 2 u : 1 x : 1 space: 7 Coding example (cont-d) l Start with the least-used symbols l The numbers in each node are the frequencies l Add the frequencies of the subtrees for new nodes l Each iteration, build from the lowest frequencies, including the symbols not yet on the tree 4 a 8 2 h 4 4 4 4 4 e 2 2 2 2 2 2 i m n s 1 1 o r 1 l 8 36 16 20 1 p 8 2 t 1 u 2 5 1 x 12 3 f 7 space 25
Coding example (cont-d) space 111 a 000 8 36 16 20 8 8 12 4 a 2 h 4 4 4 4 4 e 2 2 2 2 2 2 i m n s 1 1 o r 1 l l Now, shorter codes stand for frequent symbols l Also, no symbol begins with the same bits as any other less frequent symbol l Write the binary codes by frequency for all the symbols (the first two have been done for you) 1 p 2 t 1 u 2 5 1 x 3 f 7 space 26
Summary of this lecture l After this class, you are expected to know the basic algorithm types l Binary trees can be used to encode data l Huffman s algorithm is an efficient way of making a compact tree l You must be able to make a Huffman tree when given small samples of text or characters l Tree traversal is used in algorithms for compiling and interpreting programs, and computing l You must be able to show examples of infix, prefix, and postfix notations as code and binary trees 27
28 l Read these slides again l Do the self-preparation assignments Homework l Learn the English terms new for you
29 Next class l Overview and mid-semester evaluation for the first six weeks: from Week 01 to Week 06
Quiz 03 30