Binary Trees Case-studies

Similar documents
Heap sort. Carlos Moreno uwaterloo.ca EIT

Heap sort. Carlos Moreno uwaterloo.ca EIT

Algorithm Analysis. Carlos Moreno uwaterloo.ca EIT Θ(n) Ω(log n) Θ(1)

Binary Search Trees. Carlos Moreno uwaterloo.ca EIT

Dijkstra's Algorithm

15 July, Huffman Trees. Heaps

Balanced BST and AVL Trees

Graphs. Carlos Moreno uwaterloo.ca EIT

Minimum spanning trees

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

6. Finding Efficient Compressions; Huffman and Hu-Tucker

Trees. Carlos Moreno uwaterloo.ca EIT

Trees. Carlos Moreno uwaterloo.ca EIT

Binary Search Trees. Carlos Moreno uwaterloo.ca EIT

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms

EE 368. Weeks 5 (Notes)

Hash Tables (Cont'd) Carlos Moreno uwaterloo.ca EIT

Huffman Codes (data compression)

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

COSC-211: DATA STRUCTURES HW5: HUFFMAN CODING. 1 Introduction. 2 Huffman Coding. Due Thursday, March 8, 11:59pm

12 Abstract Data Types

CS02b Project 2 String compression with Huffman trees

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Stating the obvious, people and computers do not speak the same language.

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

14 Data Compression by Huffman Encoding

Lossless Compression Algorithms

Horn Formulae. CS124 Course Notes 8 Spring 2018

Greedy Algorithms CHAPTER 16

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II)

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

LECTURE 13 BINARY TREES

Text Compression through Huffman Coding. Terminology

Data Structures and Algorithms

ASCII American Standard Code for Information Interchange. Text file is a sequence of binary digits which represent the codes for each character.

Trees! Ellen Walker! CPSC 201 Data Structures! Hiram College!

Binary Search Tree (2A) Young Won Lim 5/17/18

CSE100. Advanced Data Structures. Lecture 12. (Based on Paul Kube course materials)

Digital Image Processing

Hash Tables (Cont'd) Carlos Moreno uwaterloo.ca EIT

Formal Languages and Automata Theory, SS Project (due Week 14)

(Refer Slide Time: 06:01)

ENSC Multimedia Communications Engineering Huffman Coding (1)

13 BINARY TREES DATA STRUCTURES AND ALGORITHMS INORDER, PREORDER, POSTORDER TRAVERSALS

How to Do Word Problems. Building the Foundation

CSE 143, Winter 2013 Programming Assignment #8: Huffman Coding (40 points) Due Thursday, March 14, 2013, 11:30 PM

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

CS15100 Lab 7: File compression

CS 206 Introduction to Computer Science II

Basic Compression Library

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

F453 Module 7: Programming Techniques. 7.2: Methods for defining syntax

More Bits and Bytes Huffman Coding

7.1 Introduction. A (free) tree T is A simple graph such that for every pair of vertices v and w there is a unique path from v to w

Friday, March 30. Last time we were talking about traversal of a rooted ordered tree, having defined preorder traversal. We will continue from there.

COMP 250 Fall binary trees Oct. 27, 2017

15-122: Principles of Imperative Computation, Spring 2013

Chapter 10: Trees. A tree is a connected simple undirected graph with no simple circuits.

Sequential Containers Cont'd

Information Science 2

Multimedia Networking ECE 599

ASCII American Standard Code for Information Interchange. Text file is a sequence of binary digits which represent the codes for each character.

Lecture Notes for Advanced Algorithms

Data structures. Priority queues, binary heaps. Dr. Alex Gerdes DIT961 - VT 2018

CSE 421 Greedy: Huffman Codes

Topics. Trees Vojislav Kecman. Which graphs are trees? Terminology. Terminology Trees as Models Some Tree Theorems Applications of Trees CMSC 302

Lecture Notes 16 - Trees CSS 501 Data Structures and Object-Oriented Programming Professor Clark F. Olson

Binary Trees

CSE 373 OCTOBER 11 TH TRAVERSALS AND AVL

CHAPTER 3. Register allocation

Tree. A path is a connected sequence of edges. A tree topology is acyclic there is no loop.

(Refer Slide Time 3:31)

Huffman Coding Assignment For CS211, Bellevue College (rev. 2016)

There are many other applications like constructing the expression tree from the postorder expression. I leave you with an idea as how to do it.

Upcoming ACM Events Linux Crash Course Date: Time: Location: Weekly Crack the Coding Interview Date:

Greedy Algorithms II

CE 221 Data Structures and Algorithms

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Intro. To Multimedia Engineering Lossless Compression

CIS 121 Data Structures and Algorithms with Java Spring 2018

Analysis of Algorithms - Greedy algorithms -

CMSC th Lecture: Graph Theory: Trees.

Binary Trees, Binary Search Trees

CSE 143 Lecture 22. Huffman Tree

Programming Lecture 3

CHAPTER 3. Register allocation

CS301 - Data Structures Glossary By

Compiler Design Spring 2018

Data compression.

Binary, Hexadecimal and Octal number system

Lecture #31: Code Generation

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS

Scan Primitives for GPU Computing

Data Compression Techniques

CSE 431/531: Algorithm Analysis and Design (Spring 2018) Greedy Algorithms. Lecturer: Shi Li

Analysis of Algorithms

Chapter 4: Application Protocols 4.1: Layer : Internet Phonebook : DNS 4.3: The WWW and s

A Research Paper on Lossless Data Compression Techniques

Binary Search Tree (3A) Young Won Lim 6/2/18

Transcription:

Carlos Moreno cmoreno @ uwaterloo.ca EIT-4103 https://ece.uwaterloo.ca/~cmoreno/ece250

Standard reminder to set phones to silent/vibrate mode, please!

Today's class: Binary Trees Case-studies We'll look at some examples / case-studies where trees play an important role as a useful/powerful tool. Binary expression trees and reverse-polish notation. Huffman Trees (for Optimal prefix code data compression)

Binary expression trees: Basic idea: mathematical expressions involve exclusively binary operations (the exception being the unary minus as in, x, but that can always be expressed as 0 x, resorting to the binary minus) So, an expression involving a certain binary operation on two sub-expressions can be represented by a binary tree where the root node represents the operation, and the child nodes represent the operands possibly sub-trees to represent the sub-expressions.

Very simple example: The expression a + b would be represented as: + a b

If instead of b we had another expression, such as 2 c, then we make it such that the right child is not a leaf node containing b, but a whole sub-tree representing the expression 2 c:

If instead of b we had another expression, such as 2 c, then we make it such that the right child is not a leaf node containing b, but a whole sub-tree representing the expression 2 c: + a 2 c

Given the recursive nature of these expressions (an expression being a binary operation between two sub-expressions), we can extend the binary tree representation to expressions of arbitrary complexity.

Given the recursive nature of these expressions (an expression being a binary operation between two sub-expressions), we can extend the binary tree representation to expressions of arbitrary complexity. Internal nodes (always full nodes) store operations. Leaf nodes store literal values or variables. This is an ordered tree! (subtraction and division are not commutative!)

Given the recursive nature of these expressions (an expression being a binary operation between two sub-expressions), we can extend the binary tree representation to expressions of arbitrary complexity. Internal nodes (always full nodes) store operations. Leaf nodes store literal values or variables. This is an ordered tree! (subtraction and division are not commutative!)

Example: 3 (4a + b + c) + d / 5 + (6 - e)

BTW... How do we write (output) the contents of an expression tree? (i.e., print the represented expression given the tree)

Breadth-first? Depth-first? Pre-, post-, or in order?

For the case of a very simple tree, it is clear that we want in-order traversal we want operand1, operator, operand2 (in the simple example below, a, followed by +, followed by b) + a b

Does it work for this one as well?

Perhaps as interesting: what is the output if we use post-order depth-first traversal?

The pattern is, both operands appear first, then the operation does this remind you of something? 3 4 a b c + + d 5 / 6 e + +

Hint: Here's how the C++ compiler thinks that the CPU can do a + b: movl -24(%rbp), %eax movl -20(%rbp), %edx addl %edx, %eax Again, this is Intel assembler, in AT&T-style notation a little different than what you're used to see in ECE-222, but the principle being the same)

The principle being: the CPU language requires that two operands to be loaded first into registers, then the instruction to perform the operation is issued.

The principle being: the CPU language requires that two operands to be loaded first into registers, then the instruction to perform the operation is issued. So, it turns out that this notation in reverse could be useful after all! Compilers could use this tree to: Do manipulations (possibly simplifications) on the expression. Determine the sequence of assembly-level instructions.

BTW, this is known as reverse-polish notation, co-creation of Edsger Dijkstra (as a tool to optimize memory access by using a stack to perform operations). (image courtesy of wikipedia.org)

Next, we'll look at Huffman Trees for data compression...

Basic idea: Suppose we want to encode a sequence of characters that can only take one of four possible symbols. We need to encode these for transmission over a digital communications channel (or to store them in some digital storage medium)

Basic idea: Suppose we want to encode a sequence of characters that can only take one of four possible symbols. We need to encode these for transmission over a digital communications channel (or to store them in some digital storage medium) How do we proceed?

The most straightforward approach is to assign two-bit codes to each symbol (with two bits, we have four possible combinations, so we use one combination for each symbol). For example, if the symbols are A,B,C,D, we could simply say A = 00, B = 01, C = 10, D = 11 The transmitter and receiver agree on this encoding, and communication will be successful.

The sequence ABAACAAD would be encoded as: 0001000010000011 The receiver can decode the stream of bits because it knows that every two bits correspond to a character.

Of course, we'd like to optimize transmission speed, so we should minimize the amount of bits used, right? However, it seems like we have no choice, since there are four possible symbols, so we need two bits to represent each.

How about this twist: what if we knew that the symbol A occurs 80% of the time in the sequences being transmitted, B occurs 10% of the time, and C, D occur 5% of the time each? Could we come up with a different encoding that would reduce the amount of bits to transmit?

How about this twist: what if we knew that the symbol A occurs 80% of the time in the sequences being transmitted, B occurs 10% of the time, and C, D occur 5% of the time each? Could we come up with a different encoding that would reduce the amount of bits to transmit? Ok, let me give you a hint: do all symbols have to encode to a fixed number of bits? Could we use variable number of bits?

The trick is: if we're going to use different amount of bits for different symbols, then we minimize the total number of bits by assigning fewer bits to the symbols that occur more frequently. We can easily see why this is the case if N k is the number of times that symbol S k appears, and S k is the size (the number of bits) of symbol S k, then the total number of bits is: N = N 1 S 1 + N 2 S 2 + N 3 S 3 + N 4 S 4

One problem is: if we have variable number of bits, how do we know when a character starts and ends? What if one symbol is encoded as 00, another as 11, and another as 0011??

One problem is: if we have variable number of bits, how do we know when a character starts and ends? What if one symbol is encoded as 00, another as 11, and another as 0011?? How does the receiver know whether the latter is the symbol 00 followed by symbol 11, or if it corresponds to the symbol 0011?

The solution is prefix codes an encoding scheme where no code can be a prefix of another code (for example, if some symbol is encoded as 01, then no other code would start with 01).

With this idea in mind, we see that a better encoding, if we know that A occurs 80% of the time, B 10% of the time, and C, D 5% of the time each, is: A = 0, B = 10, C = 110, D = 111

Let's try encoding the following: AAAABAAACAAAADAAABAA It's 20 characters, so with our straightforward encoding, it would take 20 2 = 40 bits. With A = 0, B = 10, C = 110, D = 111, it encodes to: 00001000011000001110001000 (only 26 bits!)

Now, you may be asking (and granted, it is a very fair question!) «what on earth can this possibly have to do with reality?» Well, have you seen English text lately? Do all letters occur with the same frequency? Putting aside capitalization and punctuation, we have 26 letters; that would require 5 bits per letter. It turns out that English text can be encoded with approximately ONE bit per letter!!

(image courtesy of wikipedia.org)

This is a very simplified version of the main idea behind Information Theory, one of the most ground breaking mathematical theories developed in recent centuries and we all coexisted with its creator, Claude E. Shannon (1916 2001)!

This is a very simplified version of the main idea behind Information Theory, one of the most ground breaking mathematical theories developed in recent centuries and we all coexisted with its creator, Claude E. Shannon (1916 2001)! (BTW, we also co-existed with Edsger Dijkstra another genius from the 20 th century; he died in 2002)

Incidentally, Claude Shannon, having created this Information Theory and given it a solid mathematical foundation, he then applied it to Data compression, and he got it wrong!!

Incidentally, Claude Shannon, having created this Information Theory and given it a solid mathematical foundation, he then applied it to Data compression, and he got it wrong!! Not horribly wrong, though his method, the Shannon Fano compression scheme, is reasonably efficient... It's just not optimal.

Incidentally, Claude Shannon, having created this Information Theory and given it a solid mathematical foundation, he then applied it to Data compression, and he got it wrong!! Not horribly wrong, though his method, the Shannon Fano compression scheme, is reasonably efficient... It's just not optimal. A few years later, David Huffman realized a key detail in the mechanism, and came up with an optimal compression scheme for prefix codes!

Incidentally, Claude Shannon, having created this Information Theory and given it a solid mathematical foundation, he then applied it to Data compression, and he got it wrong!! Not horribly wrong, though his method, the Shannon Fano compression scheme, is reasonably efficient... It's just not optimal. A few years later, David Huffman realized a key detail in the mechanism, and came up with an optimal compression scheme for prefix codes!

Huffman encoding is based on building the so called Huffman Encoding Tree, or just Huffman Tree. It is a tree that determines the optimal encoding for a set of symbols with given probabilities of occurrence. It also allows for efficient decoding of a stream of bits when receiving a 0, we take the left child, with a 1, we take the right child, and when we reach a leaf node, we know that a character is complete (and the leaf node stores the corresponding symbol)

Let's find the optimal encoding for the set of eight symbols A,B,C,D,E,F,G,H, where they occur with probabilities 5%, 10%, 7%, 35%, 2%, 17%, 20%, 4%, respectively.

The idea is quite simple each symbol is a node storing the symbol and its probability (initially all disconnected).

The idea is quite simple each symbol is a node storing the symbol and its probability (initially all disconnected). We pick the two nodes with lowest probability, and create a new node as the parent of those two nodes. That node is assigned a probability given by the sum of the two children's probability.

The idea is quite simple each symbol is a node storing the symbol and its probability (initially all disconnected). We pick the two nodes with lowest probability, and create a new node as the parent of those two nodes. That node is assigned a probability given by the sum of the two children's probability. We simply repeat that until we have tree (that is, until we have a single root node)

Example: A,B,C,D,E,F,G,H, where they occur with probabilities 5%, 10%, 7%, 35%, 2%, 17%, 20%, 4% E H A C B F G D 2 4 5 7 10 17 20 35

E H A C B F G D 2 4 5 7 10 17 20 35

6 E H A C B F G D 2 4 5 7 10 17 20 35

6 E H A C B F G D 5 7 10 17 20 35

6 E H A C B F G D 5 7 10 17 20 35

11 A E H C B F G D 7 10 17 20 35

11 A E H C B F G D 7 10 17 20 35

11 A E H C B F G D 7 10 17 20 35

11 A 17 E H C B F G D 17 20 35

11 A 17 E H C B F G D 17 20 35

11 A 17 E H C B F G D 17 20 35

28 A C B E H F G D 17 20 35

28 A C B E H F G D 17 20 35

28 A C B E H F G D 17 20 35

28 A C B 37 E H F G D 35

28 A C B 37 E H F G D 35

28 A C B 37 E H F G D 35

63 D A C B 37 E H F G

63 D A C B 37 E H F G

37 D F G A C B E H

0 1 0 1 0 1 0 1 D F G 01 10 11 0 0 1 E H 00000 00001 1 0 1 A C B 0001 0010 0011

So, if we encode a sequence of 100 characters with the straightforward encoding, we'd take 800 bits (right? 8 possible symbols, use three bits and assign each symbol to one of the possible 8 combinations)

So, if we encode a sequence of 100 characters with the straightforward encoding, we'd take 300 bits (right? 8 possible symbols, use three bits and assign each symbol to one of the possible 8 combinations) With this encoding, we'll have (assuming the average case), 35 characters being a D, 20 of them being a G, 17 being an F, etc.

The important detail being: 35 + 20 + 17 characters will take 2 bits (since D, G, and F all take 2 bits), then 10+7+5 characters will take 4 bits (A, B, and C take 4 bits), and 6 characters will take 5 bits (both E and H take 5 bits) for a total of: 72 2 + 22 4 + 6 5 = 144 + 88 + 30 = 262 bits! (compression factor is not that high, but then, the discrepancies in the probabilities were not that high, which is when good compression rates are possible)

If the particular data does not fit the probabilities (it could be that by simple randomness, we had 8% of E's, instead of the average of 2%), then the encoding would not be optimal.

If the particular data does not fit the probabilities (it could be that by simple randomness, we had 8% of E's, instead of the average of 2%), then the encoding would not be optimal. Depending on the situation, what we could do is build a Huffman Tree for the specific given data (instead of using the theoretical probabilities, we just count occurrences of each symbol, and use those numbers to build the tree then we get the optimal encoding for this particular sequence of characters!

If the particular data does not fit the probabilities (it could be that by simple randomness, we had 8% of E's, instead of the average of 2%), then the encoding would not be optimal. Depending on the situation, what we could do is build a Huffman Tree for the specific given data (instead of using the theoretical probabilities, we just count occurrences of each symbol, and use those numbers to build the tree then we get the optimal encoding for this particular sequence of characters! Really neat, huh?

For decoding, we just use the tree: For example, let's decode the sequence 011001000100001 0 1 0 1 0 1 0 1 D F G 01 10 11 0 0 1 E H 00000 00001 1 0 1 A C B 0001 0010 0011

For decoding, we just use the tree: For example, let's decode the sequence 011001000100001 0 1 Answer being: DFDAH 0 0 1 1 0 1 D F G 01 10 11 0 0 1 E H 00000 00001 1 0 1 A C B 0001 0010 0011

Summary During today's session, we looked at: Two case-studies, hopefully showing the usefulness and power of the Tree data structure. Both cases are binary trees: Expression trees useful for compilers to translate human-readable in-fix expressions into CPU-level post fix expressions; also, possibly to manipulate and optimize expressions. Huffman encoding trees useful for telecommunications and data storage, as the tool provides an optimal data compression mechanism for prefix codes.