CS 337 Project 1: Minimum-Weight Binary Search Trees

Similar documents
Text Compression through Huffman Coding. Terminology

Greedy Algorithms CHAPTER 16

6. Finding Efficient Compressions; Huffman and Hu-Tucker

16 Greedy Algorithms

Algorithms Dr. Haim Levkowitz

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

Huffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27

February 24, :52 World Scientific Book - 9in x 6in soltys alg. Chapter 3. Greedy Algorithms

Lecture 5. Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs

CS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department

Binary Heaps in Dynamic Arrays

Final Examination CSE 100 UCSD (Practice)

Greedy Algorithms Part Three

Data Compression Techniques

Priority Queues and Binary Heaps

CSE 431/531: Algorithm Analysis and Design (Spring 2018) Greedy Algorithms. Lecturer: Shi Li

CSC 373: Algorithm Design and Analysis Lecture 4

G205 Fundamentals of Computer Engineering. CLASS 21, Mon. Nov Stefano Basagni Fall 2004 M-W, 1:30pm-3:10pm

Chapter 16. Greedy Algorithms

CSE100. Advanced Data Structures. Lecture 8. (Based on Paul Kube course materials)

CS2 Algorithms and Data Structures Note 6

University of Waterloo CS240R Fall 2017 Solutions to Review Problems

18 Spanning Tree Algorithms

Least Squares; Sequence Alignment

Suffix Trees and Arrays

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

22 Elementary Graph Algorithms. There are two standard ways to represent a

managing an evolving set of connected components implementing a Union-Find data structure implementing Kruskal s algorithm

Name: Lirong TAN 1. (15 pts) (a) Define what is a shortest s-t path in a weighted, connected graph G.

COMP251: Greedy algorithms

Kruskal s MST Algorithm

Lecture 34. Wednesday, April 6 CS 215 Fundamentals of Programming II - Lecture 34 1

Lecture 15. Error-free variable length schemes: Shannon-Fano code

Horn Formulae. CS124 Course Notes 8 Spring 2018

Outline. Definition. 2 Height-Balance. 3 Searches. 4 Rotations. 5 Insertion. 6 Deletions. 7 Reference. 1 Every node is either red or black.

Write an algorithm to find the maximum value that can be obtained by an appropriate placement of parentheses in the expression

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

Treewidth and graph minors

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

Lecture: Analysis of Algorithms (CS )

Design and Analysis of Algorithms

CS2 Algorithms and Data Structures Note 6

Matching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition.

Greedy algorithms part 2, and Huffman code

CS161 - Final Exam Computer Science Department, Stanford University August 16, 2008

16.Greedy algorithms

COMP251: Algorithms and Data Structures. Jérôme Waldispühl School of Computer Science McGill University

CSE331 Introduction to Algorithms Lecture 15 Minimum Spanning Trees

Output: For each size provided as input, a figure of that size is to appear, followed by a blank line.

9.1 Cook-Levin Theorem

Notes for Lecture 24

Greedy Algorithms 1 {K(S) K(S) C} For large values of d, brute force search is not feasible because there are 2 d {1,..., d}.

Homework 3 Solutions

CS2 Practical 2 CS2Ah

Design and Analysis of Algorithms

Analysis of Algorithms - Greedy algorithms -

TU/e Algorithms (2IL15) Lecture 2. Algorithms (2IL15) Lecture 2 THE GREEDY METHOD

CS264: Homework #4. Due by midnight on Wednesday, October 22, 2014

CSE 421 Greedy: Huffman Codes

1 Format. 2 Topics Covered. 2.1 Minimal Spanning Trees. 2.2 Union Find. 2.3 Greedy. CS 124 Quiz 2 Review 3/25/18

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms

Prof. Dr. A. Podelski, Sommersemester 2017 Dr. B. Westphal. Softwaretechnik/Software Engineering

Greedy Algorithms. CLRS Chapters Introduction to greedy algorithms. Design of data-compression (Huffman) codes

Question Score Points Out Of 25

Comparison Based Sorting Algorithms. Algorithms and Data Structures: Lower Bounds for Sorting. Comparison Based Sorting Algorithms

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

CS 6783 (Applied Algorithms) Lecture 5

Lecture 3 February 9, 2010

x-fast and y-fast Tries

Binary Search Trees Treesort

Computing intersections in a set of line segments: the Bentley-Ottmann algorithm

Data Structure and Algorithm Midterm Reference Solution TA

Algorithms for Data Science

CS/COE 1501 cs.pitt.edu/~bill/1501/ Introduction

Data Structure and Algorithm Homework #3 Due: 2:20pm, Tuesday, April 9, 2013 TA === Homework submission instructions ===

1. Lecture notes on bipartite matching

Data Structure III. Segment Trees

COL351: Analysis and Design of Algorithms (CSE, IITD, Semester-I ) Name: Entry number:

Algorithms and Data Structures: Lower Bounds for Sorting. ADS: lect 7 slide 1

Important Project Dates

Approximation Algorithms

Solutions to Homework 10

Non-context-Free Languages. CS215, Lecture 5 c

1 (15 points) LexicoSort

Discrete mathematics

The Encoding Complexity of Network Coding

CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics

University of Waterloo CS240, Winter 2010 Assignment 2

University of Waterloo CS240 Winter 2018 Assignment 2. Due Date: Wednesday, Jan. 31st (Part 1) resp. Feb. 7th (Part 2), at 5pm

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Greedy Algorithms 1. For large values of d, brute force search is not feasible because there are 2 d

CS264: Homework #1. Due by midnight on Thursday, January 19, 2017

CS 206 Introduction to Computer Science II

Writeup for first project of CMSC 420: Data Structures Section 0102, Summer Theme: Threaded AVL Trees

COMP Data Structures

6. Lecture notes on matroid intersection

CS261: Problem Set #1

22 Elementary Graph Algorithms. There are two standard ways to represent a

Homework Assignment #3

Transcription:

CS 337 Project 1: Minimum-Weight Binary Search Trees September 6, 2006 1 Preliminaries Let X denote a set of keys drawn from some totally ordered universe U, such as the set of all integers, or the set of all real numbers. Let n denote X and assume that the elements of X are x 1 < x 2 < < x n. Suppose we would like to process queries of the following form: Given a key u in U, determine whether u belongs to X. A standard data structure for supporting such search queries is the familiar binary search tree (BST). For n > 1 there is more than one BST containing the set of keys X. For example, for n = 2, there are exactly two such BSTs: one with x 1 at the root and x 2 at the right child of the root, and one with x 2 at the root and x 1 at the left child of the root. The n keys in X partition U into 2n + 1 sets, as follows: the set of keys less than x 1, which we denote U 0 ; the singleton set of keys {x 1 }, which we denote U 1 ; the set of keys greater than x 1 and less than x 2, which we denote U 2 ; the singleton set of keys {x 2 }, which we denote U 3 ;... ; the set of keys greater than x n, which we denote U 2n. Given a BST T containing X, we search for a key u in U by performing a sequence of 3-way comparisons, beginning at the root. (A 3-way comparison between two keys u and v determines whether u = v, u < v, or u > v.) The sequence of 3-way comparisons performed in such a search is completely determined by the index i of the unique set U i to which u belongs. For any index i, let w(t, i) denote the number of 3-way comparisons used to search T for a key in U i. Further assume that for any index i, the frequency with which we search for a key in U i is known and is equal to f i. Then we define the weight of BST T, denoted w(t ), as follows: w(t ) = w(t, i) f i 0 i 2n In this assignment, we explore several efficient methods for computing a minimum-weight BST. It is worth remarking that the particular values of the x i s have no impact on the structure of the minimum-weight BST. Thus, we can represent an instance of the minimum-weight BST problem by an odd-length sequence of nonnegative frequency values f 0,..., f 2n. 2 Dynamic Programming There is some similarity between the minimum-weight BST problem and the optimal prefix code problem discussed in the lectures. Accordingly, one might hope to devise a greedy 1

algorithm, similar to Huffman s algorithm, for solving the minimum-weight BST problem. However, no such greedy algorithm is known. In this section, we instead consider somewhat more complex algorithms based on a technique called dynamic programming. 2.1 An O(n 3 ) Algorithm The key observation underlying the dynamic programming approach to the minimum-weight BST problem is as follows. Suppose we are given an instance I equal to f 0,..., f 2n of the minimum-weight BST problem. For all integers i and j such that 0 i j n, let I(i, j) denote the instance of the minimum-weight BST problem given by length-(2(j i) + 1) sequence of frequency values f 2i, f 2i+1, f 2i+2,..., f 2j. We define the size of an instance I(i, j) as j i, since the solution to instance I(i, j) is a BST containing j i nodes. Our plan will be to solve the given instance I = I(0, n) by successively solving all of the instances I(i, j), in nondecreasing order of size. But how do we go about solving a general instance I(i, j)? If i = j, the instance is trivial to solve, so let us assume that i < j. The key observation underlying our approach is as follows. Suppose that instance I(i, j) admits a minimum-weight BST T in which the left subtree T 0 of the root contains n 0 nodes and the right subtree T 1 of the root contains n 1 nodes. (Note that j i = n 0 +n 1 +1, since the size of instance I(i, j) dictates that T contains j i nodes.) Then we claim that T 0 is a minimum-weight solution to instance I(i, i + n 0 ) and T 1 is a minimum-weight solution to instance I(j n 1, j). We will argue the correctness of this claim in the lecture associated with this programming assignment, and also in the discussion sections. The basic idea is that if either T 0 or T 1 is not minimum-weight, then T could be improved by replacing T 0 and T 1 with minimum-weight solutions to I(i, i + n 0 ) and I(j n 1, j), respectively; such an improvement contradicts our assumption that T is a minimum-weight solution to instance I(i, j). Programming Exercise 1 (35 points): Use the foregoing claim to develop an algorithm for the minimum-weight BST problem with running time O(n 3 ). Implement your algorithm in Java. Your program should be contained in a single source file named CubicBST.java. The input to your algorithm is a sequence of minimum-weight BST instances, one per line. The line of input encoding a particular minimum-weight BST instance consists of a nonnegative integer n followed by 2n+1 nonnegative integers representing the sequence of frequencies f 0,..., f 2n. The input is terminated by a line consisting of a single negative integer. For each minimum-weight BST instance, your program should produce a single line of output, indicating the weight of a minimum-weight BST. Note: To simplify the coding task, we are not asking you to output a minimum-weight BST, just the weight of such a BST. 2.2 An O(n 2 ) Algorithm For all integers i and j such that 0 i < j n, let T (i, j) denote the set of all minimumweight solutions to instance I(i, j), and let l(i, j) denote the minimum, over all T in T (i, j), of the number of nodes in the left subtree of the root of T. We claim that the following lemma holds. (The proof of the lemma is somewhat challenging, and beyond the scope of this course.) Lemma 1 For all integers i and j such that 0 i, j n, and j i 2, we have l(i, j 1) l(i, j) l(i + 1, j) 2

Paper & Pencil Exercise 1 (5 points): Make use of Lemma 1 to justify the correctness of an improved version of the O(n 3 ) algorithm developed in Section 2.1. The improved version should run in O(n 2 ) time. Programming Exercise 2 (10 points): Implement your O(n 2 ) algorithm in Java, using a single source file named QuadraticBST.java. The input-output behavior of QuadraticBST.java should be the same as that of CubicBST.java. 3 A Greedy Algorithm for a Special Case In this section we develop algorithms for the special case of the minimum-weight BST problem in which f i = 0 for all odd indices i. In other words, we are now focusing on the special case in which all searches are unsuccessful. In this assignment we refer to this problem as the minimum-weight UBST problem, where the letter U is intended to remind us that all searches are unsuccessful. The minimum-weight UBST problem turns out to be closely related to the optimal prefix code problem discussed in the lectures, and which we have seen how to solve using Huffman s algorithm. Recall that the input to Huffman s algorithm is a sequence of symbols with associated frequencies, and the output is a prefix code. Assume that the symbols to be encoded are drawn from some totally ordered set. Then we define such a prefix code to be order-preserving if the ith largest symbol is mapped to the ith largest codeword, where codewords are ordered lexicographically (i.e., in dictionary order ). It is not hard to see that the minimum-weight UBST problem is equivalent to the problem of determining a minimum-weight order-preserving prefix code for a set of n + 1 symbols with associated nonnegative frequencies f 0, f 2, f 4,..., f 2n. Paper & Pencil Exercise 2 (5 points): Let s 0 and s 1 be distinct sequences of symbols, where the sequences are drawn from some totally ordered universe. Assume that s 0 lexicographically precedes s 1. Let t 0 (resp., t 1 ) denote the bit string obtained by encoding s 0 (resp., s 1 ) using an order-preserving prefix code. Prove that t 0 lexicographically precedes t 1. 3.1 Minimum-Weight Order-Preserving Prefix Codes In this section we describe a Huffman-like greedy algorithm, due to Hu and Tucker, for computing a minimum-weight order-preserving prefix code. Recall that, in the case of Huffman s algorithm, the input is a bag of n nonnegative frequency values. In the order-preserving version of the problem, the frequency values are given in the form of a sequence, where the order of the sequence is determined by a given total order on the symbols. For example, if the symbols are A through Z, in alphabetic order, then the sequence begins with the frequency value for A, followed by the frequency value for B, et cetera. The Hu-Tucker algorithm consists of two main phases. The first phase determines the length of the codeword to be assigned to each input symbol. This phase uses a greedy strategy that is similar to Huffman s algorithm, though somewhat more complicated. The second phase of the Hu-Tucker algorithm determines the specific codeword to be assigned to each symbol. It is not hard to convince yourself that the codewords are uniquely determined by the lengths computed in the first phase. Furthermore, it is relatively straightforward to compute the codewords from these lengths. In this assignment, you will not need 3

to implement the second phase of the Hu-Tucker algorithm, because you are only being asked to compute the weight of a minimum-weight order-preserving prefix code. We now describe the first phase of the Hu-Tucker algorithm in greater detail. Like Huffman s algorithm, this phase proceeds by iteratively merging two of the frequencies into a single frequency (by adding them) until there is only one frequency left. Once all of the merging is complete, the length of the codeword associated with a particular symbol can be deduced by computing the number of merges associated with the frequency value of that symbol. In this assignment, you will not need to determine the codeword length for each symbol, since you are only being asked to compute the weight of a minimum-weight solution, a quantity which can be easily computed as a by-product of the merging process. In what follows, we describe the merging rule in greater detail. A general iteration of the merging process starts with a sequence of n 2 ordered pairs (b 0, g 0 ),..., (b n 1, g n 1 ), where the b i s are boolean values and the g i s are frequency values. In the first iteration, all of the b i s are initialized to true, and the g i s represent the input frequencies. We define an index-pair as an ordered pair of integers (i, j) such that 0 i < j < n. The value of such an index-pair is defined as the integer triple (g i + g j, i, j). An index-pair (i, j) is defined to be good if there is no integer k such that i < k < j and b k is true; otherwise, it is bad. Let G denote the set of all good index-pairs. Note that G is guaranteed to be nonempty, since we have assumed that n 2. Let (u, v) denote the indexpair in G with the lexicographically least associated value. Then the Hu-Tucker algorithm removes the pair (b v, g v ) from the current sequence, and replaces the pair (b u, g u ) with the pair (false, g u + g v ). Programming Exercise 3 (35 points): Write a Java program based on the Hu-Tucker algorithm to compute the weight of a minimum-weight order-preserving prefix code. All of your source code should be provided in a single file named HuTucker.java. The input to your algorithm is a sequence of instances, one per line. The line of input encoding a particular instance consists of a positive integer n followed by n nonnegative integers representing the ordered sequence of symbol frequencies. The input is terminated by a line containing a single zero. For each instance, your program should produce a single line of output, indicating the weight of a minimum-weight order-preserving prefix code. For full credit, your algorithm should run in O(n 2 ) time. 3.2 An O(n log n) Algorithm A mergeable priority queue data structure supports the usual priority queue operations, and also provides efficient support for merging two priority queues. There exist mergeable priority queues with the following performance guarantees: (1) the usual priority queue operations insert and delete-min run in O(log n) time, where n denotes the number of items in the priority queue; (2) given two priority queues Q 0 and Q 1 containing m and n items, respectively, one can combine Q 0 and Q 1 into a single priority queue Q containing all m + n items in O(log(m + n)) time. Paper & Pencil Exercise 3 (10 points): Describe how to use a mergeable priority queue data structure to perform the same computation as the program HuTucker.java, but with O(n log n) running time per instance. 4

4 Turning in the Project This project has three programming exercises and three paper & pencil exercises. The programming exercises are due at 8pm on Wednesday, September 27, and should be turned in electronically as described in the project protocol document available in the projects section of the course website. The file readme.txt should contain a separate section for each of the three programs. It is highly recommended that you turn in all three programming exercises by the deadline. However, since you will be working on three separate programs, it is possible that you might complete some of them by the deadline, and still be working on some others. In that case we will only assess a lateness penalty on the programs that you turn in late. In the event that you opt to turn in one or more of the programs late, please avoid resubmitting your other programs, to minimize your lateness penalty. The TA in charge of grading this project is Indrajit Roy. If you turn in all of your programs at once, you should turn them in using the command turnin -submit indrajit Project1 CubicBST.java QuadraticBST.java HuTucker.java readme.txt. The paper & pencil exercises are also due on Wednesday, September 27. It is highly recommended that you turn in your solutions (as a paper document) at the lecture on Wednesday, September 27. However, your paper solutions will be accepted without a lateness penalty up to 8pm on Wednesday, September 27. If you decide to turn in your solutions outside the class period, please drop them off in Greg Plaxton s office, TAY 3.132. (If no one is there when you come by, just slide your solutions under the door.) Unlike the programming portion of the assignment, please note that the 8pm deadline for turning in the paper & pencil exercises is a hard deadline. If by the deadline you haven t been able to complete all of the paper & pencil exercises, just turn in whatever you have been able to complete. 5 Questions and Project Clarifications This project description is somewhat terse. We will be discussing the project in greater detail during the lecture period on Monday, September 11. This project will also be the primary topic of discussion at the upcoming discuss sections and office hours. Please take advantage of these opportunities to get your questions answered. You may also use the project newsgroup utexas.class.cs337 to ask for clarifications on the project. The primary advantage of the newsgroup is that it allows you to pose a question that might be answered by another member of the class. For this project, the newsgroup will be loosely monitored by one of the TAs, Indrajit Roy. If you urgently require a timely response to a question that comes up outside of the regularly scheduled lectures, discussion sections, and office hours, then you should not simply post it to the newsgroup, because it might not get answered for a day or more. Instead, you should send an email to the course instructor or to one of the TAs. But you should try to avoid this, i.e., to the extent possible, you should try to get your questions answered at one of the regularly scheduled meetings. Finally, instructor clarifications regarding the project will be posted on the project web page. You are responsible for any modifications found there. Sample inputs and outputs will also be posted on the project web page. 5