Analyzing a Greedy Approximation of an MDL Summarization

Size: px

Start display at page:

Download "Analyzing a Greedy Approximation of an MDL Summarization"

Hector Weaver
5 years ago
Views:

1 Analyzing a Greedy Approximation of an MDL Summarization Peter Fontana fontanap@seas.upenn.edu Faculty Advisor: Dr. Sudipto Guha April 10, 2007 Abstract Many OLAP (On-line Analytical Processing) applications have produced data cubes that summarize and aggregate details of data queries. These data cubes are multi-dimensional matrices where each cell that satisfies a specific property or trait is represented as a 1, notated as a 1-cell in this report. A cell that does not satisfy that specific property is represented as a 0, notated as a 0-cell. in this report In order to compress the amount of space required to represent this matrix completely, others have used MDL (Minimum Description Length) Summarization, including the MDL Summarization with Holes. While it is NP-Hard to compute the optimal MDL Summarization with Holes for a data matrix of 2 or more dimensions (Proven by Bu et al. [1]), there exists a greedy algorithm to approximate the MDL Summarization with Holes, proven to give an answer that within a factor of l m log(m) of the optimal solution (Proven by Guha and Tan [3]), where M is a factor dependent size of the data matrix. See the Technical Approach section of this report for a definition of l m. However, Guha and Tan in [3] mention that this bound has not been proven tight. I studied this for 2-dimensional matrices where the algorithm can only compress by covering rows and columns (here l m = 2). Currently, I have a proof that the greedy algorithm is a 4-approximation algorithm in this special 2-dimensional case and a constant-factor (2 (κ 2))-approximation algorithm in the general case. Furthermore, I have written a program that uses the greedy approximation to MDL Summarize with Holes an arbitrary n-by-n 2-dimensional matrix of 1 s and 0 s. Related Work 1

2 Currently, OLAP (On-line Analytical Processing) database applications exist and have very powerful data processing abilities, including the ability to give data in many varying levels of detail. OLAP applications aggregate the detail of the data by performing rollup operations, which take a current data sheet and produce a higher-level data sheet by grouping data cells together or classifying data cells at a higher level. Rollups are further described in [2]. Sage and Sarawagi [6], have developed intelligent methods of performing rollup operations, which are described in [6]. With these intelligent rollup operations, data queries can be abstracted to the level of detail where all cells can be classified by whether or not each data cell has a different property. All cells that satisfy a specific property will be represented by a 1 (1-cells) and all other cells will be represented by a 0 (0-cells). For example, let a data matrix consist of rows representing companies, columns of matrices representing cities, and the cells (i,j) containing the revenue that the company i makes in that city j. Performing a rollup could produce a new matrix of the same size where every cell (i,j) is 1 in the new matrix if it was $10,000 or more in the data matrix, and 0 otherwise. This matrix is now an abstraction that describes all the (company, city) pairs where company i made at least $10,000 in city j. Matrices describing the aggregated data, such as the matrix described in the example above, can be summarized using a MDL (Minimum Description Length) Summarization. MDL Summarization is the process of summarizing the data matrix by describing rectangular regions of 1-cells. I.e. instead of describing each cell of a multi-dimensional matrix, the MDL Summarization can describe the whole data matrix merely by describing the rectangular regions of the matrix that contain 1-cells. These rectangular regions are referred to as non-trivial rectangles (trivial rectangles are the individual cells and the entire matrix). (The notion of a rectangle is taken directly from [3]). Here, the problem defines the cost so that describing any rectangular region of all 1-cells costs 1. Many variants of the MDL Summarization have been considered. Two such summarization methods include a Generalized MDL Summarization (see [4]) and an MDL Summarization with Holes (as defined in [1]) (This definition is also described in the paragraph below). An MDL Summarization results in a more compact representation of the data. These summarization methods produce more compact representations that results in space savings, which are especially useful when used to make shorter database queries [5]. This project focuses on the MDL Summarization with Holes (as defined in [1] - the definition is paraphrased here). MDL Summarization with Holes is a specialized/refined MDL Summarization where rectangles can contain 0- cells as well as 1-cells. Since the rectangles are used to represent a region of 1-cells, the MDL Summarization with Holes also describes each 0-cell in each rectangle (these are the holes) [1]. Each 0-cell is described only once by the summarization, even if it is in overlapping rectangles that are chosen [1]. In the regular MDL Summarization, if a rectangular region contained a 0-cell in it, the summarization could not choose that rectangle even if it all the other cells were 1-cells. However, the MDL Summarization with Holes can take this rectangle and then pay a cost of 1 for each 0-cell (the hole in the rectangle.) By 2

3 allowing holes in the rectangles, chosen rectangle can become much larger and encapsulate more of the 1-cells, which results in a more compact description of the data matrix [1]. For data matrices of 2 or more dimensions, producing an optimal MDL Summarization with Holes has been proven to be NP-Hard by Bu et Al. [1]. However, Bu et al. [1] have also proposed heuristics to produce useful MDL Summarizations with Holes. One of these heuristics, which I have investigated, is the greedy approach. Guha and Tan [3] have also examined this greedy approach for approximating the MDL Summarization with Holes. Guha and Tan [3] have proven that the greedy approach gives an approximation that is O(l m log(m)), where M is the size of the data matrix. See the Technical Approach section of this paper for a definition of l m. Guha and Tan have also examined a recursive solution that expands the MDL Summarization with Holes [3]. Here, the summarization recursively selects rectangles of 1-cells, then subtracts out sub-rectangles of 0- cells, and then recurses further, alternately adding and subtracting rectangles of 1-cells and 0-cells (respectively) up to k regions (this is the k-recursive solution described in [3].) Afterwards, all 0-cells included in a rectangle of 1-cells and all 1-cells not included in a rectangle are described as individual cells. Here, Guha and Tan [3] use a Linear Programming approach instead of a greedy approach when k 2. The MDL Summarization with Holes is also called the 1-recursive Summarization described in [3]. My project focuses on the 1-recursive greedy approach of the MDL Summarization with Holes (as described in [3]). This means that the algorithm chooses only 1 level of regions before describing individual cells. While this greedy approximation has been proven to be a O(l m log(m))-approximation algorithm for the optimal MDL Summarization with Holes by Guha and Tan [3], this bound has not previously been tightly proven [3]. Guha and Tan [3] describe this as an open problem by giving the proof that the greedy algorithm is a O(l m log(m))-approximation algorithm but only giving a proof that the greedy algorithm is at best an Ω(l m )-approximation algorithm [3]. No current paper has answered this question as of April 1, My project shows that this greedy algorithm is a constant-factor (2 (κ 2)) approximation algorithm (relative to M), (such as an O(l m )-approximation algorithm) of the optimal MDL Summarization with Holes in the general case, and a 4-approximation algorithm in the special case. For definitions of l m and κ, see the subsection Definitions in the Technical Approach section of this paper. Technical Approach I studied the greedy algorithm given for MDL Summarization with Holes in [3] and tightened the analysis of the optimality of the greedy approximation algorithm, specifically focusing on the 2-dimensional case where the only nontrivial rectangles are row and columns. In this section, I define l m and κ, outline the description of the greedy algorithm, give an example of the problem 3

4 in the specialized case, give an example that shows that the greedy-algorithm is an Ω(2)-approximation algorithm in the specialized case (in the case 2 = l m = (κ 2)) following the example in [3] and then describe some challenges in solving this problem. Definitions Note: The following definitions are paraphrased from [3]. l m : When the rectangles are specified, one can consider which rectangles completely contain other rectangles. Now, for each cell u in the matrix, consider all of the rectangles that cover u. Call this set S u. Now the contains (subset) relation forms a poset over the rectangles S u, and more specifically, a lattice, because the entire matrix is a rectangle that contains u and the matrix contains all other rectangles. Here, l m is the size of the largest antichain of the poset of S u for all cells u. κ: κ is the largest number of rectangles that all contain a common cell of the matrix. (κ is defined as such in [3]). Here κ 2 is the largest number of non-trivial rectangles that all contain (cover) a common cell. The two excluded trivial rectangles are the entire matrix and a single cell. In general (κ 2) l m. In this specialized case, it happens that 2 = l m = (κ 2). M: M is the size of the data Matrix. Description of the Greedy Algorithm Here is the greedy algorithm: After reading in the data matrix, the greedy algorithm looks at all of the uncovered 0-cells and 1-cells (i.e all 0 s and 1 s that are not contained in a chosen row and column) and for each unchosen row and 1+#(uncovered 0 s) #(uncovered 1 s). column the greedy algorithm computes the following ratio: If any ratio is less than 1, the greedy algorithm chooses the row or column with the smallest ratio and covers the elements in that row or column. The greedy algorithm repeats this process until no ratio is less than one or until all possible rectangular regions are chosen. The final cost of the greedy solution is the number of rows and columns chosen + the number of uncovered 1 s + the number of covered 0 s. To generalize this to an arbitrary set of non-trivial rectangles (instead of rows and columns), just have the algorithm calculate this ratio for each of those non-trivial rectangles instead of for each row and column. Description of the Program Implementation I have implemented the greedy approximation algorithm in C for the 2- dimensional case when the non-trivial rectangles are rows and columns. The 4

5 program reads in a 2-dimensional matrix of 1 s and 0 s from a text file, where each 1-cell will be represented by a 1. The program implements the greedy approximation algorithm on the matrix and outputs the result to the screen (or into another text file if redirected). The input file includes an ordered list of the rows and columns taken by the algorithm, the resulting summarized matrix and the cost of the space needed to store the data cube both before and after the compression. An Example of the Greedy Algorithm Here I give an ordinary example that illustrates the greedy algorithm and the results I produced. This example matrix is 6 by 6. Here is the output from my program: 6 by 6 2-dimensional matrix. Here, the 1st row is row 0 and the 1st column is column 0. Original input matrix: Length needed to describe uncompressed matrix: 22 Greedy Algorithm took column 0 with a greedy ratio of Greedy Algorithm took row 5 with a greedy ratio of 0.5 Greedy Algorithm took column 5 with a greedy ratio of 0.5 Note: a covered 0 is printed as an x and a covered 1 is printed as a +. Final compressed matrix from Greedy Algorithm: x x Length needed to describe compressed matrix: 13 5

6 An Illustrative example of the Greedy Algorithm Using the program I wrote, I have given the 2-dimensional example for my special case of chosen coverable regions (these regions and the individual cells together are defined as rectangles) that is the example described in [3] that proves that the greedy algorithm is an Ω(l m ) approximation algorithm. Here is the output from my program for the example when the matrix is 8 by 8: 8 by 8 2-dimensional matrix. Here, the 1st row is row 0 and the 1st column is column 0. Original input matrix: Length needed to describe uncompressed matrix: 32 (Greedy Algorithm took nothing) Note: a covered 0 is printed as an x and a covered 1 is printed as a +. Final compressed matrix from Greedy Algorithm: Length needed to describe compressed matrix: 32 Here the cost of the greedy solution is 32. However, the optimal solution is to choose rows 2-7 and columns 2-7 (i.e cover the entire cross), which is shown below in the notation of a solution of the program: Note: a covered 0 is printed as an x and a covered 1 is printed as a +. 6

7 xxxx++ ++xxxx++ ++xxxx++ ++xxxx Cost of Greedy Solution Cost of Optimal Solution = The cost of this optimal solution is 24. Here, the ratio of = 4 3 < l m = 2. Since here l m = 2 (the rows and columns are the only overlapping rectangles), this example confirms the tightness of the specialized 2-dimensional case as proven in [3]. Now, to generalize for an n-by-n matrix in my special case, the cost of the greedy solution becomes n ( n 2 ) = n2 2 (there are 4 ( n 2 n 4 ) = n2 2 1-cells in this matrix and the greedy solution covers none of the cells, paying 1 for each 1-cell). The cost of the optimal solution becomes n n (The optimal solution takes n 2 rows and n 2 columns and pays for the n 2 ( n 2 ) = n2 4 0 s that are covered). As n gets very large, the influence of the n term in the greedy cost gets small (the influence of the n term approaches 0 as n approaches infinity) so as n approaches infinity, the ratio becomes: lim n ( n2 2 ) ( n2 4 + n) = 2 Which is l m and κ 2. This example and the reasoning behind it is a direct application of the proof in [3] that the greedy algorithm is an Ω(l m )-approximation algorithm (and an Ω(κ 2)-approximation algorithm) in the special 2-Dimensional case. Thus, if an O(l m )-approximation bound is proven for the greedy approximation algorithm, it must be tight. Challenges One challenge to solving this problem was to truly understand the formal description of the problem first, sicne this was one of my first times that I was learning about a problem through reading research papers. By reading the relevant sections of relevant papers, especially [3], thoroughly, I gained a better understanding of the formal definitions of the problem, which helped me solve it better. Terms that took me time to understand were κ and l m. By re-reading [3] and getting a better understanding of l m and κ, I was able to better understand what I was doing, which helped me check my proof, proof techniques and develop a correct proof. Another challenge was learning to formally write the proof. While often I had valid ideas and could understand what I was thinking, I was inexperienced at writing a proof in a way that was concise, thorough and understandable. 7

8 This resulted in me spending much of my rewriting the proof so it was clearer and understandable before my advisor could check the proof for correctness. While this was a challenge to learn how to properly write a proof, my advisor Dr. Sudipto Guha guided me along the way and used it as an opportunity to teach me how to write a proof. Conclusion I have developed a proof that the greedy algorithm is a 4-approximation in this specific case and a (2 (κ 2))-approximation algorithm in the general case. Through this project, I learned how to write a proof in a formal and understandable style that could be read by other researchers. Throughout the year, my advisor, Dr. Sudipto Guha was very helpful, guiding me through the proofwriting process so as I was writing proofs I was not only fixing technical errors but writing the proof in a clearer, more concise and more precise way. This process of learning how to write a proof in a research paper has been tremendously helpful for me. Something that enriched the process and made it easier for me to understand the problem was to learn some fundamental concepts of Linear Programming such as the Simplex Method, Duality and solving Network-Flow Problems while I was solving this problem. This helped because in [3] there are many algorithms for this problem and for similar problems that involve Linear Programming. By better understanding Linear Programming, I could better understand these algorithms, which in turn gave me a better understanding of the problem I am working on, which helped me solve this problem. Throughout this report I frequently use the notation and the results in [3]. I am doing this because this project provides a proof that tightens a result in the paper [3]. I have included my proof that this greedy algorithm is an 4-approximation algorithm in the special case and a (2 (κ 2)) approximation algorithm in the general case in the Proof of the MDL Greedy Approximation Bound section of this report. References [1] Bu, Shaofeng, Laks V.S. Lakshmanan and Raymond T. Ng. MDL Summarization with Holes. Proceedings of the 31st VLDB Conference, pages , [2] Chaudhuri, Surajit and Umeshwar Dayal. An Overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record, Volume 26, Issue 1, pages 65-74,

9 [3] Guha, Sudipto and Jinsong Tan. Recursive MDL Summarization and Approximation Algorithms [4] Lakshmanan, Laks V.S., Raymond T. Ng, Christing Xing Wang, Xiaodong Zhou and Theodore J Johnson. The Generalized MDL Approach for Summarization. Proceedings of the 28th VLDB Conference, pages , [5] Pu, Ken Q. and Alberto O. Mendelzon. Concise Descriptions of Subsets of Structured Sets. ACM Transactions on Database Systems (TODS) Vol. 30, No. 1, pages , March [6] Sathe, Gayatri and Sunita Sarawagi. Intelligent Rollups in Multidimensional OLAP Data. Proceedings of the 27th VLDB Conference, pages , [7] Vazirani, Vijay V. Approximation Algorithms. Berlin: Springer-Verlag. Corrected Second Printing, Proof of the MDL Greedy Approximation Bound This proof proves the 4-approximation MDL Greedy Approximation Bound for the 2-dimensional case where the only rectangles are rows and columns and a O(κ) = 2 (κ 2)-approximation algorithm in the general case, with an arbitrary number of dimensions and arbitrary non-trivial rectangles. In this proof, when the word rectangle is used, it refers to a non-trivial rectangle, which is any rectangle that contains 2 or more cells. Individual cells of the matrix and the matrix as a whole are trivial rectangles. Definitions The notations are defined here and will be used in the proofs. M is the original n-dimensional matrix of data cells. A G is the set of chosen rectangles of the greedy solution for the matrix M. A is the set of chosen rectangles of the optimal solution. Define the cost of the matrix M with respect to a solution A, denoted cost(a, M) to be the length (cost) of the description of M after applying A. By definition, cost(a, M) = (1) + (1) + A u M, M[u]=1, u is not covered by any rectangle in A u M, M[u]=0, u is covered by at least one rectangle in A 9

10 Define the cost of a region R of cells with respect to a solution A to be denoted cost(a, R) to be the length (cost) of the description of R after applying A. Here this is used when R is a sub region of the matrix M. By definition cost(a, R) = (1) + (1) + A u R, R[u]=1, u is not covered by any rectangle in R u R, R[u]=0, u is covered by at least one rectangle in A Define nont cover(a, M) to be the region of M consisting of all the cells contained by any non-trivial rectangle chosen by A. i.e nont cover(a, M) = {u (( r A)(u r))} Define the greedy estimate of a region R with respect to a solution A, denoted est(a, R), to denote the cost the greedy solution estimates the region to cost if the algorithm chose every non-trivial rectangle that is not in A in addition to those already in A. Let Rt denote the set of all non-trivial rectangles that are not in A such that each cell that each rectangle contains is in the region R. I.e. Rt = {r r / A and (( u r) u R)} So, est(a, R) = cost(a, nont cover(a, M)) + Rt + ( 1 + (1) ) Theorem and Proof r i Rt u r,m[u]=0 I prove the Theorerm using Claims. I first state and prove the Claims, then I prove the Theorem. Claim 1. cost(a G, nont cover(a, M)) (κ 2)cost(A, nont cover(a, M)) Proof of Claim 1. The region nont cover(a, M) is the region containing only the cells that are contained by some non-trivial rectangle that A chose. Look at each rectangle r that the Greedy Algorithm did not take that the optimal solution did. Now, consider what the greedy solution estimates the cost of r to be if it would take it. That estimate of the cost of r is 1 + (1). u r,m[u]=0 Summing all the estimates of r (denote Rt as the collection ( of all these rectangles r), the total estimated additional cost is 1 + (1) ) r i Rt u r,m[u]=0 ( This cost is Rt + (1) ), which is at most Rt +(κ 2) non-trivial rectangles. u in some rectangle in Rt, r that contains u M[u]=0 u in some rectangle in Rt, M[u]=0 ( 1 ). since each cell is covered by at most κ 2 10

11 Add this cost to the current cost of the rectangles chosen by A G and this is est(a G, nont cover(a, M)). Now, since the greedy algorithm did not take these rectangles, the cost of not taking the rectangles in Rt is less than the estimate of taking all of the rectangles in Rt. (If not, then one rectangle s estimate will be lower than the cost of not taking it, and then the greedy solution would have taken that rectangle.) Hence, cost(a G, nont cover(a, M)) est(a G, nont cover(a, M)) However, since the optimal solution took these rectangles, it must be less costly to take those rectangles than to not take those rectangles and pay for the cells individually. However, no matter which rectangles are taken or in what ( order they are taken in, the optimal solution must pay at least a cost of Rt + (1) ). u in some rectangle in Rt, M[u]=0 In the worst case, the entire cost of describing this region is the cost of describing Rt, so est(a G, nont cover(a, M)) (κ 2)cost(A, nont cover(a, M)) Therefore, cost(a G, nont cover(a, M)) (κ 2)cost(A, nont cover(a, M)) Claim 2. cost(a G, nont cover(a G, M)) (κ 2)cost((A G A ), nont cover(a G, M)) Proof of Claim 2. The region nont cover(a G, M) contains only (and all) the cells that are contained in a rectangle that A G chose. Look at cost((a G A )), nont cover(a G, M). and now look at all the rectangles in A G but not in A. At each step, the greedy solution only takes beneficial rectangles, which are rectangles that when chosen will result in a reduced cost. Now there are two kinds of rectangles that are in A G : rectangles in A G A and rectangles in A G A Consider each rectangle r g in A G A and compare it to when it was taken relative to each rectangle r in A G A. Now, when r g was taken in the greedy algorithm, it must have been beneficial to take. Now, examine the benefit of r g after every rectangle in A G A has been taken. Since the order of rectangles has been changed, the benefit of each r g can change. If r g was taken after each r, there is no additional cost. However, if r g was taken before r, its benefit will differ, so consider each cell u that overlap with r g and r. If u is a 0-cell, the ratio that the greedy algorithm sees is now even lower (more beneficial), and hence taking that rectangle results in an increased benefit to the solution formed by A G A. 11

12 If u is a 1-cell, then the greedy ratio for r g can be at most one unit less than benefical to the solution, since the greedy ratio is beneficial except for the overlapped cell. This means that the greedy solution pays at most an additional cost of 1 for each rectangle r g that covers u when the greedy solution takes r g. Since there are at most (κ 2) non-trivial rectangles that contain u, taking the rectangles that contain u results in at most an additional cost of (κ 2) for u. In the worst case, the entire cost of describing this region is caused by the loss from these cells u, so the cost with the greedy rectangles is at most (κ 2) times the cost of the solution (A G A ) for this region. Note: In the above paragraph, (κ 2) and not κ is used because the individual cells and the matrix are excluded because the greedy algorithm will never take a trivial rectangle and a non-trivial rectangle that contais or is contained by a taken trivial rectangle. This is because the greedy algorithm will initially take the matrix (if it is the most beneficial) or it will not take the matrix. As for individual cells, if A G took the cell in addition to a non-trivial rectangle in (A G A ) that contained the cell, the greedy solution would discard the individual cell. Claim 3. Let R 1, R 2 be two regions, not necessarily disjoint and let A be some solution. Then cost(a, (R 1 R 2 )) cost(a, R 1 ) + cost(a, R 2 ) 2 cost(a, (R 1 R 2 )) Proof of Claim 3. Trivial. Theorem 1. The MDL Greedy algorithm is a 2 (κ 2)-approximation algorithm in the general case. Proof of Theorem 1. We will break the cost of the entire solution into the following regions, whose union is the matrix M: 1. nont cover(a, M) (Denoted R ) 2. nont cover(a G, M) (Denoted R G ) 3. M (R R G ) By Claim 1, cost(a G, R ) (κ 2)cost(A, R ) By Claim 2, cost(a G, R G ) (κ 2)cost(A, R G ) 12

13 Therefore, by Claim 3, cost(a G, R R G ) 2 (κ 2)cost(A, R R G ) Now, cost(a G, (M (R R G )) = cost(a, (M (R R G )). This is because by definition of this region, both A G and A pay for all the cells individually and do not have any non-trivial rectangles that cover any of these cells. Here, since M (R R G ) and (R R G ) are disjoint and no non-trivial rectangle in A or A G covers a single cell in the M (R R G ) region, cost(a G, M (R R G )) + cost(a G, (R R G )) = cost(a G, M) and cost(a, M (R R G )) + cost(a, (R R G )) = cost(a, M) Therefore, cost(a G, (R R G )) + cost(a G, M (R R G )) 2 (κ 2)cost(A, (R R G )) + cost(a, M (R R G )) cost(a G, (R R G )) + cost(a G, M (R R G )) 2 (κ 2)cost(A, (R R G )) + 2 (κ 2)cost(A, M (R R G )) ( cost(ag, (R R G )) + cost(a G, M (R R G )) ) 2 (κ 2) ( cost(a, (R R G )) + cost(a, M (R R G )) ) Hence, cost(a G, M) 2 (κ 2)cost(A, M) Corollary 1. In the special 2-dimensional case with only row and column nontrivial rectangles, the MDL Greedy algorithm is a 4-approximation algorithm. Proof of Corollary 1. Immediate from Theorem 1, since (κ 2) = 2 in this special case, because the row and column are the only overlapping non-trivial rectangles. 13

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select