An Effective Grammar-Based Compression Algorithm for Tree Structured Data
|
|
- August Day
- 5 years ago
- Views:
Transcription
1 An Effective Grammar-Based Compression Algorithm for Tree Structured Data Kazunori Yamagata 1, Tomoyuki Uchida 1, Takayoshi Shoudai 2, and Yasuaki Nakamura 1 1 Faculty of Information Sciences, Hiroshima City University, Hiroshima , Japan {k yamagata@toc.cs,uchida@cs,nakamura@cs}.hiroshima-cu.ac.jp 2 Department of Informatics, Kyushu University, Kasuga , Japan shoudai@i.kyushu-u.ac.jp Abstract. Many semistructured data such as HTML/XML files are represented by rooted trees t such that all children of each internal vertex of t are ordered and all edges of t have labels. Such data is called tree structured data. Analyzing large tree structured data is a time-consuming process in data mining. If we can reduce the size of input data without loss of information, we can speed up such a heavy process. In this paper, we consider a problem of effective compression of an ordered rooted tree, which represents given tree structured data, without loss of information. Firstly, in order to define this problem in a grammar-based compression scheme, we present a variable replacement grammar (VRG for short) over ordered rooted trees. The grammar-based compression problem for an ordered rooted tree T is defined as a problem of finding a VRG which generates only T and whose size is minimum. For the grammar-based compression problem for an ordered rooted tree, we show that there is no polynomial time algorithm with approximation ratio less than unless P=NP. Secondly, based on this theoretical result, we present an effective compression algorithm for finding a VRG which generates only a given ordered rooted tree and whose size is as small as possible. Finally, in order to evaluate the performance of our grammar-based compression algorithm, we report some experimental results. 1 Introduction Background: Due to rapid growth of Information Technologies, semistructured data such as HTML/XML files have been rapidly increasing and each of them has become larger. Semistructured data having tree structures are called tree structured data and are represented by rooted trees t such that all children of each internal vertex of t are ordered and all edges of t have labels. In general, analyzing large tree structured data is a time-consuming process in data mining. If we can reduce the size of input data without loss of information including structural features, we can speed up such a heavy process. In this paper, we consider a problem
2 2 K. Yamagata et al. of effective compression of an ordered rooted tree, which represents given tree structured data, without loss of information. We must compress a given ordered rooted tree T so that we exclude the loss of structural features which T has. Hence, we cannot apply lossless compression algorithms for strings to tree structured data. The aim of this paper is to give a grammar-based compression scheme for an ordered rooted tree and to present an effective algorithm for compressing a given ordered rooted tree without loss of information in the constructed grammar-based compression scheme. Data Model: As our data model for tree structured data, we use a variant of Object Exchange Model (OEM, for short) presented by Abiteboul et al.[1] as follows. An object o consists of an identifier, a link and a value, which are denoted by &o, link(&o) and val(&o), respectively. The identifier &o uniquely identifies the object o. The link link(&o) is a list (&o 1, &o 2,..., &o p ) of the identifiers of all subobjects o i (i = 1, 2,..., p), where p > 0. The value(&o) is either a string such as a tag in HTML/XML files, or a text such as a text written in the field of PCDATA in HTML/XML files. Tree structured data is represented by an ordered rooted tree with edge labels as follows. Each vertex represents an object identifier &o. An edge (&o, &o i ) represents a reference &o i in link(&o) and has the value val(&o i ). For any object identifier &o with link(&o) = (&o 1, &o 2,..., &o p ), the children &o 1, &o 2,..., &o p of the vertex &o are ordered in this order. For example, in Fig. 1, the ordered tree T represents the structure which Sample html has. Main Results: In this paper, we consider a problem of effective compression of an ordered rooted tree, which represents given tree structured data, without loss of information. Firstly, in order to define such data compression problem for an ordered rooted tree in the grammar-based compression schema, we present a term tree consisting of tree structures and structured variables, and present a Variable Replacement Grammar (VRG for short) over ordered rooted trees which is based on Hyperedge Replacement Grammar (HRG for short, see [6]). A graph transformation of VRG is defined as a mechanism of replacing a variable by an ordered rooted tree. In Fig. 1, as examples of a term tree and a graph transformation of VRG, we give the term tree t and the ordered rooted tree g such that T is obtained from the term tree t and the tree g by replacing all variables labeled with x by g. The grammar-based compression problem for an ordered rooted tree T is defined as a problem of finding a VRG which generates only T and whose size is minimum. We can regard this grammar-based compression problem as an optimization problem for minimizing the size of a VRG which generates only T. Secondly, for the grammar-based compression problem for an ordered rooted tree, we show that there is no polynomial time algorithm with approximation ratio less than unless P=NP. This result shows that approximating the size of the minimum VRG to within a small constant factor is NP-hard. Next, based on this theoretical result, we present an effective grammar-based compression algorithm for finding a VRG which generates only a given ordered rooted tree whose size is as small as possible. This algorithm is based on a greedy approach
3 Title Suppressed Due to Excessive Length 3 table tr td font Text 1-A /font /td td font Text 1-B /font /td /tr tr td font Text 2-A /font /td td font Text 2-B /font /td /tr tr td font Text 3-A /font /td td font Text 3-B /font /td /tr tr td font Text 4-A /font /td td font Text 4-B /font /td /tr /table t T g Sample html Fig. 1. An HTML document Sample html, the ordered rooted tree T which is a data model of Sample html, a term tree t and an ordered rooted tree g. A variable is represented by a box with lines to its elements. The label of a box is the label of the variable. The number in the left side of a vertex denotes the ordering on its siblings. of replacing isomorphic subtrees t, which are not overlap in a given ordered rooted tree, by the same variable in order of increasing the size of t. Next, by improving the algorithm given by Asai et al. [3], we present an efficient algorithm for finding all candidate subtrees s of a given ordered rooted tree T such that s can be replaced by a variable. This algorithm is a pre-processing of our grammarbased compression algorithm. Finally, in order to evaluate the performance of our grammar-based compression algorithm, we report some experimental results of comparing our algorithm with other two algorithms. One is based on a greedy approach of the order of decreasing the size of a candidate subtree which can be replaced by a variable. The other is based on Minimum Description Length (MDL for short) heuristic such as SUBDUE [5]. Experimental results show the effectiveness of our algorithm. Related Works: For a string, several grammar-based compression algorithms have been proposed [4, 8, 9, 12, 13]. Such algorithms are based on the idea of representing a string by
4 4 K. Yamagata et al. a context-free grammar (see [8, 12]). Especially, based on a grammar-based compression scheme, Charikar et al. [4] presented an O(log(n/g )) approximation algorithm and Sakamoto [13] proposed a linear-time approximation algorithm which guarantees O(log 2 n) approximation ratio where n is the length of an input string and g is the size of the smallest grammar. On the other hand, for semistructured data, there are few researches for a grammar-based compression. Hence, we need to define a new grammar-based compression scheme for an ordered rooted tree which is based on HRG (see [6]). For semistructured data which can be represented by a general graph, Cook [5] presented a practical data compression algorithm based on MDL heuristic which is not a grammar-based compression algorithm. For semistructured data with geometric information, we presented an effective compression algorithm in [7] by introducing notions of a layout term graph in [15] and a substitution in logic programming. This compression scheme presented in [7] is regarded as a preliminary version of the grammar-based compression scheme presented in this paper. In the fields of data mining and knowledge discovery, there are increasing demands for effective methods for extracting information from large semistructured data. Several effective algorithms for finding frequent substructures among large tree structured data have been proposed [3, 16]. In [11], we presented an effective algorithm for extracting common structural features among ordered rooted trees. Moreover, in [10, 14], we discussed the learnabilities of tree patterns having tree structure, variables and ordered children from the viewpoint of machine learning. Organization: This paper is organized as follows. In Section 2, we introduce an ordered rooted term tree and define an admissible VRG which leads us to compress an ordered tree without loss of information. In Section 3, we define a problem of finding an admissible VRG whose size is minimum among admissible VRGs generating only an ordered rooted tree. Then, we present an effective greedy algorithm for solving this problem. In Section 4, in order to evaluate the performance of our algorithm, we report some experimental results of applying our algorithm to artificial large trees. 2 Preliminaries 2.1 Ordered Term Tree Let T = (V T, E T ) be an ordered rooted tree with a vertex set V T and an edge set E T. Let l 1 be an integer. A list h = (u 0, u 1,..., u l ) of vertices in V T is called a variable (or a hyperedge) of T if u 1,..., u l is a sequence of consecutive children of u 0, i.e., u 0 is the parent of u 1,..., u l and u j+1 is the next sibling of u j for j with any 1 j < l. Two variables h = (u 0, u 1,..., u l ) and h = (u 0, u 1,..., u l ) are said to be disjoint if {u 1,..., u l } {u 1,..., u l } =. Definition 1. Let T = (V T, E T ) be an ordered rooted tree and H T a set of pairwise disjoint variables of T. An ordered term tree obtained from T and H T is a
5 Title Suppressed Due to Excessive Length 5 triplet t = (V t, E t, H t ) where V t = V T, E t = E T h=(u 0,u 1,...,u l ) H T {{u 0, u i } E T 1 i l} and H t = H T. For two vertices u, u V t, we say that u is the parent of u in t if u is the parent of u in T. Similarly we say that u is a child of u in t if u is a child of u in T. In particular, for a vertex u V t with no child, we call u a leaf of t. We define the order of the children of each vertex u in t as the order of the children of u in T. We often omit the description of the ordered tree T and the variable set H T because we can find them from the triplet t = (V t, E t, H t ). Example 1. The ordered term tree t in Fig. 1 is obtained from the tree T = (V T, E T ) and the set H T, where V T = {v0, v1,..., v17}, E T = {{v0, v1}, {v1, v2}, {v2, v3}, {v1, v4}, {v4, v5}, {v1, v6}, {v6, v7}, {v1, v8}, {v8, v9}, {v1, v10}, {v10, v11}, {v1, v12}, {v12, v13}, {v1, v14}, {v14, v15}, {v1, v16}, {v16, v17}} and H T = {(v1, v2, v4), (v1, v6, v8), (v1, v10, v12), (v1, v14, v16)}. For any ordered term tree t, a vertex u of t, and two children u and u of u, we write u < t u u if u is smaller than u in the order of the children of u. For a set or a list D, the number of elements in D is denoted by D. We assume that every edge and variable of an ordered term tree is labeled with some words from specified languages. Let Λ and X be finite alphabets such that Λ X =. An element of Λ is called a edge label. An element of X is called a variable label and has the rank, denoted by rank(x), that is a nonnegative integer. A variable h has a label x such that rank(x) = h. A term tree t is called a term tree over Λ, X if every edges and every variables of t are labeled by elements in Λ and X, respectively. If Λ and X need not to be specified, we often omit them. Note. In this paper, we treat only ordered rooted term trees, and then we call an ordered rooted term tree a term tree, simply. In particular, a term tree with no variable is called a ground term tree (or simply a tree) and considered to be a tree with ordered children. For a term tree t and its vertices v 1 and v i, a path from v 1 to v i is a sequence v 1, v 2,..., v i of distinct vertices of t such that for any j with any 1 j < i, v j is the parent of v j+1. Let t = (V t, E t, H t ) be a term tree. For subsets V f V t, E f E t and H f H t, if f = (V f, E f, H f ) is a term tree then f is said to be a term subtree of t. For two term subtrees f = (V f, E f, H f ) and g = (V g, E g, H g ) of t, we say that f and g are overlap in t if ((E f E g ) (H f H g )), V f V g and V g V f. Let f and g be term trees over Λ, X each of which has at least two vertices. Let h = (v 0, v 1,..., v l ) be a variable in f and σ = (u 0, u 1,..., u l ) a list of l+1 distinct vertices in g such that u 0 is the root of g and u 1,..., u l are leaves of g. The pair [g, σ] of g and σ is called an (l + 1)-hypertree over Λ, X. If l, Λ and X need not to be specified, we often omit them. The form h [g, σ] is called a variable replacement for h. A new term tree f = f{h [g, σ]} is obtained by applying the variable replacement h [g, σ] to f in the following way. For the variable h = (v 0, v 1,..., v l ), we attach g to f by removing the variable h from H f and by identifying the vertices v 0, v 1,..., v l with the vertices u 0, u 1,..., u l of g in this order. We define a new ordering < f v on every vertex v in f in the following natural way. Suppose that v has more than one child and let
6 6 K. Yamagata et al. f g f Fig. 2. The new ordering on vertices in the term tree f = f{h [g, (u0, u1, u2, u3)]} where h = (v0, v1, v2, v3). v and v be two children of v in f. We note that v i = u i for any 0 i l. (1) If v, v, v V g and v < g v v, then v < f v v. (2) If v, v, v V f and v < f v v, then v < f v v. (3) If v = v 0 (= u 0 ), v V f {v 1,..., v l }, v V g, and v < f v v 1, then v < f v v. (4) If v = v 0 (= u 0 ), v V f {v 1,..., v l }, v V g, and v l < f v v, then v < f v v. In Fig. 2, we give an example of the new ordering on vertices in a term tree. 2.2 Admissible Variable Replacement Grammar Next, we define formally an admissible Variable Replacement Grammar, which generates only one tree, based on a HRG (see [6]). Let Λ and X be finite alphabets with Λ X =. Definition 2. A Variable Replacement Grammar (VRG for short) G = (S, R) over Λ, X is defined as follows: (1) S is a variable label in X with rank(s) = 0 and is called the start variable label. (2) R is a finite set of productions of the form x [g, σ], where x is a variable label in X with rank(x) = l and [g, σ] is an l-hypertree over Λ, X. Let G = (S, R) be a VRG. For a variable label x X, an l-hypertree [g, σ] and an integer i 1, we define the relation x i G [g, σ] inductively as follows. (1) We denote x 1 G [g, σ] if there is a production x [g, σ] in R. (2) For i 2, we denote x i G [g, σ] if there are j, m 1, an l-hypertree [f, σ] and a variable h of rank k with label y in f such that j + m = i, x j G [f, σ], y m G [d, σ ], and g = f(h [d, σ ]).
7 Title Suppressed Due to Excessive Length 7 We write x + G [g, σ] if x i G [g, σ] for some i 1. The graph language generated by a VRG G = (S, R) is the set L(G) = {T T is a tree and S + G [T, ()]}. Let G = (S, R) be a VRG and T a tree. Then, G is said to be admissible if L(G) = {T }. For a given tree T, an admissible VRG G generating only T leads us to compress T without loss of information, if the size of G is less than the size of T. Example 2. Let G = (S, R) be the VRG where R = {S [t 1, ()], x [t 2, (u1, u2)], y [t 3, (v1, v2)]}, and t 1, t 2 and t 3 term trees in Fig. 3. Then, we can see that G is admissible and L(G) = {T }, where T is the tree in Fig. 3. T t 1 t 2 t 3 Fig. 3. A Tree T and term trees t 1, t 2, t 3. 3 Grammar-Based Compression for an Ordered Rooted Tree In this section, we consider a problem of finding an admissible VRG which generates only a given tree and whose size is minimum. Firstly, we formally define this problem and show the hardness of solving this problem. Secondly, for a given tree T, we present an algorithm Find Freq Trees for finding all candidate ground term subtrees of T which can be replaced by variables. Finally, by using
8 8 K. Yamagata et al. Find Freq Trees, we give an effective algorithm for finding an admissible VRG G which generates only a given tree and whose size is as small as possible. 3.1 Hardness of Grammar-Based Compression Problem for an Ordered Rooted Tree For a term tree t = (V t, E t, H t ), we define the size of t as t = V t + 2 E t + h. For a VRG G = (S, R), we define the size of G as G = ( g + h H t x [g,σ] R σ ). For a tree T and an admissible VRG G such that L(G) = {T }, we define a compression ratio ρ of T w.r.t G as ρ = G T 100. Example 3. The size of the tree T in Fig. 3 is T = = 64. The sizes of term trees t 1, t 2 and t 3 in Fig.3 are t 1 = (2 + 2) = 10, t 2 = 3 + (2 + 2) = 7 and t 3 = = 16, respectively. Then, the size of the admissible VRG G = (S, R) is G = (10 + 0) + (7 + 2) + (16 + 2) = 37, where R = {S [t 1, ()], x [t 2, (u1, u2)], y [t 3, (v1, v2)]}. Therefore, the compression ratio ρ of T w.r.t. G is ρ = A grammar-based compression problem for a tree is defined as the following problem Find Min AVRG. Find Min AVRG Instance: A tree T. Problem: Find an admissible VRG G such that L(G) = {T } and for any admissible VRG G with L(G ) = {T }, G G. This problem is regarded as an optimization problem for minimizing the size of an admissible VRG which generates only a given tree. Then, we can prove the following theorem by a reduction from restricted form of VERTEX COVER in a similar way as the proof of Theorem 3.1 in [9]. Theorem 1. There is no polynomial time algorithm for solving Find Min AVRG with approximation ratio less than 8593 unless P=NP This theorem shows the hardness of solving Find Min AVRG. That is, this result indicates that approximating the size of the minimum VRG to within a small constant factor is NP-hard. Based on this theoretical result, in next section, we will present an effective compression algorithm for finding an admissible VRG which generates only a given tree and whose size is as small as possible 3.2 Algorithm of Finding All Frequent Ground Term Subtrees Let T = (V, E, ) be a tree and t = (V t, E t, ) a ground term subtree of T. From the definitions of a variable and a variable replacement, if there exist a path p
9 Title Suppressed Due to Excessive Length 9 in T from a vertex v V t to a vertex u V V t such that v is not the root or a leaf of t and p does not contain any leaf of t, or if for two children w 1 and w 2 of the root r of t, there is a vertex w V V t such that w 1 < T r w and w < T r w 2, then we can not replace t by a variable even if t is frequent in T. Under this constraint, by improving the algorithm given by Asai et al.[3] which finds all frequent ground term subtrees for a given tree T, we present an algorithm Find Freq Trees for finding all candidate ground term subtrees in T which can be replaced by variables. A grammar-based compression algorithm for a tree, which is given later, uses Find Freq Trees as a pre-processing. Let T be a tree and v a vertex in T. The number of vertices in the path from the root of T to v is denoted by depth T (v). We assume that next T (v) returns the nearest right sibling, if any, of v in T. We define pa 0 T (v) = v and pai T (v) as the parent of pa i 1 T (v) for i 1. A tree T is said to be of normal form if T satisfies the following conditions. (1) The set of vertices of T is V T = {1,..., k}. (2) All elements in V T are numbered by preorder traversal [2] of T. We can easily see that if T is a tree with k vertices and is of normal form, then the root of T is 1 and the rightmost leaf of T is k. For a tree T of normal form having k vertices, we denote the rightmost leaf of T by rml(t ), that is, rml(t ) = k, and denote the vertex k 1 by prevrml(t ). The path from the root of T to rml(t ) is called the rightmost branch. For an integer k 1, a k-pattern is a tree T of normal form whose number of vertices is k. For every k 1, we denote the set of all k-patterns by T k and the set of all patterns by T = k T k. Let T = (V T, E T, ) and U = (V U, E U, ) be trees. Then, a matching function from T to U is any function π : V T V U that satisfies the following conditions (1)-(4) for any vertex v V T which is not the root or a leaf of T and any v 1, v 2 V T. (1) π is a one-to-one mapping. That is, if v 1 v 2 then π(v 1 ) π(v 2 ). (2) π preserves the parent-child relation. That is, {v 1, v 2 } E T if and only if {π(v 1 ), π(v 2 )} E U. Moreover, {v 1, v 2 } in E T and {π(v 1 ), π(v 2 )} in E U have a same edge label. (3) π preserves the sibling relation. That is, next T (v 1 ) = v 2 if and only if next U (π(v 1 )) = π(v 2 ). (4) All children of π(v) in U are included in the set {π(u) u V T } V U. If V T = V U, a matching function from T to U can be regarded as an isomorphism between T and U. Then, two trees T and U are said to be isomorphic if V T = V U and there exists a matching function from T to U. Next, a pseudo-matching function from T to U is any function π : V T V U that satisfies the above conditions (1)-(3) of the matching function π and the following condition (4 ). (4 ) For any internal vertex v V T which does not appear in the rightmost branch of T, all children of π (v) in U are included in the set {π (u) u V T } V U.
10 10 K. Yamagata et al. Let U be a tree. Given a k-pattern T T k and a matching function π from T to U, we define the rightmost occurrence (the rml-occurrence for short) and the rightmost occurrence list of T w.r.t. π to be π(k) and Roc U (T ) = {π(k) π is a matching function from T to U}, respectively. Similarly, given k-pattern T T k and a pseudo-matching function π from T to U, we define the pseudo rightmost occurrence (the pseudo-rml-occurrence for short) and the candidate rightmost occurrence list of T w.r.t. U to be π (k) and Roc U (T ) = {π (k) π is a pseudo-matching function from T to U}, respectively. Let r 2 be an integer which is called a occurrence count. T is said to be r-occurred for U if Roc U (T ) r and T is said to be r-pseudo-occurred for U if Roc U (T ) r. Then, we define the set of all r-occurred k-patterns in T k for U as F U,k,r = {T T T k, Roc U (T ) r}, and F U,r = k F U,k,r T. We define the set of all r-pseudo-occurred k-patterns in T k for U as F U,k,r = {T T T k, Roc U (T ) r} and F U,r = k F U,k,r T. Let Roc U,k,r = T F U,k,r {π(k) π is a matching function from T to U} and let Roc U,k,r = T F U,k,r{π (k) π is a pseudo-matching function from T to U}. Let U be a tree, T a tree of normal form, and Roc U (T ) = {π(rml(t )) π is a matching function from T to U}. From the definitions of a matching function and a pseudo-matching function, for a vertex v in Roc U (T ), we can identify the unique matching function π from T to U such that π(rml(t )) = v and the unique matching function π from T to U such that π (rml(t )) = v. For a tree T of normal form and a vertex v of U, a ground term subtree G = (V G, E G, ) of U is said to be identified by T and v if there exists an isomorphism π between T and G such that π(rml(t )) = v. Let T T k 1, 0 p < depth T (rml(t )) any integer, and l Λ any edge label. Then, the (p, l)-expansion of T is the tree S obtained from T by attaching a new vertex k to the vertex v such that the attacked vertex k is the rightmost child of v and the edge between k and v has the label l, where v = pa p T (rml(t )), that is, v is the p-th parent of the rightmost leaf of T. In Fig. 4, given a tree U and an occurrence count r as inputs, we present an efficient algorithm Find Freq Trees which outputs the set F U,r of all r-occurred patterns for U and the set of their rml-occurrences indexed by trees in F U,r w.r.t. U. In Fig. 5, we present a procedure Expand Trees used in Find Freq Trees. Given the set R(T ) calculated in line 4 of the procedure Expand Trees and the integer p as inputs, for every edge label l Λ and every (p, l)-expansion S of T, the procedure Scanning Sibling in line 5 of the procedure Expand Trees returns the candidate rightmost occurrence list N ewroc (that is, the set of all pseudo-rmloccurrences ) of S w.r.t. the tree U as follows. Initially, Scanning Sibling creates an empty set NewRoc. Next, for each v R(T ) and each l Λ, add the pair (l, u) to NewRoc if there exists (p, l)-expansion of T in U such that if p = 0 then u is the leftmost child of v in U, otherwise u is the vertex next U (pa p 1 U (v)). Then, the following theorem holds. Theorem 2. When a tree U and an occurrence count r 2 are given as inputs, the algorithm Find Freq Trees can construct correctly the set F U,r of all r-occurred patterns for U and the set T F U,r Roc U (T ) of rml-occurrences in-
11 Title Suppressed Due to Excessive Length 11 Algorithm Find Freq Trees Input: A tree U and an occurrence count r 2 Output: The set F U,r of all r-occurrence patterns for U and their rml-occurrence lists Roc = T F U,r Roc U (T ) 1. Compute F U,1,r, Roc U,1,r, F U,2,r, and Roc U,2,r from U in level-order traversal; 2. k := 3; 3. while F U,k 1,r do 4. F U,k 1,r, Roc U,k 1,r, F U,k,r, Roc U,k,r := Expand Trees(F U,k 1,r, Roc U,k 1,r, r); 5. k := k + 1; 6. end; 7. F U,r := F U,1,r F U,k 2,r ; /* F U,1,r = F U,1,r */ 8. Roc := Roc U,1,r Roc U,k 2,r ; /* Roc U,1,r = Roc U,1,r */ 9. return F U,r, Roc ; Fig. 4. Algorithm Find Freq Trees dexed by trees in F U,r w.r.t. U in O( V U + A 2 N + A F U,r Λ ) time where V U is the set of vertices in U, A is the maximum number of vertices of trees in the set F U,r of all r-pseudo-occurred patterns for U, and N = Σ T F Roc U,r U (T ). 3.3 Grammar-Based Compression Algorithm for an Ordered Rooted Tree Let U be a tree, T a tree of normal form, and Roc U (T ) = {π(rml(t )) π is a matching function from T to U}. Then, a subset R T Roc U (T ) is a valid subset of Roc U (T ) if for any two vertices u, v R T, t u and t v are not overlap in U, where t u is the ground term subtree identified by T and u and t v is the ground term subtree identified by T and v. Moreover, a valid subset R T of Roc U (T ) is maximal if for any subset R of Roc U (T ) such that R T R, R is not a valid subset of Roc U (T ). We can compute a maximal valid subset R T of Roc U (T ) by level-order traversal of U as follows. Let R T = {v 1 } and Roc U (T ) = {v 1,..., v n } such that for 1 i < j n, v i is found before v j by level-order traversal of U. For each i = 2,..., n, we add v i to R T if there exists no vertex u in R T such that t i and t are overlap in U, where t is the ground term subtree of U identified by T and u and t i is the ground term subtrees of U identified by T and v i. We remark that the above maximal valid subset R T of Roc U (T ) is not always best for compressing a given tree. In Fig. 6, when a tree U and an occurrence count r are given as inputs, we present a greedy algorithm Compress Tree for finding an admissible VRG which generates only T and is as small as possible. The algorithm Compress Tree is based on a greedy approach of replacing isomorphic term subtrees which are not overlap in a given tree by the same variable in order of increasing the size
12 12 K. Yamagata et al. Procedure Expand Trees Input: A set F old of patterns, A set Roc old of pseudo-rml-occurrences indexed by trees in F old and an occurrence count r 2. Output: A set F ixf of r-occurred patterns for U, a set F ixroc of their rml-occurrences indexed by trees in F ixf, a set F new of the rightmost expansions of trees in F old and a set Roc new of pseudo-rml-occurrences indexed by trees in F new w.r.t. U. 1. F new := ; F ixroc := Roc old; F ixf := F old, Roc new := ; 2. foreach tree T F old do 3. foreach 0 p < depth(rml(t )) do 4. R(T ) := {π (rml(t )) π (rml(t )) F ixroc, π is a pseudo-matching function from T to U}; 5. NewRoc := Scanning Sibling(R(T ), p); 6. foreach l Λ do 7. compute the (p, l)-expansion S of T ; 8. NewRoc(l):={v (l, v) NewRoc}; 9. if NewRoc(l) r then 10. F new:=f new {S}; 11. Roc new:=roc new {(S, v) v NewRoc(l)}; /* end of if */ 12. if p 0 and p depth(rml(t ) 1) then 13. while NewRoc(l) do 14. choose a vertex v in { NewRoc(l); 15. F ixroc:=f ixroc π (prevrml(s)) 16. NewRoc(l):=NewRoc(l) {v}; 17. end; /* end of if */ 18. end; 19. R(T ) := {π (rml(t )) π (rml(t )) F ixroc, π is a pseudo-matching function from T to U}; 20. if R(T ) < r then 21. F ixf :=F ixf {T }; 22. F ixroc:=f ixroc {v v R(T )}; 23. break; /* end of if */ 24. end; /* end of foreach-loop */ 25. end; /* end of foreach-loop */ 26. return F ixf, F ixroc, F new, Roc new ; } π is a pseudo-matching function from S to U and ; π (rml(s)) = v Fig. 5. Procedure Expand Trees
13 Title Suppressed Due to Excessive Length 13 of a replaced term subtree. In line 1 of Compress Tree, we find the set F of all r-occurred patterns for U and the set of their rml-occurrences indexed by trees in F w.r.t. U by using the algorithm Find Freq Trees. In the while-loop from line 4 to line 25, Compress Tree fixes on all ground term subtrees which are actually replaced by variables in the procedure Make Grammar of line 26. In line 14, we revise the set Roc by removing all vertices u in {π(rml(g)) Roc π is a matching function from G to U)} from Roc for each G F org such that the identified ground term subtree g u of U by G and u is satisfied the following condition. There exists a vertex v in vroc(t ) such that t v and g u are overlap in U, or there exists a vertex v in vroc(t ) {w} such that g u is a ground term subtree of t v, where w is the first rml-occurrence of T in levelorder traversal of U and t v is the identified ground term subtree of U by T and v. The procedure Make Grammar in the line 26 constructs an admissible VRG G by applying the following operations to U in increasing order of the size of T of (T, V List(T )) tmprules. Let Q = (V Q, E Q, H Q ) be a copy of U. We initialize R Q := and H Q :=. For (T, V List(T )) tmprules, H Q :=H Q {h π π(rml(t )) Roc(T ), (π, h π ) V List(T )} and R Q :=R Q {x [t T, σ]} where x is a new variable label, t T is the corresponding term subtree of Q to the identified ground term subtree by T and the first rml-occurrence in level-order traversal, and σ is the first list of V List(T ). Then, for each element (π, h π ) V List(T ) such that π(rml(t )) Roc(T ), we revise the term tree Q by deleting the corresponding term subtree of Q to the identified ground term subtree by T and π(rml(t )). Finally, the rule S [Q, ()] is added to R Q and the procedure Make Grammar outputs the admissible VRG G = (S, R Q ). Then, the following theorem holds. Theorem 3. When a tree U and an occurrence count r are given, the algorithm Compress Tree in Fig. 6 can produce correctly an admissible VRG G = (S, R) over Λ, X with L(G) = {U} in O( V U +A 2 N +A F U,r Λ +BMC) time, where V U is the vertex set of U, A is the maximum number of vertices of trees in the set F U,r of all r-pseudo occurred patterns for U, N = T F Roc U (T ), B is U,r the maximum number of vertices of trees in F U,r, M = T F U,r Roc U (T ), and C is the number of variable labels appeared in G. Proof. (Sketch) We can prove the correctness of this theorem from the following facts (1) and (2). (1) The admissible VRG G = (R, S) constructed by Compress Tree is deterministic. For any variable label x appeared in G, G has only one production p in R such that the variable label in the leftside of p is x. Therefore, we can see that L(G) = 0 or L(G) = 1. (2) U is in L(G), since any two term subtrees, which are replaced by varibles in Make Grammar, are not overlap in U. From (1) and (2), we can see that G is an admissible VRG with L(G) = {U}. From Theorem 2, line 1 can be executed in O( V U +A 2 N + A F U,r Λ ) time. Moreover, lines from 4 to 25 can be executed in O(BMC) time. Then, we can show the time complexity of Compress Tree.
14 14 K. Yamagata et al. Algorithm Compress Tree Input: A tree U and an integer r 2 Output: An admissible VRG G = (S, R) such that L(G) = {U} and a compression ratio ρ 1. F, Roc :=Find Freq Trees(U); 2. remove all trees consisting of one vertex or two vertices from F ; 3. tmprules:=, F org :=F and for each T F, tmpsize(t ):= T ; 4. while F do 5. let T be a smallest tree in F ; 6. F :=F {T }; 7. Roc(T ):={π(rml(t )) Roc π is a matching function from T to U}; 8. compute a maximal valid subset vroc(t ) of Roc(T ); 9. m:= vroc(t ) ; 10. fix on the integer k > 0 and π is a matching function from T to U such that π(rml(t )) = v, V List(T ):= (π, h v) h v is a variable which consists ; v vroc(t ) of k vertices of U and by which the term subtree identified by T and v can be replaced 11. fix on hypertree [T, σ] such that σ = k, by using V List(T ); 12. Size:=((m 1)tmpSize(T ) (2m + 1)k)); 13. if Size 1 then 14. Revise Roc by removing all useless vertices in Roc, using F org; 15. tmprules:=tmprules {(T, V List(T ))}; 16. foreach G{ F do } π is a matching function from G 17. R(G):= π(rml(g)) Roc ; to U 18. if R(G) 1 then F :=F {G}; F org :=F org {G}; 19. else 20. let w be a vertex in R(G); 21. let g w { be the identified ground term subtree of U by G and w; 22. n:= u vroc(t ) g } w has the identified ground term ; tree by T and u as a term subtree 23. tmpsize(g):=tmpsize(g) n(tmpsize(t ) 2k) /* end of if */ 24. end; /* end of if */ 25. end; 26. G:=Make Grammar(U, tmprules, Roc); 27. return G, G T 100 ; Fig. 6. Algorithm Compress Tree
15 Title Suppressed Due to Excessive Length 15 4 Implementation and Experimental Results In order to evaluate our grammar-based compression algorithm Compress Tree presented in previous section, we have implemented Compress Tree and two other algorithms Algorithm 1 and Algorithm 2. The algorithm Algorithm 1 is based on a greedy approach of replacing isomorphic term subtrees, which are not overlap in a given tree, by the same variable in order of decreasing the size of a replaced term subtree. That is, Algorithm 1 is the algorithm obtained from Compress Tree by changing line 5 of Compress Tree with the instruction, let T be a largest tree in F. The algorithm Algorithm 2 is based on an approach of replacing repeatedly isomorphic term subtrees, which are not overlap in a given tree T and gives us the best compression ratio, by a variable. Algorithm 2 is the algorithm by adding the instruction else break; under line 24 of Compress Tree and changing line 5 of Compress Tree with the following instruction INSTRUMENT. let T be a best tree among F with respect to the compression ratio obtained by replacing the term subtrees, which are isomorphic to T and are not overlap, by a variable. That is, Algorithm 2 is regarded as the algorithm SUBDUE in [5] based on a Minimum Description Length heuristic. We have evaluated our algorithm Compress Tree by comparing with two other algorithms Algorithm 1 and Algorithm 2 with respect to execution time and compression ratio of applying them to artificial large trees. The machine used in experiments is a PC with two 2.4GHz CPUs and 1.00GB main memory. We implemented a data generator to randomly produce an artificial large tree satisfying the following conditions. (1) The number of vertices is 20,000, 40,000, 60,000, 80,000 or 100,000. (2) The degree of each vertex is less than 3. (3) The number of edge labels is less than 2. For N {20, 000, 40, 000, 60, 000, 80, 000, 100, 000}, let D(N) be the set of 10 trees whose numbers of vertices are N and which are produced by the data generator. We tested the execution times and the compression ratios of Compression Tree, Algorithm 1 and Algorithm 2 under the circumstances of different datasets and the occurrence count 2. Fig. 7 (a) shows the relationship between the number of vertices and the execution times. We remark that each execution time does not contain the time of reading data as an input and is the average execution time for trees in a dataset. For example, Fig. 7 (a) indicates that the average execution time of Algorithm 1 for trees in D(60, 000) is about 300 seconds. From Fig. 7 (a), our algorithm Compress Tree is fastest among three algorithms. Fig. 7 (b) shows the relationship between the number of vertices and the compression ratios. Each compression ratio in Fig. 7 (b) is the average compression ratio for trees in a dataset. For example, from Fig. 7 (b), we can see that the average compression ratio of Compress Tree for trees in D(60, 000) is about 60%. From Fig. 7 (a) and (b), Compress Tree and Algorithm 2 have extremely better performance than Algorithm 1. Fig. 7 (c) shows the relationship between the number
16 16 K. Yamagata et al. (a) Execution Time vs Number of Vertices (b) Compression Ratio vs Number of Vertices (c) Number of Variables vs Number of Vertices (d) Number of Variable Labels vs Number of Vertices Fig. 7. Experiment 1 of comparing Compress Tree with Algorithm 1 and Algorithm 2 with respect to execution time and compression ratio under the circumstances of different datasets and the fixed occurrence count 2. of vertices in input data and the number of variables appeared in admissible VRG output by each algorithm. Moreover, Fig. 7 (d) the relationship between the number of vertices in input data and the average number of variable labels used in admissible VRG output by each algorithm. From Fig. 7 (c) and (d), although the number of variables appeared in admissible VRG produced by each algorithm is almost same, Algorithm 1 produced a admissible VRG which has extremely more variable labels in each dataset than other two algorithms. Moreover, in Fig. 7 (b), (c) and (d), we can see that Compress Tree and Algorithm 2 have almost same performance. This indicates that the order of chosen trees at INSTRUCTION in Algorithm 2 almost coincides with the order of chosen trees at line 5 of Compress Tree. From these reasons, we can see that our algorithm Compress Tree and the algorithm Algorithm 2 suit for lossless compression of a large tree, but the algorithm Algorithm 1 does not suit. we tested the execution times and the compression ratios of three algorithms for the dataset D(80, 000) by varying an occurrence count from 2 to 5. Fig. 8 shows the performances of three algorithms for different occurrence counts. We can obtain the similar results as the previous experiments from Fig. 8. From these experimental results, we can see that the algorithm Compress Tree suits for lossless compression of a large tree and have an advantage of execution time.
17 Title Suppressed Due to Excessive Length 17 (a) Execution Time vs Occurrence Count (b) Compression Ratio vs Occurrence Count (c) Number of Variables vs Occurrence Count (d) Number of Variable Labels vs Occurrence Count Fig. 8. Experiment 2 of comparing Compress Tree with Algorithm 1 and Algorithm 2 with respect to execution time and compression ratio under the circumstances of the dataset D(80, 000) and the different occurrence counts. 5 Conclusions We have considered the problem of effective compression of an ordered rooted tree without loss of information. We have presented an admissible VRG which generates only a given ordered rooted tree. Then, for an ordered rooted tree T, we have defined the grammar-based compression problem of finding an admissible VRG which generates only T and whose size is minimum. Moreover, we have shown the hardness of solving this problem by proving that there is no polynomial time algorithm with approximation ratio less than unless P=NP. Next, we have presented an effective algorithm for finding an admissible VRG G, which generates only given ordered rooted tree and which is as small as possible. In order to evaluate the performance of our algorithm, we have implemented our algorithm and other two algorithms. Then, we have shown the effectiveness of our algorithm by comparing them with respect to execution time and compression ratio in applying them to artificial large trees. From the viewpoint of computational complexity, we will analyze the approximation ratio of our algorithm, that is, the maximum ratio between the size of the generated admissible VRG and the smallest possible admissible VRG over all inputs. Moreover, we will construct efficient data mining tools for lossless compressed data and apply to real-world data. Moreover, we will apply our grammar-based compression scheme for other graph structured data.
18 18 K. Yamagata et al. This work is partly supported by Grant-in-Aid for Young Scientists (B) No from the Ministry of Education, Culture, Sports, Science and Technology, Japan, and Hiroshima City University Grant for Special Academic Research(General Studies) No References 1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, A.V. Aho, J.E. Hopcroft, and J.D. Ullman. Data Structures and Algorithms. Addison-Wesley, T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient substructure discovery from large semi-structured data. Proc. 2nd SIAM Int. Conf. Data Mining (SDM-2002), pages , M. Charikar, E. Lehman, D. Liu, and R. Panigrahy. Approximating the smallest grammar: Kolmogorov Complexity in natural models. Proc. 34th ACM STOC 02, pages , D. J. Cook and L. B. Holder. Graph-based data mining. IEEE Intelligent Systems, 15:32 41, G. Rozenberg (Ed.). Handbook of Graph Grammars and Computing by Graph Transformation, volume 1. World Scientific Publishing, Y. Itokawa, T. Uchida, T. Shoudai, T. Miyahara, and Y. Nakamura. Finding frequent subgraphs from graph structured data with geometric information and its application to lossless. Proc. PAKDD-2003, Springer-Verlag, LNAI 2637, pages , J. C. Kieffer and E-h. Yang. Grammar based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46: , E. Lehman and A. Shelat. Approximations algorithms for grammar-based compression. Proc. SODA 02, pages , S. Matsumoto, T. Shoudai, T. Miyahara, and T. Uchida. Learning of finite unions of tree patterns with internal structured variables from queries. Proc. AI-2002, Springer-Verlag, LNAI 2557, pages , T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tag tree patterns in semistructured web documents. Proc. PAKDD-2002, Springer-Verlag, LNAI 2336, pages , C. Nevill-Manning and I Witten. Compression and explanation using hierarchical grammars. Computer Journal, 40(2/3): , H. Sakamoto. A fully linear-time approximation algorithm for grammar-based compression. DOI Technical Report 214, Department of Informatics, Kyushu University, Y. Suzuki, R. Akanuma, T. Shoudai, T. Miyahara, and T. Uchida. Polynomial time inductive inference of ordered tree patterns with internal structured variables from positive data. Proc. COLT-2002, Springer-Verlag, LNAI 2375, pages , T. Uchida, Y. Itokawa, T. Shoudai, T. Miyahara, and Y. Nakamura. A new framework for discovering knowledge from two-dimensional structured data using layout formal graph system. Proc. ALT-2000, Springer-Verlag, LNAI 1968, pages , K. Wang and H. Liu. Discovering structural association of semistructured data. IEEE Trans. Knowledge and Data Engineering, 12: , 2000.
Learning Characteristic Structured Patterns in Rooted Planar Maps
Learning Characteristic Structured Patterns in Rooted Planar Maps Satoshi Kawamoto Yusuke Suzuki Takayoshi Shoudai Abstract Exting the concept of ordered graphs, we propose a new data structure to express
More informationMonotone Constraints in Frequent Tree Mining
Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance
More informationTrees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.
Trees 1 Introduction Trees are very special kind of (undirected) graphs. Formally speaking, a tree is a connected graph that is acyclic. 1 This definition has some drawbacks: given a graph it is not trivial
More informationA generalization of Mader s theorem
A generalization of Mader s theorem Ajit A. Diwan Department of Computer Science and Engineering Indian Institute of Technology, Bombay Mumbai, 4000076, India. email: aad@cse.iitb.ac.in 18 June 2007 Abstract
More informationTopological Invariance under Line Graph Transformations
Symmetry 2012, 4, 329-335; doi:103390/sym4020329 Article OPEN ACCESS symmetry ISSN 2073-8994 wwwmdpicom/journal/symmetry Topological Invariance under Line Graph Transformations Allen D Parks Electromagnetic
More informationA Commit Scheduler for XML Databases
A Commit Scheduler for XML Databases Stijn Dekeyser and Jan Hidders University of Antwerp Abstract. The hierarchical and semistructured nature of XML data may cause complicated update-behavior. Updates
More informationTreewidth and graph minors
Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under
More informationCMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees
MTreeMiner: Mining oth losed and Maximal Frequent Subtrees Yun hi, Yirong Yang, Yi Xia, and Richard R. Muntz University of alifornia, Los ngeles, 90095, US {ychi,yyr,xiayi,muntz}@cs.ucla.edu bstract. Tree
More informationA Simplified Correctness Proof for a Well-Known Algorithm Computing Strongly Connected Components
A Simplified Correctness Proof for a Well-Known Algorithm Computing Strongly Connected Components Ingo Wegener FB Informatik, LS2, Univ. Dortmund, 44221 Dortmund, Germany wegener@ls2.cs.uni-dortmund.de
More informationA 4-Approximation Algorithm for k-prize Collecting Steiner Tree Problems
arxiv:1802.06564v1 [cs.cc] 19 Feb 2018 A 4-Approximation Algorithm for k-prize Collecting Steiner Tree Problems Yusa Matsuda and Satoshi Takahashi The University of Electro-Communications, Japan February
More informationEfficient Subtree Inclusion Testing in Subtree Discovering Applications
Efficient Subtree Inclusion Testing in Subtree Discovering Applications RENATA IVANCSY, ISTVAN VAJK Department of Automation and Applied Informatics and HAS-BUTE Control Research Group Budapest University
More informationThe Number of Connected Components in Graphs and Its. Applications. Ryuhei Uehara. Natural Science Faculty, Komazawa University.
The Number of Connected Components in Graphs and Its Applications Ryuhei Uehara uehara@komazawa-u.ac.jp Natural Science Faculty, Komazawa University Abstract For any given graph and an integer k, the number
More informationTrees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures.
Trees Q: Why study trees? : Many advance DTs are implemented using tree-based data structures. Recursive Definition of (Rooted) Tree: Let T be a set with n 0 elements. (i) If n = 0, T is an empty tree,
More informationComputing the Longest Common Substring with One Mismatch 1
ISSN 0032-9460, Problems of Information Transmission, 2011, Vol. 47, No. 1, pp. 1??. c Pleiades Publishing, Inc., 2011. Original Russian Text c M.A. Babenko, T.A. Starikovskaya, 2011, published in Problemy
More informationAn Edge-Swap Heuristic for Finding Dense Spanning Trees
Theory and Applications of Graphs Volume 3 Issue 1 Article 1 2016 An Edge-Swap Heuristic for Finding Dense Spanning Trees Mustafa Ozen Bogazici University, mustafa.ozen@boun.edu.tr Hua Wang Georgia Southern
More informationGreedy Algorithms CHAPTER 16
CHAPTER 16 Greedy Algorithms In dynamic programming, the optimal solution is described in a recursive manner, and then is computed ``bottom up''. Dynamic programming is a powerful technique, but it often
More informationAlgorithm Design Techniques (III)
Algorithm Design Techniques (III) Minimax. Alpha-Beta Pruning. Search Tree Strategies (backtracking revisited, branch and bound). Local Search. DSA - lecture 10 - T.U.Cluj-Napoca - M. Joldos 1 Tic-Tac-Toe
More informationVerifying a Border Array in Linear Time
Verifying a Border Array in Linear Time František Franěk Weilin Lu P. J. Ryan W. F. Smyth Yu Sun Lu Yang Algorithms Research Group Department of Computing & Software McMaster University Hamilton, Ontario
More informationClosed Pattern Mining from n-ary Relations
Closed Pattern Mining from n-ary Relations R V Nataraj Department of Information Technology PSG College of Technology Coimbatore, India S Selvan Department of Computer Science Francis Xavier Engineering
More informationOptimization I : Brute force and Greedy strategy
Chapter 3 Optimization I : Brute force and Greedy strategy A generic definition of an optimization problem involves a set of constraints that defines a subset in some underlying space (like the Euclidean
More informationApplied Mathematics Letters. Graph triangulations and the compatibility of unrooted phylogenetic trees
Applied Mathematics Letters 24 (2011) 719 723 Contents lists available at ScienceDirect Applied Mathematics Letters journal homepage: www.elsevier.com/locate/aml Graph triangulations and the compatibility
More informationThroughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees.
Chapter 7 SUPERTREE ALGORITHMS FOR NESTED TAXA Philip Daniel and Charles Semple Abstract: Keywords: Most supertree algorithms combine collections of rooted phylogenetic trees with overlapping leaf sets
More informationThese notes present some properties of chordal graphs, a set of undirected graphs that are important for undirected graphical models.
Undirected Graphical Models: Chordal Graphs, Decomposable Graphs, Junction Trees, and Factorizations Peter Bartlett. October 2003. These notes present some properties of chordal graphs, a set of undirected
More informationThe 3-Steiner Root Problem
The 3-Steiner Root Problem Maw-Shang Chang 1 and Ming-Tat Ko 2 1 Department of Computer Science and Information Engineering National Chung Cheng University, Chiayi 621, Taiwan, R.O.C. mschang@cs.ccu.edu.tw
More informationEfficient homomorphism-free enumeration of conjunctive queries
Efficient homomorphism-free enumeration of conjunctive queries Jan Ramon 1, Samrat Roy 1, and Jonny Daenen 2 1 K.U.Leuven, Belgium, Jan.Ramon@cs.kuleuven.be, Samrat.Roy@cs.kuleuven.be 2 University of Hasselt,
More informationLecture 5: Graphs. Rajat Mittal. IIT Kanpur
Lecture : Graphs Rajat Mittal IIT Kanpur Combinatorial graphs provide a natural way to model connections between different objects. They are very useful in depicting communication networks, social networks
More informationarxiv: v2 [cs.ds] 30 Sep 2016
Synergistic Sorting, MultiSelection and Deferred Data Structures on MultiSets Jérémy Barbay 1, Carlos Ochoa 1, and Srinivasa Rao Satti 2 1 Departamento de Ciencias de la Computación, Universidad de Chile,
More informationBinary Decision Diagrams
Logic and roof Hilary 2016 James Worrell Binary Decision Diagrams A propositional formula is determined up to logical equivalence by its truth table. If the formula has n variables then its truth table
More informationPriority Queues and Binary Heaps
Yufei Tao ITEE University of Queensland In this lecture, we will learn our first tree data structure called the binary heap which serves as an implementation of the priority queue. Priority Queue A priority
More informationFaster parameterized algorithms for Minimum Fill-In
Faster parameterized algorithms for Minimum Fill-In Hans L. Bodlaender Pinar Heggernes Yngve Villanger Technical Report UU-CS-2008-042 December 2008 Department of Information and Computing Sciences Utrecht
More informationFaster parameterized algorithms for Minimum Fill-In
Faster parameterized algorithms for Minimum Fill-In Hans L. Bodlaender Pinar Heggernes Yngve Villanger Abstract We present two parameterized algorithms for the Minimum Fill-In problem, also known as Chordal
More informationEfficient Incremental Mining of Top-K Frequent Closed Itemsets
Efficient Incremental Mining of Top- Frequent Closed Itemsets Andrea Pietracaprina and Fabio Vandin Dipartimento di Ingegneria dell Informazione, Università di Padova, Via Gradenigo 6/B, 35131, Padova,
More informationGraph Theory. Probabilistic Graphical Models. L. Enrique Sucar, INAOE. Definitions. Types of Graphs. Trajectories and Circuits.
Theory Probabilistic ical Models L. Enrique Sucar, INAOE and (INAOE) 1 / 32 Outline and 1 2 3 4 5 6 7 8 and 9 (INAOE) 2 / 32 A graph provides a compact way to represent binary relations between a set of
More informationDistinctive Frequent Itemset Mining from Time Segmented Databases Using ZDD-Based Symbolic Processing. Shin-ichi Minato and Takeaki Uno
TCS Technical Report TCS -TR-A-09-37 Distinctive Frequent Itemset Mining from Time Segmented Databases Using ZDD-Based Symbolic Processing by Shin-ichi Minato and Takeaki Uno Division of Computer Science
More informationXML Clustering by Bit Vector
XML Clustering by Bit Vector WOOSAENG KIM Department of Computer Science Kwangwoon University 26 Kwangwoon St. Nowongu, Seoul KOREA kwsrain@kw.ac.kr Abstract: - XML is increasingly important in data exchange
More informationSemistructured Data Store Mapping with XML and Its Reconstruction
Semistructured Data Store Mapping with XML and Its Reconstruction Enhong CHEN 1 Gongqing WU 1 Gabriela Lindemann 2 Mirjam Minor 2 1 Department of Computer Science University of Science and Technology of
More informationFormal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.
Although this paper analyzes shaping with respect to its benefits on search problems, the reader should recognize that shaping is often intimately related to reinforcement learning. The objective in reinforcement
More informationSlides for Faculty Oxford University Press All rights reserved.
Oxford University Press 2013 Slides for Faculty Assistance Preliminaries Author: Vivek Kulkarni vivek_kulkarni@yahoo.com Outline Following topics are covered in the slides: Basic concepts, namely, symbols,
More information6. Finding Efficient Compressions; Huffman and Hu-Tucker
6. Finding Efficient Compressions; Huffman and Hu-Tucker We now address the question: how do we find a code that uses the frequency information about k length patterns efficiently to shorten our message?
More informationPaths, Flowers and Vertex Cover
Paths, Flowers and Vertex Cover Venkatesh Raman, M.S. Ramanujan, and Saket Saurabh Presenting: Hen Sender 1 Introduction 2 Abstract. It is well known that in a bipartite (and more generally in a Konig)
More informationA Well-Behaved Algorithm for Simulating Dependence Structures of Bayesian Networks
A Well-Behaved Algorithm for Simulating Dependence Structures of Bayesian Networks Yang Xiang and Tristan Miller Department of Computer Science University of Regina Regina, Saskatchewan, Canada S4S 0A2
More informationFast algorithm for generating ascending compositions
manuscript No. (will be inserted by the editor) Fast algorithm for generating ascending compositions Mircea Merca Received: date / Accepted: date Abstract In this paper we give a fast algorithm to generate
More informationEfficient subset and superset queries
Efficient subset and superset queries Iztok SAVNIK Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Glagoljaška 8, 5000 Koper, Slovenia Abstract. The paper
More informationAn AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time
B + -TREES MOTIVATION An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations finish within O(log N) time The theoretical conclusion
More informationFixed-Parameter Algorithms, IA166
Fixed-Parameter Algorithms, IA166 Sebastian Ordyniak Faculty of Informatics Masaryk University Brno Spring Semester 2013 Introduction Outline 1 Introduction Algorithms on Locally Bounded Treewidth Layer
More informationDiscrete mathematics
Discrete mathematics Petr Kovář petr.kovar@vsb.cz VŠB Technical University of Ostrava DiM 470-2301/02, Winter term 2018/2019 About this file This file is meant to be a guideline for the lecturer. Many
More informationConstructions of hamiltonian graphs with bounded degree and diameter O(log n)
Constructions of hamiltonian graphs with bounded degree and diameter O(log n) Aleksandar Ilić Faculty of Sciences and Mathematics, University of Niš, Serbia e-mail: aleksandari@gmail.com Dragan Stevanović
More informationThe temporal explorer who returns to the base 1
The temporal explorer who returns to the base 1 Eleni C. Akrida, George B. Mertzios, and Paul G. Spirakis, Department of Computer Science, University of Liverpool, UK Department of Computer Science, Durham
More informationAn approximation algorithm for a bottleneck k-steiner tree problem in the Euclidean plane
Information Processing Letters 81 (2002) 151 156 An approximation algorithm for a bottleneck k-steiner tree problem in the Euclidean plane Lusheng Wang,ZimaoLi Department of Computer Science, City University
More informationLecture 7 February 26, 2010
6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some
More informationMulticut in trees viewed through the eyes of vertex cover
Multicut in trees viewed through the eyes of vertex cover Jianer Chen 1 Jia-Hao Fan 1 Iyad A. Kanj 2 Yang Liu 3 Fenghui Zhang 4 1 Department of Computer Science and Engineering, Texas A&M University, College
More informationImproved algorithms for constructing fault-tolerant spanners
Improved algorithms for constructing fault-tolerant spanners Christos Levcopoulos Giri Narasimhan Michiel Smid December 8, 2000 Abstract Let S be a set of n points in a metric space, and k a positive integer.
More informationst-orientations September 29, 2005
st-orientations September 29, 2005 Introduction Let G = (V, E) be an undirected biconnected graph of n nodes and m edges. The main problem this chapter deals with is different algorithms for orienting
More informationAUSMS: An environment for frequent sub-structures extraction in a semi-structured object collection
AUSMS: An environment for frequent sub-structures extraction in a semi-structured object collection P.A Laur 1 M. Teisseire 1 P. Poncelet 2 1 LIRMM, 161 rue Ada, 34392 Montpellier cedex 5, France {laur,teisseire}@lirmm.fr
More informationGenerating All Solutions of Minesweeper Problem Using Degree Constrained Subgraph Model
356 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16 Generating All Solutions of Minesweeper Problem Using Degree Constrained Subgraph Model Hirofumi Suzuki, Sun Hao, and Shin-ichi Minato Graduate
More informationMulticasting in the Hypercube, Chord and Binomial Graphs
Multicasting in the Hypercube, Chord and Binomial Graphs Christopher C. Cipriano and Teofilo F. Gonzalez Department of Computer Science University of California, Santa Barbara, CA, 93106 E-mail: {ccc,teo}@cs.ucsb.edu
More informationNotes on Binary Dumbbell Trees
Notes on Binary Dumbbell Trees Michiel Smid March 23, 2012 Abstract Dumbbell trees were introduced in [1]. A detailed description of non-binary dumbbell trees appears in Chapter 11 of [3]. These notes
More informationHuffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27
Huffman Coding Version of October 13, 2014 Version of October 13, 2014 Huffman Coding 1 / 27 Outline Outline Coding and Decoding The optimal source coding problem Huffman coding: A greedy algorithm Correctness
More informationGeneral Models for Optimum Arbitrary-Dimension FPGA Switch Box Designs
General Models for Optimum Arbitrary-Dimension FPGA Switch Box Designs Hongbing Fan Dept. of omputer Science University of Victoria Victoria B anada V8W P6 Jiping Liu Dept. of Math. & omp. Sci. University
More informationOptimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C
Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of California, San Diego CA 92093{0114, USA Abstract. We
More informationThe Encoding Complexity of Network Coding
The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network
More informationΛέων-Χαράλαμπος Σταματάρης
Λέων-Χαράλαμπος Σταματάρης INTRODUCTION Two classical problems of information dissemination in computer networks: The broadcasting problem: Distributing a particular message from a distinguished source
More informationUnifying and extending hybrid tractable classes of CSPs
Journal of Experimental & Theoretical Artificial Intelligence Vol. 00, No. 00, Month-Month 200x, 1 16 Unifying and extending hybrid tractable classes of CSPs Wady Naanaa Faculty of sciences, University
More informationAnalysis of Algorithms - Greedy algorithms -
Analysis of Algorithms - Greedy algorithms - Andreas Ermedahl MRTC (Mälardalens Real-Time Reseach Center) andreas.ermedahl@mdh.se Autumn 2003 Greedy Algorithms Another paradigm for designing algorithms
More informationKnowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey
Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya
More informationSimpler, Linear-time Transitive Orientation via Lexicographic Breadth-First Search
Simpler, Linear-time Transitive Orientation via Lexicographic Breadth-First Search Marc Tedder University of Toronto arxiv:1503.02773v1 [cs.ds] 10 Mar 2015 Abstract Comparability graphs are the undirected
More informationOnline Algorithms for Mining Semi-structured Data Stream
DOI-TR-211 June 2002 Department of Informatics, Kyushu Univeristy ftp://ftp.i.kyushu-u.ac.jp/pub/tr/trcs211.ps.gz Online Algorithms for Mining Semi-structured Data Stream (To appear in Proc. 2002 IEEE
More informationGreedy Algorithms 1. For large values of d, brute force search is not feasible because there are 2 d
Greedy Algorithms 1 Simple Knapsack Problem Greedy Algorithms form an important class of algorithmic techniques. We illustrate the idea by applying it to a simplified version of the Knapsack Problem. Informally,
More informationOn Covering a Graph Optimally with Induced Subgraphs
On Covering a Graph Optimally with Induced Subgraphs Shripad Thite April 1, 006 Abstract We consider the problem of covering a graph with a given number of induced subgraphs so that the maximum number
More informationRepresentations of Weighted Graphs (as Matrices) Algorithms and Data Structures: Minimum Spanning Trees. Weighted Graphs
Representations of Weighted Graphs (as Matrices) A B Algorithms and Data Structures: Minimum Spanning Trees 9.0 F 1.0 6.0 5.0 6.0 G 5.0 I H 3.0 1.0 C 5.0 E 1.0 D 28th Oct, 1st & 4th Nov, 2011 ADS: lects
More informationLaboratory Module Trees
Purpose: understand the notion of 2-3 trees to build, in C, a 2-3 tree 1 2-3 Trees 1.1 General Presentation Laboratory Module 7 2-3 Trees 2-3 Trees represent a the simplest type of multiway trees trees
More informationA Method for Construction of Orthogonal Arrays 1
Eighth International Workshop on Optimal Codes and Related Topics July 10-14, 2017, Sofia, Bulgaria pp. 49-54 A Method for Construction of Orthogonal Arrays 1 Iliya Bouyukliev iliyab@math.bas.bg Institute
More informationarxiv: v1 [math.co] 5 Apr 2012
Remoteness, proximity and few other distance invariants in graphs arxiv:104.1184v1 [math.co] 5 Apr 01 Jelena Sedlar University of Split, Faculty of civil engeneering, architecture and geodesy, Matice hrvatske
More informationA Fast Algorithm for Optimal Alignment between Similar Ordered Trees
Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221
More informationAn Efficient Algorithm for Solving Pseudo Clique Enumeration Problem
An Efficient Algorithm for Solving Pseudo Clique Enumeration Problem Takeaki Uno National Institute of Informatics 2-1-2, Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan, uno@nii.jp Abstract. The problem
More informationV Advanced Data Structures
V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,
More informationfied by a regular expression [4,7,9,11,23,16]. However, this kind of navigational queries is not completely satisfactory since in many cases we would
Electronic Notes in Theoretical Computer Science 50 No. 3 (2001) Proc. GT-VMT 2001 URL: http://www.elsevier.nl/locate/entcs/volume50.html 10 pages Graph Grammars for Querying Graph-like Data S. Flesca,
More informationThe Structure of Bull-Free Perfect Graphs
The Structure of Bull-Free Perfect Graphs Maria Chudnovsky and Irena Penev Columbia University, New York, NY 10027 USA May 18, 2012 Abstract The bull is a graph consisting of a triangle and two vertex-disjoint
More informationOptimal Region for Binary Search Tree, Rotation and Polytope
Optimal Region for Binary Search Tree, Rotation and Polytope Kensuke Onishi Mamoru Hoshi 2 Department of Mathematical Sciences, School of Science Tokai University, 7 Kitakaname, Hiratsuka, Kanagawa, 259-292,
More informationData Structures and Algorithms
Data Structures and Algorithms Trees Sidra Malik sidra.malik@ciitlahore.edu.pk Tree? In computer science, a tree is an abstract model of a hierarchical structure A tree is a finite set of one or more nodes
More informationCS 6783 (Applied Algorithms) Lecture 5
CS 6783 (Applied Algorithms) Lecture 5 Antonina Kolokolova January 19, 2012 1 Minimum Spanning Trees An undirected graph G is a pair (V, E); V is a set (of vertices or nodes); E is a set of (undirected)
More informationAlgorithms Dr. Haim Levkowitz
91.503 Algorithms Dr. Haim Levkowitz Fall 2007 Lecture 4 Tuesday, 25 Sep 2007 Design Patterns for Optimization Problems Greedy Algorithms 1 Greedy Algorithms 2 What is Greedy Algorithm? Similar to dynamic
More informationCSE 431/531: Algorithm Analysis and Design (Spring 2018) Greedy Algorithms. Lecturer: Shi Li
CSE 431/531: Algorithm Analysis and Design (Spring 2018) Greedy Algorithms Lecturer: Shi Li Department of Computer Science and Engineering University at Buffalo Main Goal of Algorithm Design Design fast
More informationAn Improved Algorithm for Matching Large Graphs
An Improved Algorithm for Matching Large Graphs L. P. Cordella, P. Foggia, C. Sansone, M. Vento Dipartimento di Informatica e Sistemistica Università degli Studi di Napoli Federico II Via Claudio, 2 8025
More informationGraph Matching: Fast Candidate Elimination Using Machine Learning Techniques
Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques M. Lazarescu 1,2, H. Bunke 1, and S. Venkatesh 2 1 Computer Science Department, University of Bern, Switzerland 2 School of
More informationarxiv: v3 [cs.ds] 18 Apr 2011
A tight bound on the worst-case number of comparisons for Floyd s heap construction algorithm Ioannis K. Paparrizos School of Computer and Communication Sciences Ècole Polytechnique Fèdèrale de Lausanne
More information1 Format. 2 Topics Covered. 2.1 Minimal Spanning Trees. 2.2 Union Find. 2.3 Greedy. CS 124 Quiz 2 Review 3/25/18
CS 124 Quiz 2 Review 3/25/18 1 Format You will have 83 minutes to complete the exam. The exam may have true/false questions, multiple choice, example/counterexample problems, run-this-algorithm problems,
More informationSpanners of Complete k-partite Geometric Graphs
Spanners of Complete k-partite Geometric Graphs Prosenjit Bose Paz Carmi Mathieu Couture Anil Maheshwari Pat Morin Michiel Smid May 30, 008 Abstract We address the following problem: Given a complete k-partite
More informationGRAPH THEORETICAL ALGORITHMS FOR CONTROL FLOW GRAPH COMPARISON
Proceedings of the IASTED International Conference Software Engineering (SE 21) February 17-19, 21 Innsbruck, Austria GRAPH THEORETICAL ALGORITHMS FOR CONTROL FLOW GRAPH COMPARISON Sergej Alekseev Fachhochschule
More informationHeap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the
Heap-on-Top Priority Queues Boris V. Cherkassky Central Economics and Mathematics Institute Krasikova St. 32 117418, Moscow, Russia cher@cemi.msk.su Andrew V. Goldberg NEC Research Institute 4 Independence
More informationV Advanced Data Structures
V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,
More informationBranch and Bound Algorithm for Vertex Bisection Minimization Problem
Branch and Bound Algorithm for Vertex Bisection Minimization Problem Pallavi Jain, Gur Saran and Kamal Srivastava Abstract Vertex Bisection Minimization problem (VBMP) consists of partitioning the vertex
More informationGraph Algorithms Using Depth First Search
Graph Algorithms Using Depth First Search Analysis of Algorithms Week 8, Lecture 1 Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Graph Algorithms Using Depth
More informationTheorem 2.9: nearest addition algorithm
There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used
More informationCHAPTER 3 LITERATURE REVIEW
20 CHAPTER 3 LITERATURE REVIEW This chapter presents query processing with XML documents, indexing techniques and current algorithms for generating labels. Here, each labeling algorithm and its limitations
More informationAlgorithm and Complexity of Disjointed Connected Dominating Set Problem on Trees
Algorithm and Complexity of Disjointed Connected Dominating Set Problem on Trees Wei Wang joint with Zishen Yang, Xianliang Liu School of Mathematics and Statistics, Xi an Jiaotong University Dec 20, 2016
More informationA CSP Search Algorithm with Reduced Branching Factor
A CSP Search Algorithm with Reduced Branching Factor Igor Razgon and Amnon Meisels Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, 84-105, Israel {irazgon,am}@cs.bgu.ac.il
More informationThe self-minor conjecture for infinite trees
The self-minor conjecture for infinite trees Julian Pott Abstract We prove Seymour s self-minor conjecture for infinite trees. 1. Introduction P. D. Seymour conjectured that every infinite graph is a proper
More informationKeywords: Data Mining, TAR, XML.
Volume 6, Issue 6, June 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com TAR: Algorithm
More informationAnalysis of Algorithms
Analysis of Algorithms Concept Exam Code: 16 All questions are weighted equally. Assume worst case behavior and sufficiently large input sizes unless otherwise specified. Strong induction Consider this
More information