An Effective Grammar-Based Compression Algorithm for Tree Structured Data

Size: px

Start display at page:

Download "An Effective Grammar-Based Compression Algorithm for Tree Structured Data"

August Day
5 years ago
Views:

1 An Effective Grammar-Based Compression Algorithm for Tree Structured Data Kazunori Yamagata 1, Tomoyuki Uchida 1, Takayoshi Shoudai 2, and Yasuaki Nakamura 1 1 Faculty of Information Sciences, Hiroshima City University, Hiroshima , Japan {k yamagata@toc.cs,uchida@cs,nakamura@cs}.hiroshima-cu.ac.jp 2 Department of Informatics, Kyushu University, Kasuga , Japan shoudai@i.kyushu-u.ac.jp Abstract. Many semistructured data such as HTML/XML files are represented by rooted trees t such that all children of each internal vertex of t are ordered and all edges of t have labels. Such data is called tree structured data. Analyzing large tree structured data is a time-consuming process in data mining. If we can reduce the size of input data without loss of information, we can speed up such a heavy process. In this paper, we consider a problem of effective compression of an ordered rooted tree, which represents given tree structured data, without loss of information. Firstly, in order to define this problem in a grammar-based compression scheme, we present a variable replacement grammar (VRG for short) over ordered rooted trees. The grammar-based compression problem for an ordered rooted tree T is defined as a problem of finding a VRG which generates only T and whose size is minimum. For the grammar-based compression problem for an ordered rooted tree, we show that there is no polynomial time algorithm with approximation ratio less than unless P=NP. Secondly, based on this theoretical result, we present an effective compression algorithm for finding a VRG which generates only a given ordered rooted tree and whose size is as small as possible. Finally, in order to evaluate the performance of our grammar-based compression algorithm, we report some experimental results. 1 Introduction Background: Due to rapid growth of Information Technologies, semistructured data such as HTML/XML files have been rapidly increasing and each of them has become larger. Semistructured data having tree structures are called tree structured data and are represented by rooted trees t such that all children of each internal vertex of t are ordered and all edges of t have labels. In general, analyzing large tree structured data is a time-consuming process in data mining. If we can reduce the size of input data without loss of information including structural features, we can speed up such a heavy process. In this paper, we consider a problem

2 2 K. Yamagata et al. of effective compression of an ordered rooted tree, which represents given tree structured data, without loss of information. We must compress a given ordered rooted tree T so that we exclude the loss of structural features which T has. Hence, we cannot apply lossless compression algorithms for strings to tree structured data. The aim of this paper is to give a grammar-based compression scheme for an ordered rooted tree and to present an effective algorithm for compressing a given ordered rooted tree without loss of information in the constructed grammar-based compression scheme. Data Model: As our data model for tree structured data, we use a variant of Object Exchange Model (OEM, for short) presented by Abiteboul et al.[1] as follows. An object o consists of an identifier, a link and a value, which are denoted by &o, link(&o) and val(&o), respectively. The identifier &o uniquely identifies the object o. The link link(&o) is a list (&o 1, &o 2,..., &o p ) of the identifiers of all subobjects o i (i = 1, 2,..., p), where p > 0. The value(&o) is either a string such as a tag in HTML/XML files, or a text such as a text written in the field of PCDATA in HTML/XML files. Tree structured data is represented by an ordered rooted tree with edge labels as follows. Each vertex represents an object identifier &o. An edge (&o, &o i ) represents a reference &o i in link(&o) and has the value val(&o i ). For any object identifier &o with link(&o) = (&o 1, &o 2,..., &o p ), the children &o 1, &o 2,..., &o p of the vertex &o are ordered in this order. For example, in Fig. 1, the ordered tree T represents the structure which Sample html has. Main Results: In this paper, we consider a problem of effective compression of an ordered rooted tree, which represents given tree structured data, without loss of information. Firstly, in order to define such data compression problem for an ordered rooted tree in the grammar-based compression schema, we present a term tree consisting of tree structures and structured variables, and present a Variable Replacement Grammar (VRG for short) over ordered rooted trees which is based on Hyperedge Replacement Grammar (HRG for short, see [6]). A graph transformation of VRG is defined as a mechanism of replacing a variable by an ordered rooted tree. In Fig. 1, as examples of a term tree and a graph transformation of VRG, we give the term tree t and the ordered rooted tree g such that T is obtained from the term tree t and the tree g by replacing all variables labeled with x by g. The grammar-based compression problem for an ordered rooted tree T is defined as a problem of finding a VRG which generates only T and whose size is minimum. We can regard this grammar-based compression problem as an optimization problem for minimizing the size of a VRG which generates only T. Secondly, for the grammar-based compression problem for an ordered rooted tree, we show that there is no polynomial time algorithm with approximation ratio less than unless P=NP. This result shows that approximating the size of the minimum VRG to within a small constant factor is NP-hard. Next, based on this theoretical result, we present an effective grammar-based compression algorithm for finding a VRG which generates only a given ordered rooted tree whose size is as small as possible. This algorithm is based on a greedy approach

3 Title Suppressed Due to Excessive Length 3 table tr td font Text 1-A /font /td td font Text 1-B /font /td /tr tr td font Text 2-A /font /td td font Text 2-B /font /td /tr tr td font Text 3-A /font /td td font Text 3-B /font /td /tr tr td font Text 4-A /font /td td font Text 4-B /font /td /tr /table t T g Sample html Fig. 1. An HTML document Sample html, the ordered rooted tree T which is a data model of Sample html, a term tree t and an ordered rooted tree g. A variable is represented by a box with lines to its elements. The label of a box is the label of the variable. The number in the left side of a vertex denotes the ordering on its siblings. of replacing isomorphic subtrees t, which are not overlap in a given ordered rooted tree, by the same variable in order of increasing the size of t. Next, by improving the algorithm given by Asai et al. [3], we present an efficient algorithm for finding all candidate subtrees s of a given ordered rooted tree T such that s can be replaced by a variable. This algorithm is a pre-processing of our grammarbased compression algorithm. Finally, in order to evaluate the performance of our grammar-based compression algorithm, we report some experimental results of comparing our algorithm with other two algorithms. One is based on a greedy approach of the order of decreasing the size of a candidate subtree which can be replaced by a variable. The other is based on Minimum Description Length (MDL for short) heuristic such as SUBDUE [5]. Experimental results show the effectiveness of our algorithm. Related Works: For a string, several grammar-based compression algorithms have been proposed [4, 8, 9, 12, 13]. Such algorithms are based on the idea of representing a string by

4 4 K. Yamagata et al. a context-free grammar (see [8, 12]). Especially, based on a grammar-based compression scheme, Charikar et al. [4] presented an O(log(n/g )) approximation algorithm and Sakamoto [13] proposed a linear-time approximation algorithm which guarantees O(log 2 n) approximation ratio where n is the length of an input string and g is the size of the smallest grammar. On the other hand, for semistructured data, there are few researches for a grammar-based compression. Hence, we need to define a new grammar-based compression scheme for an ordered rooted tree which is based on HRG (see [6]). For semistructured data which can be represented by a general graph, Cook [5] presented a practical data compression algorithm based on MDL heuristic which is not a grammar-based compression algorithm. For semistructured data with geometric information, we presented an effective compression algorithm in [7] by introducing notions of a layout term graph in [15] and a substitution in logic programming. This compression scheme presented in [7] is regarded as a preliminary version of the grammar-based compression scheme presented in this paper. In the fields of data mining and knowledge discovery, there are increasing demands for effective methods for extracting information from large semistructured data. Several effective algorithms for finding frequent substructures among large tree structured data have been proposed [3, 16]. In [11], we presented an effective algorithm for extracting common structural features among ordered rooted trees. Moreover, in [10, 14], we discussed the learnabilities of tree patterns having tree structure, variables and ordered children from the viewpoint of machine learning. Organization: This paper is organized as follows. In Section 2, we introduce an ordered rooted term tree and define an admissible VRG which leads us to compress an ordered tree without loss of information. In Section 3, we define a problem of finding an admissible VRG whose size is minimum among admissible VRGs generating only an ordered rooted tree. Then, we present an effective greedy algorithm for solving this problem. In Section 4, in order to evaluate the performance of our algorithm, we report some experimental results of applying our algorithm to artificial large trees. 2 Preliminaries 2.1 Ordered Term Tree Let T = (V T, E T ) be an ordered rooted tree with a vertex set V T and an edge set E T. Let l 1 be an integer. A list h = (u 0, u 1,..., u l ) of vertices in V T is called a variable (or a hyperedge) of T if u 1,..., u l is a sequence of consecutive children of u 0, i.e., u 0 is the parent of u 1,..., u l and u j+1 is the next sibling of u j for j with any 1 j < l. Two variables h = (u 0, u 1,..., u l ) and h = (u 0, u 1,..., u l ) are said to be disjoint if {u 1,..., u l } {u 1,..., u l } =. Definition 1. Let T = (V T, E T ) be an ordered rooted tree and H T a set of pairwise disjoint variables of T. An ordered term tree obtained from T and H T is a

5 Title Suppressed Due to Excessive Length 5 triplet t = (V t, E t, H t ) where V t = V T, E t = E T h=(u 0,u 1,...,u l ) H T {{u 0, u i } E T 1 i l} and H t = H T. For two vertices u, u V t, we say that u is the parent of u in t if u is the parent of u in T. Similarly we say that u is a child of u in t if u is a child of u in T. In particular, for a vertex u V t with no child, we call u a leaf of t. We define the order of the children of each vertex u in t as the order of the children of u in T. We often omit the description of the ordered tree T and the variable set H T because we can find them from the triplet t = (V t, E t, H t ). Example 1. The ordered term tree t in Fig. 1 is obtained from the tree T = (V T, E T ) and the set H T, where V T = {v0, v1,..., v17}, E T = {{v0, v1}, {v1, v2}, {v2, v3}, {v1, v4}, {v4, v5}, {v1, v6}, {v6, v7}, {v1, v8}, {v8, v9}, {v1, v10}, {v10, v11}, {v1, v12}, {v12, v13}, {v1, v14}, {v14, v15}, {v1, v16}, {v16, v17}} and H T = {(v1, v2, v4), (v1, v6, v8), (v1, v10, v12), (v1, v14, v16)}. For any ordered term tree t, a vertex u of t, and two children u and u of u, we write u < t u u if u is smaller than u in the order of the children of u. For a set or a list D, the number of elements in D is denoted by D. We assume that every edge and variable of an ordered term tree is labeled with some words from specified languages. Let Λ and X be finite alphabets such that Λ X =. An element of Λ is called a edge label. An element of X is called a variable label and has the rank, denoted by rank(x), that is a nonnegative integer. A variable h has a label x such that rank(x) = h. A term tree t is called a term tree over Λ, X if every edges and every variables of t are labeled by elements in Λ and X, respectively. If Λ and X need not to be specified, we often omit them. Note. In this paper, we treat only ordered rooted term trees, and then we call an ordered rooted term tree a term tree, simply. In particular, a term tree with no variable is called a ground term tree (or simply a tree) and considered to be a tree with ordered children. For a term tree t and its vertices v 1 and v i, a path from v 1 to v i is a sequence v 1, v 2,..., v i of distinct vertices of t such that for any j with any 1 j < i, v j is the parent of v j+1. Let t = (V t, E t, H t ) be a term tree. For subsets V f V t, E f E t and H f H t, if f = (V f, E f, H f ) is a term tree then f is said to be a term subtree of t. For two term subtrees f = (V f, E f, H f ) and g = (V g, E g, H g ) of t, we say that f and g are overlap in t if ((E f E g ) (H f H g )), V f V g and V g V f. Let f and g be term trees over Λ, X each of which has at least two vertices. Let h = (v 0, v 1,..., v l ) be a variable in f and σ = (u 0, u 1,..., u l ) a list of l+1 distinct vertices in g such that u 0 is the root of g and u 1,..., u l are leaves of g. The pair [g, σ] of g and σ is called an (l + 1)-hypertree over Λ, X. If l, Λ and X need not to be specified, we often omit them. The form h [g, σ] is called a variable replacement for h. A new term tree f = f{h [g, σ]} is obtained by applying the variable replacement h [g, σ] to f in the following way. For the variable h = (v 0, v 1,..., v l ), we attach g to f by removing the variable h from H f and by identifying the vertices v 0, v 1,..., v l with the vertices u 0, u 1,..., u l of g in this order. We define a new ordering < f v on every vertex v in f in the following natural way. Suppose that v has more than one child and let

6 6 K. Yamagata et al. f g f Fig. 2. The new ordering on vertices in the term tree f = f{h [g, (u0, u1, u2, u3)]} where h = (v0, v1, v2, v3). v and v be two children of v in f. We note that v i = u i for any 0 i l. (1) If v, v, v V g and v < g v v, then v < f v v. (2) If v, v, v V f and v < f v v, then v < f v v. (3) If v = v 0 (= u 0 ), v V f {v 1,..., v l }, v V g, and v < f v v 1, then v < f v v. (4) If v = v 0 (= u 0 ), v V f {v 1,..., v l }, v V g, and v l < f v v, then v < f v v. In Fig. 2, we give an example of the new ordering on vertices in a term tree. 2.2 Admissible Variable Replacement Grammar Next, we define formally an admissible Variable Replacement Grammar, which generates only one tree, based on a HRG (see [6]). Let Λ and X be finite alphabets with Λ X =. Definition 2. A Variable Replacement Grammar (VRG for short) G = (S, R) over Λ, X is defined as follows: (1) S is a variable label in X with rank(s) = 0 and is called the start variable label. (2) R is a finite set of productions of the form x [g, σ], where x is a variable label in X with rank(x) = l and [g, σ] is an l-hypertree over Λ, X. Let G = (S, R) be a VRG. For a variable label x X, an l-hypertree [g, σ] and an integer i 1, we define the relation x i G [g, σ] inductively as follows. (1) We denote x 1 G [g, σ] if there is a production x [g, σ] in R. (2) For i 2, we denote x i G [g, σ] if there are j, m 1, an l-hypertree [f, σ] and a variable h of rank k with label y in f such that j + m = i, x j G [f, σ], y m G [d, σ ], and g = f(h [d, σ ]).

7 Title Suppressed Due to Excessive Length 7 We write x + G [g, σ] if x i G [g, σ] for some i 1. The graph language generated by a VRG G = (S, R) is the set L(G) = {T T is a tree and S + G [T, ()]}. Let G = (S, R) be a VRG and T a tree. Then, G is said to be admissible if L(G) = {T }. For a given tree T, an admissible VRG G generating only T leads us to compress T without loss of information, if the size of G is less than the size of T. Example 2. Let G = (S, R) be the VRG where R = {S [t 1, ()], x [t 2, (u1, u2)], y [t 3, (v1, v2)]}, and t 1, t 2 and t 3 term trees in Fig. 3. Then, we can see that G is admissible and L(G) = {T }, where T is the tree in Fig. 3. T t 1 t 2 t 3 Fig. 3. A Tree T and term trees t 1, t 2, t 3. 3 Grammar-Based Compression for an Ordered Rooted Tree In this section, we consider a problem of finding an admissible VRG which generates only a given tree and whose size is minimum. Firstly, we formally define this problem and show the hardness of solving this problem. Secondly, for a given tree T, we present an algorithm Find Freq Trees for finding all candidate ground term subtrees of T which can be replaced by variables. Finally, by using

8 8 K. Yamagata et al. Find Freq Trees, we give an effective algorithm for finding an admissible VRG G which generates only a given tree and whose size is as small as possible. 3.1 Hardness of Grammar-Based Compression Problem for an Ordered Rooted Tree For a term tree t = (V t, E t, H t ), we define the size of t as t = V t + 2 E t + h. For a VRG G = (S, R), we define the size of G as G = ( g + h H t x [g,σ] R σ ). For a tree T and an admissible VRG G such that L(G) = {T }, we define a compression ratio ρ of T w.r.t G as ρ = G T 100. Example 3. The size of the tree T in Fig. 3 is T = = 64. The sizes of term trees t 1, t 2 and t 3 in Fig.3 are t 1 = (2 + 2) = 10, t 2 = 3 + (2 + 2) = 7 and t 3 = = 16, respectively. Then, the size of the admissible VRG G = (S, R) is G = (10 + 0) + (7 + 2) + (16 + 2) = 37, where R = {S [t 1, ()], x [t 2, (u1, u2)], y [t 3, (v1, v2)]}. Therefore, the compression ratio ρ of T w.r.t. G is ρ = A grammar-based compression problem for a tree is defined as the following problem Find Min AVRG. Find Min AVRG Instance: A tree T. Problem: Find an admissible VRG G such that L(G) = {T } and for any admissible VRG G with L(G ) = {T }, G G. This problem is regarded as an optimization problem for minimizing the size of an admissible VRG which generates only a given tree. Then, we can prove the following theorem by a reduction from restricted form of VERTEX COVER in a similar way as the proof of Theorem 3.1 in [9]. Theorem 1. There is no polynomial time algorithm for solving Find Min AVRG with approximation ratio less than 8593 unless P=NP This theorem shows the hardness of solving Find Min AVRG. That is, this result indicates that approximating the size of the minimum VRG to within a small constant factor is NP-hard. Based on this theoretical result, in next section, we will present an effective compression algorithm for finding an admissible VRG which generates only a given tree and whose size is as small as possible 3.2 Algorithm of Finding All Frequent Ground Term Subtrees Let T = (V, E, ) be a tree and t = (V t, E t, ) a ground term subtree of T. From the definitions of a variable and a variable replacement, if there exist a path p

9 Title Suppressed Due to Excessive Length 9 in T from a vertex v V t to a vertex u V V t such that v is not the root or a leaf of t and p does not contain any leaf of t, or if for two children w 1 and w 2 of the root r of t, there is a vertex w V V t such that w 1 < T r w and w < T r w 2, then we can not replace t by a variable even if t is frequent in T. Under this constraint, by improving the algorithm given by Asai et al.[3] which finds all frequent ground term subtrees for a given tree T, we present an algorithm Find Freq Trees for finding all candidate ground term subtrees in T which can be replaced by variables. A grammar-based compression algorithm for a tree, which is given later, uses Find Freq Trees as a pre-processing. Let T be a tree and v a vertex in T. The number of vertices in the path from the root of T to v is denoted by depth T (v). We assume that next T (v) returns the nearest right sibling, if any, of v in T. We define pa 0 T (v) = v and pai T (v) as the parent of pa i 1 T (v) for i 1. A tree T is said to be of normal form if T satisfies the following conditions. (1) The set of vertices of T is V T = {1,..., k}. (2) All elements in V T are numbered by preorder traversal [2] of T. We can easily see that if T is a tree with k vertices and is of normal form, then the root of T is 1 and the rightmost leaf of T is k. For a tree T of normal form having k vertices, we denote the rightmost leaf of T by rml(t ), that is, rml(t ) = k, and denote the vertex k 1 by prevrml(t ). The path from the root of T to rml(t ) is called the rightmost branch. For an integer k 1, a k-pattern is a tree T of normal form whose number of vertices is k. For every k 1, we denote the set of all k-patterns by T k and the set of all patterns by T = k T k. Let T = (V T, E T, ) and U = (V U, E U, ) be trees. Then, a matching function from T to U is any function π : V T V U that satisfies the following conditions (1)-(4) for any vertex v V T which is not the root or a leaf of T and any v 1, v 2 V T. (1) π is a one-to-one mapping. That is, if v 1 v 2 then π(v 1 ) π(v 2 ). (2) π preserves the parent-child relation. That is, {v 1, v 2 } E T if and only if {π(v 1 ), π(v 2 )} E U. Moreover, {v 1, v 2 } in E T and {π(v 1 ), π(v 2 )} in E U have a same edge label. (3) π preserves the sibling relation. That is, next T (v 1 ) = v 2 if and only if next U (π(v 1 )) = π(v 2 ). (4) All children of π(v) in U are included in the set {π(u) u V T } V U. If V T = V U, a matching function from T to U can be regarded as an isomorphism between T and U. Then, two trees T and U are said to be isomorphic if V T = V U and there exists a matching function from T to U. Next, a pseudo-matching function from T to U is any function π : V T V U that satisfies the above conditions (1)-(3) of the matching function π and the following condition (4 ). (4 ) For any internal vertex v V T which does not appear in the rightmost branch of T, all children of π (v) in U are included in the set {π (u) u V T } V U.

10 10 K. Yamagata et al. Let U be a tree. Given a k-pattern T T k and a matching function π from T to U, we define the rightmost occurrence (the rml-occurrence for short) and the rightmost occurrence list of T w.r.t. π to be π(k) and Roc U (T ) = {π(k) π is a matching function from T to U}, respectively. Similarly, given k-pattern T T k and a pseudo-matching function π from T to U, we define the pseudo rightmost occurrence (the pseudo-rml-occurrence for short) and the candidate rightmost occurrence list of T w.r.t. U to be π (k) and Roc U (T ) = {π (k) π is a pseudo-matching function from T to U}, respectively. Let r 2 be an integer which is called a occurrence count. T is said to be r-occurred for U if Roc U (T ) r and T is said to be r-pseudo-occurred for U if Roc U (T ) r. Then, we define the set of all r-occurred k-patterns in T k for U as F U,k,r = {T T T k, Roc U (T ) r}, and F U,r = k F U,k,r T. We define the set of all r-pseudo-occurred k-patterns in T k for U as F U,k,r = {T T T k, Roc U (T ) r} and F U,r = k F U,k,r T. Let Roc U,k,r = T F U,k,r {π(k) π is a matching function from T to U} and let Roc U,k,r = T F U,k,r{π (k) π is a pseudo-matching function from T to U}. Let U be a tree, T a tree of normal form, and Roc U (T ) = {π(rml(t )) π is a matching function from T to U}. From the definitions of a matching function and a pseudo-matching function, for a vertex v in Roc U (T ), we can identify the unique matching function π from T to U such that π(rml(t )) = v and the unique matching function π from T to U such that π (rml(t )) = v. For a tree T of normal form and a vertex v of U, a ground term subtree G = (V G, E G, ) of U is said to be identified by T and v if there exists an isomorphism π between T and G such that π(rml(t )) = v. Let T T k 1, 0 p < depth T (rml(t )) any integer, and l Λ any edge label. Then, the (p, l)-expansion of T is the tree S obtained from T by attaching a new vertex k to the vertex v such that the attacked vertex k is the rightmost child of v and the edge between k and v has the label l, where v = pa p T (rml(t )), that is, v is the p-th parent of the rightmost leaf of T. In Fig. 4, given a tree U and an occurrence count r as inputs, we present an efficient algorithm Find Freq Trees which outputs the set F U,r of all r-occurred patterns for U and the set of their rml-occurrences indexed by trees in F U,r w.r.t. U. In Fig. 5, we present a procedure Expand Trees used in Find Freq Trees. Given the set R(T ) calculated in line 4 of the procedure Expand Trees and the integer p as inputs, for every edge label l Λ and every (p, l)-expansion S of T, the procedure Scanning Sibling in line 5 of the procedure Expand Trees returns the candidate rightmost occurrence list N ewroc (that is, the set of all pseudo-rmloccurrences ) of S w.r.t. the tree U as follows. Initially, Scanning Sibling creates an empty set NewRoc. Next, for each v R(T ) and each l Λ, add the pair (l, u) to NewRoc if there exists (p, l)-expansion of T in U such that if p = 0 then u is the leftmost child of v in U, otherwise u is the vertex next U (pa p 1 U (v)). Then, the following theorem holds. Theorem 2. When a tree U and an occurrence count r 2 are given as inputs, the algorithm Find Freq Trees can construct correctly the set F U,r of all r-occurred patterns for U and the set T F U,r Roc U (T ) of rml-occurrences in-

11 Title Suppressed Due to Excessive Length 11 Algorithm Find Freq Trees Input: A tree U and an occurrence count r 2 Output: The set F U,r of all r-occurrence patterns for U and their rml-occurrence lists Roc = T F U,r Roc U (T ) 1. Compute F U,1,r, Roc U,1,r, F U,2,r, and Roc U,2,r from U in level-order traversal; 2. k := 3; 3. while F U,k 1,r do 4. F U,k 1,r, Roc U,k 1,r, F U,k,r, Roc U,k,r := Expand Trees(F U,k 1,r, Roc U,k 1,r, r); 5. k := k + 1; 6. end; 7. F U,r := F U,1,r F U,k 2,r ; /* F U,1,r = F U,1,r */ 8. Roc := Roc U,1,r Roc U,k 2,r ; /* Roc U,1,r = Roc U,1,r */ 9. return F U,r, Roc ; Fig. 4. Algorithm Find Freq Trees dexed by trees in F U,r w.r.t. U in O( V U + A 2 N + A F U,r Λ ) time where V U is the set of vertices in U, A is the maximum number of vertices of trees in the set F U,r of all r-pseudo-occurred patterns for U, and N = Σ T F Roc U,r U (T ). 3.3 Grammar-Based Compression Algorithm for an Ordered Rooted Tree Let U be a tree, T a tree of normal form, and Roc U (T ) = {π(rml(t )) π is a matching function from T to U}. Then, a subset R T Roc U (T ) is a valid subset of Roc U (T ) if for any two vertices u, v R T, t u and t v are not overlap in U, where t u is the ground term subtree identified by T and u and t v is the ground term subtree identified by T and v. Moreover, a valid subset R T of Roc U (T ) is maximal if for any subset R of Roc U (T ) such that R T R, R is not a valid subset of Roc U (T ). We can compute a maximal valid subset R T of Roc U (T ) by level-order traversal of U as follows. Let R T = {v 1 } and Roc U (T ) = {v 1,..., v n } such that for 1 i < j n, v i is found before v j by level-order traversal of U. For each i = 2,..., n, we add v i to R T if there exists no vertex u in R T such that t i and t are overlap in U, where t is the ground term subtree of U identified by T and u and t i is the ground term subtrees of U identified by T and v i. We remark that the above maximal valid subset R T of Roc U (T ) is not always best for compressing a given tree. In Fig. 6, when a tree U and an occurrence count r are given as inputs, we present a greedy algorithm Compress Tree for finding an admissible VRG which generates only T and is as small as possible. The algorithm Compress Tree is based on a greedy approach of replacing isomorphic term subtrees which are not overlap in a given tree by the same variable in order of increasing the size

12 12 K. Yamagata et al. Procedure Expand Trees Input: A set F old of patterns, A set Roc old of pseudo-rml-occurrences indexed by trees in F old and an occurrence count r 2. Output: A set F ixf of r-occurred patterns for U, a set F ixroc of their rml-occurrences indexed by trees in F ixf, a set F new of the rightmost expansions of trees in F old and a set Roc new of pseudo-rml-occurrences indexed by trees in F new w.r.t. U. 1. F new := ; F ixroc := Roc old; F ixf := F old, Roc new := ; 2. foreach tree T F old do 3. foreach 0 p < depth(rml(t )) do 4. R(T ) := {π (rml(t )) π (rml(t )) F ixroc, π is a pseudo-matching function from T to U}; 5. NewRoc := Scanning Sibling(R(T ), p); 6. foreach l Λ do 7. compute the (p, l)-expansion S of T ; 8. NewRoc(l):={v (l, v) NewRoc}; 9. if NewRoc(l) r then 10. F new:=f new {S}; 11. Roc new:=roc new {(S, v) v NewRoc(l)}; /* end of if */ 12. if p 0 and p depth(rml(t ) 1) then 13. while NewRoc(l) do 14. choose a vertex v in { NewRoc(l); 15. F ixroc:=f ixroc π (prevrml(s)) 16. NewRoc(l):=NewRoc(l) {v}; 17. end; /* end of if */ 18. end; 19. R(T ) := {π (rml(t )) π (rml(t )) F ixroc, π is a pseudo-matching function from T to U}; 20. if R(T ) < r then 21. F ixf :=F ixf {T }; 22. F ixroc:=f ixroc {v v R(T )}; 23. break; /* end of if */ 24. end; /* end of foreach-loop */ 25. end; /* end of foreach-loop */ 26. return F ixf, F ixroc, F new, Roc new ; } π is a pseudo-matching function from S to U and ; π (rml(s)) = v Fig. 5. Procedure Expand Trees

13 Title Suppressed Due to Excessive Length 13 of a replaced term subtree. In line 1 of Compress Tree, we find the set F of all r-occurred patterns for U and the set of their rml-occurrences indexed by trees in F w.r.t. U by using the algorithm Find Freq Trees. In the while-loop from line 4 to line 25, Compress Tree fixes on all ground term subtrees which are actually replaced by variables in the procedure Make Grammar of line 26. In line 14, we revise the set Roc by removing all vertices u in {π(rml(g)) Roc π is a matching function from G to U)} from Roc for each G F org such that the identified ground term subtree g u of U by G and u is satisfied the following condition. There exists a vertex v in vroc(t ) such that t v and g u are overlap in U, or there exists a vertex v in vroc(t ) {w} such that g u is a ground term subtree of t v, where w is the first rml-occurrence of T in levelorder traversal of U and t v is the identified ground term subtree of U by T and v. The procedure Make Grammar in the line 26 constructs an admissible VRG G by applying the following operations to U in increasing order of the size of T of (T, V List(T )) tmprules. Let Q = (V Q, E Q, H Q ) be a copy of U. We initialize R Q := and H Q :=. For (T, V List(T )) tmprules, H Q :=H Q {h π π(rml(t )) Roc(T ), (π, h π ) V List(T )} and R Q :=R Q {x [t T, σ]} where x is a new variable label, t T is the corresponding term subtree of Q to the identified ground term subtree by T and the first rml-occurrence in level-order traversal, and σ is the first list of V List(T ). Then, for each element (π, h π ) V List(T ) such that π(rml(t )) Roc(T ), we revise the term tree Q by deleting the corresponding term subtree of Q to the identified ground term subtree by T and π(rml(t )). Finally, the rule S [Q, ()] is added to R Q and the procedure Make Grammar outputs the admissible VRG G = (S, R Q ). Then, the following theorem holds. Theorem 3. When a tree U and an occurrence count r are given, the algorithm Compress Tree in Fig. 6 can produce correctly an admissible VRG G = (S, R) over Λ, X with L(G) = {U} in O( V U +A 2 N +A F U,r Λ +BMC) time, where V U is the vertex set of U, A is the maximum number of vertices of trees in the set F U,r of all r-pseudo occurred patterns for U, N = T F Roc U (T ), B is U,r the maximum number of vertices of trees in F U,r, M = T F U,r Roc U (T ), and C is the number of variable labels appeared in G. Proof. (Sketch) We can prove the correctness of this theorem from the following facts (1) and (2). (1) The admissible VRG G = (R, S) constructed by Compress Tree is deterministic. For any variable label x appeared in G, G has only one production p in R such that the variable label in the leftside of p is x. Therefore, we can see that L(G) = 0 or L(G) = 1. (2) U is in L(G), since any two term subtrees, which are replaced by varibles in Make Grammar, are not overlap in U. From (1) and (2), we can see that G is an admissible VRG with L(G) = {U}. From Theorem 2, line 1 can be executed in O( V U +A 2 N + A F U,r Λ ) time. Moreover, lines from 4 to 25 can be executed in O(BMC) time. Then, we can show the time complexity of Compress Tree.

14 14 K. Yamagata et al. Algorithm Compress Tree Input: A tree U and an integer r 2 Output: An admissible VRG G = (S, R) such that L(G) = {U} and a compression ratio ρ 1. F, Roc :=Find Freq Trees(U); 2. remove all trees consisting of one vertex or two vertices from F ; 3. tmprules:=, F org :=F and for each T F, tmpsize(t ):= T ; 4. while F do 5. let T be a smallest tree in F ; 6. F :=F {T }; 7. Roc(T ):={π(rml(t )) Roc π is a matching function from T to U}; 8. compute a maximal valid subset vroc(t ) of Roc(T ); 9. m:= vroc(t ) ; 10. fix on the integer k > 0 and π is a matching function from T to U such that π(rml(t )) = v, V List(T ):= (π, h v) h v is a variable which consists ; v vroc(t ) of k vertices of U and by which the term subtree identified by T and v can be replaced 11. fix on hypertree [T, σ] such that σ = k, by using V List(T ); 12. Size:=((m 1)tmpSize(T ) (2m + 1)k)); 13. if Size 1 then 14. Revise Roc by removing all useless vertices in Roc, using F org; 15. tmprules:=tmprules {(T, V List(T ))}; 16. foreach G{ F do } π is a matching function from G 17. R(G):= π(rml(g)) Roc ; to U 18. if R(G) 1 then F :=F {G}; F org :=F org {G}; 19. else 20. let w be a vertex in R(G); 21. let g w { be the identified ground term subtree of U by G and w; 22. n:= u vroc(t ) g } w has the identified ground term ; tree by T and u as a term subtree 23. tmpsize(g):=tmpsize(g) n(tmpsize(t ) 2k) /* end of if */ 24. end; /* end of if */ 25. end; 26. G:=Make Grammar(U, tmprules, Roc); 27. return G, G T 100 ; Fig. 6. Algorithm Compress Tree

15 Title Suppressed Due to Excessive Length 15 4 Implementation and Experimental Results In order to evaluate our grammar-based compression algorithm Compress Tree presented in previous section, we have implemented Compress Tree and two other algorithms Algorithm 1 and Algorithm 2. The algorithm Algorithm 1 is based on a greedy approach of replacing isomorphic term subtrees, which are not overlap in a given tree, by the same variable in order of decreasing the size of a replaced term subtree. That is, Algorithm 1 is the algorithm obtained from Compress Tree by changing line 5 of Compress Tree with the instruction, let T be a largest tree in F. The algorithm Algorithm 2 is based on an approach of replacing repeatedly isomorphic term subtrees, which are not overlap in a given tree T and gives us the best compression ratio, by a variable. Algorithm 2 is the algorithm by adding the instruction else break; under line 24 of Compress Tree and changing line 5 of Compress Tree with the following instruction INSTRUMENT. let T be a best tree among F with respect to the compression ratio obtained by replacing the term subtrees, which are isomorphic to T and are not overlap, by a variable. That is, Algorithm 2 is regarded as the algorithm SUBDUE in [5] based on a Minimum Description Length heuristic. We have evaluated our algorithm Compress Tree by comparing with two other algorithms Algorithm 1 and Algorithm 2 with respect to execution time and compression ratio of applying them to artificial large trees. The machine used in experiments is a PC with two 2.4GHz CPUs and 1.00GB main memory. We implemented a data generator to randomly produce an artificial large tree satisfying the following conditions. (1) The number of vertices is 20,000, 40,000, 60,000, 80,000 or 100,000. (2) The degree of each vertex is less than 3. (3) The number of edge labels is less than 2. For N {20, 000, 40, 000, 60, 000, 80, 000, 100, 000}, let D(N) be the set of 10 trees whose numbers of vertices are N and which are produced by the data generator. We tested the execution times and the compression ratios of Compression Tree, Algorithm 1 and Algorithm 2 under the circumstances of different datasets and the occurrence count 2. Fig. 7 (a) shows the relationship between the number of vertices and the execution times. We remark that each execution time does not contain the time of reading data as an input and is the average execution time for trees in a dataset. For example, Fig. 7 (a) indicates that the average execution time of Algorithm 1 for trees in D(60, 000) is about 300 seconds. From Fig. 7 (a), our algorithm Compress Tree is fastest among three algorithms. Fig. 7 (b) shows the relationship between the number of vertices and the compression ratios. Each compression ratio in Fig. 7 (b) is the average compression ratio for trees in a dataset. For example, from Fig. 7 (b), we can see that the average compression ratio of Compress Tree for trees in D(60, 000) is about 60%. From Fig. 7 (a) and (b), Compress Tree and Algorithm 2 have extremely better performance than Algorithm 1. Fig. 7 (c) shows the relationship between the number

16 16 K. Yamagata et al. (a) Execution Time vs Number of Vertices (b) Compression Ratio vs Number of Vertices (c) Number of Variables vs Number of Vertices (d) Number of Variable Labels vs Number of Vertices Fig. 7. Experiment 1 of comparing Compress Tree with Algorithm 1 and Algorithm 2 with respect to execution time and compression ratio under the circumstances of different datasets and the fixed occurrence count 2. of vertices in input data and the number of variables appeared in admissible VRG output by each algorithm. Moreover, Fig. 7 (d) the relationship between the number of vertices in input data and the average number of variable labels used in admissible VRG output by each algorithm. From Fig. 7 (c) and (d), although the number of variables appeared in admissible VRG produced by each algorithm is almost same, Algorithm 1 produced a admissible VRG which has extremely more variable labels in each dataset than other two algorithms. Moreover, in Fig. 7 (b), (c) and (d), we can see that Compress Tree and Algorithm 2 have almost same performance. This indicates that the order of chosen trees at INSTRUCTION in Algorithm 2 almost coincides with the order of chosen trees at line 5 of Compress Tree. From these reasons, we can see that our algorithm Compress Tree and the algorithm Algorithm 2 suit for lossless compression of a large tree, but the algorithm Algorithm 1 does not suit. we tested the execution times and the compression ratios of three algorithms for the dataset D(80, 000) by varying an occurrence count from 2 to 5. Fig. 8 shows the performances of three algorithms for different occurrence counts. We can obtain the similar results as the previous experiments from Fig. 8. From these experimental results, we can see that the algorithm Compress Tree suits for lossless compression of a large tree and have an advantage of execution time.

17 Title Suppressed Due to Excessive Length 17 (a) Execution Time vs Occurrence Count (b) Compression Ratio vs Occurrence Count (c) Number of Variables vs Occurrence Count (d) Number of Variable Labels vs Occurrence Count Fig. 8. Experiment 2 of comparing Compress Tree with Algorithm 1 and Algorithm 2 with respect to execution time and compression ratio under the circumstances of the dataset D(80, 000) and the different occurrence counts. 5 Conclusions We have considered the problem of effective compression of an ordered rooted tree without loss of information. We have presented an admissible VRG which generates only a given ordered rooted tree. Then, for an ordered rooted tree T, we have defined the grammar-based compression problem of finding an admissible VRG which generates only T and whose size is minimum. Moreover, we have shown the hardness of solving this problem by proving that there is no polynomial time algorithm with approximation ratio less than unless P=NP. Next, we have presented an effective algorithm for finding an admissible VRG G, which generates only given ordered rooted tree and which is as small as possible. In order to evaluate the performance of our algorithm, we have implemented our algorithm and other two algorithms. Then, we have shown the effectiveness of our algorithm by comparing them with respect to execution time and compression ratio in applying them to artificial large trees. From the viewpoint of computational complexity, we will analyze the approximation ratio of our algorithm, that is, the maximum ratio between the size of the generated admissible VRG and the smallest possible admissible VRG over all inputs. Moreover, we will construct efficient data mining tools for lossless compressed data and apply to real-world data. Moreover, we will apply our grammar-based compression scheme for other graph structured data.

18 18 K. Yamagata et al. This work is partly supported by Grant-in-Aid for Young Scientists (B) No from the Ministry of Education, Culture, Sports, Science and Technology, Japan, and Hiroshima City University Grant for Special Academic Research(General Studies) No References 1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, A.V. Aho, J.E. Hopcroft, and J.D. Ullman. Data Structures and Algorithms. Addison-Wesley, T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient substructure discovery from large semi-structured data. Proc. 2nd SIAM Int. Conf. Data Mining (SDM-2002), pages , M. Charikar, E. Lehman, D. Liu, and R. Panigrahy. Approximating the smallest grammar: Kolmogorov Complexity in natural models. Proc. 34th ACM STOC 02, pages , D. J. Cook and L. B. Holder. Graph-based data mining. IEEE Intelligent Systems, 15:32 41, G. Rozenberg (Ed.). Handbook of Graph Grammars and Computing by Graph Transformation, volume 1. World Scientific Publishing, Y. Itokawa, T. Uchida, T. Shoudai, T. Miyahara, and Y. Nakamura. Finding frequent subgraphs from graph structured data with geometric information and its application to lossless. Proc. PAKDD-2003, Springer-Verlag, LNAI 2637, pages , J. C. Kieffer and E-h. Yang. Grammar based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46: , E. Lehman and A. Shelat. Approximations algorithms for grammar-based compression. Proc. SODA 02, pages , S. Matsumoto, T. Shoudai, T. Miyahara, and T. Uchida. Learning of finite unions of tree patterns with internal structured variables from queries. Proc. AI-2002, Springer-Verlag, LNAI 2557, pages , T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tag tree patterns in semistructured web documents. Proc. PAKDD-2002, Springer-Verlag, LNAI 2336, pages , C. Nevill-Manning and I Witten. Compression and explanation using hierarchical grammars. Computer Journal, 40(2/3): , H. Sakamoto. A fully linear-time approximation algorithm for grammar-based compression. DOI Technical Report 214, Department of Informatics, Kyushu University, Y. Suzuki, R. Akanuma, T. Shoudai, T. Miyahara, and T. Uchida. Polynomial time inductive inference of ordered tree patterns with internal structured variables from positive data. Proc. COLT-2002, Springer-Verlag, LNAI 2375, pages , T. Uchida, Y. Itokawa, T. Shoudai, T. Miyahara, and Y. Nakamura. A new framework for discovering knowledge from two-dimensional structured data using layout formal graph system. Proc. ALT-2000, Springer-Verlag, LNAI 1968, pages , K. Wang and H. Liu. Discovering structural association of semistructured data. IEEE Trans. Knowledge and Data Engineering, 12: , 2000.

Learning Characteristic Structured Patterns in Rooted Planar Maps

Learning Characteristic Structured Patterns in Rooted Planar Maps Satoshi Kawamoto Yusuke Suzuki Takayoshi Shoudai Abstract Exting the concept of ordered graphs, we propose a new data structure to express