Learning n-ary tree-pattern queries for web information extraction

Size: px

Start display at page:

Download "Learning n-ary tree-pattern queries for web information extraction"

Conrad Campbell
6 years ago
Views:

1 Learning n-ary tree-pattern queries for web information extraction Benjamin Habegger Grappa / INRIA Mostrare Project Université Charles de Gaulle Lille 3 BP 60149, Villeneuve d Ascq Cedex, FRANCE habegger@grappa.univ-lille3.fr July 28, 2006 Abstract The problem of extracting information from the Web consists in building patterns allowing to extract specific information from documents of a given Web source. Up to now, most existing techniques use string-based representations of documents as well as string-based patterns. Using tree representations naturally allows to overcome limitations of string-based approaches. While some tree-based approaches exist, they are either limited to learning unary queries or build n-ary queries by composing unary queries. In this paper we study using tree-patterns as an n-ary extraction language and propose an algorithm capable of learning such queries. The learning algorithm we propose calculates the most informationconservative tree-pattern which is a generalization of two input trees. Tree-patterns have the double advantage of both allowing to explicitly work with the tree structure of the HTML/XML documents and allow to express n-ary queries. As our experiments will show, tree patterns can express many extraction tasks. They also have the advantage of being closely related to the now standard XPath language and therefore easily understandable by human experts. 1 Introduction The main motivation to web information extraction is to provide a structured access to the data made available on the web. Once extracted and restructured the extracted data can be used by any other data-based application such as mediators, online agents, etc. The need for information extraction techniques allowing to automate pattern generation comes from the fact that the web is in constant change and that hand-crafting patterns is known to be a tedious task. There has been much research on how to efficiently build programs (called wrappers) capable of automatically extracting data from web site (eg. [8, 12, 3, 14, 18, 9]). Most existing techniques [11, 13, 8] use a string representation for both documents and the learned patterns. In the context of n-ary extraction, these methods have only proven 1

2 to be efficient in extracting data which has both a tabular format and where the values to be extracted are relatively close to each other (ie. few tokens are found between them). Much data on the web does not however follow such a tabular format : the values of some attributes may be shared among the extracted tuples, unwanted data may appear between values to be extracted, etc. Using string-based representations makes it difficult to learn patterns in such cases. Figure 1, gives an example of where some data (the city) is shared among the relation to be extracted (see figure 2). Even though this is a artificial simple case, most string-based methods cannot cope with such data. The natural tree structure of HTML documents can be used in cases where stringbased techniques fail. Figure 4 gives the tree representation of the document of figure 1. The tree pattern given in figure 4 allows to correctly extract the desired relation. To our knowledge, there only exists few works using machine learning for information extraction which explicitly use a tree based document representation [3, 12]. In both cases, a tree transducers is learned from a given set of example documents which only extract single nodes. [5] use an interesting attribute/value encoding of tree nodes. The encoding however only uses the local context of the nodes to be extracted. Tree patterns allow to directly and naturally express n-ary queries. Furthermore they are closely related to widely used semi-structured query languages such as XPath [23]. Also tree patterns are more easily readable by human users than many other formalisms such as tree transducers. For these different reasons, it is interesting to develop techniques capable of learning tree patterns. There are multiple ways to define tree patterns, depending both on the relations they are allowed to express and how they are to be matched by a tree. These different settings have important consequences on the efficiency of both matching and learning, and on the expressivity of the learned patterns. In this paper we propose a weightbased algorithm capable of coping with different settings and generating a tree pattern of maximal weight in different settings. We particularly study ordered and unordered injective tree patterns in which child and descendant relationships can be expressed. Our algorithm is capable of generating n-ary extraction patterns in such cases. The contributions of this paper are the following : We introduce a flexible notion of weighted patterns based on a the relational representation of trees and patterns We propose a generalization algorithm allowing to build n-ary information extraction patterns which can easily be adapted to different settings (different relations and embedding definitions) We study learning tree patterns with our algorithm in different settings and evaluated it on different datasets in these settings This paper is organized as follows. Section 2 presents the general problem of information extraction followed by section 3 which presents related work in information extraction and in tree pattern mining. Section 4 presents the necessary background allowing to define tree pattern generalization. The principles of our generalization are given in section 5 followed by the details of our algorithm in section 6. The results of 2

3 Figure 1: The People/Location document its evaluation are presented in section 7. Finally we conclude and present future work in section 8. 2 Web Information Extraction The objective of web information extraction is to extract from a set of similarly formatted documents a target relation defined by the user. A program capable of proceding to such an extraction is called a wrapper. The People/location document of figure 1 gives an example document containing a relation between people and the cities they live in. A wrapper for such a source (in this case consisting of only one document) would be given the HTML code of this document (figure 3) and would return the set of tuples of figure 2. This task is not evident since the HTML code only describes how the data is to be presented in a navigator and not how it is related together. Definition 1 Given a set of documents D in a similar format and a target relation R, a wrapper is a program W such that W(D) = R. Building a wrapper manually requires knowing either a programmaing or querying language and, most importantly, also requires an expertise to determine regularities in a given document set can be exploited to build an efficient wrapper. In order to enable non-expert users to build wrappers, machine learning is an interesting solution. 3

4 City Name Nantes 7 Fabrice 9 Nantes 7 Nordine 11 Nantes 7 Alex 13 Grenoble 16 Gwen 18 Grenoble 16 Bertrand 20 Lille 23 Claire 25 Figure 2: The (city, name) relation to be extracted In such a case, the user simply selects (usually be the means of a graphical user interface) examples of the items he wishes to extract. In the case of n-ary extraction, he also specifies to which tuple and which attribute the value belongs. The set of selections associated to a given document is called a labeling. This labeling can be applied to the document it referes to in order to extract the selected tuples. When the document is represented as a tree, we will consider that the labeled elements are nodes of the tree. In the following, for a given document d and a given labeling L, we denote by L(d) the set of node tuples the labeling produces. It should be noted that a labeling does not necessarily specify all the elements to be extracted. Definition 2 (Wrapper learning problem) Given a set of labeled documents build the most specific pattern such that L(D) Q p (D). Where L(D) denotes the set of labeled tuples in D and Q p (D) denotes the result of applying pattern p to document D. Definition 2 is voluntarily general. First, it only takes into account positive examples (ie. the data to be extracted). This is a recurrent problem in information extraction since there is no consensual definition of what a negative example is. Second, specificity can be defined in different ways depending on the pattern language. A more precise definition of the type of patterns we are learning will be given further. One of the major problems in learning wrappers is that we are faced to multiple contradictory requirements. (1) The number of interactions between the user and the system during the learning process must be reduced to a minimum (ideally fewer than 10 labelings). (2) The time taken to build an extraction pattern must be low. (3) The constructed pattern must be reseasonably concise. (4) The pattern language must be able to cope with multiple types of regularites. These different aspects should therefore be taken into account when evaluating a wrapper learning method. 3 Related Work There has been much research on wrapper learning in the past years and many systems [14, 11, 8, 12, 7, 15, 5] have been proposed. Few systems allow for n-ary extraction in a direct manner. To our knowledge, only IERel [8] and WIEN [13] are capable of directly learning n-ary extraction patterns. However, both are string-based and rely on the strong hypothesis that the data to be extracted is tabular (ie. the tuples to be 4

5 <html> <head> <link rel="stylesheet" type="text/css" href="style.css" /> </head> <body> <h1>people list</h1> <table> <tr><th>nantes</th></tr> <tr><td>fabrice</td></tr> <tr><td>nordine</td></tr> <tr><td>alex</td></tr> </table> <table> <tr><th>grenoble</th></tr> <tr><td>gwen</td></tr> <tr><td>bertrand</td></tr> </table> <table> <tr><th>lille</th></tr> <tr><td>claire</td></tr> </table> </body> </html> Figure 3: Source of the People/Location document extracted do not overlap). WIEN is the first information extraction system developed and is known to have very limited expressivity and requires many examples. IERel on the other hand requires only very few example instances to learn efficient patterns for tabular data. Some systems allow to learn n-ary queries by composing unary queries. For example, this technique is used many systems such as Stalker [17]. This supposes that intermediate nodes or surrounding text have either to be tagged explicitly by the user or be discovered by the system. Stalker also requires knowing how the data is organized in the page. While, Lixto [2] is a visually based wrapper building systems which doesn t use machine learning, it allows to do n-ary extraction by composing monadic queries. In a similar manner as Stalker, intermediate nodes need to be selected by the user. The system PAF [5] transforms the n-ary wrapper learning problem into a classification problem. They use an attribute/value representation of the nodes of the tree representation of documents. A classifier is built for each component of the tuples. The classifier extracting nodes for the i th component has access to information related to the (i 1) st components. However, in the case of the PAF system, only the local context of the nodes to be extracted is taken into account. XPath [23] is a widely used query language for semi-structured data. In some simple cases, a wrapper can be simply expressed by a set of XPath expressions relative 5

6 to each other. To extract the (city, name) relation from the document of figure 1 whose source is given in figure 3, each group of instances can be determined by the XPath expression /html/body/table. Given such a table node it is easy to see that the expression tr/th allows to extract the city attribute of the group of instances. Finally each match with the XPath expression tr/td extracts the name attribute. The XPath query for which there exists a match (one match being one possible embedding) for each tuple extraction is the following query using branching : /html/body/table[tr/th][tr/td] It should be noted that evaluation algorithms which are linear in query and tree size exist for the core parts of XPath [6, 21]. Our work is also closely related to tree mining. Tree mining consists in finding frequently occurring patterns in a set of trees. Tree mining is mostly used for finding frequent queries or XML document classification. There are multitudes of research papers on the topic of tree-pattern mining. For example, the TreeFinder system [22] finds frequent unordered trees based on the notion of tree subsumption, a definition of inclusion which preserves ancestorship. It does not allow for child edges or abstracting labels. Another interesting tree mining approach is suggested by [1]. They propose an incremental frequent ordered tree mining algorithm based on rightmost branch extension. Learning extraction patterns differs from tree mining mostly in that we are required to learn patterns containing extraction variables. We also consider learning multiple relations between nodes and allowing to abstract some node labels and do not reduce to only relations of one type (usually either descendant or child relationships) as most tree mining techniques do. In [4] a tree query aggregation algorithm is proposed. Given a set of tree patterns, their system builds a new pattern more general than the original patterns which can be used as a replacement of the original pattern. They only consider an unordered and non-injective definition of pattern matching. Many theoretical results on tree subsumption, tree pattern containment, and tree pattern evaluation should be taken into account when considering learning tree patterns. In [10] complexity results for different tree inclusion problems are reported. They shown that unordered injective tree inclusion (preserving labeling and ancestorship) is NP-complete while the ordered version is PTIME. Recently, [16] showed that the (noninjective) containment problem of the XPath fragment allowing child and descendant relationships, label abstraction and branching together is CoNP-complete. Faced to these results and the previously stated requirements for information extraction, it is necessary to adapt extact methods in order to be efficient. The weight-based approch we propose and the tree cutting heuristics we use allow to overcome these limits and still learn efficient patterns. 4 Background We model XML and HTML documents as unranked ordered trees, having nodes labeled with symbols from an alphabet Σ. In this paper, we will consider both unordered 6

7 html 0 head 1 body 3 link 2 rel= stylesheet type= text/css href= style.css h1 4 table 5 table 14 table 21 People list tr 6 tr 8 tr 10 tr 12 tr 15 tr 17 tr 19 tr 22 tr 24 th 7 td 9 td 11 td 13 th 16 td 18 td 20 th 23 td 25 Nantes Fabrice Nordine Alex Grenoble Gwen Bertrand Lille Claire Figure 4: The tree representation of the People/Location document <people> { for $x in //table let $city = $x/th in return { for $name in $x/td return <entry> <name>$name</name> <city>$city</city> </entry> } } </people> Figure 5: XQuery allowing to extraction from the People/Location document and ordered trees. When considering unordered tree, we simply ignore the ordering of the trees. XPath is a widely used standard and therefore is an interesting target language for information extraction. However, XPath is monadic (it cannot express n-ary extraction) and therefore cannot express the target extraction. Tree patterns are a simple extension of XPath which allows to attach a variable to any node of the XPath s tree representation. This, gives us the tree pattern of figure 4 which allows to correctly extract the desired couples. Formally, tree patterns can be defined as follows : Definition 3 (Tree Pattern [16]) A tree pattern p is an unranked tree over alphabet Σ with a distinguished subset of edges called descendant edges, and a k-tuple of nodes called result tuple, for some k 0. 7

8 html 0 body 1 table 2 tr 3 tr 5 th 4 city td 6 name Figure 6: Target pattern Tree patterns are a superset of a fragment (noted XP {,//,[]} ) of the widely-used XPath query language. This fragment allows label wildcard ( ), descendant expressions (//) and branching expression ([]). The main difference between tree patterns and expressions of XP {,//,[]} is that tree patterns also allow to attach variables to nodes, thus allowing to simultaneously extract multiple values with a single expression. In the context of information extraction of n-ary data, this aspect comes in very handy. In the following, for a tree or pattern t we will denote by NODES(t) the set of nodes in t, for any given node n, LABEL(n) denotes the label of n, PARENT(n) denotes the parent of n, CHILDREN(n) denotes the subset of N ODES(t) which are the child nodes of n (in the case of a pattern, independently of the type of edge). Furthermore, we will denote by < the ordering obtained by walking thru the nodes of t in a depth-first manner. For any two nodes n and m, n m denotes that n is a strict ancestor of m. Also, when t is a tree pattern CEDGES(t) and DEGEDES(t) respectively denote the set of child edges (c-edges) of t and the set of descendant edges (d-edges) of t. Finally, for a pattern p, V ARIABLES(p) denotes the set of variables contained in p, and for a node n, V AR(n) is the name of the attached variable, if any. Determining when a tree matches a pattern can be defined through the notion of embedding. Informally, an embedding is a function which maps each node of the pattern into the nodes of the matching tree in such a way that all the properties described by the patterns are conserved. The strict minimum is that the relations between the nodes are conserved. The following definition, defines this notion of embedding. Definition 4 (Embedding) An function e from a pattern p to a tree t is an unordered embedding iff it respects the following requirements : (1) e is root preserving (ie. ROOT(p) = ROOT(e(p))) 8

9 (2) for all n NODES(p) : (3) for each n, m NODES(p) : LABEL(n) = or LABEL(n) = LABEL(e(n)) if (n, m) is a c-edge in p then (e(n), e(m)) is an edge in t if (n, m) is a d-edge in p then e(m) is a proper descendant of e(n) in t Consider, the pattern of figure 4 and the tree of figure 4. A the function e which associates nodes 0 to 6 of the pattern to respectively nodes 0, 3, 5, 6, 7, 8, 9 of the tree is an embedding. Imposing that each node of the pattern may only be mapped to a distinct node in the tree and, furthermore, that for each node n of the pattern, each outgoing edge implies the existence of distinct subtrees under e(n) can be obtained by imposing that the embedding be injective. It is also natural to impose that the children of each node n map into distinct subtrees of e(n). This can be obtained by requiring that the mapping conserves the lowest common ancestors. Definition 5 (Lowest common ancestor) The lowest common ancestor of two nodes n and m of a tree (or pattern) t (noted lca(n, m)) is the unique node z such that : z is an ancestor of n and m all other ancestors of both n and m are also ancestors of z In figure 4, lca(6, 15) = 3. Definition 6 (Injective embedding) An embedding e from a pattern p to a tree t is injective iff it also satisfies the following requirements : (4) n, m NODES(p), n m e(n) e(m) (5) n, m NODES(p), e(lca(n, m)) lca(e(n), e(m) In many cases, it may also be interesting to conserve the order of the children of a node. Definition 7 (Ordered injective embedding) An embedding e from a pattern p to a tree t is an ordered injective embedding iff it is an injective embedding which also satisfies the following requirements : (6) for all x, y NODES(p), n < m iff e(n) < e(m) Definition 8 (Match) A tree t is said to match a pattern iff there exists an embedding e from the nodes of p to the nodes of t. Of course, the notion of match depends on the type of embedding considered. Each embedding from a pattern to a tree gives rise to the extraction of a tuple of nodes. The extracted nodes are the images of variable nodes in the pattern. If V AR(n) = X then e(n) is extracted as the value of X. 9

10 Definition 9 (Extracted tuples) Given a tree pattern p and a tree t. Let (v 1,...,v n ) denote the variables of p. The extracted relation from t given p is the set : R p (t) = {(x 1,..., x n ), ee(v i ) = x i and e is an embedding from p to t } For example, the previously defined embedding from the pattern of figure 4 to the tree of figure 4, extracts the tuple (7, 9). It should be noted that in the context of information extraction we are actually interested in the contents of the nodes. By considering the content of nodes 7 and 9 we effectively extract the valid (city, name) tuple (N antes, F abrice). Embeddings allow to determine whether a tree is matched by a pattern. The notion of homomorphism from a tree pattern to another can be defined by adapting condition (3) of definition 4 and adding the condition that variables must be matched. Definition 10 (Homomorphism) A mapping h from a pattern p to another pattern p is a homomorphism iff it respects the following requirements : (0) for all n NODES(p) : h(v AR(n)) = V AR(n) if V AR(n) is defined (1) h is root preserving (ie. ROOT(p) = ROOT(h(p))) (2) for all n NODES(p) : LABEL(n) = or LABEL(n) = LABEL(h(n)) (3) for each n, m NODES(p) : if (n, m) is a c-edge in p then (e(n), e(m)) is a c-edge in p if (n, m) is a d-edge in p then h(m) is a proper descendant of h(n) in p (independently of the type of edges). Let T Σ denote the set of trees which can be constructed over alphabet Σ. We denote by L(p) the subset of trees of T Σ which match p called the language of p. The partial order of the subsets of T Σ induce a partial order over the pattern space : a pattern p is said to be more general than another pattern p (noted p p ) iff L(p ) L(p). Testing whether a pattern p is more general than a pattern p is called the pattern containment problem. When no abstracted labels are allowed, and the patterns only contain c-edges (resp. d-edges), and the embedding is injective the problem is the same as the subtree (resp. tree) inclusion problem discribed in [10]. The containment problem of containment for XP {//,,[]} has been proven to be CoNP-Complete [16] while remaining PTIME for XP {,[]}, XP {//,[]}, and XP {//, }. Proposition 1 Given two patterns p and p, if there exists a homomorphism h from NODES(p ) to NODES(p) which respects same constraints as those defined for matching, then p p. 10

11 The proof of proposition 1 is easy to see. Take any tree t matching p with embedding e. Then e h is a valid embedding from p to t thus proving that t also matched p. While the existence of a homomorphism is a sufficient condition for pattern inclusion, it is not allways a necessary condition [16] have shown that for XP {//,,[]} this is not the case. The generalization algorithm we propose in the next section guarantees the existence of a homomorphism from the generalization to the patterns to be generalized. Therefore, the answer to this question is mostly of theoretical interest from the point of view of this work. Proposition 2 Let p and p be two patterns such that there exists an injective homomorphism h. Let r and r denote respectively the root nodes of p and p. Then CHILDREN(r) CHILDREN(r ). Proposition 2 is a direct consequence of restriction (5). Suppose r has k children and that r has k children such that k > k. Now suppose without loss of generality that h maps each of the k first (for any ordering) children of r into different subtree rooted at the k children of r. Let c k +1 denote the next non mapped node. Any mapping of c k +1 will violate condition (5). Indeed, suppose w.l.o.g. that we map it into the same subtree of root c 1 that the first child c 1 has been mapped to. Then the first common ancestor of c 1 and c k +1 in p is r while the first common ancestor of h(c 1 ) and h(c k +1) in p is c 1 and not h(r) = r. Given a set of example labeled trees we want to build a pattern which will effectively extract this labeled information. Any tree can be seen as a tree pattern which matches itself. Therefore we only need to consider generalizing tree patterns. In this context we wish to build a generalized pattern which matches at least the same set of trees as the initial patterns. We consider the problem of generalizing two tree patterns together. In order to only extract very similar data, we want to keep the patterns as specific as possible. Therefore given two tree patterns we wish to build a new tree pattern which is more general than the initial patterns but kept as specific as possible. Definition 11 (Least general generalization) The least general generalization (lgg) of two tree patterns p 1 and p 2 is a pattern p such that : p p 1 and p p 2 no other pattern p, such that p p, also respects the previous conditions It is well know in the inductive logic programming (ILP) community that the lgg of horn clauses under θ-subsumption 1 is unique (up to logical equivalence) [19]. However, the the size of the generalization is quadratic in the size of the initial clauses and the obtained clause might not be in reduced form. Reducing a requires checking θ- subsumption which is known to be NP-complete. On tree patterns, efficient reduction algorithms have been proposed [20]. However, they do not allow label abstraction. By requiring that the homomorphism be injective, the lgg is not unique but is always in reduced form (ie. there are no redundant nodes). Given these results, there seems to be no best choice. Here we chose to learning injective patterns using a weight based selection. 1 A clause C subsumes a clause C is there exists a substitution (homomorphism) θ such that Cθ C 11

12 div 1 div 1 i 2 sp 4 b 2 bq 4 X a 3 Y a 5 em 6 (a) t 1 X a 3 sp 5 Y a 6 em 7 (b) t 2 div 1,1 2,2 sp 4,5 X a 3,3 Y a 5,6 em 6,7 (c) generalization of t 1 and t 2 Figure 7: Two examples trees and their generalization 5 Maximal weight generalization In this section, we introduce the notion of weighted pattern. This will allow to partly handle the non uniqueness of an lgg in the case of injective embeddings. It will also allow to introduce some control over which type of information should be preferred when characterizing the data to be extracted. A tree pattern can be transformed into a set of relational constraints which we will note rel(p) in the following. Each node n i is transformed into a variable N i, each c-edge (n i, n j ) is transformed into a constraint child(n i, N j ), each d-edge (n i, n j ) is transformed into a constraint descendent(n i, N j ), and for each a Σ, a constraint label a (N i ) is added for every node n i such that LABEL(n i ) = a. The variables of a pattern can also be considered as extraction constraints. Therefore, for each node n where V AR(n i ) is defined and equal to X we add a constraint extract X (N i ) Other types of constraints (eg. leaf, next-sibling, etc.) could have been considered. In the following we will denote the set of relational symbols over which a pattern is describe. In this paper we will consider n-ary patterns over the set : CDLX ={child/2, descendant/2} {label a /1, a Σ} {extract Xi, i [1, n]} Definition 12 (Weighting function) A weighting is a function W : R which associates a real number to each symbol in. Given such a weighting function it is now possible to associate a weight to a pattern. 12

13 div b a bq sp a em div i a sp a em Figure 8: Generalization matrix for t 1 and t 2 Definition 13 (Pattern weight) Let p be a pattern and W a weighting. The weight of p is the sum of the weights of the relations appearing in p. w(p) = Σ r( X) rel(p) W(r) It is possible to decompose the weight of a pattern into the sum of a local weight to which is added the sum of the weights of its subtrees. Definition 14 (Node weight) The weight of a node n is the weight of the subtree rooted at this node. By abuse of notation, we will also denote by w(n) the weight of a node n. Proposition 3 When = CLRX the following are verified : The weight of a node n is : w(n) =w local (n) + w extract (n) +w(child) CEDGES(n) +w(descendant) DEDGES(n) where and w local (n) = w extract (n) = { w(label a ) if LABEL(n) = a 0 otherwise { w(extract X ) if V AR(n) = X 0 otherwise The weight of a pattern p is the weight of its root. Definition 15 (Maximal weight generalization) A generalization g of two patterns p and p is of maximal weight if there exists no other generalization g of the same patterns with a higher weight. 13

14 Proposition 4 For any patterns g, p, p such that the exists two homomorphisms h and h, from NODES(g) to respectively NODES(p) and NODES(p ) there exists a unique subset S of NODES(p) NODES(p ) such that there is a one-to-one mapping sel from S to NODES(g) such that (x, y) S h(sel(x, y)) = x h (sel(x, y)) = y. Proposition 4 says is that any generalization g of two patterns p and p can be seen as a selection of couples from the Cartesian product of the nodes of p and p. Therefore, finding a maximal weight generalization of p and p consists in finding the valid subset of nodes which has maximal weight. Definition 16 (Underlying pattern) Given two patterns p and p and a selection S NODES(p) NODES(p ), the underlying pattern g is a tree pattern such that : sel is the unique one-to-one mapping from S to N ODES(g) for all (x, y) S we have LABEL(sel(x, y)) = { a if LABEL(x) = LABEL(y) = a otherwise for all (x, y) S where V AR(x) and V AR(y) are defined and equal we have : V AR(sel(x, y)) = V AR(x) the parent of a node n = sel(x, y) in g is the node n = sel(x, y ) such that (1) x x, (2) y y and (3) the exists no node n = sel(x, y ) such that x x x and y y y. for each node n = sel(x, y) and its parent n = sel(x, y ), (n, n) is a c-edge iff (x, x) is a c-edge in p and (y, y) is a c-edge in p, otherwise (n, n ) is a d-edge. Definition 16 describes how to build pattern given a node selection. Building a pattern in this way allows to keep it specific and keeping it tree shaped. Indeed, some of the nodes are not linked as descendants when they are already linked as children. Definition 17 (Injective selection) Given two patterns p and p, a selection of nodes S NODES(p) NODES(p ) is injective iff for all (x, y) S, (x, y ) S y = y) and (x, y) S x = x) Definition 18 (Order conservative selection) Given two patterns p and p, a selection of nodes S NODES(p) NODES(p ) is order conservative iff for all (x, y), (x, y ) S, x < x y < y. Consider figure 7 where t 1 and t 2 are two example fragments of HTML trees (bq and sp are short for respectively blockquote and span) and there generalization p. In each of the two trees we want to extract respectively the nodes labeled by X and Y. Both trees t 1 and t 2 have been numbered. In the case of the pattern, each couple of 14

15 html 0 [0] head 1 [1] body 3 [2] link 2 [1] href= style.css rel= stylesheet type= text/css h1 4 [1] table 5 table 12 table 17 tr 6 [1] tr 8 [2] tr 10 [3] tr 13 [1] tr 15 [2] tr 18 [1] tr 20 th 7 [1] td 9 [1] td 11 [1] th 14 [1] td 16 [1] th 19 [1] city td 21 [1] name Figure 9: Learned pattern for People/Location numbers (i, j) correspond to nodes i and j respectively in t 1 and t 2 to which the node maps Now consider we have not yet built the pattern. We wish to find the selection S of nodes which allows to build the pattern with maximal weight. First, we know that the root of the pattern will be required to map to the roots of both trees. Therefore (1, 1) belongs to S. Proposition 5 Given two patterns p and p and a maximal weight generalization g (with mappings h and h ). For any node n in g we have : h(parent(n)) = PARENT(h(n)) or h (PARENT(n)) = PARENT(h (n)) The intuition behind the proof of proposition 5 is that if for a given node n = sel(x, y) and its parent n = sel(x, y ), if neither x = PARENT(x) or y = PARENT(y), then there exists a node that we have missed (and whose addition would add to the weight of the pattern). Indeed, there exists a couple (x, y ) such that x x x and y y y whose addition would have replaced n as a child of n with at least the weight of n since n is one of its candidate children. Now let us consider the possible children of node sel(1, 1) in the pattern of figure 7. In order for the pattern to respect proposition 2, node (1, 1) can only have two children. According to proposition 5, for any (i, j) S we only need to consider as candidate 15

16 children couples (k, l) where k is child or descendant of i and l is a child of j or where k is a child of i and l is a descendant of j. It is not necessary to consider combining descendants of i and j together since we are assured the generated pattern would have a smaller weight than the combination of their parents. For example, the when building p, the couple (2, 2) will do better than the combination (3, 3) since the weight of (3, 3) included in the weight of (2, 2). Since we are maximizing the global weight, we will keep as many children as possible and therefore for (1, 1) we will keep exactly two of the candidates (recall that we had at most two). However, all candidates are not compatible together. For example, (3, 2) and (5, 2) can not be taking together since they would both map to node 2 in t 2, which is forbidden by restriction (4). Also, (5, 2) and (6, 4) are not compatible since nodes 5 and 6 are both descendants of 4 in t 1. This would violate restriction (5) since in the pattern their first common ancestor would be (1, 1) which is mapped to 1 in t 1 while the first common ancestor of nodes 5 and 6 is 4 in t 1. To calculate the best candidate children for (1, 1) we first calculate the best weight for each candidate and select the best compatible subset of candidates. In figure 7, it can be easily seen that the best solution maps together the div/i/a branch of t 1 with the div/b/a branch of t 2. Also, we can see that when generalizing the other side, we are face to two concurrent choices as a child for (1, 1). Either we map 4 and 4 together or we map 4 with 5. When scoring both candidates, 4, 5 does better than 4, 4 since it allows to keep the span(a, em) subtree of both trees by losing node 4 in t 2 and replacing two child edges by a descendant edge. The best weight obtained by choosing (4, 5) as child of (1, 1) is therefore of 21 (3 labels, 2 child edges, 1 descendant edge, 1 variable). Note that the internal weight of the subtree rooted at (4, 5) is however 20 since we do not consider the descendant edge from (1, 1) to (4, 5). The best result obtainable by combination (4, 4), is to lose node 5 in t 2 and therefore replacing two child edges by two descendant edges. We also lose a label since the label of 4 in t 1 and 4 in t 2 have different labels. The weight for combination (4, 4) is 18 (1 child edge, 2 labels, 2 descendant edges, 1 variable). Therefore, the best children for (1, 1) are (2, 2) and (4, 5). 6 Pattern learning algorithm We now give our generic algorithm which allows to recursively calculate the maximal weight pattern given both a weighting function and a local compatibility test corresponding to target type of homomorphism (unordered injective, ordered injective). In the following (experimentations included), we will consider the scoring function with the following weights : w(child) = 2, w(descendant) = 1, w(label) = 2 and w(extract) = 10. The algorithm calculates the selection S whose underlying pattern is of maximal weight. It recursively walks down the tree, calculates the score for the leaves, and on the way back up selects for each node the best candidate children for the node. Let p and p respectively rooted at nodes r and r. Let N and N denote the number of nodes in respectively p and p. The pattern p we are seeking to build is such that there exists two homomorphisms h and h from the nodes of p to the nodes of respectively 16

17 Input: Two patterns p 1 and p 2 and candidate node n i,j Output: The best subpattern rooted at n i,j and its weight Let I 1 be the indexes of the children of n 1i in p 1 Let I 2 be the indexes of the children of n 2j in p 2 Let C be the set of candidate children for all (k, l) I 1 I 2 do C C best_child(p 1, p 2, n k,l ) end for Sort C by decreasing weight (best weight first) C optimize(c) if LABEL(n 1i ) = LABEL(n 2j ) then lbl LABEL(n 1i ) scr w l else lbl scr 0 end if scr src + Σ nk,l C weight(n k,l) return tree of weight scr with a root labeled lbl having C as children Algorithm 1: Calculate the subpattern and its weight for a given candidate node p and p and such that w(r) is maximal. Each node in p is required to match exactly one node in each of p and p. Therefore there are N N candidate nodes for the generalization g we are building. Let n 1,...,n N be the nodes of p and n 1,...,n N be the nodes of p. We will suppose the indices are chosen to reflect the ordering obtained by walking through the tree depth first (ie. <). For short, let us note n i,j = sel(n i, n j ) (ie. the candidate node which would be mapped to n i in p and n j in p ). With restriction (1) (root conservation) we already know that n 1,1 will be a node of p. We now need to determine which other candidate nodes will be kept in the pattern. As noted previously, the weight of a pattern only depends on its subpatterns. Therefore, the impact of the choice of a node on the global weight of the pattern only depends on the choices for its descendants. Supposing a candidate node n i,j has been kept in p, finding the optimal subpattern rooted in n i,j consists in considering its candidate children, calculating the optimal subpatterns rooted at each candidate child, and choosing the best subset of children which respect the matching restriction. The set of candidate children for a node n i,j are the nodes n k,l such that n k and n l are proper descendants of respectively n i and n j. All other nodes would not respect restriction (2) (conserving the child/descendant relationships). Also, since we are maximizing w(p), every child of n i,j will at least be mapped to a child of n i or to a child of n j. Indeed, by doing otherwise we would at least lose a child edge which could have been kept simply by choosing the combination of their parent nodes. We can deduce from proposition 2 that for any node n i,j in p, n i,j has at most min( CHILDREN(n i ), CHILDREN(n j ) ) children. We can also deduce from 17

18 Input: Two patterns p 1 and p 2 and parent node n i,j and a node n k,l Output: The best child n k,l for n i,j this combination can generate Let s weight(p 1, p 2, n k,l ) if (i, k) and (j, l) are both child edges in their respective patterns then s s + w c else s s + w d end if Let I 1 be the indexes of the descendants of n 1k in p 1 Let I 2 be the indexes of the descendants of n 2l in p 2 Let (k, l ) (k, l) Let bs weight(k, l) for all k I 1 do if weight(k, l) + w d > bs then Let (k, l ) (k, l) Let bs weight(k, l) + w d end if end for for all l I 2 do if weight(k, l ) + w d > bs then Let (k, l ) (k, l ) Let bs weight(k, l ) + w d end if end for return n k,l Algorithm 2: Calculate the best child for a node given a combination 18

19 proposition 2 that each of the children of n i,j will be mapped to nodes in distinct subtrees of both n i and n j. Two child nodes verifying this property are said to be compatible. Therefore, for every node n k child of n i and every node n l child of n j, there will be only one best candidate from the set of nodes which map to nodes of the subtrees rooted at n i and n l. This implies that, there will be a maximum of CHILDREN(n i ) CHILDREN(n j) candidate children for n i,j. Also, the unique candidate for a the a given combination (k, l) is a combination of n k or any of its descendants with n l or any of its descendants. Calculating the best set of children for node n i,j can be done in two steps : (1) for each child combination (k, l) calculate the best child for n i,j for this combination and (2) select the best set of compatible children from the set of candidates. The compatibility test can simply be done by checking that both components of the combinations from which they come are different which means that they match different subtrees in both patterns. Algorithm 1, gives the procedure which calculates the best subpattern for a given node n i,j and its weight. It first calculates the set of best candidate children C for each combination by calling the function best_child given in algorithm 2. It then calls optimize with this set to calculate the subset of compatible children which give the best weight. Let k = min( CHILDREN(n i ), CHILDREN(n j ) ). optimize(c) chooses the best subset of C of size k and whose elements are all compatible. Compatibility depends on the type of selection we are considering : either injective (definition 17) or order conservative (definition 18). In our implementation, this compatibility function is given as a parameter. Algorithm 2 calculates the best child and its weight given a parent node n k,l and a combination n k,l. The first candidate is the combination node itself. It then tries to do better with nodes which would map to node n 2l in p 2 and to a proper descendants of n 1k in p 1 and then with nodes which would map to node n 1K in p 1 and to a proper descendants of n 2l in p 2. Since the weight of each candidate node only depends on the choice of its best children, we are only required to calculate it once. In our implementation, we use a matrix where each cell corresponds to a candidate node and stores the weight for this cell. Each line of the matrix corresponds to a node of the first pattern and each column corresponds to a node of the second pattern. By numbering the nodes in depth first order, we have the nice property that for each node with index i there exists an index j such that all nodes with index k, i k j are descendants of node indexed i. Figure 8 gives the generalization matrix for trees t 1 and t 2 of figure 7. The scoring function used is defined with w c = 2, w d = 1, w l = 2 and w v = 10. The nodes and labels of t 1 appear in the two first colums of the figure and those of t 2 in the first two lines. The cells of the matrix contain the maximal weights obtained by associating the node of t 1 in the same column with the node of t 2 in the same row. The arrows point to the outgoing edges of the pattern giving this best weight. For example, associating node 1 of t 1 with node 4 of t 2 gives a maximal weight of 4. The weight of 4 is obtained because the subpattern rooted at (1, 4) contains two child edges (2 2 = 4), has no conserved labels, and has no descendant edges. In the example, the thicker (and red if printed in color) arrows, show the output pattern obtained by the generalization algorithm. It is obtained as described in definition 19

20 Source bigbook okra s20 LX-0 pagesjaunes amazon Relation (name, address) (name, mail, score) (f ile, score, size, type) a 13-ary relation (name, address, city) (title, price) Table 1: Target relations to be extracted in each source Source Good Wrong Missed Rec. Prec. bigbook okra s L L L L pagesjaunes.fr amazon.com Table 2: Experimental results with unordered embedding 16. This pattern is indeed the one which was anticipated in figure 7. The weight of 41 of the root node (1, 1) of the pattern corresponds to the sum of the outgoing edges ( = 26) to which is added the label weight of 2 (node 1 in 1 1 and node 1 in t 2 both have the same label div), plus a descendant weight of 1 for the edge to (4, 5) plus a child weight of 2 for the edge to (2, 2). The effective implementation of our algorithms, make use of optimizations. During the construction of the set of best children for a node, it is possible to quickly calculate an upper bound of the weight the remaining child set may lead to. If such an upper bound is lower than the best current solution, a complete evaluation is no longer required. These optimizations allow for a noticeable speed up of the algorithm. 7 Evaluation We have implemented our algorithm in Ocaml and evaluated it on different datasets. We proceeded to an evaluation when considering both unordered injective embeddings and ordered injective embeddings. In each evaluation, we did à 5-fold cross validation taking one set of documents as example set and the remaining for testing 2. 2 k-fold cross validation usually consists in taking one out for testing an learning on the rest. In information extraction, few examples are often sufficient an therefore the left out set is used for learning and the rest for testing 20

21 html 0 [0] br 3 [7] head 1 [1] table 4 [8] align= center border= 0 cellpadding= 0 cellspacing= 0 width= 100 body 2 [2] alink= FF9933 bgcolor= FFFFFF text= table vlink= [9] border= 0 cellpadding= 0 cellspacing= 0 width= 100 tr 6 [1] table 36 [10] border= 0 cellpadding= 0 cellspacing= 4 width= 100 table 37 [11] border= 0 width= 100 td 7 [1] valign= top width= 180 td 8 [2] bgcolor= width= 1 td 9 [3] valign= top table 10 [1] border= 0 cellpadding= 10 cellspacing= 0 width= 100 tr 11 [1] td 12 [1] align= left class= small valign= top td 34 [4] bgcolor= width= 1 td 35 [5] valign= top width= 180 table 13 width= 100 br 14 table 15 width= 100 br 33 tr 16 [1] valign= top tr 24 [2] td 17 [2] td 18 [3] align= center td 19 [4] class= small width= 100 td 25 [1] colspan= 4 a 20 [1] br 22 [2] br 23 [3] table 26 [1] width= 100 b 21 [1] title tr 27 [1] valign= top td 28 [1] class= small width= 50 td 32 [2] font 29 [1] face= verdana,arial,helvetica size= -1 b 30 [1] 21 font 31 [1] prix color= Figure 10: Unordrerd injective pattern learned using GTree for Amazon

22 Source Good Wrong Missed Rec. Prec. bigbook okra s L L L L pagesjaunes.fr amazon.com Table 3: Experimental results with ordered embedding System ML Doc. Rep. Pat. Rep. Ext. type Stalker yes string string automata composed unary IERel yes string string-based pattern n-ary Lixto some ELog ELog composed unary Squirrel yes Tree NSTT unary PAF yes Tree-based att/value Descision Tree seed-based n-ary GTree yes Tree Tree Pattern n-ary Table 4: Qualitative comparison of different extraction systems The first two datasets Bigbook and Okra come from the RISE information extraction repository. These two datasets are the most referenced sets in the information extraction community. They are however, known to be easy. The L0-0, L3-0, L8-0, and L9-0 datasets, are artificial datasets made available by Marty. Its is the same data with different representations. We only present results for the datasets for which tree patterns over the relations considered in this paper have sufficient expressivity (we plan in future work to add other relations). Finally we evaluated our approach on two real world dataset : Amazon DVD listings and Pagesjaunes address entries. The unordered pattern obtained for Amazon is given in figure 10. Table 1 gives the target relation to extract in each dataset. Table 2 presents the results obtained when learning patterns in an unordered injective setting and table 3 present the results obtained in an ordered injective setting. In all cases we have a very high if not perfect recall. The results for Amazon and Pagesjaunes are particularly encouraging, since these datasets come from existing web sites. However, some cases show a bad precision. This means that some extractions were incorrect. Such results appear in data which have a linear format (ie. all the tuples follow each other under the same node). In such cases, a pattern defined over child and descendant relations alone, are not sufficient to handle such cases. This shows that working only with child and descendant relationships is not sufficient in some cases and that handling negative examples may be required. The algorithms can be adapted to include learning relations such as next-sibling. We are currently integrating these 22

23 results in our implementation. The tables show that results in the unordered and ordered cases are very similar. These results seem to suggest that simply respecting ordering does not provide much information on these sources. It should be noted that the ordering considered is similar to linking siblings together with a following-sibling relation. We believe that this ordering may not be strong enough to be informative and that a next-sibling relation might be an interesting replacement. Comparing our work to other work done in information extraction is difficult for different reasons. The currently referenced datasets (RISE) are either two simple for web information extraction or require natural language information extraction techniques. Therefore there is no real basis for comparison. The availability of the soccer dataset is a first effort in the direction of comparing systems.we will also contribute to this effort in making our datasets (Amazon, Pagesjaunes and otherswe are currently tagging to automate testing) available for comparison. Table 4 gives a qualitative comparison of either string-based systems known to be n-ary and systems using tree based representations and our system GTree. In order, the colums give the system name, whether machine learning is used or not, the document representation, the pattern representation and the type of extraction. 8 Conclusion In this paper we presented a novel approach to n-ary wrapper generation for information extraction from the web. We proposed to use tree patterns as the extraction language, and gave an algorithm capable of generating such a pattern given a set of example labeled documents. We have presented both the theoretical and practical aspects of the method. This method has been implemented and tested on different data sets. The evaluation shows that the approach is useful in cqases where string based approaches have show there limits. Learning tree patterns has many advantages over existing methods : it takes into account explicitly the tree structure of Web documents, it is capable of building patterns capable of skipping nodes, the extracted instances can have multiple attributes, it is not sensitive to node orderings and the generated patterns are easily understandable by a human expert. In future work we plan to extend the algorithm to include other types of relational constraints. Shortly, we will be adding next-sibling and following-sibling types of contraints. This should allow enable the method to handle linear formats (at least in some cases). Also, here we have only considered positive examples. We plan to introduce negative examples which would allow to limit overgeneralizations. References [1] Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Hiroki Arimura, Hiroshi Satamoto, and Setsuo Arikawa. Efficient substructure discovery from large semi-structured data. In Robert L. Grossman, Jiawei Han, Vipin Kumar, Heikki Mannila, and Rajeev Motwani, editors, SDM. SIAM,

Efficient Subtree Inclusion Testing in Subtree Discovering Applications

Efficient Subtree Inclusion Testing in Subtree Discovering Applications RENATA IVANCSY, ISTVAN VAJK Department of Automation and Applied Informatics and HAS-BUTE Control Research Group Budapest University