Adapting Web Information Extraction Knowledge via Mining Site-Invariant and Site-Dependent Features

Size: px

Start display at page:

Download "Adapting Web Information Extraction Knowledge via Mining Site-Invariant and Site-Dependent Features"

Valerie Chambers
5 years ago
Views:

1 Adapting Web Information Extraction Knowledge via Mining Site-Invariant and Site-Dependent Features TAK-LAM WONG City University of Hong Kong and WAI LAM The Chinese University of Hong Kong We develop a novel framework that aims at automatically adapting previously learned information extraction knowledge from a source Web site to a new unseen target site in the same domain. Two kinds of features related to the text fragments from the Web documents are investigated. The first type of feature is called, a site-invariant feature. These features likely remain unchanged in Web pages from different sites in the same domain. The second type of feature is called a sitedependent feature. These features are different in the Web pages collected from different Web sites, while they are similar in the Web pages originating from the same site. In our framework, we derive the site-invariant features from previously learned extraction knowledge and the items previously collected or extracted from the source Web site. The derived site-invariant features will be exploited to automatically seek a new set of training examples in the new unseen target site. Both the site-dependent features and the site-invariant features of these automatically discovered training examples will be considered in the learning of new information extraction knowledge for the target site. We conducted extensive experiments on a set of real-world Web sites collected from three different domains to demonstrate the performance of our framework. For example, by just providing training examples from one online book catalog Web site, our approach can automatically extract information from ten different book catalog sites achieving an average precision and recall of 71.9% and 84.0% respectively without any further manual intervention. Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing; I.2.6 [Artificial Intelligence]: Learning Induction The work described in this article is substantially supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Nos: CUHK 4179/03E and CUHK4193/04E) and the Direct Grant of the Faculty of Engineering, CUHK (Project Codes: and ). This work is also affiliated with the Microsoft-CUHK Joint Laboratory for Human-centric Computing and Interface Technologies. Authors addresses: T. L. Wong, Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong; wongtl@cityu.edu.hk; W. Lam, Department of Systems Engineering and Engineering Management, the Chinese University of Hong Kong, Shatin, Hong Kong; wlam@se.cuhk.edu.hk. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY USA, fax +1 (212) , or permissions@acm.org. C 2007 ACM /07/0200-ART6 $5.00. DOI / /

2 2 T.-L. Wong and W. Lam General Terms: Algorithms, Design Additional Key Words and Phrases: Wrapper adaptation, Web mining, text mining, machine learning ACM Reference Format: Wong, T.-L. and Lam, W Adapting Web information extraction knowledge via mining siteinvariant and site-dependent features. ACM Trans. Intern. Tech. 7, 1, Article 6 (February 2007), 40 pages. DOI = / INTRODUCTION The vast amount of online documents in the World Wide Web provides a good resource for users to search information. One common practice is to make use of search engines. For example, a potential customer may browse different bookstore Web sites with the aid of a search engine hoping to gather precise information such as the title, authors, and selling price of some books. A major problem of online search engines is that the unit of the search results is an entire Web document. Human effort is required to examine each of the returned entries to extract the precise information. Automatic information extraction systems can automate the task by effectively identifying the relevant text fragments within the document. The extracted data can also be utilized in many intelligent applications such as on online comparison-shopping agent [Doorenbos et al. 1997], or an automated travel assistant [Ambite et al. 2002]. Unlike free texts (e.g., newswire articles) and structured texts (e.g., texts in rigid format), online Web documents are semi-structured text documents with a variety of formats containing a mix of short, weakly grammatical text fragments, mark-up tags, and free texts. For example, Figure 1 depicts a portion of an example of a Web book catalog. 1 To automatically extract precise data from semi-structured documents, a commonly used technique is to make use of wrappers. A wrapper normally consists of a set of extraction rules that can identify the text fragments in the documents. In the past, human experts analyzed the documents and constructed the set of extraction rules manually. This approach is costly, time-consuming, tedious, and error-prone. Wrapper induction aims at automatically constructing wrappers by learning a set of extraction rules from the manually annotated training examples. For instance, a user can specify some training examples containing title, authors, and price of the book records in the Web document, as shown in Figure 1, through a GUI. The wrapper induction system can automatically learn the wrapper from these training examples, and the learned wrapper is able to effectively extract data from the documents of the same Web site. Different techniques have been proposed, which demonstrate that wrapper induction achieves very good extraction performance [Califf and Mooney 2003; Ciravegna 2001; Downey et al. 2005; Freitag and McCallum 1999; Muslea et al. 2001; Soderland 1999]. One major limitation of existing wrapper induction techniques is that a learned wrapper from a particular Web site cannot be applied to a new unseen Web site even in the same domain. For instance, suppose we have learned 1 The URL of this Web site is

Adapting Web Information Extraction Knowledge 3 Fig. 1. A portion of a sample Web page of a book catalog. the wrapper for the Web site as shown in Figure 1.

3 Adapting Web Information Extraction Knowledge 3 Fig. 1. A portion of a sample Web page of a book catalog. the wrapper for the Web site as shown in Figure 1. In this article, it is called a source Web site. Figure 2 is another book catalog collected from a Web site different from the one shown in Figure 1 2. Although both the source site (Figure 1) and the new Web site (Figure 2) contain information about book records, the learned wrapper for the source site cannot be applied directly to this new Web site for extracting information because their layouts are typically quite different. To automatically extract data from the new site, we must construct another wrapper customized to this new site. Hence, a separate human effort is necessary to collect the training examples from the new site and invoke the wrapper induction process separately. In this article, we develop a novel framework that can fully automate the adaptation and eliminate the human effort. This problem is called wrapper adaptation: it aims at automatically adapting a previously learned wrapper from a source Web site to a new unseen site in the same domain. Under our model, if some attributes in a domain share a certain amount of similarity among different sites, the wrapper learned from the source Web site can be adapted to the new unseen sites without any human intervention. As a result, manual effort is guaranteed to be reduced for preparing training examples in the overall process. We have previously developed an algorithm called WrapMA [Wong and Lam 2002] for solving the wrapper adaptation problem. WrapMA can adapt the previously learned extraction knowledge from one Web site to another new unseen Web site in the same domain. The main drawback of WrapMA is that human effort is still required to scrutinize the intermediate data during the adaptation phase. In this article, we present a novel method called Information 2 The URL of the Web site is

4 4 T.-L. Wong and W. Lam Fig. 2. A portion of a sample Web page about a book catalog collected from a different Web site then Figure 1. Extraction Knowledge Adaptation (IEKA) for solving the wrapper adaptation problem. IEKA is a fully automatic method without need for manual effort. The idea of IEKA is to analyze the site-dependent features and the site-invariant features of the Web pages in order to automatically seek a new set of training examples from the new, unseen, site. A preliminary version was reported in Wong and Lam [2004b]. In this article, we substantially enhance the framework by modeling the dependence among various kinds of knowledge and site-specific features for Web environment. An information-theoretic approach to analyzing the DOM (document object model) structure of Web pages is also incorporated for seeking training examples in the new site more effectively. The performance of our new approach, IEKA, is very promising, as demonstrated in the extensive experiments described in Section 9. For example, by just providing training examples from one online book catalog Web site, our approach can automatically extract information from ten different book catalog sites achieving an average precision and recall of 71.9% and 84.0% respectively without any further manual intervention. 2. PROBLEM DEFINITION AND MOTIVATION Consider the Web site shown in Figure 1. Figure 3 depicts an excerpt of the HTML texts of this page. Suppose we want to automatically extract the information such as the title, authors, and prices of the books from this Web site. We can construct a wrapper for this Web site to achieve this task. To learn the wrapper, we first manually annotate some training examples similar to the one depicted in Table I via a GUI. We employ our wrapper learning system (HISER), which considers the text fragments of the data item as well as the

5 Adapting Web Information Extraction Knowledge 5 Fig. 3. An excerpt of the HTML texts for the Web page shown in Figure 1. Table I. Sample of Manually Annotated Training Examples from the Web Page Shown in Figure 1 Item Item value Book Title: Programming Microsoft Visual Basic 6.0 with CDROM (Programming) Author: Francesco Balena Final Price: surrounding text fragments [Lin and Lam 2000]. The learned wrapper is composed of a hierarchical record structure and extraction rules. For example, Figure 4 shows the learned hierarchical record structure and Table II shows one of the learned extraction rules for the book title. A hierarchical record structure is a tree-like structure representing the relationship of the items of interest. The root node represents a record that consists of one or more items. An internal node represents a certain fragment of its parent node. An internal node can be a repetition, which may consist of other subtrees or leaf nodes. A repetition node specifies that its child can appear repeatedly in the record. A leaf node represents an attribute item of interest. Each node of the hierarchical record structure is associated with a set of extraction rules. The extraction rule contains three components. The left and right pattern components contain the left and right delimiters of the items, and the target pattern component consists of the semantic meaning of the items. After obtaining the wrapper, we can apply it to the other pages from the same Web site to automatically extract items. The learned wrapper can effectively extract items from the Web pages of the same site. However, it cannot extract any item if we directly apply the learned wrapper to a new unseen site in the same domain, such as the one shown in Figure 2. Figure 5 depicts an excerpt of the HTML texts for this page. The failure of extraction is due to the difference between the layout formats of the two Web sites; the learned wrapper becomes inapplicable to the new site. In order to automatically extract information from the new site, one could learn the wrapper for the new site by manually collecting another set of training examples. Instead, we propose and develop our IEKA framework to automatically tackle this problem.

6 T.-L. Wong and W. Lam root title repetition( author) price author Fig. 4. The learned hierarchical record structure for the Web page shown in Figure 1. Fig. 5.

6 6 T.-L. Wong and W. Lam root title repetition( author) price author Fig. 4. The learned hierarchical record structure for the Web page shown in Figure 1. Fig. 5. An excerpt of the HTML texts for the Web page shown in Figure 2. There are several characteristics for IEKA. The first is that IEKA utilizes the previously learned extraction knowledge contained in the wrapper of the source Web site. For example, the extraction rule depicted in Table II shows that the majority of the book titles contain alphabets and words starting with a capital letter. Such knowledge contains useful evidence for information extraction for the new unseen site in the same domain. However, it is not directly applicable to the new site due to the difference between the contexts of the two Web sites. We refer such knowledge as weak extraction knowledge. The second characteristic of IEKA is to make use of the items previously extracted or collected from the source site. These items can contribute to deriving training examples for the new unseen site. One major difference between this kind of training example and ordinary training examples, is that the former only consist of information about the item content, while the latter contain information for both the content and context of the Web pages. We call this property partially specified. Based on the weak extraction knowledge and the partially specified training examples, IEKA first derives those site-invariant features that remain largely unchanged for different sites. For example, one kind of site-invariant feature is the patterns, such as capitalization, about the attributes. Another kind of site-invariant feature is the orthographic information of the attributes. Next, a set of training example candidates is selected by analyzing the DOM structures of the Web documents of the new unseen site based on an information-theoretic approach. Machine learning methods are then employed to automatically discover some machine-labeled training examples from the set of candidates, based on the site-invariant features. Table III depicts samples of the automatically discovered machine-labeled training examples from the new unseen site shown in Figure 2. Both site-invariant and site-dependent features of the machine-labeled

7 Adapting Web Information Extraction Knowledge 7 Table II. A Sample of a Learned Extraction Rule for the Book Title for the Web Page Shown in Figure 1 Left pattern component Scan Until(<HTML TABLE>, SEMANTIC), Scan Until( , TOKEN), Scan Until( <A>, SEMANTIC), Scan Until( , TOKEN). Target pattern component Contain(<WORD>) Contain(<WORD START WITH CAPITAL>) Right pattern component Scan Until( , TOKEN), Scan Until( </A>, TOKEN), Scan Until( , TOKEN), Scan Until(<ANY>, SEMANTIC). Table III. Samples of Machine Labeled Training Examples Obtained by Adapting the Wrapper from the Web Site Shown in Figure 1 to the Web Site Shown in Figure 2 Using Our IEKA Framework Item Item value Example 1 Book Title: C++ Weekend Crash Course, 2nd edition Title: Stephen Randy Davis Final Price: Example 2 Author: Steve Oualline Final Price: training examples will then be considered in the learning of the new wrapper for the new target site. The newly discovered hierarchical record structure for the new site is the same as the one shown in Figure 4. Table IV shows the set of adapted extraction rules of the book title. The newly learned wrapper can be applied to extract items from the Web pages of this new site. 3. RELATED WORK Research efforts about information extraction from various kinds of textual documents ranging from free texts to structured documents have been investigated [Chawathe et al. 1994; Srihari and Li 1999]. Among different extraction approaches, a wrapper is a common technique for extracting information from semistructured documents such as Web pages [Kushmerick and Thomas 2002]. In the past few years, many wrapper learning systems that aim at constructing wrappers by learning from a set of training examples have been proposed [Blei et al. 2002; Ciravegna 2001; Cohen et al. 2002; Freitag and McCallum 1999; Hogue and Karger 2005; Hsu and Dung 1998; Kushmerick 2000a; Lin and Lam 2000; Muslea et al. 2001; Soderland 1999]. These approaches can automatically learn wrappers from a set of training examples and the learned wrapper can effectively extract items from the Web sites. However, they suffer from two common drawbacks. First, as the layout of Web sites is changed, the learned wrapper typically becomes obsolete and useless. This refers to the wrapper maintenance problem. Second, the learned wrapper can only be applied to the Web site from whence the training examples come. In order to learn a wrapper

8 8 T.-L. Wong and W. Lam Table IV. The Set of Extraction Rules for Extracting the Book Title from the Web Page Shown in Figure 2 Obtained by Adapting the Wrapper from the Web Site Shown in Figure 1 Using Our IEKA Framework Left pattern component Scan Until( , TOKEN), Scan Until( , TOKEN), Scan Until( , TOKEN), Scan Until( <A>, SEMANTIC). Target pattern component Contain(<WORD>) Contain(<WORD START WITH CAPITAL>) Right pattern component Scan Until( </a>, TOKEN), Scan Until( , TOKEN), Scan Until( <img src= images/20 circle.gif width= 33 height= 33 alt= 20% off border= 0 align= right >, TOKEN), Scan Until( , TOKEN). Left pattern component Scan Until( , TOKEN), Scan Until( , TOKEN), Scan Until( , TOKEN), Scan Until( <A>, SEMANTIC). Target pattern component Contain(<WORD>) Contain(<WORD START WITH CAPITAL>) Right pattern component Scan Until( </a>, TOKEN), Scan Until( , TOKEN), Scan Until( , TOKEN), Scan Until( , TOKEN). for a different Web site, a separate manual effort is required to prepare a new set of training examples. Wrapper maintenance aims at relearning the wrapper if the wrapper is found to be no longer applicable. Several approaches have been developed to address the wrapper maintenance problem. RAPTURE [Kushmerick 2000b] has been developed to verify the validity of the wrapper using regression technique. A probabilistic model is built based on the extracted items when the wrapper is known to operate correctly based on the extracted items. After the system operates for a period of time, the items extracted are compared against the model. If the extracted items are found to be largely different, this wrapper is believed to be invalid and needs to be relearned. However, it can only partially solve the wrapper maintenance problem since it cannot learn a new wrapper automatically. Lerman et al. [2003] have developed the DataPro algorithm to address the problem. It learns some patterns from the extracted items. For example, <ALPHA UPPER>, which represents a word containing alphabets only, followed by another word starting with a capital letter, is one of the patterns learned from the business names such as Cajun Kitchen. When the layout of

9 Adapting Web Information Extraction Knowledge 9 the Web site is changed, the DataPro algorithm will automatically label a new set of training examples by matching the learned patterns in the new Web page. The patterns are mainly composed of the display format information such as lower case and upper case of the items. However, it is doubtful that the items have the same display format in the old and new layouts of the Web site. Several approaches have been designed to reduce the human effort in preparing training examples. These approaches have an objective similar to wrapper adaptation. Bootstrapping algorithms [Ghani and Jones 2002; Riloff and Jones 1999] are well-known methods for reducing the number of training examples. They normally initiate the training process by using a set of seed words and incorporating the unlabeled examples in the training phase. However, bootstrapping algorithms assume that those seed words must be present in the training data, leading to ineffective training. For example, the word Shakespeare may appear in the title, or as the author of a book. 3 DIPRE [Brin 1998] attempts to find the occurrence of some concept pairs such as title/author in the documents to obtain training examples by finding text fragments exactly matched with the user inputs. Once sufficient training examples are obtained, it learns extraction patterns from these training examples. DIPRE can reduce the effect of incorrect initiation in bootstrapping. However, it can only work on site-independent concept pairs such as title/author. It cannot extract sitedependent concept pairs such as title/price. The reason is that it assumes that the prices of a particular book are the same in different Web sites and the prices from different sites are known in advance. Moreover, quite a number of concept pairs are required to be prepared in advance in order to obtain sufficient training examples. Cotesting [Muslea et al. 2000] is a semi-automatic approach for reducing the number of examples in the training phase. The idea of cotesting is to learn different wrappers from a few labeled training examples. One may learn a wrapper by processing the Web page forward and learn another by processing the same Web page backward. These wrappers are then applied to the unlabeled examples. If the wrappers label the examples differently, users are asked to manually label those inconsistent examples. The newly labeled examples are then added to the training set and the process iterates until convergence. However, such an active learning approach can only partially reduce the human work. ROADRUNNER [Crescenzi et al. 2001], DeLa [Wang and Lochovsky 2003], and MDR [Liu et al. 2003] are approaches developed for completely eliminating the human effort in extracting items in Web sites. The idea of ROADRUNNER is to compare the similarities and differences of the Web pages. If two different strings occur in the same corresponding positions of two Web pages, they are believed to be the items to be extracted. DeLa discovers repeated patterns of the HTML tags within a Web page and expresses these repeated patterns with regular expressions. The items are then extracted in a table format by parsing the Web page to the discovered regular patterns. MDR first discovers the data regions in the Web page by building the HTML tag tree and making use of 3 The word Shakespeare appears in the title of the book Shakespeare by Michael Wood and appears in the author of the book Romeo And Juliet by William Shakespeare.

10 10 T.-L. Wong and W. Lam string comparison techniques. The data records from each data region are extracted by applying some heuristic knowledge of how people commonly present data objects in Web pages. These three approahes do not require any human involvement in training and extraction. However, they suffer from one common shortcoming. They do not consider the type of information extracted, and hence the items extracted by these systems require human effort to interpret their meaning. For example, if the extracted string is Shakespeare, it is not known whether this string refers to a book title or a book author. Wrapper adaptation aims at automatically adapting the previously learned extraction knowledge to a new unseen site in the same domain. This can significantly reduce the human work in labeling training examples for learning wrappers. In principle, wrapper adaptation can solve the wrapper maintenance problem. It can also be applied to other intelligent tasks [Lam et al. 2003; Wong and Lam 2005]. Golgher and da Silva [2001] proposed to solve the wrapper adaptation problem by applying a bootstrapping technique and a query-like approach. This approach searches the exact matching of items in the new unseen Web page. However their approach shares the same shortcomings as bootstrapping. In essence, their approach assumes that the seed words, which refer to the elements in the source repository in their framework, must appear in the new Web page. Cohen and Fan [1999] designed a method for learning pageindependent heuristics for extracting items from Web pages. Their approach is able to extract items in different domains. However, a major disadvantage of this method is that training examples from several Web sites must be collected to learn such heuristic rules. KNOWITALL [Etzioni et al. 2005] is a domainindependent information extraction system. Its idea is to make use of online search engines and bootstrap from a set of domain-independent and generic patterns from the Web. It can extract the relation between instances and classes, and the relation between superclasses and subclasses. However, one limitation of KNOWITALL is that the proposed generic patterns cannot solve the multislot extraction problem, which aims at extracting records containing one or more attribute items. The machine-labeled training example discovery component of our proposed framework is related to the research area of object identification or duplicate detection, which aims at identifying matching objects from different information sources. Tejada et al. [2001, 2002] developed a system called Active Atlas to solve the object identification problem. They designed a method for learning the weights for different string transformations. The identification of matching objects is then achieved by computing the similarity score between the attributes of the objects. MARLIN [Bilenko and Mooney 2003] is another object identification system based on a generative model for computing the string distance with affine gaps, which applied SVM to compute the vector-space similarity between strings. Cohen defined the similarity join of tables in a database containing free text data [Cohen 1999]. The database may be constructed by extracting data from Web sites. The idea is to consider the importance of the terms contained in the attributes and compute the cosine similarity between the attributes of the tuples. However, one major difference between our machine-labeled training example discovery and these object identification methods is that

11 Adapting Web Information Extraction Knowledge 11 β Item Knowledge α Content Knowledge (Domain Dependent and Site Invariant) γ Context Knowledge (Site Dependent) Domain Web Site f I Site Invariant Features f D Site Dependent Features Web page Fig. 6. Dependence model of text data for Web sites for a particular domain. machine-labeled training example discovery identifies the text fragments, which likely belong to the items of interest, within the Web page collected from the new unseen site, while object identification determines the similarity between records that are obtained or extracted in advance. Moreover, the goal of machine-labeled training example discovery is to identify the text fragments belonging to the items of interest, but not to integrate information from different information sources. The technique used in object identification is not applicable since it is common that the source Web site and the new unseen site do not contain shared records. 4. OVERVIEW OF IEKA 4.1 Dependence Model Our proposed adaptation framework is called IEKA (Information Extraction Knowledge Adaptation). It is designed based on a dependence model of text data contained in the Web sites. Figure 6 shows the dependence model for a particular domain. Typically, there are different Web sites containing data records. Within a particular Web site, there is a set of Web pages containing some data items. For example, in the book domain, there are many bookstore Web sites. Each of these Web sites contains a set of Web pages and each page displays some items such as title, authors, price, and so on. Sometimes, a Web page is obtained by supplying a keyword to the internal search engine provided by the Web site. Associated with each domain, there exists some content knowledge denoted by α. This content knowledge contains the general information about the data items of this domain. For example, in the book domain, α refers to the knowledge that each book consists of items such as title, authors, and price. Within α, there is more specific knowledge, called item knowledge, associated with the items to be extracted. For instance, the item title is associated with particular item knowledge denoted by β, which it refers to knowledge about the title: for example, a title normally consists of few words and some of the words may start with a capital letter. It is obvious that α and β are domain dependent. For example, the knowledge for the book domain and the consumer electronics appliance domain are different. α and β are also regarded as site-invariant since

12 12 T.-L. Wong and W. Lam such knowledge does not change with different Web sites. There is another kind of knowledge called context knowledge denoted by γ. Context knowledge refers to context information such as the layout format of the Web sites. Different Web sites have different contexts γ. For example, in the book domain, the book title is displayed after the token Title: in one Web site, whereas the book title is displayed at the begining of a line in another. In a particular Web page, we differentiate two types of feature. The first type is called the site-invariant feature denoted by f I. f I is mainly related to the item content within the Web page and is dependent on α and β. For example, f I can represent the text fragments regarding the title of a book. Due to the dependence of α and β, f I remains largely unchanged in the Web pages from different Web sites in the same domain. The other type of feature is called the site-dependent feature, denoted by f D. For example, f D can represent the text fragments regarding the layout format of the title of a book in a Web page. Specifically, the titles of the books as shown in Figure 1 are bolded and underlined. f D is dependent on the context knowledge γ associated with a particular Web site. f D is also dependent on β because each item may have different contexts. As the context knowledge γ of different Web sites is different, the resulting f D are also different for the Web pages collected from different sites. However, f D of the Web pages originating from the same site are likely unchanged because they depend on the same γ. In wrapper induction, we attempt to learn the wrapper by manually annotating some training examples in the Web site. These training examples consist of the site-invariant features and the site-dependent features of the Web pages. Wrapper induction is a process of learning information extraction knowledge from the site-invariant and dependent features of the pages from the Web site. The learned wrapper can effectively extract information from the other pages of the same Web site because the site-invariant and dependent features of these Web pages depend on the same α and γ respectively. However, the wrapper learned from a source Web site cannot be directly applied to a new unseen Web site because the site-dependent features of the Web pages in the new unseen site depend on different γ. 4.2 IEKA Framework Description Our IEKA framework tackles the problem by making use of the site-invariant features as clues to solve the wrapper adaptation problem. IEKA first identifies the site-invariant features of the Web pages of the new unseen site. This is achieved by exploiting two pieces of information in the source Web site to derive the site-invariant features. The first piece of information is the extraction knowledge contained in the previously learned wrapper. The other piece of information is the items collected or extracted in the source Web site. To perform information extraction for a new Web site, the existing extraction knowledge contained in the previously learned wrapper is useful since the site-invariant features are likely applicable. However, the site-dependent features cannot be used since they are different in the new site. As mentioned in Section 2, we call such knowledge, weak extraction knowledge. The items previously extracted or collected in the source Web site embody rich information about the item

13 Adapting Web Information Extraction Knowledge 13 Previously Learned Extraction Knowledge Contained in Wrapper Items Previously Extracted or Collected Source Web Site Potential Training Text Fragment Identification DOM Analysis Modified K Classification Potential Training Text Fragments Machine Labeled Training Example Discovery Content Classification Model Lexicon Approximate Matching Unseen Target Web Site Machine Labeled Training Examples Information Extraction Knowledge Adaptation (IEKA) Wrapper Learning Component New Wrapper for Target Web Site Fig. 7. The major stages of IEKA. content. For example, these extracted items contain some characteristics and orthographic information about the item content. These items can be viewed as training examples for the new site. However, they are different from the ordinary training examples because the former only contain information about the site-invariant features, while the latter contain information about both the site-invariant features and site-dependent features. As mentioned in Section 2, we call this property partially specified. By deriving the site-invariant features from the weak extraction knowledge and the partially specified training examples, IEKA employs machine-learning methods to automatically discover some training examples from the new Web site. These newly discovered training examples are called machine-labeled training examples. The next step is to analyze both the site-invariant features and site-dependent features of those machine-labeled training examples of the new site. IEKA then learns the new information extraction knowledge tailored to the new site using a wrapper learning component. Figure 7 depicts the major stages of our IEKA framework. IEKA consists of three stages employing machine-learning methods to tackle the adaptation problem. The first stage of IEKA is the potential training text fragment identification. In this stage, we employ an information-theoretic approach to analyze the DOM structures of the Web pages of the unseen Web site. The informative nodes in the DOM structure can be effectively identified. Next, the weak extraction knowledge contained in the wrapper from the source site is utilized to identify appropriate text fragments in these informative nodes as the potential training text fragments for the new unseen site. This stage considers the site-dependent features of the Web pages as discussed above. Some auxiliary example pages are automatically fetched for the analysis of the site-dependent features. A modified K -nearest neighbours classification model is developed for effectively identifying the potential training text fragments. The second stage is the machine-labeled training example discovery. It aims at scoring the potential training text fragments. Those good potential training text fragments will become the machine-labeled training examples for learning the new wrapper for the new site. This stage considers the site-invariant features of the partially specified training examples. An automatic text

14 14 T.-L. Wong and W. Lam root book_title repetition (author) price author list_price final_price Fig. 8. The hierarchical record structure for the book information shown in Figure 2. fragment-classification model is developed to score the potential training text fragments. The classification model consists of two components. The first component is the content classification component. It considers several features to characterize the item content. The second component is the approximate matching component, which analyzes the orthographical information of the potential training text fragments. In the third stage, based on the automatically generated machine-labeled training examples, a new wrapper for the new Web site is learned using the wrapper learning component. The wrapper learning component in IEKA is derived from our previous work [Lin and Lam 2000], a brief summary of which is given in the following. 4.3 Wrapper Learning Component A wrapper learning component discovers information extraction knowledge from training text fragments. We employ a wrapper learning algorithm called HISER described in our previous work [Lin and Lam 2000]. In this article, we will only present a brief summary of HISER. HISER is a two-stage learning algorithm. The first stage induces a hierarchical representation for the structure of the records. This hierarchical record structure is a tree-like structure that can model the relationship between the items of the records. It can model records with missing items, multi-valued items, and items arranged in unrestricted order. For example, Figure 8 depicts a sample of a hierarchical record structure representing the records in the Web site as shown in Figure 2. The record structure in this example contains a book title, a list of authors, and a price. The price consists of a list price and a final price. There is no restriction on the order of the nodes under the same parent. A record can also have any item missing. The multiple occurrence property of author is modeled by a special internal node called repetition. Each node in the hierarchical record structure is associated with a set of extraction rules. These extraction rules are automatically learned in the second stage in HISER. An extraction rule consists of three parts: the left pattern component, the right pattern component, and the target pattern component. Table V depicts one of the extraction rules for the final price for the Web document in Figure 2. Both the left and right pattern components make use

15 Adapting Web Information Extraction Knowledge 15 Table V. A Sample of an Extraction Rule for the Final Price for the Web Document Shown in Figure 2. Left pattern component Scan Until( Our, TOKEN), Scan Until( Price, TOKEN), Scan Until( :, TOKEN), Scan Until( <HTML IMG TAG>, SEMANTIC). Target pattern component Contain(<FLOAT>) Right pattern component Scan Until(, TOKEN), Scan Until(, TOKEN), Scan Until( , TOKEN), Scan Until( <HTML FONT TAG>, SEMANTIC). ALL TEXT HTML_TAG DIGIT HTML_OBJECTS_TAG PUNCT HTML_SPACE FLOAT HTML_FONT_TAG HTML_IMG_TAG Our Price : <img...> Fig. 9. Examples of semantic classes organized in a hierarchy. of a token scanning instruction, Scan Until(), to identify the left and right delimiters of the item. The token scanning instruction instructs the wrapper to scan and consume any token until a particular token matching is found. The argument of the instruction can be a token or a semantic class. For the target pattern component, it makes use of an instruction, Contain(), to represent the semantic class of the item content. An extraction rule-learning algorithm is developed based on a covering-based learning algorithm. HISER first tokenizes the Web document into sequence of tokens. A token can be a word, number, punctuation, date, HTML tag, some specific ASCII characters such as which represents a space in HTML documents, or some domain-specific contents such as manufacture names. Each token will be associated with a set of semantic classes that is organized in a hierarchy. For example, Figure 9 depicts the semantic class hierarchy for the following text fragments from Figure 5 after tokenization. Our Price: <img src="images/arrow.gif" width="10" height="8" hspace="5">39.99 HISER learns the extraction rules by performing lexical and semantic

16 16 T.-L. Wong and W. Lam generalization, until effective extraction rules are discovered. The details of HISER can be found in our previous work [Lin and Lam 2000]. 5. POTENTIAL TRAINING TEXT FRAGMENT IDENTIFICATION In our IEKA framework, the first stage is the potential training text fragment identification component. This stage shares some resemblance with the research area of object identification or duplicate detection, which aims at identifying matching objects from different information sources [Bilenko and Mooney 2003; Cohen 1999; Tejada et al. 2001, 2002]. However, it is different from object identification or duplicate detection in three aspects. The first aspect is that IEKA identifies the text fragments within the Web page collected from the new unseen site. On the contrary, object identification determines the similarity between records that are obtained or extracted in advance. The second aspect is that IEKA identifies the text fragments belonging to the items of interest in the new site, while the aim of object identification is to integrate data objects from different information sources. The third aspect is that the source Web site and the new unseen site may not contain any common object. For instance, in the object identification task, it determines if the records Art s Deli and Art s Delicatessen collected from two different restaurant information sources refer to the same restaurant [Tejada et al. 2002]. These two records are stored in a database in advance. However, our approach identifies the text fragments Practical C++ Programming, 2nd Edition, which is a substring of the entire HTML text document, in the Web page shown in Figure 2. Moreover, the source site may not simultaneously contain this book displayed in its Web pages. Therefore, the techniques developed for object identification are not applicable. In our IEKA framework, potential training text fragments refer to the text fragments that likely belong to one of the items of interest, collected from the Web pages in the new unseen Web site. Notice that the potential training text fragments identified at this stage are not classified as any particular item of interest. In the next stage, some of the potential training text fragments will then be classified as different items and used as the machine-labeled training examples for learning the new wrapper for the new site in the last stage of IEKA. The idea of this stage is to analyze the site-dependent features and the site-invariant features of the new site. The DOM structure representation of the Web pages is utilized to identify the useful text fragments in the new site. A modified K -nearest neighbours method is employed to select the potential training text fragments. 5.1 Auxiliary Example Pages IEKA will automatically generate some machine-labeled training examples in one of the Web pages in the new unseen Web site. We call the Web page where the machine-labeled training examples are to be automatically collected as the main example page M. Relative to a main example page, auxiliary example pages A(M) are Web pages from the same Web site, but containing different categories of item contents. For example, in the book domain, M may contain items about

Adapting Web Information Extraction Knowledge 17 Fig. 10. A portion of a sample Web page about networking books. programming books. A(M) may contain items about networking books.

17 Adapting Web Information Extraction Knowledge 17 Fig. 10. A portion of a sample Web page about networking books. programming books. A(M) may contain items about networking books. Note that the main and auxiliary example pages are collected from the same site and hence the site-dependent features f D of these Web pages are dependent on the same context knowledge γ as described in Section 4.1. As the main example page and the auxiliary example pages contain different item contents, the text fragments regarding the item content are different in different Web pages, while the text fragment regarding the layout format are very similar. This observation gives a good indication for locating the potential training text fragments. Auxiliary example pages can easily be automatically obtained from different pages in a Web site. One typical method is to supply different keywords or queries automatically to the internal search engine provided by the Web site. For instance, consider the book catalog associated with the Web page shown in Figure 2. This Web page is generated by automatically supplying the keyword PROGRAMMING to the search engine provided by the Web site. Suppose a different keyword such as NETWORKING is automatically supplied to the search engine, a new Web page as shown in Figure 10 is returned. Only a few keywords are needed for a domain and they can easily be chosen in advance. The Web page in Figure 10 can be regarded as an auxiliary example page relative to the Web page in Figure 2. Figures 5 and 11 show the excerpt of the HTML text document associated with the Web page shown in Figures 2 and 10 respectively. The bolded text fragments are related to the item content, while the remaining text fragments are related to the format layout. The text fragments related to the item content are very different in different Web pages, whereas the text fragments related to the format layout are very similar.

18 18 T.-L. Wong and W. Lam Fig. 11. An excerpt of the HTML texts for the Web page shown in Figure 10. <td> <table> <tr> Author: Curtis Frye 2004 Published: List Price: <strike> <a> Microsoft Excel Fig. 12. Part of the DOM structure representation for the Web page shown in Figure DOM Structure Analysis A Web page can be represented by a DOM (Document Object Model) 4 structure. A DOM structure is an ordered tree consisting of two types of nodes. The first type of node is called element node, which is used to represent HTML tag information. These nodes are labeled with the element name such as <table>, <a>, and so on. The other type of node is called text node, which includes the text displayed in the browser and is labeled simply with the corresponding text. Figure 12 shows part of the DOM structure representation for the Web page shown in Figure 2. We develop an algorithm that can effectively locate the informative text nodes in the DOM structure. For each of the text nodes in the DOM structure, we define the path as the string created by concatenating the node labels from the first ancestor to the n-th ancestor where n is a predefined value. For example, as shown in Figure 12, the path for the text nodes labeled with Published: and List Price: are both equal to <table> <tr> <td> and the path for the text node labeled with Microsoft Excel 2003 Programming inside Out is <td> <a> when n is set to 4. Note that each path may locate more than one text node in the DOM structure. We define the probability that 4 The details of the Document Object Model can be found in

19 Adapting Web Information Extraction Knowledge 19 Fig. 13. An outline of the DOM structure path-finding algorithm. the term w i occurs in the text nodes located by the path p as: P(w i, p) = N(w i, p) j N(w j, p) where N(w i, p) is the number of occurrences of w i in all the text nodes located by p. Next, we define the path entropy, E(p), as follows: E(p) = P(w i, p) log P(w i, p). (1) i Note that E(p) can be calculated from more than one DOM structure by treating all the DOM structures as a forest and each P(w i, p) is calculated by considering all the text nodes located by p in the forest. Figure 13 shows an outline of our path-finding algorithm. The objective of this algorithm is to identify the paths that can locate some informative text nodes in the DOM structure. It first creates the DOM structures for the main example page M and all the auxiliary example pages A(M ). Next, all the paths in the DOM structure, dom M, for the main example page M will be identified. For each of these paths, E(p) and E(p) are calculated. If E(p) exceeds E(p)by a threshold, δ, this path will be included in the return path set. The rationale of Step 8 in the algorithm is that entropy is a measure of the randomness of the distribution. Recall that the main and auxiliary example pages consist of different site-invariant features. If the underlying path can locate the text nodes consisting of site-invariant features, the term-distribution under this path will become more complex when more pages are being considered. On the other hand, if the underlying path can just locate the text nodes corresponding to sitedependent features, the term-distribution under this path will likely remain unchanged when more pages are being considered because the site-dependent features largely remain unchanged in different Web pages in a Web site. Hence, the resulting return path will contain the paths that can locate a complex text node that highly likely consists of site-invariant features. For example, the path

20 20 T.-L. Wong and W. Lam <td><a> is one of the returned paths found by our path-finding algorithm for the Web page shown in Figure Modified K-Nearest Neighbour Classification Model The text fragments within the text nodes located by the returned paths in the above algorithm will become some useful text fragments. Although the paths found by our algorithm can effectively identify useful text nodes containing site-invariant features, these paths may also incorrectly locate some other text nodes because each path may locate more than one text node at the same time. We develop a modified K -nearest neighbours classification model for filtering out these incorrect text fragments. Recall that the previously learned wrapper from the source Web site consists of the left pattern component, the right pattern component, and the target pattern component. This wrapper is not fully applicable in the new unseen target site due to the difference between the site-dependent features f D of the source site and the target site. However, the target pattern component, which contains the semantic classes of the items regarded as the weak extraction knowledge of the source site, can be utilized for discovering useful text fragments in the new target site. Based on the weak extraction knowledge, we can obtain the set UTF(M) from the main example page M of the new target site. UTF(M ) is the set of useful text fragments in M and each text fragment contains the same set of the semantic classes as the one contained in the target component of the previously learned wrapper. From an auxiliary example page A(M ), we can also obtain the set UTF(A(M )). As explained in Section 5.1, the text fragments regarding the item content in the main example page are less likely to appear in the auxiliary example pages, while the text fragments regarding the layout format will probably appear in both the main example page and the auxiliary example page. Note that our objective is to retain the text fragment corresponding to the site-invariant features in M. Hence, all the elements in UTF(A(M )) are treated as negative instances. Each instance in the modified K -nearest neighbours classification model is represented by a set, t i, containing the unique words in the text fragment. Suppose we have two text fragments t 1 and t 2. We define the similarity between these two text fragments sim(t 1, t 2 ), as follows 5 : sim(t 1, t 2 ) = t 1 t 2 max( t 1, t 2 ) (2) where t 1 t 2 denotes the intersection of the sets t 1 and t 2, and t denotes the number of elements in the set t. Some existing methods for object identification, make use of the term frequency-inverse document frequency (TF-IDF) method to assign weight to 5 We have also tried different similarity measurements such as cosine similarity. We found that the similarity measurement described in Equation 2 has slightly better performance.

A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes

A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes Tak-Lam Wong and Wai Lam Department of Systems Engineering and Engineering Management The Chinese University