A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes

Size: px

Start display at page:

Download "A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes"

Madison Nicholson
5 years ago
Views:

A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes Tak-Lam Wong and Wai Lam Department of Systems Engineering and Engineering Management The Chinese

1 A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes Tak-Lam Wong and Wai Lam Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong, Shatin Hong Kong Abstract We develop a probabilistic framework for adapting information extraction wrappers with new attribute discovery. Wrapper adaptation aims at automatically adapting a previously learned wrapper from the source Web site to a new unseen site for information extraction. One unique characteristic of our framework is that it can discover new or previously unseen attributes as well as headers from the new site. It is based on a generative model for the generation of text fragments related to attribute items and formatting data in a Web page. To solve the wrapper adaptation problem, we consider two kinds of information from the source Web site. The first kind of information is the extraction knowledge contained in the previously learned wrapper from the source Web site. The second kind of information is the previously extracted or collected items. We employ a Bayesian learning approach to automatically select a set of training examples for adapting a wrapper for the new unseen site. To solve the new attribute discovery problem, we develop a model which analyzes the surrounding text fragments of the attributes in the new unseen site. A Bayesian learning method is developed to discover the new attributes and their headers. EM technique is employed in both Bayesian learning models. We conducted extensive experiments from a number of real-world Web sites to demonstrate the effectiveness of our framework. 1. Introduction Tremendous amount of Web documents available from the World Wide Web provide a good source for users to access various useful information electronically. Normally, users search for information with the assistance of Web search engines. By entering the key phrases to a search The work described in this paper was substantially supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Nos: CUHK 4187/01E and CUHK 4179/03E) and CUHK Strategic Grant (No: ). Figure 1. An example of Web page about book catalog. engine, numerous related Web sites or Web pages will be returned. To locate the exact and precise information, human effort is required to examine each of the Web sites or Web pages. This brings the need for information extraction systems which aim at automatically extracting precise text fragments from the pages. Another application of information extraction from Web documents is to support automated agent systems which collect precise information or data as input for conducting certain intelligent tasks such as price comparison shopping agent [6] and automated travel assistance agent [1]. A common information extraction technique known as wrappers can solve the automatic extraction problem. A wrapper normally consists of a set of extraction rules which can precisely identify the text fragments to be extracted from Web pages. In the past, these extraction rules are manually constructed by human. This manual effort is tedious, boring, error-prone and requires a high level of expertise. Recently, several wrapper learning approaches are proposed for automatically learning wrappers from training examples [2, 4, 11]. Wrapper learning systems significantly reduce the amount of human effort in constructing wrappers. Consider a Web page shown in Figure 1 collected from a Web site under the book domain 1. To learn the wrapper for automatically extracting information from this Web 1 The URL of the Web site is

site, one can manually provide some training examples via a simple GUI by just highlighting the appropriate text fragments. The training example contains the basic composition of a book record.

2 site, one can manually provide some training examples via a simple GUI by just highlighting the appropriate text fragments. The training example contains the basic composition of a book record. For example, a user may highlight the text fragment Game Programming Gems 2 as the title and the text fragment Mark Deloura as the corresponding author. The wrapper learning system automatically learns the wrapper based on the information embedded in the training examples. Typically, although the learned wrapper can effectively extract information from the same Web site and achieve very good performance, the learned wrapper cannot be applied to other Web sites for information extraction even the Web sites are in the same domain. Wrapper adaptation aims at automatically adapting a previously learned wrapper from the source Web site to new unseen sites. It can significantly reduce the human effort in preparing training examples for learning the wrappers for the new unseen sites. Figure 2 depicts a Web page collected from a Web site different from the one shown in Figure 1 2. Wrapper adaptation can automatically adapt the wrapper previously learned from the source Web site shown in Figure 1 to the new unseen site shown in Figure 2. The adapted wrapper can then be applied to the Web pages in the new unseen site for automatically extracting the data records. The attributes extracted by the learned wrapper are dependent on the training examples provided. For example, refer to the Web site shown in Figure 1, the learned wrapper can only extract the two attributes, namely, title and author. Some other attributes such as price and publication date of the book record cannot be extracted by the learned wrapper because these attributes are not indicated in the training examples. To make the learned wrapper able to extract the publication date of the book records, the related attribute must be provided in the training examples. In wrapper adaptation, similar problem is encountered. For instance, if the previously learned wrapper only contains extraction rules for the title and author from the source Web site shown in Figure 1, the adapted wrapper can at best extract the title and author from the new unseen site shown in Figure 2. However, the new unseen site may contain some new attributes which are not present in the previously learned wrapper. For example, the book records in Figure 2 contain the attribute ISBN that does not exist in the previously learned wrapper as shown in Table??. The ISBN of the book records in the unseen sites cannot be extracted. Both wrapper induction and wrapper adaptation pose a limitation on the attributes to be extracted. The wrapper learned or adapted can only extract the pre-specified attributes. The goal of new attribute discovery is to extract new attributes that are not specified in the current learned wrapper and also discover the header text fragments (if any) 2 The URL of the Web site is Figure 2. An example of Web page about book catalog collected from a different Web site shown in Figure 1. associated with these new attributes. Our Price and Your Price are examples of header text fragments for the attribute price in the Web sites shown in Figure 1 and Figure 2 respectively. Different Web sites may use different headers for the same attribute, but the headers normally contain some semantic meaning. By discovering the header text fragments for the new attributes, we are not only able to discover the new attribute items, but also understand some semantic meanings of the items newly discovered and extracted. New attribute discovery is particularly useful when combining with wrapper adaptation as illustrated in the above example. Some attributes in the new unseen site may not be present in the source Web site. In this case, the adapted wrapper can only extract incomplete information in the new unseen site. New attribute discovery can be applied to extract more useful information embodied in the new site. Several techniques such as bootstrapping [16] and active learning [9, 15] have been developed for reducing the human effort in preparing training examples. ROADRUN- NER [5], DeLa [17] and MDR [13] are approaches developed for completely eliminating human effort in extracting items in Web sites. The idea of ROADRUNNER is to compare the similarities and differences of the Web pages. If two different strings occur in the same corresponding positions of two Web pages, they are believed to be the items to be extracted. DeLa discovers repeated patterns of the HTML tags within a Web page and expresses these repeated patterns with regular expression. The items are then extracted in a table format by parsing the Web page to the discovered regular patterns. MDR first discovers the data regions in the Web page by building the HTML tag tree and making use of string comparison techniques. The data records from each data region are extracted by applying some heuristic knowledge of how people commonly present data objects in Web pages. The above three approaches do not require any human involvement in training and extraction. However, they suffer from one common shortcoming; they cannot differentiate the type of information extracted

3 and hence the items extracted by these approaches require human effort to interpret the meaning. Wrapper adaptation aims at adapting the previously learned extraction knowledge to a new unseen site in the same domain automatically. This can significantly reduce the human work in labeling training examples for learning wrappers for different sites. Golgher et al. [8] proposed to solve the problem by applying bootstrapping technique and a query-like approach. This approach searches the exact matching of items in an unseen Web page. However their approach assumes that the seed words, which refer to the elements in the source repository in their framework, must appear in the unseen Web page. Cohen and Fan designed a method for learning page-independent heuristics for extracting item from Web pages [3]. Their approach is able to extract items in different domains. However, one major disadvantage of this method is that training examples from several Web sites must be collected to learn such heuristic rules. KNOWITALL [7] is a domain independent information extraction system. Its idea is to make use of online search engines and bootstrap from a set of domain independent and generic patterns from the Web. However, one limitation of KNOWITALL is that the proposed generic patterns can only be applied to the free text portions of the Web pages. It is not able to extract information from the semistructured text documents which contain a mix of HTML tags and data text fragments in the Web pages. We have developed a preliminary wrapper adaptation method called WrapMA in our previous work [18]. One of the limitation of WrapMA is that it requires human effort to scrutinize the intermediate discovered data in the adaptation process. An improvement from our previous work called IEKA which attempts to tackle the wrapper adaptation problem in a fully automatic manner [20] have been developed. One common shortcoming of WrapMA and IEKA is that the attributes to be extracted by the adapted wrapper are fixed as specified from the source Web site. The adapted wrapper cannot extract new attributes which appear in the unseen target site. In this paper, we describe a novel probabilistic framework for solving the wrapper adaptation with new attribute discovery. This new approach is able to automatically adapt a previously learned wrapper from a source Web site to a new unseen site. The adapted wrapper can also extract new attribute items together with the associated header text fragments. We have conducted extensive experiments which offer a very encouraging results for our wrapper adaptation with new attribute discovery approach. 2. Overview We develop a probabilistic framework for wrapper adaptation with new attribute discovery. This framework is based on a generative model for the generation of text fragments related to attribute items and formatting data in a Web page as depicted in Figure 3. In each domain, there is an α attribute generation parameter β formatting feature generation parameter γ header generation parameter A attribute class Web site Web page C content feature W item content F formatting feature H header S surrounding text fragment N attribute items M pages Figure 3. The generative model for attribute generation. Shaded nodes represent Observable variables and unshaded nodes represent unobservable variables. Circle nodes represent site dependent variables and oval nodes represent site invariant variables. attribute generation parameter which is domain dependent and site invariant. This parameter controls the attribute classes of the items contained in the Web pages. For each of the N attribute items contained in one of the M pages under the same Web site, the attribute class, A, is assumed to be generated from the distribution P (Aj). Refer to the Web site shown in Figure 1, this Web site consists of several Web pages and each of them contains a number of attribute items. The attribute classes of the items generated are title, author, price, published year, etc. Based on the attribute classes generated, the item content, W,andthe content feature, C, are then generated from the distribution P (W ja) and P (C ja) respectively. W is to represent the orthographic information of the attribute. For example, the item content W for the title of the first record in Figure 1 is the character sequence Game Programming Gems 2. The content feature C is to represent the characteristics of the attributes such as the number of words starting with capital letter. W and C are conditionally independent to each other and both of them are dependent on A which in turn is dependent on. Therefore, W and C are site invariant. Given the same domain and same attribute class, W and C remain largely unchanged in the Web pages collected from different Web sites. Within a particular Web site, there is a formatting feature generation parameter denoted by. The formatting feature F of the attributes is to represent the formatting information and the context information and it is generated from the distribution P (F ja ). AsF is dependent of which is site dependent, F is also site dependent and the format of the same attribute in different Web sites are different. For example, the title of the book record in Figure 1 is bolded and underlined while the title of the book record in Figure 2 is only bolded. Within a Web site, there is a random variable called header (denoted by H) generated from P (H ja ) where is a binomial distribution parameter. Similar to, is site dependent and hence H is also site

4 dependent. The surrounding text fragment of the attribute denoted by S is then generated with P (S jh ) and S is also site dependent. For example, the text fragment Mark Deloura in Figure 1 is surrounded by the text fragment Author which is the header for the attribute author whereas the other text fragment List Price is not the header. The joint probability distribution can be expressed as : P (W C F S H Aj ) = P (W ja)p (C ja)p (F ja )P (SjH)P (H ja )P (Aj) Our wrapper adaptation with new attribute discovery is a two-stage probabilistic framework based on this generative model. At the first stage, the objective is to tackle the wrapper adaptation problem which aims at automatically adapting a previously learned wrapper from the source Web site to a new unseen site. This stage first identifies the set of useful text fragments from the new unseen Web site by considering the document object model 3 (DOM) representation of Web pages and designing an informationtheoretic method. The related information of a useful text fragment can be represented by (W C F) and is observable. Next, these useful text fragments are categorized into one of the attribute classes by employing an automatic text fragment classification model. We consider the probability P (AjW C F ) which represents the probability that the useful text fragment represented by (W C F) belongs the attribute class A. Two kinds of information from the source Web site providing useful clues are considered. The first kind of information is the knowledge contained in the previously learned wrapper which contains rich knowledge about the semantic content of the attribute. The second kind of information is the items previously extracted or collected from the source Web site which embodies the characteristics of the attributes. These items can be used for inferring training examples for the new unseen site because they contain information about C and W, which are site invariant, of the attributes. However, they are different from ordinary training examples because they do not contain any information about F of the unseen site. We call this property partially specified. Our approach uses the partially specified training examples to estimate the probability P (AjW C F ) for each of the useful text fragments. We then apply Bayesian learning technique and expectation-maximization (EM) algorithm to construct the model for finding the attribute class for each of the useful text fragments. It corresponds to the generative model depicted in the upper part of Figure 3. As a result, the corresponding attributes for the useful text fragments can be discussed. Certain useful text fragments belong to previously unseen, or new attribute classes. One example for such attribute classes is the ISBN attribute shown in Figure 2. 3 The details of the Document Object Model can be found in (1) At the second stage of our framework, we attempt to discover new or previously unseen attributes for those useful text fragments which do not associate with any known attributes after the inference process in the first stage. It is achieved by another level of Bayesian learning by analyzing the relationship between the useful text fragments and their surrounding text fragments. It corresponds to the generative model depicted in the lower part of Figure 3. Recall that S consists of the surrounding text fragments of the attributes. For example, the text fragment Scott Urman is surrounded by Author : and ISBN : and the text fragment is surrounded by ISBN : and MSRP in the first record in Figure 2. From the inference of our wrapper adaptation approach in the first stage, we know that the text fragment Scott Urman belongs to the attribute author. Suppose we also know that Author : is the header for the attribute author. We can then infer the relationship between the headers and the attributes in the Web site from their relative position and other characteristics of the headers. We can discover that it is highly probable that ISBN : is the header for the text fragment To model this idea, we consider the joint probability as depicted in Equation 1 and obtain the following conditional probability: P (H jw C F S ) = P P P (W C F ja )P (SjH )P (H ja )P (Aj) A (2) H PA P (W C F ja )P (SjH )P (H ja )P (Aj) Equation 2 essentially means how likely that the surrounding text fragment S is the header of the useful text fragment which is represented by the tuple (W C F). we can apply maximum likelihood technique and EM algorithm to estimate the parameters in Equation Probabilistic Model for Wrapper Adaptation As mentioned above, the first stage of our framework is to conduct wrapper adaptation. It mainly consists of two major steps. The first step is to identify a set of useful text fragments. The second step is to categorize the useful text fragment into one of the attribute classes. For the first step, its aim is to identify a set of useful training text fragments from the Web pages in the new unseen Web site. We observe that in different pages within the same Web site, the text fragments regarding the attribute are different, while the text fragments regarding the formatting or context are similar. This observation provides some clues in identifying the useful text fragments. A Web page can be represented by a DOM structure. A DOM structure is an ordered tree consisting of two types of nodes. The first type of node is called the element node which is used to represent HTML tag information. These nodes are labeled with the element name such as <table>, <a>, etc. The other type of node is called the text node which includes the text

5 displayed in the browser and labeled simply with the corresponding text. We develop an information-theoretic algorithm that can effectively locate the informative text nodes in the DOM structure. These text fragments become the useful text fragments for the new unseen site. The details of this useful text fragment identification step can be found in [19]. The second step of wrapper adaptation is to categorize these useful text fragments by employing an automatic text fragment classification algorithm. We first select the useful text fragments containing the same semantic classes as the ones contained in the target pattern components of the previously learned wrapper from the source Web site. The selected useful text fragments will then be categorized into different attribute classes by using a classification model. The generative model is depicted in the upper part of Figure 3. The probability for generating a particular (W C F A) given the parameters and is expressed as: P (W C F Aj ) =P (W ja)p (C ja)p (F ja )P (Aj) (3) for the attributes in the Web pages collected from the same domain. Therefore, we utilize the partially specified training examples collected from the source Web site and employ a two-level edit distance approach to compute on initial approximation of this probability. The details of the two-level edit distance algorithm can be found in our previous work [20]. We define D(W l j i ) as the distance between the the useful text fragment with item content W and the i- th partially specified training example for the j-th attribute. Then we approximate P (AjW C F 0 0 ) by: P (Aj jw C F 0 0 ) 1 K (max fd 0 (W l j i i )g) (7) where D 0 (W l j i )=1 D(W lj i ) and K is a normalization factor. After obtaining the parameters, we can calculate the probability P (AjW C F ) for each useful text fragment and determine their attribute classes by the following formulae: P (AjW C F ) = P (W C FjA )P(Aj) PA P (W C FjA )P (Aj) (8) The probability for generating the set of all the attributes,,inawebpageisasfollows: P (j ) = NY i=1 P (Wi Ci Fi Aij ) (4) Combining Equations 3 and 4, we can obtain the following log likelihood function L( ): NX X L( ) = log P (Wi ja ij)p (Ci ja ij)p (Fi ja ij )P (Aij j) i=1 A ij 2A (5) where A ij means the i-th useful text fragment belongs to the j-th attribute class. As A ij in the above equation is an unobservable variable, we can derive the following expected log likelihood function L 0 ( ): NX X L 0 ( ) = P (Aij j) log P (W i ja ij)p (Ci ja ij)p (Fi ja ij ) i=1 A ij 2A By Jensen s inequality and the concavity of the logarithmic function, it can be proved that L( )is bounded below by L 0 ( )[14, 19]. The EM algorithm is employed to increase L 0 ( ) iteratively until convergence. The E- Step and M-Step are as described as follows: E-Step: P (AjW C F t t) / P (W ja)p (CjA)P (F ja t)p (Ajt) M-Step: (t+1 t+1) =argmax L0 ( t t) To initialize the EM algorithm, we have to first estimate P (AjW C F 0 0 ). Recall that W is largely unchanged (6) ^A =argmax A i 2A P (W C FjA i )P (Aij) (9) For each attribute class, those useful text fragments with the probability of belonging to this attribute higher than a certain threshold will be selected as training examples for learning the new wrapper for the new unseen Web site. Users could optionally scrutinize the discovered training examples to improve the quality of the training examples. However, in our experiments, we did not conduct manual intervention and the adaptation was conducted in a fully automatic way. We employ the same wrapper learning method used in the source site [12]. 4. New Attribute Discovery As described before, the goal of the second stage in our framework is to discover previously unseen attributes for those useful text fragments which do not associate with any known attributes after the inference process in the first stage. We develop a Bayesian learning model to achieve this task. The model is learned from the previously categorized useful text fragments in the first stage. After the model is learned, it can be applied to discover previously unseen attributes. By Bayes theorem, P (W C FjA )P(Aj) = P (AjW C F )P (W C F). have: P (HjW C F S ) = P P From Equation 2, we P (AjW C Fj )P (SjH)P (HjA )P (W C F) A (10) H PA P (AjW C F )P (SjH)P (HjA )P (W C F ) Since H and A are both unobservable, we can then derive the following expected log likelihood as follows: L 00 ( )=P N i=1 P H P A ij 2A fp (A ij jw C F )P (HjA ij )logp (W C F)P(SjH)g (11)

6 In Equation 11, the term P (A ij jw C F ) can be determined in the first stage of our framework. Therefore, the EM algorithm proceeds as follows: E-Step: P (H jw C F S t) / PA ij 2A P (A ij jw C F )P (SjH)P (HjA ij t) M-Step: t+1 = arg max L 00 ( t ) After estimating the parameters, the attribute headers can be predicted by the following reasoning. Since the candidate useful text fragments belong to some new or unseen attribute classes, we assume the terms P (AjW C F ) are equal for all A. We replace the term P (H ja ) by E A [H ja ] as H is binomially distributed with H being either zero or one. Next we observe that P (H jw C F S ) / E A [HjA ]P (SjH). Byrepresenting S with a set of features f k (S) such as the relative position of S to the candidate useful text fragments, the number of characters of S, etc., we can obtain P (SjH) = Qk P (f k(s)jh) by the independence assumption. We can derive the following formula: P (HjW C F S ) / E A [HjA ] Y k P (f k (S)jH) (12) Equation 12 can be used for estimating the probability that the surrounding text fragment S is the header of the useful text fragment (W C F). 5. Experimental Results We conducted extensive experiments on several realworld Web sites in the book domain to demonstrate the performance of our framework for wrapper adaptation with new attribute discovery. Table 1 depicts the Web sites used in our experiment. The first column shows the Web site labels. The second column shows the names of the Web site and the corresponding Web site addresses. The third and forth columns depict the number of pages and the number of records collected from the Web site for evaluation purpose respectively Evaluation on Wrapper Adaptation To evaluate the performance of our wrapper adaptation approach, we first provide five training examples in each Web site for learning a wrapper. The attribute classes of interest are title and author. After obtaining the learned wrapper for each of the Web sites, we conducted two sets of experiments. The first set of experiments is to simply apply the learned wrapper from one particular Web site without wrapper adaptation to all the remaining sites for information extraction. For example, the wrapper learned from S1 is directly applied to S2 - S10 to extract items. This experiment can be treated as a baseline for our adaptation approach. The second set of experiments is to adapt the learned wrapper from one particular Web site with wrapper adaptation to Web site (URL) #pp. # rec. S1 Half Price Computer Books ( S2 Discount-PCBooks.com ( S3 mmistore.com ( S4 Amazon.com ( S5 Jim s Computer Books ( /vstorecomputers/jimsbooks/) S6 1Bookstreet.com ( S7 Barnes & Noble.com ( S8 bookpool.com ( S9 half.com ( S10 DigitalGuru Technical Bookshops ( Table 1. Web sites collected for experiments. # pp. and # rec. refer to the number of pages and number of records respectively. all the remaining sites. The extraction performance is evaluated by two commonly used metrics called precision and recall. Precision is defined as the number of items for which the system correctly identified divided by the total number of items it extracts. Recall is defined as the number of items for which the system correctly identified divided by the total number of actual items. The attribute items for evaluation in the first set of experiments are title and author. The experimental results of the first set of experiments reveal that all learned wrappers are not able to adapt to extract records from other Web sites. The is due to the fact that the format of the Web pages from different sites are different. The extraction rules learned from a particular Web site cannot be applied to other sites to extract information. In addition, we also evaluated an existing system called WIEN [10] to perform the same adaptation task 4. The wrapper learned by WIEN for a particular Web site cannot extract items in other Web sites. Table 2 shows the results of the second set of experiments. The first column shows the Web sites (source sites) from which the wrappers are learned from given training examples. The first row shows the Web sites (new unseen sites) to which the learned wrapper of a particular Web site is adapted. Each cell in Table 2 is divided into two subcolumns and two sub-rows. The two sub-rows represent the extraction performance on the attributes title and author respectively. The two sub-columns represent the precision (P) and recall (R) for extracting the items respectively. These results are obtained by adapting a learned wrapper from one Web site to the remaining sites using our wrapper adaptation approach. The results indicate that the extraction performance is very satisfactory. Table 3 summarizes the average 4 WIEN is available in the Web site:

7 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 P R P R P R P R P R P R P R P R P R P R S S S S S S S S S S Table 2. Experimental results of adapting a learned wrapper from one Web site to the remaining Web sites. P and R refer to precision and recall (in percentages) respectively. Title Author Without With Without With Adaptation Adaptation Adaptation Adaptation P R P R P R P R S S S S S S S S S S Average Table 3. Average extraction performance on title and author for the book domain for the cases of without adaptation and with adaptation when training examples of one particular Web site are provided. P and R refer to precision and recall (in percentages) respectively. extraction performance for the cases of without adaptation and with adaptation. The first column shows the Web sites where training examples are given. Each row summarizes the results obtained by using the learned wrapper of the Web site in the first column and applying to all other sites for extraction. The results indicate that the wrapper learned from a particular Web site cannot be directly applied to others without adaptation for information extraction. After applying our wrapper adaptation approach, the wrapper learned from a particular Web site can be adapted to other sites. A very promising performance is achieved especially compared with the performance obtained without adaptation Evaluation on New Attribute Discovery In each of the Web sites shown in Table 1, there exists some other attributes apart from title and author. For example, S1 contains attributes such as book type, list price, our price, etc. We conducted some experiments to evaluate our new attribute discovery approach for discovering those previously unseen attributes. Recall (NewR) and precision (NewP) are used for evaluating the performance with the ground truth consisting of all new attributes in the Web sites. Table 4 shows the results of the experiments. The first column depicts the Web sites where we attempt to discover the new attributes. The second column depicts the new or previously unseen attributes that are discovered. The third column depicts the new attributes that are not able to be discovered. The last two columns show the precision (NewP) and recall (NewR) of all new attributes in the Web sites respectively. For example, in S1, the new attributes publication date, list price, you save, and our price can be discovered by our approach. Among the new attributes, the precision and recall for identifying items belonging to new attributes are 100.0% and 95.0% respectively. The results show that our new attribute discovery approach achieves a very good performance with average precision and recall reaching 99.8% and 74.3% respectively. We can also correctly identify the headers for the discovered attributes from Web pages in S1 to S4. In S5, since all attributes are not associated with any headers in the Web pages, no header information is available. For S6-S10, since no headers are associated with title or author, no evidence can be used for inferring the headers of new attributes. 6. Conclusions We have developed a probabilistic framework for adapting information extraction wrappers with new attribute discovery. Our framework is based on a generative model for generating text fragments related to attribute items and formatting data. For wrapper adaptation, one feature of our framework is that we utilize the extraction knowledge contained in the previously learned wrapper from the source Web site. We also consider previously extracted or collected items. A set of training examples for learning the new wrapper for the unseen site can be identified by using a Bayesian learning approach. For new attribute discovery, we analyze the relationship between the attributes and their surrounding

8 New attributes New attribute NewP NewR correctly discovered not discovered S1 publication date, list price, you save, our price S2 ISBN, MSRP, your price, you save S3 publisher, publication date, ISBN, shipping status list price, our price, edition S4 list price, our price, buy used, book type buy collectible, shipping status S5 book type, number of pages, publication date, price S6 book type, regular price, you save, our price, shipping status S7 shipping status, publication date publisher, our price, you save book type S8 publisher, publication date, ISBN, edition, book type, our price, your save, inventory S9 price, save book type, publication date S10 publisher, publication date, you save reading level, online price Ave Table 4. Experimental results for new attribute discovery. NewP and NewR refer to precision and recall of all new attributes in the Web sites (in percentages) respectively. Ave. refers to the average performance. text fragments. A Bayesian learning model is developed to extract the new attributes and their headers from the unseen site. We employ EM technique in the learning algorithm of both Bayesian models. Experiments from some real-worlds Web sites show that our framework achieves a very promising performance in wrapper adaptation with new attribute discovery. References [1] J. Ambite, G. Barish, C. Knoblock, M. Muslea, J. Oh, and S. Minton. Getting from here to there: Interactive planning and agent execution for optimizing travel. In Proceedings of the Forteenth Innovative Applications of Artificial Intelligence Conference, pages , [2] D. Blei, J. Bagnell, and A. McCallum. Learning with scope, with application to information extraction and classification. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI-2002), pages 53 60, [3] W. Cohen and W. Fan. Learning page-independent heuristics for extracting data from Web pages. Computer Networks, 31(11-16): , [4] W. Cohen, M. Hurst, and L. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of the Eleventh International World Wide Web Conference (WWW-2002), pages , [5] V. Crescenzi, G. Mecca, and P. Merialdo. ROADRUNNER: Towards automatic data extraction from large web sites. In Proceedings of the 27th Very Large Databases Conference (VLDB-2001), pages , [6] R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the First International Conference on Autonomous Agents, pages 39 48, February [7] O. Etzioni, M. Cafarella, D. Downey, A. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Methods for domainindependent information extraction from the web: An experimental comparison. In Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-2004), [8] P. Golgher and A. da Silva. Bootstrapping for example-based data extraction. In Proceedings of the Tenth ACM International Conference on Information and Knowledge Management (CIKM-2001), pages , [9] T. Kristjansson, A. Culotta, P. Viola, and A. McCallum. Interactive information extraction with constrained conditional random fields. In Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-2004), [10] N. Kushmerick and B. Grace. The wrapper induction environment. In Proceedings of the Workshop on Software Tools for Developing Agents (AAAI-1998), pages , [11] N. Kushmerick and B. Thomas. Adaptive information extraction: Core technologies for information agents. In Intelligents Information Agents R&D In Europe: An AgentLink Perspective, pages , [12] W. Y. Lin and W. Lam. Learning to extract hierarchical information from semi-structured documents. In Proceedings of the Ninth International Conference on Information and Knowledge Management (CIKM-2000), pages , [13] B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), pages , [14] G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. John Wiley & Sons, Inc., [15] I. Muslea, S. Minton, and C. Knoblock. Selective sampling with redundant views. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), pages , [16] E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-1999), pages , [17] J. Wang and F. H. Lochovsky. Data extraction and label assignment for Web databases. In Proceedings of the Twelfth International World Wide Web Conference (WWW- 2003), pages , [18] T. L. Wong and W. Lam. Adapting information extraction knowledge for unseen web sites. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM-2002), pages , [19] T. L. Wong and W. Lam. A probabilistic approach for adapting wrappers and discovering new attributes. In The Chinese University of Hong Kong, Department of Systems Engineering and Engineering Management Technical Report, [20] T. L. Wong and W. Lam. Text mining from site invariant and dependent features for information extraction knowledge adaptation. In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM-2004), pages 45 56, 2004.

Adapting Web Information Extraction Knowledge via Mining Site-Invariant and Site-Dependent Features

Adapting Web Information Extraction Knowledge via Mining Site-Invariant and Site-Dependent Features TAK-LAM WONG City University of Hong Kong and WAI LAM The Chinese University of Hong Kong We develop