Building web information extraction tasks

Size: px

Start display at page:

Download "Building web information extraction tasks"

Martha Bryant
6 years ago
Views:

1 Building web information extraction tasks Benjamin Habegger Laboratoire d Informatique de Nantes Atlantique 2 rue de la Houssinire BP Nantes CEDEX 3 France Benjamin.Habegger@lina.univ-nantes.fr Mohamed Quafafou Institut des Applications Avances de l Internet Ecole de l Internet de Marseille 12 avenue du Gnral Leclerc Marseille France Mohamed.Quafafou@iaai.fr Abstract Most recent research in the field of information extraction from the Web has concentrated on the task of extracting the underlying content of a set of similarly structured web pages. However in order to build real-world web information extraction applications this is not sufficient. Indeed, building such applications requires fully automating the access to web sources. This does not just involve the extraction of the data from web pages. There is a need to set up the necessary infrastructure allowing to query a source, retrieve the result pages, extract the results from these pages and filter out the unwanted results. In this paper we show how such an infrastructure can be set up. We propose to build a web information extraction application by decomposing it into sub-tasks and describing it in an XML based language named WetDL. Each of the sub-tasks consists in applying a web information extraction specific operation onto its input, one of these operators being the application of an extractor. By connecting such operations together it is possible to simply define complex applications. This is shown in the paper by applying this approach to real-world information extraction tasks such as extracting DVD listings from Amazon.com, extracting addresses from online telephone directories superpages.com, etc. 1. Introduction Recently, the development of the web, web-based applications and web-based access to databases, has triggered an interest in information extraction from the web. The objective of this field of research is to allow an automated access to sources which are originally destinated to human users. Typically a web-based data source can be accessed by filling an HTML form and submitting it. The results are then usually given in the form of a set of HTML pages. Figure 1 describes a manually executed extraction process. Step (1) is to fill in a form with the user s query and submitting the query. Step (2) is to extract the results on the obtained result page. Step (3) consists in following the Next page link. We then return to step (2). Automatically accessing such a source involves translating the expression of the user s information need into data fitting in the form and translating the result pages from the presentational format into a structured and machine understandable format. Such a task can be done by building a special type of program called a wrapper. Up to now, research in the field of information extraction from the web has mostly concentrated on the extraction task of such a wrapper (ie. transforming the set of result pages into a machine-understandable format). This is true enough that the term wrapper is often used to refer to the extraction procedure. The main approaches [8, 1, 2, 7, 11] allowing to build the extraction part of a wrapper will be presented in this article. Some other research has also considered the problems transforming the initial query into the form-based languages of web-based sources (see for example [9]). However, we argue that this is not sufficient in order to build fully automated access to web sources. Indeed, automatically accessing a web source does not only involve querying and extracting data. It is also necessary to send the query, retrieve the result pages and eventually follow specific links, filter the resulting items, etc. The manner in which this is to be done largely depends on the web-based application and the task creator s objective. Some sites have their results presented on a unique page, others have a set of pages linked to each other by a next link, while others have a set of pages containing only links to pages each containing a unique result. Therefore building a wrapper also involves setting up the necessary infrastructure allowing to proceed to the querying and extraction. Such a fully operational wrapper for a web source can be built in a three step process : (1) the description of a query mapping, (2) the con-

2 struction of a result extractor and (3) the construction of the necessary infrastructure. Steps (1) and (2) each lead to a task specific operator which will be used in step (3). Describing the mapping of an incoming query into the form-based language can easily be done by hand. We therefore concentrate on the two other tasks. In this paper we present how to build an information extraction task by this three step process. We also present example applications we have built using this process an two systems we have developed : IERel allowing to build an extractor and WebSource allowing to execute a description described in WetDL. This paper is organized as follows. Section 2 presents different methods to generate extractors as well as the one we actually use. In section 3 we propose a descriptive method allowing to building the infrastructure necessary to construct a fully operational wrapper. Section 4 shows diverse applications which can be built using this method. In section 5 we give a rapid overview of related work. Finally we conclude and present future work in section Generating an extractor One of the most difficult steps in building a wrapper is the construction of an extractor allowing to translate Web pages into a machine readable format. This is necessary since data contained on the Web, generally in the form of HTML pages, is destinated to be viewed in a browser by human users. The languages in which data are given are presentation languages which give no idea of the semantics of the data contained in these pages. Therefore this enormous amount of data is seemingly useless. Nevertheless, the presentational format of the data give clues on the structure of the underlying data. This fact makes it reasonable to consider giving machine access to data by building transformation procedures, which we will call extractors. 1 Different solutions allowing to build extractors have been proposed in the literature. The main existing approaches include extractor induction based on labeled page examples [8, 7, 11], unsupervised structure discovery[1, 2], knowledge-based extractors [12, 3], context generalization based induction [4]. The data the user wishes to extract from a page is called its content. In most cases this content is relational data. It therefore consists in a set of instances of a relation. Each instance is a set of key-value pairs where the key is the name of an attribute of the relation. In our experiments we used the system IERel [4] to generate an extractor. It considers learning an extractor allowing to extract a user specified relation. This relation is ex- 1 In the field of information extraction from the web such a procedure is called a wrapper. In the context of this paper, however, we will call this procedure an extractor, a wrapper only containing such a procedure. pressed by giving the set of example instances the user wishes to extract. Therefore this method is also examplebased. The approach allows the user to specify the information he wishes to extract and the attribute order in which this data is to be obtained. To build the extractor, the proposed algorithm consists in searching the documents for occurrences of the example instances, extracting a description of their contexts and generalizing the descriptions obtained into patterns. These patterns can then be applied in order to obtain the other instances of the relation. Once the extractor of the source has been built, the necessary infrastructure needs to be constructed in order give full access to the source for which the extractor was built. For example, in the case of Superpages we have obtained an extractor which allows to take an HTML page generated by Superpages and extracts from it the relation address(name, street, city, state, zip). However, being able to extract this relation from any page generated by Superpages is not sufficient, in order to have full automated access to Superpages. Indeed, this requires to be able to post a query, fetch the result pages, apply the extractor, etc. How this can be done is shown in the following. 3. Constructing the infrastructure In this section we present how to obtain the necessary infrastructure. This can be done by composing a set of web information extraction specific operators. These operators will allow to proceed to the execution of a source specific extraction pattern. Such a pattern usually consists in querying a web source by generating an HTTP query for the source, fetching the result of such a query by connecting to the server of the source and retrieving the resulting page, extracting the results from the page, following a next page link, fetching the page corresponding to the link, extracting from this second page of results and so on. However this general pattern does not apply as is on each source. Indeed many properties of the extraction process are source specific : how to map the user s query into an HTTP query, how to fetch the next page link when it exists, etc. are source dependant. Also, the way in which results are presented might differ from one source to another. For example, some pages might present all the results on a single page, while others might only present a page with links to a new page describing fully a result instance Defining information extraction operators There is therefore a need to specify the infrastructure allowing to describe how to access a source. We propose to do this by defining a set of operators which can be parameterized with the source specific information allowing each

Each operator takes an input object and returns a list of output objects.

3 (1) Filling the query (2) Extracting the results (3) Following the next link Figure 1. A manually executed web information extraction task to proceed to a specific sub-task of the whole web information extraction task. Each operator takes an input object and returns a list of output objects. For example an extraction operator takes as its input a web page form the source an returns the list of results extracted from the page. Having all operators return a list is useful in order to consider all types of operators similarly at a higher level of abstraction, ie. when considering their combination as will be discussed later. These main operators are described here-after. These operators make use of W3C standards. In particular they make use of the Document Object Model which allows to have an abstract in memory tree representation of a document, XPath which is a path-based querying language, and XSLT which is a powerful and expressive XML transformation language. Making use of these standards allows the operators to be highly flexible. Furthermore there implementation is made easy in any programming language having libraries implementing the W3C standards. HTTP query building A first operator is the HTTP query building operator. An HTTP query is composed of three parts : a query method, a base URL and a set of key/value pairs forming the query. Applying an HTTP query building operator consists in building these three parts from the parameters the operator is given. This operator builds a list containing a unique item : the HTTP query. For example the operator allowing to query superpages.com is composed as follows. Its base URI is the HTTP method is GET and it has nine parameters such as the parameter named WL which allows to set the family name of the person looked up. Fetching A fetching operator takes as input either a URL or an HTTP request and proceeds to the downloading of the document referred to. Its output is the resulting HTTP response. This operator either generates a list containing a unique item : the HTTP response or an empty list in case of an error. This operator is quite generic in that it does not take any configuration parameters. However multiple instances of such an operator might be necessary in different places of an extraction task description as we will see further. Parsing A parsing operator takes an XML or HTML document, parses it and returns a DOM object. This object model gives a highly flexible access to the different parts of an XML/HTML document. This operator either returns a list containing a unique item : the DOM object, or an empty list in case of a parsing error. As with the fetch operator this operator is quite generic an needs no particular parameters. Filtering A filter operator does a selection on its input according to a predetermined predicate. Any input object verifying the predicate is returned. All other input is kept back. In our implementation this predicate is defined by a set of tests on the input. This operator either returns an empty list if the input does not match the predicate or a list containing the input item as its unique element. This operator can be used to refine the results returned by a source. Extracting An extraction operator returns subparts of its input. Which subparts to extract is determined by giving an expression which is applied to the input. For example, given the DOM representation of an HTML page and the //a/@href XPath expression, the resulting extraction operator returns the links contained in the input document. This operator can generate a list containing zero, one or more items. The returned list is composed of all the input object subparts matching the operators expression. Transforming A transformation operator consists in changing th format of the input. When the input is an HTML/XML document (or its DOM representation) the transformation can be described by an XSL

4 Stylesheet. This operator returns a list containing the transformed item as its unique object. Combined with and extraction operator this operator can be used to describe a manually built extractor which in some cases may be more interesting than using an automatically built extractor especially when extracting from complex documents. External operations The extraction system we have developed also allows to introduce external operators. It is with such type of operators that we are able to make use of previously constructed operators. The parameters of this type of operator depends on its implementation. It is therefore easy to allow the use of different extractor construction methods in our infrastructure. It is only necessary to define a procedure taking an input page and returning a list of results. Other basic operators may easily be added to this set. For example in our implementation we also have an operator allowing to add incoming data into a database, another one which allows to build a cache of the fetched documents, a Web Service querying operator, etc. These are not described here since they are not necessary in the context of building an web information extraction tasks Coordinating the different operators In order to build a complete information extraction task it is necessary to coordinate the basic operators. This is simply done by telling each operator what to do with its results. For example, after having built a query, the next step is to fetch the query result. This can be done by setting up a query operator and a fetching operator and telling the query operator to send its results to the fetching operator. Whenever the query task receives input and builds a new query, it then sends the generated query to the fetching task. Therefore a web information extraction can be described by a network of operators. Such a network is a graph G =< V, E > where V is the set of operators and E is a set of directed edges. A directed edge (σ i, σ j ) denotes that the source node σ i should send its results to the destination node σ j. Given this network we can associate two sets to each operator : the set of its producers and the set of its consumers. Given an operator σ i its producers are P(σ i ) = {σ j (σ j, σ i ) E} and its consumers are C(σ i ) = {σ j (σ i, σ j ) E}. Given the coordination network of an web information extraction task different strategies can be implemented in order to execute the task. For example we have implemented a lazy strategy where an operator only produces results on demand. Another strategy could be a saturation strategy which consists in systematically producing results and sending them to the consumers. These two strategies can be implemented in as a single process. However we can easily imagine parallelizing the process by having each operator run as an independent process with synchronization being done by sending and receiving input. This allows for high scalability Describing the infrastructure To describe the operators, their parameters and the coordination network we propose to use an XML language called WetDL. Each type of operator is described by an XML element. Describing an operator consists in adding an element to the description. Each operator element has two attributes : a mandatory name attribute which needs to be unique and is used to reference the operator, and a forward-to attribute which contains the list of names of the operators consumers. This information is sufficient to be able to calculate the producers of an operator at execution time. The parameters of each operator are declared as sub-elements of the operator element. This language is further described in [5]. 4. Example applications In this section we illustrate the construction of a solution to an information extraction task by describing a real world web information extraction tasks. It involves extracting DVD listings from amazon.fr. In a first step, an extractor was first constructed for the source. Then the necessary infrastructure is set up by defining the necessary operators and coordinating them. The obtained XML descriptions of the extraction tasks are given. We also present another web information extraction tasks to show the expressivity of our approach while conserving simplicity and thus reliability. It consists in extracting information on the different countries from the online CIA World Fact Book ( Both of these tasks have been described in WetDL and executed using our prototype WebSource Extracting DVD description from Amazon Another typical web information extraction task is to query Amazon for price information on products they sell. We chose to set up an information extraction application allowing to query their DVD database. However the access to this database is a bit tricky because it is necessary to visit the index page of the Amazon.fr site in order to able to go to the other pages. Indeed, when visiting this page the server generates a session specific key which appears in every further URL. Without this key the data made available is inaccessible. Therefore we need to simulate browsing in order

5 1 <?xml version="1.0" encoding="iso "?> 2 3 <source name="superpages.com"> 4 5 <options> 6 <option name="titre" shortcut="t"/> 7 <option name="acteur" shortcut="a"/> 8 <option name="realisateur" shortcut="r"/> 9 <option name="genre" shortcut="g"/> 10 <option name="public" shortcut="p"/> 11 <option name="format" shortcut="f"/> 12 </options> <fetch name="init-amazon" type="xml" 15 forward-to="dvd-link-finder"> 16 <data> 17 </fetch> <extract name="dvd-link-finder" 20 forward-to="follow-dvd-link"> 21 <path>//area[@alt="dvd"]/@href</path> 22 </extract> <fetch name="follow-dvd-link" type="xml" 25 forward-to="query-page-finder" /> <extract name="query-page-finder" 28 forward-to="fetch-query-page"> 29 <path> 30 //a[contains(.,"recherche")]/@href 31 </path> 32 </extract> <fetch name="fetch-query-page" type="xml" 35 forward-to="extract-query-uri" /> <extract name="extract-query-uri" 38 forward-to="q"> 39 <path>//table//form/@action</path> 40 </extract> <query name="q" method="post" 43 forward-to="fetcher"> 44 <parameters> 45 <param name="qtype" default="at" /> 46 <param name="rank" default="+amzrank" /> 47 <param name="field-0" default="title" /> 48 <param name="query-0" default=""> 49 <set-attribute name="default" 50 value-of="titre" /> 51 </param> 52 <param name="field-actor" default=""> 53 <set-attribute name="default" 54 value-of="acteur" /> 55 </param> 56 <param name="field-director" default=""> 57 <set-attribute name="default" 58 value-of="realisateur" /> 59 </param> 60 <param name="field-subject" default=""> 61 <set-attribute name="default" 62 value-of="genre" /> 63 </param> 64 <param name="field-cnc-rating" 65 default=""> 66 <set-attribute name="default" 67 value-of="public" /> 68 </param> 69 <param name="index" default="dvd-fr"> 70 <set-attribute name="default" 71 value-of="format" /> 72 </param> 73 </parameters> 74 </query> <fetch name="fetcher" 77 forward-to="parser extractor" /> <xmlparser name="parser" 80 forward-to="next"/> <extract name="next" forward-to="fetcher" 83 method="xpath"> 84 <path> 85 //a[img[contains(@src, 86 "more-results.gif")]]/@href 87 </path> 88 </extract> <external name="extractor" 91 module="amazon_dvd" /> 92 </source> Figure 2. Description of the Amazon extraction task to be able to query the source. How to do this will be described in the following. Building an extractor for Amazon As for superpages the first step is to build an extractor form Amazon. This extractor was built in the same way as for superpages. We queried the source for DVD s having actor Depardieu appearing in the movie. This generated 143 results on 10 pages. We these result pages as examples, we then built an extractor using IERel (see [4]). Setting up the extraction task For the Amazon tasks we will basically need the same set of operators as for the Superpages extraction task and additional operators to initiate the tasks. This initiation is necessary to retrieve the necessary URLs containing the session key. Figure 3 gives the extraction task network we need to set up and figure 2 gives the full description of the task. We will be querying Amazon.fr s DVD database. There are six query parameters : title (titre), actor (acteur), producer (réalisateur), the rating (public), and the format (format). These are declared in the options element lines For the initiation, we first need an operator to fetch the Amazon site index pages which will lead to the creation of a Figure 3. Amazon.fr extraction task network new session. This is done by the operator init-amazon described lines The data element allows to declare data which will be sent to the operator when the task execution starts. The URLs appearing in the fetched page will contain the generated session key. Next we need an extraction operator (dvd-link-finder) allowing to extract the URL of the DVD page. Downloading this page requires another fetch operator (follow-dvd-link), while another extraction operator (query-page-finder) will allow to extract

6 Figure 4. Combined white pages task network the URL of an advanced query page. A third fetch operator (fetch-query-page) is needed to download this page. Finally the base URL of the advanced query form needs to be extracted and sent to the query operator. This is done by adding another extraction operator (extract-query-uri). Once the query operator receives the base URI it can generate an HTTP request corresponding to the user s information need. This request can then be sent to the fetch operator starting an extraction cycle. This extraction cycle is similar to the one declared for Superpages. We have five operators : a query builder q, a fetch operator f, a parser p, a next link extractor next and the previously built extractor extract. They are connected to each other as described in figures 3 and 2. This example shows that our approach is generic enough to allow to describe tasks including session information. With the basic set of operators given we can describe any browsing pattern. Therefore anything which can be accessed by a human user via a browser can also be accessed by setting up the correct task description Extracting from the CIA World fact book graphical information to governmental information. An example extraction task making use of this source is to extract country information such as the size of its population or its geographical coordinates. The extracted data, once reformatted could be used to build a knowledge base and keep it up to date by reapplying the extraction task in the future. Figure 5 gives the network of operators necessary to be able to access this source. A first operator fetch-list fetches the country index page. It sends it to a extract-list operator which extracts the URLs of the country description pages. Each of the extracted URLs are sent to the fetch-item operator which fetches the country pages and sends them to the extract-item operator which applies a stylesheet to the fetch country pages. The extraction from the country pages is done by an XSLT stylesheet. This stylesheet can be easily adapted to extract and format the relevant data into a structured format. For a given country page it builds a country element containing the name of the country, its geographical coordinates, and its population. It was obtained manually by an analysis of a set of example pages. This example task shows that our description language makes it simple to describe batch extraction tasks. It also shows that it can except different extraction methods. In previous examples the extractors used had been generated in the same manner Combining directories Figure 5. CIA fact book extraction network The CIA World Fact Book 2 gives and keeps up to date much information on countries all over the world from geo- 2 Our approach also allows to simply define extraction task combinations. For example, each country has its white page service and we might want to provide a global access to multiple white page services. This access can be obtained by building an extraction task for each country s web-based service as we did for the US directory Superpages. Then each of these extraction tasks can be included as sub parts of a more global extraction task.

7 We show the feasibility of this combination by building an extraction task which combines Superpages task with a similar extraction task for the french white pages service : Pagesjaunes ( We built an extractor for Pagesjaunes using IERel in the same manner as for Superpages. Also in the case of Pagesjaunes the extraction task has a similar infrastructure as shown in the bottom sub-graph of figure 4. Once a user query is received, it needs to be directed to the proper service. Since the services do not have the same parameters we first build an XML query document which will be transformed into the correct format. Sending this query to the proper service can be done by setting up two filters us-filter and fr-filter. The us-filter operator only keeps queries for which the country part is the US and sends them to the sp-q transformation operator. Similarly the fr-filter only keeps queries for which the country part is France and sends them to the pj-q transformation operator. The transformation operators translate the query into an HTTP Request for the corresponding source. The rest of the task is executed as if the original specific task was called. 5. Related Work Up to now most work in the field of information extraction from the Web has concentrated on building extractors for web sources. This work has been presented in the section 2 of this papers. The main recent references on this subject are [8, 11, 7] for labeled-page example based extractor induction, [1, 2] for structure discovery and [12, 3] for knowledge-based extractors, and [4] for building extractors by context generalization. Some work has also been done on learning the semantics of a Web form in order to ease its automatic querying, see for example [9]. In some aspects our work is similar to the composition of Web Services. Indeed our operators can easily be seen as Web Services and their coordination as a form of composition. Different web service composition languages include the Business Process Execution Language for Web Services (BPEL4WS) based on IBM s Work Flow Specification Language (WSFL) and Microsoft s XLANG. In [6] is proposed a formal semantics allowing to model a composed web service s behavior by using petri-nets. Finally, [10] propose to semantically annotate a web service description in order to allow automatic composition of web services. Our work, however, focuses on specific information extraction operators and their coordination allowing to limit code overhead. 6. Conclusion In this paper we presented how to build web information extraction applications. Such applications can access data made available on the web. Constructing most common applications requires a three step process involving (1) building a query mapping, (2) building an extractor for the source and (3) setting up the correct infrastructure. To set up this infrastructure we proposed to define a set of web information extraction specific operators and coordinate them in order to build such applications. We showed how this method was effective by demonstrating how we have built real world applications using two systems IERel and WebSource. References [1] C.-H. Chang, C.-N. Hsu, and S.-C. Lui. Automatic Information Extraction from Semi-Structured Web Pages by Pattern Discovery. Decision Support Systems Journal, 35(1), April [2] V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In The VLDB Journal, pages , [3] X. Gao and L. Sterling. Semi-Structured Data Extraction from Heterogeneous Sources. In Second International Workshop on Innovative Internet Information Systems (IIIS 99), Copenhagen, Denmark, [4] B. Habegger and M. Quafafou. Multi-pattern wrappers for relation extraction. In F. van Harmelan, editor, ECAI Proceedings of the 15th European Conference on Artificial Intelligence, Amsterdam, IOS Press. [5] B. Habegger and M. Quafafou. WetDL : A web information extraction language. In ADVIS Proceeding of the third international conference on Advances in Information Systems, LNCS, Izmir, Turkey, [6] R. Hamadi and B. Benatallah. A Petri net-based model for web service composition. In Proceedings of the Fourteenth Australasian database conference on Database technologies, volume 17 of CRPITS, pages , Adelaide, Australia, Australian Computer Society, Inc. [7] C.-N. Hsu and M.-T. Dung. Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web. Information Systems, 23(8), [8] N. Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, [9] N. Kushmerick. Learning to Invoke Web Forms. In R. Meersman, Z. Tari, and D. C. Schmidt, editors, CoopIS/DOA/ODBASE, Lecture Notes in Computer Science, pages , Catania, Sicily, Italy, Springer Verlag. [10] B. Medjahed, A. Bouguettaya, and A. K. Elmagarmid. Composing Web services on the Semantic Web. The VLDB Journal, 12(4), [11] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent System, 4(1-2), March [12] H. Seo, J. Yang, and J. Choi. Knowledge-based Wrapper Generation by Using XML. In IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, Washington, August 2001.

Web Services for Information Extraction from the Web

Web Services for Information Extraction from the Web Benjamin Habegger Laboratiore d Informatique de Nantes Atlantique University of Nantes, Nantes, France Benjamin.Habegger@lina.univ-nantes.fr Mohamed