SEMI-AUTOMATIC WRAPPER GENERATION AND ADAPTION Living with heterogeneity in a market environment

Size: px

Start display at page:

Download "SEMI-AUTOMATIC WRAPPER GENERATION AND ADAPTION Living with heterogeneity in a market environment"

Laurel Maxwell
6 years ago
Views:

1 SEMI-AUTOMATIC WRAPPER GENERATION AND ADAPTION Living with heterogeneity in a market environment Michael Christoffel, Bethina Schmitt, Jürgen Schneider Institute for Program Structures and Data Organization, Universität Karlsruhe, Karlsruhe, Germany {christof,schmitt,schneider}@ira.uka.de Keywords: Abstract: Wrapper generation, Provider integration, Web interfaces, Open markets The success of the Internet as a medium for the supply and commerce of various kinds of goods and services leads to a fast growing number of autonomous and heterogeneous providers that offer and sell goods and services electronically. The new market structures have already entered all kinds of markets. Approaches for market infrastructures usually try to cope with the heterogeneity of the providers by special wrapper components, which translate between the native protocols of the providers and the protocol of the market infrastructure. Enforcing a special interface to the provider limits their independence. Moreover, requirements such as a direct access to the internal business logic and databases of the providers or fix templates for internal data structures are not suitable to establish a real open electronic market. A solution is the limitation of the access to the existing Web interface of the provider. This solution keeps the independence of the providers without burdening them additional work. However, for efficiency reasons, it keeps necessary to tailor a wrapper for each provider. What comes more, each change in the provider or its Web representation forces the modification of the existing wrapper or even the development of a new wrapper. In this paper, we present an approach for a wrapper for complex Web interfaces, which can easily be adapted to any provider just by adding a source description file. A tool allows the construction and modification of source descriptions without expert knowledge. Common changes in the Web representation can be detected and comprehended automatically. The presented approach has been applied to the market of scientific literature. 1. INTRODUCTION The success of the Internet does not only allow the connection of computers and business partners world-wide. It has also opened the way to completely new business models for electronic business and electronic commerce. Supply and commerce over the Internet have already entered all kinds of markets. The consequence is the development of new market structures, often also the development of completely new electronic markets. Not only traditional vendors and suppliers have used the electronic chance, also a large number of new providers has appeared, following new innovative business ideas and offering a large range of goods and services. Expectation for the future see a continuos growth of the importance of electronic business and the number of providers. These new electronic markets lack appropriate market infrastructures. However, approaches for market infrastructures have to face the problem of the heterogeneity and the autonomy of the providers. A common way is the installation of wrapper components that act as representatives of the providers in the market. Since these wrappers can be integrated in the market infrastructure, they can easily be addressed by customers or other market components. However, until now there is no common solution for the way a wrapper accesses to the underlying provider. Trying to enforce standards like ODBC and JDBC for relational databases and Z39.50 for library catalogues or the presence of a special interface only for the wrapper leads to a restriction of the autonomy of the providers and often also to a limitation of the possible capacity. Much more, such approaches limit the development of real open markets, where customers and providers are free to enter and leave in their own decision.

2 Our approach focuses in the use of already existing interfaces of the provider. Since the presentation of information in the Internet is done by static or dynamic HTML Web pages, the use of this Web interface by a wrapper component to connect the providers to the market infrastructure promises a solution of the problem of heterogeneity and autonomy. A change to XML as language for Web information representation, which can be expected in the future, will even enhance the capacity of the Web interface. Figure 1 shows a general market infrastructure with components representing customers and providers and additional components for marketinternal services. market infrastructure providers provider-side representatives (wrapper) market-internal services customer-side representatives customers Figure 1: General market infrastructure. For efficiency reasons, it is necessary to tailor a wrapper for a provider. What comes more, for every change in the provider s offers, conditions, prices, etc., and also for every change in the Web interface, the wrapper has to be re-written or at least adapted. Doing all this manually takes a big amount of time and expenses, and it this is not practicable in a larger dynamic market. In this paper, we present an approach for a flexible wrapper component for Web interfaces, which can be adapted to the interface of a provider by loading a source description file. We also present a tool for the generation of source description files for a given provider. The paper is organized as follows. After an overview on related approaches in section 2, we introduce our wrapper component in section 3. The process of the generation of the source description is presented in section 4. In section 5, we discuss how minor changes in the Web representation can be detected without having to edit the source description. Section 6 contains an overview of the implementation and section 7 first experiences of the application of our approach in a special electronic market. We conclude the paper in section RELATED WORK There are various approaches for wrapping Web data sources. However, approaches like JEDI (Huck, Frankenhauser, et al., 1998), Garlic (Roth und Schwarz, 1997), TSIMMIS (Hammer, Breunig, et al., 1997), or Disco (Tomasic, Raschid et al., 1996) present no capable solution for wrapper generation. Instead of this, manual work is necessary. UMICAS (Gruser, Raschid, et al., 1998) allows accessing semi-structured data represented in HTML interfaces. The extraction process is oriented on a tree representation of the HTML document. Simple Extractors perform extraction rules written in the Qualified path expression Extractor Language (QEL). Complex extractors use the services of the single extractors and build a higher level of extraction. However, the extraction process is limited to single pages (and this way not appropriate for most Web sites). There are graphical tools for wrapper generation, but their use is limited, because extraction rules have to be set up manually. W 4 F (World Wide Web Wrapper Factory) uses three different rule sets for the wrapper (Sahuguet and Azavanz, 1999). Retrieval Rules define the loading of Web pages from the Internet and their transformation in a tree representation. Extraction Rules define the extraction of values in the tree using the HTML extraction language (HEL). For each extracted argument, an ex expression in the Nested String List (NSL) is created. Navigation over different structured pages is not possible. Mapping Rules describe the creation of Java objects for the extracted data. The creation of these different rule sets is supported by a collection of different tools. However, for the final definition of the wrapper s behavior, the collection of the rule sets used lies in the hands of the user. XWRAP (Liu, Pu, et al., 2000) provides a graphical tool for the semi-automatic generation of

3 wrappers for Web interfaces, although the use of the wrapper is limited to patterns for the creation of XML documents out of selected Web pages. The wrapper generator allows to download Web pages from the Internet and automatically detects and removes errors in the HTML document. Region Extraction Wizards create extraction rules for values contained in the Web site, using regular expression for paths in a tree representation of the HTML document. However, a navigation over structural different Web pages is not supported. Lixto (Baumgartner, Flesca, et al., 2001) desribes a similar approach, using different tools. Lixto contains a visual and interactive wrapper generator, which uses the declarative extraction language Elog. The Extractor can use these Elog files to create a Pattern Instance Bases for different attributes contained in the HTML page extraction from HTML page. The XML Generator transforms Pattern Instances to XML documents. Again, the construction of the Pattern Instances is restricted to one Web page. 3. WRAPPER determines the structure of the Web interface. The planner creates a query plan, which is transmitted to the converter. According to the query plan and the navigation graph, the converter sends the query to the provider s Web interface, then extracts the necessary attribute sets out of the resulting HTML pages. Doing this, the converter navigates autonomously between different static and dynamic pages of the Web interface. The coordinator sends the result sets to the querying party. The following modules are needed only for commercial providers: authorization, cost monitor, and protocol module. The authorization module proofs whether the querying party is allowed to perform queries to the provider, and which rights may be granted. The cost monitor pre-calculates the costs for the performance of the query. It determines whether the query may be performed or not. If necessary, e.g., if a cost limit given by the customer is too low, the customer will be informed about this situation. Additionally, the cost monitor controls the converter and can suspend the query process, if a cost limit is reached. The protocol module protocols all actions of the wrapper. This protocol can be used in case of law-suit. In this section, we want to discuss the properties and the design of our suggested flexible wrapper for Web interfaces (Pulkowski, 1999). The principle design of our wrapper can be seen in figure 2. Each wrapper contains at least the following basis modules: coordinator, validation, planner, and converter. The coordinator is the wrapper s interface to the market infrastructure. This means, the design and the functionality of the wrapper is independent from the environment it is to operate. Especially, the coordinator receives and interprets queries from other market components and sends back the results. The validation module checks whether a query is syntactically and semantically correct. If necessary, the query is corrected, or an error message is sent back to the querying party. A correct query will be handed over to the planner. The planner determines how to perform the query, which pages are needed, which results can be found on each page, and how to navigate between the pages. The central data structure used for the planning process is the navigation graph, which planner navigation graph market environment wrapper coordinator validation authorization cost monitor protocol converter Web interface provider Figure 2: Wrapper design.

4 In this paper, we abstract from metadata management. It is necessary that the wrapper holds (and maintains) much more metadata than needed for query and result translation. These additional metadata are needed by other market components and should be collected in a step additional to the creation of the source description. For more details compare (Christoffel, Pulkowski, et al., 2000). The design of the wrapper is independent from the provider it is connected. Hence, all information needed to tailor the wrapper for a provider must be contained in the source description file. This file is an XML document containing information about the query format, costs for accessing pages, result types, the structure of the result pages, etc. Also queries and result sets are embedded in XML documents. Extraction rules needed for the extraction of values from HTML pages are formulized as Extended Hierarchical Path Expressions (EHPE), embedded in XML. Simplified, EHPE follow the following grammar: expression := (node)+ (op)+. node := node-name [index]. node-name := tag pcdata. index := index, index number number - number number - - number *. op := att (identifier) txt() split(regex) match(regex) search(regex). An extended hierarchical path expression consists of a sequence of one or more nodes, followed by a sequence of one or more operators. Each node consists of a node name and an index. The sequence of the nodes describes the sequence of nodes visited when following a path from the root of a tree representation of the HTML document to the considered node. The node name can be any legal HTML tag or the keyword pcdata which stands for an arbitrary character sequence. Each tag that can be used to encompass parts of a document can also open a new level of hierarchy. Since each level of the HTML tree can contain several nodes with the same name, the index describes the number of the considered node in the order of appearance in the level (starting with 0). The index can be any sequence or range of nonnegative integers. The symbol * is used as a shortcut for all appearances of the node name in the level. On the nodes the following operators can be applied: att, txt, split, match, and search. att should only be applied on a node referencing a tag. It returns the value of the attribute given as argument (as character sequence). txt is used for extracting text out of an HTML document. Applied to a pcdata node, it returns the character sequence the node stands for; applied to any other leaf node, it returns an empty list. Applied to an inner node, however, txt is recursively applied to all nodes of the subtree under the node (depth first), returning a nested list of all texts in the subtree in the order of appearance. The last three operators should be applied to a character string and take a regular expression as argument. split should be applied on a character string that contains substrings separated by a delimiter and returns a list of these substrings. match returns the given string if this string matches the regular expression, otherwise it returns an empty string. search extracts a substring that fits the regular expression. Whenever hierarchical path expression do not describe one single node but a list of nodes, the operators are applied to every node contained in the list. In this case, the result of the EHPE is a list of the results for each node contained in the list. Examples for these hierarchical expressions are: html[0]body[0]table[0]tr[*]th[1]table[0}tr[*]txt() html[0]body[0}table[0]att(border) html[0]body[0]table[0]tr[*]th[0}pcdata[0]txt() split( ) 4. WRAPPER GENERATION Creating the source description file for a provider manually is tricky and time-consuming, especially because the files have to be adapted for nearly every change in the provider or its Web interface. However, a fully automatic generation is not possible, since the semantics of the Web pages are not known in the general case. A human user, in the other hand, often can interpret the content of a Web page very easily. So we created a wrapper generator (which generates only the source description files, despite of the name), which bases on interaction with a user. The wrapper generator should be intuitive in use and ease the work of the human user as far as possible.

5 The metaphor that underlies the generation process is generation by example. The user has to perform a sample query on the provider and mark the important elements in the pages; the real generation process is done by the generator. The generation process consists of three steps. First, the user chooses an entry page for the search by either directly giving the URL or by navigation. Second, the user selects one form on the entry page and starts the search by entering arguments. It is not really important, which arguments the customers enters, because the wrapper generator automatically analyzes the form, enabling the wrapper to send queries autonomously, using all available parameters. Third, the user marks relevant attributes contained on the result pages. He/she can select single attributes or use a multi-selection mode, where the wrapper generator automatically tries to find hierarchical path expressions for a larger group of nodes from two selections of the user. The user is free to navigate between Web pages, even if these are structurally completely different. The wrapper generator automatically finds all paths of navigation among the Web pages. planner navigation graph user interface converter user generator Web interface provider Figure 3: Wrapper generator design. wrapper generator data module The design of the wrapper generator is presented in figure 3. The wrapper generator consists of the following modules: user interface, planner, converter, generator, and data module. Among this, planner and converter are already known from the wrapper design. The user interface is essential for the wrapper generator, because each user, who is in most cases not a computer scientist, should be able to work with it without long training. The user interface provides both views on the source and the source description. The generator module creates hierarchical path expressions for selected parts of the HTML tree. The data module is responsible for file operations. For efficiency reasons, the data module supports a cache manager for previously loaded Web pages. Additional to the navigation graph and the page structure, the source description file contains additional informations needed by the wrapper that has to be determined by the wrapper generator during the generation process, such as the need for a login or cost factors. The source description file contains also entries that are not needed for the wrapper but for the wrapper generator to repeat the generation process at any time. The wrapper generator can restore the environment of the last use, and the user can instantly do the needed changes. 5. WRAPPER ADAPTION In many cases, changes in the provider s offers and its Web representations force a modification of the source description file. The wrapper generator opens a comfortable way to do modification without having the need to repeat the entire generation process. As far as possible, however, the detection of smaller changes and the modification of the source description should be done without human help. An important class of changes are modifications in the Web site. The re-arrangement of the elements in the web page or simply the addition of some tags can make the wrapper useless, if due to the changes the extraction rules are now inconsistent, or, worse, they are still consistent but extract the wrong data. The detection algorithm also works if changes imply several attributes. For each of these considered attributes, a test data set must be

available, together with the queries that are used to generate these data sets. For error detection, the queries are repeated and the results are compared with the test data.

The idea behind the detection algorithm is that the test data can be found on the result pages, but maybe at a different position than expected.

6 available, together with the queries that are used to generate these data sets. For error detection, the queries are repeated and the results are compared with the test data. Major differences indicate the use of an extraction rule that is no more valid. The idea behind the detection algorithm is that the test data can be found on the result pages, but maybe at a different position than expected. However, if the attribute can be found on a series of web pages that are identical in their structure, then it can be found on each of these web pages at the same position. So we search the result pages for each appearance of the test data in the document text. Then we count the number of matches for each node of the HTML tree. With a certain probability, the node with the highest number of matches is the node, where the attribute can be extracted. The extraction rule for this node will be the new rule for the attribute. Of course, it is possible that a test string can also be found at any other node by chance. But since this match will not be found in other web pages, the number of matches for this node will be 1. If the results can be found in a list on one page only, the algorithm has to be modified. The result is a set of nodes that can be described by one expression rule (using the * wildcard). Figure 5 shows the interface of the wrapper generator during the generation process. The wrapper generator can handle several HTML files at the one, and the user can switch among them freely. By selecting relevant attributes, extraction rules are created and added to the source description. Figure 5: User interface of the wrapper generator. Both wrapper and wrapper generator have been implemented with Java 2, using the additional packages JavaRegex for the implementation of regular expressions, JTidy for the correction of the loaded HTML pages, and XML4J for the generation and administration of XML data structures. 6. IMPLEMENTATION Figure 4 shows the user interface of the wrapper generator. The user interface shows three different views of the source: the tree view, the text view, and the browser view. The actions that can be performed by the user always depend on the context. Figure 4: User interface of the wrapper generator. 7. APPLICATION The wrapper presented in this paper has been used for accessing data sources in a special kind of information market, namely the market of scientific literature. This market shows the importance of a uniform access to the information providers. While the supply on scientific literature worldwide is very rich and still growing, search for literature is a large, time-consuming work, especially if a list of providers has to be accessed sequentially. The trend of the commercialization of the Internet can make literature search also become very expensive. Within the UniCats project, a market infrastructure is being developed for this special market (Christoffel, Pulkowksi et al., 1998). The project is funded by German Research Foundation (DFG) as a part of the strategic research offensive Distributed Distribution and Processing of Digital Documents (V 3 D 2 ).

7 Figure 6 shows the architecture of the UniCats system. In addition to the wrappers we have developed two other market components until now: user agents that act as representatives of the customers (Schmitt and Schmidt, 1999), and traders that provide a market-internal service of provider selection (Christoffel, 1999). Technical basis of the infrastructure is the UniCats environment, a framework for independent and communicative UniCats agents (Christoffel. Nimis, et al., 2000). UniCats environment providers wrappers market-internal services (mediators, traders, certification, ) user agents customers Figure 6: UniCats architecture We proved the functionality of the wrapper generator in an experiment with 8 human test persons, who had never had worked with the wrapper generator before. After an introduction of 15 minutes, the test persons had to perform three tasks. Each task contained the development of a source description file for one information provider. The difficulty of the task increased, while the degree of details in the instructions decreased. The candidates had no help except a 4-page description of the user interface; question were not allowed. For each task, we assessed the needed time and the quality of the created source description. For the experiment, we chose three very different sources, which have different requirements, but all are in a sense typical for the selected market: Indiana University Knowledge Base, Lehmann s Online Bookshop, and the catalogue of the university library of Karlsruhe. According to their own statements, the first two sources were not known to the candidates, while the third source was known to all. The first result was that nearly all constructed wrappers (24 in total) work. Only in two cases a wrapper has been constructed with limited functionality. In one case a wrapper has been constructed with more functionality than demanded in the instruction. time (min) 14:24 12:00 09:36 07:12 04:48 02:24 00: task Figure 7: Results of the experiment. Figure 7 shows a graphical visualization of the measured times needed to perform the three tasks. The average times needed are 5:36 minutes for task 1, 7:07 minutes for task 2, and 4:31 minutes for task 3. Unfortunately, we can not give a comparable value for the time needed for a manual creation of the source descriptions for the three tasks, because we did not want to except this work from a volunteer. According to an expert, a trained computer scientist needs about 2 hours for the creation of a source description for literature sources of a comparable degree of difficulty Perhaps the most interesting result of the experiment is that all candidates found the handling of the wrapper generator intuitive and were able to instantly work with the generator. In figure 7, we can also read that a training effect has started very early, since most candidate could increase their speed in task 3, the most difficult task of all. 8. CONCLUSIONS Test Person 1 Test Person 2 Test Person 3 Test Person 4 Test Person 5 Test Person 6 Test Person 7 Test Person 8 In this paper, we presented an approach for a wrapper component for the connection of heterogeneous providers in an open market. The wrapper can be adapted to different providers by loading a source description file. The source description file can be created semi-automatically with the help of a wrapper generator.

8 For this approach, we have shown a range of advantages: The wrapper uses no connection to the provider but its Web interface. It is possible to install a wrapper for a provider without knowledge about internal data structures and business logic. The wrapper can operate on complete Web sites containing many static and dynamic structurally different Web pages. The wrapper can also be used for commercial Web sites. The use of the wrapper generator corresponds to the performance of a query. It does not need any expert knowledge. An existing source description can easily be modified, and this way the wrapper can comprehend any changes of the provider. Smaller changes can be comprehended automatically. The implementation of wrapper and wrapper generator is platform-independent. The wrapper has been designed for the use in a market infrastructure. Test applications in a real market scenario have started. In the future, the challenges will lie in the application of the approach for other domains and markets. Potential application fields include meta search engines and product catalogues for online distributors. REFERENCES Baumgartner, R., Flesca, S., Gottlob, G., Visual Web Information Extraction with Lixto. In Proceedings of the 27 th Conferenc on very Large Data Bases, Rome, pp Christoffel, M., Pulkowski, S., Schmitt, B., Lockemann, P., Electronic Commerce: The Roadmap for University Libraries and their Members to Survive in the Information Jungle. In ACM Sigmod Record 27 (4), pp Christoffel, M., A Trader for Services in a Scientific Literature Market. In Proceedings of the 2 nd International Workshop on Engineering Federated Information Systems. infix, pp Christoffel, M., Pulkowski, S., Lockemann P., Integration and Mediation of Information Sources in an Open Market Environment. In Proceedings of the 4 th International Conference Business Information Systems, Springer, pp Christoffel, M., Nimis, J., Pulkowski, S., Schmitt, B, Lockemann, P., An Infrastructure for an Electronic Market of Scientific Literature. In Proceedings of the 4 th IEEE International Baltic Workshop on Databases and Information Systems, Kluwer Academic Publishers, pp Gruser, J.-R., Raschid, L., Vidal, M., Bright, L., Wrapper Generation for Web Accessible Data Sources. In Proceedings of the 3 rd International Conference on Cooperative Information Systems, New York City, pp Hammer, J., Breunig, M., García-Molina, H., Nestorov, S., Vassalos, V., Yerneni, R., Template-based Wrappers in the TSIMMIS System. In Proceedings of the 26 th International Conference on Management of Data, Tucson, pp Huck, G., Frankenhauser, P., Aberer, K., Neuhold, E., Jedi: Extracting and Synthesizing Information from the Web. In Proceedings of the 3 rd International Conference on Cooperative Information Systems, New York City, pp Kushmerick, N., Regression testing for wrapper maintainance. In Proceedings of the 16 th National Conference on Artificial Intelligence (AAAI-99),Orlando. Liu, L., Pu, C., Lee, Y.-S., An XML-enebled Wrapper Construction System for Web Information Sources. In Proceedings of the 15 th International Confernce on Data Engineering, San Diego, IEEE, pp Pulkowski, S., Making Information Sources Available for a New Market in an Electronic Commerce Environment. In Proceedings of the International Conference on Management of Information and Communication Technology, Copenhagen. Roth, M., Schwarz, P., Don t Scap It, Wrap It! A Wrapper Architecture for Legacy Systems. I Proceedings of the 23 rd International Conference on Very Large Data Bases, Athens, pp Sahuguet, A., Azavant, F Building Light-Weight- Wrappers for Legacy Web Data-Sources Using W4F. In Proceedings of the International Conference on Very Large Data Bases, pp Schmitt, B., Schmidt, A., METALICA: An Enhanced Meta Search Engine for Literature Catalogs. In Proceedings of the 2 nd Asian Digital Libraries Conference, Taipei. Tomasic, A., Raschid, L., Valdruriez, P., 1996: Scaling Heterogeneous Databases and the Design of Disco. In Proceedings of the 16 th International Conference on Distributed Computing Systems, Hong Kong, pp

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES Praveen Kumar Malapati 1, M. Harathi 2, Shaik Garib Nawaz 2 1 M.Tech, Computer Science Engineering, 2 M.Tech, Associate Professor, Computer Science Engineering,