Accuracy Avg Error % Per Document = 9.2%

Size: px

Start display at page:

Download "Accuracy Avg Error % Per Document = 9.2%"

Sheryl Oliver
5 years ago
Views:

1 Quixote: Building XML Repositories from Topic Specic Web Documents Christina Yip Chung and Michael Gertz Department of Computer Science, University of California, Davis, CA 95616, USA Neel Sundaresan NehaNet Corp Paragon Drive, Suite E San Jose, CA 95131, USA neel@nehanet.com 1 Introduction Despite major advancements in information retrieval techniques employed by today's Web search engines, building applications that allow users to eciently manage, query, and utilize large collections of related Web documents from diverse, highly heterogeneous sources is still a hard problem. Even in the case where potentially related documents that pertain to the same topic can be gathered eciently using, e.g., a focused Web crawler, the documents are still heterogeneous both in terms of structure and presentation, due to different authorship. More importantly, the documents are marked up in HTML for visual rendering purposes, thus hampering sophisticated query schemes dierent from simple keyword-based searches. In this paper, we outline the concepts and methods underlying Quixote, a system that allows users to rapidly build XML document repositories from large collections of topic specic HTML documents. Such documents are assumed to be gathered by a topic specic Web crawler. Examples of such topics include product descriptions, ight itinerary from airlines, bibliographies, company nancial information, resumes etc. Based on a collection of documents, Quixote addresses the problem of document conversion and integration in the following way: (1) The information buried in HTML documents is converted into XML documents. Based on userdened, topic specic XML element names, the HTML documents are restructured as XML documents encoding information objects of the original HTML documents in respective logical XML element structures. (2) For these XML documents, Quixote then determines a new type of schema, called majority schema, which concisely describes prevalent structures in the documents in form of a DTD. A majority schema gives users not familiar with the documents a bird's eye view of the logical information content as a rst step to formulate queries against an XML document collection using XML query languages. (3) Although the XML documents obtained in (1) all utilize the same XML element names, they can be structured dierently, due to dierent authorship of the original HTML documents. The majority schema obtained in (2) describes prevalent structures among the XML documents and is eventually used to transform all XML documents such that they conform to the majority schema. The XML documents obtained through the document conversion and transformation process together with the majority schema are used to build an XML document repository. Users and applications can formulate queries based on the majority schema using XML query languages. Related techniques, such as XSLT, can be used to have a uniform, application specic presentation of the documents. The majority schema can also be used for XML query optimization and the identication and specication of storage and index structures. The methods realized in Quixote are closely related to work that has been done in the areas of sources wrappers and schema discovery for XML documents. Several approaches have been proposed to extract information from HTML documents using wrappers. The rst generation wrappers are hand-crafted wrappers, e.g., [HGMC + 97, SA99, CDSS98], which require users to specify how to extract information. Such wrapper techniques take too much human eort and are very sensitive to changes in the format of documents. The second generation wrappers, e.g., [Ade98, AK97, DYKR00] learn extraction rules from examples given by users, but still require that the documents follow the same format or have only slight variations. Also these approaches are inapplicable for documents from a large number of diverse data sources which typically employ dierent document formats. Unlike these wrapper approaches, Quixote can automatically extract 1

2 information from heterogeneously structured HTML documents, requiring only a very minor amount of user input about the topic the documents are about. Several approaches to infer a schema from a collection of XML documents have been proposed in the literature. These approaches dier in the level of detail (admissible) document structures are described. The approaches proposed in [NUWC97, PV00] infer exact schemas, which describe all structures in the input documents. Naturally, an exact schema can be very large in size for a large collection of heterogeneously structured documents. The approaches described in [GW99, WYW00, WL00, NAM98, GGR + 00] infer approximate schemas, which make generalized statements about document structures. While these schemas are concise, they can be too general for heterogeneously structured documents. The assumption underlying our schema discovery approach is that there is often a common way to describe information related to a topic. The type of schema we propose is computed using an ecient data mining approach. By describing only prevalent structures among the input documents, a majority schema concisely describes the documents at the expense of losing only moderate coverage. In Section 2, we outline the basic ideas and methods underlying the document conversion and schema discovery process realized in Quixote. Section 3 presents the evaluation of these methods. In the remainder of the paper, we use resumes as an example document collection to be integrated into an XML repository. 2 Approach A user who is interested in building applications on top of an XML document repository initiates the document conversion and integration process. This process is based on a collection of topic specic Web documents that have been gathered by a focused Web crawler [IBM97]. For the document conversion process, the user species a set of concept names pertaining to the topic. Such concepts names eventually will be used as XML element names and are typically already present as input to the Web crawler. For example, for resume documents, concept names include degree, organization, date etc. Concept names together with examples of how to identify concepts as text components in HTML documents are input to the document converter. This component utilizes concept information and the tree representation of HTML documents (based on the Document Object Model) to restructure the HTML documents into XML documents. This process is detailed in Section 2.1. Although the resulting XML documents share a common set of XML elements, the nesting structures of the XML elements can be heterogeneous. The schema discovery approach, which is detailed in Section 2.2, determines prevalent structures among the XML documents in form of a majority schema, which easily can be translated into a DTD. This DTD can eventually be used to transform the XML documents such that they all conform to the DTD. Due to space limitations, the document transformation approach, which is based on ideas similar to those described in [Mur97], will not be discussed in this paper. 2.1 Document Conversion The goal of the document conversion process is to extract information content from HTML documents and to embed the extracted information in appropriate XML element structures. The major diculty in doing this is that HTML is a markup language to describe the visual representation of a document, not its logical structure. Furthermore, information carrying objects are buried in the text of documents, and the logical layout of the information content is not reected in HTML document trees. Furthermore, people may markup the same information content in dierent ways in HTML. However, we notice that often the visual representation of an HTML document gives very strong clues about the underlying logical document structure. Furthermore, in a document the representation of related information carrying objects (e.g., elements of a list) is quite regular. Based on these observations, which can be made for almost all topic specic Web documents, we use the \semantics" of HTML markup tags, the structure of HTML document trees and examples for associating XML elements (topic concepts) with HTML text to convert HTML documents into XML documents. The methods guiding the conversion process are concept identication and tree restructuring. Concept Identication. This step identies topic concepts (specied by the user) in text nodes of HTML documents. Each text node in an HTML document is tokenized into tokens according to punctuation delimiters such as ';', ',', ':', '-'. Concepts associated with tokens are identied by classication or pattern matching. In classication, examples given by the user that associate concept instances with concepts are 2

3 used to train a Bayes classier. Classes are concepts while tokens are instances of the classes. The Bayes classier then classies a token in a document as the concept with the highest relative probability of being in that class. Tokens that cannot be classied above a certain probability threshold are assigned to an extra class. In pattern matching, the system matches a token with the topic specic keywords or patterns that have been specied by the user. Whenever there is a match, the corresponding concept is associated with the token. Eventually, an XML node labeled with the identied concept is created. Assume, for example, the text structure <li> B.Sci. (Computer Science), June 1999, University of California, GPA 3.9/4.0 </li> Four tokens can be can be determined for this text node. Based on user specied concept keywords or examples, this text structure is reorganized as follows: <li> <degree val ="B.Sci. (Computer Science)"/> <date val="june 1999"/> <institution val="university of California"/> <gpa val ="GPA 3.9/4.0"/> </li> As a technical convenience, the text value of a token is stored in an attribute named val. Values of tokens (plus certain optional context information) that can not be associated with a concept are passed as a value of the attribute val to the parent node. Thus, no information is lost during the concept identication process. Tree Restructuring. This step operates on the intermediate tree structure obtained by the concept identication process. The goal of tree restructuring is to reorganize the tree representation of a document such that the resulting tree reects the logical layout of the information carrying objects in the document. In particular, all HTML nodes are replaced by XML element nodes. The basic idea underlying our tree restructuring method is that in a document tree high level objects are typically detailed by lower level objects (child nodes). This idea is reected by applying the following three core rules to a document. Rule 1: Some HTML tags, called group tags (e.g., h1, p, table), are used to group related information objects based on their visual rendering properties. Thus, all sibling nodes in an HTML tree that occur between two consecutive group tags having the same label are restructured as children of the rst group tag. For example, given p is a group tag, <p/> <degree.../> <gpa.../> <p/> <degree.../> <thesis.../> <p/> is restructured as <p> <degree.../> <gpa.../> </p> <p> <degree.../> <thesis... /> </p> Rule 2: HTML tags need to be replaced by their XML children nodes. Some HTML tags, called list tags (e.g. dir, ul, ol), group related information objects at the same level of abstraction (i.e., at the same depth in the document tree). A node labeled with a list tag is replaced by its child nodes. All such nodes remain siblings. For example, given ul is a list tag, then <p> <ul> <degree.../> <organization.../> <gpa.../> </ul> </p> is restructured as <p> <degree.../> <organization.../> <gpa.../> </p> Rule 3: Since the format within a topic specic document is regular, repeating XML sibling nodes correspond to groups of information objects that are semantically related. An important observation is that the rst XML child node of an HTML node typically represents the information content of the children of the HTML node, analogous to the topic sentence of a paragraph. Thus, if there are repeating patterns among XML child nodes of an HTML node, the XML child nodes are grouped accordingly. The rst XML child node of the group replaces the HTML node. For example, the pattern degree-gpa repeats among the child nodes of p in the following document fragment: <h2> <p> <degree.../> <gpa.../> <degree.../> <gpa.../> </p> </h2> This corresponds to two groups of information objects with the rst XML node representing the content of each group. It is thus restructured as <h2> <degree...> <gpa.../> </degree> <degree...> <gpa.../> </degree> </h2> The above rules are applied to an HTML document tree obtained through concept identication in a consecutive order. Rule 1 is applied to each node once in a top-down manner while rules 2 and 3 are applied in a bottom-up fashion. Certain HTML markup tags are currently not considered by the rules. This includes, for example, font markup tags. Such tags are simply deleted during the application of the above rules. Thus, the rules ensure that the resulting documents only contain XML nodes and that the documents are well-formed. 3

4 2.2 Schema Discovery After the HTML documents have been converted into XML documents, the next step Quixote applies is to infer a schema from the documents. Since the documents have been gathered from a large number of diverse Web sources, they can be highly heterogeneous in terms of their structure. As mentioned earlier, an exact schema for these documents would be too large in size whereas approximate schemas of lower relevance determined by known schema discovery approaches would be too general. The idea of a majority schema we propose for topic specic documents covers only prevalent structures in the documents, i.e., the typical way of structuring information carrying objects. This type of schema is a suitable choice for topic specic documents since a majority schema is relevant and concise, at the expense of losing only very moderate amount of coverage. Although existing approaches can be used to discover a majority schema (e.g., [WL00]), in Quixote we adopt an approach that greatly simplies the discovery process. Initially, we deliberately ignore certain details of a schema, such as the content model and grouping information about XML elements in the schema. This transforms the problem nicely into the frequent itemset generation problem, which has a well-known ecient data mining algorithm [AS94]. After inferring an initial majority schema, details missing in the majority schema are lled in using heuristics and data gathered during the initial schema discovery approach. In our approach, we use the XML document tree model where for a document X, each node from a node set V X can be uniquely identied, and with each node v 2 V X a node label (XML element name) can be associated. With each node, furthermore a set of attributes and a list of child nodes is associated. A node path v = hv 1 v n i; v i 2 V X, in a document X is a sequence of nodes starting from the root node of the document such that for each v i 2 v we have v i 2 children(v i?1 ); 1 < i n. With each node in a node path v a label can be associated, leading to a so-called label path, which represents a sequence of XML element names. Naturally, dierent node paths can have the same label path. A collection X of XML documents can easily be mapped to a multiset P of label paths where each document X 2 X is mapped to a set (in order not to be too biased towards certain highly repetitive structures in only a very few documents) of label paths. Assume a label path p = p 1 p n. Its support, denoted support(p), is the number of occurrences of p in P. Since the support of a label path typically decreases with its length, we determine how prevalent a label path p is by computing the ratio of support(p) to support(p 1 : : : p n?1 ), called its support ratio. A frequent label path then is a label path with a support ratio not less than a user-dened parameter ratiot hreshold. This parameter can be used to guide the discovery component, depending on how much coverage the resulting schema is required to have. Progressively mining label paths of increasing lengths, the mining algorithm is used to discover maximal frequent label paths, i.e., frequent label paths that are not subpaths of any other frequent label path. The frequent label paths discovered can easily be mapped to an unordered XML tree. This tree gives an initial majority schema since information about ordering, multiplicity and element content models is ignored. In deriving a DTD from this initial majority schema, certain data that have been gathered during the computation of the multiset of label paths P are used to determine not only the content model for the elements in the initial schema, but also to recover multiplicity and ordering information. Due to space limitations, the details of this derivation process will not be discussed in this paper. 3 Evaluation We nally describe one of several empirical studies we used to evaluate the eectiveness, scalability and feasibility of the methods realized in Quixote. A collection of resume documents has been gathered using a crawler realized in IBM's GrandCentralStation [IBM97]. We ran our experiments on a Pentium 266MHz processor with 196MB main memory and 512KB cache. 24 XML tags (topic concepts) have been specied by the user. The following HTML tags have been used for the document conversion process: punctuation delimiters used in tokenization f 0 ; 0 ; 0 : 0 ; 0 ; 0 g, group tags fh1, : : :,h6,div,p,tr,dt,dd,li,title,u,strong, b,em,ig, and list tags fbody,table,dl,ul,ol,dir,menug used for the restructuring rules. Accuracy of Document Conversion. The accuracy of the document conversion has been evaluated by counting the number of wrong parent-child and sibling relationships in the XML document trees computed from input HTML documents. 50 resumes were manually inspected (Figure 1). The average number of errors in each HTML document is 3.9. The average number of XML nodes in a document is 53.7, resulting in an average error percentage of 9.2%. 4

5 Scalability of Document Conversion. We measured the scalability of the document conversion process against the size of the documents, the number of nodes in the HTML documents and the number of XML nodes. A dataset of increasing number of resumes was chosen at random from the collection. Figure 2 shows that the running time bears a very strong linear relationship with the number of XML nodes. The running time also scales linearly with the number of nodes and the number of documents Accuracy Avg Error % Per Document = 9.2% 300 Scalability Avg = 35 sec/document Number of documents Number of nodes (in hundred) Number of keyword nodes (in hundred) Number of Documents Time (min) Error (%) (Num. of Errors / Num. of keyword nodes) Fig 1. Document Conversion Accuracy Measure of input size Number of documents, nodes (in hundred), keyword nodes (in hundred) Fig 2. Document Conversion Scalability Fig 3. Schema Discovery Feasibility Feasibility of Schema Discovery Approach. We ran experiments to demonstrate that a majority schema concisely describes topic specic documents at the expense of losing only moderate amount of coverage. The conciseness of a majority schema is based on its number of XML nodes. It is normalized to the range [0; 1] by the size of the majority schema at ratiot hreshold = 0, which gives an upper bound on the sizes of majority schemas over dierent values for ratiot hreshold. The coverage of a schema is the number of nodes in the documents conforming to the schema, normalized to [0,1] by the total number of nodes. A node is said to conform to the majority schema if its label path from the root of the document is found in the majority schema. We selected at random three datasets of increasing sizes of 40, 80 and 120 resumes, respectively. Majority schemas for dierent values of ratiot hreshold were discovered. The result is shown in Figure 3, which conrms the applicability of majority schemas in describing topic specic documents. Conciseness increases sharply at a low ratiot hreshold, at a moderate cost of losing coverage. For example, compared to the most precise majority schema (ratiot hreshold = 0), the majority schemas at ratiot hreshold = 0:1 boost the conciseness from 0 to at 80% coverage. The majority schemas at ratiot hreshold = 0:2 have a conciseness of at 70% coverage. 4 Conclusions and Future Work In this paper, we have outlined the methods underlying a complete framework to integrate topic specic HTML documents into XML repositories. The novelty of our HTML to XML document conversion process is that it requires only minimal user input and is applicable to large collections of heterogeneously structured HTML documents. In particular, we exploit the typical usage of HTML markup tags in Web documents as clues to determine the logical document structure. We have also outlined a very ecient approach to discover a majority schema from XML documents. Such novel type of schema is not only relevant and concise, but it also loses only a very moderate amount of document coverage. Both approaches are shown to be ecient and scalable in dierent experiments dealing with moderate size document collections. We are currently investigating the role and usage of link structures that might exist in topic specic HTML documents. We expect that respective results will lead to approaches that allow the integration of HTML documents covering broader types of topics. We are also studying the extension of the proposed schema discovery approach to more sophisticated schema formalisms, such as XML Schema [W3C00]. 5

6 References [Ade98] [AK97] [AS94] [CDSS98] [DYKR00] [GGR + 00] B. Adelberg. NoDoSE: A tool for semi-automatically extracting semi-structured data from text documents. In Proc. SIGMOD Intl. Conference on Management of Data, 283{294, N. Ashish, C. Knoblock. Semi-automatic wrapper generation for Internet information sources. In Proc. 2nd Intl. Conference on Cooperative Information Systems (CoopIS'97), 160{169, R. Agrawal, R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th VLDB Conference, 487{499, S. Cluet, C. Delobel, J. Simeon, K. Smaga. Your mediators need data conversion! In Proc. SIG- MOD International Conference on Management of Data, 177{188, H. Davulcu, G. Yang, M. Kifer, and I. Ramakrishnan. Computational aspects of resilient data extraction from semistructured sources. In 19th ACM Symposium on Principles of Database Systems, 136{144, M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, K. Shim. XTRACT: A system for extracting document type descriptors from XML documents. In Proc. ACM SIGMOD International Conference on Management of Data, 165{176, [GW99] R. Goldman and J. Widom. Approximate dataguides. In Proceedings of the Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats, [HGMC + 97] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo Extracting semistructured information from the Web. In Proc. of the Workshop on Management of Semi-Structured Data, 18{25, [IBM97] [Mur97] IBM Almaden Research Center. IBM: All searches start at Grand Central. Network World Front Page, November M. Murata. Transformation of documents and schemas by patterns and contextual conditions. In Principles of Document Processing, 3rd Int. Workshop, LNCS 1293, Springer, 153{169, [NAM98] S. Nestorov, S. Abiteboul, R. Motwani. Extracting schema from semistructured data. In Proc. ACM SIGMOD Intl. Conference on Management of Data, 295{306, [NUWC97] [PV00] [SA99] [W3C00] [WL00] [WYW00] S. Nestorov, J. Ullman, J. Wiener, S. Chawathe. Representative objects: Concise representations of semistructured, hierarchical data. In Proceedings of the 13th International Conference on Data Engineering, IEEE Computer Society, 79{90, Y. Papakonstantinou, V. Vianu. DTD inference for views of XML data. In Proceedings of the 19th Symposium on Principles of Database Systems, 35{46, A. Sahuguet, F. Azavant. Looking at the Web through XML glasses. In Proc. 4th International Conference on Cooperative Information Systems (CoopIS'99), 148{159, W3C Working Group. XML Schema Part 1: Structures. W3C Candidate Recommendation, October K. Wang, H. Liu. Discovering structural association of semistructured data. Transactions on Knowledge and Data Engineering, 12(3), , Q. Wang, J. Yu, K. Wong. Approximate graph schema extraction for semi-structured data. In 6th Intl. Conference on Extending Database Technology, LNCS 1777, Springer, 302{316,

Heading-Based Sectional Hierarchy Identification for HTML Documents

Heading-Based Sectional Hierarchy Identification for HTML Documents 1 Dept. of Computer Engineering, Boğaziçi University, Bebek, İstanbul, 34342, Turkey F. Canan Pembe 1,2 and Tunga Güngör 1 2 Dept. of