Learning Object Models from Semistructured Web Documents. Shiren Ye and Tat-Seng Chua

Size: px
Start display at page:

Download "Learning Object Models from Semistructured Web Documents. Shiren Ye and Tat-Seng Chua"

Transcription

1 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 3, MARCH Learning Object Models from Semistructured Web Documents Shiren Ye and Tat-Seng Chua Abstract This paper presents an automated approach to learning object models by means of useful object data extracted from dataintensive semistructured Web documents such as product descriptions. Modeling intensive data on the Web involves the following three phrases: First, we identify the object region covering the descriptions of object data when irrelevant contents from the Web documents are excluded. Second, we partition the contents of different object data appearing in the object region and construct object data using hierarchical XML outputs. Third, we induce the abstract object model from the analogous object data. This model will match the corresponding object data from a Web site more precisely and comprehensively than the existing handcrafted ontologies. The main contribution of this study is in developing a fully automated approach to extract object data and object model from semistructured Web documents using kernel-based matching and View Syntax interpretation. Our system, OnModer, can automatically construct object data and induce object models from complicated Web documents, such as the technical descriptions of personal computers and digital cameras downloaded from manufacturers and vendors sites. A comparison with the available hand-crafted ontologies and tests on an open corpus demonstrate that our framework is effective in extracting meaningful and comprehensive models. Index Terms Web mining, machine learning, intelligent Web services and semantic Web, Web text analysis, knowledge acquisition, ontology design, computational geometry and object modeling, DOM. æ 1 INTRODUCTION MOST documents on the Web are semistructured. Compared with unstructured and structured documents, the key features of semistructured documents include: 1) similar objects are grouped together but allowed to differ in order, size, and type, 2) they possess unfixed structures often with missing components, and 3) they are large, rapidly evolving, and often do not have a welldefined schema. Structured documents can be considered as specific and high-quality cases of semistructured ones. Unlike most unstructured documents in plain text, these data-intensive semistructured documents are mostly generated by dynamic programs or templates. As more and more companies and organizations publish and manage their business and services on the Web, (semi)structured Web documents about product category and listing, member collection, financial statement, travel information, etc., have become extremely prevalent. Given their sheer number, it is difficult and even impossible to use a manual approach to explore these pages created by dynamic programs. There is, therefore, a strong need to track, extract, and organize these pages for effective access and management. An example of a typical semistructured document a product description page is shown in Fig. 1. There are multiple types of data components on the page, including product descriptions (key information), the advertisement bar, the search and filtering panel, and the navigator bar, etc.. The authors are with the Department of Computer Science, National University of Singapore, Singapore {yesr, chuats}@comp.nus.edu.sg. Manuscript received 22 Oct. 2004; revised 2 May 2005; accepted 1 Sept. 2005; published online 18 Jan For information on obtaining reprints of this article, please send to: tkde@computer.org, and reference IEEECS Log Number TKDESI Most users and applications require only the description of the desired product and have no need for other details. It is, therefore, necessary to extract key information (called object data or object instances, abbreviated O DATA ) that usually lies within a contiguous region (called object region, abbreviated O REG ) in the Web page. Typically, most semistructured documents contain only one O REG surrounded by some irrelevant components, and each region lists one or more pieces of O DATA. These O DATA, which are about the same kind of entities from the same site, are usually analogous, where they use compatible terminology and structure to describe the data. It is thus possible and necessary to learn an object model (abbreviated O MDL ) about them when this model is often unavailable. O MDL provides the basis for extracting and using O DATA accurately and completely, and it serves as the main motivation of our research. Our study will support the following interesting and valuable application scenarios: 1) O DATA is reversely transformed from semistructured Web documents into a repository (database) whose schema is induced from O DATA with minimal human effort [1]. 2) It is possible to provide a feasible channel for converting and constructing fundamental materials when the current Web is shifted to Semantic Web [6], [31]. 3) It facilitates the construction of answers where users expect comprehensive answers involving tabulated results from Web sites [14]. However, it is difficult to directly extract O DATA and O MDL from semistructured Web documents. The obstacles include: 1. Numerous commercial sites have formed a set of autonomously distributed business information islands with pages in different formats and styles. It is difficult to learn rules or patterns to handle heterogeneous product instances effectively /06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society

2 2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 3, MARCH 2006 Fig. 1. Layout of a typical semistructured Web document. 2. O DATA often intertwine with irrelevant content in Web pages. Indeed, the proportion of relevant information is normally less than 10 percent in many product pages. The rest are irrelevant items such as HTML tags for display formats, advertisement bar, product category, search and filtering panel, navigator bar, copyright statement, and feedback form, etc. 3. Most Web pages depict more than one object instance (i.e., product) in an O REG. Although each instance may appear to be coherent in the formatted view on the screen, its raw source is actually scattered in many noncontiguous sections. Thus, one issue is how to detect the right association among objects with their attribute names and values. 4. The order, number, and format of attributes involved in depicting the similar O DATA may vary greatly across pages from different sites. For example, a page put up by a small marketing company may list only the main features of a product, while the corresponding page put up by the manufacturer may show all descriptions of the same product (some even with more than 100 items). It is thus difficult to construct a fixed template or ontology to match all the desired attributes. Moreover, many new and optional attributes may not be reflected in the training data, thus making the problem of extracting information from live Web pages even more difficult. Because of the above problems, hand-coded rules or fixed wrapper systems cannot achieve good performance with complex Web pages. Ideally, we need an automated system to extract O DATA and O MDL. Since these pages from the same Web site are normally generated by the same dynamic program or carefully maintained template, they tend to share a similar structure. It is thus possible to uncover the common structure containing the relevant object information in O REG by analyzing the pages in conjunction. Usually, the contents of O DATA can be explored from three aspects: 1) syntactic, which describes the rule, grammar, order, and format of data representation, 2) structural, which describes the manner in which data are organized or interrelated, and 3) semantics, which gives meaning to data within a given context. In this paper, we aim to develop an unsupervised learning approach to explore the actual meaning (or semantics) of O DATA from a collection of analogous Web pages. We seek to induce an O MDL using analogous O DATA extracted from the original pages. This O MDL is an abstraction of O DATA and provides the domain knowledge to guide users and software agents in interpreting and extracting O DATA. In our initial investigations, we attempted using supervised learning approaches to extract O DATA and O MDL. But, these experiments were time-consuming and performed poorly. For example, for an object with 100 attributes, we needed to collect enough instances that cover all the involved attributes and label all of them. We also needed to train different classifiers for diverse objects. Most of the classifiers became invalid when they encountered pages from new sites using different vocabulary and representation styles. Moreover, if the source codes for multiple O DATA were entwined, the classifiers possibly arranged the corresponding attributes in wrong groups. Hence, we decided to explore automated learning approaches to deal with documents from real applications. As the organization of complex Web documents is rather complicated, it is not possible to learn an O MDL from Web pages directly. We, thus, divide the problem into the three stages of: 1) identifying an O REG covering product descriptions in Web pages, 2) constructing O DATA in O REG, and 3) constructing the abstract O MDL from the analogous O DATA. The main contribution of this study is to automate the approach to extract object data and object model from semistructured Web documents using kernel-based matching and View Syntax interpretation. Some primary results for O REG identification and O DATA construction have been reported in our previous work [33], [34]. This research extends our previous work in three aspects: 1) We improved the algorithms for constructing O DATA from semistructured and structured pages by leveraging on our View Syntax interpretation (previous method handling structured pages only). 2) We further developed O MDL induction to cover an entire object modeling workflow. 3) We carried out more experiments

3 YE AND CHUA: LEARNING OBJECT MODELS FROM SEMISTRUCTURED WEB DOCUMENTS 3 and analysis to support our arguments and verify our algorithms. The rest of this paper is organized as follows: Section 2 introduces related work about modeling data on the Web and Section 3 presents the overall procedure for learning O MDL through the segmentation and extraction of O DATA in O REG. Sections 4 and 5, respectively, present the algorithm for detecting O REG and that for partitioning and extracting O DATA. Section 6 describes the method for constructing O MDL. We present and analyze the results of our experiments in Section 7. Finally, we conclude our study in Section 8. 2 RELATED WORK Following the intense interests in Semantic Web, research on extracting and integrating information from the Web has become more and more important over the past decade. This is because it helps to tackle the urgent business problem of collating, comparing, and analyzing business information from multiple sites. Several areas of work are related to our study, including ontology learning, information extraction (IE), data integration, and Web page cleaning. We briefly discuss some of the works below. Ontology learning has become a hot research topic in recent years [19], [21]. It uses machine learning techniques, such as classification [13], clustering [29], inductive logic programming [9], association rules [20], concept induction [17], [23], and Naive Bayes [7] to construct ontologies and semantic annotations. Discovering the complicated domain ontologies, which could provide a detailed description of the concepts for a restricted domain, is an important subtask of ontology learning. For example, Maedche and Staab [20] employed a generalized association rule extraction algorithm to detect hierarchical relations between concepts and determine the appropriate level of abstraction at which to define relations. Davulcu et al. [6] used automated techniques for bootstrapping and populating specialized domain ontologies (concept taxonomies) by organizing and mining directories and links in a set of relevant Web sites provided by the user. Webb et al. [32] demonstrated that the integration of machine learning techniques with knowledge acquisition from experts can both improve the accuracy of the domain ontology developed, and reduce development time. Cole and Eklund [5] applied formal concept analysis to mine the implicit structure among semistructured sources such as classified advertisements, where a handcrafted parser is used to extract structured information. Volz et al. [31] described a metadata creation framework for Web pages from a database whose owners would collaboratively provide the semantic annotation. In general, most domain ontology acquisition approaches are still guided by human knowledge engineers, and automated learning techniques play a minor role in the knowledge acquisition process. The learned ontologies are mainly about (partial) concept taxonomy or concept relationships rather than the complete description framework for the published objects. However, before an O MDL or ontology can be built, we need to extract the appropriate O DATA from Web pages. This is typically done using the IE tool called wrapper [18], which is a program that extracts data from Web pages and stores them in a database. The wrapper could either be generated by a human being or learned from labeled data, both of which are labor-intensive and time-consuming. Moreover, the generalization of the wrapper is limited as it uses HTML tags to represent rules. Many wrapper induction systems assume that a priori knowledge about page organization (such as the schema of data in a page) is available and the sources are flat records rather than nested data. They also lack the ability to extend to diverse sites with different representation styles. These problems hinder the wider application of wrapper induction systems. To overcome these problems, some researches have tried to automate the mining of knowledge or data from Web pages by using unsupervised learning approaches such as clustering and grammar induction, or heuristic rules [2], [3], [4], [14], [16]. For example, Embley et al. [8] presented a novel automated approach to extract and structure data from the multiple-record documents posted on the Web, but the method requires users to predefine a detailed objectrelationship model which encapsulates all object sets, relationship sets, participation constraints, generalization/ specialization, as well as the extracting rules that exactly match the available pages. Etizioni et al. [10] employed K NOW I T A LL to automate the process of extracting large collections of facts. It focuses only on local contents such as names of scientists or politicians rather than whole object descriptions and models. The other type of automated approaches includes metadata generation tools, such as screen scraping and metadata harvesting. However, they rely on the fragile structure of HTML layouts and are inherently unstable. On balance, most of these techniques require well-formatted tables, with at least two pieces of O DATA (records) appearing in a page. So, their function is largely similar to table detection and data extraction from tables. Web page cleaning has been developed to identify redundant, useless, or irrelevant components (e.g., ads bar and category) in a Web page that the users would not be interested in. Current research uses priori knowledge or supervised learning to detect frequent templates [2], coherent content blocks [15], and site style trees [35]. Gupta et al. [12] used a predefined unimportant tag-set and linklist to filter irrelevant components (nodes) and parse them into DOM trees. Their approach, however, depends heavily on the manual configuration and enumeration completeness of trivial nodes, and will not be able to handle the complicated pages found in most Web sites. Our study emphasizes the development of an automated approach to detect the important portion (O REG ) rather than to eliminate the unimportant components in complex Web pages. We seek to handle the more complicated cases of multiple object representations in the form of tables, (nested) listings, and text fragments surrounded by HTML tags, with an unrestricted number of object instances in a page. We seek to mine O MDL which is about the overall structure and constraint of a general object such as the relations and value spaces among the properties of an object rather than merely extracting any single attribute about the object.

4 4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 3, MARCH OUR APPROACH: MODELING INTENSIVE DATA ON THE WEB 3.1 Ontology versus Object Model Ontology is a branch of philosophy that attempts to model things as they exist in the world. It is particularly appropriate for modeling objects and their relationships and properties. Object model (conceptual model or data model) and axiom are the essence of ontology. The semantic object models could provide an ontology that describes the knowledge of the domain that the users are interested in [26], [30]. The main difference between ontology and O MDL is that: The shareability, portability, and interoperability require that the ontology is formal, shareable, and reusable, but O MDL is usually task-specific and implementationoriented. In a way, O MDL encodes the syntactic, structural, and semantic information about the objects. It is an important component as we transform from the traditional Web to the Semantic Web and Web Services, whose success and proliferation depends on how quickly and cheaply we can construct domain ontologies and models. The variations of real data from the heterogeneous Web sites are pervasive and unpredictable. Although many ontologies and metadata created by human experts are available in many domains, most ontologies created by human experts cannot correctly match the real data that the users need to process. This is because the ontologies may be too general in nature to capture all the details of a specific domain. So, even though an existing ontology may be coherent, it is difficult to utilize it to analyze semistructured Web pages and extract the corresponding concepts. Moreover, there are several additional reasons why we need automated techniques to construct O MDL : 1. The contents of product or service pages are dynamic. The existing ontologies are often out-ofdate and do not cover new descriptions. New attributes/elements are frequently emerging in O DATA. It will be costly to maintain and update O MDL regularly and accurately. 2. In real applications, users need to spend much time deriving and understanding data schemas and business rules when they collect and clean data from the Web. An automated O MDL is the key and roadmap to understanding the corresponding object instances. If the unsupervised learning algorithms can produce a proper O MDL that matches well the data from a particular site, it could reduce the cost of acquiring and exploiting data from the Web. 3. A high-quality O MDL that has been reviewed by domain experts can be used as the basis to guide object identification and extraction, and correct the errors and defaults among O DATA. O MDL can also provide the schema design for a repository of extracted O DATA, and help in the understanding and application of O DATA. 4. This automated approach will bring additional benefits. The corresponding real-time and latest O DATA detected from the Web pages will enrich the learned O MDL and make it more valuable. On the other hand, O MDL can guide users on how to use these O DATA and play key roles in transforming the current Web into Semantic Web. 3.2 Outline of Our Approach It is difficult to model the intensive data in Web pages directly. This is due to the complex organization of semistructured pages, where irrelevant components, object names, attribute names, and values about the multiple objects intertwine. We need to apply the indirect approach by first constructing clean and structured intermediate data O DATA, before learning O MDL from these high-quality object instances. One key function involved in our technique is how to evaluate the similarity among the components in the pages. The similarity is the key for detecting the repeated structures which is used to depict the analogous objects both in the similar pages and in repeated components among a page (if they contain multiple existing objects). The quality of the induced O MDL depends largely on the accuracy of the extracted O DATA. Obviously, string matching and the bags of words technique cannot be used to identify irrelevant contents and partition O DATA. Instead, we seek inspiration from the following observations: 1. We observe that in a Web site, pages about objects in the same category, such as products, services, and member listings, always have similar representation structures and similar irrelevant contents. This is because in most such cases, they are produced by the same dynamic programs or templates that are carefully maintained by the developers. Thus, we can make use of the stable representation structure of the pages within a Web site to identify useful contents. 2. If we ignore the title of the pages, we can expect O REG to appear in a contiguous area, both visually on the screen and in the raw source of Web pages. However, we note that although the description for each object instance is visually contiguous, the raw source of O DATA might not be contiguous. 3. If we compare similar pages from the same site using HTML elements as the unit of measure, we would find that most of the content differences within a page appear in O REG, which is used to describe different O DATA with different attribute values. Thus, by choosing a suitable set of content features and appropriate similarity measures, we should be able to identify irrelevant components as having similar structure and/or content, and O REG containing product information as different. Based on the above observations, we devise the following workflow to extract O DATA and the corresponding O MDL from complex semistructured pages: 1. We employ spiders to download the related pages from a specific site, or collect the related pages from a service provider. We denote this set of Web pages collected as P SET. 2. We select a Web page p i as the focus and acquire a set of pages P SIM that are similar to p i. We utilize DOM structures 1 of Web pages as the 1. Document Object Model, available at LOM-Level-2-HTML/2003.

5 YE AND CHUA: LEARNING OBJECT MODELS FROM SEMISTRUCTURED WEB DOCUMENTS 5 feature representation for similarity calculation. Here, HTML element text and attributes are considered as the content of that node. We use tree-based kernel [11], [22] to evaluate the similarity between Web pages, as it could reflect similarities in both structure and content. For any page p j in P SET, we calculate kernel Kðp i ;p j Þ as described in [11]. Because pages similar to p i are created by the same dynamic program or template, they should have similar structures and share many common terms as a whole. Thus, it is not difficult to setup a threshold for similar page selection. The set of pages similar to p i is: P SIM ¼fp i 2 P SET jkðp i ;p j Þ >g: 3. From the DOM structure of the Web pages, we expect the subtrees corresponding to O REG to have the largest difference as they have distinct descriptions about different O DATA, and those subtrees corresponding to irrelevant contents to be similar across the pages. In particular, to identify O REG, we compute the novelty value for each subtree as described in the next section. Roughly speaking, the subtree that has the maximal novelty belongs to O REG. 4. Next, we detect O DATA within O REG. This is done by detecting the structure representation (see Fig. 6) within O REG. If there is more than one object involved, we partition the data into different O DATA and output them as different XML files as described in Section Given the set of object instances, we induce O MDL as follows: 1) we construct the initial models by removing the repeated elements in O DATA, 2) we integrate these O MDL, and 3) we identify the data type of elements, detect the enumeration members, and refine O MDL. 6. Optionally, we seek domain experts and knowledge engineers to review and refine the learnt O MDL. The modified O MDL can then be used to verify the extracted O DATA, design the database for object storage, and improve the performance of object extraction. O MDL can also be updated and improved by a learning procedure based on the new highquality objects found. We plan to employ a bootstrapping learning process [28] to minimize human labor. The above steps are summarized in Fig. 2. All steps are fully automated except step f), which is optional and needs human effort to verify O MDL. The following sections describe steps c) to e) in detail. Our workflow is different from the automated extraction approach reported in [1], where EXALG first discovers the unknown template that generates the pages and then uses the discovered template to extract the data from the input pages. We found that the performance of directly discovering the unknown template is limited when the actual object structures in pages are dynamic and unstable. For example, existing objects use nesting structure, the number of objects involved in each ð1þ Fig. 2. The main workflow for object model learning. page is varied, or the dynamic program will delete a description attribute if this attribute is blank in order to minimize page length. These factors lead to a huge gap between the template and Web pages. Our indirect and bootstrapping approach, pages O DATA O MDL O DATA, could decrease the disparities between problem spaces and simplify the task in each stage. Furthermore, this workflow will enable the system to achieve high accuracy when it handles complicated pages from the commercial sites. 4 DETECTING OBJECT REGION AMONG SIMILAR WEB PAGES 4.1 Features of Object Region We first employ a DOM parser or SGML parser (such as the one available at to generate the DOM parsed trees from the Web pages, where each node in this tree is an HTML element. O REG is then a subtree that contains the description of the desired O DATA. The characteristic of O REG is that it is typically the largest contiguous region in the Web page having distinct tokens that are used to depict distinctive objects. Thus, we define the concept of repeatability for nodes and subtrees in the parse trees in order to identify O REG. Repeatability can be evaluated by the similarity between nodes or subtrees in terms of both contents and context. Here, similar node content means that both nodes share many content tokens and attributes of HTML elements; and similar node context is simplified as having similar ancestors. From the concept of repeatability, we define novelty of node or subtree in order to identify O REG. 4.2 Formalization Definition 1. The content similarity of nodes n 1 and n 2 in parse trees T 1 and T 2 is defined as the weighted ratio of their shared tokens and attribute tags in HTML elements. Suppose that n 1, n 2 have tokens t 1, t 2, and the ratio of shared attributes is saða 1 ;a 2 Þ, then the similarity between n 1 and n 2 is defined as: Simðn 1 ;n 2 Þ¼sðt 1 ;t 2 Þþw i saða 1 ;a 2 Þ: Here, w 1 is the weight of the shared attributes and is set to 1 in our study. sðt 1 ;t 2 Þ is the similarity of tokens defined as: ð2þ

6 6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 3, MARCH < 1 if t 1 ¼ t 2 is identical in string matching sðt 1 ;t 2 Þ¼ :5 if t 1 ; and t 2 belong to the same type : 0 otherwise: ð3þ t 1 and t 2 belonging to the same type means that they are of the same semantic type such as digital, currency, , or number and unit of measure, etc. Definition 2. Suppose nodes n 1 and n 2 (not roots) are in parse trees T 1 and T 2, respectively, the repeatability of n 1 with respect to T 2 is Rðn 1 ;T 2 Þ¼ max simðn 1 ;nþþw 2 simðparentðn 1 Þ; 8 n2nodes ðt 2 Þ parentðnþþ ; where parentð:þ denotes the parent node of n and w 2 reflects the influence from the context of nodes n 1 and n 2, which is set to 0.5 in our system. From Definition 2, two subtrees will have high repeatability if they are similar in both contents and context (or structure). In order to avoid cases where some O DATA share many similar descriptions, which will increase the repeatability of O REG, we consider a window of N WIN (i.e., N WIN ¼ 2 5) pages when evaluating the repeatability of the subtrees. We randomly select N WIN pages from P SIM and compute the average repeatability of their corresponding parse trees as the final evaluation measure. As the repeatability of n 1 does not depend on the particular tree T 2, it can be simplified as Rðn 1 Þ. Hence, we can normalize Rðn 1 Þ by dividing it by Rðn 1 ;T 1 Þ as: Rðn 1 Þ¼avg T i6¼t 1 T i in Window N WIN Rðn 1 ;T i Þ Rðn 1 ;T 1 Þ: ð5þ Here, Rðn 1 Þ is between 0 and 1. From this, we can define novelty as: Definition 3. The novelty of node n 1 is Nðn 1 Þ¼1 Rðn 1 Þ: Definition 4. If treeðnþ is a subtree rooted at node n in parse tree T, the novelty of treeðnþ is the weighted sum of the novelties in its nodes, namely, NðtreeðnÞÞ ¼ NðnÞþw 3 X NðtreeðchildðnÞÞÞ ; ð7þ where childðnþ is the child of node n and, thus, treeðchildðnþþ is the subtree rooted at the child(ren) of n:nðtreeðnþþ and this is calculated recursively based on (6). w 3 (set to 1 here) gives the contribution of novelty from all the subtrees. Thus, NðtreeðnÞÞ is the sum of the novelty of all involved nodes. ð4þ ð6þ Fig. 3. A sketch of novelty distribution in a parse tree. Usually, the subtree ST covering O REG should have the highest novelty value. Based on Definition 4, we expect the parent and higher level nodes of ST to also have the largest novelty value among its siblings. To illustrate this observation more clearly, we instantiate it in Fig. 3, where the number below the node labels (such as A, B, C, and D) gives the novelty of the nodes. The root of O REG will lies in such a path with the maximal tree novelty (here, it is A, E, I, J) when we navigate from the root of the parse tree to the leaf. Now, the problem is which subtree among them is the proper O REG. We observe that the subtree (such as treeðaþ) that has only one large novelty branch is not the right O REG, while the subtree (such as treeðeþ) having many distributed novelty branches is the right O REG when we navigate from the root. The novelty of the former comes mainly from one branch, while that of the latter comes from contributions from many branches. In other words, the former has only one large novelty branch; however, the later has many average novelty branches rather than a large one. In order to formalize this measure, we use a threshold w! (for example, w 0 is set to 0:4 0:6) to predict this inflexion in the navigation path (n 0 ;...;n i ;...n j ). For any node n x between the root of parse tree n 0 and the root of O REG n i ;Nðtree:ðn x ÞÞ >w 0 Nðtreeðparentðn x ÞÞÞ, which means that n x is the only branch of the corresponding subtree treeðparentðn x ÞÞ having large novelty. But, for node n i, there is no such a branch treeðchildðn i ÞÞ that enables Nðtreeðchildðn i ÞÞÞ >w 0 Nðtreeðn i ÞÞ for all children of n i. This leads to Definition 5. Definition 5. O REG for parse tree treeðn 0 Þ is the subtree treeðn i Þ existing in the path from n 0 to the leaf with maximal novelty. For each node n x in the path between n 0 to n i, Nðtree:ðn x Þ >w 0 Nðtreeðparentðn x ÞÞÞ. However, for all children of n i, Nðtreeðchildðn i ÞÞ <w 0 Nðtreeðn i ÞÞ. 4.3 Algorithm Based on the above discussion, we design an algorithm to detect O REG. In the following algorithm, the steps in lines 2-6 are designed to calculate the novelty of all subtrees in a parse tree, which is based on the novelty of their corresponding nodes. The steps in lines 7-10 are used to detect the root of O REG. GetObjectRegion(Current_Parsing_Tree T 0, Parsing_Tree_Set ft g, window_size N) { 1. Randomly select N WIN trees T 1 ;...;T N WIN from ftg; 2. For each (node n i in nodes(t 0 )) { 3. Calculate Rðn i ;T k Þ, k ¼ 1;...N; P k6¼i k¼1;...;n WIN Rðn i ;T k Þ Rðn i ;T i Þ; 4. Nðn i Þ¼1 1 N 1 5. Recursively calculate Nðtreeðn i ÞÞ;

7 YE AND CHUA: LEARNING OBJECT MODELS FROM SEMISTRUCTURED WEB DOCUMENTS 7 Fig. 4. A tabular page showing two objects (partial). 6. } 7. while (navigating from the root of T 0 to a particular leaf along the corresponding subtrees having the largest novelty) { 8. if (Nðtreeðn i ÞÞ <Nðtreeðparentðn i ÞÞÞ=w 0 ) 9. { return parentðn i Þ; } //it is O REG 10. } } The computation cost for this algorithm is N WIN jtj 2, where N WIN is the length of the window, and jtj is the number of nodes in the parse tree. 5 CONSTRUCTING OBJECT DATA Constructing O DATA from O REG is the second phrase in our proposed approach. The object structure in semistructured pages is implicitly reflected in the layout and organization of O REG. Users cannot find any explicit expression for the structure of the objects when they browse pages such as the one shown in Fig. 4. However, they can construct a twodimensional and (or) hierarchical imaginary visual picture about O MDL in their mind and form the underlying interpretation as shown in Fig. 5. If the imaginary picture were to be represented in plain text, it would appear like: Asus M2N 1.3 GHz Centrino Laptop has the following key features: its manufacturer SKU is M2NCOMBO, its power source is Lithium... The relationships among their attribute names and attribute values would be expressed in natural language syntax. However, in semistructured pages, these relationships are represented in View Syntax, where the layout and relative coordinates of the description items in O REG reveal these relations indirectly, reliably, and concisely. If the relative position of the related items across pages varies, such as cell shifting or location exchanging, a different picture and interpretation will be formed. At the same time, View Syntax cannot depend on specific HTML tags as layout cues (hand-crafted rules such as <table> and <pre> in level 1, <tr> in level 2, as well as <p>, <li> and <hr> in level 3) such as that described in case frames of [25], because they cannot handle variations in documents. Although Song et al. [27] also used a viewbased method to segment repeated regions among Web pages; our approach emphasizes automated O DATA construction rather than using manual heuristics as they have done. Here, we use a dynamic program to mimic human s psychological procedure of constructing such objects in terms of View Syntax. This dynamic program will perform two main tasks: First, it assigns the roles for nodes in O REG. Different nodes will play different roles when they form the elements of the output framework. There are at least the following five roles in O REG : 1. the owner node that implies the name of an object, 2 2. the category node that contains a list of description attributes and it might recursively repeat in a nested structure, 3. the element name node, which provides the title of the corresponding description attributes, 4. the element value node is the concrete content for the named attributes, and 5. the irrelevant node, which includes the HTML structure such as <table> and <tr> as well as the content that are not about any object, and does not contribute to the output tree directly. Many element value nodes may share an element name node if they have the exact same name. Here, the useful 2. It might also appear in the node tagged by <title>, <h1>, or <h2> besides O REG if there is only one object in the page. It is the exception that owner node located beyond O REG.

8 8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 3, MARCH 2006 Fig. 5. The imaginary picture when users browse the above page. Fig. 6. Typical representation of extracted objects. (a) Horizontal tabular structure. (b) Vertical tabular structure. (c) (Nested) List. (d) Fragment structure. (e) Sentence. nodes could be grouped into two kinds: structure node (category or element name nodes with gray background in Fig. 5) and value node (owner or element value nodes with white background in Fig. 5). Once we determine whether the involved node is structure node or value node, it is convenient to use some heuristic rules to specify the concrete type. For instance, owner nodes usually exist at the beginning of each subregion, and the number of those elements, no matter what representation structure is used, would match the number of existing objects. Incidentally, since we do not employ natural language processing components to split any node into more detailed atoms, one important assumption is that each node plays only one role in our approach. Second, it detects the relationships among these nodes, while ignoring the irrelevant nodes. The element value nodes will become the leaves in the output tree as shown in Fig. 5. In order to build all nonleaf nodes such as the owner node, category node, and element name node in the right position, we need to detect the relationships between the element value node and the others. The main relationship pairs include: (element value owner), (element value element name), (element value category), and (category category). These relationships may be complicated, depending on the relative position of the involved nodes, which could be located by the virtual coordinate in the twodimensional visual space. Once the above issues have been resolved, we can get the following subregions which would reveal the content layout in O REG : 1) By ownership: All nodes in this region are about the same object (surrounded by dashed line in Fig. 4). 2) By category: There are two categories (Ratings and Key Features) in the example. 3) By element: For example, lowest price, $649.99, $ is a set of elements. It is easy to construct O DATA based on the existing tags in the related nodes. We will address the dynamic program shortly. 5.1 Object Data Representation Structure The roles of nodes and their relationships fully depend on the main representation structure used in O REG. There are five typical representation structures for objects as shown in Fig. 6. Obviously, some pages may contain combinations of the above structures, such as a horizontal tabular structure Fig. 6a embedded in a listing structure Fig. 6c, as in a

9 YE AND CHUA: LEARNING OBJECT MODELS FROM SEMISTRUCTURED WEB DOCUMENTS 9 person s CV. Fortunately, most O REG use only one type of representation. This is especially so for O REG in product Web pages automatically generated by dynamic programs at most commercial sites. As a set of objects are represented with different structures in different pages, it would be difficult to develop general wrappers to extract information from them. We need to identify the representation structure used and then assign the proper roles and detect the relationships among them in order to simplify the task of extracting the detailed description framework. The output of O DATA is an XML file. An example of the XML output format is: 5.2 Method for Object Data Construction Even though we exclude irrelevant components when we detect O REG, it is still not easy to assign the proper roles for related nodes and detect the relationships among them. The representation structure of many Web pages is not obvious since there are many spurious and nested HTML structures (such as tables) in O REG. For example, we have come across more than 100 tables in O REG in our experiment. There are two main reasons for the large number of tables in Web pages. First, dynamic programs and customer controls automatically produce a lot of tables. Second, some programmers use tables to set up the position of the corresponding components. So, we need to determine the main representation structure, which could cover most nodes in O REG, and could serve as the basic to set up the virtual coordinates in the two-dimension visualization space. Besides the above structural information, we need to explore the syntactic and semantic information to deal with the roles and relationships problems in O REG. For instance, we would not be able to differentiate vertical and horizontal tabular structures if we do not exploit in depth the syntactic and semantic content. Actually, not all nodes in O REG possess high novelty when they are expressed in a similar framework by the same dynamic programs or templates. Essentially, they use the same vocabulary to describe the framework, which has a stable structure and expression approach. For instance, one site would always use CPU while another would always use processor to refer to the same entity, but the two terms are never mixed in a collection of similar pages from the same site. Based on this observation, we may identify the object structure in O REG using the following principle. Generally, we may consider the nodes with low novelty to be of structure node, and the nodes with high novelty to be about value node. While there may be some exceptions to this principle as in the case of uncommon element name nodes having high novelty the frequency of these exceptions is relatively low, and we can correct this problem by examining the node roles of their neighbor nodes (such as in the same row or column) under View Syntax. At the same time, the representation structure in O REG can be determined by the initial node role assignments, where the layout is always simplified into the case described in the previous section. So, we can analyze the relation and organization among those visual nodes to catch the number of existing objects, and then identify the relationships among the related nodes. 5.3 Algorithm for Object Data Construction If there is only one object involved in O REG, the transformation from O REG to XML object output file is straightforward. However, when multiple objects appear in O REG with a typical organization style as shown in Fig. 6, we employ the divide and conquer strategy based on the above mentioned methodology to detect the number of O DATA in O REG. The steps are as follows: 1. Tabular structure. We first use the method reported in [24] to regulate the unified cell tagged by Colspan = n or Rowspan = n. We compare the coordinates of the cells in each column and each row. Since the same attributes about different objects will be aligned either vertically or horizontally in display, we could use the coordinate values to determine whether the structure is a horizontal or vertical tabular structure. If the average similarity among the nodes along the column is greater than that along the row, it means that it is a vertical representation structure. For the vertical structure, we simply compare the first column with the other columns, and verify that the cells in the first column are about proper attribute names or general attribute values. The steps for constructing O DATA in the horizontal tabular structure are similar. 2. (Nested) Listing structure. In order to find out whether it contains only one object or a set of neighboring objects, we need to investigate the repetition among the subtrees within O REG.Ifwe can find multiple similar subtrees that cover most of O REG, they will be used to generate different O DATA. Otherwise, we consider the whole O REG as one object and, hence, one XML output file will be generated. 3. Fragment and sentence structure. There are one or multiple pairs of attribute name and value appearing in an HTML cell in this type. At present, we assume that each of these structures contains only one object. We simply take the primary result in this situation and parse the whole fragment as an entity into an XML output file. 4. Compound structure. This is the combination of the above cases, and, usually, hand-crafted. Fortunately, for such intricate types, there is usually only one object involved in most cases. Our strategy is simply to parse the entire parse tree into one XML output file. The following is the pseudocodes for constructing O DATA from an O REG. Here, 1 is the threshold used to detect the repeated representation frameworks for diverse O DATA and differentiate O DATA in the listing structure. In our system, 1 is set to 0.35 empirically. ConstructObjectData(Current_Data_Region_Tree T 0, threshold 1 )

10 10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 3, MARCH 2006 Fig. 7. Generalizing the initial data model from an object instance. (a) XML Output for ODATA computer (partial). (b) Underlying Schema. (c) Logical structure for ODATA in (a). (d) The initial object model from (c). { For each (nonblank node ni in nodes(t0 )) { Determine the role of each node; } Detect the main representation structure of T0 ; Refine the node role in T0 ; if (T0 is tabular structure) { Regulate the table; Detect the tabular representation direction vertical or horizontal; 9. Combine the pair of corresponding cell and attribute name as an XML elements; 10. Parse the first column or row as object name; 11.} 12.else if (T0 is listing) 13.{ 14. For each (node ni in the second layer nodes of T0 ) { 15. Calculate the inner repeatability Rðni ; T0 Þ; 16. if (Rðni ; T0 Þ > 1 ) //multiple objects in this listing 17. {Parse the subtrees rooted at ni as XML output;} 18. else //only one object in this listing 19. {Parse T0 as XML output; break;} 20. } 21.} 22.else //other structure 23.{ Parse T0 as XML output;} } 6 INDUCING OBJECT MODEL Trees and graphics are two kinds of topology used for depicting an OMDL. Since almost all objects in semistructured pages use a hierarchical or plain structure to render their descriptions, we select the tree as its representation structure. In order to extract the abstract OMDL from the set of analogous ODATA, we need to find the essential and common framework among them. This can be considered as some sort of reverse engineering process to discover an XML schema from XML documents of ODATA. There are many tools, such as Prote ge (available at: stanford.edu) and xmlspy (available at: va.com), that can be used to generate DTD or the schema for a single object represented in XML. Here, we address the problems of generalizing an OMDL based on the characteristics of the XML outputs of ODATA, especially with nesting structure. 6.1 Identifying Repeated Elements in Object Data An example of a complex XML ODATA output is given in Fig. 7a, which shows a number of repeated elements (such as Mouse) or categories (such as Harddisk). These repeated categories and repeated elements in the same category share common patterns, respectively. On the contrary, elements having the same element name but falling into different categories, such as Cache in Harddisk and Cache in CPU (not shown in the mentioned figure), might not share the same pattern. It clearly indicates that simple merging method using cooccurrence without context is not feasible. Fig. 7c is the hierarchical logical diagram of ODATA shown in Fig. 7a. By identifying repeated elements and categories, we can derive abstract OMDL as shown in Fig. 7d. Here, two elements with the same name existing in the same category imply that they share the same pattern. We can simply scan the simple elements in each category to check for repetitions, and combine them into one element tagged with + (one or more occurrences) in OMDL.

11 YE AND CHUA: LEARNING OBJECT MODELS FROM SEMISTRUCTURED WEB DOCUMENTS 11 More work needs to be done to process repeated categories, whose children may not completely overlap even though their parents have the identical name (see Cache of Harddisk in Fig. 7a). We assume that the vocabulary at a site is consistent, thus sibling categories sharing the same name should refer to the same kind of entities, even though they might have different subelements. Hence, we can combine their child elements after we combine the repeated category. We assign the tag * (zero or more occurrences) for the nonmandatory elements (see Fig. 7d). 6.2 Integrating Initial Object Models Although most of the pages used in our research have been produced by dynamic programs or templates, the resulting O DATA may not have exactly the same structure. This is because the authors may delete a row (element) when its corresponding optional element value is null or missing, and the number of repeated elements may be different. Thus, we need to integrate the initial O MDL into a uniform one in order to eliminate the differences among several O DATA and enable all of them to be stored in one repository. The following algorithm will produce a uniform framework for O DATA : IntegrateObjectModel (Initial_Model_Set M ¼fM 1 ;...;M N g) { 1. i ¼ arg I2N maxjm i j;//m i has maximal nodes 2. For each (node n in M i ) //update the tags of M i s elements 3. { 4. if (any model in M fm i g does not have the corresponding n) 5. {Label n * ;} 6. else if (the corresponding element of any model in M fm i g is tagged + ) //repeated element in M fm i g 7. {Label n + ; } 8. } 9. For each node n in M fm i g //add the elements which do not exist in M i 10. { 11. if (M i does not have a corresponding n) { 12. Add n to the corresponding position in M i ; 13. Label n * ; //n is optional 14. } 15. } 16. return M i ; } Here, we assume that two nodes in two initial models belong to the identical element (or category) if: 1) they have the same name and 2) their parent node (category) has the same name. In the first step of this algorithm, selecting an object instance with the maximum number of nodes helps to reduce the operations in adding the optional elements. 6.3 Refining Object Model Once we use IntegrateObjectModel to merge a set of initial models into a common O MDL, we need to extract more knowledge to refine it. One important issue is the identification of the data type of elements. This is crucial information for database design and object comparison. The other issue is obtaining the enumeration list of element values. For example, the element Shooting modes in the camera domain would enumerate the values of: Auto, Manual, Long Shutter..., rather than a random set of strings. Most ontology learning approaches focus more on acquiring the definitions (or meanings) of O DATA. Here, we focus on extending the element model (possible values and their aliases) in the concept space, as it would be of greater benefit to data understanding and reasoning. First, we need to identify the data type of the elements. Our system adopts the standard set of data types as described in W3C. 3 There are 36 recommended data types for use with OWL, such as string, date, time, integer, long, etc. Here, we use regular expression and a dictionary to detect the main data types: number (integer, long, float, decimal, et al.), currency, data, time, language, Boolean, and . We assign hierarchy to data types such that integer is more specialized than decimal, and decimal is in turn more specialized than string, etc. The most general and default data type for element is string. If the element in all O DATA contains only values of the more specialized data type, we set its data type to the specialized ones. Next, we need to determine the list of candidate values for the enumerated data types. Theoretically, we could use the frequency of all vocabularies for each element in O DATA to enumerate the possible values. However, some elements may use arbitrary strings rather than a predefined set of enumerated values. To overcome this problem, we employ a pessimistic strategy to constrain the length of the enumeration string to only one token; namely, the element cannot be enumerated if the length of its value exceeds one token. At the same time, we consider the token distribution, where most of the correct enumeration members should be repeated in many instances. A larger collection of data object instances would produce more reliable statistics for this purpose. The third problem is to detect the element data format of the form: number followed by UOM (Unit of measure) such as 40 G and 800 MHz, etc. Once we can correctly assign the element to this data format rather than just as a string, we can perform computation on these elements according to some business logic. We again use regular expression to detect such data format with two attributes: quantity and UOM. Since we do not use any NLP parser to analyze the element data entity, we cannot deal with complicated cases such as PIII 800 MHz. 7 EXPERIMENT AND DISCUSSION In this project, we developed a Web! data! model! Ontology intelligent Web information processing platform called OnModer (see Fig. 8). Using OnModeler, we can conduct experiments to learn an O MDL through detecting and constructing O DATA from a set of analogous Web pages. 3. Web Ontology Language, available: and XML Schema Part 1: Structures Second Edition, available:

12 12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 3, MARCH 2006 Fig. 8. Interfact of the object model learning system, OnModer. As this is a new area of research, it is difficult to find a publicly available and representative test corpus to test our technique and illustrate its advantages. The currently available corpora on faculty category, product category, and search snapshot are relatively well-structured with simple contents [18]. Moreover, they assume fixed answer templates. Our research expects to extract detailed objectlevel information (i.e., personal profile) with a flexible template structure, rather than just category-level information (i.e., faculty category) with a fixed template structure. Hence, we need to build our own set of test corpus comprising more abundant contents. We, therefore, collected semistructured pages regarding notebooks, computers, digital cameras, member lists, etc., from various venders and product review sites. Some of the sites we used are listed in Table 1. Compared to IE test corpus available at our test collection has the following characteristics: 1) It is (semi)structured rather than structured. 2) It is complicated rather than simple, and contains numerous irrelevant components. 3) Its O REG usually uses hierarchical O MDL depicted in various object representation structures. So, in most cases, it is more difficult than the existing ones and is closer to the situations in real applications. From the sites listed in Table 1, we select about 2,050 pages (with at least 10 pages from each site) that contain detailed descriptions of the target objects. About 98 percent of the pages present product information in the form of tabular or list structures, while the rest present products in the form of text fragments or sentences. Some detailed samples and results can be downloaded from We use the kernel method to retrieve similar pages from the same site. There is high similarity between the pages produced by the same program, and they are very dissimilar to other pages that are hand-coded or derived from different programs. Because of the constraint of paper length, here we focus only on evaluating the three most critical steps in our framework: he quality of the extracted O REG, the accuracy of O DATA construction, and the performance of O MDL construction. 7.1 Test on Object Region Detection Since O REG is a subtree of the parse tree in the corresponding page, we can evaluate the quality of the detected O REG in two aspects: 1) whether it covers all nodes depicting O DATA and 2) whether it contains irrelevant components. The intuitive approach is to compute the degree of overlapping between the test O REG and the ideal O REG using recall and precision. However, because the cost of missing the description of O DATA is much higher than that of covering irrelevant nodes, recall is more important than precision. Fortunately, we found that hardly any O DATA nodes were missed out when w 0 is not less than 0.5. However, 1 6 percent of irrelevant nodes were included. This phenomenon is caused by the definition of O REG. For example, if there are three nodes a, b, and c in the second layer of O REG, treeðaþ and treeðbþ are about two O DATA, while a small treeðcþ contains irrelevant content (such as a link to the promoter s page). We have to include treeðcþ in

13 YE AND CHUA: LEARNING OBJECT MODELS FROM SEMISTRUCTURED WEB DOCUMENTS 13 TABLE 1 Some Sites Used for Downloading Product Pages Fig. 9. Performance of Data region detection with different w 0. the proposed O REG if we need O REG to cover treeðaþ and treeðbþ. This problem only happens in pages formatted with the listing structure, and they are likely to be manually constructed. They, thus, do not have a high-quality structure since the nodes about O DATA are not well encapsulated, and are mixed with irrelevant nodes. But, these irrelevant nodes are likely to be ignored in the following object partitioning step. The performance of O REG detection is not sensitive to the threshold w 0 in algorithm GetObjectRegion. This is because of the good agreement between the relative novelty values among subtrees of O REG versus non-o REG as predicted in the algorithm. The subtree between the root of the parse tree and the root of O REG always has one branch of large novelty value, but the novelty value of O REG comes from many subbranches (at least two branches). To evaluate the performance of GetObjectRegion, we test 100 random pages with different w 0 and used the precision and recall to measure the degree of node overlapping between the ideal and testing O REG. Namely,. Precision = (Nodes in ideal O REG \ Nodes in testing O REG ) / Nodes in testing O REG.. Recall = (Nodes in ideal O REG \ Nodes in testing O REG ) / Nodes in ideal O REG.. F 1 ¼ 2 P recision Recall ðp recision þ RecallÞ. For instance, the precision and recall will be 100 percent if the ideal and testing O REG overlap exactly. This measure can better reflect the quality of O REG detection. The results of O REG detection shown in Fig. 9 indicates that GetObjectRegion executes stably and efficiently with F 1 measure of between percent when w 0 varies from 0.4 to 0.6. Another issue that we may need to consider is whether one O REG is sufficient or not. We impose this as we assume there is only one O REG in a Web page. Occasionally, some pages might have more than one kind of O DATA created using different frameworks and shown in disconnected regions (for example, those manually created ones). We focus only on one kind of O DATA having the most description content. Incidentally, there might be fake analogous pages returned by the kernel method (see Section 3.2, pages not containing any object). For such fake pages, GetObjectRegion might also return some spurious fragments if it can detect such inflexions in the parse tree. However, the follow-on ConstructObjectData will likely to extract no O DATA among these spurious O REG since it cannot detect repeated representation framework. Thus, it might not bring any additional negative result. 7.2 Test on Object Data Construction In the step of constructing O DATA, the system partitions O REG into different blocks about distinctive O DATA (if necessary) and then output O DATA in the format of XML. Table 2 gives the statistics of the test corpus in terms of the distribution of data representations and the number of unique object instances. The performance of partitioning objects for each unique O DATA is tabulated in Table 3, which shows an average F 1 of 95.2 percent. Our error analysis reveals the following situations where errors could arise. 1) Entities in raw table (vertical and horizontal tabular) with unknown attributes are being used to construct spurious O DATA. 2) The system mistakenly partitions one object into many objects, where its grouped attributes are very similar to each other. 3) In the case of nested list structures, a quirky similarity between subtrees might mislead segmentation. Once incorrect O DATA segmentation happens, the relationships and ownerships of

14 14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 3, MARCH 2006 TABLE 2 Object Data Distribution on Our Corpus TABLE 4 Object Data Distribution in RISE Corpus TABLE 3 Performance of Object Data Extraction elements in O DATA will collapse, where improper description frameworks appear. In contrast, correct O DATA segmentation will facilitate the construction of the right framework for O DATA representation, which helps to ensure the correct assignment of node roles and relationships although few spurious elements might occasionally exist. Compared to the size of the original source pages, the size of the data object s output is significantly smaller, indicating a great reduction in irrelevant elements. For example, the technical specification page about a kind of digital cameras in our testing corpus reaches 556KB with more than 5,000 parse nodes, and contains more than 3,000 table cells in the nested 300 tables. Its output file is only 8 KB with about 100 XML elements in 23 categories. In general, we find that we are able to extract all O DATA from the well-structured pages with no missing cases but with some incorrect extractions. This conclusion is similar to that reported in [16] on very simple cases. Also, our overall performance in constructing O DATA is close to that achievable in object mining and extraction as reported in [3], [4], but we use more complicated structures as mentioned above. In order to verify and evaluate our approach more comprehensively, we also use an open test corpus available at: We utilize all four groups of structured pages (Internet Address Finder, BigBook, OKRA, and Quote Server) provided by the University College Dublin and ISI/USC (here, called RISE corpus). The details of this collection are shown in Table 4. The wrapper system reported in [18] cannot handle all of these pages since there are multiple intertwined objects. All pages in this collection use only one table in their O REG, with no nested cases and no variation in the structure of these tables. Without knowledge of the available templates, OnModer is able to detect and output 7,760 objects with an average size of 1 K, where there is an average of eight elements in each O DATA using plain structure (no hierarchical structure). The extraction accuracy of OnModer on this corpus is 100 percent. It appears that RISE corpus is much simpler than our test corpus, due to the fact that RISE corpus contains only simple objects in simple organization structures. We are able to conclude that our approach could obtain reliably excellent extraction performance on wellstructured pages, and is effective and applicable in general situations. 7.3 Test on Object Model Construction For this test, we focus on constructing O MDL for personal computers and electrical devices. Some examples of O MDL for notebooks and digital cameras are available at Fig. 10 shows the O MDL diagram about Kodak digital cameras derived by OnModer. For comparison purposes, we include in Fig. 11 the manually constructed ontology diagram of (digital) cameras available: era/ sld001.htm. Based on these constructed models, we can draw the following observations: First, the elements for analogous objects in diverse sites vary largely. There are four levels of differences among these variations: 1. the degree of details which is tied to the number of elements involved, 2. the organization category, Fig. 10. Object model for Kodak digital cameras.

15 YE AND CHUA: LEARNING OBJECT MODELS FROM SEMISTRUCTURED WEB DOCUMENTS 15 repeated category. For this test, we did not correct the error in O DATA before performing model learning. As a result, there were three spurious elements in model Kodak. We then deleted the definition of these three spurious elements in this O MDL and used them to verify the 31 constructed objects, which corrected the objects by removing nearly 50 spurious elements. This shows that with minimal human effort, O DATA quality could be improved tremendously. Most of the elements are of string type in our test corpus. There are some elements containing data of type decimal, integer, Boolean, and country. We also extracted the proper attributes (amount and UOM) for elements such as 16 mm. However, we could not process the more complicated cases such as Approx. 165g and mm, since there is no NLP parser component in our current system. Our future study will address this shortcoming. Fig. 11. Onotology created by domain experts. 3. the vocabulary appearing in the element name, and 4. the granularity of the element which means one element in one model might be split into a group of elements in another model. For instance, there are nearly 100 elements defined in O MDL Canon cameras, and 46 elements for O MDL Kodak cameras. However, the ontology created by human experts has only about 30 elements (including attributes). The element CPU in notebook from PC World ( has an equivalent category Processor (Bus-Speed, Processor- Manufacturer, et al.) in Bizrate ( These facts strongly indicate that users need particular models for the objects to come from different sources since there is no universal O MDL that could satisfy all cases. For analogous objects, we need multiple versions of O MDL to match the existing data which are published on diverse sites. Even though the common model is the ultimate representation of objects, we also need to grasp the essence of the distinctive descriptions. Second, there are large changes within the model structure. Our constructed O MDL uses either the hierarchical structure or the plain structure, but the element organization for diverse O MDL about the analogous objects might be quite different. For instance, one notebook model consists of about 60 elements in 11 categories, while another has about 20 elements in plain structure. Both models organize their elements according to different principles and could present their content using different frameworks. The semantics among the categories and their elements are related, consistent, and reasonable, and it is difficult to decide which one is better. What is clear, however, is that there is a big topological gap between the real data and the so-called standard conceptualization, making the extraction of real data to a standard template a rather difficult if not impossible task. At the same time, the name of elements used in two O MDL might differ even though they refer to the same descriptions. Hence, it is not easy to compare elements using different models even though they are about the analogous objects. How to integrate a common ontology from the available O MDL will be one of our next research topics. Third, the number of repeated elements in model Kodak is nearly one quarter, but we cannot find the case of 8 CONCLUSION Modeling intensive data on the Web is the key technology to realizing the ideals of Semantic Web and to support business intelligence. This paper focuses on tackling the problem of learning the domain models from complex Web pages, where the presentation structure is varied and useful data is surrounded by many irrelevant components. Our tests have demonstrated that our system is effective and efficient. The main contribution of this study is in developing a fully automated approach to extract object data and the corresponding object model from semistructured Web documents which is based on kernel-based matching and View Syntax interpretation. In this automated approach, we first utilized a kernelbased novelty measurement to identify O REG, which is the smallest subtree of the parse tree covering the descriptions of all the desired O DATA and, therefore, free from the interference of irrelevant components. We also investigated the algorithm for partitioning and constructing O DATA within O REG using View Syntax interpretation. This algorithm could recompose noncontiguous description about different O DATA and produce independent self-explanatory XML output files for objects. Finally, we have explored the method for inducing an abstract O MDL from analogous O DATA. O MDL is the generalization of the original data and is the key to understanding and using O DATA. To improve the performance of our system in processing complex semistructured documents, we need to further improve and fine-tune techniques for enhancing the detected O MDL and extracting distinct O DATA from O REG. We also need to investigate the approach to assemble multiple related O MDL into the ontology s terminological concept hierarchy. Finally, we would like to employ NLP technologies to learn deep semantic information from Web documents and build more comprehensive universal ontologies which enable users and machines to interpret the underlying semantics of semistructured Web pages. REFERENCES [1] A. Arasu and H. Garcia-Molina, Extracting Structured Data from Web Pages, Proc. ACM SIGMOD, pp , [2] Z. Bar-Yossef and S. Rajagopalan, Template Detection via Data Mining and Its Applications, Proc. World Wide Web Conf., pp , 2002.

16 16 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 3, MARCH 2006 [3] D. Buttler, L. Liu, and C. Pu, A Fully Automated Extraction System for the World Wide Web, Proc. IEEE Int l Conf. Distributed Computing Systems (ICDCS), p. 361, [4] D. Buttler, L. Liu, C. Pu, H. Paques, W. Han, and W. Tang, OminiSearch: A Method for Searching Dynamic Content on the Web, Proc. ACM SIGMOD, p. 604, [5] R. Cole and P. Eklund, Browsing Semistructured Web Texts Using Formal Concept Analysis, Lecture Notes Computer Science, vol. 2120, p. 319, [6] H. Davulcu, S. Vadrevu, and S. Nagarajan, Ontominer: Bootstrapping and Populating Ontologies from Domain Specific WebSites, Proc. Int l Workshop Semantic Web and Databases (SWDB), pp , [7] A. Doan, P. Domingos, and A. Levy, Learning Source Descriptions for Data Integration, Proc. WebDB, pp , [8] D.W. Embley et al., Ontology-Based Extraction and Structuring of Information from Data-Rich Unstructured Documents, Proc. Conf. Information and Knowledge Management, pp , [9] F. Esposito, S. Ferilli, N. Fanizzi, and G. Semeraro, Learning from Parsed Sentences with INTHELEX, Proc. Learning Language in Logic Workshop, pp , [10] O. Etzioni, M. Cafarell, Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison, Proc. Nat l Conf. Artifical Intelligence, pp , [11] T. Gärtner,, A Survey of Kernels for Structured Data, Newsletter ACM SIGKDD, vol. 5, no. 1, pp , July [12] S. Gupta et al., Adapting Content to Mobile Devices: DOM-Based Content Extraction of HTML Documents, Proc. World Wide Web Conf., pp , [13] U. Hahn and K. Schnattinger, Towards Text Knowledge Engineering, Proc. Nat l Conf. Artificial Intelligence. pp , [14] K. Lerman, C. Knoblock, and S. Minton, Automatic Data Extraction from Lists and Tables in Web Sources, Proc. IJCAI Workshop Adaptive Text Extraction and Mining, [15] S.-H. Lin and J.-M. Ho, Discovering Informative Content Blocks from Web Documents, Proc. Eighth ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining, pp , [16] B. Liu, R. Grossman, and Y. Zhai, Mining Data Records in Web Pages, Proc. Ninth ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining, pp , [17] J.-U. Kietz and K. Morik, A Polynomial Approach to the Constructive Induction of Structural Knowledge, Machine Learning, vol. 14, no. 2, pp , [18] N. Kushmerick, Wrapper Induction: Efficiency and Expressiveness, Artifical Intelligence, vol. 118, pp , [19] A. Maedche, Ontology Learning for the Semantic Web. Kluwer Academic Publishers, [20] A. Maedche and S. Staab, Semiautomatic Engineering of Ontologies from Text, Proc. Int l Conf. Software Eng. & Knowledge Eng., [21] A. Maedche and S. Staab, Ontology Learning for the Semantic Web, IEEE Intelligent Systems, vol. 16, no. 2, pp , [22] K.R. Müller et al., An Introduction to Kernel-Based Learning Algorithms, IEEE Neural Networks, vol. 12, no. 2, pp , [23] S. Schlobach, Assertional Mining in Description Logics, Proc. Int l Workshop Description Logics (DL-2000), pp , [24] K. Shimada, A. Fukumoto, and T. Endo, Information Extraction from Personal Computer Specifications on the Web Using a User s Request, IEICE Information and Systems, pp , [25] S. Soderland, Learning to Extract Text-Based Information from the World Wide Web, Proc. Third Int l Conf. Knowledge Discovery and Data Mining, pp , [26] R. Studer, R. Benjamins, and D. Fensel, Knowledge Engineering: Principles and Methods, Data & Knowledge Eng., vol. 25, no. 102, pp , [27] R. Song et al., Learning Block Importance Models for Web Pages, Proc. World Wide Web Conf., pp , [28] M. Thelen and E. Riloff, A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts, Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), [29] A. Todirascu, F. de Beuvron, D. Gâlea, F. Rousselot, Using Description Logics for Ontology Extraction, Proc. Workshop Ontology Learning of ECAI, pp , [30] M. Uschold, Knowledge Level Modeling: Concepts and Terminology, The Knowledge Eng. Rev., vol. 13, no, 1, pp. 5-29, [31] R. Volz et al., Unveiling the Hidden Bride: Deep Annotation for Mapping and Migrating Legacy Data to the Semantic Web, Web Semantics: Science, Services, and Agents on the World Wide Web, pp , [32] G. Webb, J. Wells, Z. Zheng, An Experimental Evaluation of Integrating Machine Learning with Knowledge Acquisition, Machine Learning, vol. 31, no. 1, pp. 5-23, [33] S. Ye and T.-S. Chua, Learning Object Model from Product Web Pages, Proc. Workshop Semantic Web of SIGIR, pp , [34] S. Ye and T-S Chua, Detecting and Partitioning of Data Objects in Complex Web Pages, Proc. Int l Conf. Web Intelligence, pp , [35] L. Yi and B. Liu, Eliminating Noisy Information in Web Pages for Data Mining, Proc ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining, pp , Shiren Ye received the PhD degree in computer software from the Institute of Computing, the Academy of Sciences in He is a research fellow in School of Computing, National University of Singapore. His research interests include Web and text mining, semantic Web, and natural language processing. Tat-Seng Chua received the PhD degree from the University of Leeds, UK. He is a professor in the School of Computing, National University of Singapore. He was the Acting and Founding Dean of the School of Computing from He spent three years as a research staff member at the Institute of Systems Science (now I2R) in late 1980s. Dr Chua s main research interest is in multimedia information processing, in particular, on indexing, retrieval, and extraction of information in video and text. His current projects include: news video retrieval, question answering, video QA, and information extraction on the Web. Dr Chua is active in the international research community. He has organized and served as program committee member of numerous international conferences in the areas of computer graphics and multimedia, including ACM Multimedia, ACM SIGIR, etc. He is the cochair of ACM Multimedia 2005 and ACM SIGIR He serves in the editorial boards of the IEEE Transactions of Multimedia, The Visual Computer, and Multimedia Tools and Applications.. For more information on this or any other computing topic, please visit our Digital Library at

DATA MODELS FOR SEMISTRUCTURED DATA

DATA MODELS FOR SEMISTRUCTURED DATA Chapter 2 DATA MODELS FOR SEMISTRUCTURED DATA Traditionally, real world semantics are captured in a data model, and mapped to the database schema. The real world semantics are modeled as constraints and

More information

NOTES ON OBJECT-ORIENTED MODELING AND DESIGN

NOTES ON OBJECT-ORIENTED MODELING AND DESIGN NOTES ON OBJECT-ORIENTED MODELING AND DESIGN Stephen W. Clyde Brigham Young University Provo, UT 86402 Abstract: A review of the Object Modeling Technique (OMT) is presented. OMT is an object-oriented

More information

Gestão e Tratamento da Informação

Gestão e Tratamento da Informação Gestão e Tratamento da Informação Web Data Extraction: Automatic Wrapper Generation Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2010/2011 Outline Automatic Wrapper Generation

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Category Theory in Ontology Research: Concrete Gain from an Abstract Approach

Category Theory in Ontology Research: Concrete Gain from an Abstract Approach Category Theory in Ontology Research: Concrete Gain from an Abstract Approach Markus Krötzsch Pascal Hitzler Marc Ehrig York Sure Institute AIFB, University of Karlsruhe, Germany; {mak,hitzler,ehrig,sure}@aifb.uni-karlsruhe.de

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

RiMOM Results for OAEI 2009

RiMOM Results for OAEI 2009 RiMOM Results for OAEI 2009 Xiao Zhang, Qian Zhong, Feng Shi, Juanzi Li and Jie Tang Department of Computer Science and Technology, Tsinghua University, Beijing, China zhangxiao,zhongqian,shifeng,ljz,tangjie@keg.cs.tsinghua.edu.cn

More information

Heading-Based Sectional Hierarchy Identification for HTML Documents

Heading-Based Sectional Hierarchy Identification for HTML Documents Heading-Based Sectional Hierarchy Identification for HTML Documents 1 Dept. of Computer Engineering, Boğaziçi University, Bebek, İstanbul, 34342, Turkey F. Canan Pembe 1,2 and Tunga Güngör 1 2 Dept. of

More information

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Dong Han and Kilian Stoffel Information Management Institute, University of Neuchâtel Pierre-à-Mazel 7, CH-2000 Neuchâtel,

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Chapter S:II. II. Search Space Representation

Chapter S:II. II. Search Space Representation Chapter S:II II. Search Space Representation Systematic Search Encoding of Problems State-Space Representation Problem-Reduction Representation Choosing a Representation S:II-1 Search Space Representation

More information

International Journal for Management Science And Technology (IJMST)

International Journal for Management Science And Technology (IJMST) Volume 4; Issue 03 Manuscript- 1 ISSN: 2320-8848 (Online) ISSN: 2321-0362 (Print) International Journal for Management Science And Technology (IJMST) GENERATION OF SOURCE CODE SUMMARY BY AUTOMATIC IDENTIFICATION

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Object-oriented Compiler Construction

Object-oriented Compiler Construction 1 Object-oriented Compiler Construction Extended Abstract Axel-Tobias Schreiner, Bernd Kühl University of Osnabrück, Germany {axel,bekuehl}@uos.de, http://www.inf.uos.de/talks/hc2 A compiler takes a program

More information

A Review on Identifying the Main Content From Web Pages

A Review on Identifying the Main Content From Web Pages A Review on Identifying the Main Content From Web Pages Madhura R. Kaddu 1, Dr. R. B. Kulkarni 2 1, 2 Department of Computer Scienece and Engineering, Walchand Institute of Technology, Solapur University,

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Information Discovery, Extraction and Integration for the Hidden Web

Information Discovery, Extraction and Integration for the Hidden Web Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk

More information

Ontology Extraction from Heterogeneous Documents

Ontology Extraction from Heterogeneous Documents Vol.3, Issue.2, March-April. 2013 pp-985-989 ISSN: 2249-6645 Ontology Extraction from Heterogeneous Documents Kirankumar Kataraki, 1 Sumana M 2 1 IV sem M.Tech/ Department of Information Science & Engg

More information

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction Adaptable and Adaptive Web Information Systems School of Computer Science and Information Systems Birkbeck College University of London Lecture 1: Introduction George Magoulas gmagoulas@dcs.bbk.ac.uk October

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm

More information

The Results of Falcon-AO in the OAEI 2006 Campaign

The Results of Falcon-AO in the OAEI 2006 Campaign The Results of Falcon-AO in the OAEI 2006 Campaign Wei Hu, Gong Cheng, Dongdong Zheng, Xinyu Zhong, and Yuzhong Qu School of Computer Science and Engineering, Southeast University, Nanjing 210096, P. R.

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

An Approach To Web Content Mining

An Approach To Web Content Mining An Approach To Web Content Mining Nita Patil, Chhaya Das, Shreya Patanakar, Kshitija Pol Department of Computer Engg. Datta Meghe College of Engineering, Airoli, Navi Mumbai Abstract-With the research

More information

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES Praveen Kumar Malapati 1, M. Harathi 2, Shaik Garib Nawaz 2 1 M.Tech, Computer Science Engineering, 2 M.Tech, Associate Professor, Computer Science Engineering,

More information

Form Identifying. Figure 1 A typical HTML form

Form Identifying. Figure 1 A typical HTML form Table of Contents Form Identifying... 2 1. Introduction... 2 2. Related work... 2 3. Basic elements in an HTML from... 3 4. Logic structure of an HTML form... 4 5. Implementation of Form Identifying...

More information

A Simple Syntax-Directed Translator

A Simple Syntax-Directed Translator Chapter 2 A Simple Syntax-Directed Translator 1-1 Introduction The analysis phase of a compiler breaks up a source program into constituent pieces and produces an internal representation for it, called

More information

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,

More information

METEOR-S Web service Annotation Framework with Machine Learning Classification

METEOR-S Web service Annotation Framework with Machine Learning Classification METEOR-S Web service Annotation Framework with Machine Learning Classification Nicole Oldham, Christopher Thomas, Amit Sheth, Kunal Verma LSDIS Lab, Department of CS, University of Georgia, 415 GSRC, Athens,

More information

Motivating Ontology-Driven Information Extraction

Motivating Ontology-Driven Information Extraction Motivating Ontology-Driven Information Extraction Burcu Yildiz 1 and Silvia Miksch 1, 2 1 Institute for Software Engineering and Interactive Systems, Vienna University of Technology, Vienna, Austria {yildiz,silvia}@

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

Text Mining: A Burgeoning technology for knowledge extraction

Text Mining: A Burgeoning technology for knowledge extraction Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

XETA: extensible metadata System

XETA: extensible metadata System XETA: extensible metadata System Abstract: This paper presents an extensible metadata system (XETA System) which makes it possible for the user to organize and extend the structure of metadata. We discuss

More information

Object-Oriented Programming

Object-Oriented Programming Object-Oriented Programming 3/18/14 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and M. H. Goldwasser, Wiley, 2014 Object-Oriented

More information

Managing Application Configuration Data with CIM

Managing Application Configuration Data with CIM Managing Application Configuration Data with CIM Viktor Mihajlovski IBM Linux Technology Center, Systems Management Introduction The configuration of software, regardless whether

More information

Overview of Web Mining Techniques and its Application towards Web

Overview of Web Mining Techniques and its Application towards Web Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous

More information

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS Objective PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS Explain what is meant by compiler. Explain how the compiler works. Describe various analysis of the source program. Describe the

More information

Knowledge Engineering with Semantic Web Technologies

Knowledge Engineering with Semantic Web Technologies This file is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) Knowledge Engineering with Semantic Web Technologies Lecture 5: Ontological Engineering 5.3 Ontology Learning

More information

The architecture of Eiffel software 3.1 OVERVIEW classes clusters systems

The architecture of Eiffel software 3.1 OVERVIEW classes clusters systems 3 Draft 5.02.00-0, 15 August 2005 (Santa Barbara). Extracted from ongoing work on future third edition of Eiffel: The Language. Copyright Bertrand Meyer 1986-2005. Access restricted to purchasers of the

More information

STRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE

STRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE STRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE Wei-ning Qian, Hai-lei Qian, Li Wei, Yan Wang and Ao-ying Zhou Computer Science Department Fudan University Shanghai 200433 E-mail: wnqian@fudan.edu.cn

More information

Analysis on the technology improvement of the library network information retrieval efficiency

Analysis on the technology improvement of the library network information retrieval efficiency Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):2198-2202 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Analysis on the technology improvement of the

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

Peer-to-Peer Systems. Chapter General Characteristics

Peer-to-Peer Systems. Chapter General Characteristics Chapter 2 Peer-to-Peer Systems Abstract In this chapter, a basic overview is given of P2P systems, architectures, and search strategies in P2P systems. More specific concepts that are outlined include

More information

UML-Based Conceptual Modeling of Pattern-Bases

UML-Based Conceptual Modeling of Pattern-Bases UML-Based Conceptual Modeling of Pattern-Bases Stefano Rizzi DEIS - University of Bologna Viale Risorgimento, 2 40136 Bologna - Italy srizzi@deis.unibo.it Abstract. The concept of pattern, meant as an

More information

Differential Compression and Optimal Caching Methods for Content-Based Image Search Systems

Differential Compression and Optimal Caching Methods for Content-Based Image Search Systems Differential Compression and Optimal Caching Methods for Content-Based Image Search Systems Di Zhong a, Shih-Fu Chang a, John R. Smith b a Department of Electrical Engineering, Columbia University, NY,

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Content Management for the Defense Intelligence Enterprise

Content Management for the Defense Intelligence Enterprise Gilbane Beacon Guidance on Content Strategies, Practices and Technologies Content Management for the Defense Intelligence Enterprise How XML and the Digital Production Process Transform Information Sharing

More information

MURDOCH RESEARCH REPOSITORY

MURDOCH RESEARCH REPOSITORY MURDOCH RESEARCH REPOSITORY http://researchrepository.murdoch.edu.au/ This is the author s final version of the work, as accepted for publication following peer review but without the publisher s layout

More information

Ontology Development. Qing He

Ontology Development. Qing He A tutorial report for SENG 609.22 Agent Based Software Engineering Course Instructor: Dr. Behrouz H. Far Ontology Development Qing He 1 Why develop an ontology? In recent years the development of ontologies

More information

Keywords Data alignment, Data annotation, Web database, Search Result Record

Keywords Data alignment, Data annotation, Web database, Search Result Record Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web

More information

Utilizing a Common Language as a Generative Software Reuse Tool

Utilizing a Common Language as a Generative Software Reuse Tool Utilizing a Common Language as a Generative Software Reuse Tool Chris Henry and Stanislaw Jarzabek Department of Computer Science School of Computing, National University of Singapore 3 Science Drive,

More information

A state-based 3-way batch merge algorithm for models serialized in XMI

A state-based 3-way batch merge algorithm for models serialized in XMI A state-based 3-way batch merge algorithm for models serialized in XMI Aron Lidé Supervisor: Lars Bendix Department of Computer Science Faculty of Engineering Lund University November 2011 Abstract With

More information

Extraction of Web Image Information: Semantic or Visual Cues?

Extraction of Web Image Information: Semantic or Visual Cues? Extraction of Web Image Information: Semantic or Visual Cues? Georgina Tryfou and Nicolas Tsapatsoulis Cyprus University of Technology, Department of Communication and Internet Studies, Limassol, Cyprus

More information

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge Discover hidden information from your texts! Information overload is a well known issue in the knowledge industry. At the same time most of this information becomes available in natural language which

More information

Fault Identification from Web Log Files by Pattern Discovery

Fault Identification from Web Log Files by Pattern Discovery ABSTRACT International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 2 ISSN : 2456-3307 Fault Identification from Web Log Files

More information

Trees, Part 1: Unbalanced Trees

Trees, Part 1: Unbalanced Trees Trees, Part 1: Unbalanced Trees The first part of this chapter takes a look at trees in general and unbalanced binary trees. The second part looks at various schemes to balance trees and/or make them more

More information

General Architecture of HIVE/WARE

General Architecture of HIVE/WARE General Architecture of HIVE/WARE 1. Layman s Description One of the largest paradigm shifts in writing took place when we moved from the use of the typewriter to the use of the computer. Just as the typewriter

More information

Lecture 26. Introduction to Trees. Trees

Lecture 26. Introduction to Trees. Trees Lecture 26 Introduction to Trees Trees Trees are the name given to a versatile group of data structures. They can be used to implement a number of abstract interfaces including the List, but those applications

More information

A Model of Machine Learning Based on User Preference of Attributes

A Model of Machine Learning Based on User Preference of Attributes 1 A Model of Machine Learning Based on User Preference of Attributes Yiyu Yao 1, Yan Zhao 1, Jue Wang 2 and Suqing Han 2 1 Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada

More information

Chapter 2 BACKGROUND OF WEB MINING

Chapter 2 BACKGROUND OF WEB MINING Chapter 2 BACKGROUND OF WEB MINING Overview 2.1. Introduction to Data Mining Data mining is an important and fast developing area in web mining where already a lot of research has been done. Recently,

More information

A Framework for Securing Databases from Intrusion Threats

A Framework for Securing Databases from Intrusion Threats A Framework for Securing Databases from Intrusion Threats R. Prince Jeyaseelan James Department of Computer Applications, Valliammai Engineering College Affiliated to Anna University, Chennai, India Email:

More information

Hashing. Hashing Procedures

Hashing. Hashing Procedures Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements

More information

The DPM metamodel detail

The DPM metamodel detail The DPM metamodel detail The EBA process for developing the DPM is supported by interacting tools that are used by policy experts to manage the database data dictionary. The DPM database is designed as

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

Evaluation of Predicate Calculus By Arve Meisingset, retired research scientist from Telenor Research Oslo Norway

Evaluation of Predicate Calculus By Arve Meisingset, retired research scientist from Telenor Research Oslo Norway Evaluation of Predicate Calculus By Arve Meisingset, retired research scientist from Telenor Research 31.05.2017 Oslo Norway Predicate Calculus is a calculus on the truth-values of predicates. This usage

More information

Semantics, Metadata and Identifying Master Data

Semantics, Metadata and Identifying Master Data Semantics, Metadata and Identifying Master Data A DataFlux White Paper Prepared by: David Loshin, President, Knowledge Integrity, Inc. Once you have determined that your organization can achieve the benefits

More information

Oracle Endeca Information Discovery

Oracle Endeca Information Discovery Oracle Endeca Information Discovery Glossary Version 2.4.0 November 2012 Copyright and disclaimer Copyright 2003, 2013, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered

More information

Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm

Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm Saiyad Sharik Kaji Prof.M.B.Chandak WCOEM, Nagpur RBCOE. Nagpur Department of Computer Science, Nagpur University, Nagpur-441111

More information

6.001 Notes: Section 8.1

6.001 Notes: Section 8.1 6.001 Notes: Section 8.1 Slide 8.1.1 In this lecture we are going to introduce a new data type, specifically to deal with symbols. This may sound a bit odd, but if you step back, you may realize that everything

More information

SOME TYPES AND USES OF DATA MODELS

SOME TYPES AND USES OF DATA MODELS 3 SOME TYPES AND USES OF DATA MODELS CHAPTER OUTLINE 3.1 Different Types of Data Models 23 3.1.1 Physical Data Model 24 3.1.2 Logical Data Model 24 3.1.3 Conceptual Data Model 25 3.1.4 Canonical Data Model

More information

MURDOCH RESEARCH REPOSITORY

MURDOCH RESEARCH REPOSITORY MURDOCH RESEARCH REPOSITORY http://researchrepository.murdoch.edu.au/ This is the author s final version of the work, as accepted for publication following peer review but without the publisher s layout

More information

Vocabulary Harvesting Using MatchIT. By Andrew W Krause, Chief Technology Officer

Vocabulary Harvesting Using MatchIT. By Andrew W Krause, Chief Technology Officer July 31, 2006 Vocabulary Harvesting Using MatchIT By Andrew W Krause, Chief Technology Officer Abstract Enterprises and communities require common vocabularies that comprehensively and concisely label/encode,

More information

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term

More information

Creating and Maintaining Vocabularies

Creating and Maintaining Vocabularies CHAPTER 7 This information is intended for the one or more business administrators, who are responsible for creating and maintaining the Pulse and Restricted Vocabularies. These topics describe the Pulse

More information

Semantics via Syntax. f (4) = if define f (x) =2 x + 55.

Semantics via Syntax. f (4) = if define f (x) =2 x + 55. 1 Semantics via Syntax The specification of a programming language starts with its syntax. As every programmer knows, the syntax of a language comes in the shape of a variant of a BNF (Backus-Naur Form)

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A SURVEY ON WEB CONTENT MINING DEVEN KENE 1, DR. PRADEEP K. BUTEY 2 1 Research

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

What is a multi-model database and why use it?

What is a multi-model database and why use it? What is a multi-model database and why use it? An When it comes to choosing the right technology for a new project, ongoing development or a full system upgrade, it can often be challenging to define the

More information

Intermediate Code Generation

Intermediate Code Generation Intermediate Code Generation In the analysis-synthesis model of a compiler, the front end analyzes a source program and creates an intermediate representation, from which the back end generates target

More information

6. Relational Algebra (Part II)

6. Relational Algebra (Part II) 6. Relational Algebra (Part II) 6.1. Introduction In the previous chapter, we introduced relational algebra as a fundamental model of relational database manipulation. In particular, we defined and discussed

More information

UNIT-IV BASIC BEHAVIORAL MODELING-I

UNIT-IV BASIC BEHAVIORAL MODELING-I UNIT-IV BASIC BEHAVIORAL MODELING-I CONTENTS 1. Interactions Terms and Concepts Modeling Techniques 2. Interaction Diagrams Terms and Concepts Modeling Techniques Interactions: Terms and Concepts: An interaction

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Natural Language Processing with PoolParty

Natural Language Processing with PoolParty Natural Language Processing with PoolParty Table of Content Introduction to PoolParty 2 Resolving Language Problems 4 Key Features 5 Entity Extraction and Term Extraction 5 Shadow Concepts 6 Word Sense

More information

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Katsuya Masuda *, Makoto Tanji **, and Hideki Mima *** Abstract This study proposes a framework to access to the

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

TISA Methodology Threat Intelligence Scoring and Analysis

TISA Methodology Threat Intelligence Scoring and Analysis TISA Methodology Threat Intelligence Scoring and Analysis Contents Introduction 2 Defining the Problem 2 The Use of Machine Learning for Intelligence Analysis 3 TISA Text Analysis and Feature Extraction

More information

- What we actually mean by documents (the FRBR hierarchy) - What are the components of documents

- What we actually mean by documents (the FRBR hierarchy) - What are the components of documents Purpose of these slides Introduction to XML for parliamentary documents (and all other kinds of documents, actually) Prof. Fabio Vitali University of Bologna Part 1 Introduce the principal aspects of electronic

More information

Modelling Structures in Data Mining Techniques

Modelling Structures in Data Mining Techniques Modelling Structures in Data Mining Techniques Ananth Y N 1, Narahari.N.S 2 Associate Professor, Dept of Computer Science, School of Graduate Studies- JainUniversity- J.C.Road, Bangalore, INDIA 1 Professor

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

Web Data mining-a Research area in Web usage mining

Web Data mining-a Research area in Web usage mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,

More information

DITA for Enterprise Business Documents Sub-committee Proposal Background Why an Enterprise Business Documents Sub committee

DITA for Enterprise Business Documents Sub-committee Proposal Background Why an Enterprise Business Documents Sub committee DITA for Enterprise Business Documents Sub-committee Proposal Background Why an Enterprise Business Documents Sub committee Documents initiate and record business change. It is easy to map some business

More information

6.001 Notes: Section 31.1

6.001 Notes: Section 31.1 6.001 Notes: Section 31.1 Slide 31.1.1 In previous lectures we have seen a number of important themes, which relate to designing code for complex systems. One was the idea of proof by induction, meaning

More information