c 2010 by Ngoc Trung Bui. All rights reserved.

Size: px

Start display at page:

Bertina Curtis
5 years ago
Views:

2 PROBABILISTIC VISUAL RELATIONAL DATA EXTRACTION BY NGOC TRUNG BUI THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2010 Urbana, Illinois Adviser: Associate Professor Kevin Chen-Chuan Chang

3 Abstract This paper studies the problem of wrapper generation and proposes the concept of visual-relational data extraction as the foundation for modeling wrappers. Towards large scale integration, we identify the key requirements of wrapper deployment, and observe the limitations of the state of the art which inherently result from their low-level wrapper modeling. We thus propose the visual-relational modeling and develop the execution and learning mechanisms. Our experiments show significant improvements towards satisfying the accuracy and consistency requirements. ii

4 To my Father and Mother. iii

5 Acknowledgments This project would not have been possible without the support of many people. Many thanks to my advisor, prof. Kevin Chen-Chuan Chang, who read my numerous revisions and helped me to make the approach clear. Thanks to the Vietnam Education Foundation for providing me with the financial means to complete this project. And finally, thanks to my parents, and numerous friends who endured this long process with me, always offering support and love. iv

6 Table of Contents List of Figures vi List of Abbreviations vii List of Symbols viii Chapter 1 Introduction Chapter 2 Related Work Chapter 3 Motivation: Model Matters Visual Relational Wrapper Model Model Execution: Extracting Data Relational Schema Generative Model Configuration Tree: Parsing Efficiency Parsing Chapter 4 Model Induction Extraction Estimation on Extracted Sample Dataset Chapter 5 Experiments Accuracy and Robustness Consistency Over Time Features Effectiveness Evaluation Chapter 6 Conclusion References v

7 List of Figures 1.1 hotjobs.yahoo.com: two different dates Wrapper through its full life cycle Example page fragment (amazon.com) Model Execution Configuration Tree Generation Reduce Candidate Set by Distance-based Clustering Types of false combination from clean clusters Y5D Dataset Characteristics F-measure evaluation Robustness with different webpages structures Consistency Test: Sampled Dataset Average of induced wrapper s life Feature Coverage Statistics on each Feature Similarity Level (logn) vi

8 List of Abbreviations DI DE Data Integration. Data Extraction. vii

9 List of Symbols Ω Υ e Visual Model. Data Record. Data Token. viii

10 Chapter 1 Introduction Wrapper generation is fundamental for enabling data extraction from structured data sources, a crucial step in information integration and search. This paper attempts to consider wrapper generation with a new paradigm of modeling data sources. While various approaches exist, they all uniformly resort to HTML features and tag patterns to model the regularity of sources. We observe that modeling is fundamental which inherently limit existing approaches to match several key requirements. As a different approach, we propose visual relational modeling, which aim to specify wrappers with high-level features and only minimal patterns. While a well-recognized problem, with the prevalence of databases on the Web, wrapper generation is increasingly a barrier for realizing large scale information integration across the Internet. On this deep Web, numerous data sources provide structured information (e.g., amazon.com for books; cars.com for automobiles) accessible only via dynamic queries instead of static URL links. To explore the contents behind the surface from such databases, as a major hurdle, we must extract structured data from the query results which we refer to as data pages. To illustrate, Figure 1.1 shows two data pages from Yahoo, at different times. Such data pages presents a set of records, e.g., [jobtitle, company, location, date], which are dynamically retrieved from the underlying database. With the proliferation of databases on the Web, users need to access such information has been pressing and, consequently, wrapper generation has become the key enabling techniques. Current search services cannot meaningfully index such data, precisely due to the challenge of extracting data from HTML text pages. With effective wrapper construction, we will be able to enable large scale integration of specialized and structured information, e.g., building vertical search over various structured domains such as jobs (e.g., simplyhired.com crawls and extracts job data from thousands of company sources) and shopping (e.g., thefind.com indexes product information from numerous vendors). In practical deployment towards building large scale vertical search, however, we realized that current wrapper approaches fall short in several critical aspects. To motivate, we systematically examine the full life cycle of a wrapper, towards scalable and cost-effective wrapper deployment (Chapter 3). While we 1

Figure 1.1: hotjobs.yahoo.com: two different dates. identify three key requirements accuracy,consistency, and intuitiveness unfortunately, no existing approaches satisfy all.

11 Figure 1.1: hotjobs.yahoo.com: two different dates. identify three key requirements accuracy,consistency, and intuitiveness unfortunately, no existing approaches satisfy all. While their induction approaches differ (Chapter 2), they are essentially identical in their wrapper modeling, which rely on low-level HTML features and tag-sequence patterns, resulting in wrappers that require rigid regularity, fragile to changes, and unintuitive to understand. As our key insight, we propose to elevate representation to visual perception and to minimize the patterns of wrappers to only relations between desired elements. Our proposal is guided by the dual principles of wrapper modeling: high-level features and minimal patterns. With visual-relational modeling as the core, we develop model execution for data extraction, and model induction for wrapper generation, thus completing the overall framework. We have performed extensive experimental evaluation, and the results demonstrate significant improvement over existing approaches. For concrete and realistic study, we collected a large dataset, the 2Y5D Dataset, over two years (October August 2006) across five domains (Auto, Book, Job, Movie, Music). We compare our visual relational framework to several representative existing approaches. For accuracy, our system returns high F1-measure in the range of 85%-95%, outperfoming the second-best apporache by a margin of 20%-55%. For consistency over time, our system preserves wrapper correctness for far longer periods than existing approaches, in the range of 200%-700% times. We have deployed the system in building large scale vertical search for the apartment domain, which requires building agents for thousands of rental data sources, and our experience in the industrial setting has been encouraging and consistent with the experimental evaluation. 2

12 In summary, our contribution in this paper includes: Concept: we propose novel concept of visual-relational data extraction for wrapper modeling. Framework: We propose effective execution and learning of the visual-relational model. Evaluation: We extensively evaluated the accuracy and consistency of our approach over two years of real dataset. 3

13 Chapter 2 Related Work Related works to ours are in wrapper induction research. We, therefore, want to compare with them in the following two aspects of this topic. Wrapper Model: In term of model language, most of previous works [1, 2, 3, 4, 5, 6, 7] use low-level rules directly in HTML source code to skip unnecessary information and reach to specific pattern of desired information. Baumgartner et. al. [8] uses prolog-like language called Elog, which contains visual-based predicates such as before, after. However, these predicates actually reflex the internal order of HTML tag structure rather than on the interface level. The main problem with these low-level languages is the inconsistency of wrapper description in the context of rapidly updating and changing speed of webpages. To the best of our knowledge, our work is the first research work which completely leverage the wrapper model into visual abstraction level by using probabilistic visual relations. In term of output structure support, linear approaches [6, 9] support a linear extraction without optional, repetitive and nested structure. Hsu et. al. introduce non-linear finite-state transducer on SoftMealy to deal with missing and multi-value attribute. Hierarchical-model-based approaches in STALKER [3], RoadRunner [1] and XWRAP [10] support all kinds of attribute variation such as missing values, multi-value and nested structure. The relational visual model presented in this paper provides all of these supports. Induction technique The very first approach focuses on developing some declarative languages to assist users in constructing wrappers. These languages are proposed as simpler alternatives for common functions written in general programming languages. Some systems belong to this approach are [11, 2]. Building rules in these supported languages is not intuitive and extremely error-prone to users. Supervised learning approaches to learn data extraction rules and/or patterns. Later on, these rules and patterns are used to identify data elements that follow them and assign label. The usual accuracy of this approach is not very good since most data does not always follow the same rules and patterns. Many induction systems have been introduced include in [5, 3, 6]. In our approach, we only need one training example which require no expertise from users. Even with minimal input from users, our technique achieves a very high accuracy because most of the inconsistency has been removed when we leverage our system into 4

14 the highest level of abstraction. It is also worthy to know that in the context of this paper, we consider our approach as semi-automatic even though the implemented system is fully automatic with additional domain knowledge - in comparison with RoadRunner [1] needs at least two similar pages and ViNTs [7] requires both multi-record pages and non-result pages from search engines. Automatic learning [1, 12, 13] base on the regularity of HTML structure as the basis for alignment and extraction. These methods, however, are not very robust since they require very structured input pages to have a good accuracy. Not to mention many of the pages different data record might have different tag structure because of their format different. Generally, the output of these approaches need to be intensively post-processed to be used. In our approach, we require minimal label training records (i.e. user just hight light what they want on one data record) to avoid post-processing and labeling. As noted above, the technique in this paper are comparable with automatic learning technique. Visual Usage: In a different perspective, related works to our paper also contain papers which use visual information in extracting/alanyzing webpages [14, 7, 15, 16, 17]. Deng et. al. use visual alignment to identify the meaning of webpage regions such as banner, main content, menu, etc. Webform analyzing research [16, 17] also partially/fully use visual information in identifying form elements as well as associating with corresponding labels. ViPER [15] utilizes visual bounding box as the main measure in ranking data regions which helps to eliminate low-informative data regions in output. However, the extraction algorithm in ViPER (i.e. Global Sequence Alignment) is applied completely in HTML source code. ViNTs [7] has an interesting idea in introducing the visual block regularity in extraction. However, this method is not applicable in extracting detail attributes of each data record where the attributes are written in a sequence (e.g. book s attributes in Amazon.com) since the shapes of each data record is completely different. Moreover, the paper made a very strong assumption to have both resulted and non-resulted pages from search engine which virtually give the correct extracted regions. The technique in ViNTs is also trickly since it depends too much on multiple heuristics to identify first content line of a records. 5

15 Chapter 3 Motivation: Model Matters Observation: Wrapper through Life Cycle. Let s start with observing the big picture for a wrapper in its full course of operation. In Figure 3.1, centering around a wrapper (the shaded box), there are several key stages of creation, execution, and maintenance. Wrapper Creation & Repair:. At the very first wrapper creation stage, a wrapper developer creates a wrapper for a source (e.g., amazon.com books). Essentially, such creation will build a wrapper model, which we denote Ω, for specifying the template structure of the source for data extraction. This stage has been the focus of most wrapper research How to automate wrapper generation as much as possible? Many mostly automatic approaches have been developed, as Chapter 2 discussed. In particular, as a representative category, wrapper induction takes a few example pages from the source and automatically induce the underlying template as HTML tag tree patterns, which is then used as the model Ω for data extraction by recognizing the same tag patterns in future pages. No current solutions are fully automatic; they all require certain amount of manual efforts typically for collecting one or multiple training pages, labeling these pages, or matching the induced template slots to our desired data attributes. As an example, the RoadRunner system [] takes multiple pages in training, does not need labeling, but requires developers to check the output templates and select some slots as desired attributes (say, in the pattern <li>title:... <href> #pcdata </href>... </li>, the #pcdata slot is for attribute title). The stage also handles wrapper repairing. When a wrapper breaks, such as due to source changes, the developer will fix the wrapper, either by regenerating it from scratch (requiring collecting new training example pages, labeling, etc.) or by inspecting and fixing the model directly. Wrapper Execution:. In regular production, at the wrapper execution stage, we will use the wrapper to extract data records from input pages from the source. Essentially, the wrapper will execute its model Ω over each input page, i.e., to match Ω (say, as tag tree patterns in RoadRunner) with the page and thus extract data in desired slots. Thus routinely, given data pages as input, the wrapper outputs extracted data, 6

16 Training Web Pages Input Web Pages S: Skill C: Cost Wrapper Creation/Repair Wrapper (Model Ω) Wrapper Execution P 1 : Accuracy Extracted Data P 2 : Consistency yes no Broken? Wrapper Verification Figure 3.1: Wrapper through its full life cycle. by executing its model trained earlier. The exact execution (or parsing ) mechanism depends on how the model is expressed. For instance, in most induction approaches, when Ω uses tag tree path patterns, the wrapper find the matching paths (and data elements) from the DOM tree of an input page. If Ω uses tag delimiters, then the wrapper would locate the matching tags and identify data values in between. Wrapper Verification:. Over time, a wrapper may break i.e., it can no longer extract data satisfactorily from the source since the source may change. When the source changes its page structure, the wrapper s model Ω does not match the source pages well any more. As such changes are expected, in the wrapper verification stage, we must regularly check the health of the wrapper, e.g., by monitoring the quality of the output data. If the wrapper indeed breaks, it will be sent back to the first stage for repairing. Not all source changes will break a wrapper. The exact impact depend on the particular model of the wrapper. Since different wrapper approaches use different model and execution mechanisms, they will differ in how their wrappers can react to changes. For instance, as most induction approaches resort to HTML tag path patterns, for any small change (say, by inserting an addition tag ..., a path pattern may become mismatching. Implications: Wrapper Requirements. Throughout the life cycle of a wrapper, we can clearly identify several important requirements for its effective operation. As the basis, Figure 3.1 marks the performance parameters. Labor L: In creation-&-repair, how much manual labor work does it require? Cost S: In creation-&-repair, what skill does it require? 7

17 Accuracy P 1 : In execution, how accurate is the wrapper? Consistency P 2 : In verification, how consist does the wrapper remain correct over time? With these key parameters that characterize various aspects of a wrapper approach, we clearly identify the following requirements for a wrapper framework to be effective. R1: Accuracy:. To produce high quality data, we require high accuracy; i.e., to maximize P 1. To achieve accuracy, a good framework must be robust in handling various sources with varying degrees of template regularity to induce. R2: Consistency:. To reduce maintenance cost, we require high consistency; i.e., to maximize P 2. To achieve consistency, a good framework must be resistant to source evolutions with varying degrees of change significance. We stress that, with the rapid evolution of Web data, sources tend to change more and more frequently, and thus consistency is crucial. R3: Intuitiveness:. To reduce human cost, we require high intuitiveness of working with the framework; i.e., to reduce sophisticated work, or L and S. Where is the manual work? To begin with, as just explained, full automation is unlikely, and most approaches require certain manual work in preparing the input and matching the output of wrapper creation. Further, as no such automatic approaches can guarantee 100% accuracy, a developer often needs to correct or tune a wrapper (including repairing broken wrappers). Thus, in addition to reducing the amount of work L, we also desire that the generated wrappers or their models are easy to understand by users. Problems: Current Deficiencies. As we outline the requirements, we found that, unfortunately, no current approaches meet all the requirements. We discuss each requirement in turn. To be concrete, we use two example pages from hotjobs.yahoo.com, as Figure 1.1 shows, collected at two different dates (August 2005 and October 2004, respectively) excerpted from our 2Y5D-Dataset (a set of pages over two years in 5 domains; Table 5.1. First, for accuracy: Most current approaches require rigid regularity in HTML tag path sequences with a fundamental assumption that all data records share similar tag paths. Such assumption can often be violated with today s increasingly complex page styles and HTML coding, and thus compromise accuracy. Consider a simple example in Figure 1.1b, where the odd and even rows (in the tabular listing) are of different formats, which are results of different underlying HTML tag values and tag structures. Thus the DOM subtrees of even and odd tuples can be quite different. This type of page, therefore, causes difficulties for current approaches that use HTML tag patterns essentially because that the regularity at the HTML 8

18 level is limited. (Our experiments in Chapter 5 validate this observation by comparing the robustness of different approaches for different structures.) Second, for consistency: All current approaches rely on quite low-level and internal page features in their modeling, which are rather sensitive to even small changes in sources. The existing framework all resort to HTML-level characteristics, such as DOM structure, color, text pattern, length of data, text size, etc., as their features for modeling (the Ω). Those features are only seen in the HTML coding and not visible to end users; thus they represent low-level and internal detail that may change, even when the desired elements are largely unaffected. Consequently, the current approaches compromise consistency, with their choice of model features. For example, observe the two pages in Figure 1.1, which captures the evolution of the hotjobs.yahoo.com. While the visual characteristics are quite similar (e.g., the attributes are aligned in the same way visually), their underlying HTML features are radically different, and will break any wrappers that remember such patterns. (Chapter 5 also validates this observation by comparing the consistency of different approaches over a two-year course.) Third, for intuitiveness: With the low-level HTML features and tag path structure as their model expression language, current wrappers require users who can speak HTML code. While everyone can browse Web pages, it requires relatively skilled programmers to manipulate HTML code. Thus, current approaches, again, compromise intuitiveness. For instance, for patterns generated by say RoadRunner, the developer needs to match the data slots to attributes, which will require reading HTML code (and regular expressions) of <li> Author: ( #pcdata )+. Insight: Model Matters. As we just analyzed, it becomes evident that the deficiencies of current state of the art are inherently due to the choice of modeling i.e., how we describe extraction patterns. While many approaches have been with different techniques, surprisingly, to date, they all uniformly assume HTML-level features and patterns as the modeling language. The low-level modeling has resulted in relying on rigid patterns (thus reducing accuracy), sensitive to internal and small changes (thus affecting consistency), and requiring HTML skill (thus barring intuitiveness). Our main thesis in this paper is, therefore, the choice of modeling matters. We aim to address the current deficiencies by understanding the impact of modeling, and to propose an effective framework with novel modeling. The Wrapper Modeling Principles. Reflecting on the limitation of current approaches, we believe that appropriate modeling must follow two principles: 9

19 High-level Features: As just explained, current modeling relies on low-level HTML features that are internal to a page (or invisible to users), which are thus likely irregular and unstable. Our modeling should use high-level features that are visible to human users. Minimal Patterns: Further, current modeling also relies on regularity patterns that involve tag sequence that are either paths leading to the desired elements or delimiters around them. Such patterns tend to be compromised by even changes only in the surrounding context of elements (e.g., adding a link to each author, or inserting a Used Price.) Our modeling should use use minimal patterns that only concentrates on elements of interest, and not their surrounding context. Our Proposal: Visual Relational Modeling. Guided by the dual modeling principles, we develop a novel wrapper framework consisting of a new model and the associated learning and execution techniques. As the key foundation, our propose to construct wrappers with visual features and relational patterns. On one hand, form the Principle of High-level Features: We elevate the level of abstraction for our wrappers to the visual-level features of a page exactly as what human users will see of the page as rendered by a browser, which is probably the highest-level possible. On the other hand, from the Principle of Minimal Patters, we concentrate our patterns to only those relations between desired elements (and not surrounding tag sequences). Thus, to see explicitly what elements are desired, we require input of one example record. For instance, consider Figure 1.1, supposing we want to extract jobtitle, company, and date. Focusing on these elements, we may describe them as, left(jobtitle, company) (jobtitle is at the left of company) and left(company, location). Note that they hold for both pages of different times. System Setting: We conclude with concrete definition of our system setting. Input: One or more example data pages, where one record is labeled with attributes desired. Output: Wrapper for extracting similar data pages. 3.1 Visual Relational Wrapper Model At the core of our system, we need a mechanism for specifying a wrapper. For a wrapper W to extract data from a page P, such a specification, or a model, should describe what elements on the page are of interest and where they are. The effectiveness of a wrapper essentially hinges on its model. As the driving mechanism of a wrapper, the model determines the performance of the wrapper and serves as the interface to users who train the 10

20 x 1 d 1 d 2 d 3 h 1 h 2 e 1 e 2 x 2 x 3 Figure 3.2: Example page fragment (amazon.com). wrapper. Thus, our requirements (Section 3) for wrapper accuracy, robustness, and intuitiveness directly translate into the desired properties for the model. Thus, we believe that wrapper induction is not simply the problem of learning patterns and inducing a model the choice of models does matter. As Section 3 explained, while various solutions exist, they all universally assume the standard HTML as the representation of their modeling of Web pages. Because their wrapper models similarly amount to the specification of tag sequence patterns in HTML trees, while their induction approaches differ, they all suffer the limitations inherent in the choice of modeling. As our main insight, to meet the requirements, our model clearly distinguishes from the traditional specification: We propose visual relational constraint model for specifying a wrapper, which elevates page representation to the visual (instead of hidden HTML code) level and minimize the constraints to only relational (instead of sequence) patterns between elements of interest. Given an HTML data page, which contains a set of data records (which are usually results in response to a query), since a wrapper aims to extract those records, its model must describe, on such a page, how to locate such records i.e., for each record: What are the desired elements? Where are them on the page? As our running example, we consider the page fragment, as Figure 3.2 shows. What: Schema. First, what elements are of interest? Essentially, as we are looking for a set of records, we are asking what consists such records, or their schema. We assume a record as a flat set of attributes, each of which can be omitted or repeated. We found this structure simple yet sufficiently expressive for most data sources. As we focus on extracting values of data elements, and not their potential hierarchical structure, we are viewing records as flattened which is nature in most cases. Even for the rare cases when data is nested (e.g., airfare itinerary, where a record contains departure and returning, each can be a record of several attributes, e.g., time and flight), our model can still target the desired elements and extract their values, although without the potential hierarchy (e.g., as time1, flight1, time2, flight2). Further, the flexible multiplicity of attribute occurrence, as we found, is frequently required as data is not always uniform (e.g., 11

21 a book record may not have an cover, or may have multiple author). Thus, as the first component of our model (the what component), we define schema of a record (E, T, Q) for specifying a set of attributes E= {a 1,..., a n }, their types in T = {t 1,..., t n }, and quantifiers Q = {a 1,..., a n }. That is, S specifies some n attributes, each with an attribute name (or attribute identifier) a i, type t i, and a quantifier q i. Comparably, this component can be considered as a set of attributes E = {e i } (represented by attribute names). Each attribute e i is a 2-tuple (type, quantifier). Example 1 (Schema): For our example (Figure 3.2), suppose we are interested in, for each book, the cover image or cover, title, author, format (hardcover or paperback), and Buy New price. As types, we see that author and format are plain text, title is an link (or anchor text ), cover is an image and price is number. As quantifiers, all the attributes will appear exactly once, except author, which may appear multiple times. The schema model of the desired book records is thus E= { cover(image, 1), title(link, 1), author(text, +), format(text, 1),newprice(number, 1) } To describe types, the system supports a customizable set of types T, which e i : type is drawn from, i.e., e i : type T. Even though we keep type set T opened in our framework (for the purpose of customization and flexibility), the implemented type-recognizer in our framework is error-free since T is a generalized concept of standard HTML-tag set. The type set, however, can include any domain of values that are of interest to the application and that can be recognized from pages. To describe the multiplicity of an attribute, i.e., how many values may occur, the system supports the set of quantifiers Q. We adopt the standard regular expression quantifiers, Q = {1,?, +, }. Where: Visual Relations. Second, where are those elements of interest? While existing wrapper approaches all address elements by HTML tag path patterns, we take a fundamentally different view. For describing the where, as the second component of our model, we provide matching patterns in therms of constraints on the elements, where each constraint is gauged at the visual level (and not the HTML tags), and involves only the elements of interest (and not the irrelevant sequence in the surroundings). Each constraint is thus a binary visual relation between a pair of desired attributes. Note that in principle, n-ary relations are possible; we choose to use only binary relations, for intuitiveness and simplicity. Our design of visual relations follows directly from, as Section 3 motivated, the principles of the highest level of presentation and the minimal extent of patterns. To be at the highest level, we gauge the visual perception of users and, to be minimal, we characterize only those desired attributes. Consider Figure 3.2 with the schema in Example 1, how to describe where these attributes are on the page? With visual relations, our patterns would describe how the attributes relate, in terms of visual layout, to each other. For instance, cover is at the left of title or left(cover, title); title is at the top of price or top(title, price), and cover is at the 12

22 left of price or left(cover, title), etc. In determining whether a particular visual relationship holds, we use each element s visual positions as determined by browser rendering i.e., as human users would see it. Specifically, for a given page, such visual elements will be produced by rendering the page as in a browser and then tokenizing it into basic units, each associated with visual positions on the page. We characterize each element by its entire span, i.e.,, the tight bounding box that encloses the element: We view the page as a Cartesian coordinate system, with the top-left corner as the origin (0, 0). On the page, each element is a rectangle with a start point (x, y) as its top-left corner, from where each dimension extends a range, width and height respectively, as a rectangle area, and thus its visual coordinate is (x, y, width, height). To determine a visual relation of two elements a 1 and a 2, we simply compare their coordinates, i.e., (a 1.x, a 1.y, a 1.width, a 1.height) versas (a 2.x, a 2.y, a 2.width, a 2.height). To describe such visual relational constraints in our model, the system should support a set of predicates as the vocabulary. While these predicates may capture various relationships between elements, as Section 3 motivated, we want them to be intuitive and easy to understand by users and thus we wish to keep these predicates simple yet sufficient in capturing the visual arrangement of records. What are essential predicates to support? As the essence of visual layouts, we observe that every data page share common presentation characteristics: Two-dimensional topology: Elements are related to each other in both the x-dimension, left and right, and the y dimension, top and bottom. As the relations are symmetric, we support predicates left( ) and top( ). E.g., as noticed earlier, in Figure 3.2, we have left(cover, title) and top(title, price). Tabular alignment: Records are often laid out in some tabular alignment, such as, for the row orientation, horizontally aligned and, for the column orientation, vertically aligned. Thus, correspondingly, we support predicates alignx( ) and aligny( ). E.g., in Figure 3.2, since the cover image is vertically aligned with title, their relation aligny(cover, title) holds true. Overall, to capture these essential characteristics, we need to support only four predicates V = {left, top, alignx, aligny}. While the choices are naturally motivated by the visual characteristics of record layout patterns, they prove to be very effective in our empirical study (Section 5). While expressive, as only a small number of simple relationships, these predicates are quite intuitive to understand and easy to determine, which indeed meet our requirements. Definition 1 (Visual Relations): A visual relation between attributes a 1 and a 2 is a binary predicate r(a 1, a 2 ), where r V {left, top, alignx, aligny}. Each predicate is determined as follow: 13

23 left(a 1, a 2 ): true if a 1.y + a 1.width a 2.x. top(a 1, a 2 ): true if a 1.y + a 1.height a 2.y. alignx(a 1, a 2 ): true if left(a 1, a 2 ) left(a 2, a 1 ). aligny(a 1, a 2 ): true if top(a 1, a 2 ) top(a 2, a 1 ). Since a relation describes a predicate between attributes, it is either true or false in each record However, it may not hold uniformly across all records. Some relations may hold for all records, e.g., in Figure 3.2, left(cover, title) does hold for all the records. However, in contrast, for record 1 and 2, observe that title is at the top of format ( hardcover ), which does not hold for record 3 (where title is at the same row as format paperback ); thus, top(title, format) is inconsistent from record to record. Such inconsistency can result from either client-side rendering settings or server-side data characteristics. Client-side effect comes from the reason that data is longer than the width of its container (e.g., document, browser, etc) and thus automatically goes to a new line. This inconsistency, however, is rather easily to be removed by extending the canvas width in buffer while rendering the page. The technique is very cheap and trivial in implementation. We call the state gained by applying this technique as unbounded-canvas environment (will be used in our framework)/. Therefore, as visual relations may not be consistent across predicates, we need to capture their fuzziness in a probabilistic sense. For our toy example as just mentioned, top(title, format) holds true for 2/3 or 67% of the time, statistically, while left(cover, title) holds 3/3 or 100%. Each visual relation r in our model will thus associate with a probability p(r), written as r:p(r), which indicates how likely r will hold true in a record, e.g., top(title, format):0.67 and left(cover, title): 1.0. Example 2 (Visual Relations): Continuing Example 1, for our example page, what are the visual relations? Examining every pair of attributes from E, we may identify several visual relations with non-zero probabilities i.e., holding true in at least one record. For instance, between cover and title, checking each relation r in V, we find that left(cover, title) and aligny(cover, title) hold for all three records, thus both 100% (and top and alignx are of zero probability). For the reversed pair, i.e., (title, cover), only aligny holds (with 100%). We can similarly check for the remaining pairs, to obtain the set of visual relations R = {left(cover, title):1.0, aligny(cover, title), aligny(title, cover), top(title, price): 1.0, top(title, format):0.67, left(cover, title):1.0, } Overall: Wrapper Model. With the schema E and visual relations R in place, in our system, we 14

24 Model Ω(E, R), page P (E:quantifier, R) (E:type, P) Schema Generation Parsing Model Reduction Schema Configuration Generation Configuration Tree Optimization T g uide Record Candidate Generation Record Plan Ranking Output Figure 3.3: Model Execution define a model Ω = (E, R), which specifies what attributes and where they are, for a record in our target data page to extract. E.g., for our example (Figure 3.2), Ω consists of the schema in Example 1 and visual constraints in Example 2. Definition 2 (Visual Relational Wrapper Model): The visual relational wrapper model for a data page is a 2-tuple Ω = (E, R), which specifies the schema and visual characteristics of the records on the page: E is the set of 2-tuple attributes e(type, quantifier) with type e : type and quantifier e : quantifer, and R the set of visual relations between the attributes. 3.2 Model Execution: Extracting Data In this section, we formulate the model execution architecture. Given a model Ω={E, R} and a page P, we need to output a maximal set of non-overlapping tuples (i.e., data records) Υ = {Υ i } P which is generated by Ω. We call the probability that a tuple Υ i is generated by visual model Ω is p(υ i Ω). If p(υ i Ω) is too small, it is unlikely that Υ is generated by Ω and thus not a good candidate tuple to be extracted. Therefore, we use a generative threshold θ 0 as lower-bound of generative probability to determine if a candidate tuple Υ i is considered to be generated by Ω. In other words, a candidate tuple Υ i is a valid tuple if and only if p(υ i Ω) θ 0. The higher p(υ i Ω), the better tuple Υ i is. P (Υ i Ω), hence, also indicates the ranking score of a candidate tuple. Consequently, the output of our model extraction is a maximal non-overlapping set of valid tuples {Υ i } with highest ranking score (Equation 3.1) Υ = Argmax {Υi p(υ i Ω) θ 0} p(υ i Ω) (3.1) Υ i {Υ i} 15

25 Note that visual model Ω, by definition, holds the statistical measures of visual relations among attributes of a data record. Each of such measures, in fact, represents a generative distribution of one relation between two attributes. For example, with a simple pair of two 1-quantifier attributes e i, e j E (e i : quantifer = 1 and e j : quantifer = 1), relation r(e i, e j ) : p r has only two possible instantiations: r(e i, e j ) = 1 or r(e i, e j ) = 0 (i.e., r holds or not hold) with probability of p r and (1 p r ) respectively. The real distribution, however, can be much more complicated (Section 3.2.1) since we support all possible quantifiers. Each combination of R relation instantiations, in turn, denotes a specific alignment layout of target data records which we call relational schema configuration (or schema configuration in short). Since schema configurations capture all possible variations of alignment layout of a data record, a record candidate essentially follows one specific configuration. Our extraction framework is, thus, three-phased. First, consider visual model Ω as a visual alignment generative model, we generate schema configurations and their generative probabilities (Section 3.2.1). Second, toward an efficient parsing, we optimize the parsing order in order to identify invalid configuration as soon as possible, the information are stored inside a tree structure called configuration tree T guide (Section 3.2.2). Third, we parse page Pfollowed the guidance of T guide and aim for the top-ranked dataset which satisfies Equation 3.1 (Section 3.2.3) Relational Schema Generative Model As Section 3.1 discussed, our visual model Ω captures the relative alignment information between each pair of two attributes (i.e., visual relations). As such, two data records should be considered the same (i.e., identical generative probability) w.r.t. generative behavior from Ω as long as they share the same schema configuration. Implicitly holding statistical distributions of visual relations, our visual model, thus, is a generative model of schema configuration. The generative probability of a record implies generative probability of its schema configuration. This section explains internal components of the schema configuration generation. Model Reduction Schema configuration is a combination of relation instantiations. Ideally, each relation r(e i, e j ) of two attributes a i, a j should only contains two instantiations: either hold or not-hold. Unfortunately, this is not always the case. A multi-instance attribute (e.g., author in Amazon s books) with + / * quantifier can make its relation become fuzzy since the relation might hold with some instances but not-hold with the others. Such fuzziness is further deepened with optional attributes (i.e., * and? ). Identified the source of relation instantiation fuzziness, we therefore want to reduce the quantifier set. Firstly, we observe that 16

26 (e + ) = (e 1 )(e ) and thus a + -attribute can be replaced by one 1 -attribute and one * -attribute. This conversion is done by quantifier decomposition operator Q D (Definition 3). Secondly, we further observe that an optional attribute become non-optional if we include null in the data type. This transformation (denoted by Q R ) is formalized in Definition 4. Definition 3 (Quantifier Decomposition): A quantifier decomposition operator (Q D ) is an operator which transforms a visual model Ω = (E = e 1,..., e m, R) containing some + -quantifier attribute e k into model Ω=(Ë, R) without such attribute by replacing e k (type, +) by two attributes e 1 k (type, 1) and e k (type, so that Ë = {e 1,...,e 1 k, e k, e k+1,... } R = R R k + Replace(R k, e k, e 1 k ) + Replace(R k, e k, e k ) where R k is relation set of e k Definition 4 (Optional Removal): An optional removal operator (Q R ) is an operator which transforms any optional attribute e k (type, quantifier) of visual model Ω into non-optional attribute e k (type null, quantifier) where e k : quantifier = 1 if e k : quantifier =? or e k :quantifier= + if e k :quantifier= *. By applying two operators Q D and Q R in that order, the induced model guarantees to have only two types of quantifier: 1 and +. This 2-step model transformation seems to pose internal conflict (i.e., first remove +-attributes and later transform to +-attributes again) but, in fact, it does not. After 2-step transformation, every +-attribute is guaranteed to have type with null included. This plays a crucial role to identify the hidden distribution of relation instantiation which decides the generative behavior of Ω. From now on, we assume visual model contains only quantifier 1 and +. Relational Schema Configuration Generation We can safely assume that every +-attribute contains at most N max instances. N max is called instance-bound. Empirically, in our system which operates on dataset 2Y5D, we choose N max=3. From model reduction, we know that every +-attribute of model Ω (after reduced) accept null as a valid type. As a sequence, a +-attribute e i is comparable with a N max-tuple {e 1 i,..., en max i } where e k i can be a null instance. Relation instantiation: implicit distribution With the probabilistic relation set R, we now define the underlying distribution of each relation r(e i, e j ) R. As noted above, relation instantiation depends entirely on relevant attribute s quantifiers. As such, given p r as the probability of relation r R, we have three scenarios of set {e i :quantifer, e j :quantifier} as follows. 17

27 First - { 1, 1 }: There are two possible instantiations Inst 1 when r(e i, e j ) hold and Inst 0 when r(e i, e j ) not-hold with probabilities p r if k = 1 P 1 (r = Inst k Ω) = 1 p r if k = 0 (3.2) Second - { 1, + }: Without loss of generality, we assume e j is the + -attribute. Thus, relation r is actually a set of N max primitive relations r(e i, e k j ) with k=1... N max. Intuitively, r has (1 + N max) instantiations {Inst k } where Inst k indicates that there are exactly k primitive relations hold. There are C N n! max k = k!(n k)! different picks for such k-set of hold relations from N max primitive relations; each with probability of p k r.(1 p r ) N max k. Therefore, the probability of a relation instantiation Inst k is: P 2 (r = Inst k Ω) = C N max k.p k r.(1 p r ) N max k (3.3) Third - { +, + }: Similarly, this relation is actually a set of (N max) 2 primitive relations r(e u i, ev j ) with u, v=1... N max. Thus, r has (1 + (N max) 2 ) instantiations {Inst k } where Inst k indicates that there are exactly k primitive relations hold. The probability of a relation instantiation Inst k is: P 3 (r = Inst k Ω) = C (N max )2 k.p k r.(1 p r ) (N max )2 k (3.4) Generation Behavior and Generative Probability We now discuss how model Ω generates relational schema configurations. By definition, model Ω represents n R = R distributions of visual relation. For each relation r R, Ω simply decides to select one instantiation Inst r with probability P (Inst r Ω). The final result of n R such selections on all r R is an n R -set of relation instantiations which we call schema configuration. The probability that Ω generates a configuration is called configuration generative probability. We now formalize such probability. Assume all relations in R are mutually-independent then each selection of relation instantiation is also independent from others. As such, configuration generative probability P ({Inst r } Ω) of a configuration that relation r has instantiation Inst r is product of its instantiation probabilities P (r = Inst r Ω) (Equation 3.5). A configuration with generative probability not less than generative threshold θ 0 is considered a valid configuration. Ones with probability less than θ 0 are called invalid configuration. P ({Inst r } Ω) = r R P (r = Inst r Ω) (3.5) 18

28 Where P (r(e i, e j ) = Inst r Ω) = P 1 (r = Inst r Ω) if both e i, e j are 1-attributes P 2 (r = Inst r Ω) if either e i or e j is 1-attribute P 3 (r = Inst r Ω) if neither e i, e j is 1-attribute Configuration Tree: Parsing Efficiency Invalid configurations are unimportant in our extraction framework since they represent data records which are unlikely to be generated from Ω. Generally, to identify if a configuration C = {Inst r C } is invalid (Instr C is instantiation of r in C), we need to check its generative probability follows Equation 3.5. Intuitively, if there exist a subset C sub C (called partial configuration of C) so that P (r = Inst r Inst r C C C Ω) < θ 0, then C sub is definitely an invalid configuration (since P (r = Inst r Inst r C C k r Ω) P (r = Inst r Inst r C C C Ω)). Such sub C sub is called invalid partial configuration. Consequently, an invalid configuration can be identified without the need to identify all of its relation instantiations as long as we find an invalid partial configuration of it. To capture the generative probability of such partial configurations, we need to consider the configuration generation process as a sequence of relation instantiation generation. The generation process, with respect to a specific generative sequence (r 1, r 2,..., r nr ), can be represented by a n R -depth tree called configuration tree. A node in level i represents a partial configuration (Inst r1... Inst ri ), each node in level i has exactly N ri+1 Inst children with N ri+1 Inst is the number of instantiations of relation r i+1. Each child in level (i + 1) is a partial configuration which extends from its parent configuration with one specific instantiation of r i+ (denoted by the edge from its parent). In general, level i of a configuration tree w.r.t order (r 1... r nr ) holds all possible partial configurations of a set of relation r 1... r i. Therefore, Leaf nodes are schema configurations (i.e., partial configuration of all relations) with configuration generative probability. The sequence (r 1, r 2,..., r nr ) is called parsing order. Example 3 (Configuration Tree): Assume model Ω = (E, R) from Amazon.com has E={title 1, author +, UsedPrice 1 } where superscript denotes attribute s quantifier. R = {r 1 = left(title, UsedPrice):0.7, r 2 = left(author, UsedPrice):0.6, r 3 = top(title, UsedPrice):1}. Generative Threshold θ 0 =0.1, instance bound N max=2 for book on Amazon. Notationally, we write r(i k : p) to indicate instantiation Inst k (i.e., there are exactly k primitive relations hold) of relation r has probability of p. As such, we have three distributions r 1 (I 1 : 0.7, I 0 : 0.3), r 2 (I 2 : 0.6 2, I 1 : 0.24, I 0 : ), r 3 (I 1 : 1, I 0 : 0). Figure 3.4-a, shows the configuration generation w.r.t relation order r 1 r 2 r 3. The tree is generated as follows: First, start from root (level 0), consider to first relation in parsing order (i.e., r 1 ), then this relation has two instantiation I r1 1 :07 = hold and I r1 0 :0.3 = not-hold. As such, we have two branches from root indicate these two instantiation of r 1 with 19

29 1 I 1 = 0.7 I 0 = I 2 =.36 I 1 =.48 I 0 =.16 I 2 =.36 I 1 =.48 I 0 = I 1 = 1 I 0 = 0 1 I 2 =.36 I 1 =.48 I 0 = (a): r 1 rr 2 rr 3 (b): r 3 r 2 r 1 Figure 3.4: Configuration Tree Generation probability 0.7 and 0.3 respectively. The two child nodes on level 1 are, therefore, two partial configurations {r 1 = I r1 1 } and {r 1 = I r1 0 }. Each of these two nodes generates three children in level 2 since relation r 2 has three different instantiations, etc. Paring Order - Toward Efficient Parsing Observation on Figure 3.4-a shows that even more than half of the generated configurations are invalid (i.e., 7 out of 12), most of them (i.e., 5) can only be identified when the tree is fully generated. With different parsing order, we observe a major difference on configuration tree in Figure 3.4-b. All invalid configurations except one can be identified without the need to generate to full configuration. One configuration represented many record candidates. Configuration tree pruning, therefore, is a crucial step toward an efficient parsing. As the above observation motivates, essentially, we need to identify the parsing order which leads to the best pruned configuration tree (i.e., smallest number of nodes). This problem shares some similarity with decision tree classification problem where we need to identify the best attribute that maximizes classification capability first. In our context, the best relation is the one it can lead to invalid configuration as soon as possible. As a result, comparable with several heuristics used in Decision Tree Classification, we can apply a simple heuristic by picking relation which contain the lowest instantiation probability p min. For example, in Example 3, we favor r 3 first since p min (r 3 ) = 0 and r 1 last since p min (r 1 ) = 0.5. In our implementation, however, we decided to take brute force approach to find the best parsing order because of the following reasons. Firstly, the parsing order is model-dependent only and thus it can be done offline once and used in every extraction pages. Secondly, the number of parsing orders is quite small (e.g., 24 for 4-attribute model) and generating a tree is extremely fast (because all distributions of relation instantiation are known) so brute force approach is actually fast. Lastly, saving one branch of the pruned tree 20

30 means a huge save in the parsing phase since there are many data record candidates match that instantiation branch. The algorithm is, thus, straightforward, for each parsing order, from root node we expand next level nodes by instantiations of the first relations. A new node is then expanded again by instantiations of the next relation as long as its probability θ 0. Finally, after the tree is generated, any leaf node either not in depth-n R or has generative probability less than θ 0 is removed along with its edges. The number of remaining nodes determines the size of configuration tree with that parsing order. Output the smallest tree Parsing This section presents the parsing framework follow a pruned configuration tree T guide. We first generate attribute candidates from page P, then prune them using distance-based clustering. Candidates of different attributes are then combined together w.r.t parsing order in configuration tree to form valid data records. Ranking will be applied on non-overlapped sets of valid records to determine the best output dataset. Attribute Candidate Generation This section introduces the technique to generate and shorten the set of attribute candidates from a page Pfor a given model Ω = (E={e 1,..., e n }, R). Basically, for each attribute e E, our type-recognizer generates a list of data elements which match e : type. This list, however, can be large if e : type is too general. This fact motivates us to develop a method to shorten the number of candidate for each attribute. Visual Regularity: Record regularity has been used by several extraction methods such as treealignment or pattern-based approaches. These approaches, however, only try to utilize the regularity in HTML source code level which results in severe limitation in many types of web pages. The scenario of yahoo hotjob in Figure 1.1-b illustrates this limitation. We, therefore, want to leverage the regularity abstraction to visual layer to overcome the aforementioned limitation. In Figure 1.1-b, even the format of even and odd data records is different, the vertical distance between the same attribute of two consecutive records are constant (approximately). Definition 5 (vertical distance): Let d i =< x i, y i, w i, h i >, d j =< x j, y j, w j, h j > be two data elements with their rendering positions top-left (x, y), width w and height h. A vertical distance between d i and d j is Γ (d i, d j )= x i - x j. Definition 6 (Γ-cluster): A ordered list of data elements D={d 1,..., d m } (m 3) forms a Γ-cluster if and only if any pair of two consecutive elements (d k, d k+1 ) (k [1, m-1]) has the same vertical distance Γ(d k, d k+1 )= Γ. Γ is called step of the cluster. 21

31 Claim 1 (Visual Conservation): Let Υ i, Υ j, Υ k is 3 consecutive n-tuples which are generated from visual model Ω = (E={e 1,..., e n }, R) where Υ t = {d t1, d t2,..., d tn } with (t=i/j/k), then the following properties hold for any p 1,p 2 [1,m] in unbounded-canvas environment: 1. Internal conservation: Γ (d ip1,d ip2 )=Γ (d jp1,d jp2 )=Γ (d kp1,d kp2 ) 2. External conservation: Γ (d ip1,d jp1 )=Γ (d jp1,d kp1 )=Γ (d ip2,d jp2 )=Γ (d jp2,d kp2 ) Interestingly, from external conservation characteristic (in unbounded-canvas environment), we also have Γ(e ki, e (k+1)i ) = Γ(e kj, e (k+1)j ) with k [1, n] and i, j [1, m] which leads to Claim 2. Claim 2 (Preserved Attribute Cluster): Assume a parsing page has n data records generated from visual model Ω = (E={e 1,..., e m }, R) (i.e., n extracted m-tuples) Υ k = {e k1, e k2,..., e km } (k=1,..,n), then the following statement holds: if D i = {e 1i,..., e ni } is a Γ-cluster of attribute e i then D j = {e 1j,..., e nj } is also a Γ-cluster of attribute e j (with any pair of attributes e i,e j E) This claim leads to the algorithm to filter out candidate sets of attributes in visual model. Because the claim infers that all the candidate sets for all attributes in visual model must be clusters with the same vertical distance. This algorithm is just one part of the framework, and due to space limitation, we only describe the main idea of the algorithm. We first try to build clusters for each candidate. Second, we compare steps from clusters of different attributes. An attribute cluster is kept if for each other attribute, we can find at least one cluster with the same step. Example 4 (link cluster): : In Amazon example in Figure 3.5, consider a data record has only two elements title and price. Therefore, visual model Ω = (E, R) has E={title, price} and E:type={link, number}. Obviously, the initial candidates for title are all links on the page. We have some Γ-cluster such as {menu link}, {title}, {buy new}, {Used & new}, the first one is a d-cluster while the others are D-cluster. Clearly, there is no d-luster on price candidate set (i.e., type number). This means only D-clusters are kept for both candidate sets. Elements of {menu link} are no longer candidates for title. Valid Record Generation A record candidate (n R -tuple with n R = R ) is simply any combination of attribute candidates with respect to attribute quantifier Υ i = (c 1, c 2... c m ) where c k is a set of candidates for attribute e k. c k is either 1-set/ Nmax-set if e k is 1 -attribute/ + -attribute respectively. Number of such record candidates is huge but only a portion of them are valid records which belong to some valid configuration. Our configuration tree is a perfect structure to determine how to parse a candidate (i.e., check its relation instantiations) in an efficient manner so that we can eliminate invalid candidates without the need to check all of its relations. In 22

32 E i a.g.: is a we ers rs = d D also data s: if te i E j Figure 3.5: Reduce Candidate Set by Distance-based Clustering a different view, if we gradually expand record candidates follow the structure of T guide, we will finally reach all valid records and avoid invalid ones. With that principle in mind, we generate a valid record tree T valid Figure 4.9: Reduce candidate set by distance-based clustering with the same structure as T guide. The only different is the content of each node. Each node of T valid keeps a set of partial record candidates which satisfy the configuration path to it (i.e., satisfy all of the relation instantiations along the path). Start from root with empty partial tuple set. From a node level k (which In this example, consider a data record has only two elements. Therefore, visual pattern Ω = (E, T, Q, R) with E={title, price}, data-type T={link, text-number}. Obviously, the initial candidates for title are all links on the page. Cluster the title candidates, we receive some clusters such as {menu link}, {title}, {buy new}, {Used & new}, the first one is a d clusterwhile the others are D clusters. In this example, it is quite trivial when we easily recognize that all clusters on price candidates are D clusters. This means only D clusters are kept for both candidate Null relation: sets. The basic Menu operation links valid are record no tree longer generationcandidates above is to determine for instantiation title. The remaining candidates will be easy to verify through our visual 23 relationships R. contain several partial tuple t k ), for each branch r(e i, e j ) = Inst r from this node, we generate partial tuple set of the node in level (k + 1) as follows: First, if two attribute e i and e j are already covered in tuple t k then this tuple is kept in node (k + 1) if it satisfies r(c i, c j ) = Inst r and removed otherwise. The partial subset retrieved in k + 1 node, in this case, is a subset of the set in node k. Second, if any of attribute e i or e j is not covered in t k (or both) then we find candidate for that attribute (c i for e i and/or c j for e j ) from the attribute candidate set so that r(c i, c j ) = Inst r, the new partial candidate gained by adding this attribute candidate into r k will be put into the set of node level (k+1). Repeat this step from root to all leaves. This generation process, guarantee we only generate tuples with valid configurations. Invalid ones have been pruned on-the-fly since their configurations have been pruned from T guide. Tuples in leaf nodes of T valid are all valid records we want to find.

Information Discovery, Extraction and Integration for the Hidden Web

Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk