Manual Wrapper Generation. Automatic Wrapper Generation. Grammar Induction Approach. Overview. Limitations. Website Structure-based Approach

Similar documents
Web Data Extraction. Craig Knoblock University of Southern California. This presentation is based on slides prepared by Ion Muslea and Kristina Lerman

AAAI 2018 Tutorial Building Knowledge Graphs. Craig Knoblock University of Southern California

Craig A. Knoblock University of Southern California

Information Discovery, Extraction and Integration for the Hidden Web

Gestão e Tratamento da Informação

Discovering and Building Semantic Models of Web Sources

Handling Irregularities in ROADRUNNER

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites

Source Modeling. Kristina Lerman University of Southern California. Based on slides by Mark Carman, Craig Knoblock and Jose Luis Ambite

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE

Part I: Data Mining Foundations

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

RoadRunner for Heterogeneous Web Pages Using Extended MinHash

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES

Combining Substructures to Uncover The Relational Web

ISSN (Online) ISSN (Print)

Automatically Maintaining Wrappers for Semi- Structured Web Sources

is that the question?

By Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy and Pedro Domingos. Presented by Yael Kazaz

Information Integration of Partially Labeled Data

Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

Unsupervised Learning : Clustering

Example: Map coloring

10701 Machine Learning. Clustering

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1

Classification. 1 o Semestre 2007/2008

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

A Survey on Unsupervised Extraction of Product Information from Semi-Structured Sources

Web Data Extraction Using Tree Structure Algorithms A Comparison

Constraint (Logic) Programming

Building Geospatial Mashups to Visualize Information for Crisis Management. Shubham Gupta and Craig A. Knoblock University of Southern California

Estimating the Quality of Databases

Te Whare Wananga o te Upoko o te Ika a Maui. Computer Science

CS 4100 // artificial intelligence

Reverse method for labeling the information from semi-structured web pages

Part XII. Mapping XML to Databases. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321

CONTEXT FREE GRAMMAR. presented by Mahender reddy

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Structural and Syntactic Pattern Recognition

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Modeling and Reasoning with Bayesian Networks. Adnan Darwiche University of California Los Angeles, CA

Introduction to Trajectory Clustering. By YONGLI ZHANG

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Keywords Data alignment, Data annotation, Web database, Search Result Record

Lecture 1. 2 Motivation: Fast. Reliable. Cheap. Choose two.

A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

A Re-examination of Limited Discrepancy Search

Outline. Data Association Scenarios. Data Association Scenarios. Data Association Scenarios

CS Introduction to Data Mining Instructor: Abdullah Mueen

Using Data-Extraction Ontologies to Foster Automating Semantic Annotation

CS 664 Flexible Templates. Daniel Huttenlocher

IBL and clustering. Relationship of IBL with CBR

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages

Data Analytics and Boolean Algebras

WHISK: Learning IE Rules for Semistructured

Vision-based Web Data Records Extraction

Kristina Lerman Anon Plangprasopchok Craig Knoblock. USC Information Sciences Institute

International Journal of Research in Computer and Communication Technology, Vol 3, Issue 11, November

Heading-Based Sectional Hierarchy Identification for HTML Documents

A Constraint Satisfaction Approach to Geospatial Reasoning

Solving Minesweeper Using CSP

Template Extraction from Heterogeneous Web Pages

What is Search For? CS 188: Artificial Intelligence. Constraint Satisfaction Problems

Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction

Fall 09, Homework 5

6. Dicretization methods 6.1 The purpose of discretization

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

Automatic Wrapper Adaptation by Tree Edit Distance Matching


Building Intelligent Learning Database Systems

Information Extraction from Web Pages Using Automatic Pattern Discovery Method Based on Tree Matching

Constraint Satisfaction Problems

Programming Languages Third Edition

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Annotating Multiple Web Databases Using Svm

Local Search for CSPs

Problem 3. (12 points):

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm

Semi-Supervised Clustering with Partial Background Information

Search Engines Information Retrieval in Practice

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

Topics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016

Processing Regular Path Queries Using Views or What Do We Need for Integrating Semistructured Data?

VS 3 : SMT Solvers for Program Verification

A survey: Web mining via Tag and Value

CS 188: Artificial Intelligence Spring Today

Make a Website. A complex guide to building a website through continuing the fundamentals of HTML & CSS. Created by Michael Parekh 1

Unsupervised: no target value to predict

Expectation Maximization!

Machine Learning. Unsupervised Learning. Manfred Huber

Java EE 7: Back-end Server Application Development 4-2

Central Europe Regional Contest 2016

CHAPTER 2 MARKUP LANGUAGES: XHTML 1.0

Object Extraction. Output Tagging. A Generated Wrapper

Natural Language Processing - Project 2 CSE 398/ Prof. Sihong Xie Due Date: 11:59pm, Oct 12, 2017 (Phase I), 11:59pm, Oct 31, 2017 (Phase II)

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Transcription:

Automatic Wrapper Generation Kristina Lerman University of Southern California Manual Wrapper Generation Manual wrapper generation requires user to Specify the schema of the information source Single tuple Nested type List of tuples or nested types Label data on several example HTML pages Tedious, especially for lists Goal of Automatic Wrapper Generation is to eliminate user intervention August 4, 2008 University of Southern California August 4, 2008 University of Southern California 2 Overview Methods for automatic wrapper creation and data extraction Grammar Induction approach Towards Automatic Data Extraction from Large Web Sites Website structure-based approach AutoFeed: An Unsupervised Learning System for Generating Webfeeds Using the Structure of Web Sites for Automatic Segmentation of Tables DOM-based: Simile/Piggybank Grammar Induction Approach Pages automatically generated by scripts that encode results of db query into HTML Script = grammar Given a set of pages generated by the same script Learn the grammar of the pages Wrapper induction step Use the grammar to parse the pages Data extraction step August 4, 2008 University of Southern California 3 August 4, 2008 University of Southern California 4 Limitations Learnable grammars Union-Free Regular Expressions (RoadRunner) Variety of schema structure: tuples (with optional attributes) and lists of (nested) tuples Does not efficiently handle disjunctions pages with alternate presentations of the same attribute Context-free Grammars (Hong&Clark paper) Limited learning ability User needs to provide a set of pages of the same type Website Structure-based Approach Websites attempt to simplify user navigation and interaction with data by organizing how data is presented across the site Hierarchical organization List of results Detail pages Machine learning methods take advantage of structure to extract data August 4, 2008 University of Southern California 5 August 4, 2008 University of Southern California 6

Limitations Performance depends on the choice of the learning algorithm Noise can affect performance (Lerman et al paper) Can combine different learning algorithms to improve performance (Autofeed) DOM-based Approach Use Document Object Model tree of HTML pages to extract data Works well, but on fairly simple and wellstructured pages August 4, 2008 University of Southern California 7 August 4, 2008 University of Southern California 8 RoadRunner Overview Towards Automatic Data Extraction from Large Web Sites by Crescenzi, Mecca, & Merialdo Automatically generates a wrapper from large structured web pages Supports nested structures and lists Efficient approach to large, complex pages with regular structure August 4, 2008 University of Southern California 9 August 4, 2008 University of Southern California 0 Example Pages Extracted Result Extracts the fields and hierarchical structure Depends on wellstructured HTML Only extracts at the entire field level August 4, 2008 University of Southern California August 4, 2008 University of Southern California 2 2

Approach Given a set of example pages Generates a Union-free Regular Expression (UFRE) RE without any disjunctions List of tuples (possibly nested): (a, b, c)+ Optional attributes: (a)? Strong assumption that usually holds Find the least upper bounds on the RE lattice to generate a wrapper in linear time Reduces to finding the least upper bound on two UFREs Matching/Mismatches Given a set of pages of the same type Take the first page to be the wrapper (UFRE) Match each successive sample page against the wrapper Mismatches result in generalizations of the regular expression Types of mismatches: String mismatches Tag mismatches August 4, 2008 University of Southern California 3 August 4, 2008 University of Southern California 4 Example Matching String Mismatches: Discovering Fields String mismatches are used to discover fields of the document Wrapper is generalized by replacing John Smith with #PCDATA <HTML>Books of: <B>John Smith <HTML> Books of: <B>#PCDATA August 4, 2008 University of Southern California 5 August 4, 2008 University of Southern California 6 Example Matching Tag Mismatches: Discovering Optionals First check to see if mismatch is caused by an iterator (described next) If not, could be an optional field in wrapper or sample Cross search used to determine possible optionals Image field determined to be optional: ( <img src= />)? August 4, 2008 University of Southern California 7 August 4, 2008 University of Southern California 8 3

Example Matching Tag Mismatches: Discovering Iterators String Mismatch String Mismatch Assume mismatch is caused by repeated elements in a list End of the list corresponds to last matching token: </LI> Beginning of list corresponds to one of the mismatched tokens: <LI> or </UL> These create possible squares Match possible squares against earlier squares Generalize the wrapper by finding all contiguous repeated occurences: ( <LI><I>Title:</I>#PCDATA</LI> )+ August 4, 2008 University of Southern California 9 August 4, 2008 University of Southern California 20 Example Matching Internal Mismatches Generate internal mismatch while trying to match square against earlier squares on the same page Solving internal mismatches yield further refinements in the wrapper List of book editions <I>Special!</I> August 4, 2008 University of Southern California 2 August 4, 2008 University of Southern California 22 Recursive Example Discussion Assumptions: Pages are well-structured Want to extract at the level of entire fields Structure can be modeled without disjunctions Search space for explaining mismatches is huge Uses a number of heuristics to prune space Limited backtracking Limit on number of choices to explore Patterns cannot be delimited by optionals Will result in pruning possible wrappers August 4, 2008 University of Southern California 23 August 4, 2008 University of Southern California 24 4

Relational Model of a Web Site AutoFeed: An Unsupervised Learning System for Generating Webfeeds by Gazen and Minton Sites are well structured to improve user experience Generation: Given relational data, scripts generate web site, e.g., weather site Extraction is opposite task: Given web site, find underlying relational data Homepage 0 AutoFeedWeather StateList 0 0 0 States 0 California CA Pennyslvania PA CityList 0 0 0 0 2 3 4 CityWeather 0 Los Angeles 70 San Francisco 65 2 San Diego 75 3 Pittsburgh 50 4 Philadelphia 55 Homepage page-type State page-type CityWeather page-type August 4, 2008 University of Southern California 25 August 4, 2008 University of Southern California 26 Overview of Approach Sites aim to be easy to understand and navigate Many types of structure Graph structure of site s links URL naming scheme Content of pages HTML structure within page types, Experts focus on individual structures and output discoveries as hints Experts are heterogeneous Probabilistically combine experts (don t have to be correct all the time) Hints Hints describe local structural similarities within pages or within data Hints help find relational structure of the site Used to cluster pages and data August 4, 2008 University of Southern California 27 August 4, 2008 University of Southern California 28 Overview Web Site Experts Page & Data Hints Cluster California Pennsylvania Los Angeles San Franciso San Diego Pittsburgh Philadelphia CA PA 70 65 75 50 55 Page & Data Clusters Convert Homepage 0 AutoFeedWeather States 0 California CA Pennyslvania PA CityWeather 0 Los Angeles 70 San Francisco 65 2 San Diego 75 3 Pittsburgh 50 4 Philadelphia 55 Relational Tables Page-level Experts URL patterns give clues about site structure Similar pages have similar URLs, e.g.: http://www.bookpool.com/sm/032349806 http://www.bookpool.com/sm/038269 http://www.bookpool.com/ss/l?pu=mn Page templates Similar pages contain common sequences of substrings August 4, 2008 University of Southern California 29 August 4, 2008 University of Southern California 30 5

Data-level Experts Page layout gives clues about relational structure Similar items aligned vertically or horizontally, e.g.: Coincidental alignment results in bad hints List structure List rows are represented as repeating HTML structures <TD> <TR> <TD> <TD> <TR> <TD> Page and Data Similarity Surface structure is often not helpful e.g., page with an empty list and one with a long list will be found not similar Instead, use local structural similarities Experts output hints Structure represented as hints Clusters evaluated probabilistically using hints Probabilistic representation gives flexible framework for combining possibly conflicting hints Los Angeles 85 Pittsburgh 65 August 4, 2008 University of Southern California 3 August 4, 2008 University of Southern California 32 Probabilistic Evaluation Find clustering that maximizes probability of observing hints: P(clustering hints) = P(hints clustering)*p(clustering)/p(hints) Generative process for P(hints clustering) Data Hints P( l) Page Hints if mi is a matched pair of tokens counttc ct p i where pi = 2 i l if mi is an unmatched token countt count pc c p 2 August 4, 2008 University of Southern California 33 Clustering Algorithm Leader-follower algorithm Process items one at a time If distance to nearest cluster < threshold add item to it Else create new cluster "distance" between page and page-cluster is determined by change in Prob(clustering hints) Prevents one large cluster, b/c smaller probabilities assigned to hints in larger clusters August 4, 2008 University of Southern California 34 Clustering Cluster pages and data Page-clusters are parents of data-clusters For example: State Pages City Pages Clusters to Tables Base tables are built from clusters that contain a single tuple per page. Each cluster becomes a column. Each row represents a page. Lists constructed from clusters not used for base table Cities Temperatures States Abbrs Los Angeles 70 San Franciso 65 California CA San Diego 75 Pennsylvania PA Pittsburgh 50 Philadelphia 55 August 4, 2008 University of Southern California 35 California Pennsylvania CA PA California Pennsylvania CA PA August 4, 2008 University of Southern California 36 6

Evaluation AutoFeed evaluated against manually built wrappers But can only evaluate part of output For each target column in the wrapper Find matching AutoFeed column Count relevant retrieved values Calculate precision and recall Target Column from Supervised Wrapper Los Angeles San Francisco San Diego Pittsburgh Philadelphia 70 65 75 50 55 Column Extracted by Autofeed Mostly Cloudy San Francisco San Diego Pittsburgh August 4, 2008 University of Southern California 37 Hi 65 75 50 Retrieved&Relevant=3 Retrieved=4 Relevant=5 Precision=3/4 Recall=3/5 Experiments and Results Field RR Ret. Rel. Precision Recall Domains Extract product name, manufacturer, price, etc., from online retail sites (Buy.com, CompUSA, etc.) mfg item no. model no. 8 00 66 8 00 68 8 03 7 00% 97% 96% 00% 97% 93% Extract authors, titles, URLs, from journals (DMTCS, JAIR, etc.) Extract job id, position, location from job listings (50 Forbes 500 name price 97 77 00 88 03 03 97% 85% 94% 75% companies) authors 73 97 22 98% 97% Good results Data extracted correctly from title 73 88 22 99% 97% many of the sites URL 528 528 22 00% 44% Some problems: Extracting larger fields, e.g. position 39 39 453 00% 96% Price: $9.95 req. id 278 278 422 00% 90% Over-general & over-specific clusters location 302 302 445 00% 90% Retail Journals Jobs August 4, 2008 University of Southern California 38 AutoFeed Conclusions Promising approach: Multiple experts for multiple structures Common language for collecting and combining evidence Principle applicable to beyond web extraction Interpreting complex environments Need to improve prototype system: More experts Better ways to combine evidence Confidence scores on hints Cannot-link hints Using the Structure of Web Sites for Automatic Segmentation of Tables by Lerman et al. August 4, 2008 University of Southern California 39 August 4, 2008 University of Southern California 40 Exploiting Structure of Web Sites Most Web sites that allow user to access databases present results in dynamically generated pages Web tables Web sites are very structured in terms of Organization of the site Layout of pages Content of data Exploit this structure for automatic information extraction Structure of Web Sites Entry page List pages Detail pages August 4, 2008 University of Southern California 4 August 4, 2008 University of Southern California 42 7

Structure of Pages and Data Pages are generated from a template Data in the same column is of the same type E.g., each listing starts with NAME, followed by ADDRESS, CITY, STATE, etc. Underlying Structure is not Always Clear Variability of realworld data may obscure the underlying structure Missing columns List Price and You save Formatting Content August 4, 2008 University of Southern California 43 August 4, 2008 University of Southern California 44 Problem Overview Automatically, efficiently extract records from Web tables Given a set of list and detail pages Segment list data using information from detail pages Logic based approach Based on Constraint Satisfaction Problems (CSP) Encode relations between data on list and detail pages as logical constraints and solve them Probabilistic inference approach Learns a model from data Record segmentation is an assignment that maximizes the likelihood of data given the model Identify Table and Extract Data Web sources generate list pages from a template and fill them with query results Deduce page template Given two or more example pages, derive the template used to generate them Table data and formatting tags are not part of the template Heuristic: largest slot contains the table Extract table data Extract contiguous sequences of tokens from the largest page slot August 4, 2008 University of Southern California 45 August 4, 2008 University of Southern California 46 Record Segmentation Basics () List and detail pages present two views of the same record Some overlapping fields Each detail page is a distinct record Assumption: Web tables are laid out horizontally Each record is in a separate row Order in which extracts appear in the text stream of list page is the same order they appear in the table Record Segmentation Basics (2) For each extract E i, record all detail pages on which it appears E :John Smith r, r2 E 2 :22 Was terloo r E 3 :New Hol 4345 r E 4 :(740) 335-5555 r, r2 E 5 :John Smith r, r2 E 6 :22R Was erloo r2 E 7 :Washingt 4360 r2 E 8 :(740) 335-5555 r, r2 E 9 :George W. Smith r3 E 0 :Findlay, 45840 r3 E :(49) 423-22 r3 August 4, 2008 University of Southern California 47 August 4, 2008 University of Southern California 48 8

Record Segmentation Basics (3) E E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 0 E r r 2 r 3 Observations of extracts on detail pages add valuable information for record segmentation Second record can be E 4 E 5 E 6 E 7 E 8 E 4 E 5 E 6 E 7 E 5 E 6 E 7 E 8 E 5 E 6 E 7 E 6 E 7 E 8 E 6 E 7 CSP Approach to Record Segmentation In CSP, problems are stated as logical expressions over variables Pseudo-boolean (PB) representation Variables are 0-, constraints can be inequalities Solution is assignment that minimizes inequality constraints Encode record segmentation problem in PB representation Assignment variable x ij x ij = when E i is assigned to r j x ij =0 when E i is no part of r j Information from detail pages imposes constraints Structure constraints Position constraints August 4, 2008 University of Southern California 49 August 4, 2008 University of Southern California 50 Structure Constraints E E 2 E 3 E 4 E 5 E 0 E r r 2 r 3 Uniqueness constraint Every extract E i belongs to exactly one record r j j xij = Consecutiveness constraint Only contiguous blocks of extracts can be assigned to the same record E 6 E 7 E 8 E 9 Structure Constraints E E 2 E 3 E 4 E 5 E r r 2 r 3 Uniqueness constraint Every extract E i belongs to exactly one record r j E 0 Consecutiveness constraint Only contiguous blocks of extracts can be assigned to the same record x ij + x kj when there is n, k<n<i, s.t. x nj =0 E 6 E 7 E 8 E 9 August 4, 2008 University of Southern California 5 August 4, 2008 University of Southern California 52 Position Constraints p 730 p 772 p 82 p 846 p 536 2 p 578 2 p 608 2 p 642 2 E E 2 E 3 E 4 E 5 Position constraint No two extracts assigned to same record can appear in the same position on the detail page pos j (E i ) = pos j (E k ), then E i and E k cannot be assigned to same record j Constraints are expressed mathematically and solved using integer optimization August 4, 2008 University of Southern California 53 E 6 E 7 E 8 E 9 E 0 E Probabilistic Approach to Record Segmentation Record segmentation as probabilistic inference No labeled training examples Factor the problem for efficient learning Bootstrap the learning algorithm with information from detail pages Structure constrain the problem further with global parameters such as record length August 4, 2008 University of Southern California 54 9

Probabilistic Model for Record Extraction: Variables Probabilistic Model for Record Extraction: Dependencies Observed variables T={T T n } token types of extract E i T i- T i D={D D n } detail pages on which E i was observed Unobserved variables C i- C i R={R i R n } record id C={C i C n } column label S={S i S n }: S i =true if E i is the S i- start of a new record; false D i- D i- S i otherwise Dependencies R i- R i Given by arrows, eg, P(C i C i- ) Segmentation find values for R and C given T, D variables: argmax P(R,C T,D) August 4, 2008 University of Southern California 55 P(T i C i ): token type of E i depends on column P(C i C i- ): column label of E i depends on previous column label (eg, NAME followed by ADDRESS, sometimes by STATE) P(S i C i ): new record starts with a given column (eg, NAME) P(R i R i-,d i,s i ): record number of E i depends on record number of previous extract, whether it starts a new record, and detail pages on which it was observed. D i- T i- C i- S i- R i- D i- August 4, 2008 University of Southern California 56 T i C i S i R i Learning the Model Constrain the problem further Bootstrap Detail pages provide initial guesses for parameters P(R i =r i ) Evidence about where records start: P(S i =true)= Token types of columns P(T j C i ) Structure Table has π columns specified by the underlying database schema However, not every record will have an attribute for every field, i.e., not every record has π fields Number of fields in a record estimated from data August 4, 2008 University of Southern California 57 Learning the Model P(R i =r i ) E E r ½ ½ ½ ½ r 2 r 3 Initial guess for record assignment P(R i ) E 2 E 3 E 4 E 5 E 0 ½ ½ ½ ½ Initial guess for record start P(S i ) E E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 0 E August 4, 2008 University of Southern California 58 E 6 E 7 E 8 P(S i ) Initial guess for length of records π k 2 3 4 5 P(π 2 6 4 / 2 k ) / 4 / 4 4 / 4 6 7 0 0 0 E 9 Learning Algorithm Use EM to implement the inference algorithm.initial guess for π k for each record j 2.For each potential record, update P(C i T i,c i- ) 3.Update P(S i C i ) 4.Update P(R i R i-,d i,s i ) Result is the most likely assignment of data to R and C = record segmentation Validation Input data list and detail pages from 2 sites in domains: book sellers, property tax, white pages, corrections Metrics P=Cor/(Cor+InCor+NonRecords) R=Cor/(Cor+UnsegRecords) F=2PR/(P+R) Results CSP approach: P=0.85, R=0.84, F=0.84 Probabilistic approach: P=0.74, R=0.99, F=0.85 Good performance for an automatic algorithm! August 4, 2008 University of Southern California 59 August 4, 2008 University of Southern California 60 0

Discussion of Results CSP approach is very reliable on clean data, but sensitive to errors in data source Attribute has one value on list page and another on detail page Probabilistic approach tolerates inconsistencies and is more expressive Combination of two techniques may be more robust Comparison with RoadRunner RoadRunner System (Crescenzi et al, 200) Automatically learns the page and table template by exploiting similarities in page layout (HTML tags) Uses the template to automatically extract data Does not allow for disjunctions Disjunctions are necessary to represent alternative layout instructions for the same field August 4, 2008 University of Southern California 6 August 4, 2008 University of Southern California 62 Conclusion Domain-independent approach for automatically extracting and segmenting data from Web tables Approach leverages additional information provided by Web site structure Logic based approach Information provided by detail pages encoded as constraints and solved to obtain record segmentation Probabilistic inference approach Information provided by detail pages and table structure represented as a probabilistic model Use inference to learn proper segmentation Validated approach on 2 Web sites from diverse information domains Efficient, accurate performance, F=0.85 and F=0.84 August 4, 2008 University of Southern California 63