Extracting Data From Web Sources. Giansalvatore Mecca

Similar documents
is that the question?

Semistructured Data and Mediation

Web Data Extraction. Craig Knoblock University of Southern California. This presentation is based on slides prepared by Ion Muslea and Kristina Lerman

Automatic Information Extraction from Large Websites

Databases and the World Wide Web

A survey: Web mining via Tag and Value

EXTRACTION INFORMATION ADAPTIVE WEB. The Amorphic system works to extract Web information for use in business intelligence applications.

Handling Irregularities in ROADRUNNER

DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

Semistructured Data and Mediation

Information Discovery, Extraction and Integration for the Hidden Web

Manual Wrapper Generation. Automatic Wrapper Generation. Grammar Induction Approach. Overview. Limitations. Website Structure-based Approach

The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes

Keywords Data alignment, Data annotation, Web database, Search Result Record

Automatic Generation of Wrapper for Data Extraction from the Web

Using Data-Extraction Ontologies to Foster Automating Semantic Annotation

Web Data Extraction Using Tree Structure Algorithms A Comparison

ISSN (Online) ISSN (Print)

A Survey on Unsupervised Extraction of Product Information from Semi-Structured Sources

AAAI 2018 Tutorial Building Knowledge Graphs. Craig Knoblock University of Southern California

Learning Path Queries on Graph Databases

MetaNews: An Information Agent for Gathering News Articles On the Web

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE

Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

Decomposition-based Optimization of Reload Strategies in the World Wide Web

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2016

Reverse method for labeling the information from semi-structured web pages

Schema-Guided Query Induction

Deepec: An Approach For Deep Web Content Extraction And Cataloguing

Introduction to Lexical Analysis

MIWeb: Mediator-based Integration of Web Sources

Te Whare Wananga o te Upoko o te Ika a Maui. Computer Science

An Analysis of Approaches to XML Schema Inference

(Refer Slide Time: 0:19)

Definition: A context-free grammar (CFG) is a 4- tuple. variables = nonterminals, terminals, rules = productions,,

Interactive Learning of HTML Wrappers Using Attribute Classification

Formal Languages and Compilers Lecture VI: Lexical Analysis

Joint Entity Resolution

Towards Schema-Guided XML Query Induction

Exploring Information Extraction Resilience

Finite Automata. Dr. Nadeem Akhtar. Assistant Professor Department of Computer Science & IT The Islamia University of Bahawalpur

Information Extraction from Tree Documents by Learning Subtree Delimiters

Data Extraction from Web Data Sources

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages

Automatic Wrapper Adaptation by Tree Edit Distance Matching

Similarity-based web clip matching

Context-Free Grammars and Languages (2015/11)

User query based web content collaboration

Wrappers & Information Agents. Wrapper Learning. Wrapper Induction. Example of Extraction Task. In this part of the lecture A G E N T

Formal languages and computation models

Optimizing Finite Automata

Gestão e Tratamento da Informação

Wrapper Learning. Craig Knoblock University of Southern California. This presentation is based on slides prepared by Ion Muslea and Chun-Nan Hsu

Machine Learning in Formal Verification

Object Extraction. Output Tagging. A Generated Wrapper

Recursively Enumerable Languages, Turing Machines, and Decidability

Semantic Annotation using Horizontal and Vertical Contexts

International Journal of Research in Computer and Communication Technology, Vol 3, Issue 11, November

Theory of Computations Spring 2016 Practice Final Exam Solutions

Finding and Extracting Data Records from Web Pages *

Limited Automata and Unary Languages

Consistency and Set Intersection

Motivating Ontology-Driven Information Extraction

CISC 4090 Theory of Computation

A Formalization of Transition P Systems

Extraction of Flat and Nested Data Records from Web Pages

Learning languages with queries

Limitations of Algorithmic Solvability In this Chapter we investigate the power of algorithms to solve problems Some can be solved algorithmically and

Much information on the Web is presented in regularly structured objects. A list

Table Identification and Information extraction in Spreadsheets

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

On the Hardness of Counting the Solutions of SPARQL Queries

CS 314 Principles of Programming Languages

Automatic Extraction of Informative Blocks from Webpages

Proseminar on Semantic Theory Fall 2013 Ling 720 An Algebraic Perspective on the Syntax of First Order Logic (Without Quantification) 1

3. Syntax Analysis. Andrea Polini. Formal Languages and Compilers Master in Computer Science University of Camerino

Part II Workflow discovery algorithms

analyzing the HTML source code of Web pages. However, HTML itself is still evolving (from version 2.0 to the current version 4.01, and version 5.

CS 512, Spring 2017: Take-Home End-of-Term Examination

EXTRACTION OF TEMPLATE FROM DIFFERENT WEB PAGES

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

CS415 Compilers. Lexical Analysis

Semantics via Syntax. f (4) = if define f (x) =2 x + 55.

Theory of Languages and Automata

Automatic State Machine Induction for String Recognition

Symbolic Automata Library for Fast Prototyping

A Reduction of Conway s Thrackle Conjecture

Lexical and Syntax Analysis. Bottom-Up Parsing

Estimating the Quality of Databases

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Reconfigurable Web Wrapper Agents for Web Information Integration

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Annotating Multiple Web Databases Using Svm

Languages and Compilers

LL Parsing, LR Parsing, Complexity, and Automata

The Inverse of a Schema Mapping

SDPL : XML Basics 2. SDPL : XML Basics 1. SDPL : XML Basics 4. SDPL : XML Basics 3. SDPL : XML Basics 5

Week 2: Syntax Specification, Grammars

Transcription:

Extracting Data From Web Sources Giansalvatore Mecca Università della Basilicata mecca@unibas.it http://www.difa.unibas.it/users/gmecca/ EDBT Summer School 2002 August, 26 2002 Outline Subject of this talk Information extraction from the Web Scenario the Web is probably the largest knowledge base ever developed information conceived to be consumed ( browsed ) by humans the goal is to make this information available to computer applications ( agents ) 2

Outline A general description of a software agent s tasks 1. receive a goal 2. locate relevant sources on the Web 3. gather information from these sources 4. process this content 5. deliver results to the user In this talk we concentrate exclusively on task 3 Task carried on by wrappers 3 Outline Notion of Wrapper BookTitle The HTML Sourcebook Author J. Graham Editor... Computer Networks A. Tannenbaum... Wrapper Database Systems Data on the Web R. Elmasri, S. Navathe S. Abiteboul, P. Buneman, D.Suciu... HTML page database table(s) (or XML docs) 4

Outline Part I: Information Extraction from HTML wrapper inference techniques limitations Part II: Grammar Inference Gold s theorem and identification in the limit algorithms from grammar inference limitations Part III: Bridging the Gap the RoadRunner Project 5 Outline Part I: Information Extraction from HTML What s in a wrapper Recent research on semi-automatic wrapper generation techniques machine learning approaches wrapper induction Limitations of these approaches related to supervised learning wrappers tend to be brittle the maintenance problem 6

Outline Part II: Grammar Inference Wrappers are essentially grammar parsers wrapper inference is strongly related to grammar inference A jump backward: 30 years of grammar inference accomplishments of the grammar inference community why grammar inference techniques are not applicable to information extraction 7 Outline Part III: Bridging the Gap The Schema Finding Problem a variant of the traditional grammar inference problem for information extraction Markup Grammars a class of grammars that can be practically inferred with unsupervised algorithms Limitations and opportunities future research directions 8

Part I Information Extraction from HTML 9 Information Extraction Task Information extraction task source format: plain text with HTML markup (no semantics) target format: database table or XML file (adding structure, i.e., semantics) extraction step: parse the HTML and return data items in the target format Wrapper piece of software for the extraction step intuition: use extraction rules based on HTML markup 10

Intuition Behind Extraction teamname Atalanta Inter Juventus Milan town Bergamo Milano Torino Milano <html><body> <h1>italian Football Teams</h1> <ul> <li><b>atalanta</b> - <i>bergamo</i> <br> <li><b>inter</b> - <i>milano</i> <br> <li><b>juventus </b> - <i>torino</i> <br> <li><b>milan</b> - <i>milano</i> <br> </ul> </body></html> Wrapper Procedure: Scan document for <ul> While there are more occurrences scan until <b> extract teamname between <b> and </b> scan until <i> extract town between <i> and </i> output [teamname, town] 11 A Complex Extraction Task 20-30KB of HTML >10 attributes with nesting 12

Why Information Extraction is Relevant? Wrappers are largely used today shopping agents (es: Jango, BargainFinder, ShopBot ) integration agents (es: molecular biology) personal information brokers In these applications, wrappers are usually written by hand es: Excite s shopping agent, Jango, relies on several hundred wrappers [Kushmerick, 2000] 13 Wrapper Development There are several tools that can be used to support the wrapper development process both research and commercial tools for a list see: www.wifo.uni-mannheim.de/~kuhlins/wrappertools/ Using these tools it is usually possible to rapid prototype a wrapper very quickly example: Lixto [Baumgartner et al, VLDB 2001], a GUI-based wrapper generation toolkit Problem: wrappers tend to be brittle 14

Problem: Brittlness Web sites change quite frequently Any change in the HTML layout may disrupt the extraction rules es: Jango s wrapper mean time to failure was approximately one month [Kushmerick, 2000] Maintaining a wrapper is very costly this has motivated the study of (semi)- automatic wrapper generation techniques 15 A Note on the Semantic Web It is a common opinion that neither XML nor the Semantic Web initiative will represent for now a solution to the information extraction problem Reason I the Web is a giant legacy system many sites will never move to the new technology Reason II many organizations will not be willing to expose their data in XML 16

Semi-Automatic Wrapper Generation Many proposals, from different inspirations A rough classification grammar-based ( finite-state ) machine learning approaches relational machine learning approaches ontology based approaches [Embley et al, 98] We discuss only finite-state approaches see the references for a complete list we also don t cover information extraction from free text (a related but quite different problem) 17 Finite-State Approaches Typical learning system a supervisor provides a set of training samples (i.e., labeled HTML pages) a learning algorithm is used to learn the extraction rules a wrapper is generated based on the extraction rules the wrapper is used on the target pages This process is called wrapper induction 18

Wrapper Induction Seminal work by Kushmerick et al [Kushmerick et al, 1997] [Kushmerick, 2000] Six wrapper classes increasing expressibility decreasing efficiency in the learning The simplest class: LR (left-right wrappers) targeted at pages containing multiple records one extraction rule per field each extraction rule is a pair of delimiting strings (left and right) 19 LR-wrappers: Example LR-wrapper = { [<b>, </b>] [<i>, /i>] } delimiters for field teamname delimiters for field town <html><body> <h1>italian Football Teams</h1> <ul> <li><b>atalanta</b> - <i>bergamo</i> <br> <li><b>inter</b> - <i>milano</i> <br> <li><b>juventus </b> - <i>torino</i> <br> <li><b>milan</b> - <i>milano</i> <br> </ul> </body></html> Wrapper Procedure: While there are more occurrences scan until <b> extract teamname between <b> and </b> scan until <i> extract town between <i> and </i> output [teamname, town] 20

Wrapper Induction Note the connection with grammars the extraction rules define a regular grammar with labeled nonterminals wrapper: parser for the grammar + knowledge about the target structure (set of records) LR-wrapper = { [<b>, </b>] [<i>, /i>] } Σ* ( <b>teamname</b> Σ* <i>town </i> )* Σ* Σ* : any string output: records of the form (teamname, town ) 21 Wrapper Induction In order to induce the wrapper, the first step is to label the training data A human inspects some sample pages and manually labels the fields <html><body> <h1>italian Football Teams</h1> <ul> <li><b>teamname</b> - <i>town</i> <br> <li><b>teamname</b> - <i>town</i> <br> <li><b>teamname</b> - <i>town</i> <br> <li><b>teamname</b> - <i>town</i> <br> </ul> </body></html> 22

Wrapper Induction Then a learning algorithm is run on the sample pages to find the delimiters The algorithm is quite simple it enumerates over potential values for each delimiter selecting the first such that the wrapper works correctly on the training data the learning is efficient because delimiters can be learned independently quadratic time 23 Wrapper Induction Obviously, not all pages can be wrapped using LR-wrappers in [Kushmerick, 2000] it is reported that LRwrappers were able to handle 53% of the sites in a survey conducted by the author <html><body> <b>italian Football Teams</b> <ul> <li><b>atalanta</b> - <i>bergamo</i> <br> <li><b>inter</b> - <i>milano</i> <br> <li><b>juventus </b> - <i>torino</i> <br> <li><b>milan</b> - <i>milano</i> <br> </ul> </body></html> 24

Wrapper Induction To handle these cases, Kushmerick introduces more complex classes of wrappers Example: HLRT (head-left-right-tail wrappers) two additional delimiters, h and t h is used to skip potentially confusing text in the head page t is used to skip potentially confusing text in the tail of the page 25 HLRT-wrappers: Example <html><body> <b>italian Football Teams</b> <ul> <li><b>atalanta</b> - <i>bergamo</i> <br> <li><b>inter</b> - <i>milano</i> <br> <li><b>juventus </b> - <i>torino</i> <br> <li><b>milan</b> - <i>milano</i> </ul> </body></html> <br> Σ* <ul> Σ* (<b>teamname</b> Σ* <i>town</i> )* Σ* </html> Σ* head and tail delimiters HLRT-wrapper = { [<ul>, </html>] [<b>, </b>] [<i>, /i>] } Wrapper Procedure: Scan document for <ul> While (there are more occurrences of <b> before </html>) scan until <b> extract teamname between <b> and </b> scan until <i> extract town between <i> and </i> output [teamname, town] 26

Wrapper Induction Six classes of wrappers LR (left-right) HLRT (head-left-right-tail) OCLR (open-close-left-right): open and close delimiters identify each record in the document HOCLRT (head-close-left-right-tail) NLR (nested LR): to handle nested tabular data NHLRT (nested HLRT) 27 Wrapper Induction Tradeoff between expressibility and complexity of the learning the first classes can be learned in polynomial time wrt to the length of the training documents in last two classes the learning is unfeasible (exponential in the size of the training documents) Overall the six classes were able to handle 70% of the sites selected by the author in his survey 28

Extended Wrapper Induction Techniques Subsequent works have extended Kushmerick s work in several respects missing attributes multiple orderings disjunctive delimiters arbitrary nesting Stalker [Muslea et al, 1999] SoftMealy [Hsu, Dung, 98] NoDoSE [Adelberg, 1998] Can wrap virtually any form of HTML layout 29 Limitations of Supervised Techniques If the site changes, it is necessary to maintain the wrapper Maintenance Problem I: verification it is necessary to detect when the wrapper stops to work properly delimiter-based wrappers do not always allow to detect changes in the site Maintenance Problem II: re-induction in general, a new user intervention is necessary in order to learn the new wrapper 30

Wrapper Verification Example <html><body> <h3>italian Football Teams</h3> <table> <tr><td>atalanta</td> <td>bergamo</td> <tr><td>inter</td> <td>milano</td> <tr><td>juventus</td> <td>torino</td> <tr><td>milan</td> <td>milano</td> </table> <b>last modified</b>: <i>aug. 2002</i> </body></html> LR-wrapper = { [<b>, </b>] teamname town [<i>, /i>] } Last modified Aug. 2002 the reason for this is that the grammar specified by the delimiters is not very strict, and allows a wide degree of variations 31 Wrapper Maintenance There has been some work on wrapper maintenance [Kushmerick, WWWJ 2000]: regression testing to detect wrapper disruption [Lerman, Minton, 2000]: data extracted when the wrapper was working properly can help to automatically relabelthe new samples However these results are quite preliminary 32

Wrapper Maintenance Ideal Goals Ideally, we would like to achieve two goals First, the wrapper should give a tighter description of the grammatical structure of the page this would simplify wrapper verification Second, the wrapper should be learnable in an unsupervised way (fully automatically, without the need for training examples) this would simplify wrapper reinduction 33 Wrapper Maintenance In this way, wrapper induction becomes a typical grammar inference problem given a set of samples, automatically infer the minimal grammar to which they belong Grammar inference is a well established research field, with a rich literature nice theoretical framework unsupervised algorithms Why not grammar inference for information extraction? 34

Part II Grammar Inference 35 Grammar Inference Grammar Inference is a 30-years old subject (early studies in the sixties) focus on computational theory, and not on information extraction Preliminaries: let s fix a finite alphabet of symbols Σ a language over Σ is any collection of strings over Σ a class of languages over Σ is any collection of languages over Σ 36

Grammar Inference The problem given a collection of sample strings, find the language they belong to The computational model: identification in the limit [Gold, 1967] the inference system is presented with a larger and larger corpus of examples it simultaneously makes a sequence of guesses about the language to infer if the sequence eventually converges, the inference is correct 37 Identification in the Limit Class of Languages L s 0 s 2 L 24 v 0 s k v 1v2 L 3 s 1 L 271 w 0 w2 L 8 v k w k L 4096 L 314 u 0 L 2 w 1 u 1u 2 L 4 L 213 u k s 0 H 0 : L 12 s 1 H 1 : L 14 Inference s s System H : L n-1 n H n-1 n : L 7 8 s n+1 H n+1 : L 8 H n+2 : L 8 Identification in the limit H : L 8 Hn+3: L8 H: L8 38

Gold s Theorem Any class of languages over Σ that contains all finite languages over Σ and at least one infinite language cannot be identified in the limit from positive samples only [Gold, 1967] This means, for example, that regular grammars cannot be identified in the limit from positive samples only 39 Gold s Theorem: Intuition Consider a set of samples of the form a, ab, abb, abbb, abbbb, the learner will never be able to definitely tell whether the target language is a U ab U abb U abbb U or ab* Hint: the problem is related to disjunction but r.e. without U are not identifiable either Research directions sub-classes that are identifiable in the limit different computational model (es: positive and negative examples or additional information) 40

Restricted Classes of Regular Expressions Dana Angluin has characterized the classes of languages identifiable in the limit A class of languages L is identifiable in the limit iff each language of the class admits a characteristic sample [Angluin, 1980] Characteristic sample A finite set T L L is said a characteristic sample of L in L if L is the smallest language from L containing T 41 Restricted Classes of Regular Expressions Several subclasses of regular expressions have been proven to identifiable in the limit reversible languages [Angluin, 1982] terminal-distinguishable languages [Radakrishnan and Nagaraja, 1987] testable languages [Garcia and Vidal, 1990] The definition of these classes is quite involved 42

f-distinguishable Languages Fernau has recently generalized many previously known classes of identifiable languages under the notion of f-distinguishable languages Class of f-distinguishable Languages class of regular languages identified by a distinguishing function [Fernau, 2000] Distinguishing function f: * F (with F finite) such that f(w)=f(z) f(wu)=f(zu), for all w,z,u * All these classes of languages are identifiable in the limit (polynomial algorithm) 43 k-reversible Languages 0-reversible regular language L consider the automaton A associated with L consider the reversed automaton A of A obtained by exchanging initial and final states and reversing each transition edge L is 0-reversible if both A and A are deterministic k-reversible regular language L A is deterministic A is deterministic with lookahead k 44

k-reversible Languages Characteristic sample for a k-reversible language (intuition) a set of strings that touch each state of A, and exercise each transition of A the strings need to have minimum length Formally, given A=(Q,, δ, q 0, Q F ) χ = { u(q)v(q) q Q } { u(q)av(δ(q,a)) q Q, a } where u(q) and v(q) are words of minimal length such that: δ * (q 0, u(q)) = q and δ * (q, v(q)) Q F 45 k-reversible Languages: Example Professors and their courses name, email (optional), one or more courses <html><body> <h1>john Doe</h1> email: <b>doe@dot.edu</b> <p>courses:<br> <i>distributed Systems</i><br> </body></html> <html><body> <h1>john Doe</h1> <p>courses:<br> <i>databases</i><br> <i>operating Systems</i><br> </body></html> class of pages with homogeneous content and layout 46

k-reversible Languages: Example This is a 1-reversible language <html><body> <h1>#pcdata</h1> (email: <b>#pcdata</b>)? <p>courses:<br> (<i>#pcdata</i><br>)+ </body></html> the sources of these three HTML pages form a characteristic sample for the grammar 47 k-reversible Languages For each fixed k, there is a polynomial time algorithm that correctly infers a k-reversible language from positive samples The algorithm is completely unsupervised automata theoretic, state-merging techniques Please note: the algorithm requires the knowledge of k to identify the language, the input must contain a characteristic sample (otherwise, the grammar is not correctly inferred) 48

k-reversible Languages For example, the samples below cannot be used to identify the language although they touch each state of A and exercise all transitions this does not contain a characteristic sample the learning fails to infer the grammar 49 Why Not Grammar Inference for IE? Grammar inference techniques might in principle solve the wrapper maintenance problem tight grammars unsupervised algorithms However, virtually none of the recent approaches to information extraction from HTML is based on grammar inference techniques There are several reasons for this 50

Why Not Grammar Inference for IE? Reason I: Assumptions on the input to the inference algorithm availability of a characteristic sample The minimal length requirement makes highly unlikely its availability a simple probabilistic argument shows that in practice there is a very low probability to find a characteristic sample in a bunch of randomly selected HTML pages 51 Why Not Grammar Inference for IE? Reason II: The grammar by itself is not a wrapper Recall that a wrapper is a grammar plus labeled terminals (attribute names) plus a target structure for the output In essence, the wrapper needs to know how to output strings that have been parsed 52

Why Not Grammar Inference for IE? Ideally, grammar inference algorithms would be useful if we could find a class of grammars that are natural with respect to HTML pages have a more practical notion of characteristic sample are such that the target structure can be identified easily by looking at the grammar itself 53 Part III Bridging the Gap 54

The RoadRunner Approach RoadRunner [Crescenzi, Mecca, Merialdo 2001] a research project on information extraction from Web sites The goal developing fully automatic techniques for information extraction from Web sites Contributions it bridges the gap between wrapper induction and traditional grammar inference techniques 55 The RoadRunner Approach The target large Web sites with HTML pages generated by programs (scripts) based on the content of an underlying data store the data store is usually a relational database The typical page-generation process data are extracted from the relational tables and possibly joined and nested the resulting dataset is exported in HTML format by attaching tags to values 56

Example: A Fictional Bookstore Name John Smith Paul Jones Authors Id #a1 #a2database Primer Books Title Computer Systems Id Author XML at Work #b1 #a1 #b2 #a1 HTML and Scripts #b3 #a2 #b4 #a2 JavaScripts #b5 #a2 Name John Smith Books Editions Description Id Book #e1 #b1 Editions Year Price Details Details 1998 20$ Year Price First #e2 #b1 First Edition, Paperback 2000 30$ 1998 Second 20$ Paul This Jones book #e3 #b2 1995 40$ First #e4 #b3 Second Edition, 1999 Hard Cover 30$ 2000 First 30$ #e5 #b4 1993 30$ null An undergraduate First Edition, Paperback 1995 40$ #e6 #b4 1999 45$ Second Title Description A comprehensive #e7 #b5 First Edition, Paperback 2000 50$ 1999 null 30$ Database Primer Computer Systems null 1993 30$ An useful HTML XML at Work Second Edition, Hard Cover 1999 45$ HTML and Script A must in Java Script null 2000 50$ Source Dataset HTML Pages 57 The Schema Finding Problem Page class collection of pages generated by the same script from a common dataset these pages typically share a common structure Schema finding problem given a set of sample HTML pages belonging to the same class, automatically recover the source dataset Solution: Wrapper structure of the underlying dataset extraction rules 58

Wrapper Output The Schema Finding Problem <HTML><BODY><TABLE> <TR> <TD><FONT>books.com<FONT></TD> XML at Work <TD><A>John Smith</A></TD> </TR> <TR> Paul Jones Input HTML Sources <TD><FONT>Database Primer<FONT></TD> <TD><B>First Edition, Paperback</B></TD>, A C Database Primer This book John Smith D Computer An undergraduate <HTML><BODY><TABLE> First Edition, 1995 40$ Systems <TR> A comprehensive <TD><FONT>books First Edition,.com<FONT></TD> 1999 30$ <TD><A>Paul Jones</A></TD> </TR> null 1993 30$ HTML and ScriptsAn useful <TR> HTML Second Edition, <TD><FONT>XML at Work<FONT></TD> 1999 45$ <TD><B>First Edition, Paperback</B></TD>, JavaScripts A must in null 2000 50$ B F First Edition, Second Edition, E 1998 2000 G H 20$ 30$ Wrapper = Grammar <HTML><BODY><TABLE> <TR> <TD><FONT>books.com<FONT></TD> <TD><A> #PCDATA</A></TD></TR> (<TR> ( <TD><FONT> #PCDATA <FONT></TD> ( <TD><B> #PCDATA</B> )? </TD> ( <TR><TD><FONT> #PCDATA <FONT> <B> #PCDATA</B> </TD></TR> )+ )+ Target Schema SET ( TUPLE (A : #PCDATA; B : SET ( TUPLE ( C : #PCDATA; D : #PCDATA; E : SET ( TUPLE ( F: #PCDATA; G: #PCDATA; H: #PCDATA))) 59 The Schema Finding Problem In essence we are not making any assumptions on the target structure (which can be arbitrarily nested) we are not relying on user inputs (the algorithm should be unsupervised) This is a grammar inference problem but we want the data source schema, not only the grammar we want a more natural notion of characteristic sample 60

The RoadRunner Approach Nested Types nested relation types with optional attributes Source Dataset instance of a nested type Target class of grammars markup-grammars intuition: languages obtained by serializing a dataset with HTML markup 61 Nested Relations with Optionals Schema: Tuple Smith s instance: Tuple Name Optional Set Email Tuple Course Source Dataset Name Frank Smith John Doe Paul Jones Frank Smith Email null doe@dot.edu null Optional null Tuple Operating System Set Tuple Databases Courses Operating Systems Databases Distributed Systems Advanced Database Systems 62

Serializing Instance Mark-Up Encodings Tagged Schema: <HTML><BODY> Tuple </BODY></HTML> Name Optional <H1> </H1> <P> </P> <B> Email </B> Set <UL> </UL> <LI> Tuple </LI> Smith s Instance Encoding: <HTML><BODY> <H1>Frank Smith</H1> <P></P> <UL> <LI>Operating System</LI> <LI>Databases</LI> </UL> </BODY></HTML> Course Mark-up Language: <HTML><BODY> <H1> #PCDATA </H1> <P> ( <B>#PCDATA </B> )? </P> <UL> ( <LI> #PCDATA </LI> ) + </UL> </BODY></HTML> 63 Prefix Markup-Encodings Mark-up grammar any regular language generated by a markupencoding of a nested type Prefix markup grammar (intuition) markup grammar such that the resulting language is LL(1) in both directions In essence we have identified a subclass of union-free regular expressions this class has a number of nice properties 64

Prefix Markup-Encodings Property I: There is a straightforward mapping from grammars to database types concatenation in the grammar correspond to database tuples each + in the grammar corresponds to a database set each? in the grammar corresponds to an optional In fact given a mark-up language, it is possible to recover the underlying type in linear time 65 Prefix Markup-Encodings Property II: It is identifiable in the limit note that regular expressions with. and * only are not identifiable in the limit In fact it is possible to prove that is a subclass of a (slight generalization of) a class of f- distinguishable languages [Fernau, 2000] therefore each language has a characteristic sample 66

Prefix Markup-Encodings Property III: It has a very natural notion of characteristic sample we can drop the minimality requirement More specifically all we need is a rich set of instances Rich set of instances intuitively: a set of instances that makes full use of the underlying type 67 Prefix Markup-Encodings Rich set of instances basic richness: each leaf node has at least two distinct values set richness: each set node has instances of distinct cardinalities optional richness: for each optional node has at least one null and one not null instance It is possible to prove that each rich set is a characteristic sample for the corresponding prefix markup-language 68

Rich Set of Instances Tuple Tuple Fred Perry Optional Set Frank Smith Optional Set perry@dot.edu Tuple Tuple Tuple null Tuple Tuple Distributed Systems Prog. Languages Networks Databases Operating Systems 69 Rich Set of Instances A simple probabilistic argument shows that in practice the probability of finding a rich set in a random sample is quite high Consider a type with k set types call p the probability that 2 pages have all instances of a set type of the same cardinality the probability to find 2 distinct cardinalities in n random pages is (1-p n-1 ) the probability that the n samples are set rich is (1-p n-1 ) k if k=5, n=10, p=50% then 99% 70

Contributions of RoadRunner We have developed an unsupervised learning algorithm for identifying prefixmarkup languages in the limit given a rich-sample of HTML pages, the algorithm correctly identifies the language, i.e., the wrapper the algorithm runs in polynomial time Note with respect to f-distinguishable languages, the algorithm is significantly different due to the different notion of characteristic sample 71 Prefix-Markup Languages This seems to represent a reasonable compromise between two aspects Expressibility of the grammar formalism we have been able to wrap many real-world sites like amazon.com, cnn.com, ebay.com etc. Effectiveness of the wrapper induction task it is possible to fully automate the step there are reasonable assumptions on the samples to inspect 72

Open Problems The algorithm needs multiple pages as any grammar inference algorithm how to wrap single pages? The algorithm is not capable of inferring field names these are anonymous in the target wrapper some heuristics, but preliminary work Disjunctions are necessary in some cases this makes the learning much more complex a form of non-disruptive disjunction? 73 References 74

References Wrapper Induction Finite State Techniques B. Adelberg. NoDoSE - a tool for semi-automatically extracting structured and semistructured data from text documents. In SIGMOD 98. C. Hsu and M. Dung. Generating Finite State Transducers for Semistructured Data Extraction from the Web. Information Systems 23, 1998 N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI 97. N. Kushmerick. Wrapper induction: Efficiency and Expressiveness. Artificial Intelligence 118, 2000. N. Kushmerick. Wrapper Verification. WWW Journal 3, 2000. K. Lerman, S. Minton Learning the Common Structure of Data Proc. of National Conference on Artificial Intelligence, 2000 I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the Third Annual Conference on Autonomous Agents, pages 190 197, 1999. 75 References Wrapper Induction Other Approaches D. W. Embley, D. M. Campbell, Jiang Y. S., S. W. Liddle, Ng Y., D. Quass, and Smith R. D. A conceptualmodeling approach to extracting data from the web. In ER 98. N. Kushmerick, B. Thomas Adaptive Information Extraction: Core Technologies for Information Agents In Intelligent Information Agents R&D in Europe: An AgentLink Perspective, 20002 http://www.cs.ucd.ie/staff/nick/ I. Muslea Extraction Patterns for Information Extraction Tasks: A Survey Proc. of AAAI Workshop on Machine Learning for Information Extraction, 1999 76

References Grammar Inference D. Angluin. Inductive Inference of Formal Languages from positive Data. In Information and Control, (45):117-135, 1980. D. Angluin. Inference of Reversible Languages. In Journal of the ACM, 29(3):741-765, 1982. H. Fernau. On learning Function Distinguishable Languages. Technical Report WSI-2000-13. Wilhelm-Schickard-Institute fur Informatik, 2000 P. Garcia, E. Vidal Inference of k-testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition IEEE Trans. on Pattern Analysis and Machine Intelligence 12, 1990 E. M. Gold, Language Identification in the limit. Information and Control 10, 5 (1967), 447-474 V. Radhakhishnan, G. Nagaraja Inference of Regular Grammars via Skeletons IEEE Trans. on Systems, Man and Cybernetics 17, 1987. 77 References RoadRunner S. Grumbach, G. Mecca. In Search of the Lost Schema. In ICDT'99. V. Crescenzi, G. Mecca, P. Merialdo: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In VLDB 2001: 109-118 V. Crescenzi, G. Mecca, P. Merialdo. Automatic Web Information Extraction in the RoadRunner System. Workshop DASWIS-2001 in conjunction with ER 2001 V. Crescenzi, G. Mecca, P. Merialdo. Wrapper Oriented Classification of Web Pages. SAC 2002. March 10-14, Madrid, Spain 78