is that the question?

Similar documents
Databases and the World Wide Web

Semistructured Data and Mediation

Extracting Data From Web Sources. Giansalvatore Mecca

Semistructured Data and Mediation

WEB SITES NEED MODELS AND SCHEMES

Handling Irregularities in ROADRUNNER

XXII. Website Design. The Web

Web Data Extraction. Craig Knoblock University of Southern California. This presentation is based on slides prepared by Ion Muslea and Kristina Lerman

Information Discovery, Extraction and Integration for the Hidden Web

Redundancy-Driven Web Data Extraction and Integration

Manual Wrapper Generation. Automatic Wrapper Generation. Grammar Induction Approach. Overview. Limitations. Website Structure-based Approach

MIWeb: Mediator-based Integration of Web Sources

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

ISSN (Online) ISSN (Print)

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES

Part I: Data Mining Foundations

Template Extraction from Heterogeneous Web Pages

Universita degli Studi di Roma Tre. Dipartimento di Informatica e Automazione. Design and Maintenance of. Data-Intensive Web Sites

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

Searching the Deep Web

AAAI 2018 Tutorial Building Knowledge Graphs. Craig Knoblock University of Southern California

Keywords Data alignment, Data annotation, Web database, Search Result Record

A survey: Web mining via Tag and Value

DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE

Database Systems. Sven Helmer. Database Systems p. 1/567

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Searching the Deep Web

Heading-Based Sectional Hierarchy Identification for HTML Documents

Deep Web Crawling and Mining for Building Advanced Search Application

Lessons Learned and Research Agenda for Big Data Integration of Product Specifications (Discussion Paper)

Introduction to XML. XML: basic elements

An Approach To Web Content Mining

Study on XML-based Heterogeneous Agriculture Database Sharing Platform

Introduction to Database Systems CSE 414

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

A Survey on Unsupervised Extraction of Product Information from Semi-Structured Sources

Teiid Designer User Guide 7.5.0

Knowledge discovery from XML Database

Annotating Multiple Web Databases Using Svm

Implementing a Numerical Data Access Service

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

MetaNews: An Information Agent for Gathering News Articles On the Web

Copyright 2014 Blue Net Corporation. All rights reserved

CSE 344 MAY 7 TH EXAM REVIEW

Tag-based Social Interest Discovery

XML and information exchange. XML extensible Markup Language XML

Chapter 1: Introduction

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Getting Started with Web Services

A tutorial report for SENG Agent Based Software Engineering. Course Instructor: Dr. Behrouz H. Far. XML Tutorial.

Synergic Data Extraction and Crawling for Large Web Sites

Database Roma Tre

Getting Started with Web Services

HERA: Automatically Generating Hypermedia Front- Ends for Ad Hoc Data from Heterogeneous and Legacy Information Systems

RoadRunner for Heterogeneous Web Pages Using Extended MinHash

Information Retrieval

Enterprise Data Catalog for Microsoft Azure Tutorial

Annotation Component in KiWi

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

Development of an Ontology-Based Portal for Digital Archive Services

Object Extraction. Output Tagging. A Generated Wrapper

ESRI stylesheet selects a subset of the entire body of the metadata and presents it as if it was in a tabbed dialog.

Basi di dati e World Wide Web

Features and Requirements for an XML View Definition Language: Lessons from XML Information Mediation

XML: Extensible Markup Language

Database System Concepts and Architecture

Introduction to Database Systems CSE 414

Speech 2 Part 2 Transcript: The role of DB2 in Web 2.0 and in the IOD World

Web Presentation Patterns (controller) SWEN-343 From Fowler, Patterns of Enterprise Application Architecture

Generalized Document Data Model for Integrating Autonomous Applications

Part V. Relational XQuery-Processing. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 297

Hypertext Markup Language, or HTML, is a markup

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 4, Jul-Aug 2015

Typing Massive JSON Datasets

Motivating Ontology-Driven Information Extraction

Overview. Structured Data. The Structure of Data. Semi-Structured Data Introduction to XML Querying XML Documents. CMPUT 391: XML and Querying XML

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Modelling Structures in Data Mining Techniques

ABSTRACT The Araneus project aims at developing tools for data-management on the World Wide Web. Web-based information systems deal with data of heter

SDPL : XML Basics 2. SDPL : XML Basics 1. SDPL : XML Basics 4. SDPL : XML Basics 3. SDPL : XML Basics 5

Information Retrieval CSCI

Data Presentation and Markup Languages

Form Identifying. Figure 1 A typical HTML form

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1

Automatic Wrapper Adaptation by Tree Edit Distance Matching

CSCI3030U Database Models

Ontology Based Prediction of Difficult Keyword Queries

A General Approach to Query the Web of Data

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Introduction to Data Management CSE 344

Chapter 2. Architecture of a Search Engine

ProFoUnd: Program-analysis based Form Understanding

Java Framework for Database-Centric Web Site Engineering

INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN EFFECTIVE KEYWORD SEARCH OF FUZZY TYPE IN XML

CS145 Introduction. About CS145 Relational Model, Schemas, SQL Semistructured Model, XML

Supporting Fuzzy Keyword Search in Databases

ISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

XML Processing & Web Services. Husni Husni.trunojoyo.ac.id

SEEK: Scalable Extraction of Enterprise Knowledge

Transcription:

To structure or not to structure, is that the question? Paolo Atzeni Based on work done with (or by!) G. Mecca, P.Merialdo, P. Papotti, and many others Lyon, 24 August 2009

Content A personal (and questionable ) perspective on the role structure (and its discovery) has in the Web, especially as a replacement for semantics Two observations Especially important if we want automatic processing Something similar happened for databases many years ago and a disclaimer: I feel honored to have been invited, but I am not sure why it happened Personal means that I will refer to projects in my group (with some contradiction, as I have been personally involved only in some of them ) 24/08/2009 Paolo Atzeni 2

Content Questions on the Web Structures in the Web, transformations: Araneus Information extraction, wrappers: Minerva, RoadRunner Crawling, extraction and integration: Flint 24/08/2009 Paolo Atzeni 3

Web and Databases (1) The Web is browsed and searched Usually one page or one item at the time (may be a ranked list) Databases are (mainly) queried We are often interested in tables as results 24/08/2009 Paolo Atzeni 4

What do we do with the Web? Everything! We often search for specific items, and search engines are ok In other cases we need more complex things: What could be the reason for which Frederic Joliot-Curie (a French physicist and Nobel laureate, son in law of P. and M. Curie) is popular in Bulgaria (for example, the International House of Scientists in Varna is named after him)? Find a list of qualified database researchers who have never been in the VLDB PC and deserve to belong to the next one I will not provide solutions to these problems, but give hints Semantics Structure 24/08/2009 Paolo Atzeni 5

Questions on the Web: classification criteria (MOSES European project, FP5) Format of the response Number (and relationship) of the involved sites (one or more; of the same organization or not) Structure of the source information Extraction process 24/08/2009 Paolo Atzeni 6

Format of the response Asimplepiece piece of data A web page (or fragment of it) A tuple of data A "concept" A set or list of one of the above Natural language (i.e., an explanation) 24/08/2009 Paolo Atzeni 7

Involved sites One or more Of the same organization or not Sometimes also sites and local databases 24/08/2009 Paolo Atzeni 8

Structure of the source information Information directly provided by the site(s) as a whole Information available in the sites and correlated but not aggregated Information available in the site in a unrelated way 24/08/2009 Paolo Atzeni 9

Extraction process Explicit service: this the case for example if the site offers an address book or a search facility for finding professors One-way navigation: here the user has to navigate, but without the need for going back and forth (no trial and error ) Navigation with backtracking 24/08/2009 Paolo Atzeni 10

Web and Databases (2) The answer to some questions requires structure (we need to "compile" tables) databases are structured and organized how much structure and organization is there in the Web? 24/08/2009 Paolo Atzeni 11

Workshop on Semistructured Data at Sigmod 1997 24/08/2009 Paolo Atzeni 12

Workshop on Semistructured Data at Sigmod 1997-2 24/08/2009 Paolo Atzeni 13

Workshop on Semistructured Data at Sigmod 1997-3 24/08/2009 Paolo Atzeni 14

Transformations bottom-up: accessing information from Web sources top-down: designing and maintaining Web sites global: integrating g existing sites and offering the information through new ones 24/08/2009 Paolo Atzeni 15

The Araneus Project At Università Roma Tre & Università della Basilicata 1996-2003 Goals: extend database-like techniques to Web data management Several prototype systems 24/08/2009 Paolo Atzeni 16

Example Applications The Integrated Web Museum a site integrating g data coming from the Uffizi, Louvre and Capodimonte Web sites The ACM Sigmod Record XML Version XML+XSL Version of an existing (and currently running) Web site 24/08/2009 Paolo Atzeni 17

The Integrated Web Museum A new site with data coming from the Web sites of various museums Uffizi, Louvre and Capodimonte 24/08/2009 Paolo Atzeni 18

The Integrated Web Museum DB 24/08/2009 Paolo Atzeni 19

Integration of Web Sites: The Integrated Web Museum Data are re-organized: Uffizi, paintings organized by rooms Louvre, Capodimonte, works organized by collections Integrated Museum, organized by author 24/08/2009 Paolo Atzeni 20

Web Site Re-Engineering: ACM Sigmod Record in XML The ACM Sigmod Record Site (acm.org/sigmod/record) an existing site in HTML The ACM Sigmod Record Site, XML Edition a new site in XML with XSL Stylesheets 24/08/2009 Paolo Atzeni 21

24/08/2009 Paolo Atzeni 22

24/08/2009 Paolo Atzeni 23

The Araneus Approach Identification of sites of interest Wrapping of sites to extract information Navigation of site ad extraction of data Integration of data Generation of new sites Site Generation Layer Integration Layer Mediator Layer Araneus Web Site Development System (Homer) SQL and Java Araneus Query System (Ulixes) Wrapper Layer Wrapper Layer Wrapper Layer Araneus Wrapper Toolkit (Minerva) 24/08/2009 Paolo Atzeni 24

Building Applications in Araneus Phase A: Reverse Engineering Existing Sites Phase B: Data Integration ti Phase C: Developing New Integrated Sites 24/08/2009 Paolo Atzeni 25

Building Applications in Araneus Phase A: Reverse Engineering i First Step: Deriving the logical structure of data in the site ADM Scheme Second Step: Wrapping pages in order to map physical HTML sources to database objects Third Step: Extracting Data from the Site by Queries and Navigation 24/08/2009 Paolo Atzeni 26

Information Extraction Task Information extraction ti task source format: plain text with HTML markup (no semantics) target format: database table or XML file (adding structure, i.e., semantics ) extraction step: parse the HTML and return data items in the target format Wrapper piece of software designed to perform the extraction step 24/08/2009 Paolo Atzeni 27

Notion of Wrapper BookTitle Author Editor The HTML Sourcebook J. Graham... Computer Networks A. Tannenbaum... Wrapper Database Systems Data on the Web R. Elmasri, S. Navathe S. Abiteboul, P.Buneman, D.Suciu... database table(s) HTML page (or XML docs) Intuition: use extraction rules based on HTML markup 24/08/2009 Paolo Atzeni 28

Intuition Behind Extraction teamname Atalanta Inter Juventus Milan town Bergamo Milano Torino Milano <html><body> Wrapper Procedure: <h1>italian Football Teams</h1> Scan document for <ul> <ul> While there are more occurrences <li><b>atalanta</b> - <i>bergamo</i> <br> scan until <b> <li><b>inter</b> - <i>milano</i> <br> extract teamname between <b> <li><b>juventus</b> - <i>torino</i> <br> and </b> <li><b>milan</b>-<i>milano</i> i Milano /i <br> scan until <i> </ul> extract town between <i> and </i> </body></html> output [teamname, town] 24/08/2009 Paolo Atzeni 29

The Araneus Wrapper Toolkit Adv vantag ges Procedural Languages Flexible in presence of irregularities and exceptions Allows document restructurings Grammars Concise and declarative Simple in the case of strong regularity Drawba acks Tedious even with document strongly structured Difficult to maintain Not flexible in presence of irregularities Allow for limited restructurings 24/08/2009 Paolo Atzeni 30

The Araneus Wrapper Toolkit: Minerva + Editor Minerva is a Wrapper generator It couples a declarative, grammar-based approach with the flexibility of procedural programming: grammar-based procedural exception handling 24/08/2009 Paolo Atzeni 31

Development of wrappers Effective tools: drag and drop, example based, flexible Learning and automation (or semiautomation) Maintenance support: Web sites change quite frequently Minor changes may disrupt the extraction rules maintaining i i a wrapper is very costly this has motivated the study of (semi)- automatic wrapper generation techniques 24/08/2009 Paolo Atzeni 32

Observation Can XML or the Semantic Web help much? Probably not the Web is a giant legacy system and many sites will never move to the new technology (we said this ten years ago and it still seems correct ) many organizations will not be willing to expose their data in XML 24/08/2009 Paolo Atzeni 33

A related topic: Grammar Inference The problem given a collection of sample strings, find the language they belong to The computational model: identification in the limit [Gold, 1967] a learner is presented successively with a larger and larger corpus of examples it simultaneously makes a sequence of guesses about the language to learn if the sequence eventually converges, the inference is correct 24/08/2009 Paolo Atzeni 34

The RoadRunner Approach RoadRunner [Crescenzi, Mecca, Merialdo 2001] a research project on information extraction from Web sites The goal developing fully automatic, unsupervised wrapper induction techniques Contributions wrapper induction based on grammar inference techniques (suitably adapted) 24/08/2009 Paolo Atzeni 35

The RoadRunner Approach The target t large Web sites with HTML pages generated by programs (scripts) based on the content of an underlying data store the data store is usually a relational database The process data are extracted from the relational tables and possibly joined and nested the resulting dataset is exported in HTML format by attaching tags to values 24/08/2009 Paolo Atzeni 36

The Schema Finding Problem A C Wrapper Output First Edition, 1998 20$ Database Primer Input HTML Sources John Smith <HTML><BODY><TABLE> Computer Systems <TR> <TD><FONT>books.com<FONT></TD> XML at Work <TD><A>John Smith</A></TD> </TR> <TR> <TD><FONT>Database D Paul Primer<FONT></TD> Jones HTML and Scripts <TD><B>First Edition, Paperback</B></TD>, JavaScripts Wrapper D This book B <HTML><BODY><TABLE> An undergraduate First Edition, 1995 40$ <TR> <TD><FONT>books.com<FONT></TD> A comprehensive First Edition, 1999 30$ <TD><A>Paul Jones</A></TD> </TR> null 1993 30$ <TR> An useful <TD><FONT>XML HTML Second Edition, at Work<FONT></TD> <TD><B>First Edition, Paperback</B></TD>, 1999 45$ A must in null 2000 50$ F E Second Edition, 2000 Target Schema <HTML><BODY><TABLE> <TR> SET ( <TD><FONT>books.com<FONT></TD> TUPLE (A : #PCDATA; <TD><A> #PCDATA</A></TD></TR> B : SET ( (<TR> TUPLE ( C : #PCDATA; ( <TD><FONT> #PCDATA <FONT></TD> D : #PCDATA; ( <TD><B> #PCDATA </B> )? </TD> E : SET ( ( <TR><TD><FONT> #PCDATA <FONT> TUPLE ( F: #PCDATA; <B> #PCDATA </B> </TD></TR> )+ G: #PCDATA; )+ H: #PCDATA))) 24/08/2009 Paolo Atzeni 37 G 30$ H

The Schema Finding Problem Page class collection of pages generated by the same script from a common dataset these pages typically share a common structure Schema finding problem given a set of sample HTML pages belonging to the same class, automatically recover the source dataset i.e., generate a wrapper capable of extracting the source dataset from the HTML code 38 24/08/2009 Paolo Atzeni

Contributions of RoadRunner An unsupervised learning algorithm for identifying i prefix-markup languages in the limit given a rich sample of HTML pages, the algorithm correctly identifies the language the algorithm runs in polynomial time from the language, when can infer the underlying schema and then extract the source dataset 39 24/08/2009 Paolo Atzeni

Data Extraction ti and Integration ti from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università itàdegli listudi Roma Tre

Flint A novel approach for the automatic extraction and integration of web data Flint aims at exploiting an unexplored publishing pattern that frequently occurs in data intensive i web sites: large amounts of data are usually offered by pages that encode one flat tuple for each page for many disparate real world entities (e.g. stock quotes, people, travel packages, movies, books, etc) there are hundreds of web sites that deliver their data following this publishing strategy These collections of pages can be though as HTML encodings of a relation (e.g. "stock quote" relation) 24/08/2009 Paolo Atzeni 41

Example 24/08/2009 Paolo Atzeni 42

Flint goals Last Min Max StockQuote Volume 52high Open 24/08/2009 Paolo Atzeni 43

Flint goals q 1 Q q 2 Last Min Max StockQuote Volume 52high Open q n 24/08/2009 Paolo Atzeni 44

Extraction and Integration Flint builds on the observation that although structured information is spread across a myriad of sources, the web scale implies a relevant redundancy, which is apparent at tthe intensional i level: l many sources share a core set of attributes at the extensional level: many instances occur in a number of sources The extraction and integration algorithm takes advantage of the coupling of the wrapper inference and the data integration tasks 24/08/2009 Paolo Atzeni 45

Flint main components a crawling algorithm that gathers collections of pages containing data of interest a wrapper inference and data integration algorithm that automatically ti extracts t and integrates t data from pages collected by the crawler a framework to evaluate (in probabilistic terms) the quality of the extracted data and the reliability/trustiness of the sources 24/08/2009 Paolo Atzeni 46

Crawling Given as input a small set of sample pages from distinct web sites The system automatically discovers pages containing data about other instances of the conceptual entity exemplified by the input samples

The crawler, approach Given a bunch of sample pages crawl the web sites of the sample pages to gather other pages offering the same type of information extract a set of keywords that describe the underlying entity do launch web searches to find other sources with pages that contain instances of the target entity analyze the results to filter out irrelevant pages crawl the new sources to gather new pages while new pages are found

Data and Sources Reliability Redundancy among autonomous and heterogeneous sources implies inconsistencies and conflicts in the integrated data because sources can provide different values for the same property of a given object A concrete example: on April 21th 2009, the open trade for the Sun Microsystem Inc. stock quote published by the CNN Money, Google Finance, and Yahoo! Finance web sites, was 9.17, 9.15 and 9.15, respectively. 24/08/2009 Paolo Atzeni 49

System architecture Web Search Flint Data Extraction Data Integration The Web 24/08/2009 Paolo Atzeni 50

System architecture Web search Flint Extraction Integration Probability The Web 24/08/2009 Paolo Atzeni 51

Summary The exploitation of structure is essential for extracting information from the Web In the Web world, task automation is needed to get scalability Interaction of different activities (extraction, search, transformation, integration) is often important 24/08/2009 Paolo Atzeni 52