Structured Data on the Web

Similar documents
Recent Advances in Structured Data and the Web

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Searching the Deep Web

Searching the Deep Web

WebTables: Exploring the Power of Tables on the Web

WebTables: Exploring the Power of Tables on the Web

Data Integration for the Relational Web

Big Data Integration for Data Enthusiasts. Jayant Madhavan Structured Data Research Google Inc.

Web-Scale Extraction of Structured Data

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES

WebTables: Exploring the Power of Tables on the Web

Proposal for Implementing Linked Open Data on Libraries Catalogue

ISSN (Online) ISSN (Print)

BA Insight Federator. How the BA Insight Federator Extends SharePoint Search

CS490W: Web Information Search & Management. CS-490W Web Information Search and Management. Luo Si. Department of Computer Science Purdue University

Creating a Recommender System. An Elasticsearch & Apache Spark approach

AN EFFICIENT PROCESSING OF WEBPAGE METADATA AND DOCUMENTS USING ANNOTATION Sabna N.S 1, Jayaleshmi S 2

Deep Web Crawling and Mining for Building Advanced Search Application

CS-490WIR Web Information Retrieval and Management. Luo Si

Demystifying Machine Learning

Science-as-a-Service

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Part I: Data Mining Foundations

3 Media Web. Understanding SEO WHITEPAPER

Seminar report Google App Engine Submitted in partial fulfillment of the requirement for the award of degree Of CSE

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I

How to Drive More Traffic to Your Website in By: Greg Kristan

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Uncovering the Relational Web

Search Engines. Charles Severance

Keywords Data alignment, Data annotation, Web database, Search Result Record

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CSE Database Management Systems. York University. Parke Godfrey. Winter CSE-4411M Database Management Systems Godfrey p.

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

REDCAP DATA DICTIONARY CLASS. November 9, 2017

10 SEO MISTAKES TO AVOID

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

THE WEB SEARCH ENGINE

Deep Web Content Mining

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

Optimizer Challenges in a Multi-Tenant World

Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

Give Your DITA wings with taxonomy & modern web design. Joe Pairman

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

COMPARISON WHITEPAPER. Snowplow Insights VS SaaS load-your-data warehouse providers. We do data collection right.

CSE 494: Information Retrieval, Mining and Integration on the Internet

1Table A System for Managing Structured Web Data. Yang Zhang with: Alon Halevy, Mike Cafarella, Nodira Khoussainova, Eugene Wu and Daisy Zhe

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

6 TOOLS FOR A COMPLETE MARKETING WORKFLOW

SATURN Update. DAML PI Meeting Dr. A. Joseph Rockmore 25 May 2004

Open Data Integration. Renée J. Miller

ProFoUnd: Program-analysis based Form Understanding

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Automating Unpredictable Processes:

Connected Car. Dr. Sania Irwin. Head of Systems & Applications May 27, Nokia Solutions and Networks 2014 For internal use

Machine Learning with Python

SQL Diagnostic Manager Management Pack for Microsoft System Center

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

Advanced Digital Markeitng Training Syllabus

Practice 1 1: Instance or Entity

DB2 is a complex system, with a major impact upon your processing environment. There are substantial performance and instrumentation changes in

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

By Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy and Pedro Domingos. Presented by Yael Kazaz

Outline. NLP Applications: An Example. Pointwise Mutual Information Information Retrieval (PMI-IR) Search engine inefficiencies

Easy Professional Signage Screen Marketing with viewneo

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

TIM 50 - Business Information Systems

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Metatomix Semantic Platform

How Does a Search Engine Work? Part 1

SEO News. 15 SEO Fixes for Better Rankings. For SEO, marketing books and guides, visit

Search Engines. Information Retrieval in Practice

Hebei University of Technology A Text-Mining-based Patent Analysis in Product Innovative Process

Building a Data Strategy for a Digital World

Conceptual Design with ER Model

Grid Computing Systems: A Survey and Taxonomy

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Ten Years of WebTables

Addressing the Dark Side of Vision Research: Storage

Talend Component tgoogledrive

Due: Tuesday 29 November by 11:00pm Worth: 8%

Collective Intelligence in Action

5 Fundamental Strategies for Building a Data-centered Data Center

Cloud Computing. What is cloud computing. CS 537 Fall 2017

Amazon. Case study Digital Relationship Marketing (RM) Academy Extending the consumer journey with RM Amazon: Perfect personalization

Advances in Data Management - Web Data Integration A.Poulovassilis

Lecture 2: January 24

Strategic Briefing Paper Big Data

Outline. Part I. Introduction Part II. ML for DI. Part III. DI for ML Part IV. Conclusions and research direction

Enterprise Data Catalog for Microsoft Azure Tutorial

Table of contents. 1. Backlink Audit Summary...3. Marketer s Center. 2. Site Auditor Summary Social Audit Summary...9

CS101 Lecture 30: How Search Works and searching algorithms.

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

The Wellness Business Academy

Transcription:

Structured Data on the Web Alon Halevy Google Australasian Computer Science Week January, 2010

Structured Data & The Web

Andree Hudson, 4 th of July

Hard to find structured data via search engines Discover Requires infrastructure, concerns about losing control Publish Extract Data is embedded in web page, behind forms Manage, Analyze, Combine Hard to query, visualize, combine data across organizations

Web-form crawling, Finding all HTML tables Discover Publish Fusion Tables: collaborating on data in the cloud, easy data publishing Manage, Analyze, Combine Extract Lists -> tables Extracting context

Two Over-Arching Questions How do we combine techniques for structured and unstructured data? Can t think of structured data in isolation on the Web How to design a data management system for the Web/Cloud/Mobile World? It s not just SQL and transactions anymore.

Web-form crawling, Finding all HTML tables Discover Publish Fusion Tables: collaborating on data in the cloud, easy data publishing Manage, Analyze, Combine Extract Lists -> tables Extracting context

What is the Deep Web? Deep = not accessible through general purpose search engines Major gap in the coverage of search engines. used cars store locations radio stations patents recipes

Vertical Search: Data Integration

Tree Search Amish quilts Parking tickets in India Horses

Three Flavors of Deep Web Vertical search: a single domain. Can be done with data integration techniques (e.g. Fetch, Transformic, Morpheus, ) Goal: deeper experience than search close a transaction, related items, reviews, Search for anything Goal: drive traffic to relevant sites Product search In between the above two.

Search for Anything: Surfacing Crawl & Indexing time Pre-compute interesting form submissions Insert resulting pages into the Google Index Query time: nothing! Deep web URLs in the Google Index are like any other URL Advantages Reuse existing search engine infrastructure as-is Reduced load on target web sites users click only on what they deem relevant.

Surfacing Challenges [See VLDB 08, Madhavan et al.] 1. Predicting the correct input combinations Generating all possible URLs is wasteful and unnecessary Cars.com has ~500K listings, but 250M possible queries 2. Predicting the appropriate values for text inputs Valid input values are required for retrieving data Ingredients in recipes.com and zipcodes in borderstores.com 3. Don t do anything bad!

Informative Query Templates Result pages different informative http://jobs.shrm.org/search?state=all&kw=&type=all http://jobs.shrm.org/search?state=al&kw=&type=all http://jobs.shrm.org/search?state=ak&kw=&type=all http://jobs.shrm.org/search?state=wv&kw=&type=all Result pages similar un-informative http://jobs.shrm.org/search?state=all&kw=&type=all http://jobs.shrm.org/search?state=all&kw=&type=any http://jobs.shrm.org/search?state=all&kw=&type=exact

Current Impact on Query Stream Crawled ~3M sites 50 languages, hundreds of domains 1000 queries per-second get results from the deep web! 400K forms served per day, 800K per week Impact mostly on the long and heavy tail of queries Slash-dotted and valley-wagged

Deep Web: Retrospective Though much of the deep web is structured data, we don t treat it as so: Techniques for surfacing are mostly semantics-free Ranking is no different than other pages Exceptions and opportunities: Recognizing certain fields (e.g. zipcode) Leveraging lists of domain instances E.g., wines, chemical terminology,

Web-form crawling, Finding all HTML tables Discover Publish Fusion Tables: collaborating on data in the cloud, easy data publishing Manage, Analyze, Combine Extract Lists -> tables Extracting context

WebTables: Exploring the Relational Web [Cafarella et al., VLDB 2008, WebDB 08] In corpus of 14B raw tables, we estimate 154M are good relations Single-table databases; Schema = attr labels + types Largest corpus of databases & schemas we know of The Webtables system: Recovers good relations from crawl and enables search Builds novel apps on the recovered data

The WebTables System Inverted Index Raw crawled pages Raw HTML Tables Recovered Relations Relation Search What are good relations? What is the schema? What features are important for ranking? How to index the tables? Data is interesting, but there is much more in the structure itself!

Attribute Correlations DB Inverted Index Raw crawled pages Raw HTML Tables Recovered Relations Relation Search 2.6M distinct schemas 5.4M attributes Job-title, company, date 104 Make, model, year 916 Rbi, ab, h, r, bb, avg, slg 12 Dob, player, height, weight 4 Attribute Correlation Statistics Db

App #1: Schema Auto-complete Useful for traditional schema design for nonexpert users Input I: attr 0 Output schema S: attr 1, attr 2, attr 3, While p(s - I I) > t Find attr a that maximizes p(attr a, S I) S = S U attr a

Schema Auto-complete Examples name name, size, last-modified, type instructor instructor, time, title, days, room, course elected Elected, party, district, incumbent, status, ab ab, h, r, bb, so, rbi, avg, lob, hr, pos, batters sqft sqft, price, baths, beds, year, type, lot-sqft,

App #2: Synonym Discovery Use schema statistics to automatically compute attribute synonyms More complete than thesaurus Given input context attribute set C: 1. A = all attrs that appear with C 2. P = all (a,b) where a A, b A, a b 3. rm all (a,b) from P where p(a,b)>0 4. For each remaining pair (a,b) compute:

Synonym Discovery Examples name e-mail email, phone telephone, e-mail_address email_address, date last_modified instructor elected course-title title, day days, course course-#, course-name course-title candidate name, presiding-officer speaker ab k so, h hits, avg ba, name player sqft bath baths, list list-price, bed beds, price rent

WebTables: real-time retrospection Tables data is very useful: E.g., contains lists of entities [Talukdar et al. EMNLP 2008]: creating isa relations from the Web. [Elmeleegy et al., VLDB 09]: Segmenting lists Table search: still open problem Need to make it part of Web search But also important to support search over large table collections

Web-form crawling, Finding all HTML tables Discover Publish Fusion Tables: collaborating on data in the cloud, easy data publishing Manage, Analyze, Combine Extract Lists -> tables Extracting context

Octopus: Integration from the Web [Cafarella, Koussainova, H., VLDB 2009, Elmeleegy, Madhavan, H., VLDB 2009] Try to create a database of all VLDB program committee members 32

An Integration Workbench Operators that combine: search, extraction, cleaning and integration Search: finds and clusters relevant data Context: retrieves implicit data (e.g., year) Split: transforms lists into tables Elmeleegy et al, 2009 Uses several hints, including WebTables Extend: adds new columns

Web-form crawling, Finding all HTML tables Discover Publish Fusion Tables: collaborating on data in the cloud, easy data publishing Manage, Analyze, Combine Extract Lists -> tables Extracting context

Data Management for the Web Era Integrate seamlessly with the Web: Search, maps, Easy to use: Much broader user base Provide incentives for sharing data Facilitate collaboration Fusion Tables our current attempt [Madhavan, Gonzalez, Langen, Shapely, Shen]

Attribution and Description

Visualization

Intensity Map

Collaboration

Merging Data Sets

Discuss Data

Conclusions Structured data presents a significant challenge for web search Table search is still an open problem Automatically combining data from multiple tables is an important challenge. Fusion Tables: A new kind of data management system An opportunity to build and study the ecosystem of structured data on the Web.

Fusion Tables: References tables.googlelabs.com Deep-web crawling: [Madhavan et al., VLDB 08] WebTables: [Cafarella et al., VLDB 08] Octopus: [Cafarella et al., VLDB 09], [Elmeleegy et al, VLDB 09]