Data Integration for the Relational Web

Similar documents
Structured Data on the Web

: Semantic Web (2013 Fall)

ANNUAL REPORT Visit us at project.eu Supported by. Mission

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

HTML 5 and CSS 3, Illustrated Complete. Unit M: Integrating Social Media Tools

4) DAVE CLARKE. OASIS: Constructing knowledgebases around high resolution images using ontologies and Linked Data

Information Retrieval

Patent Image Retrieval

Milind Kulkarni Research Statement

Data on the Web Best Practices: Challenges and Benefits

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Enhancing applications with Cognitive APIs IBM Corporation

We focus on the backend semantic web database architecture and offer support and other services around that.

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Metadata Management System (MMS)

Open Data Integration. Renée J. Miller

Data driven transformation of the public sector Tallinn, Estonia Head of unit 22 September 2016 European Commission

20489: Developing Microsoft SharePoint Server 2013 Advanced Solutions

AN EFFICIENT PROCESSING OF WEBPAGE METADATA AND DOCUMENTS USING ANNOTATION Sabna N.S 1, Jayaleshmi S 2

The Tagging Tangle: Creating a librarian s guide to tagging. Gillian Hanlon, Information Officer Scottish Library & Information Council

Preservation Planning in the OAIS Model

Florida Coastal Everglades LTER Program

RESOURCE DISCOVERY PAST WORK AND FUTURE PLANS

V Conclusions. V.1 Related work

DITA for Enterprise Business Documents Sub-committee Proposal Background Why an Enterprise Business Documents Sub committee

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Semantic Website Clustering

Europeana Core Service Platform

Criterion 4 Exemplary 3 Very Good 2 Good 1 Substandard Comprehensive/ Web 2.0 tool cannot be used any content area and has

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

1. CONCEPTUAL MODEL 1.1 DOMAIN MODEL 1.2 UML DIAGRAM

/// INTEROPERABILITY BETWEEN METADATA STANDARDS: A REFERENCE IMPLEMENTATION FOR METADATA CATALOGUES

6 TOOLS FOR A COMPLETE MARKETING WORKFLOW

MARKETING VOL. 3

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Revisiting Blank Nodes in RDF to Avoid the Semantic Mismatch with SPARQL

IBE101: Introduction to Information Architecture. Hans Fredrik Nordhaug 2008

An Introduction to Search Engines and Web Navigation

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Global Data Plane. The Cloud is not enough: Saving IoT from the Cloud & Toward a Global Data Infrastructure PRESENTED BY MEGHNA BAIJAL

DEVELOPING MICROSOFT SHAREPOINT SERVER 2013 ADVANCED SOLUTIONS. Course: 20489A; Duration: 5 Days; Instructor-led

Searching the Deep Web

MySQL for Beginners Ed 3

Data Virtualization Implementation Methodology and Best Practices

Implementation of a reporting workflow to maintain data lineage for major water resource modelling projects

Working towards a Metadata Federation of CLARIN and DARIAH-DE

DIGIT.B4 Big Data PoC

7.3. In t r o d u c t i o n to m e t a d a t a

SIGIR Workshop Report. The SIGIR Heterogeneous and Distributed Information Retrieval Workshop

Assessment of the progress made in the implementation of and follow-up to the outcomes of the World Summit on the Information Society

An Evaluation of Geo-Ontology Representation Languages for Supporting Web Retrieval of Geographical Information

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

The MEG Metadata Schemas Registry Schemas and Ontologies: building a Semantic Infrastructure for GRIDs and digital libraries Edinburgh, 16 May 2003

Author Submission Guidelines

Interoperability and transparency The European context

Effective Web Site: Global Standards and Best Practices

Developing Microsoft SharePoint Server 2013 Advanced Solutions

Verint Knowledge Management Solution Brief Overview of the Unique Capabilities and Benefits of Verint Knowledge Management

What is a graph database?

Data Intensive Computing SUBTITLE WITH TWO LINES OF TEXT IF NECESSARY PASIG June, 2009

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Deep Web Content Mining

Comprehensive Guide to Evaluating Event Stream Processing Engines

Introduction to Distributed Systems. INF5040/9040 Autumn 2018 Lecturer: Eli Gjørven (ifi/uio)

Building Mobile Applications. F. Ricci 2010/2011

SEARCH ENGINE MARKETING (SEM)

MCSE Productivity. A Success Guide to Prepare- Core Solutions of Microsoft SharePoint Server edusum.com

At the end of the chapter, you will learn to: Present data in textual form. Construct different types of table and graphs

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value

Marketing & Back Office Management

Microsoft SharePoint Server

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Principles of Dataspaces

Deliverable D Visualization and analytical tools

GRAPHING BAYOUSIDE CLASSROOM DATA

Automated Tagging for Online Q&A Forums

Searching the Deep Web

Semantics in RDF and SPARQL Some Considerations

Programming in C#

Addressing the Dark Side of Vision Research: Storage

Supporting a Locale of One: Global Content Delivery for the Individual

Organize. Collaborate. Discover. All About Mendeley

HDF5 Single Writer/Multiple Reader Feature Design and Semantics

Big Data Integration for Data Enthusiasts. Jayant Madhavan Structured Data Research Google Inc.

SharePoint Breakfast Session

The European Commission s science and knowledge service. Joint Research Centre

CLOUD-SCALE FILE SYSTEMS

Exploiting Semantics Where We Find Them

KNOW At The Social Book Search Lab 2016 Suggestion Track

The team has extensive expertise in microcontrollers, embedded systems design and wireless technologies like NFC, Bluetooth and Wi-Fi.

Data Warehousing 11g Essentials

Smart Crawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Enhancing Wrapper Usability through Ontology Sharing and Large Scale Cooperation

All Adobe Digital Design Vocabulary Absolute Div Tag Allows you to place any page element exactly where you want it Absolute Link Includes the

Datameer for Data Preparation:

Big Linked Data ETL Benchmark on Cloud Commodity Hardware

Transcription:

Data Integration for the Relational Web Report for the course hy562 Michalis Katsarakis katsarakis@csd.uoc.gr 26 June 2012

Contents Contents 1. Contribution Summary (Περίληψη Συνεισφοράς) 2. Range of Result Application (Εύρος Εφαρμογής Αποτελεσμάτων) 3. Powerful Points (Iσχυρά Σημεία) 4. Weak Points (Αδύναμα Σημεία) 5. Open Topics for Future Research and Possible Extensions (Aνοιχτά θέματα για μελλοντική ερεύνα και πιθανές επεκτάσεις) References

1. Contribution Summary (Περίληψη Συνεισφοράς) The web contains a vast amount of structured information such as HTML tables, HTML lists and deep-web databases. There is enormous potential in combining an re-purposing this data in creative ways. However, integrating data from the relational web raises several challenges that are not addressed by current data integration systems and mash-up tools. The paper [1], which was published in the conference VLDB2009, presents Octopus, a novel system that enables users to integrate relational data extracted from web pages. Except from the traditional relational database operators (e.g.: project, split, union), Octopus provides its users with three database integration operators, namely Search, Context and Extend. The Search operator performs 2 tasks: it searches the web for relations relevant to the query string that the user provided and then clusters relations with similar schemes together. The Context operator modifies a relation to contain additional columns, using data derived from its source Web page. Finally, the Extend operator enriches a relation with more columns by performing a join with additional relations found on the web. Each Octopus operator is implemented by two or three alternative algorithms, which are benchmarked regarding the quality of their results. Octopus has contributed in the research progress for Data Integration in several ways. It has proved that it is possible to integrate relational data extracted from dirty and incomplete web sources, achieving relatively good quality of results. It has successfully organized the procedure for creating an useful dataset in three operators. For each operator, Octopus has proposed several algorithms, covering a spectrum from naive to very elaborated implementations. For each algorithm, both its computational complexity and the quality of results are examined, giving the reader an overall impression. Finally, in section 4, the authors discuss the capability of Octopus to be implemented at scale. The difficulties stated regard the runtime complexity of the implemented algorithms and their impact on the potential to provide low latency to a mass audience. The authors also describe the techniques they applied in order to overcome these challenges.

2. Range of Result Application (Εύρος Εφαρμογής Αποτελεσμάτων) The Octopus system described in [1] deals with a hot research topic and further progress is needed to be done, in order for new end-user products to be launched. Despite that, the developed algorithms and the results of the performed experiments, as much as the innovative ideas, can be useful for a variety of already existing products (e.g.: search engines, search modules of various applications and information retrieval products). Also, they can be used to support further research on the fields of data integration, information retrieval and knowledge harvesting from the web.

3. Powerful Points (Iσχυρά Σημεία) The basic idea of the paper seems to be very interesting and highly novel. The developed algorithms and experiment results also contain some interesting parts. The writing style used i this paper is very understandable. Always are used the exact words, the appropriate vocabulary and the right terminology. The implementation of the algorithms is included in the paper in the form of figures, which are supported by detailed captions. In addition to that, all algorithms are explained using the appropriate amount of detail, allowing the reader to fully understand them, while avoiding to become verbose. The goal and the followed procedure of every experiment is defined clearly. The results are presented using clear plots or tables and are described well, explaining every interesting or strange result. The above merits of this paper successfully make a positive impression to the reader, encouraging it to read more.

4. Weak Points (Αδύναμα Σημεία) Section 4 discusses the capability of Octopus to be implemented at scale. Despite the mentioned complexity limitations of several algorithms and the used solutions, it would be interesting to include some performance tests and include measurements like the mean delay of each algorithm for given equipment versus the size of the web corpus or the cardinality of the query words. Such a performance analysis would let the reader immediately understand which algorithms could be immediately adopted, which are their scalability limitations and which algorithms need further improvements. Except from the above, there are not any other weak points detected. The overall rating of [1] should be high, mentioning this with confidence.

5. Open Topics for Future Research and Possible Extensions (Aνοιχτά θέματα για μελλοντική ερεύνα και πιθανές επεκτάσεις) As mentioned in the previous sections, [1] discusses an open topic of research and thus there are many possibilities for further work. Of course, there are open topics in the optimization of the already implemented algorithms and procedures, but these topics are mentioned in the cited paper. This section focuses in further extensions of the Octopus system. The first task of Octopus is to extract relational tables from web pages. In its current state, it can retrieve such relations from HTML tables and HTML lists. These two capabilities can provide us a large amount of relations. Despite that, there is room to implement more tools, capable of extracting data from more kinds of elements. Crawling data which is hidden in back-end databases and accessed through valid form submissions, socially created data sets or officestyle documents would provide us with new kinds of information, currently unavailable. Another interesting topic would be to keep anonymous statistics about Octopus users activity and request from them feedback about the quality of the data they finally managed to achieve. By recording such feedback, it would be possible to detect which data sources (or combinations of data sources) result high-quality relations. This knowledge could be used to promote qualitative sources in the ranking phase of the Search operator, or even to cache the qualitative results and be able to re-serve them quickly in case of a similar query. One more idea would be be to develop a cloud based service, that would facilitate the creation and publication of structured data on the Web, therefore complementing the Octopus system. Despite from extracting data from HTML tables, lists or other structures, it would be useful to recognize column names, table schemes and any kind of metadata that could accompany the extracted data. This would open the way to implement, in addition to the graphical user interface, a semantic web service, enabling the automatic use of Octopus services by other applications.

References [1] M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data Integration for the Relational Web. In VLDB, 2009.