By Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy and Pedro Domingos. Presented by Yael Kazaz

Similar documents
imap: Discovering Complex Semantic Matches between Database Schemas

Partly based on slides by AnHai Doan

Learning mappings and queries

Database Technologies. Madalina CROITORU IUT Montpellier

Customer Clustering using RFM analysis

Interactive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study

Learning to Match Ontologies on the Semantic Web

Escaping Local Optima: Genetic Algorithm

Learning to Match the Schemas of Data Sources: A Multistrategy Approach

SeMap: A Generic Schema Matching System

NOTE 1: This is a closed book examination. For example, class text, copies of overhead slides and printed notes may not be used. There are 11 pages.

Information Discovery, Extraction and Integration for the Hidden Web

Implementation Techniques

SIMON: A Multi-strategy Classification Approach Resolving Ontology Heterogeneity The P2P Meets the Semantic Web *

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Learning to match ontologies on the Semantic Web

A Generic Algorithm for Heterogeneous Schema Matching

Building a website. Should you build your own website?

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

Creating a Mediated Schema Based on Initial Correspondences

Manual Wrapper Generation. Automatic Wrapper Generation. Grammar Induction Approach. Overview. Limitations. Website Structure-based Approach

Certified Business Analysis Professional (CBAP )

Most, but not all, state associations link to the VU web site.

Google Domination SEO Copywriting Secrets For Business Owners

Learning Path Queries on Graph Databases

2. Data Preprocessing

3. Data Preprocessing. 3.1 Introduction

Procedures to become a Public Service Training Instructor

1Z0-526

Data Integration. Lecture 23. Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems. CompSci 516: Data Intensive Computing Systems

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY

TRIPWIRE VULNERABILITY RISK METRICS CONNECTING SECURITY TO THE BUSINESS

Multi column matching for database schema translation

What is. Search Engine Marketing

Introduction Implementation of the Coalescing Operator Performance of Coalescing Conclusion. Temporal Coalescing. Roger Schneider 17.4.

How To Enter A New Customer Order - Self-Installing Dealer (DSI) Desk Reference

Getting the most from your websites SEO. A seven point guide to understanding SEO and how to maximise results

EECS-3421a: Test #1 Design

Mining Frequent Itemsets in Time-Varying Data Streams

MIDTERM EXAMINATION Networked Life (NETS 112) November 21, 2013 Prof. Michael Kearns

Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

Online Digital Transformation Courses COB Certified E-Commerce & E-Business Manager E-Learning Options

How To Enter A Sales Order Sales Only Dealer Desk Reference

CSE-6490B Final Exam

CHAPTER 2 CONVENTIONAL AND NON-CONVENTIONAL TECHNIQUES TO SOLVE ORPD PROBLEM

UNIT 6 MODELLING DECISION PROBLEMS (LP)

µbe: User Guided Source Selection and Schema Mediation for Internet Scale Data Integration

DEVELOPMENT AND EVALUATION OF A SYSTEM FOR CHECKING FOR IMPROPER SENDING OF PERSONAL INFORMATION IN ENCRYPTED

STRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Leveraging Set Relations in Exact Set Similarity Join

Lecture Notes for Chapter 2: Getting Started

Dynamic Time Warping & Search

(a) Explain how physical data dependencies can increase the cost of maintaining an information

Evolving Variable-Ordering Heuristics for Constrained Optimisation

A Parallel Implementation of a Higher-order Self Consistent Mean Field. Effectively solving the protein repacking problem is a key step to successful

COSC Dr. Ramon Lawrence. Emp Relation

Unsupervised Semantic Parsing

Joint Entity Resolution

Structured Data on the Web

NALA Certifying Board Announces New Exam Specifications Effective with 2018 Administrations

Design of Parallel Algorithms. Models of Parallel Computation

10 Tips for Real Estate Agents looking for an Internet Fax Service

7. Solve the following compound inequality. Write the solution in interval notation.

CMA CMA Create Save Comparative Market Analyses CMA Analysis Resume Comparable Pricing Estimated Seller Proceeds Comparison Adjustable

Multiple Query Optimization for Density-Based Clustering Queries over Streaming Windows

The Threshold Algorithm: from Middleware Systems to the Relational Engine

Appropriate Item Partition for Improving the Mining Performance

Saving Time and Costs with Virtual Patching and Legacy Application Modernizing

System Setup. Accessing the Setup. Chapter 1

New Matrix Features Version 5.5. Count on the Fly. Contact Carts Navigation Bar Improvements Goggles Market Watch Widget Stats

Clustering Using Graph Connectivity

Wire Fraud Begins to Hammer the Construction Industry

FUNCTIONAL DEPENDENCIES

Digital Audience Analysis: Understanding Online Car Shopping Behavior & Sources of Traffic to Dealer Websites

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Additional reading for this lecture: Heuristic Evaluation by Jakob Nielsen. Read the first four bulleted articles, starting with How to conduct a

COB Certified E-Commerce & E-Business Manager E-Learning Options

Efficient Mining Algorithms for Large-scale Graphs

Mining Generalised Emerging Patterns

Collective Entity Resolution in Relational Data

The Quick Guide to Better Site Search

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps

Price Performance Analysis of NxtGen Vs. Amazon EC2 and Rackspace Cloud.

Reliability Measure of 2D-PAGE Spot Matching using Multiple Graphs

Kristina Lerman Anon Plangprasopchok Craig Knoblock. USC Information Sciences Institute

Gene Clustering & Classification

Online supplement for: A Study of Quality and Accuracy Tradeoffs in Process Mining, by Zan Huang and Akhil Kumar (Appendices A F) APPENDIX A

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Performance Analysis of Virtual Machines on NxtGen ECS and Competitive IaaS Offerings An Examination of Web Server and Database Workloads

NAME SIMILARITY MEASURES FOR XML SCHEMA MATCHING

Challenges and Benefits of a Methodology for Scoring Web Content Accessibility Guidelines (WCAG) 2.0 Conformance

Contractors Guide to Search Engine Optimization

3 SOLVING PROBLEMS BY SEARCHING

Programming Logic and Design Sixth Edition

Matching Schemas in Online Communities: A Web 2.0 Approach

Parser: SQL parse tree

Final Exam Review (Revised 3/16) Math MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Processing Structural Constraints

Transcription:

By Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy and Pedro Domingos Presented by Yael Kazaz

Example: Merging Real-Estate Agencies Two real-estate agencies: S and T, decide to merge Schema T has one table: Listings Schema S has two tables: Houses and Agents Merging schema S into schema T

Example: Making Tuples Using SQL area = SELECT location from HOUSES agent-name = SELECT name from AGENTS agent-address = SELECT concat(city, state) FROM AGENTS list-price = SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id

Motivation Creating matches between data sources important Manually creating matched is hard Past attempts deal only with one-to-one (1-1) matches For example: Address = Location For example: Room_Price = Room_Rate Complex matches are not considered For example: Address = concat(city, state) For example: Room_Price = Room_Rate*(1+ Tax_Rate)

Introducing the imap System Semi-automatically discovering: 1-1 matches Complex matches Semi automatically constructing complex matches is very important since complex matched compose up to half of the matches!

Complex Matches Creating complex matches is harder than 1-1 matches: the match space can be very large or even infinite! The number of 1-1 matches is bounded The number of complex matches is not. There are an un-bounded number of functions for combining attributes in a schema

Example: 1-1 Matches area = SELECT location from HOUSES agent-name = SELECT name from AGENTS agent-address = SELECT concat(city, state) FROM AGENTS list-price = SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id

Example: Complex Matches area = SELECT location from HOUSES agent-name = SELECT name from AGENTS agent-address = SELECT concat(city, state) FROM AGENTS list-price = SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id

Overview

Match Generator (1) Input: Target schema (T) Source schema (S) Output: Match candidates

Match Generator (2) What the Match Generator does: The Match Generator takes as input two schemas: S and T For each attribute t of T, it generates a set of match candidates: 1-1 and complex matches The generation in guided by a set of search modules

Match Generator (3) For Example: for t = area in T the candidates are: location in HOUSES name in AGENTS state in AGENTS concat(city, state) in AGENTS concat(name, city) in AGENTS

Match Generator (4) PROBLEM: Unbounded number of match candidates SOLUTION: Search the space of possible matches HOW: Use search modules, called searchers, each in charge of a specific type of attribute

Match Generator (5) Implemented searchers in imap: The searchers cover many complex match types: text, numeric, category, etc. The searchers evaluate match candidates, and exploit domain knowledge, such as domain constraints and overlap data

Match Generator (6) Applying search to candidate generation requires addressing three issues: (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition

Match Generator (7) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition The space search can be very large or even unbounded We need to efficiently search such spaces imap addresses this problem using beam search

Match Generator (8) Beam Search Example: (K = 3) A A A A B C D (3) (5) (1) B (3) (5) C D B C D (5) E F (4) (6) G H (6) (5) E F (4) (6) S = {A} S = {B C D} S = {B C E} S = {C E H}

Match Generator (9) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition We use beam search to search candidate matches Beam search uses a scoring function to evaluate each match candidate At each level of the search tree, it keeps only k highest-scoring match The searcher can conduct an efficient search in any type of search space

Match Generator (10) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition To conduct beam search, given a match candidate, we assign to it a score of the distance between it and the target attribute For example: Given a match candidate: concat(city, state) we approximates the distance between it and the target attribute: (agent-address)

Match Generator (11) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition imap uses techniques to compute the candidate scores: Machine learning Statistics Heuristics

Match Generator (12) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition The search space can be unbounded We need to decide when to stop the search We terminate when we start seeing diminishing returns from our search

Match Generator (13) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition In the i th iteration we keep track of the highest score of candidate matches Max i If (Max i+1 Max i ) < threshold: we stop the search and return the k highest match candidates

Overview

Similarity Estimator (1) Input: Match candidates Output: Similarity matrix

Similarity Estimator (2) What the Similarity Estimator does: Computes for each candidate a score of similarity to attribute t of T The output of this module is a matrix that stores the similarity score of the pairs: <target attribute, match candidate> Target attribute t1 Target attribute t2 Target attribute t3. Match candidate 1 0 0.5 0. Match candidate 2 0 0.2 0.7. Match candidate 3 0.8 0.3 0....

Similarity Estimator (3) name concat(city, state) price price * (1 + fee-rate) area 0.5 0.8 0 0 list-price 0 0 0.7 0.9 agent-address 0.5 0.8 0 0 agent-name 0.9 0.6 0 0

Similarity Estimator (4) PROBLEM: Deciding which candidate is better than another SOLUTION: For each target attribute t of T, the searchers suggest a set of match candidates HOW: The scores assigned to each of the candidate matches is based only on a single type of information For example: the text searcher considers only word frequencies

Similarity Estimator (5) The Similarity Estimator evaluates these candidates, and assign to each of them a final score The Similarity Estimator exploits additional types of information to for an accurate score The Similarity Estimator employs evaluator modules that exploit types of information, and then combines the suggested scores into one final score

Overview

Match Selector (1) Input: Similarity matrix Output: 1-1 matches Complex matches

Match Selector (2) What the Match Selector does: Examines the Similarity Matrix and outputs the best matches for the attributes of T, under certain constraints name concat(city, state) price price * (1 + fee-rate) area 0.5 0.8 0 0 list-price 0 0 0.7 0.9 agent-address 0.5 0.8 0 0 agent-name 0.9 0.6 0 0

Match Selector (3) PROBLEM: How to match the best candidate to the attribute SOLUTION: The best global match could be where each target attribute is assigned the match with the highest score BUT This match assignment may not be acceptable because it may violate domain constraints

For Example: The imap Architecture: Match Selector (4) Domain constraint: name and city in AGENTS have no relation Result: The tuple area= SELECT concat(name, city) FROM AGENTS is not selected

Overview

Exploiting Domain Knowledge (1) Exploiting domain knowledge is beneficial on 1-1 matching On complex matching, it is more crucial: early detection of unlikely matches

Exploiting Domain Knowledge (2) For example: imap learns that the number of real estate agents in a specific area is bounded by 50 Given the match: agent-name = concat(firstname, last-name), where first-name and last-name belong to the home owner, imap will realize that concat(first-name, last-name) results in hundreds of distinct names Conclusion: concat(first-name, last-name) is unlikely to match agent-name

Exploiting Domain Knowledge (3) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data

Exploiting Domain Knowledge (4) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data

Exploiting Domain Knowledge (5) Constraints are either: Present in the schema For example: agent-name is a text and amount is only numeric Provided by an expert For example: The tax on a room cannot be less than 7% Provided by the user For example: We do not sell houses that cost less than 200,000$

Exploiting Domain Knowledge (6) imap considers 3 kinds of constraints: Two attributes are un-related For example: name and beds are unrelated; meaning that they cannot appear in the same match formula Constraint on a single attribute For example: the average value of numrooms does not exceed 10 Multiple schema attributes are un-related For example: area and agent-name are unrelated

Exploiting Domain Knowledge (7) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data

Exploiting Domain Knowledge (8) Sometimes, the same or similar schemas are mapped repeatedly imap extracts the expression template of these matches, and guides the search process For example: Given the past match: price = pr * (1 + 0.6), imap will extract: VARIABLE * (1 + CONSTANT) and ask the numeric searcher to look for matches for that template

Exploiting Domain Knowledge (9) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data

Exploiting Domain Knowledge (10) There are many cases where the source (S) and target (T) share the same data For example: database S and T share a house listing ( Atlanta, GA, ) In such overlap cases the shared data provides valuable information for the mapping process

Exploiting Domain Knowledge (11) Searchers that exploit overlap data: Overlap Text Searcher Overlap Numeric Searcher Overlap Category & Schema Mismatch Searchers

Exploiting Domain Knowledge (12) HOW THE SEARCHERS WORK: Step 1: Use the original searchers for an initial mapping Step 2: Use the overlap data to re-evaluate the mappings for improved matching accuracy For example: when re-evaluated, mapping: agent-address = location, receives score 0 because it is not correct for the shared house listing: ( Atlanta, GA, ) and agent-address = concat(city, state) receives score 1

Exploiting Domain Knowledge (13) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data

Exploiting Domain Knowledge (14) External data is used as additional constraints on the attributes of a schema Usually provided by experts Can be very useful in schema matching

Exploiting Domain Knowledge (15) HOW EXPLOITING EXTERNAL DATA WORKS: Step 1: Use external data to learn about the feature Step 2: Apply the learned information to evaluate matches for the target attribute For example: Target attribute: agent-name A feature that can be potentially useful in schema matching: number of distinct agent names

Overview

Generating Explanations (1) Example: In matching real-estate schemas, for attribute: list-price, imap has produced the matches: list-price = price list-price = price * (1 + monthly-fee-rate)

Generating Explanations (2) PROBLEM: The user is uncertain which of the two is the correct match SOLUTION: imap must explain the ranking: For example: why did list-price = price get a higher rank than list-price = price * (1 + monthly-fee-rate)?

Generating Explanations (3) imap s goal is to provide an environment where a human user can quickly generate a mapping between a pair of schemas For a user to know which match to choose, imap must supply an explanation for each of the matches

Generating Explanations (4) imap considers 3 questions: Explain existing match: Why does the match exist? For example: Why the match month-posted = month-fee-rate exist? Explain absent match: Why doesn t the match exist? Explain match ranking: Why is one match better than another?

Generating Explanations (5) imap keeps track of the decision making progress in a dependency graph Each node is one of the following: Schema attribute Assumption Candidate match Domain knowledge An edge between two nodes means that one node leads to another

Generating Explanations (6)

Conclusions (1) Matches are key for enabling a wide variety of data sharing and exchange scenarios The majority of the research on schema matching has focused on 1-1 matches imap offers a solution to the problem of finding complex matches The key challenge with complex matches is that the space of possible matching candidates is possibly unbounded, and evaluating each candidate is harder

Conclusions (2) The architecture of imap is modular and extensible New searchers and new evaluation modules can be added easily Experimental results show that imap achieves 43-92% accuracy on several real world domains, thus demonstrating the promise of the approach