By Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy and Pedro Domingos. Presented by Yael Kazaz

Size: px

Start display at page:

Download "By Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy and Pedro Domingos. Presented by Yael Kazaz"

Ethelbert Rodgers
5 years ago
Views:

1 By Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy and Pedro Domingos Presented by Yael Kazaz

2 Example: Merging Real-Estate Agencies Two real-estate agencies: S and T, decide to merge Schema T has one table: Listings Schema S has two tables: Houses and Agents Merging schema S into schema T

3 Example: Making Tuples Using SQL area = SELECT location from HOUSES agent-name = SELECT name from AGENTS agent-address = SELECT concat(city, state) FROM AGENTS list-price = SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id

4 Motivation Creating matches between data sources important Manually creating matched is hard Past attempts deal only with one-to-one (1-1) matches For example: Address = Location For example: Room_Price = Room_Rate Complex matches are not considered For example: Address = concat(city, state) For example: Room_Price = Room_Rate*(1+ Tax_Rate)

5 Introducing the imap System Semi-automatically discovering: 1-1 matches Complex matches Semi automatically constructing complex matches is very important since complex matched compose up to half of the matches!

6 Complex Matches Creating complex matches is harder than 1-1 matches: the match space can be very large or even infinite! The number of 1-1 matches is bounded The number of complex matches is not. There are an un-bounded number of functions for combining attributes in a schema

7 Example: 1-1 Matches area = SELECT location from HOUSES agent-name = SELECT name from AGENTS agent-address = SELECT concat(city, state) FROM AGENTS list-price = SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id

8 Example: Complex Matches area = SELECT location from HOUSES agent-name = SELECT name from AGENTS agent-address = SELECT concat(city, state) FROM AGENTS list-price = SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id

10 Overview

11 Match Generator (1) Input: Target schema (T) Source schema (S) Output: Match candidates

12 Match Generator (2) What the Match Generator does: The Match Generator takes as input two schemas: S and T For each attribute t of T, it generates a set of match candidates: 1-1 and complex matches The generation in guided by a set of search modules

13 Match Generator (3) For Example: for t = area in T the candidates are: location in HOUSES name in AGENTS state in AGENTS concat(city, state) in AGENTS concat(name, city) in AGENTS

14 Match Generator (4) PROBLEM: Unbounded number of match candidates SOLUTION: Search the space of possible matches HOW: Use search modules, called searchers, each in charge of a specific type of attribute

15 Match Generator (5) Implemented searchers in imap: The searchers cover many complex match types: text, numeric, category, etc. The searchers evaluate match candidates, and exploit domain knowledge, such as domain constraints and overlap data

16 Match Generator (6) Applying search to candidate generation requires addressing three issues: (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition

17 Match Generator (7) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition The space search can be very large or even unbounded We need to efficiently search such spaces imap addresses this problem using beam search

18 Match Generator (8) Beam Search Example: (K = 3) A A A A B C D (3) (5) (1) B (3) (5) C D B C D (5) E F (4) (6) G H (6) (5) E F (4) (6) S = {A} S = {B C D} S = {B C E} S = {C E H}

19 Match Generator (9) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition We use beam search to search candidate matches Beam search uses a scoring function to evaluate each match candidate At each level of the search tree, it keeps only k highest-scoring match The searcher can conduct an efficient search in any type of search space

20 Match Generator (10) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition To conduct beam search, given a match candidate, we assign to it a score of the distance between it and the target attribute For example: Given a match candidate: concat(city, state) we approximates the distance between it and the target attribute: (agent-address)

21 Match Generator (11) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition imap uses techniques to compute the candidate scores: Machine learning Statistics Heuristics

22 Match Generator (12) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition The search space can be unbounded We need to decide when to stop the search We terminate when we start seeing diminishing returns from our search

23 Match Generator (13) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition In the i th iteration we keep track of the highest score of candidate matches Max i If (Max i+1 Max i ) < threshold: we stop the search and return the k highest match candidates

24 Overview

25 Similarity Estimator (1) Input: Match candidates Output: Similarity matrix

26 Similarity Estimator (2) What the Similarity Estimator does: Computes for each candidate a score of similarity to attribute t of T The output of this module is a matrix that stores the similarity score of the pairs: <target attribute, match candidate> Target attribute t1 Target attribute t2 Target attribute t3. Match candidate Match candidate Match candidate

27 Similarity Estimator (3) name concat(city, state) price price * (1 + fee-rate) area list-price agent-address agent-name

28 Similarity Estimator (4) PROBLEM: Deciding which candidate is better than another SOLUTION: For each target attribute t of T, the searchers suggest a set of match candidates HOW: The scores assigned to each of the candidate matches is based only on a single type of information For example: the text searcher considers only word frequencies

29 Similarity Estimator (5) The Similarity Estimator evaluates these candidates, and assign to each of them a final score The Similarity Estimator exploits additional types of information to for an accurate score The Similarity Estimator employs evaluator modules that exploit types of information, and then combines the suggested scores into one final score

30 Overview

31 Match Selector (1) Input: Similarity matrix Output: 1-1 matches Complex matches

32 Match Selector (2) What the Match Selector does: Examines the Similarity Matrix and outputs the best matches for the attributes of T, under certain constraints name concat(city, state) price price * (1 + fee-rate) area list-price agent-address agent-name

33 Match Selector (3) PROBLEM: How to match the best candidate to the attribute SOLUTION: The best global match could be where each target attribute is assigned the match with the highest score BUT This match assignment may not be acceptable because it may violate domain constraints

34 For Example: The imap Architecture: Match Selector (4) Domain constraint: name and city in AGENTS have no relation Result: The tuple area= SELECT concat(name, city) FROM AGENTS is not selected

35 Overview

36 Exploiting Domain Knowledge (1) Exploiting domain knowledge is beneficial on 1-1 matching On complex matching, it is more crucial: early detection of unlikely matches

37 Exploiting Domain Knowledge (2) For example: imap learns that the number of real estate agents in a specific area is bounded by 50 Given the match: agent-name = concat(firstname, last-name), where first-name and last-name belong to the home owner, imap will realize that concat(first-name, last-name) results in hundreds of distinct names Conclusion: concat(first-name, last-name) is unlikely to match agent-name

38 Exploiting Domain Knowledge (3) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data

39 Exploiting Domain Knowledge (4) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data

40 Exploiting Domain Knowledge (5) Constraints are either: Present in the schema For example: agent-name is a text and amount is only numeric Provided by an expert For example: The tax on a room cannot be less than 7% Provided by the user For example: We do not sell houses that cost less than 200,000$

41 Exploiting Domain Knowledge (6) imap considers 3 kinds of constraints: Two attributes are un-related For example: name and beds are unrelated; meaning that they cannot appear in the same match formula Constraint on a single attribute For example: the average value of numrooms does not exceed 10 Multiple schema attributes are un-related For example: area and agent-name are unrelated

42 Exploiting Domain Knowledge (7) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data

43 Exploiting Domain Knowledge (8) Sometimes, the same or similar schemas are mapped repeatedly imap extracts the expression template of these matches, and guides the search process For example: Given the past match: price = pr * ( ), imap will extract: VARIABLE * (1 + CONSTANT) and ask the numeric searcher to look for matches for that template

44 Exploiting Domain Knowledge (9) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data

45 Exploiting Domain Knowledge (10) There are many cases where the source (S) and target (T) share the same data For example: database S and T share a house listing ( Atlanta, GA, ) In such overlap cases the shared data provides valuable information for the mapping process

46 Exploiting Domain Knowledge (11) Searchers that exploit overlap data: Overlap Text Searcher Overlap Numeric Searcher Overlap Category & Schema Mismatch Searchers

47 Exploiting Domain Knowledge (12) HOW THE SEARCHERS WORK: Step 1: Use the original searchers for an initial mapping Step 2: Use the overlap data to re-evaluate the mappings for improved matching accuracy For example: when re-evaluated, mapping: agent-address = location, receives score 0 because it is not correct for the shared house listing: ( Atlanta, GA, ) and agent-address = concat(city, state) receives score 1

48 Exploiting Domain Knowledge (13) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data

49 Exploiting Domain Knowledge (14) External data is used as additional constraints on the attributes of a schema Usually provided by experts Can be very useful in schema matching

50 Exploiting Domain Knowledge (15) HOW EXPLOITING EXTERNAL DATA WORKS: Step 1: Use external data to learn about the feature Step 2: Apply the learned information to evaluate matches for the target attribute For example: Target attribute: agent-name A feature that can be potentially useful in schema matching: number of distinct agent names

51 Overview

52 Generating Explanations (1) Example: In matching real-estate schemas, for attribute: list-price, imap has produced the matches: list-price = price list-price = price * (1 + monthly-fee-rate)

53 Generating Explanations (2) PROBLEM: The user is uncertain which of the two is the correct match SOLUTION: imap must explain the ranking: For example: why did list-price = price get a higher rank than list-price = price * (1 + monthly-fee-rate)?

54 Generating Explanations (3) imap s goal is to provide an environment where a human user can quickly generate a mapping between a pair of schemas For a user to know which match to choose, imap must supply an explanation for each of the matches

55 Generating Explanations (4) imap considers 3 questions: Explain existing match: Why does the match exist? For example: Why the match month-posted = month-fee-rate exist? Explain absent match: Why doesn t the match exist? Explain match ranking: Why is one match better than another?

56 Generating Explanations (5) imap keeps track of the decision making progress in a dependency graph Each node is one of the following: Schema attribute Assumption Candidate match Domain knowledge An edge between two nodes means that one node leads to another

57 Generating Explanations (6)

59 Conclusions (1) Matches are key for enabling a wide variety of data sharing and exchange scenarios The majority of the research on schema matching has focused on 1-1 matches imap offers a solution to the problem of finding complex matches The key challenge with complex matches is that the space of possible matching candidates is possibly unbounded, and evaluating each candidate is harder

60 Conclusions (2) The architecture of imap is modular and extensible New searchers and new evaluation modules can be added easily Experimental results show that imap achieves 43-92% accuracy on several real world domains, thus demonstrating the promise of the approach

imap: Discovering Complex Semantic Matches between Database Schemas

imap: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois, Urbana-Champaign, IL, USA {dhamanka,ylee11,anhai}@cs.uiuc.edu