By Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy and Pedro Domingos Presented by Yael Kazaz
Example: Merging Real-Estate Agencies Two real-estate agencies: S and T, decide to merge Schema T has one table: Listings Schema S has two tables: Houses and Agents Merging schema S into schema T
Example: Making Tuples Using SQL area = SELECT location from HOUSES agent-name = SELECT name from AGENTS agent-address = SELECT concat(city, state) FROM AGENTS list-price = SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id
Motivation Creating matches between data sources important Manually creating matched is hard Past attempts deal only with one-to-one (1-1) matches For example: Address = Location For example: Room_Price = Room_Rate Complex matches are not considered For example: Address = concat(city, state) For example: Room_Price = Room_Rate*(1+ Tax_Rate)
Introducing the imap System Semi-automatically discovering: 1-1 matches Complex matches Semi automatically constructing complex matches is very important since complex matched compose up to half of the matches!
Complex Matches Creating complex matches is harder than 1-1 matches: the match space can be very large or even infinite! The number of 1-1 matches is bounded The number of complex matches is not. There are an un-bounded number of functions for combining attributes in a schema
Example: 1-1 Matches area = SELECT location from HOUSES agent-name = SELECT name from AGENTS agent-address = SELECT concat(city, state) FROM AGENTS list-price = SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id
Example: Complex Matches area = SELECT location from HOUSES agent-name = SELECT name from AGENTS agent-address = SELECT concat(city, state) FROM AGENTS list-price = SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id
Overview
Match Generator (1) Input: Target schema (T) Source schema (S) Output: Match candidates
Match Generator (2) What the Match Generator does: The Match Generator takes as input two schemas: S and T For each attribute t of T, it generates a set of match candidates: 1-1 and complex matches The generation in guided by a set of search modules
Match Generator (3) For Example: for t = area in T the candidates are: location in HOUSES name in AGENTS state in AGENTS concat(city, state) in AGENTS concat(name, city) in AGENTS
Match Generator (4) PROBLEM: Unbounded number of match candidates SOLUTION: Search the space of possible matches HOW: Use search modules, called searchers, each in charge of a specific type of attribute
Match Generator (5) Implemented searchers in imap: The searchers cover many complex match types: text, numeric, category, etc. The searchers evaluate match candidates, and exploit domain knowledge, such as domain constraints and overlap data
Match Generator (6) Applying search to candidate generation requires addressing three issues: (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition
Match Generator (7) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition The space search can be very large or even unbounded We need to efficiently search such spaces imap addresses this problem using beam search
Match Generator (8) Beam Search Example: (K = 3) A A A A B C D (3) (5) (1) B (3) (5) C D B C D (5) E F (4) (6) G H (6) (5) E F (4) (6) S = {A} S = {B C D} S = {B C E} S = {C E H}
Match Generator (9) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition We use beam search to search candidate matches Beam search uses a scoring function to evaluate each match candidate At each level of the search tree, it keeps only k highest-scoring match The searcher can conduct an efficient search in any type of search space
Match Generator (10) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition To conduct beam search, given a match candidate, we assign to it a score of the distance between it and the target attribute For example: Given a match candidate: concat(city, state) we approximates the distance between it and the target attribute: (agent-address)
Match Generator (11) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition imap uses techniques to compute the candidate scores: Machine learning Statistics Heuristics
Match Generator (12) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition The search space can be unbounded We need to decide when to stop the search We terminate when we start seeing diminishing returns from our search
Match Generator (13) (1) Search strategy (2) Evaluation of candidate matches (3) Termination condition In the i th iteration we keep track of the highest score of candidate matches Max i If (Max i+1 Max i ) < threshold: we stop the search and return the k highest match candidates
Overview
Similarity Estimator (1) Input: Match candidates Output: Similarity matrix
Similarity Estimator (2) What the Similarity Estimator does: Computes for each candidate a score of similarity to attribute t of T The output of this module is a matrix that stores the similarity score of the pairs: <target attribute, match candidate> Target attribute t1 Target attribute t2 Target attribute t3. Match candidate 1 0 0.5 0. Match candidate 2 0 0.2 0.7. Match candidate 3 0.8 0.3 0....
Similarity Estimator (3) name concat(city, state) price price * (1 + fee-rate) area 0.5 0.8 0 0 list-price 0 0 0.7 0.9 agent-address 0.5 0.8 0 0 agent-name 0.9 0.6 0 0
Similarity Estimator (4) PROBLEM: Deciding which candidate is better than another SOLUTION: For each target attribute t of T, the searchers suggest a set of match candidates HOW: The scores assigned to each of the candidate matches is based only on a single type of information For example: the text searcher considers only word frequencies
Similarity Estimator (5) The Similarity Estimator evaluates these candidates, and assign to each of them a final score The Similarity Estimator exploits additional types of information to for an accurate score The Similarity Estimator employs evaluator modules that exploit types of information, and then combines the suggested scores into one final score
Overview
Match Selector (1) Input: Similarity matrix Output: 1-1 matches Complex matches
Match Selector (2) What the Match Selector does: Examines the Similarity Matrix and outputs the best matches for the attributes of T, under certain constraints name concat(city, state) price price * (1 + fee-rate) area 0.5 0.8 0 0 list-price 0 0 0.7 0.9 agent-address 0.5 0.8 0 0 agent-name 0.9 0.6 0 0
Match Selector (3) PROBLEM: How to match the best candidate to the attribute SOLUTION: The best global match could be where each target attribute is assigned the match with the highest score BUT This match assignment may not be acceptable because it may violate domain constraints
For Example: The imap Architecture: Match Selector (4) Domain constraint: name and city in AGENTS have no relation Result: The tuple area= SELECT concat(name, city) FROM AGENTS is not selected
Overview
Exploiting Domain Knowledge (1) Exploiting domain knowledge is beneficial on 1-1 matching On complex matching, it is more crucial: early detection of unlikely matches
Exploiting Domain Knowledge (2) For example: imap learns that the number of real estate agents in a specific area is bounded by 50 Given the match: agent-name = concat(firstname, last-name), where first-name and last-name belong to the home owner, imap will realize that concat(first-name, last-name) results in hundreds of distinct names Conclusion: concat(first-name, last-name) is unlikely to match agent-name
Exploiting Domain Knowledge (3) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data
Exploiting Domain Knowledge (4) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data
Exploiting Domain Knowledge (5) Constraints are either: Present in the schema For example: agent-name is a text and amount is only numeric Provided by an expert For example: The tax on a room cannot be less than 7% Provided by the user For example: We do not sell houses that cost less than 200,000$
Exploiting Domain Knowledge (6) imap considers 3 kinds of constraints: Two attributes are un-related For example: name and beds are unrelated; meaning that they cannot appear in the same match formula Constraint on a single attribute For example: the average value of numrooms does not exceed 10 Multiple schema attributes are un-related For example: area and agent-name are unrelated
Exploiting Domain Knowledge (7) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data
Exploiting Domain Knowledge (8) Sometimes, the same or similar schemas are mapped repeatedly imap extracts the expression template of these matches, and guides the search process For example: Given the past match: price = pr * (1 + 0.6), imap will extract: VARIABLE * (1 + CONSTANT) and ask the numeric searcher to look for matches for that template
Exploiting Domain Knowledge (9) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data
Exploiting Domain Knowledge (10) There are many cases where the source (S) and target (T) share the same data For example: database S and T share a house listing ( Atlanta, GA, ) In such overlap cases the shared data provides valuable information for the mapping process
Exploiting Domain Knowledge (11) Searchers that exploit overlap data: Overlap Text Searcher Overlap Numeric Searcher Overlap Category & Schema Mismatch Searchers
Exploiting Domain Knowledge (12) HOW THE SEARCHERS WORK: Step 1: Use the original searchers for an initial mapping Step 2: Use the overlap data to re-evaluate the mappings for improved matching accuracy For example: when re-evaluated, mapping: agent-address = location, receives score 0 because it is not correct for the shared house listing: ( Atlanta, GA, ) and agent-address = concat(city, state) receives score 1
Exploiting Domain Knowledge (13) imap exploits domain knowledge: Domain constraints Past complex matches Overlap data External data
Exploiting Domain Knowledge (14) External data is used as additional constraints on the attributes of a schema Usually provided by experts Can be very useful in schema matching
Exploiting Domain Knowledge (15) HOW EXPLOITING EXTERNAL DATA WORKS: Step 1: Use external data to learn about the feature Step 2: Apply the learned information to evaluate matches for the target attribute For example: Target attribute: agent-name A feature that can be potentially useful in schema matching: number of distinct agent names
Overview
Generating Explanations (1) Example: In matching real-estate schemas, for attribute: list-price, imap has produced the matches: list-price = price list-price = price * (1 + monthly-fee-rate)
Generating Explanations (2) PROBLEM: The user is uncertain which of the two is the correct match SOLUTION: imap must explain the ranking: For example: why did list-price = price get a higher rank than list-price = price * (1 + monthly-fee-rate)?
Generating Explanations (3) imap s goal is to provide an environment where a human user can quickly generate a mapping between a pair of schemas For a user to know which match to choose, imap must supply an explanation for each of the matches
Generating Explanations (4) imap considers 3 questions: Explain existing match: Why does the match exist? For example: Why the match month-posted = month-fee-rate exist? Explain absent match: Why doesn t the match exist? Explain match ranking: Why is one match better than another?
Generating Explanations (5) imap keeps track of the decision making progress in a dependency graph Each node is one of the following: Schema attribute Assumption Candidate match Domain knowledge An edge between two nodes means that one node leads to another
Generating Explanations (6)
Conclusions (1) Matches are key for enabling a wide variety of data sharing and exchange scenarios The majority of the research on schema matching has focused on 1-1 matches imap offers a solution to the problem of finding complex matches The key challenge with complex matches is that the space of possible matching candidates is possibly unbounded, and evaluating each candidate is harder
Conclusions (2) The architecture of imap is modular and extensible New searchers and new evaluation modules can be added easily Experimental results show that imap achieves 43-92% accuracy on several real world domains, thus demonstrating the promise of the approach