Outline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration

Size: px

Start display at page:

Download "Outline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration"

Juliana Booker
5 years ago
Views:

1 Outline Duplicates Detection in Database Integration Background HumMer Automatic Data Fusion System Duplicate Detection methods An efficient method using priority queue Approach based on Extended key Approach based on Classification methods More Resources Data Integration What is Data Integration? Background It is the problem of taking multiple independently developed databases and resolving the difference between them, to make them appear as one. Entity Matching/Identification What is Entity Matching/Identification? Given a pair of records drawn from semantically corresponding tables in multiple heterogeneous databases, the objective is to determine whether the records represent the same real-world entity. Duplicate Detection Schema level Identify the duplicate attributes Instance level Identify the duplicate records It is closed related to Duplicate Detection. 1

2 Introduction HumMer Automatic Data Fusion System It is a system trying to fuse heterogeneous, duplicate, and conflicting data. [1] It incorporates Schema Matching Duplicate (Records) Detection Conflict Resolution System Diagram Query Language Based on SQL For Example Conflict Resolution SQL function (min, max, sum, ) Others Schematic Heterogeneity Resolving Schema Matching Detect a few duplicates in two unaligned databases Convert tuples to strings Do string matching to find duplicates Derive attribute correspondences based on similar attribute values of duplicates Two duplicates are compared field-wise, resulting in a matrix containing similarity scores for each attribute combination Using threshold to get the final attribute correspondences 2

3 Schematic Heterogeneity Resolving Data Transformation For the two relations to be fused One schema is chosen to determine the name of attribute correspondences in fused table All tables receive an additional sourceid The full outer union of all tables is computed. Duplicate Records Detection Specify the relevant attributes Compare tuples pairwisely using a similarity measure Objects with similarity above a given threshold are considered as duplicates The closures of duplicates is computed, in which each is assigned one <sourceid> Duplicates Detection Methods Duplicate Detection Methods An efficient entity identification method using priority queue [2] Approach based on Extended key [3] Approach based on Classification methods [4] Introduction An Efficient Entity Identification Method Using Priority Queue Addressed Issue: Entity Identification with improved efficiency. Standard method Sort the table based on an application-specific key Comparing nearby records by a sliding window Cost expensive pairwise record comparison 3

Main Component How to improve efficiency? Each record is only compared with clusters of duplicates in Priority Queue. Priority Queue It contains clusters of duplicates.

The newly updated cluster has the highest priority The queue has a fixed size The cluster with least priority will be replaced by the newly-added cluster.

4 Main Component How to improve efficiency? Each record is only compared with clusters of duplicates in Priority Queue. Priority Queue It contains clusters of duplicates. Each cluster contains some records as representatives. Every cluster has a priority. The newly updated cluster has the highest priority The queue has a fixed size The cluster with least priority will be replaced by the newly-added cluster. Main Component Strategy of Priority Queue When there is a new record, its membership with clusters in the queue is checked. The record only need to be compared with representatives of cluster If it belongs to none, create a new cluster for it, and put it in the queue. If the queue is full, the cluster with least priority will be moved out. If it belongs to one cluster, the cluster will be given highest priority. If the matching score of the record is below a certain threshold, set it as one representative. Other Component How to compare records? Convert record to a long string Use Edit Distance algorithm Example: edit(error,eror) = 1, edit(great,grate) = 2 How to store and query the duplicate clusters? Union-Find data structure Union: Combine or merge two sets into a single set. Find: Determine which set a particular element is in. Also useful for determining if two elements are in the same set. Workflow 1. Sort the records according to the given attributes 2. Scan through the database sequentially Check each record Compare it with the representatives of each cluster in priority queue. Update the priority queue 3. Scan through the database again Check each record (same as before) 4. End Introduction Approach based on Extended key Addressed Issue: How to determine the correspondence between records from multiple databases with different schemas? 4

Main Idea An Example What is Extended Key?

defined by ILFD (Instance level functional dependencies) tables Main Idea How to get ILFD tables?

Approach based on Classification methods Introduction Addressed Issue: Entity Matching using machine learning method.

5 Main Idea An Example What is Extended Key? A minimal set of attributes able to uniquely identify an instance New way to resolve the heterogeneity Use Extended Key Equivalence It is defined by ILFD (Instance level functional dependencies) tables Main Idea How to get ILFD tables? Such semantic information can be supplied By database administrators during Schema Integration Through some knowledge acquisition tools Approach based on Classification methods Introduction Addressed Issue: Entity Matching using machine learning method. Strategy How to use one pair of records as input of classifier? Convert it to a vector. Here is an example: Main Idea: Model the problem as a binary (match or non-match) classification problem. For each pair of records, the trained classifier will tell whether they are duplicates or not. A vector with length of 16 5

6 Strategy How to get training example? One way is to get suggestions from domain expert. Another way is to use partial common key, if available. More Resources Reading Resources Schema Matching Rahm, Erhard and Bernstein, Philip A. (2001) A survey of approaches to automatic schema matching. VLDB Journal Duplicate Record Detection A. K. Elmagarmid, P. G. Ipeirotis and V. S. Verykios. Duplicate record detection: A survey. TKDE 19(1): 1--16, 1007 Paper Lists [1] A. Bilke, J. Bleiholder, C. Bohm, K. Draba, F. Naumann, and M. Weis. Automatic data fusion with hummer. In Proc. of VLDB, Trondheim, Norway, 2005 [2] A.E. Monge and C.P. Elkan, An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records, Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery and Data Mining, 1997 [3] Ee-Peng Lim, Jaideep Srivastava, Satya Prabhakar, James Richardson, Entity Identification in Database Integration, Proceedings of the Ninth International Conference on Data Engineering, p , April 19-23, 1993 [4] Huimin Zhao, Sudha Ram, Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization. Data & Knowledge Engineering, Volume 66, 2008, Question & Answer Tack! 6

7 Table R and S as shown below: Table R Exercise 1. Use Extended Key to do Entity Identification[1] Name City ZIP PersonNr Eva Aadde INGARÖ Eva Aalto Norsborg Eva Abrahamsson INGARÖ Table S Name HomeAddress Telephone Eva Aadde Myskviksvägen Eva Abrahamsson Myrvägen Eva Abrahamsson Pilgatan Eva Abrahamsson Nyängsvägen 39A Suppose the extended key is {name, city, homeaddress} and the following ILFDs: (E. HomeAddress= Myskviksvägen 8 ) ->(E.City= INGARÖ ) (E. HomeAddress= Myrvägen 2 ) ->(E.City= INGARÖ ) (E. HomeAddress= Pilgatan 9 ) ->(E.City= STOCKHOLM ) (E. HomeAddress= Nyängsvägen 39A ) ->(E.City= TULLINGE ) Please construct the integrated table. Exercise 2. Use Priority Queue to do Duplicate Detection[2] [1] Lim, Jaideep Srivastava, Satya Prabhakar, James Richardson, Entity Identification in Database Integration, Proceedings of the Ninth International Conference on Data Engineering, p , April 19-23, 1993 Given conditions below, please use Priority Queue algorithm to find the Duplicate Clusters within. 1. Table R, which is already sorted according to application-specific key : Tuple T1 T2 T3 T4 T5 T6 T7 2. Similarities between tuples T1 T2 T3 T4 T5 T6 T7 T T T T T T T Method to count Matching Sorce: Given one cluster, the Matching Sorce of one tuple is : The average of the tuple s similarity with the cluster s all representitives. 4. The condition to declare a new cluster : matching score > The condition to declare a representitive: 0.5 < matching score < The size of Priority Queue: [2] A.E. Monge and C.P. Elkan, An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records, Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery and Data Mining,

Rule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD)

American-Eurasian Journal of Scientific Research 12 (5): 255-259, 2017 ISSN 1818-6785 IDOSI Publications, 2017 DOI: 10.5829/idosi.aejsr.2017.255.259 Rule-Based Method for Entity Resolution Using Optimized