Adaptive Temporal Entity Resolution on Dynamic Databases

Size: px

Start display at page:

Download "Adaptive Temporal Entity Resolution on Dynamic Databases"

Kellie Reeves
6 years ago
Views:

Adaptive Temporal Entity Resolution on Dynamic Databases Peter Christen 1 and Ross Gayler 2 1 Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian

1 Adaptive Temporal Entity Resolution on Dynamic Databases Peter Christen 1 and Ross Gayler 2 1 Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra, Australia 2 Veda, Melbourne VIC 3000, Australia Contacts: peter.christen@anu.edu.au / ross.gayler@veda.com.au This research was funded by the Australian Research Council (ARC), Veda, and Funnelback Pty. Ltd., under Linkage Project LP April 2013 p.1/18

2 Outline Short introduction to entity resolution An example application: Identity verification Problem formulation and contribution An example set of temporal records Modelling temporal changes of entities Adjusting similarities between records Calculating agreement and disagreement probabilities The adaptive temporal matching process Experimental evaluation Conclusions and future work April 2013 p.2/18

3 Short introduction to entity resolution Entity resolution is the process of identifying and matching records that correspond to the same entity from one or several databases Several major challenges to entity resolution Entity identifiers are commonly not available, so often personal details need to be used for matching Real world data are dirty (typos, variations, etc.) Naive comparison of all record pairs scales quadratic with the sizes of databases to be matched Lack of training data (true match status of record pairs) makes accurate and automatic classification difficult April 2013 p.3/18

4 Example: Identity verification Many services require the verification of the personal details provided by customers (government services, credit cards, loans, etc.) Based on large databases of known entities (the personal details of individuals, such as their names, addresses, phone numbers, dates of birth, etc.) Requires real-time matching of query records with one or several large databases Accurate and fast matching is crucial for good service and to prevent identify fraud Personal details change over time (databases are dynamic) April 2013 p.4/18

5 Problem formulation and contribution We investigate how temporal information can be incorporated into the entity resolution process (such as people changing their names or addresses) We modify similarities between records according to temporal characteristics of the data Building on the earlier approach Linking temporal records (Li et al., VLDB Endowment, 2011) Our contributions An adaptive entity resolution approach for dynamic data that contain temporal information An efficient temporal adjustment method An evaluation on both synthetic and real data April 2013 p.5/18

6 Example set of temporal records RecID / EntID Givenname Surname Street address City Time-stamp r1 / e1 Gale Miller 13 Main Rd Sydney r2 / e2 Peter O Brian 43/1 Miller St Sydeny r3 / e1 Gail Miller 11 Town Pl Hobart r4 / e1 Gail Smith 42 Ocean Dr Perth r5 / e2 Pete O Brien 43 Miller St Sydney r6 / e1 Abigail Smith 42 Ocean Dr Perth r7 / e2 Peter OBrian 12 Nice Tr Brisbane r8 / e1 Gayle Smith 11a Town Pl Sydney An entity changes address values more often than surname values Small variations in values are possible (no actual changes) Several entities can have the same value in an attribute April 2013 p.6/18

7 Modelling temporal changes (1) Basic assumptions and notation used R, r i Database containing entity records r i a j r i.e, r i.t Attributes of r i, denoted by r i.a j Entity identifier and time-stamp of r i q, q.a j, q.t Query record with attributes a j and time-stamp (q does not have a known entity identifier) t s same, s match Difference in time-stamps t = r i.t - q i.t Global agreement and match thresholds The aim is to match a query record, q, to its correct true entity in R (q.e r i.e) We calculate similarities 0 sim j (r i.a j, q.a j ) 1 (values are agreeing if sim j s same, else disagreeing) April 2013 p.7/18

8 Modelling temporal changes (2) To consider temporal aspects, we define: S is the event that q and r i actually refer to same entity A j is the event that q and r i have an agreeing value in attribute a j We consider two probabilities P(A j, t S) Probability that a query and a database record that actually refer to the same entity have an agreeing value in attribute a j over t (no value change) P( A j, t S) Probability that a query and a database record that actually refer to different entities have disagreeing (different) values in attribute a j over t April 2013 p.8/18

9 Adjusting similarities (1) Based on previous two probabilities, we adjust the overall similarity between compared records Assume q and r i have been compared using a set of attribute similarity functions s j = sim j (r i.a j, q.a j ) We assign relative weights, w j, to the attribute similarities, s j sim(r i, q) = j w j(s j, t) s j j w j(s j, t) These weights are calculated based on the likelihood of change in their attribute values April 2013 p.9/18

10 Adjusting similarities (2) We adjust similarities based on s j and s same If s j s same then w j (s j, t) = s j P( A j, t S) The more likely it is that two different entities have the same value in attribute a j over time t, the less weight is assigned for this agreement If s j < s same then w j (s j, t) = s j P(A j, t S) The more likely it is that for an entity a value in attribute a j changes over time t, the less weight is assigned for this disagreement April 2013 p.10/18

11 Calculating probabilities In a dynamic and real-time setting, P(A j, t S) and P( A j, t S) need to be calculated and updated in an adaptive and efficient way P(A j, t S) can be calculated from data if it is known which records correspond to the same entity (or based on match decisions made) P( A j, t S) is calculated as P( A j, t S) = 1 - P(A j, t S), the probability of how frequently certain values appear in an attribute (surname value Smith is more frequent than Dijkstra ) Details of these calculations please see paper April 2013 p.11/18

12 Adaptive temporal matching process Assume an initial database R of known entity records, and a stream of query records q For each q, the following process is conducted 1. Get a set of candidate records C from R using an appropriate blocking/indexing technique 2. For each candidate record c C, calculate overall adjusted similarity sim(c, q) 3. Get c best with highest similarity s best of all c 4. If s best s match, set q.e = c best.e, else set q.e to a new unique entity identifier value 5. Update P(A j, t S) and P( A j, t S) 6. Add q into database R April 2013 p.12/18

13 Experiments and data sets We used a North Carolina voter database (personal details of 2.4 million voters, only 113,801 voters with duplicate records) We also generated synthetic data sets based on real personal data (details and results in paper) Prototype implemented in Python (code available from authors) Three baseline approaches Traditional entity resolution that does not consider temporal aspects An additional temporal attribute a temp = t/max( t) Non-adaptive temporal (no update of probabilities) April 2013 p.13/18

14 Percentage of true matches correctly identified Results on NC voter data NC Voter matching quality with s match =0.7 TM, s same =0.8 TM top 10, s same =0.8 TM, s same =0.9 TM top 10, s same =0.9 None Temp Attr Adapt Non Adapt April 2013 p.14/18

15 Time in milli-seconds Temporal overhead for NC voter data Timing results for NC Voter data set Match time Adjust time Update time 10 0 None Temp Attr Adapt Non Adapt April 2013 p.15/18

16 Conclusions and future work We proposed an efficient approach for adaptive entity resolution on dynamic databases We consider temporal aspects to adjust agreement and disagreement weights Experiments showed that taking temporal aspects into account can improve matching quality Future work includes Take attribute dependencies into account Combine the proposed approach with probabilistic record linkage Incorporate constraints April 2013 p.16/18

17 Results on synthetic data Parameters None Adapt Avrg rec per ent Clean, 0.8 / Clean, 0.9 / Dirty, 0.8 / Dirty, 0.9 / Results reported are accuracy calculated as percentage of true matches correctly identified Parameter values are for s same / s match April 2013 p.17/18

18 Percentage of true matches correctly identified Results on NC voter data (2) NC Voter matching quality with s match =0.8 TM, s same =0.8 TM top 10, s same =0.8 TM, s same =0.9 TM top 10, s same =0.9 None Temp Attr Adapt Non Adapt April 2013 p.18/18

Adaptive Temporal Entity Resolution on Dynamic Databases

Adaptive Temporal Entity Resolution on Dynamic Databases Peter Christen 1 and Ross W. Gayler 2 1 Research School of Computer Science, The Australian National University, Canberra ACT 0200, Australia peter.christen@anu.edu.au