Introduction Entity Match Service. Step-by-Step Description

Similar documents
Overview of Record Linkage Techniques

Duplicate Constituents and Merge Tasks Guide

Identifying Duplicate Persons in SACWIS

// The Value of a Standard Schedule Quality Index

Record Linkage. with SAS and Link King. Dinu Corbu. Queensland Health Health Statistics Centre Integration and Linkage Unit

CSCI 5417 Information Retrieval Systems. Jim Martin!

Version 1.4 Paribus Discovery for Microsoft Dynamics CRM User Guide

Entity Resolution, Clustering Author References

Collective Entity Resolution in Relational Data

Proceedings of the Eighth International Conference on Information Quality (ICIQ-03)

2 Corporation Way Suite 150 Peabody, MA

Going Further with matchit. Version 5.2

SYS 6021 Linear Statistical Models

Version 1.5 Paribus Discovery for Saleslogix User Guide

Legal Software Systems, Inc.

Cluster Analysis Gets Complicated

Whitepaper Spain SEO Ranking Factors 2012

SMART LIVE CHAT LIMITER

Introduction to blocking techniques and traditional record linkage

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics

Using Query History to Prune Query Results

Guide to Google Analytics: Admin Settings. Campaigns - Written by Sarah Stemen Account Manager. 5 things you need to know hanapinmarketing.

Whitepaper Italy SEO Ranking Factors 2012

Tutorial for Windows and Macintosh SNP Hunting

A Parallel Implementation of a Higher-order Self Consistent Mean Field. Effectively solving the protein repacking problem is a key step to successful

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules

IBM InfoSphere MDM Enterprise Viewer User's Guide

The great primary-key debate

GiftWorks Import Guide Page 2

The Matching Engine. The Science of Maximising Legitimate Matches, Minimising False Matches and Taking Control of the Matching Process

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Optimizing Testing Performance With Data Validation Option

Manage Duplicate Records in Salesforce PREVIEW

USPTO INVENTOR DISAMBIGUATION

Running SNAP. The SNAP Team February 2012

Detecting Network Intrusions

Data linkages in PEDSnet

Tutorial for Windows and Macintosh SNP Hunting

Content-Aware Master Data Management

DATA MANAGEMENT. About This Guide. for the MAP Growth and MAP Skills assessment. Main sections:

Whitepaper US SEO Ranking Factors 2012

Introduction. Chapter Background Recommender systems Collaborative based filtering

Employment of Multiple Algorithms for Optimal Path-based Test Selection Strategy. Miroslav Bures and Bestoun S. Ahmed

Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data

AudienceView How To Guides How to Run a Crosstab Report

selection of similarity functions for

The Detection of Faces in Color Images: EE368 Project Report

Authorship Disambiguation and Alias Resolution in Data

ORACLE Communications Pricing Design Center (PDC) Frameworx Information Framework R9.5. Product Conformance Certification Report. Version 11.

An Ensemble Approach for Record Matching in Data Linkage

CONCEPTUAL DESIGN FOR SOFTWARE PRODUCTS: SERVICE REQUEST PORTAL. Tyler Munger Subhas Desa

Construction Change Order analysis CPSC 533C Analysis Project

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

Ontology-based Integration and Refinement of Evaluation-Committee Data from Heterogeneous Data Sources

CITS4009 Introduction to Data Science

II TupleRank: Ranking Discovered Content in Virtual Databases 2

DETECTING RESOLVERS AT.NZ. Jing Qiao, Sebastian Castro DNS-OARC 29 Amsterdam, October 2018

DATA POSITION AND PROFILING IN DOMAIN-INDEPENDENT WAREHOUSE CLEANING

Grouping methods for ongoing record linkage

Chapter 2: Understanding Data Distributions with Tables and Graphs

CS299 Detailed Plan. Shawn Tice. February 5, The high-level steps for classifying web pages in Yioop are as follows:

UIN Server Maintenance Manual

IBM InfoSphere Master Data Management Version 11 Release 5. IBM InfoSphere MDM Inspector User's Guide IBM SC

Comparing Implementations of Optimal Binary Search Trees

IBM InfoSphere MDM Inspector User's Guide

VOID FILL ACCURACY MEASUREMENT AND PREDICTION USING LINEAR REGRESSION VOID FILLING METHOD

HEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A PHARMACEUTICAL MANUFACTURING LABORATORY

etitlesearch User Guide Classic User Accounts

Implications of Post-NCSC Project Scenarios for Future Test Development

A Session-based Ontology Alignment Approach for Aligning Large Ontologies

Center, Shape, & Spread Center, shape, and spread are all words that describe what a particular graph looks like.

UNIBALANCE Users Manual. Marcin Macutkiewicz and Roger M. Cooke

Hot Fuzz: using SAS fuzzy data matching techniques to identify duplicate and related customer records. Stuart Edwards, Collection House Group

How Is the CPA Exam Scored? Prepared by the American Institute of Certified Public Accountants

EXECUTIVE REPORT ADOBE SYSTEMS, INC. COLDFUSION SECURITY ASSESSMENT

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 95-96

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

BETA DEMO SCENARIO - ATTRITION IBM Corporation

PREPARATION GUIDELINES FOR SUSPICIOUS ACTIVITY REPORT FORM (SAR) August 2001

I. Contact Information: Lynn Herrick Director, Technology Integration and Project Management Wayne County Department of Technology

2. Smoothing Binning

Data: a collection of numbers or facts that require further processing before they are meaningful

Introduction. Aleksandar Rakić Contents

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data files Keith Kranker

Professional Evaluation and Certification Board Frequently Asked Questions

arxiv: v2 [cs.lg] 11 Sep 2015

Computing Classic Closeness Centrality, at Scale

Running SNAP. The SNAP Team October 2012

Political Organization Filing and Disclosure. Search Process User Guide

The Results of Falcon-AO in the OAEI 2006 Campaign

2.19 Software Release Document Addendum

DATABASE DEVELOPMENT (H4)

Administrative Guidance for Internally Assessed Units

User Manual. Last updated 1/19/2012

AGIIS Duplicate Prevention

ARELLO.COM Licensee Verification Web Service v2.0 (LVWS v2) Documentation. Revision: 8/22/2018

Data.com Record Matching in Salesforce

Evaluation of Seed Selection Strategies for Vehicle to Vehicle Epidemic Information Dissemination

U1. Data Base Management System (DBMS) Unit -1. MCA 203, Data Base Management System

Transcription:

Introduction Entity Match Service In order to incorporate as much institutional data into our central alumni and donor database (hereafter referred to as CADS ), we ve developed a comprehensive suite of automated entity match services. The CADS database contains millions of entity records. The Entity Match Service Suite identifies if one of the millions of entity records CADS corresponds with the person represented by the input data, and if so, which record. Unfortunately, looking for exact matches on attributes such as Name, Address, Telephone, etc., will miss many true matches, potentially causing a number of duplicate records to be created in CADS. The reasons an exact match might fail are numerous: ambiguity in data (Thomas Smith and Tom Smith may represent the same person); unformatted data (the same address may be written in multiple ways); missing data elements; out-of-date information; or unrecognized partial matches. In addition to the difficulties posed by attempting an exact match, the sheer volume of data in CADS requires that the number of candidate records be narrowed prior to matching. To address these challenges, the Entity Match Web Service is divided into four general steps: 1. Receiving and accepting the data 2. Deciding which CADS entities to match against (Blocking) 3. Matching the data against the chosen CADS entities (Preliminary Match) 4. Using the Name data to refine and confirm the match results (Secondary Match) Step-by-Step Description In Step 1, the service receives any or all of the following input: First Name, Middle Name, Last Name, Address, Phone, and Email. If the input fails to meet the minimum requirements, or if the service is unavailable, an error will be generated. The minimum requirements for the Entity Match Service Suite are: During Step 2, the service determines which CADS entities will be looked at as potential matches. Instead of trying to match all the provided input fields against 1.4 million CADS entities, the service uses CADS data to choose a small subset to pass on to Step 3. The three criteria used to make this choice are: In Step 3, each piece of input data is matched against its counterpart for each CADS entity identified during Step 2. An exact match is attempted on all names, addresses, phones, and emails on the CADS record. Each time a match is found on an attribute, the CADS entity s match score is increased; each time a non-match is found, the entity s score is decreased. For a detailed look at how the match score is calculated, please the explanation beginning on Page 4.

By the end of Step 3, all potential matches are assigned a match confidence value. The values are determined based on the aggregate score calculated during the match process. The various values and their score thresholds are as follows: Low: match score less than or equal to 0 Medium: match score greater than 0 but less than or equal to 8.8 High: match score greater than 8.8 Based on the confidence values assigned to each CADS entity in the pool of potential matches, the following actions are taken: Highest Confidence Value Action Performed in Result Set No results Pass input data to Entity Create Service; automatically create new CADS entity Low Pass input data to Entity Create Service; automatically create new CADS entity Medium Pass Medium result(s) to Exception Interface High Pass High result(s) to Step 4 Step 4 loops back and takes a final look at the name input and compares it to the name data present on each matched CADS record. Because Step 2 looks at other attributes than name when choosing entities to pass to Step 3, it s possible for two members of the same household to both be assigned a High match confidence value. It s also possible that the name input might include a misspelling or use a nickname not recorded on a CADS entity s record. In order to compensate for these possibilities, the service runs all High confidence results through the decision tree depicted below: The first name check compares the calculated Oracle SoundEx value of the input name with the known SoundEx value of all names found on the matched entity s record. This screens out members of the same household who have different first names. 2

The second name check employs the Jaro-Winkler similarity metric to calculate the string distance between the input First Name and all First Names found on the matched entity s record. Comparing the names character-bycharacter prevents names which sound different due to a misspelling from erroneously excluding an entity that is an otherwise correct match. The third name check runs the input First Name against a custom-built synonym table then, if any synonyms are found, compares those synonyms to all First Names found on the matched entity s record. Any single High confidence match that passes any one of the name checks is considered to represent the same individual as the input data. The input data is passed to the Entity Update Service, which will automatically apply any new or different input data to the applicable CADS entity record. Any High confidence match that cannot pass the three name checks is assigned a new confidence value of High*. This conditional high value indicates that although the CADS entity was a numerically strong candidate, there is insufficient name evidence for the Entity Match Service to decide that the CADS entity and the entity represented by the input data are truly the same individual. These records are passed to the Exception Interface where an expert can make the final determination regarding the match. Example of Match Process 3

Appendix Weights and Matching During the match process, each attribute is assigned a positive or negative match score, also known as a weight. The value of each weight was precalculated during the development process of the Entity Match Service Suite; the array of positive and negative weights is specially calibrated for the CADS dataset. Attribute Weight (Positive Match) Weight (Data Not Present in Input OR Not Present in CADS) Weight (Negative Match) First Name +1.1202 0.0000-4.7506 Middle Name +1.0071 0.0000-0.4485 Last Name +2.0433 0.0000-5.2233 Address +7.3102 0.0000-2.2687 Phone +6.7998 0.0000-1.1676 Email +7.3328 0.0000-2.3590 Highest Possible Score: 25.6133 Lowest Possible Score: -16.2177 Results Skewed Toward Address, Phone, and Email Because far fewer people share an Address, Phone and/or Email than share a name, these three attributes have a high positive weight. In other words, two sets of entity information that share an Address, Phone, and/or Email have a statistically significant likelihood to represent the same person. This does not mean, however, that two sets of entity information that do not share an Address, Phone, and/or Email are strongly predisposed not to represent the same person. The power of the Address, Phone and Email match is one of positive correlation. First Middle Last Name Address Phone Email Score Status Name Name Match 1 +1.1202 +1.0071 +2.0433 +7.3102 +6.7998 +7.3328 25.6133 High Match 2-4.7506 +1.0071 +2.0433 +7.3102 +6.7998 +7.3328 19.7426 High Match 3-4.7506-0.4485 +2.0433 +7.3102 +6.7998 +7.3328 18.2870 High Match 4-4.7506-0.4485-5.2233 +7.3102 +6.7998 +7.3328 11.0204 High Match 5-4.7506-0.4485-5.2233-2.2687 +6.7998 +7.3328 1.4415 Medium Match 6-4.7506-0.4485-5.2233-2.2687-1.1676 +7.3328-6.5259 Low Match 7-4.7506-0.4485-5.2233-2.2687-1.1676-2.3590-16.2177 Low Results Skewed Toward Names Unlike Address, Phone, and Email, the power of the First Name and Last Name match is one of negative correlation. Since many different people can share the same first and/or last name, a positive match for those attributes isn t very powerful. A negative match on those attributes is powerful because it is much more likely to indicate that the two sets of entity information being compared do not represent the same person. For example, it is much more likely that many different people are named John Smith than that one person is named both Robert Johnson and Martin Bridges. 4

First Middle Last Name Address Phone Email Score Status Name Name Match 8 +1.1202 +1.0071 +2.0433 +7.3102 +6.7998 +7.3328 25.6133 High Match 9 +1.1202 +1.0071 +2.0433 +7.3102 +6.7998-2.3590 15.9216 High Match 10 +1.1202 +1.0071 +2.0433 +7.3102-1.1676-2.3590 7.9542 Medium Match 11 +1.1202 +1.0071 +2.0433-2.2687-1.1676-2.3590-1.6247 Low Match 12 +1.1202 +1.0071-5.2233-2.2687-1.1676-2.3590-8.8913 Low Match 13 +1.1202-0.4485-5.2233-2.2687-1.1676-2.3590-10.3469 Low Match 14-4.7506-0.4485-5.2233-2.2687-1.1676-2.3590-16.2177 Low Taking a Closer Look at 50/50 Matches Match 11 shows both categories at their weakest: negative matches for Address, Phone, and Email; and positive matches for First Name, Middle Name, and Last Name. With a total score of only -1.6247, the positive and negative values for the two categories nearly cancel each other out. Match 4, on the other hand, shows both categories at their most powerful: positive matches for Address, Phone, and Email; and negative matches for First Name, Middle Name, and Last Name. In this case, because Address, Phone, and Email are so heavily weighted, the result is still a High status match. First Name Middle Last Name Address Phone Email Score Status Name Match 4-4.7506-0.4485-5.2233 +7.3102 +6.7998 +7.3328 11.0204 High Match 11 +1.1202 +1.0071 +2.0433-2.2687-1.1676-2.3590-1.6247 Low Summary Weights and Matching In Summary, when determining if two sets of entity information represent the same person: Differing names are more meaningful than matching names Matching Address, Phone and/or Email are more meaningful than differing Address, Phone, and/or Email Matching Address, Phone, and/or Email are more meaningful than differing names 5