Introduction Entity Match Service In order to incorporate as much institutional data into our central alumni and donor database (hereafter referred to as CADS ), we ve developed a comprehensive suite of automated entity match services. The CADS database contains millions of entity records. The Entity Match Service Suite identifies if one of the millions of entity records CADS corresponds with the person represented by the input data, and if so, which record. Unfortunately, looking for exact matches on attributes such as Name, Address, Telephone, etc., will miss many true matches, potentially causing a number of duplicate records to be created in CADS. The reasons an exact match might fail are numerous: ambiguity in data (Thomas Smith and Tom Smith may represent the same person); unformatted data (the same address may be written in multiple ways); missing data elements; out-of-date information; or unrecognized partial matches. In addition to the difficulties posed by attempting an exact match, the sheer volume of data in CADS requires that the number of candidate records be narrowed prior to matching. To address these challenges, the Entity Match Web Service is divided into four general steps: 1. Receiving and accepting the data 2. Deciding which CADS entities to match against (Blocking) 3. Matching the data against the chosen CADS entities (Preliminary Match) 4. Using the Name data to refine and confirm the match results (Secondary Match) Step-by-Step Description In Step 1, the service receives any or all of the following input: First Name, Middle Name, Last Name, Address, Phone, and Email. If the input fails to meet the minimum requirements, or if the service is unavailable, an error will be generated. The minimum requirements for the Entity Match Service Suite are: During Step 2, the service determines which CADS entities will be looked at as potential matches. Instead of trying to match all the provided input fields against 1.4 million CADS entities, the service uses CADS data to choose a small subset to pass on to Step 3. The three criteria used to make this choice are: In Step 3, each piece of input data is matched against its counterpart for each CADS entity identified during Step 2. An exact match is attempted on all names, addresses, phones, and emails on the CADS record. Each time a match is found on an attribute, the CADS entity s match score is increased; each time a non-match is found, the entity s score is decreased. For a detailed look at how the match score is calculated, please the explanation beginning on Page 4.
By the end of Step 3, all potential matches are assigned a match confidence value. The values are determined based on the aggregate score calculated during the match process. The various values and their score thresholds are as follows: Low: match score less than or equal to 0 Medium: match score greater than 0 but less than or equal to 8.8 High: match score greater than 8.8 Based on the confidence values assigned to each CADS entity in the pool of potential matches, the following actions are taken: Highest Confidence Value Action Performed in Result Set No results Pass input data to Entity Create Service; automatically create new CADS entity Low Pass input data to Entity Create Service; automatically create new CADS entity Medium Pass Medium result(s) to Exception Interface High Pass High result(s) to Step 4 Step 4 loops back and takes a final look at the name input and compares it to the name data present on each matched CADS record. Because Step 2 looks at other attributes than name when choosing entities to pass to Step 3, it s possible for two members of the same household to both be assigned a High match confidence value. It s also possible that the name input might include a misspelling or use a nickname not recorded on a CADS entity s record. In order to compensate for these possibilities, the service runs all High confidence results through the decision tree depicted below: The first name check compares the calculated Oracle SoundEx value of the input name with the known SoundEx value of all names found on the matched entity s record. This screens out members of the same household who have different first names. 2
The second name check employs the Jaro-Winkler similarity metric to calculate the string distance between the input First Name and all First Names found on the matched entity s record. Comparing the names character-bycharacter prevents names which sound different due to a misspelling from erroneously excluding an entity that is an otherwise correct match. The third name check runs the input First Name against a custom-built synonym table then, if any synonyms are found, compares those synonyms to all First Names found on the matched entity s record. Any single High confidence match that passes any one of the name checks is considered to represent the same individual as the input data. The input data is passed to the Entity Update Service, which will automatically apply any new or different input data to the applicable CADS entity record. Any High confidence match that cannot pass the three name checks is assigned a new confidence value of High*. This conditional high value indicates that although the CADS entity was a numerically strong candidate, there is insufficient name evidence for the Entity Match Service to decide that the CADS entity and the entity represented by the input data are truly the same individual. These records are passed to the Exception Interface where an expert can make the final determination regarding the match. Example of Match Process 3
Appendix Weights and Matching During the match process, each attribute is assigned a positive or negative match score, also known as a weight. The value of each weight was precalculated during the development process of the Entity Match Service Suite; the array of positive and negative weights is specially calibrated for the CADS dataset. Attribute Weight (Positive Match) Weight (Data Not Present in Input OR Not Present in CADS) Weight (Negative Match) First Name +1.1202 0.0000-4.7506 Middle Name +1.0071 0.0000-0.4485 Last Name +2.0433 0.0000-5.2233 Address +7.3102 0.0000-2.2687 Phone +6.7998 0.0000-1.1676 Email +7.3328 0.0000-2.3590 Highest Possible Score: 25.6133 Lowest Possible Score: -16.2177 Results Skewed Toward Address, Phone, and Email Because far fewer people share an Address, Phone and/or Email than share a name, these three attributes have a high positive weight. In other words, two sets of entity information that share an Address, Phone, and/or Email have a statistically significant likelihood to represent the same person. This does not mean, however, that two sets of entity information that do not share an Address, Phone, and/or Email are strongly predisposed not to represent the same person. The power of the Address, Phone and Email match is one of positive correlation. First Middle Last Name Address Phone Email Score Status Name Name Match 1 +1.1202 +1.0071 +2.0433 +7.3102 +6.7998 +7.3328 25.6133 High Match 2-4.7506 +1.0071 +2.0433 +7.3102 +6.7998 +7.3328 19.7426 High Match 3-4.7506-0.4485 +2.0433 +7.3102 +6.7998 +7.3328 18.2870 High Match 4-4.7506-0.4485-5.2233 +7.3102 +6.7998 +7.3328 11.0204 High Match 5-4.7506-0.4485-5.2233-2.2687 +6.7998 +7.3328 1.4415 Medium Match 6-4.7506-0.4485-5.2233-2.2687-1.1676 +7.3328-6.5259 Low Match 7-4.7506-0.4485-5.2233-2.2687-1.1676-2.3590-16.2177 Low Results Skewed Toward Names Unlike Address, Phone, and Email, the power of the First Name and Last Name match is one of negative correlation. Since many different people can share the same first and/or last name, a positive match for those attributes isn t very powerful. A negative match on those attributes is powerful because it is much more likely to indicate that the two sets of entity information being compared do not represent the same person. For example, it is much more likely that many different people are named John Smith than that one person is named both Robert Johnson and Martin Bridges. 4
First Middle Last Name Address Phone Email Score Status Name Name Match 8 +1.1202 +1.0071 +2.0433 +7.3102 +6.7998 +7.3328 25.6133 High Match 9 +1.1202 +1.0071 +2.0433 +7.3102 +6.7998-2.3590 15.9216 High Match 10 +1.1202 +1.0071 +2.0433 +7.3102-1.1676-2.3590 7.9542 Medium Match 11 +1.1202 +1.0071 +2.0433-2.2687-1.1676-2.3590-1.6247 Low Match 12 +1.1202 +1.0071-5.2233-2.2687-1.1676-2.3590-8.8913 Low Match 13 +1.1202-0.4485-5.2233-2.2687-1.1676-2.3590-10.3469 Low Match 14-4.7506-0.4485-5.2233-2.2687-1.1676-2.3590-16.2177 Low Taking a Closer Look at 50/50 Matches Match 11 shows both categories at their weakest: negative matches for Address, Phone, and Email; and positive matches for First Name, Middle Name, and Last Name. With a total score of only -1.6247, the positive and negative values for the two categories nearly cancel each other out. Match 4, on the other hand, shows both categories at their most powerful: positive matches for Address, Phone, and Email; and negative matches for First Name, Middle Name, and Last Name. In this case, because Address, Phone, and Email are so heavily weighted, the result is still a High status match. First Name Middle Last Name Address Phone Email Score Status Name Match 4-4.7506-0.4485-5.2233 +7.3102 +6.7998 +7.3328 11.0204 High Match 11 +1.1202 +1.0071 +2.0433-2.2687-1.1676-2.3590-1.6247 Low Summary Weights and Matching In Summary, when determining if two sets of entity information represent the same person: Differing names are more meaningful than matching names Matching Address, Phone and/or Email are more meaningful than differing Address, Phone, and/or Email Matching Address, Phone, and/or Email are more meaningful than differing names 5