Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data

Size: px

Start display at page:

Download "Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data"

Rolf Anderson
5 years ago
Views:

1 Int'l Conf. Information and Knowledge Engineering IKE' Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data (Research in progress) A. Pei Wang 1, B. Daniel Pullen 2, C. John Talburt 2 and D. Ningning Wu 2 1 Department Information Science Department University of Arkansas at Little Rock Little Rock, AR, USA Abstract: Entity resolution and record linking processes are often required to process input records of poor data quality. However, the matching errors caused by poor quality data can often be overcome by categorizing the quality problems, then applying a cyclic process that continuously refines the match rules to overcome these problems. This paper presents an extension to a previous case study of this process for student enrollment data and describes how the unique data quality issues that were identified throughout this cyclic process and how different phonetic hashing functions were used to overcome these issues. Key Word: Entity Resolution, Record Linkage, Phonetic Hash Code, Data Quality (DQ), Boolean matching rules 1. Introduction Previous work in this area has been published utilizing similarity functions such as Levenshtein Edit Distance and Q-gram Tetrahedral Ratio [11]. This research takes a different approach by applying phonetic hash code functions to mitigate quality issues presented in student enrollment data. This approach can help to overcome variations that stem from phonetic to text conversion performed by humans as well as overcome common typographical variations. 2. Background Entity Resolution (ER) is the process of determining whether two references to real world objects in an information system are referring to the same object or to different objects [1]. The references are made up of attributes and the values of the attributes describe the real world entity to which they refer. The ER processes discussed in this paper use Boolean match rules to make their decisions. Boolean match rules do not produce a score or weight when comparing a pair of references, only a True/False decision. If two references satisfy a Boolean match rule, i.e. the rule is true, the references are linked together. After the application of transitive closure, all of the references that can be linked together form an entity identity structure (EIS) [9]. 3. Boolean Match Rules and ER Outcomes Boolean match rules are used to determine the outcome as "link" pairs or "non-link" pairs. The basic unit of a Boolean rule is a term. A term is the comparison between the values of an attribute in the pair of records. The term is considered to be TRUE if the degree of similarity required by the comparison is met. The rule itself is made up of a series of terms connected by AND logic, i.e. every term must be true in order for the rule to be true. Finally, the ER process may use several Boolean rules that are connected by OR logic, i.e. the pair of references should be linked if at least one of the Boolean rules is true [3]. In evaluating the outcome of an ER process, the results of the matches between all pairs of references can be placed into four, mutually exclusive categories: true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN). TPs are correctly labeled "link" pairs. TNs are correctly labeled nonlink pairs. Contrasting these correct results are two types of incorrect linking results. FPs are pairs of records that have been identified as matches or link pairs by the ER process but actually refer to two

2 188 Int'l Conf. Information and Knowledge Engineering IKE'15 different real world entities. FNs are pairs of records that have been identified as non-matches or nonlink pairs by an ER process but actually refer to the same real world entity [8]. The goal of an ER process is to produce the lowest number of FPs and FNs. 4. The OYSTER ER System The ER processes in this paper were performed with OYSTER (Open system for Entity Resolution). OYSTER is an open source ER system developed by the Center for Advance Research in Entity Resolution and Information Quality (ERIQ) at the University of Arkansas at Little Rock (UALR). OYSTER was specifically designed to support entity identity information management (EIIM) [9]. Although OYSTER can be run in several different configurations to support the various phases of the entity identity information life cycle, only the identity capture configuration was used for the results given in this paper [10]. ER Impact of Data Quality Issues The data set used throughout this testing is a collection of student enrollment data spanning two academic years. The total records and Clusters are listed in Table1. TABLE I. DATA SETS Set A Set B Total Cluster 526, ,934 Total Records 3,234,292 3,255,513 Only the student identity information was used. Any results discussed in this paper have been made anonymous to allow the sharing and description of the unique cases identified. In the data available, a few strong identifying attributes are of particular interest. These are first name, middle name, last name, date of birth, and student identifier. Some of the data quality (DQ) issues identified with these attributes and their rates are summarized below in Tables 2 and Tables 3. TABLE II. DATA QUALITY ISSUES IN DATA SET A Data Quality Issue Data Set A % Number in First Name Number in Middle Name Number in Last Name Virgule in First Name Asterisk in First Name Total Problems Total Records 3,234,292 TABLE III. DATA QUALITY ISSUES IN DATA SET B Data Quality Issue Data Set B % Number in First Name Number in Middle Name Number in Last Name Virgule in First Name Asterisk in First Name Total Problems Total Records 3,255,513 These tables point out some of the obvious and easily quantifiable data quality issues present in these two data sets. There are several other data quality issues that occur over these attributes. The student name fields have some particularly interesting and challenging problems. The fields occur frequently enough throughout the data set to increase the amount of errors made by the ER process. The first name field has many records where the field is treated not only as the student s first name but also nickname. This creates examples that look like Joseph (Joey) or Joseph Joey. In other cases, Many Hispanic students have a hyphenated name where one comes from the father and the other comes from the mother. Upon data entry, sometimes the first of the two names is placed in the middle name field. This has a detrimental impact on matching using the middle and last name fields. In addition, to the presence of numbers or special characters in all three of the name fields can cause problems. Some problems affect multiple attributes. Some of these unique cases can be summarized briefly. In some cases one attribute is placed in the incorrect

3 Int'l Conf. Information and Knowledge Engineering IKE' field. Cases involving the phone number, student identifier, and address field have been identified where these values are actually in one of the student name fields. The data also shows a trend in naming twins. Often parents will name the twins with very similar names such as Terrell and Jerrell. Occasionally, this is extended to a similarity in the middle names as well. With the date of birth and last name fields already identical, differentiating twins in the match rules is problematic. In some cases, mixing this with erroneous or sequential student identifiers can create FP outcomes. 5. Methodology How can managers of entity data overcome data quality problems when performing ER? To overcome data quality issues some appropriate similarity functions and comparator functions can make a notable improvement. IBM Alpha Code - IBM Alpha Code is a name encoding algorithm. The coding rules produce a 14 digit phonetic key of the name according to some rules [11]. Based on these phonetic keys, the name which has different spelling but same pronunciation can be matched to each other. For example: value 1 = "Rodgers" and value 2 = "Rogers". The New York State Identification and Intelligence System (NYSIIS) - It is a phonetic algorithm. Much like the previous algorithm, a name with different spelling can produce a match by using this function. For example: value 1 = "Carry" and value 2 = "Carrie". Soundex Soundex can be used to find the values which have similar pronunciation but difference spelling. This function can be used to fix misspelled and even transposed characters. For example value 1 = "Damieva" and value 2 = "Dameiva." These two values will produce the same Soundex hash value, creating a match. Scan In order to overcome the special characters in names, the similarity function scan can be used. It is often performed in preprocessing before the ER is completed and has the capability to filter all the special characters and only include letters or alphanumerical characters. For example, value 1 = "JAMES\\" and value 2 = "JAMES". Also, scan can reorder strings or even read them from right to left as opposed to left to right and perform transformations regarding the casing of alphabetical characters. This comparator can force all characters to be lower case, upper case, or the original case present in the string. For example "Eric" can be generated as "ERIC" after using scan. Sometimes, these similarity functions will create FPs and FNs. For example, suppose two different rules are used to produce two different ER results from the same data set. The first rule we use is student first name, student last name and date of birth with an exact match for each of them. The second rule is student first name Soundex, last name and date of birth with an exact match. After performing a split comparison to compare the two results as in previous research [7], the FPs and FNs created by the second rule can be identified and their rates can be calculated. The calculation for the approximate FP percentage rate is shown in equation (1). The results are shown in table 3. Since these FPs were identified using split analysis, these are considered to be worst case FP rates. Split analysis is a methodology used to analyze splits in the clusters between two different link identifiers. How this process works has been discussed in detail in recent research [7]. (1) The FP rate indicates one side of how well the rules are performing. For this reason, the user should attempt to reduce FP and FN rates as low as possible when creating and testing rules. These results focus on the FN rates in particular. This research focuses on three similarity functions. These are Soundex, NYSIIS and IBM Alpha Code. These three similarity functions can be used in indexing, which can help the process to speed up, especially for the large data sets. After testing these three functions in the same student enrollment data, the percentage of TP and FP are shown in the table below (Table 4):

4 190 Int'l Conf. Information and Knowledge Engineering IKE' TABLE IV. THE PERCENTAGES OF TP AND FP TP FP Not Sure Soundex 34.6% 62.5% 2.9% NYSIIS 36.5% 56.9% 6.7% IBMAlpha 27.4% 68.0% 4.6% 0 Fig. 1. Bar Graph of TP and FP Percentages Comparing the results of these three similarity functions to benchmark, clearly the one that has the best performance is NYSIIS, which has the highest percentage of TP and lowest percentage of FP. 6. Conclusions TP FP NotSure Soundex Nysiis IBMalpha Data quality problems often present a formidable obstacle to obtaining an accurate and effective ER result. The approaches to overcome data quality issues in the student enrollment data during ER described in this paper have been successfully implemented in OYSTER. The success of any ER process is often directly related to the time spent profiling the data and identifying these types of data quality problems. Effectively identifying and categorizing these types of problems directly affect the quality of the ER results at the end of such processes. The approaches above include the similarity functions such as Soundex and IBMAlphaCode that can overcome some issues such as both nickname and given name contained together in one field, transposed characters, and other typographical or spelling errors. Additionally, other similarity functions such as Scan can overcome the issues such as special characters, numbers, and misspellings. While these approaches contribute greatly to improving the ER results, there is a limit to which of these approaches can aid in reducing the FP rate. The hash code functions tested in this paper cannot overcome all of the issues listed earlier in this paper. For example, they cannot directly overcome variations produced by the inclusion of nickname in some references but a given name in other references. However, the application of these hash code functions along with similarity functions such as q-gram tetrahedral ratio, Levenshtein edit distance, and nickname could further mitigate the issues encountered in this particular student enrollment data. 7. Acknowledgment The research described in this paper has been supported in part through grants from the Arkansas Department of Education and Black Oak Analytics. 8. Reference [1] Talburt, John R. Entity Resolution and Information Quality. San Francisco, CA: Morgan Kaufmann/Elsevier, [2] Melody Penning and John Talburt. "Information Quality Assessment and Improvement of Student Information in the University Environment". Information and Knowledge Engineering, [3] Yinle Zhou, John Talburt, Fumiko Kobayashi and Eric D.Nelson. "Implementing Boolean Matching Rules in an Entity Resolution System using XML Scripts". Information and Knowledge Engineering, [4] Holland, G. & Talburt, J. (2010) q-gram Tetrahedral Ratio (atr) for approximate pattern matching Conference on Applied Research in Information Technology, University of Central Arkansas, Conway, AR. [5] IvenFellegi and Alan Sunter. A Theory for Record Linkage ; Journal of the American Statistical Association, Vol. 64 No. 328, , 1969 [6] Steven Whang and Hector Garcia-Molina. Entity Resolution with Evolving Rules ; Proceedings of the VLDB Endowment, Vol. 3 Issue 1-2, , September 2010

5 Int'l Conf. Information and Knowledge Engineering IKE' [7] Huzaifa Syed, Fan Lui, Daniel Pullen, Ningning Wu, John Talburt. Developing and Refining Matching Rules for Entity Resolution ; Information and Knowledge Engineering, 2012 [8] Christen, Peter. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Berlin: Springer, [9] Zhou, Y. and Talburt, J. (2011). Entity Identity Information Management (EIIM). International Conference on Information Quality (ICIQ-11), Adelaide, Australia, November 18-20, 2011, pp [10] Zhou, Y. and Talburt, J. (2011). The Role of Asserted Resolution in Entity Identity Management. The 2011 International Conference on Information and Knowledge Engineering (IKE 11), Las Vegas, Nevada, July 18-20, 2011, pp [11] Wang, Pei, Pullen, Daniel, Wu, Ningning, and Talburt, John. (2013) Mitigating Data Quality Impairment on Entity Resolution Errors in Student Enrollment Data; Information and Knowledge Engineering Conference, 2013.

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Fumiko Kobayashi, John R Talburt Department of Information Science University of Arkansas at Little Rock 2801 South