Avoiding Doubles in Distributed Nominative Medical Databases: Optimization of the Needleman and Wunsch Algorithm

Size: px

Start display at page:

Download "Avoiding Doubles in Distributed Nominative Medical Databases: Optimization of the Needleman and Wunsch Algorithm"

Jeffry Cobb
6 years ago
Views:

1 83 Avoiding Doubles in Distributed Nominative Medical Databases: Optimization of the Needleman and Wunsch Algorithm Loïc Le Mignot, Claude Mugnier, Mohamed Ben Saïd, Jean-Philippe Jais, Jean- Baptiste Richard, Christine Le Bihan-Benjamin, Pierre Taupin, Paul Landais Université Paris-Descartes, Faculté de Médecine; Assistance Publique-Hôpitaux de Paris; Hôpital Necker, EA222, Service de Biostatistique et d Informatique Médicale, Hôpital Necker, 149 rue de Sèvres, Paris, France. Abstract Difficulties in reconstituting patients trajectory in the public health information systems are raised by errors in patients identification processes. A crucial issue to achieve is avoiding doubles in distributed web databases. We explored Needleman and Wunsch (N&W) algorithm in order to optimize the properties of string matching. Five variants of the N&W algorithm were developed. The algorithms were implemented for a web Multi-Source Information System. This system was dedicated to tracking patients with End-Stage Renal Disease at both regional and national level. A simulated study database of 73,210 records was created. An insertion or suppression of each character of the original string was simulated. The rate of double entries was 2% given an acceptable distance set to 5 modifications. The search was sensitive and specific with an acceptable detection time. It detected up to 10% of modifications that is above the estimated error rate. A variant of the N&W algorithm designed as cut-off heuristic, proved to be efficient for the search of double entries occurring in nominative distributed databases. Keywords Dynamic algorithm; Alignment; Edit distance; Pattern matching; End- Stage Renal Disease 1. Introduction This work focused on character strings comparison between the user entry information and the stored information in a simulated national patient database. Our method is derived from the Needleman and Wunsch (N&W) algorithm developed in bio-computing [1] searching for similarities in amino acid sequences of two proteins and for the sequence of maximum match between two strings. An important implication of the N&W algorithm is the distance between two strings. The goal is to make the distance the smallest possible when one of the strings is likely to be an erroneous variant of the other under the error model in use. One error model is the edit distance, which allows deleting, inserting or substituting simple characters [2]. We considered that all operations have the same cost and we focused on finding the minimum number of insertions, deletions and substitutions to make both strings equals. A value of an acceptable distance between two strings is assigned while testing the N&W algorithm. If the minimal computed distance is equal or lower than acceptable distance, the information entered is considered as having a match in the system.

2 84 The objective of our work, based on the N&W algorithm, was to optimize a method that allows deciding whether, given two strings, one is produced from the other by a limited series of modifications. The experimental design was applied to a Multi-Source Information System including a dynamic web server for identifying doubles in a nominative national database of End-Stage Renal Disease (ESRD) patients 2. Material and Methods Patients and organizational support A Multi-Source Information System (MSIS-REIN) was dedicated to collect continuous and exhaustive records of all ESRD cases and their clinical follow-up in France [3]. It collates in a standardized representation a condensed patient record elaborated by health professionals. MSIS-REIN aimed to fulfil the following requirements: scalability, portability, reliability, accessibility, and cost effectiveness oriented toward open source software. The use of standard references, the respect of privacy, confidentiality and security of patient information were required as well. The architecture of MSIS-REIN has been described elsewhere [4]. Briefly, it is based on an n-tier architecture. Via a web browser the client tier connects to a middle tier that is in relation with several databases: the identification database, the production database and the data warehouse. The middle tier supports client services through Web containers and business logic services through component containers. Business logic components in the middleware support transactions toward the databases. A Web server application interacts with the production database system, consisting of a collection of Web components: Java Server Pages (JSP), Servlets and other resources (graphics, scripting programs or plugins) organized in a directory structure. Web components interact within a Web container, Tomcat, which corresponds to a runtime environment providing a context and a life cycle management. MSIS REIN was authorized by the Commission Informatique et Libertés. Description of the algorithms Both strings are arranged in a two-dimensional matrix and paired for comparison with the same value for the concordance score as well as for the penalty of modifying inserting/deleting one character. Five variants of the N&W algorithm were implemented and tested, with different optimization attempts. These algorithms were all functionally equivalent. A cell S(i,j), within the similarity matrix defined above, can be viewed as a distance between the two substrings, using the relation: Distance (i, j) = maximum (i, j) similarity (i, j) The corresponding matrix of distances is defined by: S( i, j) = min S( i - 1, j - 1) + dist(i, j) S( i 1,j) + P S( i, j 1) + P Where: dist( i,j) = 0 if characters are equal, dist( i,j) = 1 if characters are different, P = 1: penalty score of inserting or deleting a character. - Alg #1: Direct implementation of Needleman-Wunsch algorithm as described above. The full matrix is computed for each record.

3 85 - Alg #2, #3, #4: variations on the use of the acceptable distance as a loop breaker, exploiting the properties of the distances matrix ( cut-off heuristic). - Alg #5: Use of common prefixes in records to avoid redundant computations. With the growing of the patient table, more and more concatenated strings will be found to have common prefixes. With a sorted list, time can be spared by re-using part of the precedent matrix. Implementation of the algorithm Eliminating non ASCII character errors: Non-alphanumeric characters (space, hyphen, apostrophe, etc) are eliminated and non- ASCII characters (é,è,ç,ô etc) are transformed into capital ASCII character sets, example: Jean-françois La Pérouse, male, born on December 1st, 1954 becomes: JEANFRANCOISLAPEROUSE Eliminating mistyping, orthographic errors: The score in case of deletion or substitution is represented on Figures 1. The experiment was tested using a usual PC Intel-Pentium-III computer with 396 megabytes of random access memory. The software development refers to the same approach and environment as used for MSIS-REIN: a dynamic web application based on JSP/ Java servlets, a web container, Tomcat/Apache Jakarta open source projects and MySQL open source database system is used. A simulated study database of 73,210 records was created. The characteristics of the concatenated data set are presented on table 1. M U G N I E R M U N I E R M U G N I E R M A G N I E R Figure 1 Deletion penalty (deletion of one character) on the left and substitution penalty (mismatch one character) on the right (distance = 1) Table 1 Characteristics of the tested data set. Rows Min Max length Range Average length When a new patient name is entered, a Java program searches for an existing patient record in the database. It comprised two parts: a function, which directly searches for exact match between the concatenated stored data, and a calculated string derived from the entered information. In case of a match, a dynamic web page is generated. The user is asked for confirmation to create the patient record. If no match is found, a second program function implementing the dynamic programming algorithm of N&W is run and searches the concatenated data for a patient record with a spelling close to the user-entered information. The program selects potential matches depending on their maximum match score relevance in conformity with the implemented algorithm. In case of potential matches, a dynamic web page informs the user. They are displayed according to the user s profile and authorizations to access nominative data of patients he is in charge of, or not.

4 3. Results Accuracy A modification was simulated by the addition of a probability of change, insertion and suppression for each character of the initial string. Given an "acceptable distance" set to 5, the false positive rate, i.e. new names detected as double entries, was 2 % (figure 3). Since the matching probability depends on the acceptable distance, a sizeable distance will cause the algorithm to become incapable to differentiate new entries from doubles. Moreover, the greater a data set, the higher the probability that it contains a close enough record. The false positive rate is thus expected to increase with the number of records. Specificity We checked whether double entries were properly detected. Given the so-called acceptable distance set to 5, the detection rate of doubles is presented on figure 4 according to the probability of simulated errors at data entry. Time consumption Direct access to last record in the database was 472 milliseconds in case of perfect matching. Search for a record within the acceptable distance appears below: Algorithm version #1 #2 #3 #4 #5 Answering time (seconds) Quality of the chosen method The comparison of a sufficient number of different strings provides the general distribution of the matching scores, which constitutes our "research space". Then, by comparing a string with its twisted version (simulation of double entry) we get the average score of a "positive case". The quality of the method (in term of discriminative power) is given by the quantity Q estimated by the following ratio: Q= Sg Sp /Dg,, which compares the difference between "matching" and "no-matching" score to the standard deviation of the general score where: Sg is the average score for the overall distribution, Sp the average score for positive cases, Dg the standard deviation of the overall distribution. We checked whether the method, originally conceived for random sequences, is still relevant for matching our concatenated strings (any two "real" strings being closer than any two random strings). As expected, the results presented on figure 5 show that a better quality of match is observed with randomly generated strings rather than with real names. In effect, the discriminative power appeared better when the similarity between 2 strings is low. I1 Td[(casas-n ther) s-

5 Figure 3-Rate of false positive (2%) as a function of the threshold of acceptable distance (distance = 5). Figure 5-Quality of the method according to the probability of error (threshold set to 5) for strings based on random characters and for real strings (corresponding to names). Figure 4-Detection rate of doubles (~100%) given the probability of error at data entry (10% of errors in a correct string). A study is currently in progress to explore whether the sa

6 88 [6] is low and not adapted to nominative data for medical applications. It mainly consists in removing the vowels (plus H and W), merging the 18 consonants into 6 different numeric codes depending on their phonetic value, and then keeping 3 of the resulting digits added to the first letter of the word. By this way, the same code is used for many different words. For example, in a set of 30,000 names (distinct one from the others) a code is found to appear about 12 times on average. That is, for almost every new name entered, the algorithm would identify it as a double entry. It is therefore irrelevant in the case of our problem, unless another phonetic coding is defined. The algorithm we used belongs to a family of algorithms derived from Levenshtein distance [7] such as Jaro-Winkler [8][9] or Smith-Waterman [10] algorithms. Smith- Waterman algorithm is dedicated to local subsequences matching: applied to characters strings, it may therefore be more relevant for search queries than for identity recognition where the whole signature string has to be matched. A recent work [11] presents an approach based on Porter-Jaro-Winkler algorithm, using weighting on identity items, and using similar items and string normalization. This work is focused on extracting identity aggregates in existing database, which is a different approach from our main objective. The method we described proved to fulfil our goals. It provides a satisfactory answering time and specifity, in order to be easily accepted by the users. Moreover its high sensitivity avoids double entries in our database in accordance with our goal of detecting upstream a single new entry, quickly enough for the process to remain transparent to the end user. 6. References [1] Needleman S, Wunsch C. A general Method Applicable to the Search for Similarities in the Amino Acid Sequences of Proteins. J. Mol. Biol. 1970:48, [2] Navarro G. A Guided Tour to Approximate String Matching. ACM Computing Surveys 2001: 33, [3] Landais P, Simonet A, Guillon D, Jacquelinet C, Ben Said M, Mugnier C, Simonet M. SIMS REIN: a multi-source information system for end-stage renal disease. C R Biol Apr;325(4): [4] Ben Saïd M, Simonet A, Guillon D, Jacquelinet C, Gaspoz F, Dufour E, Mugnier C, Jais JP, Simonet M, Landais P. A Dynamic Web Application Within N-Tier Architecture : a Multi-Source Information System for End-Stage Renal Disease. In : Baud R, Fieschi M, Le Beux P, Ruch P, eds. Proceedings of MIE2003, Saint Malo:IOS Press, 2003;pp [5] The Soundex algorithm. D. Knuth. The Art of Computer Programming, vol 3. Sorting and searching. Addison-Wesley Pubs. Second printing 1975 pp 725. [6] Sideli RV, Friedman C. Validating Patient Names in an Integrated Clinical Information System. In: Clayton P, ed. Proceedings of the Fifteenth Annual Symposium on Computer Applications in Medical Care, Washington D.C.: McGrawHill, 1991; pp [7] Levenshtein V I. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 1966;6: [8] Porter EH, Winkler WE. Approximate String Comparison and its Effect on an Advanced Record Linkage System. U.S. Bureau of Census. Research Reports Statistics # ( [9] Winkler WE, Approximate String Comparator Search Strategies for Very Large Administrative Lists. U.S. Bureau of Census. Research Reports Series Statistics # ( [10] Smith TF, Waterman M.S, J Mol Biol (1981) 147: [11] Paumier JP, Sauleau EA, Buemi A. Journées francophones d'informatique médicale, Lille mai ( Address for correspondence Pr Paul Landais Service de Biostatistique et d Informatique Médicale, Hôpital Necker-Enfants Malades, 149, rue de Sèvres Paris cedex 15, landais@necker.fr

A Multi-Source Information System via the Internet for End-Stage Renal Disease: Scalability and Data Quality

994 A Multi-Source Information System via the Internet for End-Stage Renal Disease: Scalability and Data Quality Mohamed Ben Saïd a, Loic Le Mignot a, Claude Mugnier a, Jean Baptiste Richard a, Christine