Avoiding Doubles in Distributed Nominative Medical Databases: Optimization of the Needleman and Wunsch Algorithm

Size: px
Start display at page:

Download "Avoiding Doubles in Distributed Nominative Medical Databases: Optimization of the Needleman and Wunsch Algorithm"

Transcription

1 83 Avoiding Doubles in Distributed Nominative Medical Databases: Optimization of the Needleman and Wunsch Algorithm Loïc Le Mignot, Claude Mugnier, Mohamed Ben Saïd, Jean-Philippe Jais, Jean- Baptiste Richard, Christine Le Bihan-Benjamin, Pierre Taupin, Paul Landais Université Paris-Descartes, Faculté de Médecine; Assistance Publique-Hôpitaux de Paris; Hôpital Necker, EA222, Service de Biostatistique et d Informatique Médicale, Hôpital Necker, 149 rue de Sèvres, Paris, France. Abstract Difficulties in reconstituting patients trajectory in the public health information systems are raised by errors in patients identification processes. A crucial issue to achieve is avoiding doubles in distributed web databases. We explored Needleman and Wunsch (N&W) algorithm in order to optimize the properties of string matching. Five variants of the N&W algorithm were developed. The algorithms were implemented for a web Multi-Source Information System. This system was dedicated to tracking patients with End-Stage Renal Disease at both regional and national level. A simulated study database of 73,210 records was created. An insertion or suppression of each character of the original string was simulated. The rate of double entries was 2% given an acceptable distance set to 5 modifications. The search was sensitive and specific with an acceptable detection time. It detected up to 10% of modifications that is above the estimated error rate. A variant of the N&W algorithm designed as cut-off heuristic, proved to be efficient for the search of double entries occurring in nominative distributed databases. Keywords Dynamic algorithm; Alignment; Edit distance; Pattern matching; End- Stage Renal Disease 1. Introduction This work focused on character strings comparison between the user entry information and the stored information in a simulated national patient database. Our method is derived from the Needleman and Wunsch (N&W) algorithm developed in bio-computing [1] searching for similarities in amino acid sequences of two proteins and for the sequence of maximum match between two strings. An important implication of the N&W algorithm is the distance between two strings. The goal is to make the distance the smallest possible when one of the strings is likely to be an erroneous variant of the other under the error model in use. One error model is the edit distance, which allows deleting, inserting or substituting simple characters [2]. We considered that all operations have the same cost and we focused on finding the minimum number of insertions, deletions and substitutions to make both strings equals. A value of an acceptable distance between two strings is assigned while testing the N&W algorithm. If the minimal computed distance is equal or lower than acceptable distance, the information entered is considered as having a match in the system.

2 84 The objective of our work, based on the N&W algorithm, was to optimize a method that allows deciding whether, given two strings, one is produced from the other by a limited series of modifications. The experimental design was applied to a Multi-Source Information System including a dynamic web server for identifying doubles in a nominative national database of End-Stage Renal Disease (ESRD) patients 2. Material and Methods Patients and organizational support A Multi-Source Information System (MSIS-REIN) was dedicated to collect continuous and exhaustive records of all ESRD cases and their clinical follow-up in France [3]. It collates in a standardized representation a condensed patient record elaborated by health professionals. MSIS-REIN aimed to fulfil the following requirements: scalability, portability, reliability, accessibility, and cost effectiveness oriented toward open source software. The use of standard references, the respect of privacy, confidentiality and security of patient information were required as well. The architecture of MSIS-REIN has been described elsewhere [4]. Briefly, it is based on an n-tier architecture. Via a web browser the client tier connects to a middle tier that is in relation with several databases: the identification database, the production database and the data warehouse. The middle tier supports client services through Web containers and business logic services through component containers. Business logic components in the middleware support transactions toward the databases. A Web server application interacts with the production database system, consisting of a collection of Web components: Java Server Pages (JSP), Servlets and other resources (graphics, scripting programs or plugins) organized in a directory structure. Web components interact within a Web container, Tomcat, which corresponds to a runtime environment providing a context and a life cycle management. MSIS REIN was authorized by the Commission Informatique et Libertés. Description of the algorithms Both strings are arranged in a two-dimensional matrix and paired for comparison with the same value for the concordance score as well as for the penalty of modifying inserting/deleting one character. Five variants of the N&W algorithm were implemented and tested, with different optimization attempts. These algorithms were all functionally equivalent. A cell S(i,j), within the similarity matrix defined above, can be viewed as a distance between the two substrings, using the relation: Distance (i, j) = maximum (i, j) similarity (i, j) The corresponding matrix of distances is defined by: S( i, j) = min S( i - 1, j - 1) + dist(i, j) S( i 1,j) + P S( i, j 1) + P Where: dist( i,j) = 0 if characters are equal, dist( i,j) = 1 if characters are different, P = 1: penalty score of inserting or deleting a character. - Alg #1: Direct implementation of Needleman-Wunsch algorithm as described above. The full matrix is computed for each record.

3 85 - Alg #2, #3, #4: variations on the use of the acceptable distance as a loop breaker, exploiting the properties of the distances matrix ( cut-off heuristic). - Alg #5: Use of common prefixes in records to avoid redundant computations. With the growing of the patient table, more and more concatenated strings will be found to have common prefixes. With a sorted list, time can be spared by re-using part of the precedent matrix. Implementation of the algorithm Eliminating non ASCII character errors: Non-alphanumeric characters (space, hyphen, apostrophe, etc) are eliminated and non- ASCII characters (é,è,ç,ô etc) are transformed into capital ASCII character sets, example: Jean-françois La Pérouse, male, born on December 1st, 1954 becomes: JEANFRANCOISLAPEROUSE Eliminating mistyping, orthographic errors: The score in case of deletion or substitution is represented on Figures 1. The experiment was tested using a usual PC Intel-Pentium-III computer with 396 megabytes of random access memory. The software development refers to the same approach and environment as used for MSIS-REIN: a dynamic web application based on JSP/ Java servlets, a web container, Tomcat/Apache Jakarta open source projects and MySQL open source database system is used. A simulated study database of 73,210 records was created. The characteristics of the concatenated data set are presented on table 1. M U G N I E R M U N I E R M U G N I E R M A G N I E R Figure 1 Deletion penalty (deletion of one character) on the left and substitution penalty (mismatch one character) on the right (distance = 1) Table 1 Characteristics of the tested data set. Rows Min Max length Range Average length When a new patient name is entered, a Java program searches for an existing patient record in the database. It comprised two parts: a function, which directly searches for exact match between the concatenated stored data, and a calculated string derived from the entered information. In case of a match, a dynamic web page is generated. The user is asked for confirmation to create the patient record. If no match is found, a second program function implementing the dynamic programming algorithm of N&W is run and searches the concatenated data for a patient record with a spelling close to the user-entered information. The program selects potential matches depending on their maximum match score relevance in conformity with the implemented algorithm. In case of potential matches, a dynamic web page informs the user. They are displayed according to the user s profile and authorizations to access nominative data of patients he is in charge of, or not.

4 3. Results Accuracy A modification was simulated by the addition of a probability of change, insertion and suppression for each character of the initial string. Given an "acceptable distance" set to 5, the false positive rate, i.e. new names detected as double entries, was 2 % (figure 3). Since the matching probability depends on the acceptable distance, a sizeable distance will cause the algorithm to become incapable to differentiate new entries from doubles. Moreover, the greater a data set, the higher the probability that it contains a close enough record. The false positive rate is thus expected to increase with the number of records. Specificity We checked whether double entries were properly detected. Given the so-called acceptable distance set to 5, the detection rate of doubles is presented on figure 4 according to the probability of simulated errors at data entry. Time consumption Direct access to last record in the database was 472 milliseconds in case of perfect matching. Search for a record within the acceptable distance appears below: Algorithm version #1 #2 #3 #4 #5 Answering time (seconds) Quality of the chosen method The comparison of a sufficient number of different strings provides the general distribution of the matching scores, which constitutes our "research space". Then, by comparing a string with its twisted version (simulation of double entry) we get the average score of a "positive case". The quality of the method (in term of discriminative power) is given by the quantity Q estimated by the following ratio: Q= Sg Sp /Dg,, which compares the difference between "matching" and "no-matching" score to the standard deviation of the general score where: Sg is the average score for the overall distribution, Sp the average score for positive cases, Dg the standard deviation of the overall distribution. We checked whether the method, originally conceived for random sequences, is still relevant for matching our concatenated strings (any two "real" strings being closer than any two random strings). As expected, the results presented on figure 5 show that a better quality of match is observed with randomly generated strings rather than with real names. In effect, the discriminative power appeared better when the similarity between 2 strings is low. I1 Td[(casas-n ther) s-

5 Figure 3-Rate of false positive (2%) as a function of the threshold of acceptable distance (distance = 5). Figure 5-Quality of the method according to the probability of error (threshold set to 5) for strings based on random characters and for real strings (corresponding to names). Figure 4-Detection rate of doubles (~100%) given the probability of error at data entry (10% of errors in a correct string). A study is currently in progress to explore whether the sa

6 88 [6] is low and not adapted to nominative data for medical applications. It mainly consists in removing the vowels (plus H and W), merging the 18 consonants into 6 different numeric codes depending on their phonetic value, and then keeping 3 of the resulting digits added to the first letter of the word. By this way, the same code is used for many different words. For example, in a set of 30,000 names (distinct one from the others) a code is found to appear about 12 times on average. That is, for almost every new name entered, the algorithm would identify it as a double entry. It is therefore irrelevant in the case of our problem, unless another phonetic coding is defined. The algorithm we used belongs to a family of algorithms derived from Levenshtein distance [7] such as Jaro-Winkler [8][9] or Smith-Waterman [10] algorithms. Smith- Waterman algorithm is dedicated to local subsequences matching: applied to characters strings, it may therefore be more relevant for search queries than for identity recognition where the whole signature string has to be matched. A recent work [11] presents an approach based on Porter-Jaro-Winkler algorithm, using weighting on identity items, and using similar items and string normalization. This work is focused on extracting identity aggregates in existing database, which is a different approach from our main objective. The method we described proved to fulfil our goals. It provides a satisfactory answering time and specifity, in order to be easily accepted by the users. Moreover its high sensitivity avoids double entries in our database in accordance with our goal of detecting upstream a single new entry, quickly enough for the process to remain transparent to the end user. 6. References [1] Needleman S, Wunsch C. A general Method Applicable to the Search for Similarities in the Amino Acid Sequences of Proteins. J. Mol. Biol. 1970:48, [2] Navarro G. A Guided Tour to Approximate String Matching. ACM Computing Surveys 2001: 33, [3] Landais P, Simonet A, Guillon D, Jacquelinet C, Ben Said M, Mugnier C, Simonet M. SIMS REIN: a multi-source information system for end-stage renal disease. C R Biol Apr;325(4): [4] Ben Saïd M, Simonet A, Guillon D, Jacquelinet C, Gaspoz F, Dufour E, Mugnier C, Jais JP, Simonet M, Landais P. A Dynamic Web Application Within N-Tier Architecture : a Multi-Source Information System for End-Stage Renal Disease. In : Baud R, Fieschi M, Le Beux P, Ruch P, eds. Proceedings of MIE2003, Saint Malo:IOS Press, 2003;pp [5] The Soundex algorithm. D. Knuth. The Art of Computer Programming, vol 3. Sorting and searching. Addison-Wesley Pubs. Second printing 1975 pp 725. [6] Sideli RV, Friedman C. Validating Patient Names in an Integrated Clinical Information System. In: Clayton P, ed. Proceedings of the Fifteenth Annual Symposium on Computer Applications in Medical Care, Washington D.C.: McGrawHill, 1991; pp [7] Levenshtein V I. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 1966;6: [8] Porter EH, Winkler WE. Approximate String Comparison and its Effect on an Advanced Record Linkage System. U.S. Bureau of Census. Research Reports Statistics # ( [9] Winkler WE, Approximate String Comparator Search Strategies for Very Large Administrative Lists. U.S. Bureau of Census. Research Reports Series Statistics # ( [10] Smith TF, Waterman M.S, J Mol Biol (1981) 147: [11] Paumier JP, Sauleau EA, Buemi A. Journées francophones d'informatique médicale, Lille mai ( Address for correspondence Pr Paul Landais Service de Biostatistique et d Informatique Médicale, Hôpital Necker-Enfants Malades, 149, rue de Sèvres Paris cedex 15, landais@necker.fr

A Multi-Source Information System via the Internet for End-Stage Renal Disease: Scalability and Data Quality

A Multi-Source Information System via the Internet for End-Stage Renal Disease: Scalability and Data Quality 994 A Multi-Source Information System via the Internet for End-Stage Renal Disease: Scalability and Data Quality Mohamed Ben Saïd a, Loic Le Mignot a, Claude Mugnier a, Jean Baptiste Richard a, Christine

More information

An XML Schema for Automated Data Integration in a Multi-Source Information System Dedicated to End-Stage Renal Disease

An XML Schema for Automated Data Integration in a Multi-Source Information System Dedicated to End-Stage Renal Disease Medical Informatics in a United and Healthy Europe K.-P. Adlassnig et al. (Eds.) IOS Press, 2009 2009 European Federation for Medical Informatics. All rights reserved. doi:10.3233/978-1-60750-044-5-215

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

Formalizing Mappings to Optimize Automated Schema Alignment: Application to Rare Diseases

Formalizing Mappings to Optimize Automated Schema Alignment: Application to Rare Diseases e-health For Continuity of Care C. Lovis et al. (Eds.) 2014 European Federation for Medical Informatics and IOS Press. This article is published online with Open Access by IOS Press and distributed under

More information

Overview of Record Linkage Techniques

Overview of Record Linkage Techniques Overview of Record Linkage Techniques Record linkage or data matching refers to the process used to identify records which relate to the same entity (e.g. patient, customer, household) in one or more data

More information

TABLE OF CONTENTS PAGE TITLE NO.

TABLE OF CONTENTS PAGE TITLE NO. TABLE OF CONTENTS CHAPTER PAGE TITLE ABSTRACT iv LIST OF TABLES xi LIST OF FIGURES xii LIST OF ABBREVIATIONS & SYMBOLS xiv 1. INTRODUCTION 1 2. LITERATURE SURVEY 14 3. MOTIVATIONS & OBJECTIVES OF THIS

More information

Dynamic Programming & Smith-Waterman algorithm

Dynamic Programming & Smith-Waterman algorithm m m Seminar: Classical Papers in Bioinformatics May 3rd, 2010 m m 1 2 3 m m Introduction m Definition is a method of solving problems by breaking them down into simpler steps problem need to contain overlapping

More information

Lecture 10. Sequence alignments

Lecture 10. Sequence alignments Lecture 10 Sequence alignments Alignment algorithms: Overview Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences. We want to maximize the score

More information

Programming assignment for the course Sequence Analysis (2006)

Programming assignment for the course Sequence Analysis (2006) Programming assignment for the course Sequence Analysis (2006) Original text by John W. Romein, adapted by Bart van Houte (bart@cs.vu.nl) Introduction Please note: This assignment is only obligatory for

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find

More information

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise

More information

Pairwise Sequence Alignment. Zhongming Zhao, PhD

Pairwise Sequence Alignment. Zhongming Zhao, PhD Pairwise Sequence Alignment Zhongming Zhao, PhD Email: zhongming.zhao@vanderbilt.edu http://bioinfo.mc.vanderbilt.edu/ Sequence Similarity match mismatch A T T A C G C G T A C C A T A T T A T G C G A T

More information

) I R L Press Limited, Oxford, England. The protein identification resource (PIR)

) I R L Press Limited, Oxford, England. The protein identification resource (PIR) Volume 14 Number 1 Volume 1986 Nucleic Acids Research 14 Number 1986 Nucleic Acids Research The protein identification resource (PIR) David G.George, Winona C.Barker and Lois T.Hunt National Biomedical

More information

A Design of a Hybrid System for DNA Sequence Alignment

A Design of a Hybrid System for DNA Sequence Alignment IMECS 2008, 9-2 March, 2008, Hong Kong A Design of a Hybrid System for DNA Sequence Alignment Heba Khaled, Hossam M. Faheem, Tayseer Hasan, Saeed Ghoneimy Abstract This paper describes a parallel algorithm

More information

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77 Dynamic Programming Part I: Examples Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 1 / 77 Dynamic Programming Recall: the Change Problem Other problems: Manhattan

More information

Improving the Efficacy of Approximate Searching by Personal-Name

Improving the Efficacy of Approximate Searching by Personal-Name Improving the Efficacy of Approximate Searching by Personal-Name Rafael Camps, Jordi Daudé Software Department, Universitat Politècnica de Catalunya, C/ Jordi Girona 1-3, 08034 Barcelona, Spain rcamps@lsi.upc.es

More information

Information Integration

Information Integration .. Dennis Sun DATA 401: Data Science Alexander Dekhtyar.. Information Integration Data Integration. Data Integration is the process of combining data residing in different sources and providing the user

More information

Notes on Dynamic-Programming Sequence Alignment

Notes on Dynamic-Programming Sequence Alignment Notes on Dynamic-Programming Sequence Alignment Introduction. Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for rigorous alignment of DNA

More information

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

CHAPTER-6 WEB USAGE MINING USING CLUSTERING CHAPTER-6 WEB USAGE MINING USING CLUSTERING 6.1 Related work in Clustering Technique 6.2 Quantifiable Analysis of Distance Measurement Techniques 6.3 Approaches to Formation of Clusters 6.4 Conclusion

More information

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De

More information

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment Sequence lignment (chapter 6) p The biological problem p lobal alignment p Local alignment p Multiple alignment Local alignment: rationale p Otherwise dissimilar proteins may have local regions of similarity

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

Mouse, Human, Chimpanzee

Mouse, Human, Chimpanzee More Alignments 1 Mouse, Human, Chimpanzee Mouse to Human Chimpanzee to Human 2 Mouse v.s. Human Chromosome X of Mouse to Human 3 Local Alignment Given: two sequences S and T Find: substrings of S and

More information

A Survey on Removal of Duplicate Records in Database

A Survey on Removal of Duplicate Records in Database Indian Journal of Science and Technology A Survey on Removal of Duplicate Records in Database M. Karthigha 1* and S. Krishna Anand 2 1 PG Student, School of Computing (CSE), SASTRA University, 613401,

More information

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 95-96

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 95-96 ه عا ی Semantic Web Ontology Alignment Morteza Amini Sharif University of Technology Fall 95-96 Outline The Problem of Ontologies Ontology Heterogeneity Ontology Alignment Overall Process Similarity (Matching)

More information

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment Courtesy of jalview 1 Motivations Collective statistic Protein families Identification and representation of conserved sequence features

More information

An Efficient Algorithm to Locate All Locally Optimal Alignments Between Two Sequences Allowing for Gaps

An Efficient Algorithm to Locate All Locally Optimal Alignments Between Two Sequences Allowing for Gaps An Efficient Algorithm to Locate All Locally Optimal Alignments Between Two Sequences Allowing for Gaps Geoffrey J. Barton Laboratory of Molecular Biophysics University of Oxford Rex Richards Building

More information

Accelerating Smith Waterman (SW) Algorithm on Altera Cyclone II Field Programmable Gate Array

Accelerating Smith Waterman (SW) Algorithm on Altera Cyclone II Field Programmable Gate Array Accelerating Smith Waterman (SW) Algorithm on Altera yclone II Field Programmable Gate Array NUR DALILAH AHMAD SABRI, NUR FARAH AIN SALIMAN, SYED ABDUL MUALIB AL JUNID, ABDUL KARIMI HALIM Faculty Electrical

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 8. Note

CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 8. Note MS: Bioinformatic lgorithms, Databases and ools Lecture 8 Sequence alignment: inexact alignment dynamic programming, gapped alignment Note Lecture 7 suffix trees and suffix arrays will be rescheduled Exact

More information

Sequence analysis Pairwise sequence alignment

Sequence analysis Pairwise sequence alignment UMF11 Introduction to bioinformatics, 25 Sequence analysis Pairwise sequence alignment 1. Sequence alignment Lecturer: Marina lexandersson 12 September, 25 here are two types of sequence alignments, global

More information

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence Alignments Overview Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence alignment means arranging

More information

Central Issues in Biological Sequence Comparison

Central Issues in Biological Sequence Comparison Central Issues in Biological Sequence Comparison Definitions: What is one trying to find or optimize? Algorithms: Can one find the proposed object optimally or in reasonable time optimize? Statistics:

More information

Security Control Methods for Statistical Database

Security Control Methods for Statistical Database Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security Statistical Database A statistical database is a database which provides statistics on subsets of records OLAP

More information

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Fasta is used to compare a protein or DNA sequence to all of the

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

BLAST - Basic Local Alignment Search Tool

BLAST - Basic Local Alignment Search Tool Lecture for ic Bioinformatics (DD2450) April 11, 2013 Searching 1. Input: Query Sequence 2. Database of sequences 3. Subject Sequence(s) 4. Output: High Segment Pairs (HSPs) Sequence Similarity Measures:

More information

Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases. A Thesis Presented to

Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases. A Thesis Presented to Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases A Thesis Presented to The Faculty of the Computer Science Program California State University Channel Islands In (Partial)

More information

AOT / AOTL Results for OAEI 2014

AOT / AOTL Results for OAEI 2014 AOT / AOTL Results for OAEI 2014 Abderrahmane Khiat 1, Moussa Benaissa 1 1 LITIO Lab, University of Oran, BP 1524 El-Mnaouar Oran, Algeria abderrahmane_khiat@yahoo.com moussabenaissa@yahoo.fr Abstract.

More information

A Comparison of Algorithms used to measure the Similarity between two documents

A Comparison of Algorithms used to measure the Similarity between two documents A Comparison of Algorithms used to measure the Similarity between two documents Khuat Thanh Tung, Nguyen Duc Hung, Le Thi My Hanh Abstract Nowadays, measuring the similarity of documents plays an important

More information

Software Implementation of Smith-Waterman Algorithm in FPGA

Software Implementation of Smith-Waterman Algorithm in FPGA Software Implementation of Smith-Waterman lgorithm in FP NUR FRH IN SLIMN, NUR DLILH HMD SBRI, SYED BDUL MULIB L JUNID, ZULKIFLI BD MJID, BDUL KRIMI HLIM Faculty of Electrical Engineering Universiti eknologi

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Director Bioinformatics & Research Computing Whitehead Institute Topics to Cover

More information

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT Asif Ali Khan*, Laiq Hassan*, Salim Ullah* ABSTRACT: In bioinformatics, sequence alignment is a common and insistent task. Biologists align

More information

Bioinformatics explained: BLAST. March 8, 2007

Bioinformatics explained: BLAST. March 8, 2007 Bioinformatics Explained Bioinformatics explained: BLAST March 8, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com Bioinformatics

More information

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 Improvisation of Global Pairwise Sequence Alignment

More information

Distributed Protein Sequence Alignment

Distributed Protein Sequence Alignment Distributed Protein Sequence Alignment ABSTRACT J. Michael Meehan meehan@wwu.edu James Hearne hearne@wwu.edu Given the explosive growth of biological sequence databases and the computational complexity

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Single Error Analysis of String Comparison Methods

Single Error Analysis of String Comparison Methods Single Error Analysis of String Comparison Methods Peter Christen Department of Computer Science, Australian National University, Canberra ACT 2, Australia peter.christen@anu.edu.au Abstract. Comparing

More information

Alignment Based Similarity distance Measure for Better Web Sessions Clustering

Alignment Based Similarity distance Measure for Better Web Sessions Clustering Available online at www.sciencedirect.com Procedia Computer Science 5 (2011) 450 457 The 2 nd International Conference on Ambient Systems, Networks and Technologies (ANT) Alignment Based Similarity distance

More information

DEFORMABLE MATCHING OF HAND SHAPES FOR USER VERIFICATION. Ani1 K. Jain and Nicolae Duta

DEFORMABLE MATCHING OF HAND SHAPES FOR USER VERIFICATION. Ani1 K. Jain and Nicolae Duta DEFORMABLE MATCHING OF HAND SHAPES FOR USER VERIFICATION Ani1 K. Jain and Nicolae Duta Department of Computer Science and Engineering Michigan State University, East Lansing, MI 48824-1026, USA E-mail:

More information

Sequence Alignment. part 2

Sequence Alignment. part 2 Sequence Alignment part 2 Dynamic programming with more realistic scoring scheme Using the same initial sequences, we ll look at a dynamic programming example with a scoring scheme that selects for matches

More information

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Proposal for diploma thesis Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Astrid Rheinländer 01-09-2010 Supervisor: Prof. Dr. Ulf Leser Motivation

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

Ontology-based Integration and Refinement of Evaluation-Committee Data from Heterogeneous Data Sources

Ontology-based Integration and Refinement of Evaluation-Committee Data from Heterogeneous Data Sources Indian Journal of Science and Technology, Vol 8(23), DOI: 10.17485/ijst/2015/v8i23/79342 September 2015 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Ontology-based Integration and Refinement of Evaluation-Committee

More information

Alignment of Long Sequences

Alignment of Long Sequences Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale

More information

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm. FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence

More information

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm Rochester Institute of Technology Making personalized education scalable using Sequence Alignment Algorithm Submitted by: Lakhan Bhojwani Advisor: Dr. Carlos Rivero 1 1. Abstract There are many ways proposed

More information

Automatic training example selection for scalable unsupervised record linkage

Automatic training example selection for scalable unsupervised record linkage Automatic training example selection for scalable unsupervised record linkage Peter Christen Department of Computer Science, The Australian National University, Canberra, Australia Contact: peter.christen@anu.edu.au

More information

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching A CAM(Content Addressable Memory)-based architecture for molecular sequence matching P.K. Lala 1 and J.P. Parkerson 2 1 Department Electrical Engineering, Texas A&M University, Texarkana, Texas, USA 2

More information

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA Comparative Analysis of Protein Alignment Algorithms in Parallel environment using BLAST versus Smith-Waterman Shadman Fahim shadmanbracu09@gmail.com Shehabul Hossain rudrozzal@gmail.com Gulshan Jubaed

More information

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 A Real Time GIS Approximation Approach for Multiphase

More information

AROMA results for OAEI 2009

AROMA results for OAEI 2009 AROMA results for OAEI 2009 Jérôme David 1 Université Pierre-Mendès-France, Grenoble Laboratoire d Informatique de Grenoble INRIA Rhône-Alpes, Montbonnot Saint-Martin, France Jerome.David-at-inrialpes.fr

More information

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS FINDING APPROXIMATE REPEATS IN DNA SEQUENCES USING MULTIPLE SPACED SEEDS By SARAH BANYASSADY, B.S. A Thesis Submitted to the School of Graduate Studies

More information

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS Fidel Cacheda, Francisco Puentes, Victor Carneiro Department of Information and Communications Technologies, University of A

More information

An Ensemble Approach for Record Matching in Data Linkage

An Ensemble Approach for Record Matching in Data Linkage Digital Health Innovation for Consumers, Clinicians, Connectivity and Community A. Georgiou et al. (Eds.) 2016 The authors and IOS Press. This article is published online with Open Access by IOS Press

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

Symbol Detection Using Region Adjacency Graphs and Integer Linear Programming

Symbol Detection Using Region Adjacency Graphs and Integer Linear Programming 2009 10th International Conference on Document Analysis and Recognition Symbol Detection Using Region Adjacency Graphs and Integer Linear Programming Pierre Le Bodic LRI UMR 8623 Using Université Paris-Sud

More information

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 - Spring 2015 Luay Nakhleh, Rice University DP Algorithms for Pairwise Alignment The number of all possible pairwise alignments (if

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 04: Variations of sequence alignments http://www.pitt.edu/~mcs2/teaching/biocomp/tutorials/global.html Slides adapted from Dr. Shaojie Zhang (University

More information

Oracle Communications Performance Intelligence Center

Oracle Communications Performance Intelligence Center Oracle Communications Performance Intelligence Center KPI Configuration Guide Release 10.2.1 E77501-01 June 2017 1 Oracle Communications Performance Intelligence Center KPI Configuration Guide, Release

More information

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment Today s Lecture Edit graph & alignment algorithms Smith-Waterman algorithm Needleman-Wunsch algorithm Local vs global Computational complexity of pairwise alignment Multiple sequence alignment 1 Sequence

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

Histogram and watershed based segmentation of color images

Histogram and watershed based segmentation of color images Histogram and watershed based segmentation of color images O. Lezoray H. Cardot LUSAC EA 2607 IUT Saint-Lô, 120 rue de l'exode, 50000 Saint-Lô, FRANCE Abstract A novel method for color image segmentation

More information

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. .. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. PAM and BLOSUM Matrices Prepared by: Jason Banich and Chris Hoover Background As DNA sequences change and evolve, certain amino acids are more

More information

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University 1 Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University DP Algorithms for Pairwise Alignment 2 The number of all possible pairwise alignments (if gaps are allowed)

More information

Systematic Security Checking on OSGi Bundles for Remote Healthcare System

Systematic Security Checking on OSGi Bundles for Remote Healthcare System , pp.1-5 http://dx.doi.org/10.14257/astl.2015.116.01 Systematic Security Checking on OSGi Bundles for Remote Healthcare System Jinsoo Hwang 1, Kichang Kim 2 1 Department of Statistics, Inha University,

More information

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion. www.ijarcet.org 54 Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion. Hassan Kehinde Bello and Kazeem Alagbe Gbolagade Abstract Biological sequence alignment is becoming popular

More information

Introduction to blocking techniques and traditional record linkage

Introduction to blocking techniques and traditional record linkage Introduction to blocking techniques and traditional record linkage Brenda Betancourt Duke University Department of Statistical Science bb222@stat.duke.edu May 2, 2018 1 / 32 Blocking: Motivation Naively

More information

Sequence Clustering Tools

Sequence Clustering Tools Sequence Clustering Tools [Internal Report] Saliya Ekanayake School of Informatics and Computing Indiana University sekanaya@cs.indiana.edu 1. Introduction The sequence clustering work carried out by SALSA

More information

An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance

An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance Toan Thang Ta, Cheng-Yao Lin and Chin Lung Lu Department of Computer Science National Tsing Hua University, Hsinchu

More information

Solution Overview Vectored Event Grid Architecture for Real-Time Intelligent Event Management

Solution Overview Vectored Event Grid Architecture for Real-Time Intelligent Event Management Solution Overview Vectored Event Grid Architecture for Real-Time Intelligent Event Management Copyright Nuvon, Inc. 2007, All Rights Reserved. Introduction The need to improve the quality and accessibility

More information

Algorithmic Approaches for Biological Data, Lecture #20

Algorithmic Approaches for Biological Data, Lecture #20 Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History 20 April 2016 Outline Aligning with Gaps and Substitution Matrices

More information

Keywords Pattern Matching Algorithms, Pattern Matching, DNA and Protein Sequences, comparison per character

Keywords Pattern Matching Algorithms, Pattern Matching, DNA and Protein Sequences, comparison per character Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Index Based Multiple

More information

Collective Entity Resolution in Relational Data

Collective Entity Resolution in Relational Data Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution

More information

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Khaddouja Boujenfa, Nadia Essoussi, and Mohamed Limam International Science Index, Computer and Information Engineering waset.org/publication/482

More information

CHAPTER 7 CONCLUSION AND FUTURE WORK

CHAPTER 7 CONCLUSION AND FUTURE WORK CHAPTER 7 CONCLUSION AND FUTURE WORK 7.1 Conclusion Data pre-processing is very important in data mining process. Certain data cleaning techniques usually are not applicable to all kinds of data. Deduplication

More information

Parallel Processing for Scanning Genomic Data-Bases

Parallel Processing for Scanning Genomic Data-Bases 1 Parallel Processing for Scanning Genomic Data-Bases D. Lavenier and J.-L. Pacherie a {lavenier,pacherie}@irisa.fr a IRISA, Campus de Beaulieu, 35042 Rennes cedex, France The scan of a genomic data-base

More information

Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform

Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform Barry Strengholt Matthijs Brobbel Delft University of Technology Faculty of Electrical Engineering, Mathematics

More information

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 94-95

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 94-95 ه عا ی Semantic Web Ontology Alignment Morteza Amini Sharif University of Technology Fall 94-95 Outline The Problem of Ontologies Ontology Heterogeneity Ontology Alignment Overall Process Similarity Methods

More information

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles Today s Lecture Multiple sequence alignment Improved scoring of pairwise alignments Affine gap penalties Profiles 1 The Edit Graph for a Pair of Sequences G A C G T T G A A T G A C C C A C A T G A C G

More information

Cover Page. The handle holds various files of this Leiden University dissertation.

Cover Page. The handle  holds various files of this Leiden University dissertation. Cover Page The handle http://hdl.handle.net/887/2976 holds various files of this Leiden University dissertation. Author: Schraagen, Marijn Paul Title: Aspects of record linkage Issue Date: 24-- Chapter

More information

Highly Scalable and Accurate Seeds for Subsequence Alignment

Highly Scalable and Accurate Seeds for Subsequence Alignment Highly Scalable and Accurate Seeds for Subsequence Alignment Abhijit Pol Tamer Kahveci Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, 32611

More information

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA) Biochemistry 324 Bioinformatics Multiple Sequence Alignment (MSA) Big- Οh notation Greek omicron symbol Ο The Big-Oh notation indicates the complexity of an algorithm in terms of execution speed and storage

More information

Performance Optimization for Informatica Data Services ( Hotfix 3)

Performance Optimization for Informatica Data Services ( Hotfix 3) Performance Optimization for Informatica Data Services (9.5.0-9.6.1 Hotfix 3) 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,

More information

Real-Time Document Image Retrieval for a 10 Million Pages Database with a Memory Efficient and Stability Improved LLAH

Real-Time Document Image Retrieval for a 10 Million Pages Database with a Memory Efficient and Stability Improved LLAH 2011 International Conference on Document Analysis and Recognition Real-Time Document Image Retrieval for a 10 Million Pages Database with a Memory Efficient and Stability Improved LLAH Kazutaka Takeda,

More information

COMP5318 Knowledge Management & Data Mining Assignment 1

COMP5318 Knowledge Management & Data Mining Assignment 1 COMP538 Knowledge Management & Data Mining Assignment Enoch Lau SID 20045765 7 May 2007 Abstract 5.5 Scalability............... 5 Clustering is a fundamental task in data mining that aims to place similar

More information

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University While your data tables or spreadsheets may look good to

More information