TABLE OF CONTENTS PAGE TITLE NO.

TABLE OF CONTENTS CHAPTER PAGE TITLE ABSTRACT iv LIST OF TABLES xi LIST OF FIGURES xii LIST OF ABBREVIATIONS & SYMBOLS xiv 1. INTRODUCTION 1 2. LITERATURE SURVEY 14 3. MOTIVATIONS & OBJECTIVES OF THIS WORK 27 4. EXPERIMENTATION 29 4.1 INTRODUCTION 29 4.2 SEMI AUTOMATIC METHOD FOR STRING 32 MATCHING 4.2.1. METRICS FOR MEASURING SIMILARITY 32 4.2.1.1. EDIT DISTANCE 33 4.2.1.2. AFFINE GAP METHOD 35 4.2.1.3. NEEDLEMAN WUNSCH DISTANCE OR SELLERS 38 ALGORITHM 4.2.1.4. SMITH WATERMAN DISTANCE 40 4.2.1.5. THE JARO METRIC AND ITS 42 VARIANTS 4.2.1.6. JACCARD INDEX 44 4.2.1.7. TANIMOTO COEFFICIENT (EXTENDED JACCARD COEFFICIENT) 45 viii

4.2.1.8. TF / IDF (TERM FREQUENCY / INVERSE DOCUMENT 45 FREQUENCY) 4.2.1.9. N-GRAMS APPROACH 49 4.2.1.10 RABIN KARP METHOD 50 4.2.1.11 KNUTH MORRIS PRATT METHOD 53 4.2.1.12 BOYER MOORE APPROACH 58 4.2.2. HYBRID STRING MATCHING PROCESS 60 4.3. DATA MINING & KNOWLEDGE DISCOVERY TECHNIQUE FOR MULTIMEDIA DATA USING 62 UNSUPERVISED CONFLATION METHOD 4.3.1 DUPLICATE DETECTION USING UNSUPERVISED CONFLATION METHOD 62 (UCM) 4.3.1.1. PROBLEM DEFINITION 62 4.3.1.2. SIMILARITY ESTIMATION 65 4.3.1.3. UNSUPERVISED CONFLATION METHOD OVERVIEW 65 4.3.1.4. STRING SIMILARITY FUNCTION BASED CLASSIFIER C1 66 4.3.1.5. WEIGHTED COMPONENT SIMILARITY SUMMING (WCSS) 67 CLASSIFIER C2 5. RESULTS & DISCUSSION 70 5.1 SEMI AUTOMATIC METHOD FOR STRING MATCHING EXPERIMENTAL EVALUATION 70 5.2 UNSUPERVISED CONFLATION METHOD EXPERIMENTAL EVALUATION 73 5.2.1. DATA SETS 73 5.2.2. EVALUATION METRICS 77 5.2.3. EXPERIMENTAL RESULTS 77 ix

6. CONCLUSION 6.1 CONCLUSION 87 6.2 SCOPE FOR FUTURE WORK 88 REFERENCES 90 APPENDICES APPENDIX I DEFINITIONS OF TERMS USED IN THIS THESIS 99 LIST OF PUBLICATIONS x

LIST OF TABLES TABLE PAGE TITLE 1.1 Elementary Examples of Matching Pairs of Records (Dependent on Context) 7 4.1 Computation of Levenshtein Distance 35 4.2 Computation of Needleman Wunsch Distance 40 4.3 Computation of Smith-Waterman Distance 42 4.4 IDF values 47 4.5 Computation of scores 48 5.1 Sample Duplicate Records from the Restaurant Database 71 5.2 Sample Duplicate Records from the Cora Database 71 5.3 Sample Duplicate Records from the Reasoning Database 72 5.4 F-measures from the Experiments 72 5.5 Structure of the table ebook 74 5.6 Structure of the table mp3 75 5.7 Structure of the table video 76 xi

LIST OF FIGURES FIGURE PAGE TITLE 1.1 The general process of matching two databases 9 1.2 Query results from www.bookadda.com 11 1.3 Query results from www.infibeam.com 12 4.1 Sample duplicate records from (a) A restaurant database (b) A scientific citation database 30 4.2 Modified alignment from Advanced Dynamic Programming example 37 4.3 Alignment from Figure 4.2 re-scored using affine gap penalties 37 4.4 Modified alignment. Equivalent under regular gap penalty system 38 4.5 The alignment from Figure 4.4 re-scored using affine gap penalties 38 4.6 Computation of Jaro Metric 43 4.7 Example for N-Grams approach 50 4.8 Example 1 for Rabin Karp approach 51 4.9 Example 2(a) for Rabin Karp approach 52 4.10 Example 2(b) for Rabin Karp approach 52 4.11 Example for KMP approach 53 4.12 Example for KMP approach Step 1 54 4.13 Example for KMP approach Step 2 54 4.14 Example for KMP approach Step 3 54 4.15 Example for KMP approach Step 4 55 4.16 Example for KMP approach Step 5 55 4.17 Example for KMP approach Step 6 55 4.18 Example for KMP approach Step 7 56 xii

4.19 Example for KMP approach Step 8 56 4.20 Example for KMP approach Step 9 56 4.21 Example for KMP approach Step 10 57 4.22 Example for KMP approach Step 11 57 4.23 Example for KMP approach Step 12 57 4.24 Example for KMP approach Step 13 58 4.25 Duplicate Vector Identification Algorithm 64 4.26 Component Weight Assignment Algorithm 69 5.1 F-Measures from the Experiments 73 5.2 Sample records from the ebook table 74 5.3 Sample records from the mp3 table 75 5.4 Sample records from the video table 76 5.5 Domain Selection 78 5.6 Source Selection 1 78 5.7 Source Selection 2 79 5.8 After Loading 79 5.9 Calculation of Weights 80 5.10 Record Selection 80 5.11 Record Similarity Calculated Results 81 5.12 Record Similarity Matching all records 81 5.13 Three different similarity thresholds on e-book 82 5.14 Three different similarity thresholds on mp3 83 5.15 Two different similarity thresholds on video 83 5.16 Component weight setting based on similarity values of the fields in N 84 5.17 Effect of the threshold in matching process 85 xiii

LIST OF ABBREVIATIONS & SYMBOLS AI : Artificial Intelligence DNA : Deoxyribonucleic Acid DBLP : Digital Bibliography & Library Project EM : Expectation Maximization Febrl : Freely Extensible Biomedical Record Linkage HTML : Hyper Text Markup Language ISBN : International Standard Book Number M-C : Mapping-Convergence MCMC : Markov Chain Monte Carlo NLP : Natural Language Processing OCR : Optical Character Recognition PEBL : Positive Example Based Learning PES : Post Enumeration Survey PPRL : Privacy Preserving Record Linkage RelDC : Relationships for domain independent Data Cleaning RL : Record Linkage RNA : Ribonucleic Acid SQL : Structured Query Language SVM : Support Vector Machine TF-IDF : Term Frequency Inverse Document Frequency UCM : Unsupervised Conflation Method U.S.A : United States of America WCSS : Weighted Component Similarity Summing D : Distance between two strings s : String 1 t : String 2 O : Edit Distance xiv

c : Cost of the edit operation x i : th i character of string x y j : j th character of string y M : Matrix G : Gap cost d : distance function P : length of the longest common prefix θ : Cosine similarity T : Tanimoto coefficient N : Non duplicate vector set C1, C2 : Classifiers S a, S b : Pair of Strings : Null set AS th : Predefined Threshold value γ : Feature Vector P(γ M) : Probabilities of observing feature vector for a matched pair (P(γ U) : Probabilities of observing feature vector for a nonmatched pair Tμ : Threshold based on desired error level for equivalent record pair Tλ : Threshold based on desired error level for nonequivalent record pair xv