TABLE OF CONTENTS PAGE TITLE NO.

Similar documents
TABLE OF CONTENTS CHAPTER NO. TITLE PAGENO. LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION

Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases. A Thesis Presented to

A Survey on Removal of Duplicate Records in Database

TABLE OF CONTENTS CHAPTER NO. TITLE PAGE NO. ABSTRACT 5 LIST OF TABLES LIST OF FIGURES LIST OF SYMBOLS AND ABBREVIATIONS xxi

Part I: Data Mining Foundations

INTRODUCTION Background of the Problem Statement of the Problem Objectives of the Study Significance of the Study...

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification

Automatic training example selection for scalable unsupervised record linkage

Information Integration

Lecture 10. Sequence alignments

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Craig Knoblock University of Southern California. These slides are based in part on slides from Sheila Tejada and Misha Bilenko

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

Visualization and text mining of patent and non-patent data

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

A Design of a Hybrid System for DNA Sequence Alignment

AOT / AOTL Results for OAEI 2014

TABLE OF CONTENTS CHAPTER TITLE PAGE NO NO.

Entity Resolution, Clustering Author References

A KNOWLEDGE BASED ONLINE RECORD MATCHING OVER QUERY RESULTS FROM MULTIPLE WEB DATABASE

Alignment of Long Sequences

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Mouse, Human, Chimpanzee

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)

CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 8. Note

A Comparison of Algorithms used to measure the Similarity between two documents

Accelerating Smith Waterman (SW) Algorithm on Altera Cyclone II Field Programmable Gate Array

Stochastic Simulation: Algorithms and Analysis

Epipolar Geometry in Stereo, Motion and Object Recognition

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

Contents. Part I Setting the Scene

Computing Patterns in Strings I. Specific, Generic, Intrinsic

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Awos Kanan B.Sc., Jordan University of Science and Technology, 2003 M.Sc., Jordan University of Science and Technology, 2006

Graph analytics approach to analyse Enterprise Architecture models

Privacy-Preserving. Introduction to. Data Publishing. Concepts and Techniques. Benjamin C. M. Fung, Ke Wang, Chapman & Hall/CRC. S.

Information Integration of Partially Labeled Data

The SQL Guide to Pervasive PSQL. Rick F. van der Lans

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana

Contents. Preface to the Second Edition

Programming assignment for the course Sequence Analysis (2006)

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Bioinformatics explained: Smith-Waterman

Privacy Preserving Probabilistic Record Linkage

Sequence analysis Pairwise sequence alignment

Statistical Matching using Fractional Imputation

A Web-Based Introduction

Developing Focused Crawlers for Genre Specific Search Engines

Quiz section 10. June 1, 2018

Efficient Algorithm for Two Dimensional Pattern Matching Problem (Square Pattern)

Summary of Contents LIST OF FIGURES LIST OF TABLES

Mapping Bug Reports to Relevant Files and Automated Bug Assigning to the Developer Alphy Jose*, Aby Abahai T ABSTRACT I.

Generalized Additive Models

Central Issues in Biological Sequence Comparison

Local Alignment & Gap Penalties CMSC 423

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

TABLE OF CONTENTS CHAPTER TITLE PAGE

Automatic annotation of digital photos

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Data Linkage Methods: Overview of Computer Science Research

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011

Data Linkage Techniques: Past, Present and Future

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming

Mining Web Data. Lijun Zhang

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

LinkedMDB. The first linked data source dedicated to movies

String Patterns and Algorithms on Strings

Clever Linear Time Algorithms. Maximum Subset String Searching

Notes on Dynamic-Programming Sequence Alignment

String matching algorithms

Managing Your Biological Data with Python

Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by

Mining Web Data. Lijun Zhang

International Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17, ISSN

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm

EECS730: Introduction to Bioinformatics

INFORMATION RETRIEVAL SYSTEMS: Theory and Implementation

Similarity Joins of Text with Incomplete Information Formats

Algorithms and Data Structures

CS 6320 Natural Language Processing

Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest. Introduction to Algorithms

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Evaluation of similarity metrics for programming code plagiarism detection method

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Mahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Dietrich Paulus Joachim Hornegger. Pattern Recognition of Images and Speech in C++

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Clever Linear Time Algorithms. Maximum Subset String Searching. Maximum Subrange

Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution

Business Intelligence Roadmap HDT923 Three Days

Computational Molecular Biology

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INSTITUTO SUPERIOR TÉCNICO Gestão e Tratamento de Informação

Biology 644: Bioinformatics

Transcription:

TABLE OF CONTENTS CHAPTER PAGE TITLE ABSTRACT iv LIST OF TABLES xi LIST OF FIGURES xii LIST OF ABBREVIATIONS & SYMBOLS xiv 1. INTRODUCTION 1 2. LITERATURE SURVEY 14 3. MOTIVATIONS & OBJECTIVES OF THIS WORK 27 4. EXPERIMENTATION 29 4.1 INTRODUCTION 29 4.2 SEMI AUTOMATIC METHOD FOR STRING 32 MATCHING 4.2.1. METRICS FOR MEASURING SIMILARITY 32 4.2.1.1. EDIT DISTANCE 33 4.2.1.2. AFFINE GAP METHOD 35 4.2.1.3. NEEDLEMAN WUNSCH DISTANCE OR SELLERS 38 ALGORITHM 4.2.1.4. SMITH WATERMAN DISTANCE 40 4.2.1.5. THE JARO METRIC AND ITS 42 VARIANTS 4.2.1.6. JACCARD INDEX 44 4.2.1.7. TANIMOTO COEFFICIENT (EXTENDED JACCARD COEFFICIENT) 45 viii

4.2.1.8. TF / IDF (TERM FREQUENCY / INVERSE DOCUMENT 45 FREQUENCY) 4.2.1.9. N-GRAMS APPROACH 49 4.2.1.10 RABIN KARP METHOD 50 4.2.1.11 KNUTH MORRIS PRATT METHOD 53 4.2.1.12 BOYER MOORE APPROACH 58 4.2.2. HYBRID STRING MATCHING PROCESS 60 4.3. DATA MINING & KNOWLEDGE DISCOVERY TECHNIQUE FOR MULTIMEDIA DATA USING 62 UNSUPERVISED CONFLATION METHOD 4.3.1 DUPLICATE DETECTION USING UNSUPERVISED CONFLATION METHOD 62 (UCM) 4.3.1.1. PROBLEM DEFINITION 62 4.3.1.2. SIMILARITY ESTIMATION 65 4.3.1.3. UNSUPERVISED CONFLATION METHOD OVERVIEW 65 4.3.1.4. STRING SIMILARITY FUNCTION BASED CLASSIFIER C1 66 4.3.1.5. WEIGHTED COMPONENT SIMILARITY SUMMING (WCSS) 67 CLASSIFIER C2 5. RESULTS & DISCUSSION 70 5.1 SEMI AUTOMATIC METHOD FOR STRING MATCHING EXPERIMENTAL EVALUATION 70 5.2 UNSUPERVISED CONFLATION METHOD EXPERIMENTAL EVALUATION 73 5.2.1. DATA SETS 73 5.2.2. EVALUATION METRICS 77 5.2.3. EXPERIMENTAL RESULTS 77 ix

6. CONCLUSION 6.1 CONCLUSION 87 6.2 SCOPE FOR FUTURE WORK 88 REFERENCES 90 APPENDICES APPENDIX I DEFINITIONS OF TERMS USED IN THIS THESIS 99 LIST OF PUBLICATIONS x

LIST OF TABLES TABLE PAGE TITLE 1.1 Elementary Examples of Matching Pairs of Records (Dependent on Context) 7 4.1 Computation of Levenshtein Distance 35 4.2 Computation of Needleman Wunsch Distance 40 4.3 Computation of Smith-Waterman Distance 42 4.4 IDF values 47 4.5 Computation of scores 48 5.1 Sample Duplicate Records from the Restaurant Database 71 5.2 Sample Duplicate Records from the Cora Database 71 5.3 Sample Duplicate Records from the Reasoning Database 72 5.4 F-measures from the Experiments 72 5.5 Structure of the table ebook 74 5.6 Structure of the table mp3 75 5.7 Structure of the table video 76 xi

LIST OF FIGURES FIGURE PAGE TITLE 1.1 The general process of matching two databases 9 1.2 Query results from www.bookadda.com 11 1.3 Query results from www.infibeam.com 12 4.1 Sample duplicate records from (a) A restaurant database (b) A scientific citation database 30 4.2 Modified alignment from Advanced Dynamic Programming example 37 4.3 Alignment from Figure 4.2 re-scored using affine gap penalties 37 4.4 Modified alignment. Equivalent under regular gap penalty system 38 4.5 The alignment from Figure 4.4 re-scored using affine gap penalties 38 4.6 Computation of Jaro Metric 43 4.7 Example for N-Grams approach 50 4.8 Example 1 for Rabin Karp approach 51 4.9 Example 2(a) for Rabin Karp approach 52 4.10 Example 2(b) for Rabin Karp approach 52 4.11 Example for KMP approach 53 4.12 Example for KMP approach Step 1 54 4.13 Example for KMP approach Step 2 54 4.14 Example for KMP approach Step 3 54 4.15 Example for KMP approach Step 4 55 4.16 Example for KMP approach Step 5 55 4.17 Example for KMP approach Step 6 55 4.18 Example for KMP approach Step 7 56 xii

4.19 Example for KMP approach Step 8 56 4.20 Example for KMP approach Step 9 56 4.21 Example for KMP approach Step 10 57 4.22 Example for KMP approach Step 11 57 4.23 Example for KMP approach Step 12 57 4.24 Example for KMP approach Step 13 58 4.25 Duplicate Vector Identification Algorithm 64 4.26 Component Weight Assignment Algorithm 69 5.1 F-Measures from the Experiments 73 5.2 Sample records from the ebook table 74 5.3 Sample records from the mp3 table 75 5.4 Sample records from the video table 76 5.5 Domain Selection 78 5.6 Source Selection 1 78 5.7 Source Selection 2 79 5.8 After Loading 79 5.9 Calculation of Weights 80 5.10 Record Selection 80 5.11 Record Similarity Calculated Results 81 5.12 Record Similarity Matching all records 81 5.13 Three different similarity thresholds on e-book 82 5.14 Three different similarity thresholds on mp3 83 5.15 Two different similarity thresholds on video 83 5.16 Component weight setting based on similarity values of the fields in N 84 5.17 Effect of the threshold in matching process 85 xiii

LIST OF ABBREVIATIONS & SYMBOLS AI : Artificial Intelligence DNA : Deoxyribonucleic Acid DBLP : Digital Bibliography & Library Project EM : Expectation Maximization Febrl : Freely Extensible Biomedical Record Linkage HTML : Hyper Text Markup Language ISBN : International Standard Book Number M-C : Mapping-Convergence MCMC : Markov Chain Monte Carlo NLP : Natural Language Processing OCR : Optical Character Recognition PEBL : Positive Example Based Learning PES : Post Enumeration Survey PPRL : Privacy Preserving Record Linkage RelDC : Relationships for domain independent Data Cleaning RL : Record Linkage RNA : Ribonucleic Acid SQL : Structured Query Language SVM : Support Vector Machine TF-IDF : Term Frequency Inverse Document Frequency UCM : Unsupervised Conflation Method U.S.A : United States of America WCSS : Weighted Component Similarity Summing D : Distance between two strings s : String 1 t : String 2 O : Edit Distance xiv

c : Cost of the edit operation x i : th i character of string x y j : j th character of string y M : Matrix G : Gap cost d : distance function P : length of the longest common prefix θ : Cosine similarity T : Tanimoto coefficient N : Non duplicate vector set C1, C2 : Classifiers S a, S b : Pair of Strings : Null set AS th : Predefined Threshold value γ : Feature Vector P(γ M) : Probabilities of observing feature vector for a matched pair (P(γ U) : Probabilities of observing feature vector for a nonmatched pair Tμ : Threshold based on desired error level for equivalent record pair Tλ : Threshold based on desired error level for nonequivalent record pair xv