TABLE OF CONTENTS PAGE TITLE NO.
|
|
- Dennis Young
- 6 years ago
- Views:
Transcription
1 TABLE OF CONTENTS CHAPTER PAGE TITLE ABSTRACT iv LIST OF TABLES xi LIST OF FIGURES xii LIST OF ABBREVIATIONS & SYMBOLS xiv 1. INTRODUCTION 1 2. LITERATURE SURVEY MOTIVATIONS & OBJECTIVES OF THIS WORK EXPERIMENTATION INTRODUCTION SEMI AUTOMATIC METHOD FOR STRING 32 MATCHING METRICS FOR MEASURING SIMILARITY EDIT DISTANCE AFFINE GAP METHOD NEEDLEMAN WUNSCH DISTANCE OR SELLERS 38 ALGORITHM SMITH WATERMAN DISTANCE THE JARO METRIC AND ITS 42 VARIANTS JACCARD INDEX TANIMOTO COEFFICIENT (EXTENDED JACCARD COEFFICIENT) 45 viii
2 TF / IDF (TERM FREQUENCY / INVERSE DOCUMENT 45 FREQUENCY) N-GRAMS APPROACH RABIN KARP METHOD KNUTH MORRIS PRATT METHOD BOYER MOORE APPROACH HYBRID STRING MATCHING PROCESS DATA MINING & KNOWLEDGE DISCOVERY TECHNIQUE FOR MULTIMEDIA DATA USING 62 UNSUPERVISED CONFLATION METHOD DUPLICATE DETECTION USING UNSUPERVISED CONFLATION METHOD 62 (UCM) PROBLEM DEFINITION SIMILARITY ESTIMATION UNSUPERVISED CONFLATION METHOD OVERVIEW STRING SIMILARITY FUNCTION BASED CLASSIFIER C WEIGHTED COMPONENT SIMILARITY SUMMING (WCSS) 67 CLASSIFIER C2 5. RESULTS & DISCUSSION SEMI AUTOMATIC METHOD FOR STRING MATCHING EXPERIMENTAL EVALUATION UNSUPERVISED CONFLATION METHOD EXPERIMENTAL EVALUATION DATA SETS EVALUATION METRICS EXPERIMENTAL RESULTS 77 ix
3 6. CONCLUSION 6.1 CONCLUSION SCOPE FOR FUTURE WORK 88 REFERENCES 90 APPENDICES APPENDIX I DEFINITIONS OF TERMS USED IN THIS THESIS 99 LIST OF PUBLICATIONS x
4 LIST OF TABLES TABLE PAGE TITLE 1.1 Elementary Examples of Matching Pairs of Records (Dependent on Context) Computation of Levenshtein Distance Computation of Needleman Wunsch Distance Computation of Smith-Waterman Distance IDF values Computation of scores Sample Duplicate Records from the Restaurant Database Sample Duplicate Records from the Cora Database Sample Duplicate Records from the Reasoning Database F-measures from the Experiments Structure of the table ebook Structure of the table mp Structure of the table video 76 xi
5 LIST OF FIGURES FIGURE PAGE TITLE 1.1 The general process of matching two databases Query results from Query results from Sample duplicate records from (a) A restaurant database (b) A scientific citation database Modified alignment from Advanced Dynamic Programming example Alignment from Figure 4.2 re-scored using affine gap penalties Modified alignment. Equivalent under regular gap penalty system The alignment from Figure 4.4 re-scored using affine gap penalties Computation of Jaro Metric Example for N-Grams approach Example 1 for Rabin Karp approach Example 2(a) for Rabin Karp approach Example 2(b) for Rabin Karp approach Example for KMP approach Example for KMP approach Step Example for KMP approach Step Example for KMP approach Step Example for KMP approach Step Example for KMP approach Step Example for KMP approach Step Example for KMP approach Step 7 56 xii
6 4.19 Example for KMP approach Step Example for KMP approach Step Example for KMP approach Step Example for KMP approach Step Example for KMP approach Step Example for KMP approach Step Duplicate Vector Identification Algorithm Component Weight Assignment Algorithm F-Measures from the Experiments Sample records from the ebook table Sample records from the mp3 table Sample records from the video table Domain Selection Source Selection Source Selection After Loading Calculation of Weights Record Selection Record Similarity Calculated Results Record Similarity Matching all records Three different similarity thresholds on e-book Three different similarity thresholds on mp Two different similarity thresholds on video Component weight setting based on similarity values of the fields in N Effect of the threshold in matching process 85 xiii
7 LIST OF ABBREVIATIONS & SYMBOLS AI : Artificial Intelligence DNA : Deoxyribonucleic Acid DBLP : Digital Bibliography & Library Project EM : Expectation Maximization Febrl : Freely Extensible Biomedical Record Linkage HTML : Hyper Text Markup Language ISBN : International Standard Book Number M-C : Mapping-Convergence MCMC : Markov Chain Monte Carlo NLP : Natural Language Processing OCR : Optical Character Recognition PEBL : Positive Example Based Learning PES : Post Enumeration Survey PPRL : Privacy Preserving Record Linkage RelDC : Relationships for domain independent Data Cleaning RL : Record Linkage RNA : Ribonucleic Acid SQL : Structured Query Language SVM : Support Vector Machine TF-IDF : Term Frequency Inverse Document Frequency UCM : Unsupervised Conflation Method U.S.A : United States of America WCSS : Weighted Component Similarity Summing D : Distance between two strings s : String 1 t : String 2 O : Edit Distance xiv
8 c : Cost of the edit operation x i : th i character of string x y j : j th character of string y M : Matrix G : Gap cost d : distance function P : length of the longest common prefix θ : Cosine similarity T : Tanimoto coefficient N : Non duplicate vector set C1, C2 : Classifiers S a, S b : Pair of Strings : Null set AS th : Predefined Threshold value γ : Feature Vector P(γ M) : Probabilities of observing feature vector for a matched pair (P(γ U) : Probabilities of observing feature vector for a nonmatched pair Tμ : Threshold based on desired error level for equivalent record pair Tλ : Threshold based on desired error level for nonequivalent record pair xv
TABLE OF CONTENTS CHAPTER NO. TITLE PAGENO. LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION
vi TABLE OF CONTENTS ABSTRACT LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION iii xii xiii xiv 1 INTRODUCTION 1 1.1 WEB MINING 2 1.1.1 Association Rules 2 1.1.2 Association Rule Mining 3 1.1.3 Clustering
More informationUnsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases. A Thesis Presented to
Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases A Thesis Presented to The Faculty of the Computer Science Program California State University Channel Islands In (Partial)
More informationA Survey on Removal of Duplicate Records in Database
Indian Journal of Science and Technology A Survey on Removal of Duplicate Records in Database M. Karthigha 1* and S. Krishna Anand 2 1 PG Student, School of Computing (CSE), SASTRA University, 613401,
More informationTABLE OF CONTENTS CHAPTER NO. TITLE PAGE NO. ABSTRACT 5 LIST OF TABLES LIST OF FIGURES LIST OF SYMBOLS AND ABBREVIATIONS xxi
ix TABLE OF CONTENTS CHAPTER NO. TITLE PAGE NO. ABSTRACT 5 LIST OF TABLES xv LIST OF FIGURES xviii LIST OF SYMBOLS AND ABBREVIATIONS xxi 1 INTRODUCTION 1 1.1 INTRODUCTION 1 1.2 WEB CACHING 2 1.2.1 Classification
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationINTRODUCTION Background of the Problem Statement of the Problem Objectives of the Study Significance of the Study...
vii TABLE OF CONTENTS CHAPTER TITLE PAGE DECLARATION... ii DEDICATION... iii ACKNOWLEDGEMENTS... iv ABSTRACT... v ABSTRAK... vi TABLE OF CONTENTS... vii LIST OF TABLES... xii LIST OF FIGURES... xiii LIST
More informationTRIE BASED METHODS FOR STRING SIMILARTIY JOINS
TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH
More informationAutomatic Record Linkage using Seeded Nearest Neighbour and SVM Classification
Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Peter Christen Department of Computer Science, ANU College of Engineering and Computer Science, The Australian National University,
More informationAutomatic training example selection for scalable unsupervised record linkage
Automatic training example selection for scalable unsupervised record linkage Peter Christen Department of Computer Science, The Australian National University, Canberra, Australia Contact: peter.christen@anu.edu.au
More informationInformation Integration
.. Dennis Sun DATA 401: Data Science Alexander Dekhtyar.. Information Integration Data Integration. Data Integration is the process of combining data residing in different sources and providing the user
More informationLecture 10. Sequence alignments
Lecture 10 Sequence alignments Alignment algorithms: Overview Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences. We want to maximize the score
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationCraig Knoblock University of Southern California. These slides are based in part on slides from Sheila Tejada and Misha Bilenko
Record Linkage Craig Knoblock University of Southern California These slides are based in part on slides from Sheila Tejada and Misha Bilenko Craig Knoblock University of Southern California 1 Record Linkage
More informationCHAPTER-6 WEB USAGE MINING USING CLUSTERING
CHAPTER-6 WEB USAGE MINING USING CLUSTERING 6.1 Related work in Clustering Technique 6.2 Quantifiable Analysis of Distance Measurement Techniques 6.3 Approaches to Formation of Clusters 6.4 Conclusion
More informationVisualization and text mining of patent and non-patent data
of patent and non-patent data Anton Heijs Information Solutions Delft, The Netherlands http://www.treparel.com/ ICIC conference, Nice, France, 2008 Outline Introduction Applications on patent and non-patent
More informationBing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web
More informationA Design of a Hybrid System for DNA Sequence Alignment
IMECS 2008, 9-2 March, 2008, Hong Kong A Design of a Hybrid System for DNA Sequence Alignment Heba Khaled, Hossam M. Faheem, Tayseer Hasan, Saeed Ghoneimy Abstract This paper describes a parallel algorithm
More informationAOT / AOTL Results for OAEI 2014
AOT / AOTL Results for OAEI 2014 Abderrahmane Khiat 1, Moussa Benaissa 1 1 LITIO Lab, University of Oran, BP 1524 El-Mnaouar Oran, Algeria abderrahmane_khiat@yahoo.com moussabenaissa@yahoo.fr Abstract.
More informationTABLE OF CONTENTS CHAPTER TITLE PAGE NO NO.
vi TABLE OF CONTENTS CHAPTER TITLE PAGE NO NO. ABSTRACT iii LIST OF TABLES xiii LIST OF FIGURES xiv LIST OF SYMBOLS AND ABBREVIATIONS xix 1 INTRODUCTION 1 1.1 CLOUD COMPUTING 1 1.1.1 Introduction to Cloud
More informationEntity Resolution, Clustering Author References
, Clustering Author References Vlad Shchogolev vs299@columbia.edu May 1, 2007 Outline 1 What is? Motivation 2 Formal Definition Efficieny Considerations Measuring Text Similarity Other approaches 3 Clustering
More informationA KNOWLEDGE BASED ONLINE RECORD MATCHING OVER QUERY RESULTS FROM MULTIPLE WEB DATABASE
A KNOWLEDGE BASED ONLINE RECORD MATCHING OVER QUERY RESULTS FROM MULTIPLE WEB DATABASE M.Ann Michle, K.C. Abhilash Sam Paulstin Nanjil Catholic College of Arts and Science, Kaliyakkavilai. Objective-The
More informationAlignment of Long Sequences
Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale
More informationFINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS
FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS FINDING APPROXIMATE REPEATS IN DNA SEQUENCES USING MULTIPLE SPACED SEEDS By SARAH BANYASSADY, B.S. A Thesis Submitted to the School of Graduate Studies
More informationOPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT
OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT Asif Ali Khan*, Laiq Hassan*, Salim Ullah* ABSTRACT: In bioinformatics, sequence alignment is a common and insistent task. Biologists align
More informationMouse, Human, Chimpanzee
More Alignments 1 Mouse, Human, Chimpanzee Mouse to Human Chimpanzee to Human 2 Mouse v.s. Human Chromosome X of Mouse to Human 3 Local Alignment Given: two sequences S and T Find: substrings of S and
More informationInexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)
Inexact Matching, Alignment See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming) Outline Yet more applications of generalized suffix trees, when combined with a least common ancestor
More informationCMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 8. Note
MS: Bioinformatic lgorithms, Databases and ools Lecture 8 Sequence alignment: inexact alignment dynamic programming, gapped alignment Note Lecture 7 suffix trees and suffix arrays will be rescheduled Exact
More informationA Comparison of Algorithms used to measure the Similarity between two documents
A Comparison of Algorithms used to measure the Similarity between two documents Khuat Thanh Tung, Nguyen Duc Hung, Le Thi My Hanh Abstract Nowadays, measuring the similarity of documents plays an important
More informationAccelerating Smith Waterman (SW) Algorithm on Altera Cyclone II Field Programmable Gate Array
Accelerating Smith Waterman (SW) Algorithm on Altera yclone II Field Programmable Gate Array NUR DALILAH AHMAD SABRI, NUR FARAH AIN SALIMAN, SYED ABDUL MUALIB AL JUNID, ABDUL KARIMI HALIM Faculty Electrical
More informationStochastic Simulation: Algorithms and Analysis
Soren Asmussen Peter W. Glynn Stochastic Simulation: Algorithms and Analysis et Springer Contents Preface Notation v xii I What This Book Is About 1 1 An Illustrative Example: The Single-Server Queue 1
More informationEpipolar Geometry in Stereo, Motion and Object Recognition
Epipolar Geometry in Stereo, Motion and Object Recognition A Unified Approach by GangXu Department of Computer Science, Ritsumeikan University, Kusatsu, Japan and Zhengyou Zhang INRIA Sophia-Antipolis,
More informationDynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77
Dynamic Programming Part I: Examples Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 1 / 77 Dynamic Programming Recall: the Change Problem Other problems: Manhattan
More informationContents. Part I Setting the Scene
Contents Part I Setting the Scene 1 Introduction... 3 1.1 About Mobility Data... 3 1.1.1 Global Positioning System (GPS)... 5 1.1.2 Format of GPS Data... 6 1.1.3 Examples of Trajectory Datasets... 8 1.2
More informationComputing Patterns in Strings I. Specific, Generic, Intrinsic
Outline : Specific, Generic, Intrinsic 1,2,3 1 Algorithms Research Group, Department of Computing & Software McMaster University, Hamilton, Ontario, Canada email: smyth@mcmaster.ca 2 Digital Ecosystems
More informationOutline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:
Outline Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries Dongwon Lee, Byung-Won On Penn State University, USA Jaewoo Kang North Carolina State University, USA
More informationAwos Kanan B.Sc., Jordan University of Science and Technology, 2003 M.Sc., Jordan University of Science and Technology, 2006
Optimized Hardware Accelerators for Data Mining Applications by Awos Kanan B.Sc., Jordan University of Science and Technology, 2003 M.Sc., Jordan University of Science and Technology, 2006 A Dissertation
More informationGraph analytics approach to analyse Enterprise Architecture models
Nikhitha Rajashekar nikhita.rajashekar@rwth-aachen.de Graph analytics approach to analyse Enterprise Architecture models Master Thesis Proposal Supervisor: Simon Hacks Overview 1. Enterprise Architecture
More informationPrivacy-Preserving. Introduction to. Data Publishing. Concepts and Techniques. Benjamin C. M. Fung, Ke Wang, Chapman & Hall/CRC. S.
Chapman & Hall/CRC Data Mining and Knowledge Discovery Series Introduction to Privacy-Preserving Data Publishing Concepts and Techniques Benjamin C M Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S Yu CRC
More informationInformation Integration of Partially Labeled Data
Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de
More informationThe SQL Guide to Pervasive PSQL. Rick F. van der Lans
The SQL Guide to Pervasive PSQL Rick F. van der Lans Copyright 2009 by R20/Consultancy All rights reserved; no part of this publication may be reproduced, stored in a retrieval system, or transmitted in
More informationA Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana
School of Information Technology A Frequent Max Substring Technique for Thai Text Indexing Todsanai Chumwatana This thesis is presented for the Degree of Doctor of Philosophy of Murdoch University May
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationProgramming assignment for the course Sequence Analysis (2006)
Programming assignment for the course Sequence Analysis (2006) Original text by John W. Romein, adapted by Bart van Houte (bart@cs.vu.nl) Introduction Please note: This assignment is only obligatory for
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationBioinformatics explained: Smith-Waterman
Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com
More informationPrivacy Preserving Probabilistic Record Linkage
Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of
More informationSequence analysis Pairwise sequence alignment
UMF11 Introduction to bioinformatics, 25 Sequence analysis Pairwise sequence alignment 1. Sequence alignment Lecturer: Marina lexandersson 12 September, 25 here are two types of sequence alignments, global
More informationStatistical Matching using Fractional Imputation
Statistical Matching using Fractional Imputation Jae-Kwang Kim 1 Iowa State University 1 Joint work with Emily Berg and Taesung Park 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application:
More informationA Web-Based Introduction
A Web-Based Introduction to Programming Essential Algorithms, Syntax, and Control Structures Using PHP, HTML, and MySQL Third Edition Mike O'Kane Carolina Academic Press Durham, North Carolina Contents
More informationDeveloping Focused Crawlers for Genre Specific Search Engines
Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com
More informationQuiz section 10. June 1, 2018
Quiz section 10 June 1, 2018 Logistics Bring: 1 page cheat-sheet, simple calculator Any last logistics questions about the final? Logistics Bring: 1 page cheat-sheet, simple calculator Any last logistics
More informationEfficient Algorithm for Two Dimensional Pattern Matching Problem (Square Pattern)
Efficient Algorithm for Two Dimensional Pattern Matching Problem (Square Pattern) Hussein Abu-Mansour 1, Jaber Alwidian 1, Wael Hadi 2 1 ITC department Arab Open University Riyadh- Saudi Arabia 2 CIS department
More informationSummary of Contents LIST OF FIGURES LIST OF TABLES
Summary of Contents LIST OF FIGURES LIST OF TABLES PREFACE xvii xix xxi PART 1 BACKGROUND Chapter 1. Introduction 3 Chapter 2. Standards-Makers 21 Chapter 3. Principles of the S2ESC Collection 45 Chapter
More informationMapping Bug Reports to Relevant Files and Automated Bug Assigning to the Developer Alphy Jose*, Aby Abahai T ABSTRACT I.
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 1 ISSN : 2456-3307 Mapping Bug Reports to Relevant Files and Automated
More informationGeneralized Additive Models
:p Texts in Statistical Science Generalized Additive Models An Introduction with R Simon N. Wood Contents Preface XV 1 Linear Models 1 1.1 A simple linear model 2 Simple least squares estimation 3 1.1.1
More informationCentral Issues in Biological Sequence Comparison
Central Issues in Biological Sequence Comparison Definitions: What is one trying to find or optimize? Algorithms: Can one find the proposed object optimally or in reasonable time optimize? Statistics:
More informationLocal Alignment & Gap Penalties CMSC 423
Local Alignment & ap Penalties CMSC 423 lobal, Semi-global, Local Alignments Last time, we saw a dynamic programming algorithm for global alignment: both strings s and t must be completely matched: s t
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationTABLE OF CONTENTS CHAPTER TITLE PAGE
vii TABLE OF CONTENTS CHAPTER TITLE PAGE DECLARATION DEDICATION ACKNOWLEDGEMENT ABSTRACT ABSTRAK TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES LIST OF APPENDICES ABBREVIATIONS ii iii iv v vi vii xi
More informationAutomatic annotation of digital photos
University of Wollongong Research Online University of Wollongong Thesis Collection 1954-2016 University of Wollongong Thesis Collections 2007 Automatic annotation of digital photos Wenbin Shao University
More informationMultimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency
Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following
More informationSequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.
Sequence Alignments Overview Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence alignment means arranging
More informationData Linkage Methods: Overview of Computer Science Research
Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra,
More informationKnuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011
Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA December 16, 2011 Abstract KMP is a string searching algorithm. The problem is to find the occurrence of P in S, where S is the given
More informationData Linkage Techniques: Past, Present and Future
Data Linkage Techniques: Past, Present and Future Peter Christen Department of Computer Science, The Australian National University Contact: peter.christen@anu.edu.au Project Web site: http://datamining.anu.edu.au/linkage.html
More informationCSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming
(2017F) Lecture12: Strings and Dynamic Programming Daijin Kim CSE, POSTECH dkim@postech.ac.kr Strings A string is a sequence of characters Examples of strings: Python program HTML document DNA sequence
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationProfiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University
Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and sequence logos Profile hidden Markov models Aligning profiles Multiple sequence alignment by gradual sequence
More informationLinkedMDB. The first linked data source dedicated to movies
Oktie Hassanzadeh Mariano Consens University of Toronto April 20th, 2009 Madrid, Spain Presentation at the Linked Data On the Web (LDOW) 2009 Workshop LinkedMDB 2 The first linked data source dedicated
More informationString Patterns and Algorithms on Strings
String Patterns and Algorithms on Strings Lecture delivered by: Venkatanatha Sarma Y Assistant Professor MSRSAS-Bangalore 11 Objectives To introduce the pattern matching problem and the important of algorithms
More informationClever Linear Time Algorithms. Maximum Subset String Searching
Clever Linear Time Algorithms Maximum Subset String Searching Maximum Subrange Given an array of numbers values[1..n] where some are negative and some are positive, find the subarray values[start..end]
More informationNotes on Dynamic-Programming Sequence Alignment
Notes on Dynamic-Programming Sequence Alignment Introduction. Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for rigorous alignment of DNA
More informationString matching algorithms
String matching algorithms Deliverables String Basics Naïve String matching Algorithm Boyer Moore Algorithm Rabin-Karp Algorithm Knuth-Morris- Pratt Algorithm Copyright @ gdeepak.com 2 String Basics A
More informationManaging Your Biological Data with Python
Chapman & Hall/CRC Mathematical and Computational Biology Series Managing Your Biological Data with Python Ailegra Via Kristian Rother Anna Tramontano CRC Press Taylor & Francis Group Boca Raton London
More informationDivya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by
Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques A Thesis Presented by Divya R. Singh to The Faculty of the Graduate College of the University of Vermont In Partial Fulfillment of
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationInternational Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17, ISSN
International Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17, www.ijcea.com ISSN 2321-3469 DNA PATTERN MATCHING - A COMPARATIVE STUDY OF THREE PATTERN MATCHING ALGORITHMS
More informationRochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm
Rochester Institute of Technology Making personalized education scalable using Sequence Alignment Algorithm Submitted by: Lakhan Bhojwani Advisor: Dr. Carlos Rivero 1 1. Abstract There are many ways proposed
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 04: Variations of sequence alignments http://www.pitt.edu/~mcs2/teaching/biocomp/tutorials/global.html Slides adapted from Dr. Shaojie Zhang (University
More informationINFORMATION RETRIEVAL SYSTEMS: Theory and Implementation
INFORMATION RETRIEVAL SYSTEMS: Theory and Implementation THE KLUWER INTERNATIONAL SERIES ON INFORMATION RETRIEVAL Series Editor W. Bruce Croft University of Massachusetts Amherst, MA 01003 Also in the
More informationSimilarity Joins of Text with Incomplete Information Formats
Similarity Joins of Text with Incomplete Information Formats Shaoxu Song and Lei Chen Department of Computer Science Hong Kong University of Science and Technology {sshaoxu,leichen}@cs.ust.hk Abstract.
More informationAlgorithms and Data Structures
Algorithms and Data Structures Charles A. Wuethrich Bauhaus-University Weimar - CogVis/MMC May 11, 2017 Algorithms and Data Structures String searching algorithm 1/29 String searching algorithm Introduction
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationThomas H. Cormen Charles E. Leiserson Ronald L. Rivest. Introduction to Algorithms
Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest Introduction to Algorithms Preface xiii 1 Introduction 1 1.1 Algorithms 1 1.2 Analyzing algorithms 6 1.3 Designing algorithms 1 1 1.4 Summary 1 6
More informationPrinciples of Bioinformatics. BIO540/STA569/CSI660 Fall 2010
Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed
More informationEvaluation of similarity metrics for programming code plagiarism detection method
Evaluation of similarity metrics for programming code plagiarism detection method Vedran Juričić Department of Information Sciences Faculty of humanities and social sciences University of Zagreb I. Lučića
More informationOutline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity
Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using
More informationMahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island
Mahout in Action SEAN OWEN ROBIN ANIL TED DUNNING ELLEN FRIEDMAN II MANNING Shelter Island contents preface xvii acknowledgments about this book xx xix about multimedia extras xxiii about the cover illustration
More informationGiri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748
CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 1/30/07 CAP5510 1 BLAST & FASTA FASTA [Lipman, Pearson 85, 88]
More informationDietrich Paulus Joachim Hornegger. Pattern Recognition of Images and Speech in C++
Dietrich Paulus Joachim Hornegger Pattern Recognition of Images and Speech in C++ To Dorothea, Belinda, and Dominik In the text we use the following names which are protected, trademarks owned by a company
More informationContents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.
page v Preface xiii I Basics 1 1 Optimization Models 3 1.1 Introduction... 3 1.2 Optimization: An Informal Introduction... 4 1.3 Linear Equations... 7 1.4 Linear Optimization... 10 Exercises... 12 1.5
More informationClever Linear Time Algorithms. Maximum Subset String Searching. Maximum Subrange
Clever Linear Time Algorithms Maximum Subset String Searching Maximum Subrange Given an array of numbers values[1..n] where some are negative and some are positive, find the subarray values[start..end]
More informationLearning-Based Fusion for Data Deduplication: A Robust and Automated Solution
Utah State University DigitalCommons@USU All Graduate Theses and Dissertations Graduate Studies 12-2010 Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution Jared Dinerstein Utah
More informationBusiness Intelligence Roadmap HDT923 Three Days
Three Days Prerequisites Students should have experience with any relational database management system as well as experience with data warehouses and star schemas. It would be helpful if students are
More informationComputational Molecular Biology
Computational Molecular Biology Erwin M. Bakker Lecture 2 Materials used from R. Shamir [2] and H.J. Hoogeboom [4]. 1 Molecular Biology Sequences DNA A, T, C, G RNA A, U, C, G Protein A, R, D, N, C E,
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationIncorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches
Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance
More informationINF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering
INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,
More informationINSTITUTO SUPERIOR TÉCNICO Gestão e Tratamento de Informação
-------------------------------------------------------------------------------------------------------------- INSTITUTO SUPERIOR TÉCNICO Gestão e Tratamento de Informação Exam 1 16 January 2011 --------------------------------------------------------------------------------------------------------------
More informationBiology 644: Bioinformatics
Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find
More information