S 4 : International Competition on Scalable String Similarity Search & Join. Ulf Leser, Humboldt-Universität zu Berlin

Similar documents
Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing

Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

State-of-the-art in String Similarity Search and Join

RRCA: Ultra-fast Multiple In-Species Genome Alignments

Bioinformatics explained: BLAST. March 8, 2007

BLAST, Profile, and PSI-BLAST

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Fast Searching in Wikipedia, P2P, and Biological Sequences. Ela Hunt, Marie Curie Fellow

An I/O device driver for bioinformatics tools: the case for BLAST

A Coprocessor Architecture for Fast Protein Structure Prediction

Bioinformatics explained: Smith-Waterman

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

Inverted Index for Fast Nearest Neighbour

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

Parallelizing String Similarity Join Algorithms

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

BIOINFORMATICS. Multiple spaced seeds for homology search

arxiv: v1 [cs.db] 1 Aug 2012

An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data

Combinatorial Pattern Matching

Bioinformatics. Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing

A NEW METHOD FOR FINDING SIMILAR PATTERNS IN MOVING BODIES

Database Searching Using BLAST

Data Mining Technologies for Bioinformatics Sequences

Efficient Parallel Partition-based Algorithms for Similarity Search and Join with Edit Distance Constraints

High-performance short sequence alignment with GPU acceleration

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

Computational Molecular Biology

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

String Searching in Referentially Compressed Genomes

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

FastCluster: a graph theory based algorithm for removing redundant sequences

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2

Super-Linear Indices for Approximate Dictionary Searching

A NEW GENERATION OF HOMOLOGY SEARCH TOOLS BASED ON PROBABILISTIC INFERENCE

New Algorithms for the Spaced Seeds

Distance Based Indexing for String Proximity Search

Similarity Joins of Text with Incomplete Information Formats

A Framework for BioCuration (part II)

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Brief review from last class

LinkedMDB. The first linked data source dedicated to movies

Heuristic methods for pairwise alignment:

MetaPhyler Usage Manual

BLAST & Genome assembly

Searching and mining sequential data

ARM: Authenticated Approximate Record Matching for Outsourced Databases. Boxiang Dong, Wendy Wang Stevens Institute of Technology Hoboken, NJ

Optimizing multiple spaced seeds for homology search

SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size

Short Read Alignment. Mapping Reads to a Reference

Aligners. J Fass 23 August 2017

Towards Declarative and Efficient Querying on Protein Structures

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

Keywords Real-Time Data Analysis; In-Memory Database Technology; Genome Data; Personalized Medicine; Next-Generation Sequencing

Estimating effective DNA database size via compression

Survey of Spatial Approximate String Search

Supporting Fuzzy Keyword Search in Databases

A Scalable Coprocessor for Bioinformatic Sequence Alignments

DRESS: dimensionality reduction for efficient sequence search

Hash Joins for Multi-core CPUs. Benjamin Wagner

On enhancing variation detection through pan-genome indexing

A new methodology for gene normalization using a mix of taggers, global alignment matching and document similarity disambiguation

Accelerating Genomic Sequence Alignment Workload with Scalable Vector Architecture

DSIM: A Distance-based Indexing Method for Genomic Sequences

Searching for near-duplicate video sequences from a scalable sequence aligner

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015


FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

Burrows-Wheeler Short Read Aligner on AWS EC2 F1 Instances

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Whole Genome Comparison: Colinear Alignment

Bio-Sequence Analysis with Cradle s 3SoC Software Scalable System on Chip

modern database systems lecture 4 : information retrieval

Metric Indexing of Protein Databases and Promising Approaches

A Study On Pair-Wise Local Alignment Of Protein Sequence For Identifying The Structural Similarity

Sequence Alignment as a Database Technology Challenge

Mismatch String Kernels for SVM Protein Classification

Classification of Contradiction Patterns

Similarity Joins in MapReduce

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching

Pairwise Sequence Alignment using Bio-Database Compression by Improved Fine Tuned Enhanced Suffix Array

Linking Entities in Chinese Queries to Knowledge Graph

Quality metrics for benchmarking sequences comparison tools

Using MPI One-sided Communication to Accelerate Bioinformatics Applications

Scoring and heuristic methods for sequence alignment CG 17

Notes on Dynamic-Programming Sequence Alignment

BUNDLED SUFFIX TREES

An Efficient Approach to Triple Search and Join of HDT Processing Using GPU

Oracle Database 11g: New Features for Administrators DBA Release 2

Sequence alignment theory and applications Session 3: BLAST algorithm

A New Context Based Indexing in Search Engines Using Binary Search Tree

Transcription:

S 4 : International Competition on Scalable String Similarity Search & Join Ulf Leser, Humboldt-Universität zu Berlin

S 4 Competition, not a workshop in the usual sense Idea: Try to find the best existing software for solving (certain tasks in) similarity search & similarity join on (certain) large string sets Organizers: Sebastian Wandelt, Ulf Leser Work with DNA: Compression, alignment, read mapping, variation detection, motif search, Live in two worlds: Databases and Bioinformatics Also research in text mining & named entity normalization Often participate in competitions in other areas Ulf Leser: S 4 : Scalable String Similarity Search & Join, EDBT Workshop, 2013 2

String Similarity Search Given a (large) set S of strings and a query string q, find all s S that are sufficiently similar to q Similar : Wrt a similarity func.: unweighted edit distance Could be weighted edit distance, local alignment score, hamming distance, phonetic distance, n-gram Jaccard distance, Sufficiently : With a distance of at most k Could be the single most similar or the k most similar or Large string set : DNA sequences and geographic names Could be person names, Wikipedia entries, Oxford dictionary, Important difference: Alphabet size, skew in character frequencies Ulf Leser: S 4 : Scalable String Similarity Search & Join, EDBT Workshop, 2013 3

Applications Similarity search Homologous DNA or protein sequences Most probable entity referred to in a text Most probable correct spelling during a keyword search Most probable match in a dictionary Similarity join Duplicate detection (Self-Join) Data cleansing Entity recognition Bulk similarity search Ulf Leser: S 4 : Scalable String Similarity Search & Join, EDBT Workshop, 2013 4

Important in many domains Databases (hundred?) Koudas, N., Marathe, A. and Srivastava, D. (2004). "Flexible String Matching Against Large Databases in Practice". VLDB 2004. Papapetrou, P., Athitsos, V., Kollios, G. and Gunopulos, D. (2009). "Reference-based alignment in large sequence databases." PVLDB 2(1). Rheinländer, A., Knobloch, M., Hochmuth, N. and Leser, U. (2010). "Prefix Tree Indexing for Similarity Search and Similarity Join on Genomic Data". SSDBM Zhang, Zhenjie, et al. "Bed-tree: an all-purpose index structure for string similarity search based on edit distance." SIGMOD 2010. Sahinalp, S. Cenk, et al. "Distance based indexing for string proximity search." ICDE 2003. Bioinformatics (thousand?) Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). "Basic local alignment search tool." J Mol Biol 215(3): 403-10. Kehr, B., Weese, D. and Reinert, K. (2011). "STELLAR: fast and exact local alignments." BMC Bioinformatics Kent, W. J. (2002). "BLAT--the BLAST-like alignment tool." Genome Res 12(4): 656-64. Li, H. and Durbin, R. (2010). "Fast and accurate long-read alignment with Burrows-Wheeler transform." Bioinformatics 26(5): 589-95. Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O. and Weigel, D. (2009). "Simultaneous alignment of short reads against multiple genomes." Genome Biol 10(9): R98. Yang, X., Wang, B., Li, C. and Xie, X. (2013). "Efficient direct search on compressed genomic data". Int. Conf. on Data Engineering, Brisbane, Australia. Ulf Leser: S 4 : Scalable String Similarity Search & Join, EDBT Workshop, 2013 5

Important in many domains Algorithms (hundred?) Bartolini, Ilaria, Paolo Ciaccia, and Marco Patella. "String matching with metric trees using an approximate distance." String Processing and Information Retrieval. Springer Berlin/Heidelberg, 2002. Bocek, Thomas, Ela Hunt, and Burkhard Stiller. Fast similarity search in large dictionaries. University, 2007. Gawrychowski, Paweł. "Faster Algorithm for Computing the Edit Distance between SLP-Compressed Strings." String Processing and Information Retrieval. Springer Berlin/Heidelberg, 2012. Arslan, Abdullah N., and Omer Egecioglu. "An efficient uniform-cost normalized edit distance algorithm." String processing and information retrieval symposium, 1999 and international workshop on groupware. IEEE, 1999. NLP (hundred?) probably some more domains. Ulf Leser: S 4 : Scalable String Similarity Search & Join, EDBT Workshop, 2013 6

Specialization Different data sets Subtle variations of the problem Some work in memory, some on disk Indexing mechanisms of various kinds Specialized on small, medium, large data sets And large in 1995 is different from large in 2013 Good for small, medium, large error thresholds Ulf Leser: S 4 : Scalable String Similarity Search & Join, EDBT Workshop, 2013 7

Consequences Very hard to compare results in papers Code often not available or programmed for a special case Exhaustive benchmarks proved very difficult Why not perform a competition? Ulf Leser: S 4 : Scalable String Similarity Search & Join, EDBT Workshop, 2013 8

Long History of Competitions Bioinformatics CASP: Protein structure prediction BioCreative: Critical Assessment of Information Extraction in Bio. Automated Function Prediction SIG 2013 CAGI: Critical assessment of Genome Interpretation Sequence Mapping and Assembly Assessment Project Information Retrieval / Text Mining TREC INEX SemEval CLEF Ulf Leser: S 4 : Scalable String Similarity Search & Join, EDBT Workshop, 2013 9

Format and Impact Format Organizers describe problem and provide training data Teams work with training data to tune their systems Systems are tested on unseen test data Often tremendous impact People get personally challenged and give their best Standardized problems make solutions comparable Bogus methods / papers have little changes Brought many sobering results Ulf Leser: S 4 : Scalable String Similarity Search & Join, EDBT Workshop, 2013 10

Difference Other competitions measure hardware-independent measures: precision, recall, accuracy, standard error, Thus, usually test data is provided to the teams instead of teams providing their systems We want to measure wall-clock time Thus, we need to have all systems running on the same machine Ulf Leser: S 4 : Scalable String Similarity Search & Join, EDBT Workshop, 2013 11

Open Issues Not clear if this would be accepted by the community No tuning for the preferred number of threads, particular CPU, memory size, operating system, compiler optimizations, disc access, caching, Results could vary considerably on other computer systems Competition is only a small snapshot of a large class of problems and possible computational infrastructures Still, we thought it is worth trying Ulf Leser: S 4 : Scalable String Similarity Search & Join, EDBT Workshop, 2013 12

Setup Track 1: Similarity search Edit distance threshold k= 0 16 Two real-life data sets DNA sequence reads: rather uniform length (100), small alphabet, almost equal character frequencies Geographical names: different length, shorter, large alphabet, skewed character (and n-gram) frequencies Separate indexing (measured, not ranked) and search phase Queries generated at random Track 2: Similarity join Self-join with edit distance threshold k=0 16 Same two domains Ulf Leser: S 4 : Scalable String Similarity Search & Join, EDBT Workshop, 2013 13

Program Today Ulf Leser: S 4 : Scalable String Similarity Search & Join, EDBT Workshop, 2013 14