Towards Declarative and Efficient Querying on Protein Structures

Size: px
Start display at page:

Download "Towards Declarative and Efficient Querying on Protein Structures"

Transcription

1 Towards Declarative and Efficient Querying on Protein Structures Jignesh M. Patel University of Michigan Biology Data Types Sequences: AGCGGTA. Structure: Interaction Maps: Micro-arrays: Gene A Gene B Expr Expr When there is something to do in the cell, it is a protein that does it.

2 Role of DBMS in Bioinformatics Large Data Sets Growing exponentially! Data Types Sequences/arrays/-D/text Complex Queries Ad-hoc tools today Integrated querying done using procedural methods Scalability/Parallelism Home-grown techniques often used today Need: Declarative and efficient querying # DNA Base Pairs (Billions) Base Pairs Sequences Source: GenBank # Sequences (Millions) /7/4 Jignesh M. Patel, 4 Protein Querying Protein functionality determined by Sequence composition Geometric structure Need to query across all structures Current methods Non-declarative tools, often not very efficient on large data sets Support querying on only one structure /7/4 Jignesh M. Patel, 4 4

3 Periscope Goal: Design, implement, and evaluate a database management system for declarative and efficient querying on all protein structures Periscope IAKDPQ Declarative and Integrated Query MRIVL Data Parse Optimize Evaluate Answer (quickly) /7/4 Jignesh M. Patel, 4 5 Roadmap Background and Introduction Primary Structure Sequence Matching Querying on Secondary Structure PiQA: : Integrated Query Algebra Summary /7/4 Jignesh M. Patel, 4 6

4 Sequence Matching Find similar sequences Given a protein sequence, find homologous matches in the database Similarity based on local similarity A local-alignment alignment algorithm Operations: Replace, Delete, Insert Score using a substitution matrix Database: THE TRAIN DRIVER S CABIN Query: DRAIN /7/4 Jignesh M. Patel, 4 7 Sequence Matching Find similar sequences Given a protein sequence, find homologous matches in the database Similarity based on local similarity A local-alignment alignment algorithm Operations: Replace, Delete, Insert Score using a substitution matrix Query Target - A B A - B - C - C - D - D - Target: C A B I N R I R D R R Query: D R A I N Score: /7/4 Jignesh M. Patel, 4 8

5 D R A I N C A 4 Smith-Waterman B 4 I 6 5 S-W Matrix: G N 5 9 Query - A B C D Target A - B - C - Substitution Matrix D - G i,j = max G i-, j- + S(q i t j ) G i-, j + S(q i ) G i, j- + S( t j ) C A B I N R I R D R R D R A I N /7/4 Jignesh M. Patel, 4 9 Suffix Trees Compact Patricia trie Every suffix has a path (to leaf) Every subsequence is a prefix of a path Can find subsequences very fast E.g. GTACG Size: ~x Position Database G$ 8L 8L 4N L L L N 5L 5L L L N A TA G C CGCCTAG$ $ CCTAG$ CCTAG$ GCCTAG$ N 4L 4L CTAG$ CTAG$ 6L 6L TAG$ TAG$ A G T A C G C C T A G $ N 7L 7L $ 9L 9L CGCCTAG$ G$ G$ 5N L L L L /7/4 Jignesh M. Patel, 4

6 OASIS Search driven by the suffix tree Fill up the S-W S W columns by traversing down the suffix tree Best-first: expand node in the tree with highest expected score Expected Score = current score + best possible score for unconsumed portion of query Guarantees online behavior! Exploits redundancy in the database /7/4 Jignesh M. Patel, 4 OASIS Query: TACG Unit Edit Distance Matrix: Same Symbol Substitution =, else - Q:.TA... T: TACG Score: ++ = 4 Position Database G$ 8L 8L 4N L L L N 5L 5L L L N A TA G C CGCCTAG$ $ CCTAG$ CCTAG$ GCCTAG$ N 4L 4L CTAG$ CTAG$ 6L 6L TAG$ TAG$ A G T A C G C C T A G $ N 7L 7L Q:.A... T: TACG Score: ++ = $ 9L 9L CGCCTAG$ G$ G$ 5N L L L L

7 Experimental Setup: Comparison Data set: swissprot K proteins, # symbols: 4M, data size: 4MB Index Size: 5MB (~.5 bytes per symbol) Workload: queries from ProClass motif database Short queries lengths 6-56, avg len = 6 PAM scoring matrix BLAST parameters (short query settings, 5-55 residues) E=,, word size = OASIS: minscore = ln( Kmn) ln( E) λ Platform:.7GHz Xeon, Linux, 56MB buffer pool, K page size, clock replacement policy /7/4 Jignesh M. Patel, 4 Mean Time in secs (log scale) Experimental Results Performance Comparison S-W BLAST OASIS. 4 6 Query Length OASIS - orders of magnitude faster than S-W, S and usually also faster than BLAST Sensitivity Comparison OASIS retrieved ~ 6% more matches than BLAST Lots more in our VLDB paper

8 Roadmap Background and Introduction Primary Structure Sequence Matching Querying on Secondary Structure PiQA: : Integrated Query Algebra Summary /7/4 Jignesh M. Patel, 4 5 Querying Protein Secondary Structure Secondary Structure describes protein folds Beta sheets (e); Alpha helices (h); Loops (l) Predicted structure; helps determine protein function Data Model Sequence of Segments h h h h l l e e e e e <h>, < l>, <4 e> Query Language Sequence of regular expression terms Example: Beta sheet of length 4-4 followed at some point by an helix of length -6 Query: <e 4 > <? Inf> > <h 6> /7/4 Jignesh M. Patel, 4 6

9 Schema and Query Evaluation id name A B len 5 6 sec-seq seq lleee hhhee Protein Table 4 e 4 Evaluation Methods CSP: complex scan SSS: scan segment table ISS: use segment index MISS (k): multiple ISS seg-id id type l e h len start-pos (Derived) Segment Table Scan each protein tuple and evaluate query using a finite state automata N-way merge join of k multiple segment index scans + FK join /7/4 Jignesh M. Patel, 4 7 Multiple Index Scan Method: MISS(k) Query: <P >< P > <P n > To satisfy other query predicates CSP Merge based on protein id and ordering constraints Merge (id) INLJ (id) Idx Scan (id) Sort (id, start) Sort (id, start) Sort (id, start) (id, start) Idx Scan (type, len) (id, start) Idx Scan (type, len) (id, start) Idx Scan (type, len) Protein Table /7/4 Segment Table Segment Table Segment Table k k most highly selective query predicates Jignesh M. Patel, 4 8

10 Query Optimizer Chooses best method for given query Optimization: Requires estimates of query predicate selectivities and result cardinality Cost model: CPU and IO Estimation: Two types of histograms Basic: segment predicate selectivity Complex: query selectivity /7/4 Jignesh M. Patel, 4 9 Basic Histogram Estimates selectivity of individual query predicates -D D array Dimensions : Length, Fold type Value: number of each <type,len> > segment Example Pred <e 4 4>, Est: : 5 Pred <e 4>, Est: : Len H E L /7/4 Jignesh M. Patel, 4

11 Complex Histogram Estimates result cardinality Four-dimensional structure Protein id (equi( equi-width buckets) Start position (equi( equi-width buckets) Length ( 5) Same as in the Type ( e,( h, l ) basic histogram Example: Position [x] [y] [z][ ] [ e ][ holds the number of <e< z z> segments whose starting position is in the range of the y bucket and whose protein id lies within the x bucket range /7/4 Jignesh M. Patel, 4 Complex Histogram Query: {<P> <P>} Compute selectivity of query result 4 Dimensions - Protein id -Start pos -Length -Type 5 5 Start Positions Protein Sequence P P Same Bucket Distinct Bucket pos pos+ ( n np ) P start positions laststartbucketnum pos= / IDbucketWidth startbucketwidth sp sp+ ( ) P P start sp= positions n n IDbucketWidth /7/4 Jignesh M. Patel, 4

12 Experimental Evaluation Techniques implemented in Periscope Configuration: Using the SHORE storage manager, 64MB buffer pool size, 6K page size Also implemented in a commercial ORDBMS Used type extensibility to create array-like data types for sequences & user-defined functions for FSM Data Set: PIR 5K protein tuples, 5MB Segment Table: M tuples, 55MB Platform:.7GHz Xeon, Linux, 4GB SCSI disk /7/4 Jignesh M. Patel, 4 Complex Histogram Accuracy Query: {<l 5 5> > <? X> <h 4 4>} Experiment: Vary Gap Predicate Result Cardinality 5,, 5,, 5, Estimate Actual Length of Gap Predicate (X) Histogram Details Data Set: PIR xx5x Size: 5.8MB Build time: secs Estimation time: ms Histogram accurate within 8% of the actual result /7/4 Jignesh M. Patel, 4 4

13 Query Performance Query: 9 Predicates P G P G P G P 4 G 4 P 5 P=predicate, G=Gap pred. Expt: : Vary Selectivity of P sp Alternatives: S (.%), L (7%) Choice of algorithm is critical CSP sensitive to position of L pred. Choice of k in MISS critical Index Probe + reduce # protein tuples fetched - probe cost, sorting and merging Influenced by query and predicate selectivity Time (secs) CSP MISS() MISS() MISS(5) SSSSL SSLSS LSSSS Query More details in our VLDB paper Roadmap Background and Introduction Primary Structure Sequence Matching Querying on Secondary Structure PiQA: : Integrated Query Algebra Summary /7/4 Jignesh M. Patel, 4 6

14 PiQL? Details in our PiQA SSDBM paper An algebra in PNF for queries on primary & secondary structures Examples: Match the primary sequence AAANBPPPPSDF,, but ignore mismatch in the segment NBPPP if it is on a loop (P.p * AAA ) ((P.p * NBPPP )) U (P.s( * <L 5 5>)) (P.p( * PSDF ) BLAST: Match n-grams n + match extension + transitive closure Find all occurrences of TGCTGACTCAGCA within 4 bps upstream of a CA with TATA 5- bps upstream of the CA M * TGCTGACTCAGCA 957 M * TATA 6 M * CA 957 TGC..A 5 6 4,,,, TATA CA /7/4 Jignesh M. Patel, 4 7 /7/4 Jignesh M. Patel, 4 8

15 Roadmap Background and Introduction Primary Structure Sequence Matching Querying on Secondary Structure PiQA: : Integrated Query Algebra Summary /7/4 Jignesh M. Patel, 4 9 Summary Bioinformatics applications (urgently) need declarative and efficient query processing tools Current procedural methods reminiscent of pre-relational relational days Database researchers have a lot to contribute! Periscope is our effort in this direction PiQA: : algebraic framework for querying on primary and secondary structures Primary Structure: OASIS Declarative querying on secondary structure: Periscope/PS Current Status OASIS and Periscope/PS currently (beta-)deployed at UM Under development: PiQA powered PiQL queries on Periscope /7/4 Jignesh M. Patel, 4

Similarity Searching of Secondary Structures in Protein Sequences

Similarity Searching of Secondary Structures in Protein Sequences Similarity Searching of Secondary Structures in Protein Sequences Laurie Hammel* Jignesh M. Patel University of Michigan 1301 Beal Avenue Ann Arbor, MI 48109-2122 {lhammel, jignesh}@eecs.umich.edu Abstract

More information

Chapter 11 Declarative and Efficient Querying on Protein Secondary Structures

Chapter 11 Declarative and Efficient Querying on Protein Secondary Structures Chapter 11 Declarative and Efficient Querying on Protein Secondary Structures Jignesh M. Patel, Donald P. Huddler, and Laurie Hammel Summary In spite of the many decades of progress in database research,

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

Towards Declarative Querying for Biological Sequences

Towards Declarative Querying for Biological Sequences Towards Declarative Querying for Biological Sequences Sandeep Tata 1 Jignesh M. Patel 1 James S. Friedman 2 Anand Swaroop 2,3 Departments of 1 Electrical Engineering and Computer Science, 2 Ophthalmology

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences

OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences Colin Meek Jignesh M. Patel Department of Electrical Engineering and Computer Science The University of Michigan,

More information

Declarative Querying for Biological Sequences

Declarative Querying for Biological Sequences Declarative Querying for Biological Sequences Sandeep Tata 1 Jignesh M. Patel 1 James S. Friedman 2 Anand Swaroop 2;3 Departments of 1 Electrical Engineering and Computer Science, 2 Ophthalmology and Visual

More information

Join Processing for Flash SSDs: Remembering Past Lessons

Join Processing for Flash SSDs: Remembering Past Lessons Join Processing for Flash SSDs: Remembering Past Lessons Jaeyoung Do, Jignesh M. Patel Department of Computer Sciences University of Wisconsin-Madison $/MB GB Flash Solid State Drives (SSDs) Benefits of

More information

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) CISC 636 Computational Biology & Bioinformatics (Fall 2016) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms: BLAST and FASTA Database search: gene finding and annotations

More information

BLAST, Profile, and PSI-BLAST

BLAST, Profile, and PSI-BLAST BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources

More information

RELATIONAL OPERATORS #1

RELATIONAL OPERATORS #1 RELATIONAL OPERATORS #1 CS 564- Spring 2018 ACKs: Jeff Naughton, Jignesh Patel, AnHai Doan WHAT IS THIS LECTURE ABOUT? Algorithms for relational operators: select project 2 ARCHITECTURE OF A DBMS query

More information

Review. Support for data retrieval at the physical level:

Review. Support for data retrieval at the physical level: Query Processing Review Support for data retrieval at the physical level: Indices: data structures to help with some query evaluation: SELECTION queries (ssn = 123) RANGE queries (100

More information

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be 48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and

More information

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1 Database Searching In database search, we typically have a large sequence database

More information

BLAST MCDB 187. Friday, February 8, 13

BLAST MCDB 187. Friday, February 8, 13 BLAST MCDB 187 BLAST Basic Local Alignment Sequence Tool Uses shortcut to compute alignments of a sequence against a database very quickly Typically takes about a minute to align a sequence against a database

More information

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Fasta is used to compare a protein or DNA sequence to all of the

More information

BLAST - Basic Local Alignment Search Tool

BLAST - Basic Local Alignment Search Tool Lecture for ic Bioinformatics (DD2450) April 11, 2013 Searching 1. Input: Query Sequence 2. Database of sequences 3. Subject Sequence(s) 4. Output: High Segment Pairs (HSPs) Sequence Similarity Measures:

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, 2013 1 Motivation for External Sort Often have a large (size greater than the available

More information

Cost Models. the query database statistics description of computational resources, e.g.

Cost Models. the query database statistics description of computational resources, e.g. Cost Models An optimizer estimates costs for plans so that it can choose the least expensive plan from a set of alternatives. Inputs to the cost model include: the query database statistics description

More information

Single Pass, BLAST-like, Approximate String Matching on FPGAs*

Single Pass, BLAST-like, Approximate String Matching on FPGAs* Single Pass, BLAST-like, Approximate String Matching on FPGAs* Martin Herbordt Josh Model Yongfeng Gu Bharat Sukhwani Tom VanCourt Computer Architecture and Automated Design Laboratory Department of Electrical

More information

Overview of Implementing Relational Operators and Query Evaluation

Overview of Implementing Relational Operators and Query Evaluation Overview of Implementing Relational Operators and Query Evaluation Chapter 12 Motivation: Evaluating Queries The same query can be evaluated in different ways. The evaluation strategy (plan) can make orders

More information

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading: 24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm. FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence

More information

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure Bioinformatics Sequence alignment BLAST Significance Next time Protein Structure 1 Experimental origins of sequence data The Sanger dideoxynucleotide method F Each color is one lane of an electrophoresis

More information

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace. 5 Multiple Match Refinement and T-Coffee In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace. This exposition

More information

BLAST. NCBI BLAST Basic Local Alignment Search Tool

BLAST. NCBI BLAST Basic Local Alignment Search Tool BLAST NCBI BLAST Basic Local Alignment Search Tool http://www.ncbi.nlm.nih.gov/blast/ Global versus local alignments Global alignments: Attempt to align every residue in every sequence, Most useful when

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation

More information

Estimating Result Size and Execution Times for Graph Queries

Estimating Result Size and Execution Times for Graph Queries Estimating Result Size and Execution Times for Graph Queries Silke Trißl 1 and Ulf Leser 1 Humboldt-Universität zu Berlin, Institut für Informatik, Unter den Linden 6, 10099 Berlin, Germany {trissl,leser}@informatik.hu-berlin.de

More information

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture February 6, 2008 BLAST: Basic local alignment search tool B L A S T! Jonathan Pevsner, Ph.D. Introduction to Bioinformatics pevsner@jhmi.edu 4.633.0 Copyright notice Many of the images in this powerpoint

More information

Spring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M.

Spring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M. Spring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, 2013 1 Joins The focus here is on equijoins These are very common,

More information

Brief review from last class

Brief review from last class Sequence Alignment Brief review from last class DNA is has direction, we will use only one (5 -> 3 ) and generate the opposite strand as needed. DNA is a 3D object (see lecture 1) but we will model it

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010 Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed

More information

Mapping Reads to Reference Genome

Mapping Reads to Reference Genome Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

A Design of a Hybrid System for DNA Sequence Alignment

A Design of a Hybrid System for DNA Sequence Alignment IMECS 2008, 9-2 March, 2008, Hong Kong A Design of a Hybrid System for DNA Sequence Alignment Heba Khaled, Hossam M. Faheem, Tayseer Hasan, Saeed Ghoneimy Abstract This paper describes a parallel algorithm

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part VI Lecture 17, March 24, 2015 Mohammad Hammoud Today Last Two Sessions: DBMS Internals- Part V External Sorting How to Start a Company in Five (maybe

More information

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag. Database Management Systems DBMS Architecture SQL INSTRUCTION OPTIMIZER MANAGEMENT OF ACCESS METHODS CONCURRENCY CONTROL BUFFER MANAGER RELIABILITY MANAGEMENT Index Files Data Files System Catalog DATABASE

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2008 Quiz II There are 14 questions and 11 pages in this quiz booklet. To receive

More information

Adaptive Parallel Compressed Event Matching

Adaptive Parallel Compressed Event Matching Adaptive Parallel Compressed Event Matching Mohammad Sadoghi 1,2 Hans-Arno Jacobsen 2 1 IBM T.J. Watson Research Center 2 Middleware Systems Research Group, University of Toronto April 2014 Mohammad Sadoghi

More information

Module 9: Selectivity Estimation

Module 9: Selectivity Estimation Module 9: Selectivity Estimation Module Outline 9.1 Query Cost and Selectivity Estimation 9.2 Database profiles 9.3 Sampling 9.4 Statistics maintained by commercial DBMS Web Forms Transaction Manager Lock

More information

Database Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building

Database Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building External Sorting and Query Optimization A.R. Hurson 323 CS Building External sorting When data to be sorted cannot fit into available main memory, external sorting algorithm must be applied. Naturally,

More information

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 10: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2015 Quiz I There are 12 questions and 13 pages in this quiz booklet. To receive

More information

Database Searching Using BLAST

Database Searching Using BLAST Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain

More information

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

External Sorting Implementing Relational Operators

External Sorting Implementing Relational Operators External Sorting Implementing Relational Operators 1 Readings [RG] Ch. 13 (sorting) 2 Where we are Working our way up from hardware Disks File abstraction that supports insert/delete/scan Indexing for

More information

Principles of Data Management. Lecture #9 (Query Processing Overview)

Principles of Data Management. Lecture #9 (Query Processing Overview) Principles of Data Management Lecture #9 (Query Processing Overview) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s Notable News v Midterm

More information

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing

More information

A Coprocessor Architecture for Fast Protein Structure Prediction

A Coprocessor Architecture for Fast Protein Structure Prediction A Coprocessor Architecture for Fast Protein Structure Prediction M. Marolia, R. Khoja, T. Acharya, C. Chakrabarti Department of Electrical Engineering Arizona State University, Tempe, USA. Abstract Predicting

More information

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE) COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE) PRESENTATION BY PRANAV GOEL Introduction On analytical workloads, Column

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part VIII Lecture 16, March 19, 2014 Mohammad Hammoud Today Last Session: DBMS Internals- Part VII Algorithms for Relational Operations (Cont d) Today s Session:

More information

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky Administriva Lab 2 Final version due next Wednesday CS 133: Databases Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky Problem sets PSet 5 due today No PSet out this week optional practice

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

Efficient Index-based Methods for Processing Large Biological Databases

Efficient Index-based Methods for Processing Large Biological Databases Efficient Index-based Methods for Processing Large Biological Databases by You Jung Kim A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer

More information

CMSC424: Database Design. Instructor: Amol Deshpande

CMSC424: Database Design. Instructor: Amol Deshpande CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons

More information

Alignment of Long Sequences

Alignment of Long Sequences Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale

More information

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment Courtesy of jalview 1 Motivations Collective statistic Protein families Identification and representation of conserved sequence features

More information

Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX

Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX ABSTRACT Symmetric multiprocessor (SMP) computers can increase performance by reducing the time required to analyze large volumes

More information

Sequence analysis Pairwise sequence alignment

Sequence analysis Pairwise sequence alignment UMF11 Introduction to bioinformatics, 25 Sequence analysis Pairwise sequence alignment 1. Sequence alignment Lecturer: Marina lexandersson 12 September, 25 here are two types of sequence alignments, global

More information

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17 Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa

More information

A BANDED SMITH-WATERMAN FPGA ACCELERATOR FOR MERCURY BLASTP

A BANDED SMITH-WATERMAN FPGA ACCELERATOR FOR MERCURY BLASTP A BANDED SITH-WATERAN FPGA ACCELERATOR FOR ERCURY BLASTP Brandon Harris*, Arpith C. Jacob*, Joseph. Lancaster*, Jeremy Buhler*, Roger D. Chamberlain* *Dept. of Computer Science and Engineering, Washington

More information

EXTERNAL SORTING. Sorting

EXTERNAL SORTING. Sorting EXTERNAL SORTING 1 Sorting A classic problem in computer science! Data requested in sorted order (sorted output) e.g., find students in increasing grade point average (gpa) order SELECT A, B, C FROM R

More information

Evaluation of Relational Operations

Evaluation of Relational Operations Evaluation of Relational Operations Yanlei Diao UMass Amherst March 13 and 15, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke 1 Relational Operations We will consider how to implement: Selection

More information

Using MPI One-sided Communication to Accelerate Bioinformatics Applications

Using MPI One-sided Communication to Accelerate Bioinformatics Applications Using MPI One-sided Communication to Accelerate Bioinformatics Applications Hao Wang (hwang121@vt.edu) Department of Computer Science, Virginia Tech Next-Generation Sequencing (NGS) Data Analysis NGS Data

More information

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13! Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13! q Overview! q Optimization! q Measures of Query Cost! Query Evaluation! q Sorting! q Join Operation! q Other

More information

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. .. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. PAM and BLOSUM Matrices Prepared by: Jason Banich and Chris Hoover Background As DNA sequences change and evolve, certain amino acids are more

More information

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012 Outline Database Management and Tuning Johann Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Unit 6 April 12, 2012 1 Acknowledgements: The slides are provided by Nikolaus Augsten

More information

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms

More information

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA) Biochemistry 324 Bioinformatics Multiple Sequence Alignment (MSA) Big- Οh notation Greek omicron symbol Ο The Big-Oh notation indicates the complexity of an algorithm in terms of execution speed and storage

More information

Query Optimization Overview

Query Optimization Overview Query Optimization Overview parsing, syntax checking semantic checking check existence of referenced relations and attributes disambiguation of overloaded operators check user authorization query rewrites

More information

CS222P Fall 2017, Final Exam

CS222P Fall 2017, Final Exam STUDENT NAME: STUDENT ID: CS222P Fall 2017, Final Exam Principles of Data Management Department of Computer Science, UC Irvine Prof. Chen Li (Max. Points: 100 + 15) Instructions: This exam has seven (7)

More information

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1 Implementing Relational Operators: Selection, Projection, Join Database Management Systems, R. Ramakrishnan and J. Gehrke 1 Readings [RG] Sec. 14.1-14.4 Database Management Systems, R. Ramakrishnan and

More information

Constrained Sequence Alignment

Constrained Sequence Alignment Western Washington University Western CEDAR WWU Honors Program Senior Projects WWU Graduate and Undergraduate Scholarship 12-5-2017 Constrained Sequence Alignment Kyle Daling Western Washington University

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3

More information

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example Student Introduction to Database Systems CSE 414 Hash table example Index Student_ID on Student.ID Data File Student 10 Tom Hanks 10 20 20 Amy Hanks ID fname lname 10 Tom Hanks 20 Amy Hanks Lecture 26:

More information

Fundamentals of Database Systems

Fundamentals of Database Systems Fundamentals of Database Systems Assignment: 4 September 21, 2015 Instructions 1. This question paper contains 10 questions in 5 pages. Q1: Calculate branching factor in case for B- tree index structure,

More information

Query Processing and Advanced Queries. Query Optimization (4)

Query Processing and Advanced Queries. Query Optimization (4) Query Processing and Advanced Queries Query Optimization (4) Two-Pass Algorithms Based on Hashing R./S If both input relations R and S are too large to be stored in the buffer, hash all the tuples of both

More information

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

PS2 out today. Lab 2 out today. Lab 1 due today - how was it? 6.830 Lecture 7 9/25/2017 PS2 out today. Lab 2 out today. Lab 1 due today - how was it? Project Teams Due Wednesday Those of you who don't have groups -- send us email, or hand in a sheet with just your

More information

8/19/13. Computational problems. Introduction to Algorithm

8/19/13. Computational problems. Introduction to Algorithm I519, Introduction to Introduction to Algorithm Yuzhen Ye (yye@indiana.edu) School of Informatics and Computing, IUB Computational problems A computational problem specifies an input-output relationship

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact: Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base

More information

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations: Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating

More information

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs Introduction to Database Systems CSE 414 Lecture 26: More Indexes and Operator Costs CSE 414 - Spring 2018 1 Student ID fname lname Hash table example 10 Tom Hanks Index Student_ID on Student.ID Data File

More information

Preliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification

Preliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification Preliminary Syllabus Sep 30 Oct 2 Oct 7 Oct 9 Oct 14 Oct 16 Oct 21 Oct 25 Oct 28 Nov 4 Nov 8 Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification OCTOBER BREAK

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Basic Steps in Query Processing 1. Parsing and translation 2. Optimization 3. Evaluation 12.2

More information

Database Management System

Database Management System Database Management System Lecture Join * Some materials adapted from R. Ramakrishnan, J. Gehrke and Shawn Bowers Today s Agenda Join Algorithm Database Management System Join Algorithms Database Management

More information

Heuristic methods for pairwise alignment:

Heuristic methods for pairwise alignment: Bi03c_1 Unit 03c: Heuristic methods for pairwise alignment: k-tuple-methods k-tuple-methods for alignment of pairs of sequences Bi03c_2 dynamic programming is too slow for large databases Use heuristic

More information

Sequence Alignment & Search

Sequence Alignment & Search Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating the first version

More information