Towards Declarative and Efficient Querying on Protein Structures
|
|
- Jade Anis Henderson
- 5 years ago
- Views:
Transcription
1 Towards Declarative and Efficient Querying on Protein Structures Jignesh M. Patel University of Michigan Biology Data Types Sequences: AGCGGTA. Structure: Interaction Maps: Micro-arrays: Gene A Gene B Expr Expr When there is something to do in the cell, it is a protein that does it.
2 Role of DBMS in Bioinformatics Large Data Sets Growing exponentially! Data Types Sequences/arrays/-D/text Complex Queries Ad-hoc tools today Integrated querying done using procedural methods Scalability/Parallelism Home-grown techniques often used today Need: Declarative and efficient querying # DNA Base Pairs (Billions) Base Pairs Sequences Source: GenBank # Sequences (Millions) /7/4 Jignesh M. Patel, 4 Protein Querying Protein functionality determined by Sequence composition Geometric structure Need to query across all structures Current methods Non-declarative tools, often not very efficient on large data sets Support querying on only one structure /7/4 Jignesh M. Patel, 4 4
3 Periscope Goal: Design, implement, and evaluate a database management system for declarative and efficient querying on all protein structures Periscope IAKDPQ Declarative and Integrated Query MRIVL Data Parse Optimize Evaluate Answer (quickly) /7/4 Jignesh M. Patel, 4 5 Roadmap Background and Introduction Primary Structure Sequence Matching Querying on Secondary Structure PiQA: : Integrated Query Algebra Summary /7/4 Jignesh M. Patel, 4 6
4 Sequence Matching Find similar sequences Given a protein sequence, find homologous matches in the database Similarity based on local similarity A local-alignment alignment algorithm Operations: Replace, Delete, Insert Score using a substitution matrix Database: THE TRAIN DRIVER S CABIN Query: DRAIN /7/4 Jignesh M. Patel, 4 7 Sequence Matching Find similar sequences Given a protein sequence, find homologous matches in the database Similarity based on local similarity A local-alignment alignment algorithm Operations: Replace, Delete, Insert Score using a substitution matrix Query Target - A B A - B - C - C - D - D - Target: C A B I N R I R D R R Query: D R A I N Score: /7/4 Jignesh M. Patel, 4 8
5 D R A I N C A 4 Smith-Waterman B 4 I 6 5 S-W Matrix: G N 5 9 Query - A B C D Target A - B - C - Substitution Matrix D - G i,j = max G i-, j- + S(q i t j ) G i-, j + S(q i ) G i, j- + S( t j ) C A B I N R I R D R R D R A I N /7/4 Jignesh M. Patel, 4 9 Suffix Trees Compact Patricia trie Every suffix has a path (to leaf) Every subsequence is a prefix of a path Can find subsequences very fast E.g. GTACG Size: ~x Position Database G$ 8L 8L 4N L L L N 5L 5L L L N A TA G C CGCCTAG$ $ CCTAG$ CCTAG$ GCCTAG$ N 4L 4L CTAG$ CTAG$ 6L 6L TAG$ TAG$ A G T A C G C C T A G $ N 7L 7L $ 9L 9L CGCCTAG$ G$ G$ 5N L L L L /7/4 Jignesh M. Patel, 4
6 OASIS Search driven by the suffix tree Fill up the S-W S W columns by traversing down the suffix tree Best-first: expand node in the tree with highest expected score Expected Score = current score + best possible score for unconsumed portion of query Guarantees online behavior! Exploits redundancy in the database /7/4 Jignesh M. Patel, 4 OASIS Query: TACG Unit Edit Distance Matrix: Same Symbol Substitution =, else - Q:.TA... T: TACG Score: ++ = 4 Position Database G$ 8L 8L 4N L L L N 5L 5L L L N A TA G C CGCCTAG$ $ CCTAG$ CCTAG$ GCCTAG$ N 4L 4L CTAG$ CTAG$ 6L 6L TAG$ TAG$ A G T A C G C C T A G $ N 7L 7L Q:.A... T: TACG Score: ++ = $ 9L 9L CGCCTAG$ G$ G$ 5N L L L L
7 Experimental Setup: Comparison Data set: swissprot K proteins, # symbols: 4M, data size: 4MB Index Size: 5MB (~.5 bytes per symbol) Workload: queries from ProClass motif database Short queries lengths 6-56, avg len = 6 PAM scoring matrix BLAST parameters (short query settings, 5-55 residues) E=,, word size = OASIS: minscore = ln( Kmn) ln( E) λ Platform:.7GHz Xeon, Linux, 56MB buffer pool, K page size, clock replacement policy /7/4 Jignesh M. Patel, 4 Mean Time in secs (log scale) Experimental Results Performance Comparison S-W BLAST OASIS. 4 6 Query Length OASIS - orders of magnitude faster than S-W, S and usually also faster than BLAST Sensitivity Comparison OASIS retrieved ~ 6% more matches than BLAST Lots more in our VLDB paper
8 Roadmap Background and Introduction Primary Structure Sequence Matching Querying on Secondary Structure PiQA: : Integrated Query Algebra Summary /7/4 Jignesh M. Patel, 4 5 Querying Protein Secondary Structure Secondary Structure describes protein folds Beta sheets (e); Alpha helices (h); Loops (l) Predicted structure; helps determine protein function Data Model Sequence of Segments h h h h l l e e e e e <h>, < l>, <4 e> Query Language Sequence of regular expression terms Example: Beta sheet of length 4-4 followed at some point by an helix of length -6 Query: <e 4 > <? Inf> > <h 6> /7/4 Jignesh M. Patel, 4 6
9 Schema and Query Evaluation id name A B len 5 6 sec-seq seq lleee hhhee Protein Table 4 e 4 Evaluation Methods CSP: complex scan SSS: scan segment table ISS: use segment index MISS (k): multiple ISS seg-id id type l e h len start-pos (Derived) Segment Table Scan each protein tuple and evaluate query using a finite state automata N-way merge join of k multiple segment index scans + FK join /7/4 Jignesh M. Patel, 4 7 Multiple Index Scan Method: MISS(k) Query: <P >< P > <P n > To satisfy other query predicates CSP Merge based on protein id and ordering constraints Merge (id) INLJ (id) Idx Scan (id) Sort (id, start) Sort (id, start) Sort (id, start) (id, start) Idx Scan (type, len) (id, start) Idx Scan (type, len) (id, start) Idx Scan (type, len) Protein Table /7/4 Segment Table Segment Table Segment Table k k most highly selective query predicates Jignesh M. Patel, 4 8
10 Query Optimizer Chooses best method for given query Optimization: Requires estimates of query predicate selectivities and result cardinality Cost model: CPU and IO Estimation: Two types of histograms Basic: segment predicate selectivity Complex: query selectivity /7/4 Jignesh M. Patel, 4 9 Basic Histogram Estimates selectivity of individual query predicates -D D array Dimensions : Length, Fold type Value: number of each <type,len> > segment Example Pred <e 4 4>, Est: : 5 Pred <e 4>, Est: : Len H E L /7/4 Jignesh M. Patel, 4
11 Complex Histogram Estimates result cardinality Four-dimensional structure Protein id (equi( equi-width buckets) Start position (equi( equi-width buckets) Length ( 5) Same as in the Type ( e,( h, l ) basic histogram Example: Position [x] [y] [z][ ] [ e ][ holds the number of <e< z z> segments whose starting position is in the range of the y bucket and whose protein id lies within the x bucket range /7/4 Jignesh M. Patel, 4 Complex Histogram Query: {<P> <P>} Compute selectivity of query result 4 Dimensions - Protein id -Start pos -Length -Type 5 5 Start Positions Protein Sequence P P Same Bucket Distinct Bucket pos pos+ ( n np ) P start positions laststartbucketnum pos= / IDbucketWidth startbucketwidth sp sp+ ( ) P P start sp= positions n n IDbucketWidth /7/4 Jignesh M. Patel, 4
12 Experimental Evaluation Techniques implemented in Periscope Configuration: Using the SHORE storage manager, 64MB buffer pool size, 6K page size Also implemented in a commercial ORDBMS Used type extensibility to create array-like data types for sequences & user-defined functions for FSM Data Set: PIR 5K protein tuples, 5MB Segment Table: M tuples, 55MB Platform:.7GHz Xeon, Linux, 4GB SCSI disk /7/4 Jignesh M. Patel, 4 Complex Histogram Accuracy Query: {<l 5 5> > <? X> <h 4 4>} Experiment: Vary Gap Predicate Result Cardinality 5,, 5,, 5, Estimate Actual Length of Gap Predicate (X) Histogram Details Data Set: PIR xx5x Size: 5.8MB Build time: secs Estimation time: ms Histogram accurate within 8% of the actual result /7/4 Jignesh M. Patel, 4 4
13 Query Performance Query: 9 Predicates P G P G P G P 4 G 4 P 5 P=predicate, G=Gap pred. Expt: : Vary Selectivity of P sp Alternatives: S (.%), L (7%) Choice of algorithm is critical CSP sensitive to position of L pred. Choice of k in MISS critical Index Probe + reduce # protein tuples fetched - probe cost, sorting and merging Influenced by query and predicate selectivity Time (secs) CSP MISS() MISS() MISS(5) SSSSL SSLSS LSSSS Query More details in our VLDB paper Roadmap Background and Introduction Primary Structure Sequence Matching Querying on Secondary Structure PiQA: : Integrated Query Algebra Summary /7/4 Jignesh M. Patel, 4 6
14 PiQL? Details in our PiQA SSDBM paper An algebra in PNF for queries on primary & secondary structures Examples: Match the primary sequence AAANBPPPPSDF,, but ignore mismatch in the segment NBPPP if it is on a loop (P.p * AAA ) ((P.p * NBPPP )) U (P.s( * <L 5 5>)) (P.p( * PSDF ) BLAST: Match n-grams n + match extension + transitive closure Find all occurrences of TGCTGACTCAGCA within 4 bps upstream of a CA with TATA 5- bps upstream of the CA M * TGCTGACTCAGCA 957 M * TATA 6 M * CA 957 TGC..A 5 6 4,,,, TATA CA /7/4 Jignesh M. Patel, 4 7 /7/4 Jignesh M. Patel, 4 8
15 Roadmap Background and Introduction Primary Structure Sequence Matching Querying on Secondary Structure PiQA: : Integrated Query Algebra Summary /7/4 Jignesh M. Patel, 4 9 Summary Bioinformatics applications (urgently) need declarative and efficient query processing tools Current procedural methods reminiscent of pre-relational relational days Database researchers have a lot to contribute! Periscope is our effort in this direction PiQA: : algebraic framework for querying on primary and secondary structures Primary Structure: OASIS Declarative querying on secondary structure: Periscope/PS Current Status OASIS and Periscope/PS currently (beta-)deployed at UM Under development: PiQA powered PiQL queries on Periscope /7/4 Jignesh M. Patel, 4
Similarity Searching of Secondary Structures in Protein Sequences
Similarity Searching of Secondary Structures in Protein Sequences Laurie Hammel* Jignesh M. Patel University of Michigan 1301 Beal Avenue Ann Arbor, MI 48109-2122 {lhammel, jignesh}@eecs.umich.edu Abstract
More informationChapter 11 Declarative and Efficient Querying on Protein Secondary Structures
Chapter 11 Declarative and Efficient Querying on Protein Secondary Structures Jignesh M. Patel, Donald P. Huddler, and Laurie Hammel Summary In spite of the many decades of progress in database research,
More informationBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics
More informationTowards Declarative Querying for Biological Sequences
Towards Declarative Querying for Biological Sequences Sandeep Tata 1 Jignesh M. Patel 1 James S. Friedman 2 Anand Swaroop 2,3 Departments of 1 Electrical Engineering and Computer Science, 2 Ophthalmology
More informationBiology 644: Bioinformatics
Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find
More informationBioinformatics explained: Smith-Waterman
Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com
More informationOASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences
OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences Colin Meek Jignesh M. Patel Department of Electrical Engineering and Computer Science The University of Michigan,
More informationDeclarative Querying for Biological Sequences
Declarative Querying for Biological Sequences Sandeep Tata 1 Jignesh M. Patel 1 James S. Friedman 2 Anand Swaroop 2;3 Departments of 1 Electrical Engineering and Computer Science, 2 Ophthalmology and Visual
More informationJoin Processing for Flash SSDs: Remembering Past Lessons
Join Processing for Flash SSDs: Remembering Past Lessons Jaeyoung Do, Jignesh M. Patel Department of Computer Sciences University of Wisconsin-Madison $/MB GB Flash Solid State Drives (SSDs) Benefits of
More informationCISC 636 Computational Biology & Bioinformatics (Fall 2016)
CISC 636 Computational Biology & Bioinformatics (Fall 2016) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms: BLAST and FASTA Database search: gene finding and annotations
More informationBLAST, Profile, and PSI-BLAST
BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources
More informationRELATIONAL OPERATORS #1
RELATIONAL OPERATORS #1 CS 564- Spring 2018 ACKs: Jeff Naughton, Jignesh Patel, AnHai Doan WHAT IS THIS LECTURE ABOUT? Algorithms for relational operators: select project 2 ARCHITECTURE OF A DBMS query
More informationReview. Support for data retrieval at the physical level:
Query Processing Review Support for data retrieval at the physical level: Indices: data structures to help with some query evaluation: SELECTION queries (ssn = 123) RANGE queries (100
More informationAs of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be
48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and
More informationCOS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching
COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1 Database Searching In database search, we typically have a large sequence database
More informationBLAST MCDB 187. Friday, February 8, 13
BLAST MCDB 187 BLAST Basic Local Alignment Sequence Tool Uses shortcut to compute alignments of a sequence against a database very quickly Typically takes about a minute to align a sequence against a database
More informationCompares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.
Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Fasta is used to compare a protein or DNA sequence to all of the
More informationBLAST - Basic Local Alignment Search Tool
Lecture for ic Bioinformatics (DD2450) April 11, 2013 Searching 1. Input: Query Sequence 2. Database of sequences 3. Subject Sequence(s) 4. Output: High Segment Pairs (HSPs) Sequence Similarity Measures:
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationSpring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,
Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, 2013 1 Motivation for External Sort Often have a large (size greater than the available
More informationCost Models. the query database statistics description of computational resources, e.g.
Cost Models An optimizer estimates costs for plans so that it can choose the least expensive plan from a set of alternatives. Inputs to the cost model include: the query database statistics description
More informationSingle Pass, BLAST-like, Approximate String Matching on FPGAs*
Single Pass, BLAST-like, Approximate String Matching on FPGAs* Martin Herbordt Josh Model Yongfeng Gu Bharat Sukhwani Tom VanCourt Computer Architecture and Automated Design Laboratory Department of Electrical
More informationOverview of Implementing Relational Operators and Query Evaluation
Overview of Implementing Relational Operators and Query Evaluation Chapter 12 Motivation: Evaluating Queries The same query can be evaluated in different ways. The evaluation strategy (plan) can make orders
More information24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:
24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationFASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.
FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence
More informationBioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure
Bioinformatics Sequence alignment BLAST Significance Next time Protein Structure 1 Experimental origins of sequence data The Sanger dideoxynucleotide method F Each color is one lane of an electrophoresis
More informationIn this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.
5 Multiple Match Refinement and T-Coffee In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace. This exposition
More informationBLAST. NCBI BLAST Basic Local Alignment Search Tool
BLAST NCBI BLAST Basic Local Alignment Search Tool http://www.ncbi.nlm.nih.gov/blast/ Global versus local alignments Global alignments: Attempt to align every residue in every sequence, Most useful when
More informationChapter 12: Query Processing
Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation
More informationEstimating Result Size and Execution Times for Graph Queries
Estimating Result Size and Execution Times for Graph Queries Silke Trißl 1 and Ulf Leser 1 Humboldt-Universität zu Berlin, Institut für Informatik, Unter den Linden 6, 10099 Berlin, Germany {trissl,leser}@informatik.hu-berlin.de
More informationB L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture
February 6, 2008 BLAST: Basic local alignment search tool B L A S T! Jonathan Pevsner, Ph.D. Introduction to Bioinformatics pevsner@jhmi.edu 4.633.0 Copyright notice Many of the images in this powerpoint
More informationSpring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M.
Spring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, 2013 1 Joins The focus here is on equijoins These are very common,
More informationBrief review from last class
Sequence Alignment Brief review from last class DNA is has direction, we will use only one (5 -> 3 ) and generate the opposite strand as needed. DNA is a 3D object (see lecture 1) but we will model it
More informationQuery Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016
Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,
More informationPrinciples of Bioinformatics. BIO540/STA569/CSI660 Fall 2010
Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed
More informationMapping Reads to Reference Genome
Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene
More informationChapter 13: Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationDynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014
Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationA Design of a Hybrid System for DNA Sequence Alignment
IMECS 2008, 9-2 March, 2008, Hong Kong A Design of a Hybrid System for DNA Sequence Alignment Heba Khaled, Hossam M. Faheem, Tayseer Hasan, Saeed Ghoneimy Abstract This paper describes a parallel algorithm
More information! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationChapter 13: Query Processing Basic Steps in Query Processing
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationDatabase System Concepts
Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationAn Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST
An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise
More informationQuery Processing & Optimization
Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction
More informationDatabase Applications (15-415)
Database Applications (15-415) DBMS Internals- Part VI Lecture 17, March 24, 2015 Mohammad Hammoud Today Last Two Sessions: DBMS Internals- Part V External Sorting How to Start a Company in Five (maybe
More informationQuery optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.
Database Management Systems DBMS Architecture SQL INSTRUCTION OPTIMIZER MANAGEMENT OF ACCESS METHODS CONCURRENCY CONTROL BUFFER MANAGER RELIABILITY MANAGEMENT Index Files Data Files System Catalog DATABASE
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II
Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2008 Quiz II There are 14 questions and 11 pages in this quiz booklet. To receive
More informationAdaptive Parallel Compressed Event Matching
Adaptive Parallel Compressed Event Matching Mohammad Sadoghi 1,2 Hans-Arno Jacobsen 2 1 IBM T.J. Watson Research Center 2 Middleware Systems Research Group, University of Toronto April 2014 Mohammad Sadoghi
More informationModule 9: Selectivity Estimation
Module 9: Selectivity Estimation Module Outline 9.1 Query Cost and Selectivity Estimation 9.2 Database profiles 9.3 Sampling 9.4 Statistics maintained by commercial DBMS Web Forms Transaction Manager Lock
More informationDatabase Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building
External Sorting and Query Optimization A.R. Hurson 323 CS Building External sorting When data to be sorted cannot fit into available main memory, external sorting algorithm must be applied. Naturally,
More informationData Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich
Data Modeling and Databases Ch 10: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I
Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2015 Quiz I There are 12 questions and 13 pages in this quiz booklet. To receive
More informationDatabase Searching Using BLAST
Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain
More informationData Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich
Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application
More informationExternal Sorting Implementing Relational Operators
External Sorting Implementing Relational Operators 1 Readings [RG] Ch. 13 (sorting) 2 Where we are Working our way up from hardware Disks File abstraction that supports insert/delete/scan Indexing for
More informationPrinciples of Data Management. Lecture #9 (Query Processing Overview)
Principles of Data Management Lecture #9 (Query Processing Overview) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s Notable News v Midterm
More informationExternal Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing
More informationA Coprocessor Architecture for Fast Protein Structure Prediction
A Coprocessor Architecture for Fast Protein Structure Prediction M. Marolia, R. Khoja, T. Acharya, C. Chakrabarti Department of Electrical Engineering Arizona State University, Tempe, USA. Abstract Predicting
More informationCOLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)
COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE) PRESENTATION BY PRANAV GOEL Introduction On analytical workloads, Column
More informationDatabase Applications (15-415)
Database Applications (15-415) DBMS Internals- Part VIII Lecture 16, March 19, 2014 Mohammad Hammoud Today Last Session: DBMS Internals- Part VII Algorithms for Relational Operations (Cont d) Today s Session:
More informationAdministriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky
Administriva Lab 2 Final version due next Wednesday CS 133: Databases Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky Problem sets PSet 5 due today No PSet out this week optional practice
More informationData Mining Technologies for Bioinformatics Sequences
Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment
More informationEfficient Index-based Methods for Processing Large Biological Databases
Efficient Index-based Methods for Processing Large Biological Databases by You Jung Kim A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer
More informationCMSC424: Database Design. Instructor: Amol Deshpande
CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons
More informationAlignment of Long Sequences
Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale
More informationCISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment
CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment Courtesy of jalview 1 Motivations Collective statistic Protein families Identification and representation of conserved sequence features
More informationScalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX
Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX ABSTRACT Symmetric multiprocessor (SMP) computers can increase performance by reducing the time required to analyze large volumes
More informationSequence analysis Pairwise sequence alignment
UMF11 Introduction to bioinformatics, 25 Sequence analysis Pairwise sequence alignment 1. Sequence alignment Lecturer: Marina lexandersson 12 September, 25 here are two types of sequence alignments, global
More informationAnnouncement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17
Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa
More informationA BANDED SMITH-WATERMAN FPGA ACCELERATOR FOR MERCURY BLASTP
A BANDED SITH-WATERAN FPGA ACCELERATOR FOR ERCURY BLASTP Brandon Harris*, Arpith C. Jacob*, Joseph. Lancaster*, Jeremy Buhler*, Roger D. Chamberlain* *Dept. of Computer Science and Engineering, Washington
More informationEXTERNAL SORTING. Sorting
EXTERNAL SORTING 1 Sorting A classic problem in computer science! Data requested in sorted order (sorted output) e.g., find students in increasing grade point average (gpa) order SELECT A, B, C FROM R
More informationEvaluation of Relational Operations
Evaluation of Relational Operations Yanlei Diao UMass Amherst March 13 and 15, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke 1 Relational Operations We will consider how to implement: Selection
More informationUsing MPI One-sided Communication to Accelerate Bioinformatics Applications
Using MPI One-sided Communication to Accelerate Bioinformatics Applications Hao Wang (hwang121@vt.edu) Department of Computer Science, Virginia Tech Next-Generation Sequencing (NGS) Data Analysis NGS Data
More informationQuery Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!
Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13! q Overview! q Optimization! q Measures of Query Cost! Query Evaluation! q Sorting! q Join Operation! q Other
More information.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..
.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. PAM and BLOSUM Matrices Prepared by: Jason Banich and Chris Hoover Background As DNA sequences change and evolve, certain amino acids are more
More informationOutline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012
Outline Database Management and Tuning Johann Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Unit 6 April 12, 2012 1 Acknowledgements: The slides are provided by Nikolaus Augsten
More informationIntroduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe
Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms
More informationBiochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)
Biochemistry 324 Bioinformatics Multiple Sequence Alignment (MSA) Big- Οh notation Greek omicron symbol Ο The Big-Oh notation indicates the complexity of an algorithm in terms of execution speed and storage
More informationQuery Optimization Overview
Query Optimization Overview parsing, syntax checking semantic checking check existence of referenced relations and attributes disambiguation of overloaded operators check user authorization query rewrites
More informationCS222P Fall 2017, Final Exam
STUDENT NAME: STUDENT ID: CS222P Fall 2017, Final Exam Principles of Data Management Department of Computer Science, UC Irvine Prof. Chen Li (Max. Points: 100 + 15) Instructions: This exam has seven (7)
More informationImplementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1
Implementing Relational Operators: Selection, Projection, Join Database Management Systems, R. Ramakrishnan and J. Gehrke 1 Readings [RG] Sec. 14.1-14.4 Database Management Systems, R. Ramakrishnan and
More informationConstrained Sequence Alignment
Western Washington University Western CEDAR WWU Honors Program Senior Projects WWU Graduate and Undergraduate Scholarship 12-5-2017 Constrained Sequence Alignment Kyle Daling Western Washington University
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3
More informationHash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example
Student Introduction to Database Systems CSE 414 Hash table example Index Student_ID on Student.ID Data File Student 10 Tom Hanks 10 20 20 Amy Hanks ID fname lname 10 Tom Hanks 20 Amy Hanks Lecture 26:
More informationFundamentals of Database Systems
Fundamentals of Database Systems Assignment: 4 September 21, 2015 Instructions 1. This question paper contains 10 questions in 5 pages. Q1: Calculate branching factor in case for B- tree index structure,
More informationQuery Processing and Advanced Queries. Query Optimization (4)
Query Processing and Advanced Queries Query Optimization (4) Two-Pass Algorithms Based on Hashing R./S If both input relations R and S are too large to be stored in the buffer, hash all the tuples of both
More informationPS2 out today. Lab 2 out today. Lab 1 due today - how was it?
6.830 Lecture 7 9/25/2017 PS2 out today. Lab 2 out today. Lab 1 due today - how was it? Project Teams Due Wednesday Those of you who don't have groups -- send us email, or hand in a sheet with just your
More information8/19/13. Computational problems. Introduction to Algorithm
I519, Introduction to Introduction to Algorithm Yuzhen Ye (yye@indiana.edu) School of Informatics and Computing, IUB Computational problems A computational problem specifies an input-output relationship
More informationColumn Stores vs. Row Stores How Different Are They Really?
Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background
More informationSomething to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:
Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base
More informationLecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:
Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating
More informationIntroduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs
Introduction to Database Systems CSE 414 Lecture 26: More Indexes and Operator Costs CSE 414 - Spring 2018 1 Student ID fname lname Hash table example 10 Tom Hanks Index Student_ID on Student.ID Data File
More informationPreliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification
Preliminary Syllabus Sep 30 Oct 2 Oct 7 Oct 9 Oct 14 Oct 16 Oct 21 Oct 25 Oct 28 Nov 4 Nov 8 Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification OCTOBER BREAK
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Basic Steps in Query Processing 1. Parsing and translation 2. Optimization 3. Evaluation 12.2
More informationDatabase Management System
Database Management System Lecture Join * Some materials adapted from R. Ramakrishnan, J. Gehrke and Shawn Bowers Today s Agenda Join Algorithm Database Management System Join Algorithms Database Management
More informationHeuristic methods for pairwise alignment:
Bi03c_1 Unit 03c: Heuristic methods for pairwise alignment: k-tuple-methods k-tuple-methods for alignment of pairs of sequences Bi03c_2 dynamic programming is too slow for large databases Use heuristic
More informationSequence Alignment & Search
Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating the first version
More information