LAB # 3 / Project # 1
|
|
- Sara Robinson
- 5 years ago
- Views:
Transcription
1 DEI Departamento de Engenharia Informática Algorithms for Discrete Structures 2011/2012 LAB # 3 / Project # 1 Matching Proteins This is a lab guide for Algorithms in Discrete Structures. These exercises are for the classes of October 13, 20 and 27 of The resulting software along with a report must be delivered by the 28 of October. The project will be discussed on the class of 3 of November. Students should not use existing code for the algorithms described in the project, either from software libraries or other electronic sources. It is important to read the full description of the project before starting to design and implement the solution. Students should deliver a working implementation of the project described, along a report including experimental setup s and time and space analysis, both theoretical and experimental, of the algorithms implemented. 1 Online Search In this project students will implement an efficient algorithm for matching small patterns against a large text. This problem is recurrent in computer science and specially in bioinformatic applications, therefore we will use a reference database of protein sequences. Download the Swiss-Prot database from the following URL: ftp://ftp.ebi.ac.uk/pub/databases/uniprot/ knowledgebase/uniprot_sprot.fasta.gz Decompress the file and take a look inside. The file contains the sequence of several proteins, separated by comment lines. The comment lines have the following structure: >sp REFERENCE Description The lines start by > followed by the string sp, followed by the protein reference between characters and finally end with a brief description of the protein. This is a very elementary fasta format. The sequence of the respective protein appears in the next lines, until another comment line or the end of the file is reached. Notice that the lines are not longer than 60 characters, but the protein 1
2 sequence is obtained by removing the newline characters and concatenating the lines. In a first step the algorithm should read a file, like the one we have just described, and load it into memory. Some processing is required, to remove the newlines and to separate the different sequences, so that occurrences of the pattern do not spread across more than one protein. After preprocessing the database it is time to implement an efficient matching algorithm. Some of the characters in the database actually represent other characters, or combinations of other characters. We will ignore this fact and interpret the characters literally. Let us start by implementing the Shift-And algorithm. Recall that for a pattern P = MAFS the Shif-And algorithm simulates the following non-deterministic automata by using the bitwise & and << operations. M A F S The arrow indicates the initial state and the double state is a final state. It is safe to assume that we will not be searching for patterns longer than 30 characters. Therefore the automata needs only one processor word. After finding an occurrence it is reported by indicating the reference of the sequence and the position. Therefore a search for "DPLVSAE" in the Swiss-Prot database should return "Q6GZX3 89". Test your implementation by using patterns of different sizes and include experimental results in the report. Also include the amount of memory required by your program. In C the execution time can be measured with the ftime function. Memory can be measured with the massif tool of valgrind. 2 Indexed Search The results obtained by the previous algorithm should present a convincing case that direct search over the text is slow. Even if the search time is acceptable for a couple of searches it is not good enough, not efficient, for bioinformatic applications that can deal with several thousands of patterns in a search. For example, to assembly a new genome a search algorithm must compare millions of DNA fragments with an average size of 200 characters to a reference genome that can also be a sequence of millions of characters or longer. Use a suffix array to index the protein sequences, it is a generalized suffix array because it indexes more than one sequence. Recall that suffix arrays contain the indexes of the suffixes in lexicographical order. Consider to following suffix array for "ABRACADABRA#", where "#" is the terminator character. 11 # 10 A# 7 ABRA# 0 ABRACADABRA# 2
3 3 ACADABRA# 5 ADABRA# 8 BRA# 1 BRACADABRA# 4 CADABRA# 6 DABRA# 9 RA# 2 RACADABRA# On the left we show the suffix indexes and on the right the respective suffixes, note that the suffixes are not really stored in the structure, they are only presented for illustration purposes. A fairly efficient way to construct a suffix array is to use an MSD radix sort. This means that all the suffixes are first sorted by the first letter, then the procedure continues recursively inside the intervals of suffixes that share the first letter, i.e., that all start by the same letter. Unfortunately suffix arrays must be constructed in main memory, this construction algorithm in secondary memory is very slow. Hence let us first reduce the size of the database to 50MB with a command like head -c 50M UniProt.fasta > Small.fasta. Computing exact matching with suffix arrays consists in determining the interval in the array that contains that suffixes that start with the pattern P we are searching for. In the above example the corresponding interval for "AB" would be [2, 3], i.e. the interval that contains the suffixes "ABRA" and "ABRACADABRA". This interval is determined with two binary searches, one for each extreme of the interval. Also implement the simple accelerant, presented by Gusfield [1], in section This accelerant is an heuristic to avoid having to redundantly compare letters, between P and the extremes of the interval of the binary search. At each step store the size of the common prefix between P and each of the extremes, whenever we need to compare P against a suffix inside this interval we have the guarantee that the common prefix between P and such a suffix can not be smaller than the minimum of the previous values. Include in the report experimental results, including the time and memory necessary to construct suffix arrays, and comparing the time of computing exact matching against the online version. Determine the break even point, i.e. the minimum number of online searches beyond which it pays of to build suffix arrays. 3 Approximate Search In bioinformatics, as in other real world applications of pattern matching, the pattern string may be corrupted by errors, which makes the search harder. In this work we assume only substitution errors, i.e., we assume a letter may be substituted by another letter. By modifying the automata already coded in Section 1, it is possible to search with this type of errors. Consider the following automata that can be used to find patterns with at most one substitution. 3
4 M A F S M A F S Modify the previous implementation of Shift-And to simulate this automata. Even though the above implementation is efficient we have just seen that there is a large gap between the performance of online search and indexed searches. Unfortunately errors are harder to handle over indexes. Still the following observation provides a way to do just that. Assume that we want to find all the occurrences of P with at most two substitutions. If we divide P into 3 parts, a prefix, a substring and suffix, at least one of the pieces most occur without errors. Therefore we can search for those pieces exactly over the suffix array. Then we need to check the resulting positions for complete approximate occurrences of P. Fill in the details of this solution, implement it and present experimental results. 4 Validation To automatically validate the index we use the following conventions. The file containing the texts to index respects the fasta format described above. The name of this file is passed in the command line, i.e., in C it corresponds to argv[1]. Therefore the binary is executed with the following command:./project Small.fasta < in > out The file in contains the input commands that we will describe next. The output is stored in a file named out. The input and output must respect the specification bellow precisely. The output file will be validated against an expected result, stored in a file named check, with the following command: diff out check This command should produce no output, thus indicating that both files are identical. Each operation is issued in a separate line, it begins by a letter that identifies it and is followed by a sequence of argument or options. E followed by a string P, performs an exact online search, for P, on the database. Assume the string P contains no white characters. The output consists of a line for each occurrence found, with the format ref: pos, where ref is the reference of the sequence and pos is the position. The results should appear in increasing order of reference and, on ties in increasing other of position. The order of the references is that the same in which they appear in the database file. In case there are no occurrences this command produces no output. I builds the generalized suffix array of the database and prints the Database indexed. message. In case the suffix array was built, by a previous call, this call does nothing, but still prints the message. 4
5 B followed by a string P, performs an exact search over the suffix array. The output consists of a line for each occurrence found, with the format ref: pos, where ref is the reference of the sequence and pos is the position. The results should be ordered according to the lexicographical order of the suffixes, i.e., the suffix array order. In case two sequences share the same suffix, i.e., when there is a lexicographical tie, the order should be the increasing sequence order. In case the suffix array has not yet been constructed output the Please index database. message. A followed by a number k > 0 and a string P, performs an approximate online search, for P, on the database, with at most k substitutions. The output uses the same format as the command E. F followed by a number k > 0 and a string P, performs an approximate search using the suffix array, with at most k substitutions. The output uses the same format as the command E. Note that this is not the same as the command B. In case the suffix array has not yet been constructed output the Please index database. message. Consider the following example: Database >sp Q6GZX4 001R_FRG3G Putative transcription factor 001R MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD SFRKIYTDLGWKFTPL >sp Q6GZX3 002L_FRG3G Uncharacterized protein 002L MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL AERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAADC KCINRASDPVYQKVKTLHAYPDQCWYVPCAADVGELKMGTQRDTPTNCPTQVCQIVFNML DDGSVTMDDVKNTINCDFSKYVPPPPPPKPTPPTPPTPPTPPTPPTPPTPPTPRPVHNRK VMFFVAGAVLVAILISTVRW Input E MAFS E SAED B MAFS B SAED F 1 SAAD I I B MAF B SAED A 1 SAAD F 1 SAAD 5
6 Output Q6GZX4 0 Please index database. Please index database. Please index database. Database indexed. Database indexed. Q6GZX4 0 Q6GZX3 141 Q6GZX3 175 Q6GZX3 208 Q6GZX3 141 Q6GZX3 175 Q6GZX3 208 References [1] Gusfield, D. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge Univ Press,
A Compressed Self-Index on Words
2nd Workshop on Compression, Text, and Algorithms 2007 Santiago de Chile. Nov 1st, 2007 A Compressed Self-Index on Words Databases Laboratory University of A Coruña (Spain) + Gonzalo Navarro Outline Introduction
More informationGiri Narasimhan. COT 6936: Topics in Algorithms. The String Matching Problem. Approximate String Matching
COT 6936: Topics in lgorithms Giri Narasimhan ECS 254 / EC 2443; Phone: x3748 giri@cs.fiu.edu http://www.cs.fiu.edu/~giri/teach/cot6936_s10.html https://online.cis.fiu.edu/portal/course/view.php?id=427
More informationAdvanced Algorithms: Project
Advanced Algorithms: Project (deadline: May 13th, 2016, 17:00) Alexandre Francisco and Luís Russo Last modified: February 26, 2016 This project considers two different problems described in part I and
More informationI519 Introduction to Bioinformatics. Indexing techniques. Yuzhen Ye School of Informatics & Computing, IUB
I519 Introduction to Bioinformatics Indexing techniques Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents We have seen indexing technique used in BLAST Applications that rely
More informationLectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures
4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut
More informationFastA & the chaining problem
FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,
More informationFastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:
FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem
More informationNew Implementation for the Multi-sequence All-Against-All Substring Matching Problem
New Implementation for the Multi-sequence All-Against-All Substring Matching Problem Oana Sandu Supervised by Ulrike Stege In collaboration with Chris Upton, Alex Thomo, and Marina Barsky University of
More informationString Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42
String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3
More informationSpace Efficient Linear Time Construction of
Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics
More information17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.
17 dicembre 2003 Luca Bortolussi SUFFIX TREES From exact to approximate string matching. An introduction to string matching String matching is an important branch of algorithmica, and it has applications
More informationRecursively Defined Functions
Section 5.3 Recursively Defined Functions Definition: A recursive or inductive definition of a function consists of two steps. BASIS STEP: Specify the value of the function at zero. RECURSIVE STEP: Give
More information24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:
24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid
More informationInexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)
Inexact Matching, Alignment See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming) Outline Yet more applications of generalized suffix trees, when combined with a least common ancestor
More informationPAPER Constructing the Suffix Tree of a Tree with a Large Alphabet
IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationBLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database.
BLAST Basic Local Alignment Search Tool Used to quickly compare a protein or DNA sequence to a database. There is no such thing as a free lunch BLAST is fast and highly sensitive compared to competitors.
More informationMapping Reads to Reference Genome
Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene
More informationApplied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017
Applied Databases Lecture 14 Indexed String Search, Suffix Trees Sebastian Maneth University of Edinburgh - March 9th, 2017 2 Recap: Morris-Pratt (1970) Given Pattern P, Text T, find all occurrences of
More informationData structures for string pattern matching: Suffix trees
Suffix trees Data structures for string pattern matching: Suffix trees Linear algorithms for exact string matching KMP Z-value algorithm What is suffix tree? A tree-like data structure for solving problems
More informationA Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms
A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms Charalampos S. Kouzinopoulos and Konstantinos G. Margaritis Parallel and Distributed Processing Laboratory Department
More informationCOS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching
COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1 Database Searching In database search, we typically have a large sequence database
More informationBioinformatics explained: Smith-Waterman
Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com
More informationSolution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.
Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. The basic approach is the same as when using a suffix tree,
More informationExact String Matching. The Knuth-Morris-Pratt Algorithm
Exact String Matching The Knuth-Morris-Pratt Algorithm Outline for Today The Exact Matching Problem A simple algorithm Motivation for better algorithms The Knuth-Morris-Pratt algorithm The Exact Matching
More informationGenomic Files. University of Massachusetts Medical School. October, 2015
.. Genomic Files University of Massachusetts Medical School October, 2015 2 / 55. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further
More informationAlgorithms and Data Structures
Algorithms and Data Structures Charles A. Wuethrich Bauhaus-University Weimar - CogVis/MMC May 11, 2017 Algorithms and Data Structures String searching algorithm 1/29 String searching algorithm Introduction
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:
More informationString Matching Algorithms
String Matching Algorithms Georgy Gimel farb (with basic contributions from M. J. Dinneen, Wikipedia, and web materials by Ch. Charras and Thierry Lecroq, Russ Cox, David Eppstein, etc.) COMPSCI 369 Computational
More informationModern Information Retrieval
Modern Information Retrieval Chapter 9 Indexing and Searching with Gonzalo Navarro Introduction Inverted Indexes Signature Files Suffix Trees and Suffix Arrays Sequential Searching Multi-dimensional Indexing
More information6.00 Introduction to Computer Science and Programming Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.00 Introduction to Computer Science and Programming Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationBUNDLED SUFFIX TREES
Motivation BUNDLED SUFFIX TREES Luca Bortolussi 1 Francesco Fabris 2 Alberto Policriti 1 1 Department of Mathematics and Computer Science University of Udine 2 Department of Mathematics and Computer Science
More informationFrom Smith-Waterman to BLAST
From Smith-Waterman to BLAST Jeremy Buhler July 23, 2015 Smith-Waterman is the fundamental tool that we use to decide how similar two sequences are. Isn t that all that BLAST does? In principle, it is
More informationData Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.
Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data
More informationData Mining Technologies for Bioinformatics Sequences
Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:
More informationComputational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh
Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh Overlap detection: Semi-Global Alignment An overlap of two sequences is considered an
More informationLecture 7 February 26, 2010
6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some
More information12 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006
12 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006 3 Sequence comparison by compression This chapter is based on the following articles, which are all recommended reading: X. Chen,
More informationAlgorithms for Bioinformatics
These slides are based on previous years slides of Alexandru Tomescu, Leena Salmela and Veli Mäkinen 582670 Algorithms for Bioinformatics Lecture 1: Primer to algorithms and molecular biology 2.9.2014
More informationAlgorithms for Bioinformatics
582670 Algorithms for Bioinformatics Lecture 1: Primer to algorithms and molecular biology 4.9.2012 Course format Thu 12-14 Thu 10-12 Tue 12-14 Grading Exam 48 points Exercises 12 points 30% = 1 85% =
More informationSOLiD GFF File Format
SOLiD GFF File Format 1 Introduction The GFF file is a text based repository and contains data and analysis results; colorspace calls, quality values (QV) and variant annotations. The inputs to the GFF
More informationReconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences
SEQUENCE ALIGNMENT ALGORITHMS 1 Why compare sequences? Reconstructing long sequences from overlapping sequence fragment Searching databases for related sequences and subsequences Storing, retrieving and
More informationSuffix Tree and Array
Suffix Tree and rray 1 Things To Study So far we learned how to find approximate matches the alignments. nd they are difficult. Finding exact matches are much easier. Suffix tree and array are two data
More informationDistributed and Paged Suffix Trees for Large Genetic Databases p.1/18
istributed and Paged Suffix Trees for Large Genetic atabases Raphaël Clifford and Marek Sergot raphael@clifford.net, m.sergot@ic.ac.uk Imperial College London, UK istributed and Paged Suffix Trees for
More informationUSING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT
IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah
More informationDatabase Searching Using BLAST
Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain
More informationBioinformatics I, WS 09-10, D. Huson, February 10,
Bioinformatics I, WS 09-10, D. Huson, February 10, 2010 189 12 More on Suffix Trees This week we study the following material: WOTD-algorithm MUMs finding repeats using suffix trees 12.1 The WOTD Algorithm
More informationDEFLATE COMPRESSION ALGORITHM
DEFLATE COMPRESSION ALGORITHM Savan Oswal 1, Anjali Singh 2, Kirthi Kumari 3 B.E Student, Department of Information Technology, KJ'S Trinity College Of Engineering and Research, Pune, India 1,2.3 Abstract
More informationCSCI 1820 Notes. Scribes: tl40. February 26 - March 02, Estimating size of graphs used to build the assembly.
CSCI 1820 Notes Scribes: tl40 February 26 - March 02, 2018 Chapter 2. Genome Assembly Algorithms 2.1. Statistical Theory 2.2. Algorithmic Theory Idury-Waterman Algorithm Estimating size of graphs used
More informationSorting. Sorting in Arrays. SelectionSort. SelectionSort. Binary search works great, but how do we create a sorted array in the first place?
Sorting Binary search works great, but how do we create a sorted array in the first place? Sorting in Arrays Sorting algorithms: Selection sort: O(n 2 ) time Merge sort: O(nlog 2 (n)) time Quicksort: O(n
More informationAnnouncements COMP 141. Writing to a File. Reading From a File 10/18/2017. Reading/Writing from/to Files
Announcements COMP 141 Reading/Writing from/to Files Reminders Program 5 due Thurs., October 19 th by 11:55pm Solutions to selected problems from Friday s lab are in my Box.com directory (LoopLab.py) Programming
More informationLamé s Theorem. Strings. Recursively Defined Sets and Structures. Recursively Defined Sets and Structures
Lamé s Theorem Gabriel Lamé (1795-1870) Recursively Defined Sets and Structures Lamé s Theorem: Let a and b be positive integers with a b Then the number of divisions used by the Euclidian algorithm to
More informationLecture 5: Suffix Trees
Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common
More informationReading Assignment. Strings. K.N. King Chapter 13. K.N. King Sections 23.4, Supplementary reading. Harbison & Steele Chapter 12, 13, 14
Reading Assignment Strings char identifier [ size ] ; char * identifier ; K.N. King Chapter 13 K.N. King Sections 23.4, 23.5 Supplementary reading Harbison & Steele Chapter 12, 13, 14 Strings are ultimately
More informationAs of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be
48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and
More informationGenomic Files. University of Massachusetts Medical School. October, 2014
.. Genomic Files University of Massachusetts Medical School October, 2014 2 / 39. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further
More informationOptimization of Boyer-Moore-Horspool-Sunday Algorithm
Optimization of Boyer-Moore-Horspool-Sunday Algorithm Rionaldi Chandraseta - 13515077 Program Studi Teknik Informatika Sekolah Teknik Elektro dan Informatika, Institut Teknologi Bandung Bandung, Indonesia
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De
More informationLinear Work Suffix Array Construction
Linear Work Suffix Array Construction Juha Karkkainen, Peter Sanders, Stefan Burkhardt Presented by Roshni Sahoo March 7, 2019 Presented by Roshni Sahoo Linear Work Suffix Array Construction March 7, 2019
More informationLiterature Databases
Literature Databases Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Overview 1. Databases 2. Publications in Science 3. PubMed and
More information1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998
7 Multiple Sequence Alignment The exposition was prepared by Clemens GrÃP pl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all
More informationExtreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce
Extreme Computing Introduction to MapReduce 1 Cluster We have 12 servers: scutter01, scutter02,... scutter12 If working outside Informatics, first: ssh student.ssh.inf.ed.ac.uk Then log into a random server:
More informationGSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu
GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu Matt Huska Freie Universität Berlin Computational Methods for High-Throughput Omics
More informationtagerator: a program for mapping short sequence tags a manual
tagerator: a program for mapping short sequence tags a manual Stefan Kurtz Center for Bioinformatics, University of Hamburg August 6, 2012 1 Preliminary definitions By S let us denote the concatenation
More informationString quicksort solves this problem by processing the obtained information immediately after each symbol comparison.
Lcp-Comparisons General (non-string) comparison-based sorting algorithms are not optimal for sorting strings because of an imbalance between effort and result in a string comparison: it can take a lot
More informationJava s String Class. in simplest form, just quoted text. used as parameters to. "This is a string" "So is this" "hi"
1 Java s String Class in simplest form, just quoted text "This is a string" "So is this" "hi" used as parameters to Text constructor System.out.println 2 The Empty String smallest possible string made
More informationComputational Molecular Biology
Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive
More informationCS313 Exercise 4 Cover Page Fall 2017
CS313 Exercise 4 Cover Page Fall 2017 Due by the start of class on Thursday, October 12, 2017. Name(s): In the TIME column, please estimate the time you spent on the parts of this exercise. Please try
More informationINTRODUCTION TO BIOINFORMATICS
Molecular Biology-2019 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain
More informationCHAPTER 3 A TIME-DEPENDENT k-shortest PATH ALGORITHM FOR ATIS APPLICATIONS
CHAPTER 3 A TIME-DEPENDENT k-shortest PATH ALGORITHM FOR ATIS APPLICATIONS 3.1. Extension of a Static k-sp Algorithm to the Time-Dependent Case Kaufman and Smith [1993] showed that under the consistency
More informationExact String Matching Part II. Suffix Trees See Gusfield, Chapter 5
Exact String Matching Part II Suffix Trees See Gusfield, Chapter 5 Outline for Today What are suffix trees Application to exact matching Building a suffix tree in linear time, part I: Ukkonen s algorithm
More informationAnnouncements for this Lecture
Lecture 6 Objects Announcements for this Lecture Last Call Quiz: About the Course Take it by tomorrow Also remember survey Assignment 1 Assignment 1 is live Posted on web page Due Thur, Sep. 18 th Due
More informationLexical Analysis. COMP 524, Spring 2014 Bryan Ward
Lexical Analysis COMP 524, Spring 2014 Bryan Ward Based in part on slides and notes by J. Erickson, S. Krishnan, B. Brandenburg, S. Olivier, A. Block and others The Big Picture Character Stream Scanner
More informationComplexity Theory. Compiled By : Hari Prasad Pokhrel Page 1 of 20. ioenotes.edu.np
Chapter 1: Introduction Introduction Purpose of the Theory of Computation: Develop formal mathematical models of computation that reflect real-world computers. Nowadays, the Theory of Computation can be
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the
More information(for more info see:
Genome assembly (for more info see: http://www.cbcb.umd.edu/research/assembly_primer.shtml) Introduction Sequencing technologies can only "read" short fragments from a genome. Reconstructing the entire
More informationSPEEDING UP INDEX CONSTRUCTION WITH GPU FOR DNA DATA SEQUENCES
SPEEDING UP INDEX CONSTRUCTION WITH GPU FOR DNA DATA SEQUENCES INTRODUCTION Rahmaddiansyah 1 and Nur aini Abdul Rashid 2 1 Universiti Sains Malaysia (USM), Malaysia, new_rahmad@yahoo.co.id 2 Universiti
More informationAbdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013
Abdullah-Al Mamun CSE 5095 Yufeng Wu Spring 2013 Introduction Data compression is the art of reducing the number of bits needed to store or transmit data Compression is closely related to decompression
More informationRedesde Computadores (RCOMP)
Redesde Computadores (RCOMP) Theoretical-Practical (TP) Lesson 04 2016/2017 Introduction to IPv4 operation. IPv4 addressing and network masks. Instituto Superior de Engenharia do Porto Departamento de
More informationUNIT -2 LEXICAL ANALYSIS
OVER VIEW OF LEXICAL ANALYSIS UNIT -2 LEXICAL ANALYSIS o To identify the tokens we need some method of describing the possible tokens that can appear in the input stream. For this purpose we introduce
More informationCMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays
CMSC423: Bioinformatic Algorithms, Databases and Tools Exact string matching: Suffix trees Suffix arrays Searching multiple strings Can we search multiple strings at the same time? Would it help if we
More informationMore text file manipulation: sorting, cutting, pasting, joining, subsetting,
More text file manipulation: sorting, cutting, pasting, joining, subsetting, Laboratory of Genomics & Bioinformatics in Parasitology Department of Parasitology, ICB, USP Inverse cat Last week we learned
More informationINTRODUCTION TO BIOINFORMATICS
Molecular Biology-2017 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain
More informationCOMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou Administrative! [ALSU03] Chapter 3 - Lexical Analysis Sections 3.1-3.4, 3.6-3.7! Reading for next time [ALSU03] Chapter 3 Copyright (c) 2010 Ioanna
More informationIntroduction to Bioinformatics Problem Set 3: Genome Sequencing
Introduction to Bioinformatics Problem Set 3: Genome Sequencing 1. Assemble a sequence with your bare hands! You are trying to determine the DNA sequence of a very (very) small plasmids, which you estimate
More informationSlide Set 2. for ENCM 335 in Fall Steve Norman, PhD, PEng
Slide Set 2 for ENCM 335 in Fall 2018 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary September 2018 ENCM 335 Fall 2018 Slide Set 2 slide
More informationIndexing Variable Length Substrings for Exact and Approximate Matching
Indexing Variable Length Substrings for Exact and Approximate Matching Gonzalo Navarro 1, and Leena Salmela 2 1 Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl 2 Department of
More informationOverview of C. Basic Data Types Constants Variables Identifiers Keywords Basic I/O
Overview of C Basic Data Types Constants Variables Identifiers Keywords Basic I/O NOTE: There are six classes of tokens: identifiers, keywords, constants, string literals, operators, and other separators.
More informationJava Foundations: Unit 3. Parts of a Java Program
Java Foundations: Unit 3 Parts of a Java Program class + name public class HelloWorld public static void main( String[] args ) System.out.println( Hello world! ); A class creates a new type, something
More informationProblem 3: Theoretical Questions: Complete the midterm exam taken from a previous year attached at the end of the assignment.
CSE 2021: Computer Organization Assignment # 1: MIPS Programming Due Date: October 25, 2010 Please note that a short quiz based on Assignment 1 will be held in class on October 27, 2010 to assess your
More informationDivide and Conquer Strategy. (Page#27)
MUHAMMAD FAISAL MIT 4 th Semester Al-Barq Campus (VGJW01) Gujranwala faisalgrw123@gmail.com Reference Short Questions for MID TERM EXAMS CS502 Design and Analysis of Algorithms Divide and Conquer Strategy
More informationComputational models for bionformatics
Computational models for bionformatics De-novo assembly and alignment-free measures Michele Schimd Department of Information Engineering July 8th, 2015 Michele Schimd (DEI) PostDoc @ DEI July 8th, 2015
More informationComputational biology course IST 2015/2016
Computational biology course IST 2015/2016 Introduc)on to Algorithms! Algorithms: problem- solving methods suitable for implementation as a computer program! Data structures: objects created to organize
More informationConcepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens
Concepts Introduced in Chapter 3 Lexical Analysis Regular Expressions (REs) Nondeterministic Finite Automata (NFA) Converting an RE to an NFA Deterministic Finite Automatic (DFA) Lexical Analysis Why separate
More informationGraph Contraction. Graph Contraction CSE341T/CSE549T 10/20/2014. Lecture 14
CSE341T/CSE549T 10/20/2014 Lecture 14 Graph Contraction Graph Contraction So far we have mostly talking about standard techniques for solving problems on graphs that were developed in the context of sequential
More informationTREES Lecture 10 CS2110 Spring2014
TREES Lecture 10 CS2110 Spring2014 Readings and Homework 2 Textbook, Chapter 23, 24 Homework: A thought problem (draw pictures!) Suppose you use trees to represent student schedules. For each student there
More information1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998
7 Multiple Sequence Alignment The exposition was prepared by Clemens Gröpl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all
More informationSpecial course in Computer Science: Advanced Text Algorithms
Special course in Computer Science: Advanced Text Algorithms Lecture 8: Multiple alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg
More information