BMI/CS 576 Fall 2015 Midterm Exam

Similar documents
Sequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes

Computational Genomics and Molecular Biology, Fall

Graphs and Puzzles. Eulerian and Hamiltonian Tours.

Eulerian Tours and Fleury s Algorithm

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization

Graph Algorithms in Bioinformatics

10/15/2009 Comp 590/Comp Fall

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics

Sequence Assembly Required!

10/8/13 Comp 555 Fall

Algorithms for Bioinformatics

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees

Eulerian tours. Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck. April 20, 2016

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Parsimony-Based Approaches to Inferring Phylogenetic Trees

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly

Genome 373: Genome Assembly. Doug Fowler

Darwin: A Genomic Co-processor gives up to 15,000X speedup on long read assembly (To appear in ASPLOS 2018)

Evolutionary tree reconstruction (Chapter 10)

Algorithms for Bioinformatics

Algorithms for Bioinformatics

Recent Research Results. Evolutionary Trees Distance Methods

DNA Fragment Assembly

Alignment of Long Sequences

Lecture 22 Tuesday, April 10

Shortest Path Algorithm

Lecture 10. Sequence alignments

Creating a custom mappings similarity matrix

(for more info see:

CSE548, AMS542: Analysis of Algorithms, Fall 2012 Date: October 16. In-Class Midterm. ( 11:35 AM 12:50 PM : 75 Minutes )

DNA Sequencing. Overview

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Read Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015

Sequence comparison: Local alignment

Weighted Finite-State Transducers in Computational Biology

RESEARCH TOPIC IN BIOINFORMANTIC

SAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche

Sequence Comparison: Dynamic Programming. Genome 373 Genomic Informatics Elhanan Borenstein

BLAST & Genome assembly

Terminology. A phylogeny is the evolutionary history of an organism

human chimp mouse rat

Global Alignment. Algorithms in BioInformatics Mandatory Project 1 Magnus Erik Hvass Pedersen (971055) Daimi, University of Aarhus September 2004

Parsimony methods. Chapter 1

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Path Finding in Graphs. Problem Set #2 will be posted by tonight

Purpose of sequence assembly

Genome Sequencing Algorithms

Sequence Alignment. part 2

Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh

Introduction to Graphs

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

Quiz section 10. June 1, 2018

02-711/ Computational Genomics and Molecular Biology Fall 2016

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Notes 4 : Approximating Maximum Parsimony

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

Local Alignment & Gap Penalties CMSC 423

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Finding homologous sequences in databases

Phylogenetic Trees - Parsimony Tutorial #11

CPS 102: Discrete Mathematics. Quiz 3 Date: Wednesday November 30, Instructor: Bruce Maggs NAME: Prob # Score. Total 60

Biostrings. Martin Morgan Bioconductor / Fred Hutchinson Cancer Research Center Seattle, WA, USA June 2009

Bioinformatics-themed projects in Discrete Mathematics

Solutions Exercise Set 3 Author: Charmi Panchal

CS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018

Using seqtools package

11/17/2009 Comp 590/Comp Fall

Multiple Sequence Alignment II

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand Midterm 1

Introduction to Trees

Construction of a distance tree using clustering with the Unweighted Pair Group Method with Arithmatic Mean (UPGMA).

Foundations of Discrete Mathematics

Multiple Sequence Alignment Gene Finding, Conserved Elements

Special course in Computer Science: Advanced Text Algorithms

University of Illinois at Urbana-Champaign Department of Computer Science. Second Examination

Solving problems on graph algorithms

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Solutions to Midterm 2 - Monday, July 11th, 2009

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

A multiple alignment tool in 3D

Divide and Conquer Algorithms. Problem Set #3 is graded Problem Set #4 due on Thursday

Problem Pts Score Grader Problem Pts Score Grader

DNA Fragment Assembly

Research Article An Improved Scoring Matrix for Multiple Sequence Alignment

Trees. Arash Rafiey. 20 October, 2015

CISC 636 Computational Biology & Bioinformatics (Fall 2016) Phylogenetic Trees (I)

CS 251, LE 2 Fall MIDTERM 2 Tuesday, November 1, 2016 Version 00 - KEY

Sparse matrices, graphs, and tree elimination

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Questions Total Points Score

More Dynamic Programming

High dimensional data analysis

56:272 Integer Programming & Network Flows Final Exam -- December 16, 1997

Analysis of Algorithms

Phylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau

Transcription:

BMI/CS 576 Fall 2015 Midterm Exam Prof. Colin Dewey Tuesday, October 27th, 2015 11:00am-12:15pm Name: KEY Write your answers on these pages and show your work. You may use the back sides of pages as necessary. Before starting, make sure your exam has every page (numbered 1 through 9). You are allowed 2 (double-sided) pages of notes. Calculators are not allowed. Problem Score Max Score 1 25 2 25 3 25 Total 75 1

1. (25 points) Suppose that you are given the following k-mer spectrum graph (k = 3) for the assembly of a short DNA sequence. AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT (a) (5 points) Give one Eulerian path through the graph and the superstring to which it corresponds. There are two possible Eulerian paths. The paths and their corresponding superstrings are: T T T C CC CT T C CT T A : T T CCT CT A T T T C CT T C CC CT T A : T T CT CCT A (b) (5 points) A confused bioinformatician attempts to find a superstring by finding a Hamiltonian path through this graph. Give one Hamiltonian path through this graph, ignoring the vertices that have zero incident edges (i.e., have indegree = outdegree = 0). There is one Hamiltonian path in this graph: T T T C CC CT T A 2

(c) (5 points) Briefly explain why a Hamiltonian path in this graph (again ignoring vertices that have zero incident edges) does not solve the assembly problem. In the k-mer spectrum approach to assembly, a valid superstring must contain exactly the set of k-mers represented by the edges of the graph. A Hamiltonian path through this graph does not traverse all edges of the graph, and thus the superstring corresponding to the Hamiltonian path does not contain all k-mers represented in the graph. (d) (5 points) Suppose that we add the edge CT T T to this graph. Does the graph still have an Eulerian path? Briefly justify your answer. Yes. After the addition of the edge CT T T, all vertices are balanced except for CT, which has indegree outdegree = 1, and T A, which has indegree outdegree = 1. These properties guarantee the existence of an Eulerian path in this graph. (e) (5 points) Suppose that we add the edge T T T A to this graph. Does the graph still have an Eulerian path? Briefly justify your answer. No. After the addition of the edge T T T A, the vertex T A has indegree outdegree = 2, and the vertex T T has indegree outdegree = 2 which precludes the existence of an Eulerian path in the graph. 3

2. (25 points) Consider the dynamic programming matrix for the global alignment of the sequences GCATT and CCACT with a linear gap penalty with parameters match = 1, mismatch = 3, and space = 1. Recall that the recurrence for this dynamic programming problem is M[i 1, j 1] + S(x i, y j ), M[i, j] = max M[i 1, j] + space, M[i, j 1] + space C C A C T G C A T T 0-1 -2-3 -4-5 -1-2 -3-4 -5-6 -2 0-1 -2-3 -4-3 -1-2 0-1 -2-4 -2-3 -1-2 0-5 -3-4 -2-3 -1 4

(a) (5 points) Fill in the values and traceback pointers for the empty cells in the dynamic programming matrix. See the red values and traceback pointers in the matrix above. (b) (5 points) Show how to reuse this dynamic programming matrix to find an optimal alignment of the sequences GCA and CCAC. Give both an optimal alignment of the sequences and the optimal score. GCA is a prefix of GCATT and CCAC is a prefix of CCACT, therefore the matrix consisting of the first four rows and first five columns of the dynamic programming matrix for the original sequences solves the problem for these prefixes. The optimal score is -1, which is obtained by one of three alignments: GC-A- -CCAC -GCA- C-CAC G-CA- -CCAC (c) (5 points) Explain how to reuse this dynamic programming matrix to find an optimal alignment of the sequences GCATTC and CCACTG. You do not need to give an optimal alignment or its score. GCATT is a prefix of GCATTC and CCACT is a prefix of CCACTG, therefore the dynamic programming matrix for the longer sequences is the same as that of the shorter sequences, but with an additional column to the right and an addition row at the bottom. All that is needed to do is compute the entires of the last column and last row of this larger matrix. All other entries will have the same values and traceback pointers as their corresponding entry in the smaller matrix. 5

(d) (5 points) Suppose that we modify the global alignment algorithm such that we find the maximum value in the last column of the dynamic programming matrix (not necessarily the lower-right entry) and traceback from that value to the upper-left entry. Briefly describe what task this algorithm solves in general (e.g., not specifically for the sequences in this problem). This algorithm solves the problem of finding the highest scoring global alignment of some prefix of the first sequence (which indexes the rows) with the entirety of the second sequence (which indexes the columns). Equivalently, it solves the problem of finding the highest scoring global alignment of the entirety of both sequences with no penalty for insertions in the first sequence at the end of the alignment. (e) (5 points) Suppose that we are interested in finding the maximum number of matches (aligned characters that are identical) that could be obtained by an alignment of the two sequences. Briefly describe how to modify the values of the scoring parameters such that the dynamic programming algorithm computes this maximum possible number of matches for you. If we change the scoring parameters to match = 1, mismatch = 0, and space = 0, then the score of an alignment will simply be the number of matches within it. With this scoring function, the dynamic programming algorithm will then find the maximum number of matches possible between two sequences. 6

3. (25 points) Suppose that the true unrooted phylogenetic tree relating five biological sequences, x 1, x 2, x 3, x 4, and x 5, is the one given below. The sequences x 6, x 7, and x 8 represent sequences of ancestral nodes in this tree. x 1 x 2 x 3 2 2 4 x 8 2 1 1 x 6 x 7 1 x 5 x 4 (a) (5 points) Suppose that you are given the true distances between each pair of sequences. Which pair of leaves would the UPGMA algorithm choose to join first? Justify your answer. The leaves for x 4 and x 5 would be joined first because they have distance = 3, which is the minimum over all pairwise distances between leaves. (b) (5 points) Suppose that you are given the true distances between each pair of sequences. Recall that the neighbor-joining algorithm uses corrected distances D ij = d ij (r i + r j ) where r i = 1 L 2 m L d im and L is the set of nodes remaining to join. Which pair of leaves would the neighbor-joining algorithm choose to join first? Justify your answer. Hint: to simplify your arithmetic, note that the pair with the minimal value of D ij also has the minimal value of ( L 2) D ij. 7

Because we are given the true pairwise distances, which are derived from the true tree, the distances are additive and thus neighbor-joining will reconstruct the true tree. In reconstructing the true tree, starting with the leaf nodes, the first join must either be between leaves x 1 and x 5 or between x 2 and x 3. Thus, we must compare D 15 with D 23, or, equivalently, ( L 2)D 15 with ( L 2)D 23. ( L 2)D 15 = ( L 2) d 15 (d 12 + d 13 + d 14 + d 15 + d 15 + d 25 + d 35 + d 45 ) = 3 5 (9 + 9 + 6 + 5 + 5 + 6 + 6 + 3) = 15 49 = 34 ( L 2)D 23 = ( L 2) d 23 (d 12 + d 23 + d 24 + d 25 + d 13 + d 23 + d 34 + d 35 ) = 3 4 (9 + 4 + 5 + 6 + 9 + 4 + 5 + 6) = 12 48 = 36 Since ( L 2)D 23 < ( L 2)D 15, then D 23 < D 15 and neighbor-joining would select leaves x 2 and x 3 to join first. (c) (5 points) Do the pairwise distances between the leaves of the true tree obey the properties of an ultrametric? Briefly explain why or why not. No. There are a couple of ways to see this. First, note that if the distances were an ultrametric, then UPGMA would give the correct tree, but we saw in part (a) that UPGMA would join x 4 with x 5 first, which is not compatible with the true tree. Thus, by contradiction, the distances cannot be an ultrametric. Alternatively, we can examine the distances between all triplets of leaves and find that for leaves x 1, x 4, and x 5, the distances are d 14 = 6, d 15 = 5, and d 45 = 3, which are all different, violating the definition of an ultrametric. 8

(d) (5 points) Suppose that the sequences at the leaves are x 1 = A, x 2 = C, x 3 = C, x 4 = A, and x 5 = G. Give the minimum unweighted parsimony cost of this tree and an assignment of sequences to the ancestral nodes that achieves this cost. The minimum unweighted parsimony cost is 2 and an assignment to the internal nodes that achieves this cost is x 6 = A, x 7 = A, x 8 = C. One can verify this via Fitch s algorithm. For example, suppose we place a root node x 9 on the edge in between x 7 and x 8. Then we have that R 6 = {A, G} R 7 = {A} R 8 = {C} R 9 = {A, C} and regardless of the assignment to the root node, the other internal nodes must be assigned as x 6 = A, x 7 = A, x 8 = C. (e) (5 points) Suppose that x 2 is the outgroup. Where is the root of this tree located? The root is located somewhere along the edge between nodes x 2 and x 8. 9