Algorithms for Bioinformatics

Similar documents
Evolutionary tree reconstruction (Chapter 10)

Phylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau

11/17/2009 Comp 590/Comp Fall

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees

Lecture 20: Clustering and Evolution

Lecture 20: Clustering and Evolution

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

4/4/16 Comp 555 Spring

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Distance-based Phylogenetic Methods Near a Polytomy

Introduction to Trees

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Recent Research Results. Evolutionary Trees Distance Methods

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

CLUSTERING IN BIOINFORMATICS

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

DISTANCE BASED METHODS IN PHYLOGENTIC TREE CONSTRUCTION

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Cluster analysis. Agnieszka Nowak - Brzezinska

BMI/CS 576 Fall 2015 Midterm Exam

Exercise set 2 Solutions

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

human chimp mouse rat

Scaling species tree estimation methods to large datasets using NJMerge

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

2. CONNECTIVITY Connectivity

Clustering of Proteins

Unsupervised Learning Hierarchical Methods

Markovian Models of Genetic Inheritance

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Lower Bound on Comparison-based Sorting

Randomized Algorithms 2017A - Lecture 10 Metric Embeddings into Random Trees

Terminology. A phylogeny is the evolutionary history of an organism

Shortest path problems

FastJoin, an improved neighbor-joining algorithm

Fixed-Parameter Algorithms, IA166

Fast Hierarchical Clustering via Dynamic Closest Pairs

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Special course in Computer Science: Advanced Text Algorithms

Copyright 2000, Kevin Wayne 1

Algorithm Design and Analysis

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Rapid Neighbour-Joining

Clustering Techniques

Notes on Binary Dumbbell Trees

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

The worst case complexity of Maximum Parsimony

Algebraic method for Shortest Paths problems

DESIGN AND ANALYSIS OF ALGORITHMS (DAA 2017)

Clustering & Bootstrapping

CSE 549: Computational Biology

10701 Machine Learning. Clustering

1 i n (p i + r n i ) (Note that by allowing i to be n, we handle the case where the rod is not cut at all.)

Geometric Modeling. Mesh Decimation. Mesh Decimation. Applications. Copyright 2010 Gotsman, Pauly Page 1. Oversampled 3D scan data

Algorithms for Bioinformatics

Copyright 2000, Kevin Wayne 1

EECS 2011M: Fundamentals of Data Structures

Phylogenetic networks that display a tree twice

Mining Social Network Graphs

Discrete mathematics

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Lesson 3. Prof. Enza Messina

Computational Genomics and Molecular Biology, Fall

Cluster Analysis. Ying Shen, SSE, Tongji University

3. Cluster analysis Overview

Chapter 16. Greedy Algorithms

In this lecture, we ll look at applications of duality to three problems:

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.

Treewidth and graph minors

Multi-Modal Data Fusion: A Description

CSE 3101: Introduction to the Design and Analysis of Algorithms. Office hours: Wed 4-6 pm (CSEB 3043), or by appointment.

Undirected Graphs. V = { 1, 2, 3, 4, 5, 6, 7, 8 } E = { 1-2, 1-3, 2-3, 2-4, 2-5, 3-5, 3-7, 3-8, 4-5, 5-6 } n = 8 m = 11

Hierarchical Clustering Lecture 9

Ma/CS 6b Class 26: Art Galleries and Politicians

Sergei Silvestrov, Christopher Engström. January 29, 2013

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1

Algorithm Design and Analysis

Write an algorithm to find the maximum value that can be obtained by an appropriate placement of parentheses in the expression

Algorithms Dr. Haim Levkowitz

by conservation of flow, hence the cancelation. Similarly, we have

Graph Theory Questions from Past Papers

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Matching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition.

Motivation for B-Trees

Dimension reduction : PCA and Clustering

Throughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees.

10 Clustering and Trees

Lecture 3: Graphs and flows

Hierarchical Clustering

Searching Algorithms/Time Analysis

CSE 5243 INTRO. TO DATA MINING

On the Optimality of the Neighbor Joining Algorithm

Algorithms: Lecture 10. Chalmers University of Technology

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Outline. CS38 Introduction to Algorithms. Approximation Algorithms. Optimization Problems. Set Cover. Set cover 5/29/2014. coping with intractibility

Transcription:

Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and phylogeny 3.10.2013

Outline Distance-based clustering, UPGMA Neighbor joining Study group assignments About the exam 2 / 41

Phylogenetic tree: Bears 3 / 41

Phylogeny by distance method pipeline genome sequences of the species For all pairs of species, find the homologous genes permutations representing the homologs Compute the rearrangement distance for all pairs of species D(A, B) for all species A and B Build the phylogenetic tree from the distances 4 / 41

Clustering Clustering can be loosely stated as the problem of grouping objects into sets called clusters, where the members of the cluster are similar in some sense. Hierarchical clustering: Iteratively join two closest clusters forming a tree hierarchy (agglomerative... also divisive version exists) Distance between clusters can be e.g. max pair-wise distance (complete linkage), min (single linkage), UPGMA (average linkage), neighbor joining Partitional clustering: k-means 5 / 41

Distances in a phylogenetic tree Distance matrix D = (d ij ) gives pairwise distances for leaves of the phylogenetic tree In addition, the phylogenetic tree will now specify distances between internal nodes Denote these with dij as well 6 1 2 3 4 5 Distance d ij states how far apart species i and j are evolutionary. 8 7 6 / 41

Distances in evolutionary context Distance d ij in evolutionary context satisfy the following conditions: Positivity: dij 0 Identity: dij = 0 if and only if i = j Symmetry: d ij = d ji for each i, j Triangle inequality: d ij d ik + d kj for each i, j, k Distance satisfying these conditions is called metric In addition, evolutionary mechanisms may impose additional constraints on the distances additive and ultrametric distances 7 / 41

Additive trees Suppose that every edge in a tree is labeled with a distance d ij A tree is called additive if for every pair of leaves the distance between the leaves is the sum of the edge distances on the path between the leaves. Example: A B C D A 0 2 4 4 B 2 0 4 4 C 4 4 0 2 D 4 4 2 0 A 1 1 B 2 C 1 1 D 8 / 41

Ultrametric trees A rooted additive tree is called an ultrametric tree if the distances between any two leaves i and j and their common ancestor k are equal d ik = d jk d ij /2 corresponds to the time elapsed since divergence of i and j from the common parent In other words, edge lengths are measured by a molecular clock with a constant rate 9 / 41

Identifying ultrametric data We can identify distances to be ultrametric by the three-point condition: D corresponds to an ultrametric tree if and only if for any three species we can label them i, j, and k such that the distances satisfy: d ik = d jk d ij If we find out that the data is ultrametric, we can utilise a simple algorithm to find the corresponding tree 10 / 41

Ultrametric trees Only vertical segments of the tree have correspondence to some distance d ij Horizontal segments act as connectors Time d 8,9 8 7 9 d ik = d jk for any two leaves i, j and any ancestor k of i and j 5 4 3 2 1 6 Observation time 11 / 41

UPGMA algorithm UPGMA (unweighted pair group method using arithmetic averages) constructs a phylogenetic tree via clustering The algorithm works by at the same time Merging two clusters Creating a new node on the tree The tree is built from leaves towards the root UPGMA produces a ultrametric tree 12 / 41

Cluster distances Let distance d ij between clusters C i and C j be d ij = 1 C i C j p C i,q C j d pq, that is, the average distance between points (species) in the cluster. 13 / 41

UPGMA algorithm Initialisation Assign each point i to its own cluster C i Define one leaf for each point and place it at height zero Iteration Find clusters i and j for which dij is minimal Define new cluster k by C k = C i C j and compute d kl for all l Add a node k with children i and j to the tree. Place k at height d ij /2 Remove clusters i and j Termination When only two clusters i and j remain, place root at height d ij /2 14 / 41

UPGMA example 1 2 9 3 4 5 6 7 8 1 d 2 6,8 1 2 4 5 3 15 / 41

UPGMA example 1 2 9 3 4 5 7 6 1 d 2 1,2 1 2 4 5 3 8 1 d 2 6,8 16 / 41

UPGMA example 1 2 9 3 4 5 6 7 8 1 d 2 4,5 1 d 2 6,8 1 2 4 5 3 17 / 41

UPGMA example 1 2 9 3 4 5 6 7 1 8 d 2 6,8 1 d 2 3,7 1 2 4 5 3 18 / 41

UPGMA example 1 2 9 3 4 5 6 7 8 1 d 2 6,8 1 2 4 5 3 19 / 41

UPGMA implementation In naive implementation, each iteration takes O(n 2 ) time with n initial points = algorithm takes O(n 3 ) time The algorithm can be implemented to take only O(n 2 ) time (see Gronau & Moran, 2006, for a survey) 20 / 41

Problem solved? We now have a simple algorithm which finds an ultrametric tree If the data is ultrametric, then there is exactly one ultrametric tree corresponding to the data The tree found is then the correct solution to the phylogeny problem if the assumptions hold Unfortunately, the data is not ultrametric in practice Measurement errors distort distances Basic assumption of a molecular clock does not hold usually very well 21 / 41

Incorrect reconstruction of non-ultrametric data by UPGMA 2 3 1 Tree which corresponds to non-ultrametric distances 4 1 4 2 3 Incorrect ultrametric reconstruction by UPGMA algorithm 22 / 41

Outline Distance-based clustering, UPGMA Neighbor joining Study group assignments About the exam 23 / 41

Checking for additivity Recall: a tree is additive if for every pair of leaves the distance between the leaves is the sum of the edge distances on the path between the leaves. How can we check that the data is additive? Let i, j, k, and l be four distinct species Compute three sums dij + d kl dik + d jl dil + d jk d j ij i k d l kl j i d d ik jl k l j i d d il jk k l 24 / 41

Four-point condition d j ij i k d l kl j i d d ik jl k l j i d d il jk k l Sums represented by the middle and right figures cover all edges Sum represented by the left figure does not cover all edges Four-point condition: i, j, k, and l satisfy the four-point condition if two of the sums d ij + d kl, d ik + d jl, and d il + d jk are equal and the third one is smaller than these two. An n n matrix D is additive if and only if the four-point condition holds for every 4 elements 1 i, j, k, l n. 25 / 41

Checking for additivity: Example A B C D A 0 6 7 5 B 0 11 9 C 0 6 D 0 d AB + d CD = 6 + 6 = 12 d AC + d BD = 7 + 9 = 16 d AD + d BC = 5 + 11 = 16 Two of the sums are equal and the third is smaller = Four-point condition holds = Matrix is additive 26 / 41

Finding an additive phylogenetic tree Additive trees can be found for example by the neighbor joining method (Saitou & Nei, 1987) However, in real data, even additivity usually does not hold very well 27 / 41

Neighbor joining algorithm Neighbor joining works in a similar fashion to UPGMA Find clusters C1 and C 2 that minimize a function f (C 1, C 2 ) Join the two clusters C 1 and C 2 into a new cluster C Add a node to the tree corresponding to C Assign distances to new branches Differences in The choice of function f (C 1, C 2 ) How to assign the distances 28 / 41

Neighbor joining algorithm: Separation of a cluster Let u(c i ) be the separation of cluster C i from other clusters defined as u(c i ) = 1 n 2 where n is the number of clusters. C j d ij 29 / 41

Neighbor joining algorithm Neighbor joining at the same time Minimizes the distance between clusters C i and C j to be joined Maximizes the separation of both Ci and C j from other clusters Recall that UPGMA only minimizes the distance between the clusters C i and C j 30 / 41

Neighbor joining algorithm Initialization as in UPGMA Iteration Find clusters Ci and C j for which d ij u(c i ) u(c j ) is minimum Define a new cluster C k = C i C j and compute d kl for all l: d kl = 1 2 (d il + d jl d ij ) Remove clusters Ci and C j Define a node k with edges to i and j Assign length 1 2 (d ij + u(c i ) u(c j )) to the edge i k Assign length 1 2 (d ij + u(c j ) u(c i )) to the edge j k Termination When two clusters i and j remain, add an edge between them with weight d ij. 31 / 41

Neighbor joining algorithm: Example A B C D A 0 6 7 5 B 0 11 9 C 0 6 D 0 i, j d ij u(c i ) u(c j ) A,B 6 9 13 = 16 A,C 7 9 12 = 14 A,D 5 9 10 = 14 B,C 11 13 12 = 14 B,D 9 13 10 = 14 C,D 6 12 10 = 16 i u(c i ) A (6 + 7 + 5)/2 = 9 B (6 + 11 + 9)/2 = 13 C (7 + 11 + 6)/2 = 12 D (5 + 9 + 6)/2 = 10 Choose either one of the red pairs to join 32 / 41

Neighbor joining algorithm: Example A B C D A 0 6 7 5 B 0 11 9 C 0 6 D 0 i, j d ij u(c i ) u(c j ) A,B 6 9 13 = 16 A,C 7 9 12 = 14 A,D 5 9 10 = 14 B,C 11 13 12 = 14 B,D 9 13 10 = 14 C,D 6 12 10 = 16 i u(c i ) A (6 + 7 + 5)/2 = 9 B (6 + 11 + 9)/2 = 13 C (7 + 11 + 6)/2 = 12 D (5 + 9 + 6)/2 = 10 E 1 A 5 B C D d AE = 1 2 6 + 1 2 (9 13) = 1 d BE = 1 2 6 + 1 2 (13 9) = 5 This is only the first step! 32 / 41

Neighbor joining algorithm: Correctness Theorem: If D is an additive matrix, neighbor joining algorithm correctly constructs the corresponding additive tree. Proof (sketch): By contradiction. Assume i and j with minimum D ij = d ij u(c i ) u(c j ) are not neighbors in the additive tree. Show that there are two neighbors m and n with D mn < D ij (see Durbin et al. Biological Sequence Analysis, pp. 190-191 for details). Then the theorem follows by induction. 33 / 41

Outline Distance-based clustering, UPGMA Neighbor joining Study group assignments About the exam 34 / 41

Study Group 1: Those who did not get any material at lecture Read pages 179 181 from Sung: Algorithms in Bioinformatics, CRC Press, 2010 The 4-point condition for additive trees. Copies distributed at lecture. At study group, explain the proof of Theorem 7.1 (Buneman s 4-point condition). The last part of the proof is very condensed: Why does the path length between a and b in T equal M ab + M bc M cd? (recall that T and T are additive trees) The 4-point condition can be applied because Mad + M bc M ac + M bd. Why is this true? (Recall how c and d were chosen) 35 / 41

Study Group 2: Random allocation at lecture Read pages 184 187 from Sung: Algorithms in Bioinformatics, CRC Press, 2010 (Especially Lemma 7.13). Correctness of UPGMA algorithm Copies distributed at lecture. At study group, summarize the proof for the correctness of UPGMA. 36 / 41

Study Group 3: Random allocation at lecture Read pages 190 191 from Durbin et al.: Biological Sequence Analysis, Cambridge University Press, 1998. Correctness of neighbor joining. Note that their notation of Dij equals our d ij u(c i ) u(c j ). Copies distributed at lecture. At study group, summarize the proof for correctness of neighbor joining. 37 / 41

Outline Distance-based clustering, UPGMA Neighbor joining Study group assignments About the exam 38 / 41

Practicalities The course exam is on Wed 16.10. at 16:00 in hall A111 2.5 hours time You can leave at earliest half an hour after the start of the exam The first separate exam is on Tue 26.11. at 16:00 in hall B123 This can also be taken as renewal exam where points from exercises are still valid! 3.5 hours time this will probably be graded in a month s time No own papers. You will need student id card (or some other proof of identity)! You can answer in English or Finnish. 39 / 41

What to study for the exam? Material covered at the lectures! Take a look at some subjects studied in the study groups. If there are questions regarding subjects in the study groups, you will have a choice so that you can answer to a question about a subject you have studied yourself. Example: Choose one of the (non-trivial) problems studied during the course (in study groups, lectures, or/and exercises) not related to the previous assignments above. Define the problem (input, output), explain how the problem is motivated by molecular biology, and describe an algorithm for the problem either simulating an example or by giving its pseudocode. 40 / 41

What kind of questions? In course exam four questions, some might include subquestions (i.e. several questions that all require short answers) In separate exams five questions. Short answers: Example: Explain in one or two sentences what is the shortest superstring problem. Essay type questions: Example: Define the Motif Finding and Median String problems and explain why they are actually the same problem. Describe briefly the idea of the branch-and-bound solution for solving the problem. Simulate an algorithm on a given input. Solve an exercise. 41 / 41