A space where trees lie?

Similar documents
Outline 2 The statistical paradigm. Estimates and confidence. Statistical approaches to variability. Robustness Inspirations from ranked data. Buildin

Confidence Regions and Averaging for Trees

Distances between Trees and their applications

Confidence Regions and Averaging for Trees Examples

Tree Space Distances between Trees

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

Stat 547 Assignment 3

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Point-Set Topology 1. TOPOLOGICAL SPACES AND CONTINUOUS FUNCTIONS

CS 534: Computer Vision Segmentation and Perceptual Grouping

Maths: Phase 5 (Y12-13) Outcomes

Understanding Spaces of Phylogenetic Trees

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Seeing the wood for the trees: Analysing multiple alternative phylogenies

GEOMETRY Graded Course of Study

Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller

Segmentation and Grouping

Operads and the Tree of Life John Baez and Nina Otter

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Topological properties of convex sets

Mathematics 6 12 Section 26

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015

A Fine Partitioning of Cells

Coxeter Decompositions of Hyperbolic Polygons

The combinatorics of CAT(0) cube complexes

Math 208/310 HWK 4 Solutions Section 1.6

UNIT 1 GEOMETRY TEMPLATE CREATED BY REGION 1 ESA UNIT 1

Graphs associated to CAT(0) cube complexes

Non-extendible finite polycycles

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Chapter 11 Representation & Description

A VERTICAL LOOK AT KEY CONCEPTS AND PROCEDURES GEOMETRY

Braid groups and Curvature Talk 2: The Pieces

5. Compare the volume of a three dimensional figure to surface area.

Extremal Configurations of Polygonal Linkages

Tutorial 3 Comparing Biological Shapes Patrice Koehl and Joel Hass

Scope and Sequence for the New Jersey Core Curriculum Content Standards

Y7 Learning Stage 1. Y7 Learning Stage 2. Y7 Learning Stage 3

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Topics in Machine Learning

In this chapter, we will investigate what have become the standard applications of the integral:

Hyperbolic Structures from Ideal Triangulations

arxiv: v3 [math.oc] 16 Aug 2017

An Introduction to the Bootstrap

Holt Grade Mathematics Curriculum Guides

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems

Grade 4 Math Proficiency Scales-T1

Lecture 0: Reivew of some basic material

Robust Shape Retrieval Using Maximum Likelihood Theory

On Soft Topological Linear Spaces

7 th GRADE PLANNER Mathematics. Lesson Plan # QTR. 3 QTR. 1 QTR. 2 QTR 4. Objective

Приложение 34 к приказу 949 от 29 сентября 2017 г. MOSCOW AVIATION INSTITUTE (NATIONAL RESEARCH UNIVERSITY)

The Construction of a Hyperbolic 4-Manifold with a Single Cusp, Following Kolpakov and Martelli. Christopher Abram

Babu Madhav Institute of Information Technology Years Integrated M.Sc.(IT)(Semester - 7)

Understand the concept of volume M.TE Build solids with unit cubes and state their volumes.

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

1 Introduction and Review

Triangle in a brick. Department of Geometry, Budapest University of Technology and Economics, H-1521 Budapest, Hungary. September 15, 2010

CAT(0)-spaces. Münster, June 22, 2004

Euler s Theorem. Brett Chenoweth. February 26, 2013

arxiv: v5 [math.mg] 9 Jun 2016

The combinatorics of CAT(0) cubical complexes

A STEP-BY-STEP TUTORIAL FOR DISCRETE STATE PHYLOGEOGRAPHY INFERENCE

Angle Structures and Hyperbolic Structures

The Tree Congestion of Graphs

Themes in the Texas CCRS - Mathematics

A Short SVM (Support Vector Machine) Tutorial

Curvature Berkeley Math Circle January 08, 2013

AQA GCSE Maths - Higher Self-Assessment Checklist

Scheme of Work Form 4 (Scheme A)

5. THE ISOPERIMETRIC PROBLEM

The Cyclic Cycle Complex of a Surface

Year 8 Review 1, Set 1 Number confidence (Four operations, place value, common indices and estimation)

CURVES OF CONSTANT WIDTH AND THEIR SHADOWS. Have you ever wondered why a manhole cover is in the shape of a circle? This

On median graphs and median grid graphs

Dynamic Collision Detection

Geometry Critical Areas of Focus

Angles and Polygons. Angles around a point, on a straight line and opposite angles. Angles in parallel lines (alt, corr and co-int angles)

MATHEMATICS Scope and sequence P 6

Scope and Sequence for the Maryland Voluntary State Curriculum for Mathematics

Maths Scope and Sequence. Gr. 5 - Data Handling. Mathematics Scope and Sequence Document Last Updated August SM

How and what do we see? Segmentation and Grouping. Fundamental Problems. Polyhedral objects. Reducing the combinatorics of pose estimation

B. Number Operations and Relationships Grade 6

Parameterization of Triangular Meshes with Virtual Boundaries

Lecture 2 September 3

EXTREME POINTS AND AFFINE EQUIVALENCE

4. Simplicial Complexes and Simplicial Homology

Carnegie Learning Math Series Course 2, A Florida Standards Program

round decimals to the nearest decimal place and order negative numbers in context

Pre AP Geometry. Mathematics Standards of Learning Curriculum Framework 2009: Pre AP Geometry

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Convex Sets (cont.) Convex Functions

Using Machine Learning to Optimize Storage Systems

Maths Year 11 Mock Revision list

YEAR 7 SCHEME OF WORK - EXTENSION

The generalized Schoenflies theorem

arxiv:math/ v3 [math.dg] 23 Jul 2007

Ohio Tutorials are designed specifically for the Ohio Learning Standards to prepare students for the Ohio State Tests and end-ofcourse

K-5 Mathematics Missouri Learning Standards: Grade-Level Expectations

Similarities and Differences Or Compare and Contrast

Transcription:

A space where trees lie? 1 Susan Holmes Statistics Department, Stanford and INRA- Biométrie, Montpellier,France susan@stat.stanford.edu http://www-stat.stanford.edu/~susan/ Funded in part by a grant from NSF-DMS Contains joint work with Karen Vogtmann and Lou Billera

Do we care about inferences for phylogenetic trees? 2 Cetacees: recognising what is being sold as Whale meat in Japan? Steve Palumbi, Harvard. Scott Baker, Auckland. Whale www.dna.surveillance Earth Trust Press Release

The River without a Paddle? 3 Human immunodeficiency virus: Phylogeny and the origin of HIV-1

4 The origin of human immunodeficiency virus type 1 (HIV-1) is controversial. Phylogeny has showed that viruses obtained from the Democratic Republic of Congo in Africa have a

quantitatively different phylogenetic tree structure from those sampled in other parts of the world. 5 Quest for the origin of AIDS This indicates that the structure of HIV-1 phylogenies is the result of epidemiological processes acting within human populations alone, and is not due to multiple cross-species transmission initiated by oral polio vaccination. Serial Passage Conversely, phylogenetic analysis of HIV-1 sequences indicates that group M originated before the vaccination campaign, supporting a model of natural transfer from chimpanzees to humans. If this timescale is correct, then the OPV theory remains a viable

hypothesis of HIV-1 origins only if the subtypes of group M differentiated in chimpanzees before their transmission to humans. 6

Confidence Intervals? 7 Korber and colleagues extrapolated the timing of the origin of HIV-1 group M back to a single viral ancestor in 1931, give or take about 12 years for 95% confidence limits. Because this calendar of events obviously pre-dated the OPV trials, in the revised version of his book, Hooper suggested that group M first began to diverge in chimpanzees, and that there were then several independent transfers of virus to humans via OPV. In that case, several OPV batches should bear evidence

of their production in chimpanzee tissue, yet no such evidence has been found. 8

Closure: Polio vaccines exonerated Nature 410, 1035-1036 (2001) 9 The OPV batch that Hooper considered to be under most suspicion, however, was CHAT 10A-11. An original vial of the batch was found at Britain s National Institute for Biological Standards and Control, and the new tests show that it was prepared from rhesus-macaque cells.

How sure are we of the answers? 10 Phylogenetic Trees and Variability. Aggregating/Combining trees, Stability of sets of trees, Comparisons of sets of trees of several kinds. Explanation of one set of trees by another. Combining trees with other data. Confidence Statements for trees.

11 A Parameter T :a semi-labeled binary Tree 0 root inner edges inner edges Inner Node 1 3 2 4 leaves

Data T 1, T 2,..., T n : many semi-labeled binary Trees 12 0 root 0 root 0 root inner edges inner edges inner edges inner edges inner edges inner edges Inner Node Inner Node Inner Node 1 3 2 4 1 3 2 4 1 3 2 4 leaves 0 root leaves 0 root leaves 0 root inner edges inner edges inner edges inner edges inner edges inner edges Inner Node Inner Node Inner Node 1 3 2 4 1 3 2 4 1 3 2 4 leaves leaves leaves

What to do when data are trees? 13 The data could be considered equivalent to a set of trees. Building a probability distribution on trees is a complex procedure. Most biologists agree that the simplest possible probability distribution on tree space, the uniform distribution is not relevant. Others: Yule process, for coalescent trees: Kingman s model.

Another non standard type of data 14 Here we are going to give an analog of the nonstandard splits-and-trees data. The following example involves data that do not belong to R. Commonly called rank data. (1, 4, 3, 2)(3, 4, 1, 2)(3, 2, 1, 4)...... (4, 2, 1, 3) that we would also like to summarize. This data occurs in genetics: gene order data are available (for a review see chapter 10 of Pezner).

15

Sufficiency 16

Confidence Statements 17 R 5 S J R 3 9-10 boundary R 2 ^ π. ^ π π. R 1 R 4 R 6

Confidence Statements 17 R 5 S J R 3 9-10 boundary R 2 ^ π. ^ π π. R 1 R 4 R 6 From Efron, Halloran, Holmes, (1996).

Confidence Statements 17 R 5 S J 9-10 boundary R 2 ^ π. ^ π R 3 π. R 1 R 4 R 6 What is the curvature of the boundary? From Efron, Halloran, Holmes, (1996).

Confidence Statements 17 R 5 S J 9-10 boundary R 2 ^ π. ^ π R 3 π. R 1 R 4 R 6 What is the curvature of the boundary? How many neighbors does a region have? From Efron, Halloran, Holmes, (1996).

Confidence Statements 17 R 5 S J 9-10 boundary R 2 ^ π. ^ π R 3 π. R 1 R 4 R 6 What is the curvature of the boundary? How many neighbors does a region have? From Efron, Halloran, Holmes, (1996).

Bootstrap Confidence Values 18 +----Macaca mul +-100.0 +-99.5 +----Macaca fus!! +------100.0 +---------Macaca fas!!! +--------------Macaca syl +-79.4!! +-------------------Hylobates!!!!!! +----Homosapien! +-99.0 +-50.2 +-100.0! +-100.0 +----Pan!!!!!!! +-89.0 +---------Gorilla!!! +-100.0! +--------------Pongo!!!!! +-----------------------------Saimiri sc! +----------------------------------Tarsius sy +---------------------------------------Lemur catt Do these values mean anything?

Simple confidence values 19 Univariate. Multiple Testing. Composite Statements.

Simple confidence values 19 Univariate. Multiple Testing. Composite Statements. In general, the clade frequencies are not sufficient statistics for the data on trees.

Confidence Statements for trees 20

Aims 21 Fill Tree Space and make meaningful boundaries. Define distances between trees. Define neighborhoods, meaningful measures. Principal directions of variations in tree space, summarizing : structure + noise. Confidence statements, convex hulls.

Rotation Moves 22 0 1 2 3 4

Rotation Moves 22 0 0 1 2 3 4 1 2 3 4

Rotation Moves 22 0 0 0 1 2 3 4 1 2 3 4 1 2 3 4 These are biologists NNI moves.

Boundary for trees with 3 leaves 0 23 3 2 1 0 0 1 2 3 1 2 3 0 3 1 2

Extension to 4 24

The quadrant for one tree 25 (0,1) (1,1) (0,0) (1,0)

The quadrant for one tree 25 (0,1) (1,1) 0 1 2 3 4 (0,0) (1,0)

The quadrant for one tree (0,1) (1,1) 1 { { 1 0 25 1 2 3 4 0 1 2 3 4 (0,0) (1,0)

The quadrant for one tree (0,1) (1,1) 1 { { 1 0 25 1 2 3 4 0 1 2 3 4 (0,0) (1,0) 0 1 2 3 4

26

26

The cube complex 27 A binary n-tree has the maximal possible number of interior edges (n 2). It determines the largest possible dimensional quadrant which is n 2-dimensional. The quadrant corresponding to each tree which is not binary appears as a boundary face of at least three binary trees; in particular the origin of each quadrant corresponds to the (unique) tree with no interior edges.

The cube complex 27 A binary n-tree has the maximal possible number of interior edges (n 2). It determines the largest possible dimensional quadrant which is n 2-dimensional. The quadrant corresponding to each tree which is not binary appears as a boundary face of at least three binary trees; in particular the origin of each quadrant corresponds to the (unique) tree with no interior edges. T n is built by taking one n 2-dimensional quadrant for each of the (2n 3)!! = (2n 3) (2n 5) 5 3 1 possible binary trees, and gluing them together along their common faces.

For n = 3 there are three binary trees, each with 1 interior edge. Each tree thus determines a 1-dimensional quadrant, i.e. a ray from the origin. The three rays are identified at their origins. Figure for n=3. 28

Three quadrants sharing a ray for n=4 Boundary 29 0 0 2 3 1 4 1 2 3 4 0 0 1 2 3 4 1 3 2 4 0 1 2 3 4

Three quadrants sharing a ray for n=4 Boundary 29 0 0 2 3 1 4 1 2 3 4 0 0 1 2 3 4 1 3 2 4 0 1 2 3 4 Note that the bottom boundary rays form a copy of T 3 embedded in T 4.

Three quadrants sharing a ray for n=4 Boundary 29 0 0 2 3 1 4 1 2 3 4 0 0 1 2 3 4 1 3 2 4 0 1 2 3 4 Note that the bottom boundary rays form a copy of T 3 embedded in T 4. In general, T n contains many embedded copies of T k for k < n.

CAT(0) space, (Gromov) c c 30 a b a b

CAT(0) space, (Gromov) c c 30 a Triangles are thin. b a b

Consequences 31 Averaging works better than it should, (an argument against total evidence computation without decomposing??). We can build Bayesian priors based on distances. We can make a useful bootstrap statement. We can make convex hulls. Confidence regions. We can use Mallow s model. We know how many neighbors any tree has.

How many neighbors for a given tree?(w.h.li,1993) 32 We know the number of neighbors of each tree.

How many neighbors for a given tree?(w.h.li,1993) 32 We know the number of neighbors of each tree.

How many neighbors for a given tree?(w.h.li,1993) 32 We know the number of neighbors of each tree. For a tree with only two inner edges, there is the only one way of having two edges small: to be close to the origin-star tree: 15 neighbors. This same notion of neighborhood

containing 15 different branching orders applies to all trees on as many leaves as necessary but who have two contiguous small edges and all the other inner edges significantly bigger than 0. 33 This picture of treespace frees us from having to use simulations to find out how many different trees are in a neighborhood of a given radius r around a given tree. All we have to do is check the sets of continguous edges in the tree smaller than r, say there is only one set of size k, then the neighborhood will contain (2k 3)!! = (2k 3) (2k 5) 3 different trees. If there are m sets of sizes (n 1, n 2,..., n m )

34 1 2 3 4 5 6 7 8 9 10 11

34 1 2 3 4 5 6 7 8 9 10 11

34 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11

35 1 2 3 4 5 6 7 8 9 10 11

35 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 15 105 3 In this case the number of trees within r will be 15 105 3 = 4725, in general: (2n 1 3)!! (2n 2 3)!! (2n 3 3)!! (2n m 3)!!

A tree near the star tree at the origin will have an exponential number of neighbors. This explosion of the volume of a neighborhood at the origin provides for interesting math problems. 36

Importance of distance for the bootstrap: 37 X original data ˆT estimate. Call X bootstrap samples consistent with the model used for estimating the tree: Non parametric multinomial resampling for a parsimony tree. Seqgen parametric type resampling with the same parameters for a ML. Bayesian GAMMA prior on rates and generation (Yang 2000) for random sequences according to ˆT

Bootstrap Theory 38 Distribution(d( ˆT, T ) = Distribution(d( ˆT, ˆT ))

Convex Hulls and confidence regions 39

39 Convex Hulls and confidence regions

39 Convex Hulls and confidence regions

Perspectives, problems 40 How tree-like are the data? Translates into how well are the data projected onto tree space. Can be solved empirically by making a graph and asking how treelike it is. Mathematical analysis: hyperbolicity. al. 2001). (Moulton et

Tree trajectories 41...

Tree trajectories 41 1 1 1 1...

Tree trajectories 41 1 1 1 1 1 1 1. 1 1..

Tree trajectories 41 1 1 1 1 1 1 1. 1 1.. Regression in treespace. Extension from trees to general graphs (median networks,...)

How can mathematical statistics help? 42 Decompositions that can be generalisable. Geometric Picture of Tree Space A space for comparisons. Ways of projecting. Follow trees as they change, (paths of trees) Aggregating trees, expectations for various measures. Neighborhoods (convex hulls of trees)... Justification of commonsense, ground for generalizations.

Proof by direct decomposition 43 Call B n 1 the subgroup of S 2n 2 that fixes the pairs then {1, 2}{3, 4}... {2n 3, 2n 2} and M n 1 = (2n 2)! 2 n 1 (n 1)! M n 1 = S 2n /B n 1 = (2n 3)!! = (2n 3) (2n 5) This formula for the number of trees was first proved using generating functions by Schroder (1873)[?].

(S 2n 2, B n 1 ) form a Gelfand pair Diaconis and Shahshahani (1987). 44 L(M n 1 ) = V 1 V 2... V λ A multiplicity free representation. L(M n 1 ) = S 2λ λ n where the direct sum is over all partitions λ of m, 2λ = (2λ 1, 2λ 2,..., 2λ k ) and S 2λ is associated irreducible representation of the symmetric group S 2m. Just to take the first few: for λ = n 1 S λ are the constants, and this gives the sample size. for λ = (n 2, 1), S λ are the number of times each pair

appears. for λ = (n 3, 2), S λ are the number of times partition of 4 appears in the tree. for λ = (n 3, 1, 1), S λ are the number of times 2 pairs appear simultaneously. This decomposition is similar to what was done by Diaconis for permutation data.[?] 45