HPC methods for hidden Markov models (HMMs) in population genetics

Size: px

Start display at page:

Download "HPC methods for hidden Markov models (HMMs) in population genetics"

Aubrie Ward
5 years ago
Views:

1 HPC methods for hidden Markov models (HMMs) in population genetics Peter Kecskemethy supervised by: Chris Holmes Department of Statistics and, University of Oxford February 20, 2013

2 Outline Background to increasing data complexity in modern genomics Challenges for statistical models used in medical genomics Biology and dependence structures in the data. Accurate models and computable models, Hidden Markov Models (HMMs) Efficient algorithms via Dynamic Programming HMMs in genetics The Li and Stephens model Applications - natural selection and chromosome painting Trivial parallel HMM methods Block-parallel HMM methods

3 The expansion of genetic data In 2000 the first draft of the Human Genome was reported Took 10 years to complete Cost approx $3 billion = 1 dollar per base In 2012 the UK government announced plans to sequence 100,000 genomes Cost in region of $8k per genome (but I hope they re getting a better deal!) Takes around 2 days to sequence a genome Alongside this UK Biobank is storing a detailed record of molecular features in blood, and outcome data (phenotypes) on 500,000 individuals Within 3 to 5 years it will be routine to have all cancers sequenced in the UK, as well as the patient DNA

Impact on Statistics The developments in generating genetic data will have a huge impact on statistics and machine learning We will require new methods that can scale to massive (or Big ) data To do

4 Impact on Statistics The developments in generating genetic data will have a huge impact on statistics and machine learning We will require new methods that can scale to massive (or Big ) data To do this we will also need to exploit advances in computer hardware to allow us to develop increasingly richer classes of models where the possibilities and limitations of hardware structure will need to be considered within the design stage of method development e.g. MapReduce, GPGPUs,...

Genetics - dependence structures The genome exhibits complex dependence structures both along a genome within an individual, or cancer cell, and across genomes in populations of individuals or cancer

5 Genetics - dependence structures The genome exhibits complex dependence structures both along a genome within an individual, or cancer cell, and across genomes in populations of individuals or cancer cells in a tumour These dependence structures are a product of the interplay between cell division processes. Mutations introduce new variations into a population of individuals (i.e. a pool of genes). Recombinations shuffle the genomes between each generation. Each recombination event splits parental chromosomes and combines them to produce child chromosomes. Recombinations introduce independence between positions along the genome.

6 Modelling dependence structures - the coalescent Figure: Coalescent with recombination (McVean et al, Genetics, 2001) To formally model population and sequential dependence requires graphical models (such as the ancestral-recombination-graph under the coalescent approach) whose structures prevent computation due to complexity Hence we require simplifying (approximating) models that capture the major sources of dependence and allow for computation.

7 Modelling dependence structures - Markov Models Perhaps the most important simplifying structure in statistics is the notion of Markov conditional independence, such that on a set of random variables, S = {S 1,..., S N } we define a joint probability model, Pr(S) that factorises, Pr(S i S 1,..., S i 1, S i+1,..., S N ) = Pr(S i S j n(i)) where n(i) indexes the Markov neighbourhood of i, such that S i is conditionally independent of variables outside of this neighbourhood

8 Methods - Hidden Markov Models (HMMs) HMMs are arguably the most widely used probability model in Bioinformatics, where the hidden states refer to classifications of loci such as {coding, non-coding}, or {duplication, deletion} events (in cancer), etc, HMMs are defined by: a number of unobserved hidden states S = {1,..., N} and Markov transition probabilities Pr(S t+1 S t ) that define rules for state-transitions possible emissions or observations Y i and emission probabilities (likelihoods) Pr(Y i S i ) for each state an initial distribution π = Pr(S 1 ) The state sequence S then forms a Markov Chain: such that the future S t+1,..., S T does not depend on the past S 1,..., S t 1, given the present S t

9 Methods - Hidden Markov Models (HMMs) The Markov property is key to the success of HMMs. Dependencies are represented as edges and conditional independences as missing edges in the graph representation. s 1 s 2 s 3 s T 1 s T y 1 y 2 y 3 y T 1 y T Figure: HMM depicted as a directed graphical model for observation e 1,..., e T. The joint distribution for HMMs is written as: T Pr(y 1,..., y T, s 1,..., s T ) = Pr(s 1 )Pr(y 1 s 1 ) Pr(y t s t )Pr(s t s t 1 ) t=2

10 Methods - Efficient HMM Algorithms with Dynamic Programming There are three basic problems that can be solved efficiently with HMMs: how do we compute the probability of an observation Y = {Y 1,...Y T } given a parameterised HMM? how do we find the optimal state sequence corresponding to an observation given a parameterised HMM? how do we estimate the model parameters? The Markov structure of HMMs allows for dynamic programming. the Forward algorithm computes the probability of an observation Solutions to the second problem depend on the definition of optimality. the Viterbi algorithm finds the most probable (MAP) state sequence, maximising ŝ = arg max s Pr(s 1..T y 1..T ) the Forward-Backward algorithm computes the posterior marginal probabilities Pr(s t y 1,..., y T ) for each state at every t. All three algorithms have the computation cost O(N 2 T ), so linear in sequence length T

11 Applying HMMs in genetics - the Li and Stephens model (LSM) Figure: The imperfect mosaic modeling (Li and Stephens 2003). The Li and Stephens Model is one of the most widely used models since its development. is an HMM-based approximation to the coalescent with recombination models the complex correlation structure between genetic loci (linkage disequilibrium) by treating each genome as an imperfect mosaic made of the other genomes defines a joint model over a collection of sequences as a Product of Approximate Conditionals (PAC) likelihood

12 Using LSM to detect signals of Natural Selection Figure: Effect of Natural Selection on Haplotypes - LCT:2q21.3 HapMap

13 Application of the LSM - Chromosome Painting Figure: Chromosome painting and derivation of coancestry (Lawson et al.). The GPU-LSM is currently being applied to Chromosome painting, which is a method of relating stretches of DNA sequences to one another is a crucial step in producing coancestry matrices when inferring population structure from dense haplotype data

14 Computation details for HMM algorithms y 1 y 2 y 3 s 1=1 s 2=1 s 3=1 s 1=2 s 2=2 s 3=2 s 1=3 s 2=3 s 3=3 Figure: A single computation step of HMM algorithms. Each step in the forward recursion means filling in a cell of a dynamic programming table: α(s t ) = Pr(y t s t ) Pr(s t s t 1 ) α(s t 1 ) s t 1 S

15 Trivial HMM Parallelisation The parallelisation of HMM algorithms is straightforward (in theory) over multivariate emissions (likelihoods) and over the state-space: y 1 y 2 y 3 e1 y 2 y 3 y1 y 2 y 3 s 1=1 s 2=1 s 3=1 s 2=2 s 2=2 s 3=2 s 3=3 s 2=3 s 3=3 Figure: The calculation over the observations and the calculations of α(s 2 = 1), α(s 2 = 2) and α(s 2 = 3) can be performed in parallel.

16 Trivial HMM Parallelisation Calculations corresponding to different observations are trivially parallelisable and is perfectly suited even for distributed computation. The standard HMM algorithms all repeat the same operations for each state s t. Moreover, the calculation of the α(s t ) (or beta, etc.) values are independent at each timepoint/position and hence the calculations are suitable for parallelisation on GPUs. Such computations are known as embarrassingly parallel or trivially parallel. The algorithm, and number of compute operations, remains the same. You just exploit additive redundancy (for loops). Theoretically, the above parallelisations can reduce the overall runtime of the HMM algorithms for K multivariate observations, y i = {y i1,..., y ik }, to O(TN) from O(KTN 2 ). The number of compute operations remains O(KTN 2 ).

17 GPU-LSM Parallelisation of the LSM is less trivial then normal HMMs. the simplifications that make the LSM so efficient actually result in complications for parallel programming. The summations can be performed as parallel reduction, but they can not be hidden in any existing loop, they need to be run separately. This increases the cost of parallelisation with a log N factor to O(KTN 2 log N). the LS model is based on a non-homogeneous HMM, which enables the transition probabilities to be different between different positions. This generalization requires the storing and loading of more data than in case of ordinary HMMs. the datasets are generally big, hence achieving memory efficiency is not straightforward Despite the complications, our present implementation of the Viterbi algorithm under the LSM achieves acceleration compared to optimized sequential C code. reduce days of runtime to hours, which is crucial for model development

18 HMM Parallelisation with Sequence Partitioning Genetic datasets are generally large and the length of sequences is much greater than the state space (T>>N). The natural question is whether it is possible to design new parallel-hmm algorithms (rather than parallelising existing algorithms)? We have been investigating GPU algorithms exploiting parallel computation along the sequence The algorithm works by partitioning the sequence into blocks: assume the sequence is partitioned into b {1,..B} blocks each of length T b the last values of the previous blocks (e.g. α(s k Tb 1 ) k Z) values are not available at the beginning of the computation hence we run the algorithm N-times, from each possible state, each time conditioning on a different starting state. Naturally these sets of N conditional runs may also be run in parallel when the computation is done for all blocks, they can be merged sequentially updating each conditional run with its corresponding starting value or in a parallel reduction fashion merging block pairs according to a binary tree structure

19 Block-parallel fwd algorithm with serial and parallel merge Figure: Representation of the block-parallel forward algorithm with serial and parallel merge (Neilsen&Sand, 2011).

20 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.

21 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.

22 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.

23 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.

24 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.

25 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.

26 Potential speed up In theory with at least P = T /2 processors at hand, it is possible to achieve O(N 3 log T ), in comparison with the traditional O(N 2 T ), leading to a speed up of R = N log T T which is hugely significant for T >> N. The number of compute operations is O(N 3 T ), so the algorithm involves a factor of N additional computations relative to the traditional HMM. In practice we can implement serial merging with B = P when P << T, which may lead to some loss of efficiency but the runtime order is close to linear in the block size T b T /B. If we had unlimited computational power (some KT /2N 2 processing units), it would be possible to combine all parallelisation approaches to achieve O(N log T ) runtime for O(KTN 3 ) computation.

27 Future work We are currently characterizing the theoretical applicability of all possible parallel algorithm approaches for every combination of values in a K N T space. We aim to include communication costs into our theoretical evaluations. We are in the process of applying parallel inference algorithms to HMM parameter learning, such as sequential Monte Carlo samplers.

28 Conclusions Medical genetics and genomics will produce vast data sets over the next few years We need statistical methods that can scale to handle To do so, exploiting parallel computation within the algorithm design stage will be key, both for model development and model fitting We believe that parallel algorithms can actually bring some algorithms from the overly computation intensive zone into the practically applicable zone.

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern