HPC methods for hidden Markov models (HMMs) in population genetics

Size: px
Start display at page:

Download "HPC methods for hidden Markov models (HMMs) in population genetics"

Transcription

1 HPC methods for hidden Markov models (HMMs) in population genetics Peter Kecskemethy supervised by: Chris Holmes Department of Statistics and, University of Oxford February 20, 2013

2 Outline Background to increasing data complexity in modern genomics Challenges for statistical models used in medical genomics Biology and dependence structures in the data. Accurate models and computable models, Hidden Markov Models (HMMs) Efficient algorithms via Dynamic Programming HMMs in genetics The Li and Stephens model Applications - natural selection and chromosome painting Trivial parallel HMM methods Block-parallel HMM methods

3 The expansion of genetic data In 2000 the first draft of the Human Genome was reported Took 10 years to complete Cost approx $3 billion = 1 dollar per base In 2012 the UK government announced plans to sequence 100,000 genomes Cost in region of $8k per genome (but I hope they re getting a better deal!) Takes around 2 days to sequence a genome Alongside this UK Biobank is storing a detailed record of molecular features in blood, and outcome data (phenotypes) on 500,000 individuals Within 3 to 5 years it will be routine to have all cancers sequenced in the UK, as well as the patient DNA

4 Impact on Statistics The developments in generating genetic data will have a huge impact on statistics and machine learning We will require new methods that can scale to massive (or Big ) data To do this we will also need to exploit advances in computer hardware to allow us to develop increasingly richer classes of models where the possibilities and limitations of hardware structure will need to be considered within the design stage of method development e.g. MapReduce, GPGPUs,...

5 Genetics - dependence structures The genome exhibits complex dependence structures both along a genome within an individual, or cancer cell, and across genomes in populations of individuals or cancer cells in a tumour These dependence structures are a product of the interplay between cell division processes. Mutations introduce new variations into a population of individuals (i.e. a pool of genes). Recombinations shuffle the genomes between each generation. Each recombination event splits parental chromosomes and combines them to produce child chromosomes. Recombinations introduce independence between positions along the genome.

6 Modelling dependence structures - the coalescent Figure: Coalescent with recombination (McVean et al, Genetics, 2001) To formally model population and sequential dependence requires graphical models (such as the ancestral-recombination-graph under the coalescent approach) whose structures prevent computation due to complexity Hence we require simplifying (approximating) models that capture the major sources of dependence and allow for computation.

7 Modelling dependence structures - Markov Models Perhaps the most important simplifying structure in statistics is the notion of Markov conditional independence, such that on a set of random variables, S = {S 1,..., S N } we define a joint probability model, Pr(S) that factorises, Pr(S i S 1,..., S i 1, S i+1,..., S N ) = Pr(S i S j n(i)) where n(i) indexes the Markov neighbourhood of i, such that S i is conditionally independent of variables outside of this neighbourhood

8 Methods - Hidden Markov Models (HMMs) HMMs are arguably the most widely used probability model in Bioinformatics, where the hidden states refer to classifications of loci such as {coding, non-coding}, or {duplication, deletion} events (in cancer), etc, HMMs are defined by: a number of unobserved hidden states S = {1,..., N} and Markov transition probabilities Pr(S t+1 S t ) that define rules for state-transitions possible emissions or observations Y i and emission probabilities (likelihoods) Pr(Y i S i ) for each state an initial distribution π = Pr(S 1 ) The state sequence S then forms a Markov Chain: such that the future S t+1,..., S T does not depend on the past S 1,..., S t 1, given the present S t

9 Methods - Hidden Markov Models (HMMs) The Markov property is key to the success of HMMs. Dependencies are represented as edges and conditional independences as missing edges in the graph representation. s 1 s 2 s 3 s T 1 s T y 1 y 2 y 3 y T 1 y T Figure: HMM depicted as a directed graphical model for observation e 1,..., e T. The joint distribution for HMMs is written as: T Pr(y 1,..., y T, s 1,..., s T ) = Pr(s 1 )Pr(y 1 s 1 ) Pr(y t s t )Pr(s t s t 1 ) t=2

10 Methods - Efficient HMM Algorithms with Dynamic Programming There are three basic problems that can be solved efficiently with HMMs: how do we compute the probability of an observation Y = {Y 1,...Y T } given a parameterised HMM? how do we find the optimal state sequence corresponding to an observation given a parameterised HMM? how do we estimate the model parameters? The Markov structure of HMMs allows for dynamic programming. the Forward algorithm computes the probability of an observation Solutions to the second problem depend on the definition of optimality. the Viterbi algorithm finds the most probable (MAP) state sequence, maximising ŝ = arg max s Pr(s 1..T y 1..T ) the Forward-Backward algorithm computes the posterior marginal probabilities Pr(s t y 1,..., y T ) for each state at every t. All three algorithms have the computation cost O(N 2 T ), so linear in sequence length T

11 Applying HMMs in genetics - the Li and Stephens model (LSM) Figure: The imperfect mosaic modeling (Li and Stephens 2003). The Li and Stephens Model is one of the most widely used models since its development. is an HMM-based approximation to the coalescent with recombination models the complex correlation structure between genetic loci (linkage disequilibrium) by treating each genome as an imperfect mosaic made of the other genomes defines a joint model over a collection of sequences as a Product of Approximate Conditionals (PAC) likelihood

12 Using LSM to detect signals of Natural Selection Figure: Effect of Natural Selection on Haplotypes - LCT:2q21.3 HapMap

13 Application of the LSM - Chromosome Painting Figure: Chromosome painting and derivation of coancestry (Lawson et al.). The GPU-LSM is currently being applied to Chromosome painting, which is a method of relating stretches of DNA sequences to one another is a crucial step in producing coancestry matrices when inferring population structure from dense haplotype data

14 Computation details for HMM algorithms y 1 y 2 y 3 s 1=1 s 2=1 s 3=1 s 1=2 s 2=2 s 3=2 s 1=3 s 2=3 s 3=3 Figure: A single computation step of HMM algorithms. Each step in the forward recursion means filling in a cell of a dynamic programming table: α(s t ) = Pr(y t s t ) Pr(s t s t 1 ) α(s t 1 ) s t 1 S

15 Trivial HMM Parallelisation The parallelisation of HMM algorithms is straightforward (in theory) over multivariate emissions (likelihoods) and over the state-space: y 1 y 2 y 3 e1 y 2 y 3 y1 y 2 y 3 s 1=1 s 2=1 s 3=1 s 2=2 s 2=2 s 3=2 s 3=3 s 2=3 s 3=3 Figure: The calculation over the observations and the calculations of α(s 2 = 1), α(s 2 = 2) and α(s 2 = 3) can be performed in parallel.

16 Trivial HMM Parallelisation Calculations corresponding to different observations are trivially parallelisable and is perfectly suited even for distributed computation. The standard HMM algorithms all repeat the same operations for each state s t. Moreover, the calculation of the α(s t ) (or beta, etc.) values are independent at each timepoint/position and hence the calculations are suitable for parallelisation on GPUs. Such computations are known as embarrassingly parallel or trivially parallel. The algorithm, and number of compute operations, remains the same. You just exploit additive redundancy (for loops). Theoretically, the above parallelisations can reduce the overall runtime of the HMM algorithms for K multivariate observations, y i = {y i1,..., y ik }, to O(TN) from O(KTN 2 ). The number of compute operations remains O(KTN 2 ).

17 GPU-LSM Parallelisation of the LSM is less trivial then normal HMMs. the simplifications that make the LSM so efficient actually result in complications for parallel programming. The summations can be performed as parallel reduction, but they can not be hidden in any existing loop, they need to be run separately. This increases the cost of parallelisation with a log N factor to O(KTN 2 log N). the LS model is based on a non-homogeneous HMM, which enables the transition probabilities to be different between different positions. This generalization requires the storing and loading of more data than in case of ordinary HMMs. the datasets are generally big, hence achieving memory efficiency is not straightforward Despite the complications, our present implementation of the Viterbi algorithm under the LSM achieves acceleration compared to optimized sequential C code. reduce days of runtime to hours, which is crucial for model development

18 HMM Parallelisation with Sequence Partitioning Genetic datasets are generally large and the length of sequences is much greater than the state space (T>>N). The natural question is whether it is possible to design new parallel-hmm algorithms (rather than parallelising existing algorithms)? We have been investigating GPU algorithms exploiting parallel computation along the sequence The algorithm works by partitioning the sequence into blocks: assume the sequence is partitioned into b {1,..B} blocks each of length T b the last values of the previous blocks (e.g. α(s k Tb 1 ) k Z) values are not available at the beginning of the computation hence we run the algorithm N-times, from each possible state, each time conditioning on a different starting state. Naturally these sets of N conditional runs may also be run in parallel when the computation is done for all blocks, they can be merged sequentially updating each conditional run with its corresponding starting value or in a parallel reduction fashion merging block pairs according to a binary tree structure

19 Block-parallel fwd algorithm with serial and parallel merge Figure: Representation of the block-parallel forward algorithm with serial and parallel merge (Neilsen&Sand, 2011).

20 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.

21 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.

22 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.

23 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.

24 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.

25 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.

26 Potential speed up In theory with at least P = T /2 processors at hand, it is possible to achieve O(N 3 log T ), in comparison with the traditional O(N 2 T ), leading to a speed up of R = N log T T which is hugely significant for T >> N. The number of compute operations is O(N 3 T ), so the algorithm involves a factor of N additional computations relative to the traditional HMM. In practice we can implement serial merging with B = P when P << T, which may lead to some loss of efficiency but the runtime order is close to linear in the block size T b T /B. If we had unlimited computational power (some KT /2N 2 processing units), it would be possible to combine all parallelisation approaches to achieve O(N log T ) runtime for O(KTN 3 ) computation.

27 Future work We are currently characterizing the theoretical applicability of all possible parallel algorithm approaches for every combination of values in a K N T space. We aim to include communication costs into our theoretical evaluations. We are in the process of applying parallel inference algorithms to HMM parameter learning, such as sequential Monte Carlo samplers.

28 Conclusions Medical genetics and genomics will produce vast data sets over the next few years We need statistical methods that can scale to handle To do so, exploiting parallel computation within the algorithm design stage will be key, both for model development and model fitting We believe that parallel algorithms can actually bring some algorithms from the overly computation intensive zone into the practically applicable zone.

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern

More information

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Computer vision: models, learning and inference. Chapter 10 Graphical Models Computer vision: models, learning and inference Chapter 10 Graphical Models Independence Two variables x 1 and x 2 are independent if their joint probability distribution factorizes as Pr(x 1, x 2 )=Pr(x

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics A statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states in the training data. First used in speech and handwriting recognition In

More information

RJaCGH, a package for analysis of

RJaCGH, a package for analysis of RJaCGH, a package for analysis of CGH arrays with Reversible Jump MCMC 1. CGH Arrays: Biological problem: Changes in number of DNA copies are associated to cancer activity. Microarray technology: Oscar

More information

Enumerating the decomposable neighbours of a decomposable graph under a simple perturbation scheme

Enumerating the decomposable neighbours of a decomposable graph under a simple perturbation scheme Enumerating the decomposable neighbours of a decomposable graph under a simple perturbation scheme Alun Thomas Department of Biomedical Informatics University of Utah Peter J Green Department of Mathematics

More information

Hidden Markov Models in the context of genetic analysis

Hidden Markov Models in the context of genetic analysis Hidden Markov Models in the context of genetic analysis Vincent Plagnol UCL Genetics Institute November 22, 2012 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi

More information

Introduction to Hidden Markov models

Introduction to Hidden Markov models 1/38 Introduction to Hidden Markov models Mark Johnson Macquarie University September 17, 2014 2/38 Outline Sequence labelling Hidden Markov Models Finding the most probable label sequence Higher-order

More information

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Lecture 21 : A Hybrid: Deep Learning and Graphical Models 10-708: Probabilistic Graphical Models, Spring 2018 Lecture 21 : A Hybrid: Deep Learning and Graphical Models Lecturer: Kayhan Batmanghelich Scribes: Paul Liang, Anirudha Rayasam 1 Introduction and Motivation

More information

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C, Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative

More information

An Introduction to Hidden Markov Models

An Introduction to Hidden Markov Models An Introduction to Hidden Markov Models Max Heimel Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin http://www.dima.tu-berlin.de/ 07.10.2010 DIMA TU Berlin 1 Agenda

More information

Variational Methods for Graphical Models

Variational Methods for Graphical Models Chapter 2 Variational Methods for Graphical Models 2.1 Introduction The problem of probabb1istic inference in graphical models is the problem of computing a conditional probability distribution over the

More information

ECE521 Lecture 21 HMM cont. Message Passing Algorithms

ECE521 Lecture 21 HMM cont. Message Passing Algorithms ECE521 Lecture 21 HMM cont Message Passing Algorithms Outline Hidden Markov models Numerical example of figuring out marginal of the observed sequence Numerical example of figuring out the most probable

More information

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea Descent w/modification Descent w/modification Descent w/modification Descent w/modification CPU Descent w/modification Descent w/modification Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

More information

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland Genetic Programming Charles Chilaka Department of Computational Science Memorial University of Newfoundland Class Project for Bio 4241 March 27, 2014 Charles Chilaka (MUN) Genetic algorithms and programming

More information

Network Based Models For Analysis of SNPs Yalta Opt

Network Based Models For Analysis of SNPs Yalta Opt Outline Network Based Models For Analysis of Yalta Optimization Conference 2010 Network Science Zeynep Ertem*, Sergiy Butenko*, Clare Gill** *Department of Industrial and Systems Engineering, **Department

More information

Time series, HMMs, Kalman Filters

Time series, HMMs, Kalman Filters Classic HMM tutorial see class website: *L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989. Time series,

More information

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi Hidden Markov Models Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi Sequential Data Time-series: Stock market, weather, speech, video Ordered: Text, genes Sequential

More information

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT - Swarbhanu Chatterjee. Hidden Markov models are a sophisticated and flexible statistical tool for the study of protein models. Using HMMs to analyze proteins

More information

Hidden Markov Model II

Hidden Markov Model II Hidden Markov Model II A brief review of HMM 1/20 HMM is used to model sequential data. Observed data are assumed to be emitted from hidden states, where the hidden states is a Markov chain. A HMM is characterized

More information

CSE 549: Computational Biology

CSE 549: Computational Biology CSE 549: Computational Biology Phylogenomics 1 slides marked with * by Carl Kingsford Tree of Life 2 * H5N1 Influenza Strains Salzberg, Kingsford, et al., 2007 3 * H5N1 Influenza Strains The 2007 outbreak

More information

Missing Data and Imputation

Missing Data and Imputation Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 21. Independence in PGMs; Example PGMs Independence PGMs encode assumption of statistical independence between variables. Critical

More information

Lecture 5: Markov models

Lecture 5: Markov models Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics Problem in biology Data and patterns are often not clear cut When we want to make a

More information

On Demand Phenotype Ranking through Subspace Clustering

On Demand Phenotype Ranking through Subspace Clustering On Demand Phenotype Ranking through Subspace Clustering Xiang Zhang, Wei Wang Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA {xiang, weiwang}@cs.unc.edu

More information

Graphical Models. Michael I. Jordan Computer Science Division and Department of Statistics. University of California, Berkeley

Graphical Models. Michael I. Jordan Computer Science Division and Department of Statistics. University of California, Berkeley Graphical Models Michael I. Jordan Computer Science Division and Department of Statistics University of California, Berkeley 94720 Abstract Statistical applications in fields such as bioinformatics, information

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Overview of Part One Probabilistic Graphical Models Part One: Graphs and Markov Properties Christopher M. Bishop Graphs and probabilities Directed graphs Markov properties Undirected graphs Examples Microsoft

More information

Dependency detection with Bayesian Networks

Dependency detection with Bayesian Networks Dependency detection with Bayesian Networks M V Vikhreva Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Leninskie Gory, Moscow, 119991 Supervisor: A G Dyakonov

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 5 Inference

More information

Information Fusion Dr. B. K. Panigrahi

Information Fusion Dr. B. K. Panigrahi Information Fusion By Dr. B. K. Panigrahi Asst. Professor Department of Electrical Engineering IIT Delhi, New Delhi-110016 01/12/2007 1 Introduction Classification OUTLINE K-fold cross Validation Feature

More information

Machine Learning. Sourangshu Bhattacharya

Machine Learning. Sourangshu Bhattacharya Machine Learning Sourangshu Bhattacharya Bayesian Networks Directed Acyclic Graph (DAG) Bayesian Networks General Factorization Curve Fitting Re-visited Maximum Likelihood Determine by minimizing sum-of-squares

More information

Neuro-fuzzy, GA-Fuzzy, Neural-Fuzzy-GA: A Data Mining Technique for Optimization

Neuro-fuzzy, GA-Fuzzy, Neural-Fuzzy-GA: A Data Mining Technique for Optimization International Journal of Computer Science and Software Engineering Volume 3, Number 1 (2017), pp. 1-9 International Research Publication House http://www.irphouse.com Neuro-fuzzy, GA-Fuzzy, Neural-Fuzzy-GA:

More information

Massive Data Analysis

Massive Data Analysis Professor, Department of Electrical and Computer Engineering Tennessee Technological University February 25, 2015 Big Data This talk is based on the report [1]. The growth of big data is changing that

More information

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

More information

A Fast Learning Algorithm for Deep Belief Nets

A Fast Learning Algorithm for Deep Belief Nets A Fast Learning Algorithm for Deep Belief Nets Geoffrey E. Hinton, Simon Osindero Department of Computer Science University of Toronto, Toronto, Canada Yee-Whye Teh Department of Computer Science National

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

BeviMed Guide. Daniel Greene

BeviMed Guide. Daniel Greene BeviMed Guide Daniel Greene 1 Introduction BeviMed [1] is a procedure for evaluating the evidence of association between allele configurations across rare variants, typically within a genomic locus, and

More information

CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

A Survey of Statistical Models to Infer Consensus 3D Chromosomal Structure from Hi-C data

A Survey of Statistical Models to Infer Consensus 3D Chromosomal Structure from Hi-C data A Survey of Statistical Models to Infer Consensus 3D Chromosomal Structure from Hi-C data MEDHA UPPALA, University of California Los Angeles The spatial organization of the genomic material leads to interactions

More information

Monika Maharishi Dayanand University Rohtak

Monika Maharishi Dayanand University Rohtak Performance enhancement for Text Data Mining using k means clustering based genetic optimization (KMGO) Monika Maharishi Dayanand University Rohtak ABSTRACT For discovering hidden patterns and structures

More information

Learning Objects and Parts in Images

Learning Objects and Parts in Images Learning Objects and Parts in Images Chris Williams E H U N I V E R S I T T Y O H F R G E D I N B U School of Informatics, University of Edinburgh, UK Learning multiple objects and parts from images (joint

More information

Haplotype Analysis. 02 November 2003 Mendel Short IGES Slide 1

Haplotype Analysis. 02 November 2003 Mendel Short IGES Slide 1 Haplotype Analysis Specifies the genetic information descending through a pedigree Useful visualization of the gene flow through a pedigree A haplotype for a given individual and set of loci is defined

More information

Issues in MCMC use for Bayesian model fitting. Practical Considerations for WinBUGS Users

Issues in MCMC use for Bayesian model fitting. Practical Considerations for WinBUGS Users Practical Considerations for WinBUGS Users Kate Cowles, Ph.D. Department of Statistics and Actuarial Science University of Iowa 22S:138 Lecture 12 Oct. 3, 2003 Issues in MCMC use for Bayesian model fitting

More information

G-PhoCS Generalized Phylogenetic Coalescent Sampler version 1.2.3

G-PhoCS Generalized Phylogenetic Coalescent Sampler version 1.2.3 G-PhoCS Generalized Phylogenetic Coalescent Sampler version 1.2.3 Contents 1. About G-PhoCS 2. Download and Install 3. Overview of G-PhoCS analysis: input and output 4. The sequence file 5. The control

More information

Bayesian Networks Inference

Bayesian Networks Inference Bayesian Networks Inference Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 5 th, 2007 2005-2007 Carlos Guestrin 1 General probabilistic inference Flu Allergy Query: Sinus

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

The Parallel Software Design Process. Parallel Software Design

The Parallel Software Design Process. Parallel Software Design Parallel Software Design The Parallel Software Design Process Deborah Stacey, Chair Dept. of Comp. & Info Sci., University of Guelph dastacey@uoguelph.ca Why Parallel? Why NOT Parallel? Why Talk about

More information

Genetic Algorithm for Finding Shortest Path in a Network

Genetic Algorithm for Finding Shortest Path in a Network Intern. J. Fuzzy Mathematical Archive Vol. 2, 2013, 43-48 ISSN: 2320 3242 (P), 2320 3250 (online) Published on 26 August 2013 www.researchmathsci.org International Journal of Genetic Algorithm for Finding

More information

Clustering Relational Data using the Infinite Relational Model

Clustering Relational Data using the Infinite Relational Model Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015

More information

Preprocessing, Multiphase Inference, and Massive Data in Theory and Practice

Preprocessing, Multiphase Inference, and Massive Data in Theory and Practice Preprocessing, Multiphase Inference, and Massive Data in Theory and Practice Alexander W Blocker Department of Statistics Harvard University MMDS 2012 July 13, 2012 Joint work with Xiao-Li Meng Alexander

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Data Mining in Bioinformatics: Study & Survey

Data Mining in Bioinformatics: Study & Survey Data Mining in Bioinformatics: Study & Survey Saliha V S St. Joseph s college Irinjalakuda Abstract--Large amounts of data are generated in medical research. A biological database consists of a collection

More information

Machine Learning. Computational biology: Sequence alignment and profile HMMs

Machine Learning. Computational biology: Sequence alignment and profile HMMs 10-601 Machine Learning Computational biology: Sequence alignment and profile HMMs Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Growth

More information

Population Genetics (52642)

Population Genetics (52642) Population Genetics (52642) Benny Yakir 1 Introduction In this course we will examine several topics that are related to population genetics. In each topic we will discuss briefly the biological background

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Step-by-Step Guide to Basic Genetic Analysis

Step-by-Step Guide to Basic Genetic Analysis Step-by-Step Guide to Basic Genetic Analysis Page 1 Introduction This document shows you how to clean up your genetic data, assess its statistical properties and perform simple analyses such as case-control

More information

High-throughput Sequence Alignment using Graphics Processing Units

High-throughput Sequence Alignment using Graphics Processing Units High-throughput Sequence Alignment using Graphics Processing Units Michael Schatz & Cole Trapnell May 21, 2009 UMD NVIDIA CUDA Center Of Excellence Presentation Searching Wikipedia How do you find all

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Classifier Inspired Scaling for Training Set Selection

Classifier Inspired Scaling for Training Set Selection Classifier Inspired Scaling for Training Set Selection Walter Bennette DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016-2511 Outline Instance-based classification

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

STATISTICS (STAT) Statistics (STAT) 1

STATISTICS (STAT) Statistics (STAT) 1 Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).

More information

Hidden Markov Models. Implementing the forward-, backward- and Viterbi-algorithms

Hidden Markov Models. Implementing the forward-, backward- and Viterbi-algorithms Hidden Markov Models Implementing the forward-, backward- and Viterbi-algorithms Recursion: Viterbi Basis: Forward Recursion: Basis: Backward Recursion: Basis: Viterbi Recursion: Problem: The values in

More information

Generalized Inverse Reinforcement Learning

Generalized Inverse Reinforcement Learning Generalized Inverse Reinforcement Learning James MacGlashan Cogitai, Inc. james@cogitai.com Michael L. Littman mlittman@cs.brown.edu Nakul Gopalan ngopalan@cs.brown.edu Amy Greenwald amy@cs.brown.edu Abstract

More information

Data Representation and Indexing

Data Representation and Indexing Data Representation and Indexing Kai Shen Data Representation Data structure and storage format that represents, models, or approximates the original information What we ve seen/learned so far? Raw byte

More information

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017 CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.

More information

of Manchester The University COMP14112 Markov Chains, HMMs and Speech Revision

of Manchester The University COMP14112 Markov Chains, HMMs and Speech Revision COMP14112 Lecture 11 Markov Chains, HMMs and Speech Revision 1 What have we covered in the speech lectures? Extracting features from raw speech data Classification and the naive Bayes classifier Training

More information

Eukaryotic Gene Finding: The GENSCAN System

Eukaryotic Gene Finding: The GENSCAN System Eukaryotic Gene Finding: The GENSCAN System BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC

More information

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization Machine Learning A 708.064 WS15/16 1sst KU Version: January 11, 2016 Exercises Problems marked with * are optional. 1 Conditional Independence I [3 P] a) [1 P] For the probability distribution P (A, B,

More information

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos Machine Learning for Computer Vision 1 18 October, 2013 MVA ENS Cachan Lecture 6: Introduction to graphical models Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Center for Visual Computing Ecole Centrale Paris

More information

Outline. Probabilistic Name and Address Cleaning and Standardisation. Record linkage and data integration. Data cleaning and standardisation (I)

Outline. Probabilistic Name and Address Cleaning and Standardisation. Record linkage and data integration. Data cleaning and standardisation (I) Outline Probabilistic Name and Address Cleaning and Standardisation Peter Christen, Tim Churches and Justin Xi Zhu Data Mining Group, Australian National University Centre for Epidemiology and Research,

More information

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines Quiz Section Week 8 May 17, 2016 Machine learning and Support Vector Machines Another definition of supervised machine learning Given N training examples (objects) {(x 1,y 1 ), (x 2,y 2 ),, (x N,y N )}

More information

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University NIPS 2008: E. Sudderth & M. Jordan, Shared Segmentation of Natural

More information

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM) Motivation: Shortcomings of Hidden Markov Model Maximum Entropy Markov Models and Conditional Random Fields Ko, Youngjoong Dept. of Computer Engineering, Dong-A University Intelligent System Laboratory,

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

Quiz section 10. June 1, 2018

Quiz section 10. June 1, 2018 Quiz section 10 June 1, 2018 Logistics Bring: 1 page cheat-sheet, simple calculator Any last logistics questions about the final? Logistics Bring: 1 page cheat-sheet, simple calculator Any last logistics

More information

Exact Inference: Elimination and Sum Product (and hidden Markov models)

Exact Inference: Elimination and Sum Product (and hidden Markov models) Exact Inference: Elimination and Sum Product (and hidden Markov models) David M. Blei Columbia University October 13, 2015 The first sections of these lecture notes follow the ideas in Chapters 3 and 4

More information

Neural Network Weight Selection Using Genetic Algorithms

Neural Network Weight Selection Using Genetic Algorithms Neural Network Weight Selection Using Genetic Algorithms David Montana presented by: Carl Fink, Hongyi Chen, Jack Cheng, Xinglong Li, Bruce Lin, Chongjie Zhang April 12, 2005 1 Neural Networks Neural networks

More information

Bayesian Estimation for Skew Normal Distributions Using Data Augmentation

Bayesian Estimation for Skew Normal Distributions Using Data Augmentation The Korean Communications in Statistics Vol. 12 No. 2, 2005 pp. 323-333 Bayesian Estimation for Skew Normal Distributions Using Data Augmentation Hea-Jung Kim 1) Abstract In this paper, we develop a MCMC

More information

HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms

HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms by TIN YIN LAM B.Sc., The Chinese University of Hong Kong, 2006 A THESIS SUBMITTED IN PARTIAL

More information

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach Outline Objective Approach Experiment Conclusion and Future work Objective Automatically establish linguistic indexing of pictures

More information

Optimization of Hidden Markov Model by a Genetic Algorithm for Web Information Extraction

Optimization of Hidden Markov Model by a Genetic Algorithm for Web Information Extraction Optimization of Hidden Markov Model by a Genetic Algorithm for Web Information Extraction Jiyi Xiao Lamei Zou Chuanqi Li School of Computer Science and Technology, University of South China, Hengyang 421001,

More information

Nearest neighbor classification DSE 220

Nearest neighbor classification DSE 220 Nearest neighbor classification DSE 220 Decision Trees Target variable Label Dependent variable Output space Person ID Age Gender Income Balance Mortgag e payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000

More information

Data Mining in Bioinformatics Day 1: Classification

Data Mining in Bioinformatics Day 1: Classification Data Mining in Bioinformatics Day 1: Classification Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls

More information

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

MATE-EC2: A Middleware for Processing Data with Amazon Web Services MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim,

More information

MCMC Methods for data modeling

MCMC Methods for data modeling MCMC Methods for data modeling Kenneth Scerri Department of Automatic Control and Systems Engineering Introduction 1. Symposium on Data Modelling 2. Outline: a. Definition and uses of MCMC b. MCMC algorithms

More information

Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller

Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller Parameter and State inference using the approximate structured coalescent 1 Background Phylogeographic methods can help reveal the movement

More information

Introduction to Graphical Models

Introduction to Graphical Models Robert Collins CSE586 Introduction to Graphical Models Readings in Prince textbook: Chapters 10 and 11 but mainly only on directed graphs at this time Credits: Several slides are from: Review: Probability

More information

The k-means Algorithm and Genetic Algorithm

The k-means Algorithm and Genetic Algorithm The k-means Algorithm and Genetic Algorithm k-means algorithm Genetic algorithm Rough set approach Fuzzy set approaches Chapter 8 2 The K-Means Algorithm The K-Means algorithm is a simple yet effective

More information

Genetic type 1 Error Calculator (GEC)

Genetic type 1 Error Calculator (GEC) Genetic type 1 Error Calculator (GEC) (Version 0.2) User Manual Miao-Xin Li Department of Psychiatry and State Key Laboratory for Cognitive and Brain Sciences; the Centre for Reproduction, Development

More information

How Learning Differs from Optimization. Sargur N. Srihari

How Learning Differs from Optimization. Sargur N. Srihari How Learning Differs from Optimization Sargur N. srihari@cedar.buffalo.edu 1 Topics in Optimization Optimization for Training Deep Models: Overview How learning differs from optimization Risk, empirical

More information

ECE521 Lecture 18 Graphical Models Hidden Markov Models

ECE521 Lecture 18 Graphical Models Hidden Markov Models ECE521 Lecture 18 Graphical Models Hidden Markov Models Outline Graphical models Conditional independence Conditional independence after marginalization Sequence models hidden Markov models 2 Graphical

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

Forward Feature Selection Using Residual Mutual Information

Forward Feature Selection Using Residual Mutual Information Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics

More information

MAXIMUM LIKELIHOOD ESTIMATION USING ACCELERATED GENETIC ALGORITHMS

MAXIMUM LIKELIHOOD ESTIMATION USING ACCELERATED GENETIC ALGORITHMS In: Journal of Applied Statistical Science Volume 18, Number 3, pp. 1 7 ISSN: 1067-5817 c 2011 Nova Science Publishers, Inc. MAXIMUM LIKELIHOOD ESTIMATION USING ACCELERATED GENETIC ALGORITHMS Füsun Akman

More information

Alignment and clustering tools for sequence analysis. Omar Abudayyeh Presentation December 9, 2015

Alignment and clustering tools for sequence analysis. Omar Abudayyeh Presentation December 9, 2015 Alignment and clustering tools for sequence analysis Omar Abudayyeh 18.337 Presentation December 9, 2015 Introduction Sequence comparison is critical for inferring biological relationships within large

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 20. PGM Representation Next Lectures Representation of joint distributions Conditional/marginal independence * Directed vs

More information

A noninformative Bayesian approach to small area estimation

A noninformative Bayesian approach to small area estimation A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported

More information

ChromHMM: automating chromatin-state discovery and characterization

ChromHMM: automating chromatin-state discovery and characterization Nature Methods ChromHMM: automating chromatin-state discovery and characterization Jason Ernst & Manolis Kellis Supplementary Figure 1 Supplementary Figure 2 Supplementary Figure 3 Supplementary Figure

More information

Computational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop

Computational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop Computational Architecture of Cloud Environments Michael Schatz April 1, 2010 NHGRI Cloud Computing Workshop Cloud Architecture Computation Input Output Nebulous question: Cloud computing = Utility computing

More information