HPC methods for hidden Markov models (HMMs) in population genetics
|
|
- Aubrie Ward
- 5 years ago
- Views:
Transcription
1 HPC methods for hidden Markov models (HMMs) in population genetics Peter Kecskemethy supervised by: Chris Holmes Department of Statistics and, University of Oxford February 20, 2013
2 Outline Background to increasing data complexity in modern genomics Challenges for statistical models used in medical genomics Biology and dependence structures in the data. Accurate models and computable models, Hidden Markov Models (HMMs) Efficient algorithms via Dynamic Programming HMMs in genetics The Li and Stephens model Applications - natural selection and chromosome painting Trivial parallel HMM methods Block-parallel HMM methods
3 The expansion of genetic data In 2000 the first draft of the Human Genome was reported Took 10 years to complete Cost approx $3 billion = 1 dollar per base In 2012 the UK government announced plans to sequence 100,000 genomes Cost in region of $8k per genome (but I hope they re getting a better deal!) Takes around 2 days to sequence a genome Alongside this UK Biobank is storing a detailed record of molecular features in blood, and outcome data (phenotypes) on 500,000 individuals Within 3 to 5 years it will be routine to have all cancers sequenced in the UK, as well as the patient DNA
4 Impact on Statistics The developments in generating genetic data will have a huge impact on statistics and machine learning We will require new methods that can scale to massive (or Big ) data To do this we will also need to exploit advances in computer hardware to allow us to develop increasingly richer classes of models where the possibilities and limitations of hardware structure will need to be considered within the design stage of method development e.g. MapReduce, GPGPUs,...
5 Genetics - dependence structures The genome exhibits complex dependence structures both along a genome within an individual, or cancer cell, and across genomes in populations of individuals or cancer cells in a tumour These dependence structures are a product of the interplay between cell division processes. Mutations introduce new variations into a population of individuals (i.e. a pool of genes). Recombinations shuffle the genomes between each generation. Each recombination event splits parental chromosomes and combines them to produce child chromosomes. Recombinations introduce independence between positions along the genome.
6 Modelling dependence structures - the coalescent Figure: Coalescent with recombination (McVean et al, Genetics, 2001) To formally model population and sequential dependence requires graphical models (such as the ancestral-recombination-graph under the coalescent approach) whose structures prevent computation due to complexity Hence we require simplifying (approximating) models that capture the major sources of dependence and allow for computation.
7 Modelling dependence structures - Markov Models Perhaps the most important simplifying structure in statistics is the notion of Markov conditional independence, such that on a set of random variables, S = {S 1,..., S N } we define a joint probability model, Pr(S) that factorises, Pr(S i S 1,..., S i 1, S i+1,..., S N ) = Pr(S i S j n(i)) where n(i) indexes the Markov neighbourhood of i, such that S i is conditionally independent of variables outside of this neighbourhood
8 Methods - Hidden Markov Models (HMMs) HMMs are arguably the most widely used probability model in Bioinformatics, where the hidden states refer to classifications of loci such as {coding, non-coding}, or {duplication, deletion} events (in cancer), etc, HMMs are defined by: a number of unobserved hidden states S = {1,..., N} and Markov transition probabilities Pr(S t+1 S t ) that define rules for state-transitions possible emissions or observations Y i and emission probabilities (likelihoods) Pr(Y i S i ) for each state an initial distribution π = Pr(S 1 ) The state sequence S then forms a Markov Chain: such that the future S t+1,..., S T does not depend on the past S 1,..., S t 1, given the present S t
9 Methods - Hidden Markov Models (HMMs) The Markov property is key to the success of HMMs. Dependencies are represented as edges and conditional independences as missing edges in the graph representation. s 1 s 2 s 3 s T 1 s T y 1 y 2 y 3 y T 1 y T Figure: HMM depicted as a directed graphical model for observation e 1,..., e T. The joint distribution for HMMs is written as: T Pr(y 1,..., y T, s 1,..., s T ) = Pr(s 1 )Pr(y 1 s 1 ) Pr(y t s t )Pr(s t s t 1 ) t=2
10 Methods - Efficient HMM Algorithms with Dynamic Programming There are three basic problems that can be solved efficiently with HMMs: how do we compute the probability of an observation Y = {Y 1,...Y T } given a parameterised HMM? how do we find the optimal state sequence corresponding to an observation given a parameterised HMM? how do we estimate the model parameters? The Markov structure of HMMs allows for dynamic programming. the Forward algorithm computes the probability of an observation Solutions to the second problem depend on the definition of optimality. the Viterbi algorithm finds the most probable (MAP) state sequence, maximising ŝ = arg max s Pr(s 1..T y 1..T ) the Forward-Backward algorithm computes the posterior marginal probabilities Pr(s t y 1,..., y T ) for each state at every t. All three algorithms have the computation cost O(N 2 T ), so linear in sequence length T
11 Applying HMMs in genetics - the Li and Stephens model (LSM) Figure: The imperfect mosaic modeling (Li and Stephens 2003). The Li and Stephens Model is one of the most widely used models since its development. is an HMM-based approximation to the coalescent with recombination models the complex correlation structure between genetic loci (linkage disequilibrium) by treating each genome as an imperfect mosaic made of the other genomes defines a joint model over a collection of sequences as a Product of Approximate Conditionals (PAC) likelihood
12 Using LSM to detect signals of Natural Selection Figure: Effect of Natural Selection on Haplotypes - LCT:2q21.3 HapMap
13 Application of the LSM - Chromosome Painting Figure: Chromosome painting and derivation of coancestry (Lawson et al.). The GPU-LSM is currently being applied to Chromosome painting, which is a method of relating stretches of DNA sequences to one another is a crucial step in producing coancestry matrices when inferring population structure from dense haplotype data
14 Computation details for HMM algorithms y 1 y 2 y 3 s 1=1 s 2=1 s 3=1 s 1=2 s 2=2 s 3=2 s 1=3 s 2=3 s 3=3 Figure: A single computation step of HMM algorithms. Each step in the forward recursion means filling in a cell of a dynamic programming table: α(s t ) = Pr(y t s t ) Pr(s t s t 1 ) α(s t 1 ) s t 1 S
15 Trivial HMM Parallelisation The parallelisation of HMM algorithms is straightforward (in theory) over multivariate emissions (likelihoods) and over the state-space: y 1 y 2 y 3 e1 y 2 y 3 y1 y 2 y 3 s 1=1 s 2=1 s 3=1 s 2=2 s 2=2 s 3=2 s 3=3 s 2=3 s 3=3 Figure: The calculation over the observations and the calculations of α(s 2 = 1), α(s 2 = 2) and α(s 2 = 3) can be performed in parallel.
16 Trivial HMM Parallelisation Calculations corresponding to different observations are trivially parallelisable and is perfectly suited even for distributed computation. The standard HMM algorithms all repeat the same operations for each state s t. Moreover, the calculation of the α(s t ) (or beta, etc.) values are independent at each timepoint/position and hence the calculations are suitable for parallelisation on GPUs. Such computations are known as embarrassingly parallel or trivially parallel. The algorithm, and number of compute operations, remains the same. You just exploit additive redundancy (for loops). Theoretically, the above parallelisations can reduce the overall runtime of the HMM algorithms for K multivariate observations, y i = {y i1,..., y ik }, to O(TN) from O(KTN 2 ). The number of compute operations remains O(KTN 2 ).
17 GPU-LSM Parallelisation of the LSM is less trivial then normal HMMs. the simplifications that make the LSM so efficient actually result in complications for parallel programming. The summations can be performed as parallel reduction, but they can not be hidden in any existing loop, they need to be run separately. This increases the cost of parallelisation with a log N factor to O(KTN 2 log N). the LS model is based on a non-homogeneous HMM, which enables the transition probabilities to be different between different positions. This generalization requires the storing and loading of more data than in case of ordinary HMMs. the datasets are generally big, hence achieving memory efficiency is not straightforward Despite the complications, our present implementation of the Viterbi algorithm under the LSM achieves acceleration compared to optimized sequential C code. reduce days of runtime to hours, which is crucial for model development
18 HMM Parallelisation with Sequence Partitioning Genetic datasets are generally large and the length of sequences is much greater than the state space (T>>N). The natural question is whether it is possible to design new parallel-hmm algorithms (rather than parallelising existing algorithms)? We have been investigating GPU algorithms exploiting parallel computation along the sequence The algorithm works by partitioning the sequence into blocks: assume the sequence is partitioned into b {1,..B} blocks each of length T b the last values of the previous blocks (e.g. α(s k Tb 1 ) k Z) values are not available at the beginning of the computation hence we run the algorithm N-times, from each possible state, each time conditioning on a different starting state. Naturally these sets of N conditional runs may also be run in parallel when the computation is done for all blocks, they can be merged sequentially updating each conditional run with its corresponding starting value or in a parallel reduction fashion merging block pairs according to a binary tree structure
19 Block-parallel fwd algorithm with serial and parallel merge Figure: Representation of the block-parallel forward algorithm with serial and parallel merge (Neilsen&Sand, 2011).
20 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.
21 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.
22 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.
23 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.
24 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.
25 HMM Parallelisation with Sequence Partitioning y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 s 1=1 s 2=1 s 3=1 s 4=1 s 5=1 s 6=1 s 7=1 s 8=1 s 9=1 s 1=2 s 2=2 s 3=2 s 4=2 s 5=2 s 6=2 s 7=2 s 8=2 s 9=2 Figure: Parallelisation of the Viterbi algorithm with simple observation partitioning. The number of blocks is B =2, the length of blocks is T b =4.
26 Potential speed up In theory with at least P = T /2 processors at hand, it is possible to achieve O(N 3 log T ), in comparison with the traditional O(N 2 T ), leading to a speed up of R = N log T T which is hugely significant for T >> N. The number of compute operations is O(N 3 T ), so the algorithm involves a factor of N additional computations relative to the traditional HMM. In practice we can implement serial merging with B = P when P << T, which may lead to some loss of efficiency but the runtime order is close to linear in the block size T b T /B. If we had unlimited computational power (some KT /2N 2 processing units), it would be possible to combine all parallelisation approaches to achieve O(N log T ) runtime for O(KTN 3 ) computation.
27 Future work We are currently characterizing the theoretical applicability of all possible parallel algorithm approaches for every combination of values in a K N T space. We aim to include communication costs into our theoretical evaluations. We are in the process of applying parallel inference algorithms to HMM parameter learning, such as sequential Monte Carlo samplers.
28 Conclusions Medical genetics and genomics will produce vast data sets over the next few years We need statistical methods that can scale to handle To do so, exploiting parallel computation within the algorithm design stage will be key, both for model development and model fitting We believe that parallel algorithms can actually bring some algorithms from the overly computation intensive zone into the practically applicable zone.
ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov
ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern
More informationComputer vision: models, learning and inference. Chapter 10 Graphical Models
Computer vision: models, learning and inference Chapter 10 Graphical Models Independence Two variables x 1 and x 2 are independent if their joint probability distribution factorizes as Pr(x 1, x 2 )=Pr(x
More informationBiology 644: Bioinformatics
A statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states in the training data. First used in speech and handwriting recognition In
More informationRJaCGH, a package for analysis of
RJaCGH, a package for analysis of CGH arrays with Reversible Jump MCMC 1. CGH Arrays: Biological problem: Changes in number of DNA copies are associated to cancer activity. Microarray technology: Oscar
More informationEnumerating the decomposable neighbours of a decomposable graph under a simple perturbation scheme
Enumerating the decomposable neighbours of a decomposable graph under a simple perturbation scheme Alun Thomas Department of Biomedical Informatics University of Utah Peter J Green Department of Mathematics
More informationHidden Markov Models in the context of genetic analysis
Hidden Markov Models in the context of genetic analysis Vincent Plagnol UCL Genetics Institute November 22, 2012 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi
More informationIntroduction to Hidden Markov models
1/38 Introduction to Hidden Markov models Mark Johnson Macquarie University September 17, 2014 2/38 Outline Sequence labelling Hidden Markov Models Finding the most probable label sequence Higher-order
More informationLecture 21 : A Hybrid: Deep Learning and Graphical Models
10-708: Probabilistic Graphical Models, Spring 2018 Lecture 21 : A Hybrid: Deep Learning and Graphical Models Lecturer: Kayhan Batmanghelich Scribes: Paul Liang, Anirudha Rayasam 1 Introduction and Motivation
More informationConditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,
Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative
More informationAn Introduction to Hidden Markov Models
An Introduction to Hidden Markov Models Max Heimel Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin http://www.dima.tu-berlin.de/ 07.10.2010 DIMA TU Berlin 1 Agenda
More informationVariational Methods for Graphical Models
Chapter 2 Variational Methods for Graphical Models 2.1 Introduction The problem of probabb1istic inference in graphical models is the problem of computing a conditional probability distribution over the
More informationECE521 Lecture 21 HMM cont. Message Passing Algorithms
ECE521 Lecture 21 HMM cont Message Passing Algorithms Outline Hidden Markov models Numerical example of figuring out marginal of the observed sequence Numerical example of figuring out the most probable
More informationPhylogenetics on CUDA (Parallel) Architectures Bradly Alicea
Descent w/modification Descent w/modification Descent w/modification Descent w/modification CPU Descent w/modification Descent w/modification Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea
More informationGenetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland
Genetic Programming Charles Chilaka Department of Computational Science Memorial University of Newfoundland Class Project for Bio 4241 March 27, 2014 Charles Chilaka (MUN) Genetic algorithms and programming
More informationNetwork Based Models For Analysis of SNPs Yalta Opt
Outline Network Based Models For Analysis of Yalta Optimization Conference 2010 Network Science Zeynep Ertem*, Sergiy Butenko*, Clare Gill** *Department of Industrial and Systems Engineering, **Department
More informationTime series, HMMs, Kalman Filters
Classic HMM tutorial see class website: *L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989. Time series,
More informationHidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi
Hidden Markov Models Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi Sequential Data Time-series: Stock market, weather, speech, video Ordered: Text, genes Sequential
More informationHIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT
HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT - Swarbhanu Chatterjee. Hidden Markov models are a sophisticated and flexible statistical tool for the study of protein models. Using HMMs to analyze proteins
More informationHidden Markov Model II
Hidden Markov Model II A brief review of HMM 1/20 HMM is used to model sequential data. Observed data are assumed to be emitted from hidden states, where the hidden states is a Markov chain. A HMM is characterized
More informationCSE 549: Computational Biology
CSE 549: Computational Biology Phylogenomics 1 slides marked with * by Carl Kingsford Tree of Life 2 * H5N1 Influenza Strains Salzberg, Kingsford, et al., 2007 3 * H5N1 Influenza Strains The 2007 outbreak
More informationMissing Data and Imputation
Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex
More informationCOMP90051 Statistical Machine Learning
COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 21. Independence in PGMs; Example PGMs Independence PGMs encode assumption of statistical independence between variables. Critical
More informationLecture 5: Markov models
Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics Problem in biology Data and patterns are often not clear cut When we want to make a
More informationOn Demand Phenotype Ranking through Subspace Clustering
On Demand Phenotype Ranking through Subspace Clustering Xiang Zhang, Wei Wang Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA {xiang, weiwang}@cs.unc.edu
More informationGraphical Models. Michael I. Jordan Computer Science Division and Department of Statistics. University of California, Berkeley
Graphical Models Michael I. Jordan Computer Science Division and Department of Statistics University of California, Berkeley 94720 Abstract Statistical applications in fields such as bioinformatics, information
More informationProbabilistic Graphical Models
Overview of Part One Probabilistic Graphical Models Part One: Graphs and Markov Properties Christopher M. Bishop Graphs and probabilities Directed graphs Markov properties Undirected graphs Examples Microsoft
More informationDependency detection with Bayesian Networks
Dependency detection with Bayesian Networks M V Vikhreva Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Leninskie Gory, Moscow, 119991 Supervisor: A G Dyakonov
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 5 Inference
More informationInformation Fusion Dr. B. K. Panigrahi
Information Fusion By Dr. B. K. Panigrahi Asst. Professor Department of Electrical Engineering IIT Delhi, New Delhi-110016 01/12/2007 1 Introduction Classification OUTLINE K-fold cross Validation Feature
More informationMachine Learning. Sourangshu Bhattacharya
Machine Learning Sourangshu Bhattacharya Bayesian Networks Directed Acyclic Graph (DAG) Bayesian Networks General Factorization Curve Fitting Re-visited Maximum Likelihood Determine by minimizing sum-of-squares
More informationNeuro-fuzzy, GA-Fuzzy, Neural-Fuzzy-GA: A Data Mining Technique for Optimization
International Journal of Computer Science and Software Engineering Volume 3, Number 1 (2017), pp. 1-9 International Research Publication House http://www.irphouse.com Neuro-fuzzy, GA-Fuzzy, Neural-Fuzzy-GA:
More informationMassive Data Analysis
Professor, Department of Electrical and Computer Engineering Tennessee Technological University February 25, 2015 Big Data This talk is based on the report [1]. The growth of big data is changing that
More informationBuilding Classifiers using Bayesian Networks
Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance
More informationA Fast Learning Algorithm for Deep Belief Nets
A Fast Learning Algorithm for Deep Belief Nets Geoffrey E. Hinton, Simon Osindero Department of Computer Science University of Toronto, Toronto, Canada Yee-Whye Teh Department of Computer Science National
More informationReview of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.
Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and
More informationBeviMed Guide. Daniel Greene
BeviMed Guide Daniel Greene 1 Introduction BeviMed [1] is a procedure for evaluating the evidence of association between allele configurations across rare variants, typically within a genomic locus, and
More informationCLUSTERING IN BIOINFORMATICS
CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of
More informationA Survey of Statistical Models to Infer Consensus 3D Chromosomal Structure from Hi-C data
A Survey of Statistical Models to Infer Consensus 3D Chromosomal Structure from Hi-C data MEDHA UPPALA, University of California Los Angeles The spatial organization of the genomic material leads to interactions
More informationMonika Maharishi Dayanand University Rohtak
Performance enhancement for Text Data Mining using k means clustering based genetic optimization (KMGO) Monika Maharishi Dayanand University Rohtak ABSTRACT For discovering hidden patterns and structures
More informationLearning Objects and Parts in Images
Learning Objects and Parts in Images Chris Williams E H U N I V E R S I T T Y O H F R G E D I N B U School of Informatics, University of Edinburgh, UK Learning multiple objects and parts from images (joint
More informationHaplotype Analysis. 02 November 2003 Mendel Short IGES Slide 1
Haplotype Analysis Specifies the genetic information descending through a pedigree Useful visualization of the gene flow through a pedigree A haplotype for a given individual and set of loci is defined
More informationIssues in MCMC use for Bayesian model fitting. Practical Considerations for WinBUGS Users
Practical Considerations for WinBUGS Users Kate Cowles, Ph.D. Department of Statistics and Actuarial Science University of Iowa 22S:138 Lecture 12 Oct. 3, 2003 Issues in MCMC use for Bayesian model fitting
More informationG-PhoCS Generalized Phylogenetic Coalescent Sampler version 1.2.3
G-PhoCS Generalized Phylogenetic Coalescent Sampler version 1.2.3 Contents 1. About G-PhoCS 2. Download and Install 3. Overview of G-PhoCS analysis: input and output 4. The sequence file 5. The control
More informationBayesian Networks Inference
Bayesian Networks Inference Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 5 th, 2007 2005-2007 Carlos Guestrin 1 General probabilistic inference Flu Allergy Query: Sinus
More information9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology
9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example
More informationThe Parallel Software Design Process. Parallel Software Design
Parallel Software Design The Parallel Software Design Process Deborah Stacey, Chair Dept. of Comp. & Info Sci., University of Guelph dastacey@uoguelph.ca Why Parallel? Why NOT Parallel? Why Talk about
More informationGenetic Algorithm for Finding Shortest Path in a Network
Intern. J. Fuzzy Mathematical Archive Vol. 2, 2013, 43-48 ISSN: 2320 3242 (P), 2320 3250 (online) Published on 26 August 2013 www.researchmathsci.org International Journal of Genetic Algorithm for Finding
More informationClustering Relational Data using the Infinite Relational Model
Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015
More informationPreprocessing, Multiphase Inference, and Massive Data in Theory and Practice
Preprocessing, Multiphase Inference, and Massive Data in Theory and Practice Alexander W Blocker Department of Statistics Harvard University MMDS 2012 July 13, 2012 Joint work with Xiao-Li Meng Alexander
More informationINF 4300 Classification III Anne Solberg The agenda today:
INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15
More informationData Mining in Bioinformatics: Study & Survey
Data Mining in Bioinformatics: Study & Survey Saliha V S St. Joseph s college Irinjalakuda Abstract--Large amounts of data are generated in medical research. A biological database consists of a collection
More informationMachine Learning. Computational biology: Sequence alignment and profile HMMs
10-601 Machine Learning Computational biology: Sequence alignment and profile HMMs Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Growth
More informationPopulation Genetics (52642)
Population Genetics (52642) Benny Yakir 1 Introduction In this course we will examine several topics that are related to population genetics. In each topic we will discuss briefly the biological background
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationStep-by-Step Guide to Basic Genetic Analysis
Step-by-Step Guide to Basic Genetic Analysis Page 1 Introduction This document shows you how to clean up your genetic data, assess its statistical properties and perform simple analyses such as case-control
More informationHigh-throughput Sequence Alignment using Graphics Processing Units
High-throughput Sequence Alignment using Graphics Processing Units Michael Schatz & Cole Trapnell May 21, 2009 UMD NVIDIA CUDA Center Of Excellence Presentation Searching Wikipedia How do you find all
More information1 Case study of SVM (Rob)
DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationClassifier Inspired Scaling for Training Set Selection
Classifier Inspired Scaling for Training Set Selection Walter Bennette DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016-2511 Outline Instance-based classification
More information10-701/15-781, Fall 2006, Final
-7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly
More informationSTATISTICS (STAT) Statistics (STAT) 1
Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).
More informationHidden Markov Models. Implementing the forward-, backward- and Viterbi-algorithms
Hidden Markov Models Implementing the forward-, backward- and Viterbi-algorithms Recursion: Viterbi Basis: Forward Recursion: Basis: Backward Recursion: Basis: Viterbi Recursion: Problem: The values in
More informationGeneralized Inverse Reinforcement Learning
Generalized Inverse Reinforcement Learning James MacGlashan Cogitai, Inc. james@cogitai.com Michael L. Littman mlittman@cs.brown.edu Nakul Gopalan ngopalan@cs.brown.edu Amy Greenwald amy@cs.brown.edu Abstract
More informationData Representation and Indexing
Data Representation and Indexing Kai Shen Data Representation Data structure and storage format that represents, models, or approximates the original information What we ve seen/learned so far? Raw byte
More informationCPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017
CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.
More informationof Manchester The University COMP14112 Markov Chains, HMMs and Speech Revision
COMP14112 Lecture 11 Markov Chains, HMMs and Speech Revision 1 What have we covered in the speech lectures? Extracting features from raw speech data Classification and the naive Bayes classifier Training
More informationEukaryotic Gene Finding: The GENSCAN System
Eukaryotic Gene Finding: The GENSCAN System BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC
More informationMachine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization
Machine Learning A 708.064 WS15/16 1sst KU Version: January 11, 2016 Exercises Problems marked with * are optional. 1 Conditional Independence I [3 P] a) [1 P] For the probability distribution P (A, B,
More information18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos
Machine Learning for Computer Vision 1 18 October, 2013 MVA ENS Cachan Lecture 6: Introduction to graphical models Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Center for Visual Computing Ecole Centrale Paris
More informationOutline. Probabilistic Name and Address Cleaning and Standardisation. Record linkage and data integration. Data cleaning and standardisation (I)
Outline Probabilistic Name and Address Cleaning and Standardisation Peter Christen, Tim Churches and Justin Xi Zhu Data Mining Group, Australian National University Centre for Epidemiology and Research,
More informationQuiz Section Week 8 May 17, Machine learning and Support Vector Machines
Quiz Section Week 8 May 17, 2016 Machine learning and Support Vector Machines Another definition of supervised machine learning Given N training examples (objects) {(x 1,y 1 ), (x 2,y 2 ),, (x N,y N )}
More informationApplied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University
Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University NIPS 2008: E. Sudderth & M. Jordan, Shared Segmentation of Natural
More informationMotivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)
Motivation: Shortcomings of Hidden Markov Model Maximum Entropy Markov Models and Conditional Random Fields Ko, Youngjoong Dept. of Computer Engineering, Dong-A University Intelligent System Laboratory,
More informationPROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota
Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein
More informationQuiz section 10. June 1, 2018
Quiz section 10 June 1, 2018 Logistics Bring: 1 page cheat-sheet, simple calculator Any last logistics questions about the final? Logistics Bring: 1 page cheat-sheet, simple calculator Any last logistics
More informationExact Inference: Elimination and Sum Product (and hidden Markov models)
Exact Inference: Elimination and Sum Product (and hidden Markov models) David M. Blei Columbia University October 13, 2015 The first sections of these lecture notes follow the ideas in Chapters 3 and 4
More informationNeural Network Weight Selection Using Genetic Algorithms
Neural Network Weight Selection Using Genetic Algorithms David Montana presented by: Carl Fink, Hongyi Chen, Jack Cheng, Xinglong Li, Bruce Lin, Chongjie Zhang April 12, 2005 1 Neural Networks Neural networks
More informationBayesian Estimation for Skew Normal Distributions Using Data Augmentation
The Korean Communications in Statistics Vol. 12 No. 2, 2005 pp. 323-333 Bayesian Estimation for Skew Normal Distributions Using Data Augmentation Hea-Jung Kim 1) Abstract In this paper, we develop a MCMC
More informationHMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms
HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms by TIN YIN LAM B.Sc., The Chinese University of Hong Kong, 2006 A THESIS SUBMITTED IN PARTIAL
More informationAutomatic Linguistic Indexing of Pictures by a Statistical Modeling Approach
Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach Outline Objective Approach Experiment Conclusion and Future work Objective Automatically establish linguistic indexing of pictures
More informationOptimization of Hidden Markov Model by a Genetic Algorithm for Web Information Extraction
Optimization of Hidden Markov Model by a Genetic Algorithm for Web Information Extraction Jiyi Xiao Lamei Zou Chuanqi Li School of Computer Science and Technology, University of South China, Hengyang 421001,
More informationNearest neighbor classification DSE 220
Nearest neighbor classification DSE 220 Decision Trees Target variable Label Dependent variable Output space Person ID Age Gender Income Balance Mortgag e payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000
More informationData Mining in Bioinformatics Day 1: Classification
Data Mining in Bioinformatics Day 1: Classification Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls
More informationMATE-EC2: A Middleware for Processing Data with Amazon Web Services
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering
More informationMachine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim,
More informationMCMC Methods for data modeling
MCMC Methods for data modeling Kenneth Scerri Department of Automatic Control and Systems Engineering Introduction 1. Symposium on Data Modelling 2. Outline: a. Definition and uses of MCMC b. MCMC algorithms
More informationTutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller
Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller Parameter and State inference using the approximate structured coalescent 1 Background Phylogeographic methods can help reveal the movement
More informationIntroduction to Graphical Models
Robert Collins CSE586 Introduction to Graphical Models Readings in Prince textbook: Chapters 10 and 11 but mainly only on directed graphs at this time Credits: Several slides are from: Review: Probability
More informationThe k-means Algorithm and Genetic Algorithm
The k-means Algorithm and Genetic Algorithm k-means algorithm Genetic algorithm Rough set approach Fuzzy set approaches Chapter 8 2 The K-Means Algorithm The K-Means algorithm is a simple yet effective
More informationGenetic type 1 Error Calculator (GEC)
Genetic type 1 Error Calculator (GEC) (Version 0.2) User Manual Miao-Xin Li Department of Psychiatry and State Key Laboratory for Cognitive and Brain Sciences; the Centre for Reproduction, Development
More informationHow Learning Differs from Optimization. Sargur N. Srihari
How Learning Differs from Optimization Sargur N. srihari@cedar.buffalo.edu 1 Topics in Optimization Optimization for Training Deep Models: Overview How learning differs from optimization Risk, empirical
More informationECE521 Lecture 18 Graphical Models Hidden Markov Models
ECE521 Lecture 18 Graphical Models Hidden Markov Models Outline Graphical models Conditional independence Conditional independence after marginalization Sequence models hidden Markov models 2 Graphical
More informationData Mining Technologies for Bioinformatics Sequences
Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment
More informationForward Feature Selection Using Residual Mutual Information
Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics
More informationMAXIMUM LIKELIHOOD ESTIMATION USING ACCELERATED GENETIC ALGORITHMS
In: Journal of Applied Statistical Science Volume 18, Number 3, pp. 1 7 ISSN: 1067-5817 c 2011 Nova Science Publishers, Inc. MAXIMUM LIKELIHOOD ESTIMATION USING ACCELERATED GENETIC ALGORITHMS Füsun Akman
More informationAlignment and clustering tools for sequence analysis. Omar Abudayyeh Presentation December 9, 2015
Alignment and clustering tools for sequence analysis Omar Abudayyeh 18.337 Presentation December 9, 2015 Introduction Sequence comparison is critical for inferring biological relationships within large
More informationCOMP90051 Statistical Machine Learning
COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 20. PGM Representation Next Lectures Representation of joint distributions Conditional/marginal independence * Directed vs
More informationA noninformative Bayesian approach to small area estimation
A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported
More informationChromHMM: automating chromatin-state discovery and characterization
Nature Methods ChromHMM: automating chromatin-state discovery and characterization Jason Ernst & Manolis Kellis Supplementary Figure 1 Supplementary Figure 2 Supplementary Figure 3 Supplementary Figure
More informationComputational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop
Computational Architecture of Cloud Environments Michael Schatz April 1, 2010 NHGRI Cloud Computing Workshop Cloud Architecture Computation Input Output Nebulous question: Cloud computing = Utility computing
More information