Iterative Learning of Single Individual Haplotypes from High-Throughput DNA Sequencing Data

Size: px
Start display at page:

Download "Iterative Learning of Single Individual Haplotypes from High-Throughput DNA Sequencing Data"

Transcription

1 Iterative Learning of Single Individual Haplotypes from High-Throughput DNA Sequencing Data Zrinka Puljiz and Haris Vikalo Electrical and Computer Engineering Department The University of Texas at Austin 8 th International Symposium on Turbo Codes & Iterative Information Processing Bremen, Germany, August 18-22, 2014 Iterative Learning of Single Individual Haplotypes 1 / 22

2 Overview of the Talk Motivation and background DNA sequencing and studies of genetic variations Haplotype assembly data structure and problem formulation graphical representation of the problem existing methods Communication systems analogy and belief propagation haplotype assembly as a decoding problem belief propagation algorithm performance analysis, comparison with existing methods Conclusions and future work Iterative Learning of Single Individual Haplotypes 2 / 22

3 DNA Sequencing: Discovering Genetic Blueprint Determine the order of nucleotides in a DNA sequence Human Genome Project: mapping the genetic blueprint followed by sequencing more individuals, studies of genetic variations Iterative Learning of Single Individual Haplotypes 3 / 22

4 Study of Genetic Variations in Humans Humans are diploid organism with 23 pairs of chromosomes chromosomes in a pair of autosomes are homologous the most common type of variation are SNPs Iterative Learning of Single Individual Haplotypes 4 / 22

5 Study of Genetic Variations in Humans Cont d Describing variations SNP calling determines locations and type of polymorphisms based on the detected SNPs, perform genotype calling example: A/T, A/C, G/T Genotypes provide only the list of unordered pairs of alleles no association of alleles with one of the chromosomes in a pair The complete information is provided by haplotypes the list of alleles at contiguous sites in a region of a chromosome example: (A,C,G) and (T,A,T) fundamental for many applications (personalized medicine!) Iterative Learning of Single Individual Haplotypes 5 / 22

6 Single Individual Haplotyping Determine a haplotype of an individual using DNA sequencing The SNP rate is low, typically estimated to be 10 3 high-throughput DNA sequencing provides reads that are too short get pairs of fragments at opposite ends of a strand of known length Iterative Learning of Single Individual Haplotypes 6 / 22

7 A Fragment Conflict Graph Interpretation Represent reads by nodes, conflicts by edges fragments are in conflict if they cover a common SNP location but have di erent nucleotides there (so, di erent chromosomes) If data is error-free, conflict graph is bipartite otherwise, the graph contains cycles Iterative Learning of Single Individual Haplotypes 7 / 22

8 Various Formulation of the Haplotype Assembly Problem If the conflict graph is not bipartite, assembly is non-trivial Approach: minimize the number of transformation steps needed to alter the graph so that it becomes bipartite minimum edge removal (MER), minimum fragment removal (MFR), minimum SNP removal Minimum error correction (MEC): find the smallest number of nucleotides in reads whose flipping to a di erent value resolves conflicts among the fragments from the same chromosome essentially, remove cycles in the conflict graph by assuming the fewest possible sequencing errors NP hard, various methods: HapCut [Bansal & Banfa, 2008], HapCompass [Aguiar & Istrail, 2013], HapTree [Berger et al., 2014] Iterative Learning of Single Individual Haplotypes 8 / 22

9 Minimum Error Correction Formulation Label bases in heterozygous sites as h 1 i, h 2 i 2 {1, 0} define h = h 1 = h 2 =[h 1 1 h h 1 n] Each read is as a ternary string with entries 0, 1 and organize reads into a matrix R, rowr i is the i th read 2 x x 0 x x 3 1 x 1 x x 0 x x x 0 x 0 x 0 x x 1 x x R = 6 1 x 1 x x x 7 4 x x 1 x 0 x 5 x 0 x 0 x x x x x 0 x 0 The MEC formulation is concerned with minimizing Z over h, mx nx Z = min(hd(r i, h), hd(r i, h)), hd(r i, h) = d(r i,j, h j ) i=1 j=1 Iterative Learning of Single Individual Haplotypes 9 / 22

10 Structure of the Data Matrix Consider the error-free SNP fragment matrix 2 R = 6 4 x x 0 x x 1 x 1 x x 0 x x x 0 x 0 x 0 x x 1 x x 1 x 1 x x x x x 1 x 0 x x 0 x 0 x x x x x 0 x 0 Let h = [ ], and the origin of the reads in R be s = [ ]. Then for a binary R i,j it holds s i h j R i,j Iterative Learning of Single Individual Haplotypes 10 / 22

11 Haplotype Assembly as a Decoding Problem Collect indices {(i k, j k )} identifying positions where the m n matrix R has binary entries (1 apple k apple M) Define the code generating matrix G, ( 1ifl = j k or l = i k + n, 1 apple k apple M, G(l, k) = 0, otherwise. apple 0 1 Example: for R= G= 6 4, we construct Iterative Learning of Single Individual Haplotypes 11 / 22

12 Haplotype Assembly as a Decoding Problem Cont d Define a message m =[h s] and a codeword c = mg c collects binary entries from an error-free data matrix R Due to sequencing errors, entries in R erroneously flipped this can be interpreted as the e ect of a binary symmetric channel on c = mg formally, y = c + e =[h s]g + e, wherey k = R(i k, j k ) Iterative Learning of Single Individual Haplotypes 12 / 22

13 Graphical Model Graphical representation of the problem Iterative Learning of Single Individual Haplotypes 13 / 22

14 Graphical Model Cont d Haplotyping with MEC criterion min distance decoding using the parity check matrix H: MEC = min H(y+e)=0 kek0 Iterative Learning of Single Individual Haplotypes 14 / 22

15 Belief Propagation for Haplotype Assembly Graphical model for the belief propagation algorithm Iterative Learning of Single Individual Haplotypes 15 / 22

16 Belief Propagation for Haplotype Assembly Cont d Iterative Learning of Single Individual Haplotypes 16 / 22

17 Belief Propagation for Haplotype Assembly Cont d Iterative Learning of Single Individual Haplotypes 17 / 22

18 Belief Propagation for Haplotype Assembly Cont d Stopping criterion: threshold, max # of iterations reached Iterative Learning of Single Individual Haplotypes 18 / 22

19 Computational Complexity Belief propagation algorithm: Allow random restarts, MAXITER iterations Schemes relying on parity-check need preprocessing: Parity check matrix transformation Complexity for each haplotype block: O((#SNP +#Reads) (#entries in R)) This step depends on the locations of the binary entries in the matrix R Iterative Learning of Single Individual Haplotypes 19 / 22

20 Results on 1000 Genomes Project Data Iterative Learning of Single Individual Haplotypes 20 / 22

21 Performance Guarantees Found lower bounds on Pr{ĥ 6= h}, E[ ĥ h 0 ], E[#switch errors] [SVV, ITW 2014] Consider the haplotype of length n, error rate p, and probability of assembly error P e =Pr{ĥ 6= h R}. The number of reads m necessary for the assembly satisfies m (1 P e )n 2[1 H(p)]. If m = (n ln n), one can determine h accurately with high probability. Specifically, given a target small constant > 0, there exists n large enough such that by choosing m = (n ln n) theprobabilityoferror P e apple. Iterative Learning of Single Individual Haplotypes 21 / 22

22 Summary and Future Work Developed a novel framework for haplotype assembly rephrased assembly as a decoding problem belief propagation algorithm as a solution outperforms existing methods on 1000 Genomes Project data Several possible extensions exploit possible prior SNP/genotype information develop joint base/snp/genotype calling and haplotype assembly schemes Explore other suitable methods and techniques sparse low-rank matrix completion, spectral partitioning, correlation clustering Analyze limits of performance, experimental conditions needed to achieve desired accuracy Iterative Learning of Single Individual Haplotypes 22 / 22

RECENT advancements in high-throughput DNA sequencing

RECENT advancements in high-throughput DNA sequencing IEEE TRANSACTIONS OF COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 1 Decoding Genetic Variations: Communications-Inspired aplotype Assembly Zrinka Puljiz, Student Member, IEEE, aris Vikalo, Senior Member, IEEE

More information

Network Based Models For Analysis of SNPs Yalta Opt

Network Based Models For Analysis of SNPs Yalta Opt Outline Network Based Models For Analysis of Yalta Optimization Conference 2010 Network Science Zeynep Ertem*, Sergiy Butenko*, Clare Gill** *Department of Industrial and Systems Engineering, **Department

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

On a Divide and Conquer Approach for Haplotype Inference with Pure Parsimony

On a Divide and Conquer Approach for Haplotype Inference with Pure Parsimony On a Divide and Conquer Approach for Haplotype Inference with Pure Parsimony Konstantinos Kalpakis, and Parag Namjoshi Department of Computer Science and Electrical Engineering University of Maryland Baltimore

More information

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Genome Assembly Using de Bruijn Graphs. Biostatistics 666 Genome Assembly Using de Bruijn Graphs Biostatistics 666 Previously: Reference Based Analyses Individual short reads are aligned to reference Genotypes generated by examining reads overlapping each position

More information

Minimum Recombinant Haplotype Configuration on Tree Pedigrees (Extended Abstract)

Minimum Recombinant Haplotype Configuration on Tree Pedigrees (Extended Abstract) Minimum Recombinant Haplotype Configuration on Tree Pedigrees (Extended Abstract) Koichiro Doi 1, Jing Li 2, and Tao Jiang 2 1 Department of Computer Science Graduate School of Information Science and

More information

Error correction guarantees

Error correction guarantees Error correction guarantees Drawback of asymptotic analyses Valid only as long as the incoming messages are independent. (independence assumption) The messages are independent for l iterations only if

More information

Shuheng Zhou. Annotated Bibliography

Shuheng Zhou. Annotated Bibliography Shuheng Zhou Annotated Bibliography High-dimensional Statistical Inference S. Zhou, J. Lafferty and L. Wasserman, Compressed Regression, in Advances in Neural Information Processing Systems 20 (NIPS 2007).

More information

LDPC Codes a brief Tutorial

LDPC Codes a brief Tutorial LDPC Codes a brief Tutorial Bernhard M.J. Leiner, Stud.ID.: 53418L bleiner@gmail.com April 8, 2005 1 Introduction Low-density parity-check (LDPC) codes are a class of linear block LDPC codes. The name

More information

A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS

A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS Jim Gasvoda and Qin Ding Department of Computer Science, Pennsylvania State University at Harrisburg, Middletown, PA 17057, USA {jmg289, qding}@psu.edu

More information

Genetic type 1 Error Calculator (GEC)

Genetic type 1 Error Calculator (GEC) Genetic type 1 Error Calculator (GEC) (Version 0.2) User Manual Miao-Xin Li Department of Psychiatry and State Key Laboratory for Cognitive and Brain Sciences; the Centre for Reproduction, Development

More information

Genetic Master-Slave Algorithm for Haplotype Inference by Parsimony

Genetic Master-Slave Algorithm for Haplotype Inference by Parsimony Alma Mater Studiorum Università degli Studi di Bologna DEIS Genetic Master-Slave Algorithm for Haplotype Inference by Parsimony Stefano Benedettini Luca Di Gaspero Andrea Roli January 10, 2009 DEIS Technical

More information

4.1. Access the internet and log on to the UCSC Genome Bioinformatics Web Page (Figure 1-

4.1. Access the internet and log on to the UCSC Genome Bioinformatics Web Page (Figure 1- 1. PURPOSE To provide instructions for finding rs Numbers (SNP database ID numbers) and increasing sequence length by utilizing the UCSC Genome Bioinformatics Database. 2. MATERIALS 2.1. Sequence Information

More information

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland Genetic Programming Charles Chilaka Department of Computational Science Memorial University of Newfoundland Class Project for Bio 4241 March 27, 2014 Charles Chilaka (MUN) Genetic algorithms and programming

More information

An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis

An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer Science The University of Oklahoma Norman, Oklahoma,

More information

Haplotype reconstruction using perfect phylogeny and sequence data

Haplotype reconstruction using perfect phylogeny and sequence data Haplotype reconstruction using perfect phylogeny and sequence data Anatoly Efros 1 and Eran Halperin 1,2,3,0 1 The Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel. 2 International

More information

IDBA A Practical Iterative de Bruijn Graph De Novo Assembler

IDBA A Practical Iterative de Bruijn Graph De Novo Assembler IDBA A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry C.M. Leung, S.M. Yiu, and Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong

More information

Minimum Multicolored Subgraph Problem in Multiplex PCR Primer Set Selection and Population Haplotyping

Minimum Multicolored Subgraph Problem in Multiplex PCR Primer Set Selection and Population Haplotyping Minimum Multicolored Subgraph Problem in Multiplex PCR Primer Set Selection and Population Haplotyping M.T. Hajiaghayi 1,K.Jain 2,L.C.Lau 3,I.I.Măndoiu 4,A.Russell 4,and V.V. Vazirani 5 1 Laboratory for

More information

Random Forest in Genomic Selection

Random Forest in Genomic Selection Random Forest in genomic selection 1 Dpto Mejora Genética Animal, INIA, Madrid; Universidad Politécnica de Valencia, 20-24 September, 2010. Outline 1 Remind 2 Random Forest Introduction Classification

More information

Identifying Blocks and Sub-Populations in Noisy SNP Data

Identifying Blocks and Sub-Populations in Noisy SNP Data Identifying Blocks and Sub-Populations in Noisy SNP Data Gad Kimmel 1, Roded Sharan 2, and Ron Shamir 1 1 School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel. {kgad,rshamir}@tau.ac.il

More information

BEAGLECALL 1.0. Brian L. Browning Department of Medicine Division of Medical Genetics University of Washington. 15 November 2010

BEAGLECALL 1.0. Brian L. Browning Department of Medicine Division of Medical Genetics University of Washington. 15 November 2010 BEAGLECALL 1.0 Brian L. Browning Department of Medicine Division of Medical Genetics University of Washington 15 November 2010 BEAGLECALL 1.0 P a g e i Contents 1 Introduction... 1 1.1 Citing BEAGLECALL...

More information

Heuristic Optimisation Methods for System Partitioning in HW/SW Co-Design

Heuristic Optimisation Methods for System Partitioning in HW/SW Co-Design Heuristic Optimisation Methods for System Partitioning in HW/SW Co-Design Univ.Prof. Dipl.-Ing. Dr.techn. Markus Rupp Vienna University of Technology Institute of Communications and Radio-Frequency Engineering

More information

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu Matt Huska Freie Universität Berlin Computational Methods for High-Throughput Omics

More information

NextGenMap and the impact of hhighly polymorphic regions. Arndt von Haeseler

NextGenMap and the impact of hhighly polymorphic regions. Arndt von Haeseler NextGenMap and the impact of hhighly polymorphic regions Arndt von Haeseler Joint work with: The Technological Revolution Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program

More information

Genetic Algorithms Variations and Implementation Issues

Genetic Algorithms Variations and Implementation Issues Genetic Algorithms Variations and Implementation Issues CS 431 Advanced Topics in AI Classic Genetic Algorithms GAs as proposed by Holland had the following properties: Randomly generated population Binary

More information

Adaptive Linear Programming Decoding of Polar Codes

Adaptive Linear Programming Decoding of Polar Codes Adaptive Linear Programming Decoding of Polar Codes Veeresh Taranalli and Paul H. Siegel University of California, San Diego, La Jolla, CA 92093, USA Email: {vtaranalli, psiegel}@ucsd.edu Abstract Polar

More information

LD vignette Measures of linkage disequilibrium

LD vignette Measures of linkage disequilibrium LD vignette Measures of linkage disequilibrium David Clayton June 13, 2018 Calculating linkage disequilibrium statistics We shall first load some illustrative data. > data(ld.example) The data are drawn

More information

SNP HiTLink Manual. Yoko Fukuda 1, Hiroki Adachi 2, Eiji Nakamura 2, and Shoji Tsuji 1

SNP HiTLink Manual. Yoko Fukuda 1, Hiroki Adachi 2, Eiji Nakamura 2, and Shoji Tsuji 1 SNP HiTLink Manual Yoko Fukuda 1, Hiroki Adachi 2, Eiji Nakamura 2, and Shoji Tsuji 1 1 Department of Neurology, Graduate School of Medicine, the University of Tokyo, Tokyo, Japan 2 Dynacom Co., Ltd, Kanagawa,

More information

Iterative Learning for Reference-Guided DNA Sequence Assembly from Short Reads: Algorithms and Limits of Performance

Iterative Learning for Reference-Guided DNA Sequence Assembly from Short Reads: Algorithms and Limits of Performance Iterative Learning for Reference-Guided DNA Sequence Assembly from Short Reads: Algorithms and Limits of Performance Xiaohu Shen, Manohar Shamaiah, and Haris Vialo 1 arxiv:1403.5686v1 [q-bio.gn] 22 Mar

More information

Double Patterning Layout Decomposition for Simultaneous Conflict and Stitch Minimization

Double Patterning Layout Decomposition for Simultaneous Conflict and Stitch Minimization Double Patterning Layout Decomposition for Simultaneous Conflict and Stitch Minimization Kun Yuan, Jae-Seo Yang, David Z. Pan Dept. of Electrical and Computer Engineering The University of Texas at Austin

More information

Module 4. Constraint satisfaction problems. Version 2 CSE IIT, Kharagpur

Module 4. Constraint satisfaction problems. Version 2 CSE IIT, Kharagpur Module 4 Constraint satisfaction problems Lesson 10 Constraint satisfaction problems - II 4.5 Variable and Value Ordering A search algorithm for constraint satisfaction requires the order in which variables

More information

Tabu Search for the Founder Sequence Reconstruction Problem: A Preliminary Study

Tabu Search for the Founder Sequence Reconstruction Problem: A Preliminary Study Alma Mater Studiorum Università degli Studi di Bologna DEIS Tabu Search for the Founder Sequence Reconstruction Problem: A Preliminary Study Andrea Roli and Christian Blum January 10, 2009 DEIS Technical

More information

Estimating. Local Ancestry in admixed Populations (LAMP)

Estimating. Local Ancestry in admixed Populations (LAMP) Estimating Local Ancestry in admixed Populations (LAMP) QIAN ZHANG 572 6/05/2014 Outline 1) Sketch Method 2) Algorithm 3) Simulated Data: Accuracy Varying Pop1-Pop2 Ancestries r 2 pruning threshold Number

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

On the construction of Tanner graphs

On the construction of Tanner graphs On the construction of Tanner graphs Jesús Martínez Mateo Universidad Politécnica de Madrid Outline Introduction Low-density parity-check (LDPC) codes LDPC decoding Belief propagation based algorithms

More information

Finding Small Stopping Sets in the Tanner Graphs of LDPC Codes

Finding Small Stopping Sets in the Tanner Graphs of LDPC Codes Finding Small Stopping Sets in the Tanner Graphs of LDPC Codes Gerd Richter University of Ulm, Department of TAIT Albert-Einstein-Allee 43, D-89081 Ulm, Germany gerd.richter@uni-ulm.de Abstract The performance

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,

More information

IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler

IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry Leung, S.M. Yiu, Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong {ypeng,

More information

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your

More information

A Genome Assembly Algorithm Designed for Single-Cell Sequencing

A Genome Assembly Algorithm Designed for Single-Cell Sequencing SPAdes A Genome Assembly Algorithm Designed for Single-Cell Sequencing Bankevich A, Nurk S, Antipov D, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput

More information

Statistical relationship discovery in SNP data using Bayesian networks

Statistical relationship discovery in SNP data using Bayesian networks Statistical relationship discovery in SNP data using Bayesian networks Pawe l Szlendak and Robert M. Nowak Institute of Electronic Systems, Warsaw University of Technology, Nowowiejska 5/9, -665 Warsaw,

More information

Normalized cuts and image segmentation

Normalized cuts and image segmentation Normalized cuts and image segmentation Department of EE University of Washington Yeping Su Xiaodan Song Normalized Cuts and Image Segmentation, IEEE Trans. PAMI, August 2000 5/20/2003 1 Outline 1. Image

More information

RAD Population Genomics Programs Paul Hohenlohe 6/2014

RAD Population Genomics Programs Paul Hohenlohe 6/2014 RAD Population Genomics Programs Paul Hohenlohe (hohenlohe@uidaho.edu) 6/2014 I. Overview These programs are designed to conduct population genomic analysis on RAD sequencing data. They were designed for

More information

ELAI user manual. Yongtao Guan Baylor College of Medicine. Version June Copyright 2. 3 A simple example 2

ELAI user manual. Yongtao Guan Baylor College of Medicine. Version June Copyright 2. 3 A simple example 2 ELAI user manual Yongtao Guan Baylor College of Medicine Version 1.0 25 June 2015 Contents 1 Copyright 2 2 What ELAI Can Do 2 3 A simple example 2 4 Input file formats 3 4.1 Genotype file format....................................

More information

Q-Clustering. Abstract

Q-Clustering. Abstract Q-Clustering Mukund Narasimhan Nebojsa Jojic Jeff Bilmes Dept of Electrical Engineering, University of Washington, Seattle WA Microsoft Research, Microsoft Corporation, Redmond WA {mukundn,bilmes}@ee.washington.edu

More information

M 100 G 3000 M 3000 G 100. ii) iii)

M 100 G 3000 M 3000 G 100. ii) iii) A) B) RefSeq 1 Other Alignments 180000 1 1 Simulation of Kim et al method Human Mouse Rat Fruitfly Nematode Best Alignment G estimate 1 80000 RefSeq 2 G estimate C) D) 0 350000 300000 250000 0 150000 Interpretation

More information

Recalling Genotypes with BEAGLECALL Tutorial

Recalling Genotypes with BEAGLECALL Tutorial Recalling Genotypes with BEAGLECALL Tutorial Release 8.1.4 Golden Helix, Inc. June 24, 2014 Contents 1. Format and Confirm Data Quality 2 A. Exclude Non-Autosomal Markers......................................

More information

Example: Map coloring

Example: Map coloring Today s s lecture Local Search Lecture 7: Search - 6 Heuristic Repair CSP and 3-SAT Solving CSPs using Systematic Search. Victor Lesser CMPSCI 683 Fall 2004 The relationship between problem structure and

More information

Clustering Algorithms for general similarity measures

Clustering Algorithms for general similarity measures Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

More information

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly Ben Raphael Sept. 22, 2009 http://cs.brown.edu/courses/csci2950-c/ l-mer composition Def: Given string s, the Spectrum ( s, l ) is unordered multiset

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 6 Coding I Chapter 3 Information Redundancy Part.6.1 Information Redundancy - Coding A data word with d bits is encoded

More information

Notes 8: Expander Codes and their decoding

Notes 8: Expander Codes and their decoding Introduction to Coding Theory CMU: Spring 010 Notes 8: Expander Codes and their decoding March 010 Lecturer: Venkatesan Guruswami Scribe: Venkat Guruswami & Ankit Sharma In this lecture, we shall look

More information

Sparse Matrix Reordering Algorithms for Cluster Identification

Sparse Matrix Reordering Algorithms for Cluster Identification Sparse Matrix Reordering Algorithms for Cluster Identification Chris Mueller For I532, Machine Learning in Bioinformatics December 17, 2004 Introduction The dot plot (Figure 1) is a technique for displaying

More information

RESEARCH TOPIC IN BIOINFORMANTIC

RESEARCH TOPIC IN BIOINFORMANTIC RESEARCH TOPIC IN BIOINFORMANTIC GENOME ASSEMBLY Instructor: Dr. Yufeng Wu Noted by: February 25, 2012 Genome Assembly is a kind of string sequencing problems. As we all know, the human genome is very

More information

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Biclustering Bioinformatics Data Sets. A Possibilistic Approach Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Package HMMASE. February 4, HMMASE R package

Package HMMASE. February 4, HMMASE R package Package HMMASE February 4, 2014 Type Package Title HMMASE R package Version 1.0 Date 2014-02-04 Author Juan R. Steibel, Heng Wang, Ping-Shou Zhong Maintainer Heng Wang An R package that

More information

Package inversion. R topics documented: July 18, Type Package. Title Inversions in genotype data. Version

Package inversion. R topics documented: July 18, Type Package. Title Inversions in genotype data. Version Package inversion July 18, 2013 Type Package Title Inversions in genotype data Version 1.8.0 Date 2011-05-12 Author Alejandro Caceres Maintainer Package to find genetic inversions in genotype (SNP array)

More information

Low Cost Convolutional Code Based Concurrent Error Detection in FSMs

Low Cost Convolutional Code Based Concurrent Error Detection in FSMs Low Cost Convolutional Code Based Concurrent Error Detection in FSMs Konstantinos Rokas & Yiorgos Makris Electrical Engineering Department Yale University {konstantinos.rokas, yiorgos.makris}@yale.edu

More information

Step-by-Step Guide to Relatedness and Association Mapping Contents

Step-by-Step Guide to Relatedness and Association Mapping Contents Step-by-Step Guide to Relatedness and Association Mapping Contents OBJECTIVES... 2 INTRODUCTION... 2 RELATEDNESS MEASURES... 2 POPULATION STRUCTURE... 6 Q-K ASSOCIATION ANALYSIS... 10 K MATRIX COMPRESSION...

More information

Introduction to GDS. Stephanie Gogarten. July 18, 2018

Introduction to GDS. Stephanie Gogarten. July 18, 2018 Introduction to GDS Stephanie Gogarten July 18, 2018 Genomic Data Structure CoreArray (C++ library) designed for large-scale data management of genome-wide variants data format (GDS) to store multiple

More information

Social-Network Graphs

Social-Network Graphs Social-Network Graphs Mining Social Networks Facebook, Google+, Twitter Email Networks, Collaboration Networks Identify communities Similar to clustering Communities usually overlap Identify similarities

More information

Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications

Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Authors: Shuangchen Li, Yongpan Liu, X.Sharon Hu, Xinyu He, Pei Zhang, and Huazhong Yang 2013/01/23 Outline

More information

CSEP 561 Error detection & correction. David Wetherall

CSEP 561 Error detection & correction. David Wetherall CSEP 561 Error detection & correction David Wetherall djw@cs.washington.edu Codes for Error Detection/Correction ti ti Error detection and correction How do we detect and correct messages that are garbled

More information

Performance analysis of LDPC Decoder using OpenMP

Performance analysis of LDPC Decoder using OpenMP Performance analysis of LDPC Decoder using OpenMP S. V. Viraktamath Faculty, Dept. of E&CE, SDMCET, Dharwad. Karnataka, India. Jyothi S. Hosmath Student, Dept. of E&CE, SDMCET, Dharwad. Karnataka, India.

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Mathematical Programming Formulations, Constraint Programming

Mathematical Programming Formulations, Constraint Programming Outline DM87 SCHEDULING, TIMETABLING AND ROUTING Lecture 3 Mathematical Programming Formulations, Constraint Programming 1. Special Purpose Algorithms 2. Constraint Programming Marco Chiarandini DM87 Scheduling,

More information

ARELAY network consists of a pair of source and destination

ARELAY network consists of a pair of source and destination 158 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 55, NO 1, JANUARY 2009 Parity Forwarding for Multiple-Relay Networks Peyman Razaghi, Student Member, IEEE, Wei Yu, Senior Member, IEEE Abstract This paper

More information

12. Use of Test Generation Algorithms and Emulation

12. Use of Test Generation Algorithms and Emulation 12. Use of Test Generation Algorithms and Emulation 1 12. Use of Test Generation Algorithms and Emulation Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin

More information

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics 27626 - Next Generation Sequencing Analysis Generalized NGS analysis Data size Application Assembly: Compare

More information

Package SimGbyE. July 20, 2009

Package SimGbyE. July 20, 2009 Package SimGbyE July 20, 2009 Type Package Title Simulated case/control or survival data sets with genetic and environmental interactions. Author Melanie Wilson Maintainer Melanie

More information

MLSTest Tutorial Contents

MLSTest Tutorial Contents MLSTest Tutorial Contents About MLSTest... 2 Installing MLSTest... 2 Loading Data... 3 Main window... 4 DATA Menu... 5 View, modify and export your alignments... 6 Alignment>viewer... 6 Alignment> export...

More information

Design and Implementation of Low Density Parity Check Codes

Design and Implementation of Low Density Parity Check Codes IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 09 (September. 2014), V2 PP 21-25 www.iosrjen.org Design and Implementation of Low Density Parity Check Codes

More information

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors Algorithms for Nearest Neighbors Classic Ideas, New Ideas Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura University of Toronto, July 2007 1 / 39 Outline

More information

Spectral Clustering and Community Detection in Labeled Graphs

Spectral Clustering and Community Detection in Labeled Graphs Spectral Clustering and Community Detection in Labeled Graphs Brandon Fain, Stavros Sintos, Nisarg Raval Machine Learning (CompSci 571D / STA 561D) December 7, 2015 {btfain, nisarg, ssintos} at cs.duke.edu

More information

Axiom Analysis Suite Release Notes (For research use only. Not for use in diagnostic procedures.)

Axiom Analysis Suite Release Notes (For research use only. Not for use in diagnostic procedures.) Axiom Analysis Suite 4.0.1 Release Notes (For research use only. Not for use in diagnostic procedures.) Axiom Analysis Suite 4.0.1 includes the following changes/updates: 1. For library packages that support

More information

REVIEW ON CONSTRUCTION OF PARITY CHECK MATRIX FOR LDPC CODE

REVIEW ON CONSTRUCTION OF PARITY CHECK MATRIX FOR LDPC CODE REVIEW ON CONSTRUCTION OF PARITY CHECK MATRIX FOR LDPC CODE Seema S. Gumbade 1, Anirudhha S. Wagh 2, Dr.D.P.Rathod 3 1,2 M. Tech Scholar, Veermata Jijabai Technological Institute (VJTI), Electrical Engineering

More information

Outline. Motivation. Introduction of GAs. Genetic Algorithm 9/7/2017. Motivation Genetic algorithms An illustrative example Hypothesis space search

Outline. Motivation. Introduction of GAs. Genetic Algorithm 9/7/2017. Motivation Genetic algorithms An illustrative example Hypothesis space search Outline Genetic Algorithm Motivation Genetic algorithms An illustrative example Hypothesis space search Motivation Evolution is known to be a successful, robust method for adaptation within biological

More information

Graph based codes for distributed storage systems

Graph based codes for distributed storage systems /23 Graph based codes for distributed storage systems July 2, 25 Christine Kelley University of Nebraska-Lincoln Joint work with Allison Beemer and Carolyn Mayer Combinatorics and Computer Algebra, COCOA

More information

Sparse Matrices Reordering using Evolutionary Algorithms: A Seeded Approach

Sparse Matrices Reordering using Evolutionary Algorithms: A Seeded Approach 1 Sparse Matrices Reordering using Evolutionary Algorithms: A Seeded Approach David Greiner, Gustavo Montero, Gabriel Winter Institute of Intelligent Systems and Numerical Applications in Engineering (IUSIANI)

More information

Addendum to the proof of log n approximation ratio for the greedy set cover algorithm

Addendum to the proof of log n approximation ratio for the greedy set cover algorithm Addendum to the proof of log n approximation ratio for the greedy set cover algorithm (From Vazirani s very nice book Approximation algorithms ) Let x, x 2,...,x n be the order in which the elements are

More information

Generic Topology Mapping Strategies for Large-scale Parallel Architectures

Generic Topology Mapping Strategies for Large-scale Parallel Architectures Generic Topology Mapping Strategies for Large-scale Parallel Architectures Torsten Hoefler and Marc Snir Scientific talk at ICS 11, Tucson, AZ, USA, June 1 st 2011, Hierarchical Sparse Networks are Ubiquitous

More information

Cycles in Random Graphs

Cycles in Random Graphs Cycles in Random Graphs Valery Van Kerrebroeck Enzo Marinari, Guilhem Semerjian [Phys. Rev. E 75, 066708 (2007)] [J. Phys. Conf. Series 95, 012014 (2008)] Outline Introduction Statistical Mechanics Approach

More information

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Parsimony-Based Approaches to Inferring Phylogenetic Trees Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 www.biostat.wisc.edu/bmi576.html Mark Craven craven@biostat.wisc.edu Fall 0 Phylogenetic tree approaches! three general types! distance:

More information

Summary of Raptor Codes

Summary of Raptor Codes Summary of Raptor Codes Tracey Ho October 29, 2003 1 Introduction This summary gives an overview of Raptor Codes, the latest class of codes proposed for reliable multicast in the Digital Fountain model.

More information

Introduction to GDS. Stephanie Gogarten. August 7, 2017

Introduction to GDS. Stephanie Gogarten. August 7, 2017 Introduction to GDS Stephanie Gogarten August 7, 2017 Genomic Data Structure Author: Xiuwen Zheng CoreArray (C++ library) designed for large-scale data management of genome-wide variants data format (GDS)

More information

CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

motifs In the context of networks, the term motif may refer to di erent notions. Subgraph motifs Coloured motifs { }

motifs In the context of networks, the term motif may refer to di erent notions. Subgraph motifs Coloured motifs { } motifs In the context of networks, the term motif may refer to di erent notions. Subgraph motifs Coloured motifs G M { } 2 subgraph motifs 3 motifs Find interesting patterns in a network. 4 motifs Find

More information

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be

More information

Set Cover with Almost Consecutive Ones Property

Set Cover with Almost Consecutive Ones Property Set Cover with Almost Consecutive Ones Property 2004; Mecke, Wagner Entry author: Michael Dom INDEX TERMS: Covering Set problem, data reduction rules, enumerative algorithm. SYNONYMS: Hitting Set PROBLEM

More information

Mapping Reads to Reference Genome

Mapping Reads to Reference Genome Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene

More information

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame 1 When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from

More information

Stacked Denoising Autoencoders for Face Pose Normalization

Stacked Denoising Autoencoders for Face Pose Normalization Stacked Denoising Autoencoders for Face Pose Normalization Yoonseop Kang 1, Kang-Tae Lee 2,JihyunEun 2, Sung Eun Park 2 and Seungjin Choi 1 1 Department of Computer Science and Engineering Pohang University

More information

C LDPC Coding Proposal for LBC. This contribution provides an LDPC coding proposal for LBC

C LDPC Coding Proposal for LBC. This contribution provides an LDPC coding proposal for LBC C3-27315-3 Title: Abstract: Source: Contact: LDPC Coding Proposal for LBC This contribution provides an LDPC coding proposal for LBC Alcatel-Lucent, Huawei, LG Electronics, QUALCOMM Incorporated, RITT,

More information

Haplotype Inference by Pure Parsimony with Constraint Programming

Haplotype Inference by Pure Parsimony with Constraint Programming IT 09 050 Examensarbete 30 hp Oktober 2009 Haplotype Inference by Pure Parsimony with Constraint Programming Xiaoyue Pan Institutionen för informationsteknologi Department of Information Technology Abstract

More information

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Da Zhang Collaborators: Hao Wang, Kaixi Hou, Jing Zhang Advisor: Wu-chun Feng Evolution of Genome Sequencing1 In 20032: 1 human genome

More information

Section 7.12: Similarity. By: Ralucca Gera, NPS

Section 7.12: Similarity. By: Ralucca Gera, NPS Section 7.12: Similarity By: Ralucca Gera, NPS Motivation We talked about global properties Average degree, average clustering, ave path length We talked about local properties: Some node centralities

More information

Reducing Genome Assembly Complexity with Optical Maps

Reducing Genome Assembly Complexity with Optical Maps Reducing Genome Assembly Complexity with Optical Maps Lee Mendelowitz LMendelo@math.umd.edu Advisor: Dr. Mihai Pop Computer Science Department Center for Bioinformatics and Computational Biology mpop@umiacs.umd.edu

More information