TCGR: A Novel DNA/RNA Visualization Technique

Similar documents
by the Genevestigator program ( Darker blue color indicates higher gene expression.

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner

HP22.1 Roth Random Primer Kit A für die RAPD-PCR

Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds

Appendix A. Example code output. Chapter 1. Chapter 3

Genome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau Carnegie Mellon University Computational Biology Department

6 Anhang. 6.1 Transgene Su(var)3-9-Linien. P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,III 2 5 I,II,III,IV 3

SUPPLEMENTARY INFORMATION. Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells

Supplementary Table 1. Data collection and refinement statistics

warm-up exercise Representing Data Digitally goals for today proteins example from nature

Digging into acceptor splice site prediction: an iterative feature selection approach

2 41L Tag- AA GAA AAA ATA AAA GCA TTA RYA GAA ATT TGT RMW GAR C K65 Tag- A AAT CCA TAC AAT ACT CCA GTA TTT GCY ATA AAG AA

Machine Learning Classifiers

A relation between trinucleotide comma-free codes and trinucleotide circular codes

DNA Sequencing. Overview

Supplementary Data. Image Processing Workflow Diagram A - Preprocessing. B - Hough Transform. C - Angle Histogram (Rose Plot)

MLiB - Mandatory Project 2. Gene finding using HMMs

Sequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes

Supplementary Materials:

Graph Algorithms in Bioinformatics

Crick s Hypothesis Revisited: The Existence of a Universal Coding Frame

DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization

Efficient Selection of Unique and Popular Oligos for Large EST Databases. Stefano Lonardi. University of California, Riverside

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics

LABORATORY STANDARD OPERATING PROCEDURE FOR PULSENET CODE: PNL28 MLVA OF SHIGA TOXIN-PRODUCING ESCHERICHIA COLI

Degenerate Coding and Sequence Compacting

10/15/2009 Comp 590/Comp Fall

DNA Fragment Assembly

10/8/13 Comp 555 Fall

Lecture 5: Markov models

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly

Purpose of sequence assembly

Sequence Assembly Required!

Genome 373: Genome Assembly. Doug Fowler

Algorithms for Bioinformatics

Supporting Information

PERFORMANCE ANALYSIS OF DATAMINIG TECHNIQUE IN RBC, WBC and PLATELET CANCER DATASETS

Algorithms for Bioinformatics

Scalable Solutions for DNA Sequence Analysis

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

de Bruijn graphs for sequencing data

WSSP-10 Chapter 7 BLASTN: DNA vs DNA searches

Structural analysis and haplotype diversity in swine LEP and MC4R genes

Detecting Superbubbles in Assembly Graphs. Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)!

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data

DNA Fragment Assembly

de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January /73

Problem statement. CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly. Spring Example.

NOSEP: Non-Overlapping Sequence Pattern Mining with Gap Constraints

Read Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015

OFFICE OF RESEARCH AND SPONSORED PROGRAMS

Global Alignment. Algorithms in BioInformatics Mandatory Project 1 Magnus Erik Hvass Pedersen (971055) Daimi, University of Aarhus September 2004

A Novel Implementation of an Extended 8x8 Playfair Cipher Using Interweaving on DNA-encoded Data

QuasiAlign: Position Sensitive P-Mer Frequency Clustering with Applications to Genomic Classification and Differentiation

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

Eulerian Tours and Fleury s Algorithm

BGGN 213 Foundations of Bioinformatics Barry Grant

Solutions Exercise Set 3 Author: Charmi Panchal

Assembly in the Clouds

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Computational Molecular Biology

RESEARCH TOPIC IN BIOINFORMANTIC

Algorithms and Data Structures

How to Run NCBI BLAST on zcluster at GACRC

Graphs and Puzzles. Eulerian and Hamiltonian Tours.

BMI/CS 576 Fall 2015 Midterm Exam

Graph Algorithms in Bioinformatics

Eulerian tours. Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck. April 20, 2016

In Silico Modelling and Analysis of Ribosome Kinetics and aa-trna Competition

3. The object system(s)

Computational Genomics and Molecular Biology, Fall

1. PURPOSE: to describe the standardized laboratory protocol for molecular subtyping of Salmonella enterica serotype Enteritidis.

Computational Theory MAT542 (Computational Methods in Genomics) - Part 2 & 3 -

GeneR. JORGE ARTURO ZEPEDA MARTINEZ LOPEZ HERNANDEZ JOSE FABRICIO. October 6, 2009

Mining more complex patterns: frequent subsequences and subgraphs. Department of Computers, Czech Technical University in Prague

Tutorial 1: Exploring the UCSC Genome Browser


Multiple Sequence Alignment Gene Finding, Conserved Elements

A Feature Generation Algorithm for Sequences with Application to Splice-Site Prediction

Biostrings. Martin Morgan Bioconductor / Fred Hutchinson Cancer Research Center Seattle, WA, USA June 2009

Programming Applications. What is Computer Programming?

Research Article An Improved Scoring Matrix for Multiple Sequence Alignment

debgr: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph Prashant Pandey Stony Brook University, NY, USA

SAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche

Parallel de novo Assembly of Complex (Meta) Genomes via HipMer

Biostatistics and Bioinformatics Molecular Sequence Databases

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

Shortest Path Algorithm

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

EECS730: Introduction to Bioinformatics

Sequence Alignment 1

DELAMANID SUSCEPTIBILITY TESTING IN AN AUTOMATED LIQUID CULTURE SYSTEM

Finding homologous sequences in databases

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Weighted Finite-State Transducers in Computational Biology

Accelerating Genome Assembly with Power8

Package mirnatap. February 1, 2018

DNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats

Chaotic Image Encryption Technique using S-box based on DNA Approach

Alignments BLAST, BLAT

Transcription:

TCGR: A Novel DNA/RNA Visualization Technique Donya Quick and Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 dquick@mail.smu.edu, mhd@engr.smu.edu 10/11/07 1

Table of Contents Introduce TCGR Background FCGR Algorithm Overview TCGR Parameters Input Parameters Sequence Alignment Applications Conclusions 10/11/07 2

FCGR AAA AAC ACA ACC CAA CAC CCA CCC AAG AAT ACG ACT CAG CAT CCG CCT AGA AGC ATA ATC CGA CGC CTA CTC AGG AGT ATG ATT CGG CGT CTG CTT AA AC CA CC GAA GAC GCA GCC TAA TAC TCA TCC AG AT CG CT GAG GAT GCG GCT TAG TAT TCG TCT A C GA GC TA TC GGA GGC GTA GTC TGA TGC TTA TTC G T GG GT TG TT GGG GGT GTG GTT TGG TGT TTG TTT a) Nucleotides b) Dinucleotides c) Trinucletides Courtesy of Eamonn Keogh, UCR 10/11/07 3

FCGR Example Homo sapiens all mature mirna Patterns of length 3 UUC GUG 10/11/07 4

What is TCGR? Temporal Chaos Game Representation (TCGR) A visual and numerical representation of data Can be applied to DNA sequence data as well as other data types Shows general structure of sequences Structure is represented as distribution of subsequence over sequence length. 10/11/07 5

Temporal CGR (TCGR) Temporal version of Frequency CGR In our context temporal means the starting location of a window 2D Array Each Row represents counts for a particular window in sequence First row first window Last row last window We start successive windows at the next character location Each Column represents the counts for the associated pattern in that window Initially we have assumed order of patterns is alphabetic Size of TCGR depends primarily on sequence length and subsequence size As sequence sizes vary, we only examine complete windows We only count patterns completely contained in each window. 10/11/07 6

Frequency Instead of actual frequencies, a normalization (based on largest frequency in sequence) is used. 0.0 means a subsequence did not occur in a window 1.0 for a subsequence means it is the most frequently occurring in the data set. Color schemes for visualization: Frequency 0.0 0.5 1.0 Color White ( cold spot ) Blue Red ( hot spot ) Grayscale White Gray Black 10/11/07 7

TCGR Representation Subsequence Window Hot Spots Possible Cold Spots 10/11/07 8

TCGR Algorithm Overview 1. Counting While windows are left: Count all subsequences present for all strings in current window Move window down by specified overlap and repeat 2. Frequency conversion Divide all subsequence counts by maximum to scale to [0,1]. 10/11/07 9

TCGR Algorithm Overview Counting Process Counts Array Frequency Array TCGR output 10/11/07 10

TCGR Parameters Subsequence size (SS) Maximum Count Value Window Length (WL) Window Overlap (WO) 10/11/07 11

Effects of Subsequence Size Number of columns is 4 n For a constant window length and overlap and increasing subsequence size: The number of columns will increase exponentially The TCGR will become less dense (more white space) As density decreases, white space holds less potential meaning. 10/11/07 12

Effects of Subsequence Size SS=1 SS=2 SS=3 Synthetic data set 10/11/07 13

Effects of Maximum Count Value Affects the scaling of the data at the frequency level. When the maximum count value is low, small differences in frequency are more visible. If comparing TCGRs for two different sequences, should scale both to the same maximum count value to avoid false hot spots. If comparing TCGRs where each represents a set of many sequences, using the default scaling may be better to compare relative structure. 10/11/07 14

Effects of Maximum Count Value Max=23 (default) Max=30 Max=50 (data from slide 15, multiple sequences) 10/11/07 15

Effects of Window Length For a constant SS and maximal WO: The output becomes denser Cold spots may become more meaningful Total number of rows will decrease 10/11/07 16

Effects of Window Length (data from slide 10, multiple sequences) 10/11/07 17

Effects of Window Overlap Gives best results when maximized Risks associated with decreasing WO: Boundary anomaly can occur if last window is only partially filled Maximum count values may be missed Scaling may be off due to missed maximum counts 10/11/07 18

Effects of Window Overlap SS=1, WL=10 WO = 9, 8, 7, and 6 respectively GGGTGAGGTAGTAGGTTGTATAGTTTGGGGCTCTGCCCTGCTATGGGATAACTATACAATCTACTGTCTTTCCT TCAGAGTGAGGTAGTAGATTGTATAGTTGTGGGGTAGTGATTTTACCCTGTTCAGGAGATAACTATACAATCTATTGCCTTCCCTGA CTGGCTGAGGTAGTAGTTTGTGCTGTTGGTCGGGTTGTGACATTGCCCGCTGTGGAGATAACTGCGCAAGCTACTGCCTTGCTA AGGTTGAGGTAGTAGGTTGTATAGTTTAGAATTACATCAAGGGAGATAACTGTACAGCCTCCTAGCTTTCCT CCCGGGCTGAGGTAGGAGGTTGTATAGTTGAGGAGGACACCCAAGGAGATCACTATACGGCCTCCTAGCTTTCCCCAGG (Xu et al.) 10/11/07 19

Effects of Sequence Alignment If used before performing TCGR: Can result in more accurate data representation Hot spots will not be missed due to being misaligned Rows may increase, particularly if gaps are allowed 10/11/07 20

Effects of Sequence Alignment A synthetic data set: CAGAATTTTCGACATGGAGCAACGATATATATTGACCCTATGCCGGATTCTGCTCTCACTAACTTTGCGCACGGGTG CAGAATTTTCGACATTCTAAGAACCCTTTAAGTACACCGAATCTATCAAACGATACATTTGCGCACGGGTGGTAG CAGAATTTTCGACAGAAGAAAATAAAACATCAGAGTCATCCGGACTAAGATAGCCGCGTTTGCGCACGGGTGTTCA CAGAATTTTCGACCATGGAACGCGTGGAGCGTCATTACAGCGAGCCGTAGAGTTTGCGCACGGGTGATATATG CAGAATTTTCGACGTCCTGGCAAGTAACTTGTTCACAGCACTTTAAATGATTTGCGCACGGGTGTCCAATGAGA Conserved regions are marked in red. Sample alignment of the data: CAGAATTTTCGACATTCTAAGAAC_C C_TTTAAGTAC_ACCGAA_TCTATCA AACGATACATTTGC_GCACGGGTGG TAG CAGAATTTTCGACGTCCTGGCAAG_TAA C_TTG TT_C_ACAGCA_CTT_T_A AATGAT_T_TGCGC_ACGGGTGTCCAATGAGA CAGAATTTTCGACAG AAGAAAATAAAACATCAGAGTC ATCCGGACT_AAGAT_AGCCGCGTTTGCGC_ACGGGTGTTCA CAGAATTTTCGACATGGAGCAACGATATAT_ATTGACCCTATGCCGGATTCTGCTCTCACTAACTTTGCGC ACGGGTG CAGAATTTTCGACCATGGAACGCGTGGAGCGTCATTACAGCGAGCCGTAGAGTTTGCGCACGGGTGATATATG 10/11/07 21

Effects of Sequence Alignment Data unaligned Data aligned 2 nd set of hot spots 10/11/07 22

TCGR Applications Visualize Structure Identify motifs or conserved regions Predict locations of DNA/RNA features mirna mirna binding site May be generalized to non DNA/RNA strings (temporal spatial data) Has been linked to a modeling prediction technique - EMM 10/11/07 23

Visualize Structure TCGR Mature mirna (Window=5; Pattern=2) C Elegans Homo Sapiens Mus Musculus All Mature All higher level animals mirna have a noticeable CG cold streak 10/11/07 24

Predict mirna site P O S I T I V E N E G A T I V E Data from: C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine, BMC Bioinformatics, vol 6, no 310. 10/11/07 25

TCGR - Not Just a Pretty Picture 1. Represent potential mirna sequence with TCGR sequence of count vectors 2. Create dynamic Markov chain, EMM, using count vectors for known mirna (mirna stem loops, mirna targets) 3. Predict unknown sequence to be mirna (mirna stem loop, mirna target) based on normalized product of transition probabilities along clustering path in EMM 10/11/07 26

EMM Creation 1: <18,10,3,3,1,0,0> 2: <17,10,2,3,1,0,0> 2/3 2/21/1 1/2 2/3 1/2 3: <16,9,2,3,1,0,0> 4: <14,8,2,3,1,0,0> N1 1/3 N2 1/1 N3 1/2 1/1 5: <14,8,2,3,0,0,0> 6: <18,10,3,3,1,1,0.> 10/11/07 27

Cisco Internal VoIP Traffic Data Time Values VoIP traffic data was provided by Cisco Systems and represents logged VoIP traffic in their Richardson, Texas facility from Mon Sep 22 12:17:32 2003 to Mon Nov 17 11:29:11 Time 2003. 10/11/07 28

Seismic Data Example Sensor location Time 10/11/07 29

Conclusions TCGR is a useful new tool for data where composition varies with respect to distance or time. TCGR can be applied to data mining for event detection. Potential applications of TCGR to biological data include motif detection. Careful use of parameters makes TCGR more useful. 10/11/07 30

References [1] C. S. a. A. Consortium, Initial sequence of the chimpanzee genome and comparison with the human genome, Nature, vol. 437, pp. 69-87, 2005. [2] N. Rajewsky, "microrna target predictions in animals," Nat Genet, vol. 38 Suppl 1, pp. S8-S13, 2006. [3] H. J. Jeffrey, Chaos Game Representation of Gene Structure, Nucleic Acids Research, 1990, vol 18, pp 2163-2170. [4] P.J., Deschavanne, A. Giron, J. Vilain, G. Fagot and B. Fertil, Genomic Signature: Characterization and Classification of Species Assessed by Chaos Game Representation of Sequences, Molecular Biol. Evol, 1999, vol 16, pp 1391-1399. [5] Dunham et al, Visualization of DNA/RNA Structure using Temporal CGRs, IEEE BIBE Conference Proceedings, pp171-178, 2006. [6] Margaret Dunham, Yu Meng, and Jie Huang, Extensible Markov Model, Proc. IEEE Int l Conf. Data Mining (ICDM 04), 2004. [7] E. Berezikov, E. Cuppen, and R. H. Plasterk, "Approaches to microrna discovery," Nat Genet, vol. 38 Suppl 1, pp. S2-7, 2006. [8] I. Bentwich, A. Avniel, Y. Karov, R. Aharonov, S. Gilad, O. Barad, A. Barzilai, P. Einat, U. Einav, E. Meiri, E. Sharon, Y. Spector, and Z. Bentwich, "Identification of hundreds of conserved and nonconserved human micrornas," Nat Genet, vol. 37, pp. 766-70, 2005. [9] J. W. Nam, K. R. Shin, J. Han, Y. Lee, V. N. Kim, and B. T. Zhang, "Human microrna prediction through a probabilistic co-learning model of sequence and structure," Nucleic Acids Res, vol. 33, pp. 3570-81, 2005. 10/11/07 31