Efficient Selection of Unique and Popular Oligos for Large EST Databases. Stefano Lonardi. University of California, Riverside

Similar documents
Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner

by the Genevestigator program ( Darker blue color indicates higher gene expression.

Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds

BIOINFORMATICS. Efficient selection of unique and popular oligos for large EST databases

HP22.1 Roth Random Primer Kit A für die RAPD-PCR

Genome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau Carnegie Mellon University Computational Biology Department

6 Anhang. 6.1 Transgene Su(var)3-9-Linien. P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,III 2 5 I,II,III,IV 3

SUPPLEMENTARY INFORMATION. Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells

Appendix A. Example code output. Chapter 1. Chapter 3

Supplementary Table 1. Data collection and refinement statistics

TCGR: A Novel DNA/RNA Visualization Technique

Digging into acceptor splice site prediction: an iterative feature selection approach

Graph Algorithms in Bioinformatics

Machine Learning Classifiers

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics

Sequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes

DNA Sequencing. Overview

warm-up exercise Representing Data Digitally goals for today proteins example from nature

Sequence Assembly Required!

10/8/13 Comp 555 Fall

Genome 373: Genome Assembly. Doug Fowler

10/15/2009 Comp 590/Comp Fall

2 41L Tag- AA GAA AAA ATA AAA GCA TTA RYA GAA ATT TGT RMW GAR C K65 Tag- A AAT CCA TAC AAT ACT CCA GTA TTT GCY ATA AAG AA

Supplementary Data. Image Processing Workflow Diagram A - Preprocessing. B - Hough Transform. C - Angle Histogram (Rose Plot)

Crick s Hypothesis Revisited: The Existence of a Universal Coding Frame

Scalable Solutions for DNA Sequence Analysis

DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization

Algorithms for Bioinformatics

DNA Fragment Assembly

Degenerate Coding and Sequence Compacting

Research Article Efficient Serial and Parallel Algorithms for Selection of Unique Oligos in EST Databases

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

A relation between trinucleotide comma-free codes and trinucleotide circular codes

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

BLAST & Genome assembly

Purpose of sequence assembly

LABORATORY STANDARD OPERATING PROCEDURE FOR PULSENET CODE: PNL28 MLVA OF SHIGA TOXIN-PRODUCING ESCHERICHIA COLI

Computational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop

BIOINFORMATICS APPLICATIONS NOTE

How to Run NCBI BLAST on zcluster at GACRC

BLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database.

de Bruijn graphs for sequencing data

Supplementary Materials:

Algorithms for Bioinformatics

Graph Algorithms in Bioinformatics

Assembly in the Clouds

Genome 373: Mapping Short Sequence Reads I. Doug Fowler

MLiB - Mandatory Project 2. Gene finding using HMMs

Detecting Superbubbles in Assembly Graphs. Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)!

Omega: an Overlap-graph de novo Assembler for Metagenomics

arxiv: v1 [cs.db] 30 Sep 2011

Meraculous De Novo Assembly of the Ariolimax dolichophallus Genome. Charles Cole, Jake Houser, Kyle McGovern, and Jennie Richardson

Preliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification

Algorithms and Data Structures

Parallel de novo Assembly of Complex (Meta) Genomes via HipMer

RESEARCH TOPIC IN BIOINFORMANTIC

BLAST & Genome assembly

Hybrid Parallel Programming

Indexing and Searching

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

Study of Data Localities in Suffix-Tree Based Genetic Algorithms

debgr: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph Prashant Pandey Stony Brook University, NY, USA

Problem statement. CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly. Spring Example.

High-Performance Graph Traversal for De Bruijn Graph-Based Metagenome Assembly

DNA Fragment Assembly

NOSEP: Non-Overlapping Sequence Pattern Mining with Gap Constraints

Michał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose

Bioinformatics I, WS 09-10, D. Huson, February 10,

Read Mapping and Assembly

Application of the BWT Method to Solve the Exact String Matching Problem

I519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB

ngaio.)' Thif. as you ore doubtless aware U the

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

Evaluation of Relational Operations: Other Techniques

SAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche

An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data

COMP 204: Dictionaries Recap & Sets

Clustering Billions of Images with Large Scale Nearest Neighbor Search

11/5/13 Comp 555 Fall

(for more info see:

Shotgun sequencing. Coverage is simply the average number of reads that overlap each true base in genome.

Special course in Computer Science: Advanced Text Algorithms

OFFICE OF RESEARCH AND SPONSORED PROGRAMS

Evaluation of Relational Operations: Other Techniques

Implementation of Relational Operations: Other Operations

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January /73

Indexing Variable Length Substrings for Exact and Approximate Matching

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

Sequence Design Problems in Discovery of Regulatory Elements

Database Applications (15-415)

NextGenMap and the impact of hhighly polymorphic regions. Arndt von Haeseler

A Novel Implementation of an Extended 8x8 Playfair Cipher Using Interweaving on DNA-encoded Data

Sequencing Alignment I

Evaluation of Relational Operations: Other Techniques

Scrible: Ultra-Accurate Error-Correction of Pooled Sequenced Reads

Transcription:

Efficient Selection of Unique and Popular Oligos for Large EST Databases Stefano Lonardi University of California, Riverside joint work with Jie Zheng, Timothy Close, Tao Jiang University of California, Riverside General problem Input: A list of DNA sequences Output: A list of short DNA strings of length -5 bases (oligos) occur only once in each DNA sequence ( unique oligos problem) or occur in as many DNA sequences as possible ( popular oligos problem)

Barley genome (H. vulgare) Size is 5x 9 bases times the size of Rice 5 times the size of Arabidopsis Too large for whole sequencing Strategy Build a BAC library of Barley Identify/sequence only the BACs containing the genes (expected %) Method An EST database for Barley is available Use the EST db to identify a set of popular oligos that hybridize with as many genes/est as possible (maximize coverage) Use as little oligos/filter/screens as possible (minimize time and money)

Objectives Maximize the coverage ratio (number of covered ESTs/number of oligos) Minimize the computational resources (memory, time) Barley EST db Composed by 5K EST sequences Cleaned (quality-trimming, cleaned of contaminants, etc.) Assembled (pre-clustered, assembled) Final dataset (HarvEST v.7) 46,45 unigenes 8,475,6 bases

Related work Pattern discovery (Meme, Teiresias, Pratt, Gibbs, Projection, Weeder, etc.) cannot be used because of the large input size Primer/probe design typically use all-againstall BLAST (eg., [Li&Stormo ], [Rouillard et al. ]) are extremely slow Rahmann [CSB ] uses suffix arrays (requires 5 hours on Compaq Alpha with 6GB RAM on a dataset of 4Mbases) Def: (c,d)-match Given integers c and d and strings w and y, w = y, we say that w (c,d)-match y iff w and y can be partitioned in substrings w=w w w and y=y y y such that w = y and w = y w w w w =y, w = y =c (core) H(w w,y y )=d acaatatgagaccctt agaatatgagacgcat y y l=6, c=8, d= y

Def: (c,d)-coverage Given a set X={x,,x k }, a string y and integers c and d, the (c,d)-coverage of y is the number of sequences of X containing each at least one (c,d)-match of y Integer l to denote the length of y (l-mer) popular oligos problem Given X={x,,x k } and integers l, d, c and T, find all strings of length l such that their (c,d)-coverage in X is =T We call these strings popular oligos In our experiments l=6, c=, d= or, T= 5

Observations Note that a popular oligo may never appear exactly in X Enumerating/counting all l c ( ) d d Σ possible (c,d)-matches of each l-mer in X is computationally impractical For example, if l=6, c=, d=, S =4, one should count 5K (,)-matches for each 6mer. We have *8M 6mers, for a total of 846B elementary operations Heuristics: phase one Build an hash table for the cores For each core w that appears in =T c (core coverage threshold) sequences Collect all flanking regions w w, such that w w w is an l-mer with popular core w Run phase two on set of all extensions w w

Example: phase one >EST TGGAGTCCTCGGACACGATCACATCGACAATGTGAA GGCGA >EST GTGAAGGAGGTAGATCAAATAGAGCCTGCCCTAAAA AGGCAGCTTATAATCTCCACTGCT >EST TCCGACTACTGCACCCCGAGCGGATCACACAATGGAA GGCCCGTGCGC >EST GTGAAGGAGGTAGATACTCGTATACGATCACTGCCTA AAAAGGCAGCTTATAATCTCCATATCGCTG GAAGG TGAAG ATCAC AAGGC GTGAA GATCA 7 4 6 5 4 6 5 4 5 ACTGC 54 7 9 l=8, c=5, d=, T c = Example: phase one >EST TGGAGTCCTCGGACACGATCACATCGACAATGTGAA GGCGA >EST GTGAAGGAGGTAGATCAAATAGAGCCTGCCCTAAAA AGGCAGCTTATAATCTCCACTGCT >EST TCCGACTACTGCACCCCGAGCGGATCACACAATGGAA GGCCCGTGCGC >EST GTGAAGGAGGTAGATACTCGTATACGATCACTGCCTA AAAAGGCAGCTTATAATCTCCATATCGCTG GAAGG TGAAG ATCAC AAGGC GTGAA GATCA 7 4 6 5 4 6 5 4 5 ACTGC 54 7 9 l=8, c=5, d=, T c =

Heuristics: phase two (UPGMA) Place all w w at the leaves of the tree & merge identical leaves Build the UPGMA* tree on Hamming distance Create a set of d-mutants for each string in the leaves of the tree Traverse the tree bottom-up performing set intersection as soon as intersection is empty, separate the subtree from the rest of the tree The sets at the root of each tree in the forest represent the candidate popular oligo * Unweighed Pair Group Method with Arithmetic Mean Example : phase two (UPGMA) core AAGGC occurrences GTG AAGGC GA. GTG. TGG AAA AAGGC AGC. AAA. AAA TGG AAGGC CCG. TGG. GGC AAA AAGGC AGC 4. AAA 4. AAA set set core flanking region. GGA. AAG. AGC. GCC. CCG 4. AAG. AGC set set 4 l=8, c=5, d=, T c =

Example : phase two (UPGMA) core AAGGC occurrences GTG AAGGC GA. GTG. TGG AAA AAGGC AGC. AAA. AAA TGG AAGGC CCG. TGG. GGC AAA AAGGC AGC 4. AAA 4. AAA set set core flanking region. GGA. AAG. AGC. GCC. CCG 4. AAG. AGC set set 4 l=8, c=5, d=, T c = Example : phase two (UPGMA) set 4 TGG AAA GGC AAA 4 Before compression make tree I I AAA AAA 4 TGG GCG I (, 4) TGG I make tree (, 4) AAA I GGC TGG GCG (, AAA 4) After compression I l=8, c=5, d=, T c =

Example : phase two (UPGMA) set 4 TGG AAA GGC AAA 4 Before compression make tree AAA AAA 4 TGG GCG H(AAA,AAA)= (, 4) I TGG I make tree (, 4) AAA I GGC TGG GCG (, AAA 4) After compression I I I l=8, c=5, d=, T c = Example : phase two (UPGMA) -mutants of AAA: -mutants of TGG: -mutants of GGC: CAA AGG AGC GAA TAA ACA AGA ATA AAC AAG AAT CGG GGG TAG TCG TTG TGA TGC TGT CGC TGC GAC GCC GTC GGA GGG GGT AAA TGG GCG I I (, 4) TGG GCG AAA = emtpy I I cut tree TGG GCG cluster I (, AAA 4) cluster Candidates (from core AAGGC): AAAAGGCA AGAAGGCA GGAAGGCG CAAAGGCA ATAAGGCA ACAAGGCA TGAAGGCC GAAAGGCA AAAAGGCC TAAAGGCA AAAAGGCG AAAAGGCT l=8, c=5, d=, T c =

Example : phase two (UPGMA) -mutants of AAA: -mutants of TGG: -mutants of GGC: CAA AGG AGC GAA TAA ACA AGA ATA AAC AAG AAT CGG GGG TAG TCG TTG TGA TGC TGT CGC TGC GAC GCC GTC GGA GGG GGT AAA TGG GCG I I (, 4) TGG GCG AAA = emtpy I I cut tree TGG GCG cluster I (, AAA 4) cluster Candidates (from core AAGGC): AAAAGGCA AGAAGGCA GGAAGGCG CAAAGGCA ATAAGGCA ACAAGGCA TGAAGGCC GAAAGGCA AAAAGGCC TAAAGGCA AAAAGGCG AAAAGGCT l=8, c=5, d=, T c = Heuristics: phase three Radix sort the candidate oligos to remove duplicates Discard unsuitable oligos low-complexity strings (polya, polyt, etc.) 44% < GC-content < 56% Compute coverage Compress/correct oligos

Overview: phase one Output oligos Input EST Hashing Compute Coverage Table of seeds Table of popular cores Compression & correction Select Compute Coverage Collect flanking regions 7 sets of 6-mers that share the core at a specific position... Discard unsuitable oligos set set set 7 Build tree Compute candidates Cut tree List of candidates UPGMA tree Overview: phase two (UPGMA) Output oligos Input EST Hashing Compute Coverage Table of seeds Table of popular cores Compression & correction Select Compute Coverage Collect flanking regions 7 sets of 6-mers that share the core at a specific position... Discard unsuitable oligos set set set 7 Build tree Compute candidates Cut tree List of candidates UPGMA tree

Overview: phase three Output oligos Input EST Hashing Compute Coverage Table of seeds Table of popular cores Compression & correction Select Compute Coverage Collect flanking regions 7 sets of 6-mers that share the core at a specific position... Discard unsuitable oligos set set set 7 Build tree Compute candidates Cut tree List of candidates UPGMA tree Limitation of the heuristics Cores which have coverage below T c are called unpopular A l-mer can (c,d)-match with any of its l- c+ cores We will miss popular oligos which popularity depend on a combination of several unpopular cores

Simulations Generate {x,,x k } random sequences Inject {I,,I s } popular oligos with d errors outside a core of length c, with coverage {C,,C s } (Gaussian distribution, max coverage R) Run the popular oligo algorithm on {x,,x k } Simulation Obtain {O,,O t } with coverage {C,,C s } (sorted) {O,,O t } is compressed Compare (I,C) with (O,C ) For each =i=u for u=min(s,t) we compute u Ci C ECC (, ') = u C ' i= i i '

Simulation results k=, x i =7, c=, s=, R= E* T c = T c =5 T c = T c =5 T c = d=.89.6.7..5 We never miss any oligo whose coverage is above T c + d= 5..7.6 Experimental results l=6 (oligo length) c= (core length) d=, (max mismatches outside core) T c =varies (core coverage threshold) k=46,45 unigenes n=8 million bases PC with. GHz CPU and GB memory

Coverage graph 78 8½h 896 9 ½h 8 ½h 7 5 5 5 5 4 45 5 55 core coverage threshold unigene covered oligos candidates (M) time (min) coverage ratio Current & Future work Progressive processing to reduce memory requirements Fine tuning & optimization of the code New strategies to improve coverage ratio New definition for popular/unique oligos Parallel implementation

Complexity Build a seed table O(cn) Collect flanking substrings O(nr(l-c)) where r is # occurrences of cores Building UPGMA l c O r d Counting colors for m candidate O(rm(l-c)) d

UPGMA + intersection TGG GGG TGC GCG AAA