Efficient Selection of Unique and Popular Oligos for Large EST Databases. Stefano Lonardi. University of California, Riverside

Efficient Selection of Unique and Popular Oligos for Large EST Databases Stefano Lonardi University of California, Riverside joint work with Jie Zheng, Timothy Close, Tao Jiang University of California, Riverside General problem Input: A list of DNA sequences Output: A list of short DNA strings of length -5 bases (oligos) occur only once in each DNA sequence ( unique oligos problem) or occur in as many DNA sequences as possible ( popular oligos problem)

Barley genome (H. vulgare) Size is 5x 9 bases times the size of Rice 5 times the size of Arabidopsis Too large for whole sequencing Strategy Build a BAC library of Barley Identify/sequence only the BACs containing the genes (expected %) Method An EST database for Barley is available Use the EST db to identify a set of popular oligos that hybridize with as many genes/est as possible (maximize coverage) Use as little oligos/filter/screens as possible (minimize time and money)

Objectives Maximize the coverage ratio (number of covered ESTs/number of oligos) Minimize the computational resources (memory, time) Barley EST db Composed by 5K EST sequences Cleaned (quality-trimming, cleaned of contaminants, etc.) Assembled (pre-clustered, assembled) Final dataset (HarvEST v.7) 46,45 unigenes 8,475,6 bases

Related work Pattern discovery (Meme, Teiresias, Pratt, Gibbs, Projection, Weeder, etc.) cannot be used because of the large input size Primer/probe design typically use all-againstall BLAST (eg., [Li&Stormo ], [Rouillard et al. ]) are extremely slow Rahmann [CSB ] uses suffix arrays (requires 5 hours on Compaq Alpha with 6GB RAM on a dataset of 4Mbases) Def: (c,d)-match Given integers c and d and strings w and y, w = y, we say that w (c,d)-match y iff w and y can be partitioned in substrings w=w w w and y=y y y such that w = y and w = y w w w w =y, w = y =c (core) H(w w,y y )=d acaatatgagaccctt agaatatgagacgcat y y l=6, c=8, d= y

Def: (c,d)-coverage Given a set X={x,,x k }, a string y and integers c and d, the (c,d)-coverage of y is the number of sequences of X containing each at least one (c,d)-match of y Integer l to denote the length of y (l-mer) popular oligos problem Given X={x,,x k } and integers l, d, c and T, find all strings of length l such that their (c,d)-coverage in X is =T We call these strings popular oligos In our experiments l=6, c=, d= or, T= 5

Observations Note that a popular oligo may never appear exactly in X Enumerating/counting all l c ( ) d d Σ possible (c,d)-matches of each l-mer in X is computationally impractical For example, if l=6, c=, d=, S =4, one should count 5K (,)-matches for each 6mer. We have *8M 6mers, for a total of 846B elementary operations Heuristics: phase one Build an hash table for the cores For each core w that appears in =T c (core coverage threshold) sequences Collect all flanking regions w w, such that w w w is an l-mer with popular core w Run phase two on set of all extensions w w

Example: phase one >EST TGGAGTCCTCGGACACGATCACATCGACAATGTGAA GGCGA >EST GTGAAGGAGGTAGATCAAATAGAGCCTGCCCTAAAA AGGCAGCTTATAATCTCCACTGCT >EST TCCGACTACTGCACCCCGAGCGGATCACACAATGGAA GGCCCGTGCGC >EST GTGAAGGAGGTAGATACTCGTATACGATCACTGCCTA AAAAGGCAGCTTATAATCTCCATATCGCTG GAAGG TGAAG ATCAC AAGGC GTGAA GATCA 7 4 6 5 4 6 5 4 5 ACTGC 54 7 9 l=8, c=5, d=, T c = Example: phase one >EST TGGAGTCCTCGGACACGATCACATCGACAATGTGAA GGCGA >EST GTGAAGGAGGTAGATCAAATAGAGCCTGCCCTAAAA AGGCAGCTTATAATCTCCACTGCT >EST TCCGACTACTGCACCCCGAGCGGATCACACAATGGAA GGCCCGTGCGC >EST GTGAAGGAGGTAGATACTCGTATACGATCACTGCCTA AAAAGGCAGCTTATAATCTCCATATCGCTG GAAGG TGAAG ATCAC AAGGC GTGAA GATCA 7 4 6 5 4 6 5 4 5 ACTGC 54 7 9 l=8, c=5, d=, T c =

Heuristics: phase two (UPGMA) Place all w w at the leaves of the tree & merge identical leaves Build the UPGMA* tree on Hamming distance Create a set of d-mutants for each string in the leaves of the tree Traverse the tree bottom-up performing set intersection as soon as intersection is empty, separate the subtree from the rest of the tree The sets at the root of each tree in the forest represent the candidate popular oligo * Unweighed Pair Group Method with Arithmetic Mean Example : phase two (UPGMA) core AAGGC occurrences GTG AAGGC GA. GTG. TGG AAA AAGGC AGC. AAA. AAA TGG AAGGC CCG. TGG. GGC AAA AAGGC AGC 4. AAA 4. AAA set set core flanking region. GGA. AAG. AGC. GCC. CCG 4. AAG. AGC set set 4 l=8, c=5, d=, T c =

Example : phase two (UPGMA) core AAGGC occurrences GTG AAGGC GA. GTG. TGG AAA AAGGC AGC. AAA. AAA TGG AAGGC CCG. TGG. GGC AAA AAGGC AGC 4. AAA 4. AAA set set core flanking region. GGA. AAG. AGC. GCC. CCG 4. AAG. AGC set set 4 l=8, c=5, d=, T c = Example : phase two (UPGMA) set 4 TGG AAA GGC AAA 4 Before compression make tree I I AAA AAA 4 TGG GCG I (, 4) TGG I make tree (, 4) AAA I GGC TGG GCG (, AAA 4) After compression I l=8, c=5, d=, T c =

Example : phase two (UPGMA) set 4 TGG AAA GGC AAA 4 Before compression make tree AAA AAA 4 TGG GCG H(AAA,AAA)= (, 4) I TGG I make tree (, 4) AAA I GGC TGG GCG (, AAA 4) After compression I I I l=8, c=5, d=, T c = Example : phase two (UPGMA) -mutants of AAA: -mutants of TGG: -mutants of GGC: CAA AGG AGC GAA TAA ACA AGA ATA AAC AAG AAT CGG GGG TAG TCG TTG TGA TGC TGT CGC TGC GAC GCC GTC GGA GGG GGT AAA TGG GCG I I (, 4) TGG GCG AAA = emtpy I I cut tree TGG GCG cluster I (, AAA 4) cluster Candidates (from core AAGGC): AAAAGGCA AGAAGGCA GGAAGGCG CAAAGGCA ATAAGGCA ACAAGGCA TGAAGGCC GAAAGGCA AAAAGGCC TAAAGGCA AAAAGGCG AAAAGGCT l=8, c=5, d=, T c =

Example : phase two (UPGMA) -mutants of AAA: -mutants of TGG: -mutants of GGC: CAA AGG AGC GAA TAA ACA AGA ATA AAC AAG AAT CGG GGG TAG TCG TTG TGA TGC TGT CGC TGC GAC GCC GTC GGA GGG GGT AAA TGG GCG I I (, 4) TGG GCG AAA = emtpy I I cut tree TGG GCG cluster I (, AAA 4) cluster Candidates (from core AAGGC): AAAAGGCA AGAAGGCA GGAAGGCG CAAAGGCA ATAAGGCA ACAAGGCA TGAAGGCC GAAAGGCA AAAAGGCC TAAAGGCA AAAAGGCG AAAAGGCT l=8, c=5, d=, T c = Heuristics: phase three Radix sort the candidate oligos to remove duplicates Discard unsuitable oligos low-complexity strings (polya, polyt, etc.) 44% < GC-content < 56% Compute coverage Compress/correct oligos

Overview: phase one Output oligos Input EST Hashing Compute Coverage Table of seeds Table of popular cores Compression & correction Select Compute Coverage Collect flanking regions 7 sets of 6-mers that share the core at a specific position... Discard unsuitable oligos set set set 7 Build tree Compute candidates Cut tree List of candidates UPGMA tree Overview: phase two (UPGMA) Output oligos Input EST Hashing Compute Coverage Table of seeds Table of popular cores Compression & correction Select Compute Coverage Collect flanking regions 7 sets of 6-mers that share the core at a specific position... Discard unsuitable oligos set set set 7 Build tree Compute candidates Cut tree List of candidates UPGMA tree

Overview: phase three Output oligos Input EST Hashing Compute Coverage Table of seeds Table of popular cores Compression & correction Select Compute Coverage Collect flanking regions 7 sets of 6-mers that share the core at a specific position... Discard unsuitable oligos set set set 7 Build tree Compute candidates Cut tree List of candidates UPGMA tree Limitation of the heuristics Cores which have coverage below T c are called unpopular A l-mer can (c,d)-match with any of its l- c+ cores We will miss popular oligos which popularity depend on a combination of several unpopular cores

Simulations Generate {x,,x k } random sequences Inject {I,,I s } popular oligos with d errors outside a core of length c, with coverage {C,,C s } (Gaussian distribution, max coverage R) Run the popular oligo algorithm on {x,,x k } Simulation Obtain {O,,O t } with coverage {C,,C s } (sorted) {O,,O t } is compressed Compare (I,C) with (O,C ) For each =i=u for u=min(s,t) we compute u Ci C ECC (, ') = u C ' i= i i '

Simulation results k=, x i =7, c=, s=, R= E* T c = T c =5 T c = T c =5 T c = d=.89.6.7..5 We never miss any oligo whose coverage is above T c + d= 5..7.6 Experimental results l=6 (oligo length) c= (core length) d=, (max mismatches outside core) T c =varies (core coverage threshold) k=46,45 unigenes n=8 million bases PC with. GHz CPU and GB memory

Coverage graph 78 8½h 896 9 ½h 8 ½h 7 5 5 5 5 4 45 5 55 core coverage threshold unigene covered oligos candidates (M) time (min) coverage ratio Current & Future work Progressive processing to reduce memory requirements Fine tuning & optimization of the code New strategies to improve coverage ratio New definition for popular/unique oligos Parallel implementation

Complexity Build a seed table O(cn) Collect flanking substrings O(nr(l-c)) where r is # occurrences of cores Building UPGMA l c O r d Counting colors for m candidate O(rm(l-c)) d

UPGMA + intersection TGG GGG TGC GCG AAA