12 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006

Size: px

Start display at page:

Download "12 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006"

Virgil White
5 years ago
Views:

12 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006 3 Sequence comparison by compression This chapter is based on the following articles, which are all recommended reading: X.

Ma, Chain letters and evolutionary histories, Scientific American, 64-69, June 2003 C.H. Bennett, M. Li and B. Ma, Linking chain letters, manuscript, http://www.cs.uwaterloo. ca/~mli/chain.html.

1 12 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, Sequence comparison by compression This chapter is based on the following articles, which are all recommended reading: X. Chen, S. Kwong, and M. Li. A compression algorithm for DNA sequences and its application in genome comparisons. Genome Informatics, 10:51 61, C.H. Bennett, M. Li and B. Ma, Chain letters and evolutionary histories, Scientific American, 64-69, June 2003 C.H. Bennett, M. Li and B. Ma, Linking chain letters, manuscript, ca/~mli/chain.html. 3 Motivation We often use similarity as a marker for homology. And use homology to infer function. Sometimes, we are only interested in a numerical distance between two sequences. For example, to infer a phylogeny. Figure adapted from Text- vs DNA compression Programs such as compress, gzip or zip are routinely used to compress text files. Given a text file containing DNA, this can also be compressed using such programs. For example, a text file containing chromosome 19 of human in fasta format is approximately 61 MB. Application of unix compress produces a file of size 8.5 MB. In such a text file, 8 bits are used for each character. However, as DNA consists of only four different bases, encoding requires at most two bits per base, e.g. using: A = 00, C = 01, G = 10, and T = 11. Applying a standard compression algorithm to a file containing DNA encoded using two bits per base will usually not be able to compress the file further.

Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006 13 3.2 The repetitive nature of DNA Take advantage of the repetitive nature of DNA!

2 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, The repetitive nature of DNA Take advantage of the repetitive nature of DNA!! LINEs (Long Interspersed Nuclear Elements), SINEs.. UCSC Genome Browser: DNA compression Standard compression algorithms are aimed at natural language text and are not aimed at making use of the regularities in DNA sequences. However, we expect DNA sequences to be very compressible, especially for higher eukaryotes: they containing many repeats of different size, with different numbers of instances and different amounts of identity. A first simple idea. While processing the DNA string from left to right, detect exact repeats and/or palindromes (reverse-complemented repeats) that possess previous instances in the already processed text and encode them by the length and position of an earlier occurrence. For stretches of sequence for which no significant repeat is found, use two-bit encoding. (The program Biocompress is based on this idea.) Data structure for fast access to sequence patterns already encountered. Sliding window along unprocessed sequence. current position processed sequence segment unprocessed sequence segment data structure check for previous occurrences

3 14 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006 A second simple idea. Build a suffix tree for the whole sequence and use it to detect maximal repeats of some fixed minimum size. Then code all repeats as above and use two-bit encoding for bases not contained inside repeats. (The program Cfact is based on this idea.) Both of these algorithms are lossless, meaning that the original sequences can be precisely reconstructed from their encodings. An number of lossy algorithms exist, which we will not discuss here. In the following we will discuss the GenCompress algorithm due to Xin Chen, Sam Kwong and Ming Li. 3.4 Edit operations The main idea of GenCompress is to use inexact matching, followed by edit operations. In other words, instances of inexact repeats are encoded by a reference to an earlier instance of the repeat, followed by some edit operations that modify the earlier instance to obtain the current instance. Three standard edit operations are considered: 1. Replace. This operation (R, i, char) consists of replacing the character at position i by character char. 2. Insert. This operation (I, i, char) consists of inserting character char at position i. 3. Delete. This operation (D, i) consists of deleting the character at position i. Let C denote copy. Here are two different ways to convert the string gaccgtcatt to gaccttcatt via different edit operation sequences: (a) (b) C C C C R C C C C C g a c c g t c a t t g a c c t t c a t t C C C C D C I C C C C g a c c g t c a t t g a c c t t c a t t Case (a) involves one replacement and case (b) involves a deletion and an insertion. Clearly, there are an infinite number of ways to convert one string into another. Given two strings q and p. An edit transcript λ(q, p) is a list of edit operations that transforms q into p. For example, in case (a) the edit transcript is: whereas in case (b) it is: λ(gaccgtcatt, gaccttcatt) = (R, 4, t), λ(gaccgtcatt, gaccttcatt) = (D, 4), (I, 4, g). (We are assuming that positions start at 0 and that positions are given relative to current state of the string, as obtained by application of any preceding edit operations.) If we know the string q and are given an edit transcript λ(q, p), then we can construct p correctly from q using the transcript. or 3.5 Encoding DNA We discuss four ways to encode a DNA string, e.g. gaccttcatt.

4 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, (1) Using the two-bit encoding method, gaccttcatt can be encoded in 20 bits: The following three encode gaccttcatt relative to gaccgtcatt: (2) In the exact matching method we use a pair of numbers (repeat position, repeat length) to represent an exact repeat. We can encode gaccttcatt as (0, 4), t, (5, 5), relative to gaccgtcatt. Let us use four bits to encode an integer, two bits to encode a base and one bit to indicate whether the next part is a pair (indicating a repeat) or a base. We obtain an encoding in 21 bits: (3) In the approximate matching method we can encode gaccttcatt as (0, 10), (R, 4, t) relative to gaccgtcatt. Let use encode R by 00, I by 01, D by 11 and use a single bit to indicate whether the next part is a pair or a triplet. We obtain an encoding in 18 bits: (4) For approximate matching, we could also use the edit sequence (D, 4), (I, 4, t), for example, yielding the relative encoding (0, 10), (D, 4), (I, 4, t), which uses 25 bits: The approximate matching method (3) uses the smallest number of bits. (Note that there is a mistake in the discussion of these cases in the paper by Chen et al.) 3.6 GenCompress: An algorithm based on approximate matching GenCompress is a one-pass algorithm based on approximate matching that processes as follows: For a given input string w, assume that a part v has already been compressed and the remaining part is u, with w = vu. The algorithm finds an optimal prefix p of u that approximately matches some substring of v such that p can be encoded economically. After outputting the encoding of p, remove the prefix p from u and append it to v. If no optimal prefix is found, output the next base and then move it from u to v. Repeat until u = ɛ. 3.7 The condition C How do we find a optimal prefix p? The following condition will be used to limit the search. Given two numbers k and b. Let p be a prefix of the unprocessed part u and q a substring of the processed part v. If the length of q > k, then any transcript λ(q, p) is said to satisfy the condition C = (k, b) for compression, if its number of edit operations is b. Experimental studies indicate that C = (k, b) = (12, 3) gives good results. In other words, when attempting to determine the optimal prefix for compression, we will only consider repeats of length k that require at most b edit operations.

5 16 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, The compression gain function We define a compression gain function G to determine whether a particular approximate repeat q, p and edit transcript λ are beneficial for the encoding: where p is a prefix of the unprocessed part u, G(q, p, λ) = max {2 p (i, q ) w λ λ(q, p) c, 0}. q is a substring of the processed part v of length q that starts at position i, 2 p is the number of bits that the two-bit encoding would use, (i, q ) is the encoding size of (i, q ), w λ is the cost of encoding an edit operation, λ(q, p) is the number of edit operations in λ(q, p), and c is the overhead proportional to the size of control bits. Let w = vu be an input sequence and assume that we have already processed v. Given G(q, p, λ) and C = (k, b), an optimal prefix is a prefix p of u such that G(q, p, λ) is maximized over all λ and q, where q is a substring of v and λ is an edit transcript from q to p satisfying condition C. In the algorithm, prefixes are encoding using the repeat representation only if the compression gain function yields a positive value, otherwise the two-bit encoding (or more advanced encoding of the individual bases) is used. 3.9 The GenCompress algorithm Here is a high-level description of the GenCompress algorithm: Algorithm (GenCompress) Input: A DNA sequence w, parameter C = (k, b) Output: A lossless compressed encoding of w Initialization: u = w and v = ɛ while u ɛ do Search for an optimal prefix p of u if an optimal prefix p with repeat q in v is found then Encode the repeat representation (i, q ), where i is the position of q in v, together with the shortest edit transcript λ(q, p). Output the code. else Set p equal to the first character of u, encode and output it. Remove the prefix p from u and append it to v end 3.10 Implementing the optimal prefix search An exhaustive search for the optimal prefix in the main step of the algorithm is practical only for very short sequences. The goal is to reduce the search space of possible repeats without significantly reducing the compression ratio of the algorithm. Some algorithm engineering is necessary to perform the optimal prefix search in reasonable time. We briefly discuss three ingredients that are used.

Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006 17 Lemma 3.10.1 An optimal prefix p always ends right before a mismatch. Lemma 3.10.2 Let λ(q, p) be an optimal edit sequence from q to p.

6 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, Lemma An optimal prefix p always ends right before a mismatch. Lemma Let λ(q, p) be an optimal edit sequence from q to p. If qi is copied onto pj in λ, then λ restricted to (q0:i, p0:j ) is an optimal transcript from q0:i = q0 q1... qi to p0:j = p0 p1..., pj. (Proof: see paper for details.) The search is simplified as follows: to find an approximate match for p in v, we look for an exact match of the first l bases in p, where l is a fixed small number. To speed this up, an integer is stored at each position i of v that is determined by the the word of length l starting at i. Here is an outline of the optimal prefix search: 1. Let w = vu where v has already been compressed. 2. Find all occurrences u0:l in v, for some small l. For each such occurrence, try to extend it, allowing mismatches, limited by the above observations and condition C. Return the prefix p with the largest compression gain value G Performance of GenCompress Since any nucleotide can be encoded canonically using 2 bits, we define the compression ratio of a compression algorithm as O 1, 2 I where I is the number of bases in the input DNA sequence and O is the length (number of bits) of the output sequence. Alternatively, if our DNA string is already encoded canonically, we can define the compression ratio of a compression algorithm as O, 1 I where I is the number of bits in the canonical encoding of the input DNA sequence and O is the length (number of bits) of the output sequence. The following table shows the performance of GenCompress on some standard benchmark data sets: (Source: Chen, Kwong and Li, 1999) 3.12 Conditional compression In conditional compression we want to compress a sequence w relative to a given sequence z. E.g., in the case of GenCompress, when compressing w we are also allowed to repeat-encode prefixes of w relative to approximate matches in z.

7 18 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006 Let Compress(w z) denote the length of the compression of w, given z. Similarly, let Compress(w) = Compress(w ɛ). Note that, in general, Compress(w z) Compress(z w). This is not only a theoretical result, but also true in practice. For example, the Biocompress-2 program produces: CompressRatio(brucella rochalima) = 55.95%, and CompressRatio(rochalima brucella) = 34.56%. Let us consider two extreme cases. If z = w, then Compress(w z) will be very small, as w is obtained from z as a single repeat. On the other hand, if the relationship between w and z is random, then Compress(w z) Compress(w). This, and the fact that the approximate matching and edit operations employed in the compression algorithm seem biologically reasonable, raises the following question: Can one define a measure of evolutionary distance between sequences based on their conditional compressibility? One idea that has been proposed in the literature is to use D(w, z) := but there is no good reason to do so. Compress(w z) + Compress(z w), A symmetric measure of similarity Let K(w z) denote the Kolmogorov complexity of string w, given z. Informally, this is the length of the shortest program that outputs w given z. Similarly, set K(w) = K(w ɛ). The following result is due to Kolmogorov and Levin: Theorem Within an additive logarithmic factor, K(w z) + K(z) = K(z w) + K(w). This implies K(w) K(w z) = K(z) K(z w). Normalizing by the sequence length we obtain the following symmetric map: R(w, z) = K(w) K(w z) K(wz) = K(z) K(z w), K(wz) describing the mutual information or relatedness of two sequences. ( wz means concatenation of w and z) A distance between the two sequences w and z is given by 1 R(w, z). Unfortunately, the Kolmogorov complexity K(w z) is incomputable!

Hence, we obtain a mutual-information distance between any two DNA sequences w and z as: D(w, z) := 1 1 Compress(w) Compress(w z) Compress(wz) Compress(z)

14 Application to genomic sequences One proposed domain of application is to DNA sequences (e.g. 16S rrna) of distantly related species.

8 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, The authors of GenCompress propose to use Compress(w) to approximate K(w) and to use Compress(w z) to approximate K(w z). Hence, we obtain a mutual-information distance between any two DNA sequences w and z as: D(w, z) := 1 1 Compress(w) Compress(w z) Compress(wz) Compress(z) Compress(z w), Compress(wz) ranging between 0 for identical sequences and 1 for independent ones Application to genomic sequences One proposed domain of application is to DNA sequences (e.g. 16S rrna) of distantly related species. Given the following data: Archaea Bacteria: Archaeoglobus fulgidus, Pyrococcus abyssi and Pyrococcus horikoshii, and Bacteria: Eshcerichia coli K-12 MG1655, Haemophilus influenzae Rd, Helicobacter pylori and Helicbacter pylori, strain J99. Application of GenCompress gives rise to the following data: and, based on this, the following tree:

20 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006 3.15 Chain letters Charles Bennet, Ming Li and Bin Ma studied 33 versions of a chain letter, collected 1980-1995.

9 20 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, Chain letters Charles Bennet, Ming Li and Bin Ma studied 33 versions of a chain letter, collected These were in circulation when photocopiers, but not , were already widely used: This chain letter and its variants provide a good analogy to the evolution of genes: they have a similar length to genes, they provide some motivation for their replication (otherwise, you will have a lot of bad luck!), they are subject to copy errors or mutations, such mutations occur with different frequency in different parts of the string, depending on the meaning of the region, and they are attractive to study because they are human readable Differences between the letters The 33 letters can be downloaded from: They differ significantly, e.g. there are: 15 different titles, 23 different names for an office employee, 25 names for the author of the articles, and many misspellings, missing or added phrases, sentences and paragraphs, and swapped sentences. These mutations occurred whenever a photocopy of a photocopy of a photocopy etc became so illegible that a new copy had to be typed.

10 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, Chain letter phylogeny Bennet, Li and Ma have studied the phylogeny of the letters using the above described compressiondistance approach. First, they computed the mutual information distance between every pair of letters. Then they applied a number of different distance based methods to compute trees on the resulting distance matrix. See their article in Scientific American (June 2003) for a detailed discussion. The following result was obtained:

Lecture 9: Core String Edits and Alignments

Biosequence Algorithms, Spring 2005 Lecture 9: Core String Edits and Alignments Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 9: String Edits and Alignments p.1/30 III: