LAB # 3 / Project # 1

Size: px

Start display at page:

Download "LAB # 3 / Project # 1"

Sara Robinson
5 years ago
Views:

1 DEI Departamento de Engenharia Informática Algorithms for Discrete Structures 2011/2012 LAB # 3 / Project # 1 Matching Proteins This is a lab guide for Algorithms in Discrete Structures. These exercises are for the classes of October 13, 20 and 27 of The resulting software along with a report must be delivered by the 28 of October. The project will be discussed on the class of 3 of November. Students should not use existing code for the algorithms described in the project, either from software libraries or other electronic sources. It is important to read the full description of the project before starting to design and implement the solution. Students should deliver a working implementation of the project described, along a report including experimental setup s and time and space analysis, both theoretical and experimental, of the algorithms implemented. 1 Online Search In this project students will implement an efficient algorithm for matching small patterns against a large text. This problem is recurrent in computer science and specially in bioinformatic applications, therefore we will use a reference database of protein sequences. Download the Swiss-Prot database from the following URL: ftp://ftp.ebi.ac.uk/pub/databases/uniprot/ knowledgebase/uniprot_sprot.fasta.gz Decompress the file and take a look inside. The file contains the sequence of several proteins, separated by comment lines. The comment lines have the following structure: >sp REFERENCE Description The lines start by > followed by the string sp, followed by the protein reference between characters and finally end with a brief description of the protein. This is a very elementary fasta format. The sequence of the respective protein appears in the next lines, until another comment line or the end of the file is reached. Notice that the lines are not longer than 60 characters, but the protein 1

2 sequence is obtained by removing the newline characters and concatenating the lines. In a first step the algorithm should read a file, like the one we have just described, and load it into memory. Some processing is required, to remove the newlines and to separate the different sequences, so that occurrences of the pattern do not spread across more than one protein. After preprocessing the database it is time to implement an efficient matching algorithm. Some of the characters in the database actually represent other characters, or combinations of other characters. We will ignore this fact and interpret the characters literally. Let us start by implementing the Shift-And algorithm. Recall that for a pattern P = MAFS the Shif-And algorithm simulates the following non-deterministic automata by using the bitwise & and << operations. M A F S The arrow indicates the initial state and the double state is a final state. It is safe to assume that we will not be searching for patterns longer than 30 characters. Therefore the automata needs only one processor word. After finding an occurrence it is reported by indicating the reference of the sequence and the position. Therefore a search for "DPLVSAE" in the Swiss-Prot database should return "Q6GZX3 89". Test your implementation by using patterns of different sizes and include experimental results in the report. Also include the amount of memory required by your program. In C the execution time can be measured with the ftime function. Memory can be measured with the massif tool of valgrind. 2 Indexed Search The results obtained by the previous algorithm should present a convincing case that direct search over the text is slow. Even if the search time is acceptable for a couple of searches it is not good enough, not efficient, for bioinformatic applications that can deal with several thousands of patterns in a search. For example, to assembly a new genome a search algorithm must compare millions of DNA fragments with an average size of 200 characters to a reference genome that can also be a sequence of millions of characters or longer. Use a suffix array to index the protein sequences, it is a generalized suffix array because it indexes more than one sequence. Recall that suffix arrays contain the indexes of the suffixes in lexicographical order. Consider to following suffix array for "ABRACADABRA#", where "#" is the terminator character. 11 # 10 A# 7 ABRA# 0 ABRACADABRA# 2

3 3 ACADABRA# 5 ADABRA# 8 BRA# 1 BRACADABRA# 4 CADABRA# 6 DABRA# 9 RA# 2 RACADABRA# On the left we show the suffix indexes and on the right the respective suffixes, note that the suffixes are not really stored in the structure, they are only presented for illustration purposes. A fairly efficient way to construct a suffix array is to use an MSD radix sort. This means that all the suffixes are first sorted by the first letter, then the procedure continues recursively inside the intervals of suffixes that share the first letter, i.e., that all start by the same letter. Unfortunately suffix arrays must be constructed in main memory, this construction algorithm in secondary memory is very slow. Hence let us first reduce the size of the database to 50MB with a command like head -c 50M UniProt.fasta > Small.fasta. Computing exact matching with suffix arrays consists in determining the interval in the array that contains that suffixes that start with the pattern P we are searching for. In the above example the corresponding interval for "AB" would be [2, 3], i.e. the interval that contains the suffixes "ABRA" and "ABRACADABRA". This interval is determined with two binary searches, one for each extreme of the interval. Also implement the simple accelerant, presented by Gusfield [1], in section This accelerant is an heuristic to avoid having to redundantly compare letters, between P and the extremes of the interval of the binary search. At each step store the size of the common prefix between P and each of the extremes, whenever we need to compare P against a suffix inside this interval we have the guarantee that the common prefix between P and such a suffix can not be smaller than the minimum of the previous values. Include in the report experimental results, including the time and memory necessary to construct suffix arrays, and comparing the time of computing exact matching against the online version. Determine the break even point, i.e. the minimum number of online searches beyond which it pays of to build suffix arrays. 3 Approximate Search In bioinformatics, as in other real world applications of pattern matching, the pattern string may be corrupted by errors, which makes the search harder. In this work we assume only substitution errors, i.e., we assume a letter may be substituted by another letter. By modifying the automata already coded in Section 1, it is possible to search with this type of errors. Consider the following automata that can be used to find patterns with at most one substitution. 3

4 M A F S M A F S Modify the previous implementation of Shift-And to simulate this automata. Even though the above implementation is efficient we have just seen that there is a large gap between the performance of online search and indexed searches. Unfortunately errors are harder to handle over indexes. Still the following observation provides a way to do just that. Assume that we want to find all the occurrences of P with at most two substitutions. If we divide P into 3 parts, a prefix, a substring and suffix, at least one of the pieces most occur without errors. Therefore we can search for those pieces exactly over the suffix array. Then we need to check the resulting positions for complete approximate occurrences of P. Fill in the details of this solution, implement it and present experimental results. 4 Validation To automatically validate the index we use the following conventions. The file containing the texts to index respects the fasta format described above. The name of this file is passed in the command line, i.e., in C it corresponds to argv[1]. Therefore the binary is executed with the following command:./project Small.fasta < in > out The file in contains the input commands that we will describe next. The output is stored in a file named out. The input and output must respect the specification bellow precisely. The output file will be validated against an expected result, stored in a file named check, with the following command: diff out check This command should produce no output, thus indicating that both files are identical. Each operation is issued in a separate line, it begins by a letter that identifies it and is followed by a sequence of argument or options. E followed by a string P, performs an exact online search, for P, on the database. Assume the string P contains no white characters. The output consists of a line for each occurrence found, with the format ref: pos, where ref is the reference of the sequence and pos is the position. The results should appear in increasing order of reference and, on ties in increasing other of position. The order of the references is that the same in which they appear in the database file. In case there are no occurrences this command produces no output. I builds the generalized suffix array of the database and prints the Database indexed. message. In case the suffix array was built, by a previous call, this call does nothing, but still prints the message. 4

5 B followed by a string P, performs an exact search over the suffix array. The output consists of a line for each occurrence found, with the format ref: pos, where ref is the reference of the sequence and pos is the position. The results should be ordered according to the lexicographical order of the suffixes, i.e., the suffix array order. In case two sequences share the same suffix, i.e., when there is a lexicographical tie, the order should be the increasing sequence order. In case the suffix array has not yet been constructed output the Please index database. message. A followed by a number k > 0 and a string P, performs an approximate online search, for P, on the database, with at most k substitutions. The output uses the same format as the command E. F followed by a number k > 0 and a string P, performs an approximate search using the suffix array, with at most k substitutions. The output uses the same format as the command E. Note that this is not the same as the command B. In case the suffix array has not yet been constructed output the Please index database. message. Consider the following example: Database >sp Q6GZX4 001R_FRG3G Putative transcription factor 001R MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD SFRKIYTDLGWKFTPL >sp Q6GZX3 002L_FRG3G Uncharacterized protein 002L MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL AERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAADC KCINRASDPVYQKVKTLHAYPDQCWYVPCAADVGELKMGTQRDTPTNCPTQVCQIVFNML DDGSVTMDDVKNTINCDFSKYVPPPPPPKPTPPTPPTPPTPPTPPTPPTPPTPRPVHNRK VMFFVAGAVLVAILISTVRW Input E MAFS E SAED B MAFS B SAED F 1 SAAD I I B MAF B SAED A 1 SAAD F 1 SAAD 5

6 Output Q6GZX4 0 Please index database. Please index database. Please index database. Database indexed. Database indexed. Q6GZX4 0 Q6GZX3 141 Q6GZX3 175 Q6GZX3 208 Q6GZX3 141 Q6GZX3 175 Q6GZX3 208 References [1] Gusfield, D. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge Univ Press,

A Compressed Self-Index on Words

2nd Workshop on Compression, Text, and Algorithms 2007 Santiago de Chile. Nov 1st, 2007 A Compressed Self-Index on Words Databases Laboratory University of A Coruña (Spain) + Gonzalo Navarro Outline Introduction