DNA Sequencing Error Correction using Spectral Alignment

Size: px

Start display at page:

Download "DNA Sequencing Error Correction using Spectral Alignment"

Cory Randall
6 years ago
Views:

1 DNA Sequencing Error Correction using Spectral Alignment Novaldo Caesar, Wisnu Ananta Kusuma, Sony Hartono Wijaya Department of Computer Science Faculty of Mathematics and Natural Science, Bogor Agricultural University Abstract The second generation DNA sequencing technology can generate large number of DNA fragments/reads in a relatively short time. A DNA sequence assembly step is required to obtain whole genome sequences from reads. The assembly process generally uses graph based approach. This approach is very sensitive due to DNA sequencing errors. To obtain the optimal results in assembly process, the error correction step can be performed before or after the assembly process. In this research, we developed a software prototype for correcting DNA sequencing error. We employed the spectral alignment technique implemented as a pre-processing step before the DNA sequence assembly process. We tested our method by using simulated DNA reads containing errors. We measured the results by evaluating the number of nodes. The evaluation results showed that our method can reduce the complexity of graph shown by the decreasing of number of nodes. It can be stated that our method has successfully corrected DNA reads which contain sequencing errors. I. INTRODUCTION The achievements in DNA sequencing technology have opened the opportunities of finding new approaches in the field of the DNA sequence assembly. Currently, reads produced by the next generation DNA sequencing (NGS) are still much shorter than those of produced by the traditional Sanger shotgun sequencing method [5]. However, NGS can yield large number of reads very fast. The Illumina Genome Analyzer can generate 1.5 billion base-pairs (bps) of sequence data in a single-end 60- hours run with a read length of 36 [6]. The length of NGS s reads increase the difficulty of the assembly process. The overlap detection become more difficult because of the existence of repeats. Repeat can cause mis-assembly. To overcome this problem, researcher proposed the new assembly approach using de Bruijn Graph [3-4]. However, this approach is still very sensitive due to DNA sequencing errors. These sequencing errors can increase the complexity of graph. Therefore to reduce the compleity, we require a DNA sequencing error correction as a pre-processing step before conducting assembly process. This research aims to develop a software prototype which is able to detect and correct DNA sequencing errors using the spectral alignment method. The spectral alignment method adopt a statistical approach to classify each reads into error or not error. This method can be employed without constructing graph. Thus, this method is more efficient than the topology approach. In the topology approach, the correction error is conducted after constructing graph. To evaluate our method, reads yielded by the correcting step will be assembled by Velvet to show the number of nodes representing the complexity of graph. Evaluation will be conducted by comparing reads containing sequencing errors and corrected reads yielded by our method. II. METHODS This research focused on the pre-processing step of DNA sequence assembly. The DNA sequence assembly step was performed using Velvet, the wellknown DNA sequence assembly software developed by Zerbino [8]. The sequence data used in this research was classified as short reads. Data test was simulated by using MetaSim [9] software. This data was stored as FASTA files format. This research employed the spectral alignment method to detect and correct DNA sequencing errors [7]. This approach uses the definitions and concepts of solid and weak tuples, spectrum, T-Strings and the spectral alignment problem itself. The definitions and concepts mentioned in this paper will be elaborated with the following explanations. A. Solid Tuple and Weak Tuple Given the reads of sequences R, where R = {r 1, r 2, r 3,..., r k }, with r i = L and r i {A,C,G,T} L for all i where 1 i k. Symbols of A, C, G and T represents nucleotide codes with A for adenine, C for cytosine, G for guanine, and T for thymine. We also define two integers called multiplicity m (m > 1) and length l (l < L). If an l-tuple was defined as a DNA string with length l, then a tuple will be classified as solid tuple with respect to R and m, if the l-tuple was a substring of at least m reads and classified weak otherwise. For

2 instance, given the parameter multiplicity m = 2 and sequences reads R = {AAAA, AAAC, AATC, ATCA}. An l-tuple with length 3, will be called a 3- tuple. If the 3-tuple contains AAA then it will be called a solid tuple, because AAA is a substring of at least 2 reads in R, AAAA and AAAC. B. Construction of Spectrum Set A spectrum from a set of DNA sequences R with multiplicity m and length l, will be denoted as Tm,l(R). A spectrum Tm,l(R) is a set of all solid tuples only acquired from DNA sequences R with the specified multiplicity and length parameter. So, spectrum was generated from the formulation of l- tuples classified as solid tuples from the previous step. The spectrum constructed will be used as a reference to define T-String which will be explained next. C. Error Correction with Spectral Alignment Method After the spectrum construction, the next definition and concept required to understand is T-String and spectral alignment problem. A DNA sequence s will be classified as a Tm,l(R)-string or for short T-String if every possible l-tuples in s are members of the spectrum. Given DNA string s and spectrum Tm,l(R), the spectral alignment problem is to find a T-String s* in the set of all T-Strings that minimize the distance function d(s, s*). The distance function used in this research is the Levenshtein/edit distance. For example if S is a set of all T-Strings, whereas the members were s1 = ATCGAGCT, s2 = ATCCATCT, s3 = ATCGAACT and given an error-containing sequence s = ATCGGGCT. Then, the spectral alignment problem is to find an s*, an element of S that minimize the Levenshtein distance between s and s*. In this example, s1 was able to minimize the distance with s compared to s2 and s3, where s contain an error in the fifth nucleotide from left, it should have read as A, instead of G. The DNA sequencing error correction process was performed to all DNA reads iteratively. After the entire sequences were corrected, a set of all corrected sequences R* will be produced. The performance of correction will be evaluated with Velvet assembler referring to R* together with R (set of uncorrected sequences). D. Evaluation Set of uncorrected sequences and set of corrected sequences will be used as the input for Velvet assembler. This action was required to evaluate the performance of the error correction process. Velvet is a set of collective algorithms to manipulate De Bruijn graph in order to perform the DNA sequences assembly [8]. Velvet will construct De Bruijn graphs from the DNA sequence reads inputed by user. The graph complexity produced by Velvet referring to input R and R* separately will be evaluated. The criterion of the graph complexity in this research is the total nodes of the graph produced. An erroneous sequences dataset will tend to produce a more complex graph compared to graph from an error free sequences dataset. This will be used to evaluate whether the error correction process was performed properly. E. Levenshtein / Edit Distance In spectral alignment, a distance function was used to determine score representing similarities between DNA strings. This research employed a Levenshtein/edit distance as distance function. Distance function was determined by counting how many single-character edit performed to transform a string to another reference string. A single character edit action performed could be an insertion, deletion, and substitution. For every one action implemented will be counted as 1 score. For instance: 1. kitten sitting, scored 3 ( substitution of character k to s, e to i and insertion of character g at the end of the string) 2. sitten sittin, scored 1 ( substitution of character i to e ) In this research, every action will be scored -1 instead of 1. The purpose of this scoring was to give a more intuitive approach by stating that a more similar sequences will be scored higher than dissimilar sequences. F. Pattern Matching with Enhanced Suffix Array The enhanced suffix array data structure is a suffix array with addition of lcp table. In this research, an lcp (longest common prefix) is used to store a maximum length of the same prefix between two suffixes, suf[i] and suf[i-1] in a suffix array [2]. Exact pattern matching is an activity to determine whether a pattern P (needle) is a part of string S (haystack). In an exact pattern matching context, enhanced suffix array have a time complexity of O(m+z) and space 6n [1]. This complexity is better than suffix array in terms of time and better than suffix tree in terms of space. Because of that, the exact pattern matching method used in this research was performed using the enhanced suffix array data structure. The exact pattern matching activity performed in this research was conducted at the spectrum construction step and T-String classification step. In spectrum construction, DNA sequences were indexed by an enhanced suffix array structure, where the patterns are all possible permutations for tuple with length l. While in T-String classification, spectrum was indexed by an enhanced suffix array structure, with all substrings from a sequence instance as patterns. A good implementation of exact pattern matching was required. This was to boost the performance, for the fact that most DNA

3 sequences data come in very large size. By using enhanced suffix array data structure, an optimum and relatively fast pattern matching process can be performed. III. RESULTS A. DNA Sequences Data Data used in this research was DNA sequences simulated by MetaSim software. The outputs of MetaSim are DNA fragments containing DNA sequencing errors. Three organisms were picked from NCBI database to be simulated. The three organisms chosen in this research could be seen in Table 1. TABLE 1 THE SPECIFICATION OF ORGANISMS DATA USED IN THIS RESEARCH No Organism aureus subsp. aureus ED98 plasmid pavy pasteurianus IFO plasmid papa plantarum WCFS1 plasmid pwcfs101 Complete sequence length For each organism sequences, simulation will be performed once. Simulation will produce DNA sequences data containing errors. The error model implemented is the Sollexa error model with fragment length of 36 for each reads restricting to only substitution errors. The parameters configuration in MetaSim for each simulation could be found in Table 2. Gi TABLE 2 PARAMETER CONFIGURATION TO GENERATE DNA SEQUENCES DATA USING METASIM DNA Clone Organism Reads Size Second Mean Distribution Parameter Type aureus 1500 Normal 36 0 pasteurianus 2000 Normal 36 0 plantarum 2000 Normal 36 0 The outputs of simulation processes performed by MetaSim were three FASTA (.fna) files referring to three organisms. Each file contains erroneous sequences produced by the simulations. These three data are the datasets used in this research. B. Software Workflow in order to Correct Errors This research produced a software that able to detect and correct errors arose from DNA sequencing step with spectral alignment method. The software will accept a FASTA file path as input containing DNA sequences to be corrected. Input data will be scan line per line and labeled as reads. Then, all possible permutation with repeats will be constructed from nucleotides A, C, G, and T with specified length l. The length l used in this research is 5. So, the total permutations produced were 4 5 = 1024 tuples. This permutation results later will be called pool. The next step, a spectrum set will be constructed with referring to reads and pool. For each string, member of pool will be classified as weak or solid. So, the spectrum set will be formed where the members were all solid tuples only from pool. In order to determine whether a string classified as solid or weak tuple, this research assign a parameter of multiplicity m = 10. The spectrum set construction process could be seen in Algorithm 1. ALGORITHM 1 SPECTRUM CONSTRUCTION PROCEDURE FROM SOLID ONLY READS Input : reads set, pool set, m = 10 Output: spectrum set set numocc = 0 set index = 0 for i := 1...length(pool) for j := 1...length(reads) if (pool[i] equal reads[j]) numocc++ end if if (numocc >= m) spectrum[index++] = reads[j] end if return spectrum The next step, a set of T-Strings will be constructed referring to spectrum. For every fragment in reads, it will be determined whether it is a T-String or non T- String. To do this it is necessary to know all the substrings of the fragment. Then for each element of the substring set will be compared with every element of the spectrum. If every substrings of a fragment could be found in the spectrum set, then the fragment will be classified as T-String. If there is one or more substrings which do not belong in the spectrum set, then it will be classified as a non T-String and detected as erroneous fragment/read. Member of reads which didn t classified as T-String will be detected as fragment with error. Detected erroneous fragments will be corrected with spectral alignment method. Algorithm 2 shows the procedure to generate all possible substrings of a fragment. While Algorithm 3

4 shows the procedure of T-String classification process. ALGORITHM 2 POSSIBLE SUBSTRING SET CONSTRUCTION OF A FRAGMENT PROCEDURE Input : read/fragment, substring length l Output : substrings of read/fragment string inf int index for i := 1...length(read)-l inf = infix(read,i,i+1) substrings[index++] = infix return substrings ALGORITHM 3 T-STRING CLASSIFICATION PROCEDURE Input : a read/fragment, substrings set from input read Keluaran: True if read is classified as T-String, False if read is not T- String set status = 1; for i := 1...length(substrings) for j := 1...length(spectrum) if (substring[i] equal spectrum[j]) status *= 1 else status *= 0 if (status = 0) break if (status = 1) return true //read is T-String else return false //not T-String Every erroneous reads will be aligned with each elements of T-Strings set. While being aligned, a scoring scheme will be implemented to determine the similarity between reads and every element of T- Strings set. The scoring scheme used Levenshtein/edit distance function. So, each fragment will have a score of distance for each T-String. To correct erroneous fragment, a T-String that resulted the least distance with the fragment will be picked, in other words, the T-String is the most similar string to the fragment among other T-Strings. The erroneous fragment will be corrected by substituting itself with the corresponding T-String. The process will be repeated on and on iteratively until all the erroneous fragments corrected successfully. The output of the software is a FASTA file containing corrected DNA sequences corresponding to the input sequences. Algorithm 4 shows the procedure to detect and correct DNA sequencing error with spectral alignment method. ALGORITHM 4 POSSIBLE SUBSTRING SET CONSTRUCTION OF A FRAGMENT PROCEDURE Input : a set of all non T-String position, T-String set, reads set Output: corrected reads set int pos, tstring_max_pos, max_score, count; String[] align set pos = 0 for i := 1...length(posisi) set pos = posisi[i] align[0] = pos int idx set idx = 0 set max_score = for j := 1...length(TString) align[1] = TString[j] int score = globalalignment(align) if (score > max_score) max_score = score tstring_max_pos = idx idx++ reads[pos] = TStrings[max_pos] return reads Three datasets simulated by MetaSim will be the input files for the software developed in this research. The software will detect and correct errors contained in the three datasets. The output would be three FASTA files containing corrected DNA sequences. Information regarding the software execution for each three prepared datasets could be found in Table 3. C. Correlation between total nodes and succession of error correction In graph construction process in DNA sequence assembly, whether the sequences contain or do not contain sequencing errors could affect graph complexity. Error presence in sequences could lead to produce unnecessary branches in graph. Thus, we assumed that in this case, the complexity of a graph could be measured by calculating the total nodes in a graph generated in the DNA sequence assembly. For example, given a pair of DNA strings AATGC and GCCAGT. Assume the first string should be read AATGC, instead because of sequencing error, the string was read as AATAC. Then, the graph produced by the erroneous strings will be more complex

TABLE 3 INFORMATION REGARDING EXECUTION OF THE SOFTWARE DEVELOPED FOR EACH THREE PREPARED DATASETS Organism Erroneous Execution Spectrum T-String reads time elements elements detected (ms) aureus 844

2 shows the graph of fragments with sequencing error. The graph in Fig. 2 has a total of 8 nodes, meanwhile graph in Fig. 3 has a total of 9 nodes. As shown in Fig.

3 shown that sequencing error increased the complexity of the resulted graph in the DNA sequencee assembly. Fig. 1.

For each DNA sequence assembly process using Velvet, a parameter hash length k must be determined. Hash length is the length of k-mers included in the hash table.

In this research, k value was set to 17, 19, and 21. The DNA sequence assembly results using Velvet for each k values can be seen in Table 4.

It shown that every corrected sequences produced less number of nodes compared to uncorrected sequences for every k values.

The results shown that the error correction software developed in this research was able to detect and correct DNA sequencing error and also simplifies the constructed graph resulted from DNA

5 TABLE 3 INFORMATION REGARDING EXECUTION OF THE SOFTWARE DEVELOPED FOR EACH THREE PREPARED DATASETS Organism Erroneous Execution Spectrum T-String reads time elements elements detected (ms) aureus pasteurianus plantarum compared to the error free strings. Fig. 1 shows the graph of error-free fragments. While, Fig. 2 shows the graph of fragments with sequencing error. The graph in Fig. 2 has a total of 8 nodes, meanwhile graph in Fig. 3 has a total of 9 nodes. As shown in Fig. 2, because of substitution error, the produced graph has an unnecessary branch and resulted in more total nodes. Fig. 2 and Fig. 3 shown that sequencing error increased the complexity of the resulted graph in the DNA sequencee assembly. Fig. 1. Graph assembled with two fragments without error removal techniques and graph simplification. So, it can be used a measure to indicate the successfully of our DNA sequencing error correction method. For each DNA sequence assembly process using Velvet, a parameter hash length k must be determined. Hash length is the length of k-mers included in the hash table. The k value must be an odd number, smaller than MAXKMERHASHH which is 31 for 36 bp reads, or must be smaller than the length of each fragments inputted. In this research, k value was set to 17, 19, and 21. The DNA sequence assembly results using Velvet for each k values can be seen in Table 4. Table 4 shown the results generated by Velvet in total nodes produced for each dataset with k=17, k=19, and k=21. It shown that every corrected sequences produced less number of nodes compared to uncorrected sequences for every k values. The only exception value is happen that of using the plantarum in k=21, whereas both corrected and uncorrected sequences produced 8 nodes. The results shown that the error correction software developed in this research was able to detect and correct DNA sequencing error and also simplifies the constructed graph resulted from DNA sequence assembly step. IV. CONCLUSION Fig. 2. Graph assembled with one fragment containing error D. Error correction evaluation using Velvet Corrected DNA sequences by using spectral alignment method will be evaluated using Velvet assembler. For each organism, there would be a pair of file, one containing sequences with errors (without error correction) and the other containing corrected sequences by spectral alignment method. These three pairs of dataset for three organism were stored in six files in FASTA format. The outputt of Velvet is a De Bruijn graph assembled from input DNA sequences reads. There are two files corresponding to graph generated by Velvet execution. The two plain text files named PreGraph and LastGraph. Both files contain a list of nodes representing De Bruijn graph by Velvet. But, in this research only the PreGraph would be considered. The reason is the graph represented in LastGraph is the final output of Velvet yielded by Velvet s error removal techniques and graph simplification. Thus, the resulted graph in LastGraph is irrelevant with the aim of this research. While, resulted graph represented in PreGraph has not processed by Velvet s error TABLE 4 DNA SEQUENCE ASSEMBLY RESULTS USING VELVET Total nodes in Organism Error graph correction k = k = k = aureus No Yes pasteurianus No Yes plantarum No Yes This research was successfully able to produce a software prototype for deteting and correcting DNA sequencing error using spectral alignment method. The graph resulted from the corrected sequences is simpler than graph generated from error-containing sequences. The results also show that the process of correcting sequencing error using spectral alignment method can simplify the graph resulted from the DNA sequence assembly. REFERENCES [1] Abouelhoda, Mohamed Ibrahim, Enno Ohlebusch, and Stefan Kurtz. "Optimal exact string matching based on suffix arrays." String Processing and Information Retrieval. Springer Berlin Heidelberg, [2] Abouelhoda, Mohamed Ibrahim, Stefan Kurtz, and Enno Ohlebusch. "Replacing suffix trees with

6 enhanced suffix arrays." Journal of Discrete Algorithms 2.1 (2004): [3] Chaisson, Mark, Pavel Pevzner, and Haixu Tang. "Fragment assembly with short reads." Bioinformatics (2004): [4] Pevzner, Pavel A., Haixu Tang, and Michael S. Waterman. "An Eulerian path approach to DNA fragment assembly." Proceedings of the National Academy of Sciences (2001): [5] Schröder, Jan, et al. "SHREC: a short-read error correction method." Bioinformatics (2009): [6] Shi, Haixiang, et al. "Accelerating error correction in hig0068-throughput short-read DNA sequencing data with CUDA." Parallel & Distributed Processing, IPDPS IEEE International Symposium on. IEEE, [7] Wong, Jason WH, Gerard Cagney, and Hugh M. Cartwright. "SpecAlign processing and alignment of mass spectra datasets." Bioinformatics 21.9 (2005): [8] Zerbino, Daniel R., and Ewan Birney. "Velvet: algorithms for de novo short read assembly using de Bruijn graphs." Genome research 18.5 (2008): [9] Richter D C, Ott F A, Auch F, Schmid R, Huson D H. MetaSim-A sequencing simulator for genomics and metagenomics. PLoS ON3E, vol. 3, no. 10, page e3373, 2008.

Performance analysis of parallel de novo genome assembly in shared memory system

IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Performance analysis of parallel de novo genome assembly in shared memory system To cite this article: Syam Budi Iryanto et al 2018