A Scalable Coprocessor for Bioinformatic Sequence Alignments

A Scalable Coprocessor for Bioinformatic Sequence Alignments Scott F. Smith Department of Electrical and Computer Engineering Boise State University Boise, ID, U.S.A. Abstract A hardware coprocessor for the rapid calculation of bioinformatic sequence alignments is presented. The coprocessor uses a globally-asynchronous locally-synchronous (GALS) design style which makes the coprocessor much easier to scale as CMOS feature sizes decrease. The coprocessor is intended to be implemented on a single integrated circuit along with a simple 32-bit RISC processor and memory system. The specific sequence alignment algorithm implemented is that of Smith and Waterman, but the general design strategy could be extended to other bioinformatic sequence alignment algorithms. Keywords - coprocessor, bioinformatics, sequence alignment, Smith-Waterman, globally-asynchronous locally-synchronous. 1. Introduction There is a huge amount of biologic sequence data available today and the volume of data is increasing at an exponential rate. This data includes DNA and protein sequences as well as structural data (x, y, and z positions of atoms in organic polymers). A major challenge for molecular biologists is to make sense of these newly available sequences which run into the billions of units (DNA base pairs or protein amino acids). The only possible way to handle this large amount of data is with automated computer processing, a field now called bioinformatics. One of the core activities undertaken by bioinformatics is the alignment of sequences. This is similar to string searching in text databases, but with some aspects that are peculiar to biologic databases. The simplest of these alignments involves taking a vector of symbols (a query string) and finding ranges in a vector database that are similar to the query string. More complicated alignments use structural data and attempt to find locations in the database with similar relative locations of atoms. This paper describes a coprocessor to handle the non-structural alignment problem, but may be used as a base for designing structural alignment hardware. One of the reasons that text search algorithms can not be directly used for sequence alignment is that there is rarely an exact match between the query string and the database. The query string might be a protein found in humans and the search is for a protein with a similar function in another organism such as mice. The two proteins will have differences that correspond to mutations, insertions, or deletions. A mutation is a difference in a character at a given position. An insertion or deletion is an addition or removal of one or more characters from the query string or database. Insertions and deletions are symmetric in that an insertion in the query string can be viewed as a deletion in the database and vice versa. A further complication is that certain mutations for proteins (amino acids) have less effect on the protein function than other mutations. In order to generate a high quality alignment which finds similar locations in the database without a high false alarm rate, it is necessary that the alignment algorithm mirror the statistics of the underlying biological process. One such algorithm is the Smith-Waterman alignment algorithm [1], which gives high quality alignments but is very computationally demanding.

Alternatives to linear-programming based alignments such as Smith-Waterman exist which are not as high quality, but less computationally demanding. These include BLAST (Basic Local Alignment Search Tool) [2] and FASTA (Fast Align) [3] and are commonly used on web-based servers such as those maintained by the National Center for Biotechnology Information (NCBI) [4]. Even though these algorithms are less computationally demanding, they still use significant computing resources. In order to get the high quality of the Smith- Waterman algorithm, one typically has to resort to the power of scientific supercomputing. Strategies include both the use of generalpurpose supercomputers [5] and special-purpose coprocessors [6][7]. This paper describes a special-purpose coprocessor. The coprocessor presented here is different than existing coprocessors in the literature in that it is intended to be a single-chip implementation and that it is more easily scalable due to its globallyasynchronous locally-synchronous design style [8]. There are at least two companies that manufacture and sell systems with coprocessors to accelerate bioinformatic alignments. One of these uses ASIC (application-specific integrated circuit) technology [9] and the other FPGA (field-programmable gate array) technology [10]. Typically these are used to accelerate BLAST processing, which is already computationally intensive enough to warrant special-purpose hardware. The design of these accelerators is, however, proprietary and very little detailed information about them is available in the public literature. The main advantage of the GALS design style is that it gets around the problem of designing a large-area low-skew high-frequency global clock tree. The problems associated with global clock tree design are documented in [11]. These problems will only get worse as CMOS feature sizes decrease and include new problems such as crosstalk and wire inductance that could be safely ignored until recently. Interfaces between the local clock domains of GALS systems have been designed by [12], [13], and [14]. The later design is used for the coprocessor of this paper because it has the advantage of not requiring any control of the local clocks by the interface. The design of a single computation unit which resides within one of the local clock domains of the GALS coprocessor system is described in Section 2. The combination of these units into a multi-clock-domain GALS coprocessor system is the topic of Section 3. Conclusions are presented in Section 4 along with future work to be undertaken. 2. Computation Unit Organization The computation unit is designed to implement the equations of the Smith-Waterman alignment algorithm corresponding to a single character of the query sequence. If there are more characters in the query sequence than there are computation units, then the coprocessor will be accessed multiple times by the main processor. Each access will pass the entire database through the array of computation units. The processor will store intermediate results collected at the end of each pass which generates an intermediate results file several times as large as the initial database. The Smith-Waterman equations are: I i,j = D i,j = M i,j = 0, for all i and j such that i = 0 or j = 0 I i,j = max{i i-1,j - c, M i-1,j - g} D i,j = max{d i,j-1 - c, M i,j-1 - g} M i,j = max{i i-1,j-1 + d(a i,b j ), D i-1,j-1 + d(a i,b j ), M i-1,j-1 + d(a i,b j ), 0} where the indices i and j refer to the position within the query string and within the database. Since the algorithm is symmetric is does not mater which index is chosen for query and which for database. I is the current score if an insertion is underway and D if a deletion is underway. The choice of assignment of i and j to query string versus database determines whether these insertions or deletions actually refer to query string insertions and deletions or

database insertions and deletions. The current score if the current pair of characters (one from the query string and one from the database) is taken as a match/mutation is M. The penalty for starting a new insertion or deletion is g, and the penalty for continuing an insertion or deletion is c (g is normally chosen larger than c). The reward for a match is d and this depends in general on how close the match is. Exactly matching characters get the highest reward value and similar characters get a reduced, but positive reward. For DNA alignments, exact matches are usually assigned a positive reward with all other combinations given a reward of zero. For protein alignments, amino acids with similar properties (such as both being hydrophobic) are given non-zero, but lower rewards than exact matches. The d matrix is normally symmetric. There are four possible characters for DNA alignments (A, T, C, and G) and twenty possible characters for protein alignments (C, H, I, M, S, V, A, G, L, P, T, F, R, Y, W, D, N, E, Q, and K) [15]. These characters are normally stored as eight-bit ASCII codes in biological databases. A block diagram of a single computation unit is shown in Figure 1. The unit is divided into two sections, constants and calculation. These two sections have separate request and acknowledge interfaces and work independently. The constants section is loaded first with c, g, d, and valid values for a particular character of the query string. The valid bit allows a computation unit to be bypassed if it is not needed as a result of the query string being shorter than the total number of computation units. The g and c values are the start and continuation penalties for insertions and deletions and are eight bit values. The twenty d values indexed d(0) through d(19) are each three bit rewards for matching/mutating characters that pass through the computation unit. The d values are a single column of the d matrix corresponding to the query string character assigned to the computation unit. In the case of DNA alignment, only the first four d values will be used since the character for the other sixteen d values will simply never appear in the database data stream. Figure 1 Computation unit block diagram.

The database characters (Char) are passed through the pipeline of computation units using a twenty-bit one-hot code. The one-hot code makes it easier to select the required d value from the constants section. This makes the computation unit faster and saves transistors inside the unit at the expense of additional bits at the interface between units. The current score at the current position in the database is labeled Max. An additional intermediate variable X has been added to the equations. This X variable represents a portion of the calculation that can been done prior to the arrival of the current Char value. The variable X is not passed between computation units since it is only a temporary internal state. clock domain. Even though the clock signals of two different clock domains are nominally the same, minimal effort is employed to maintain low skew between the domains and the clock domain interfaces make no assumptions about the relative phase of the two local clock signals. The internal design and performance of the asynchronous interface is described in detail in [14] and [16]. The interface is built around an asynchronous FIFO. The performance of the interface has been estimated using a SPICE model as 1.09 ns plus a clock-phase differential term which varies between zero and one period of the receiving clock signal. This performance estimate is based on a 180 nm TSMC [17] CMOS process available through MOSIS [18]. 3. Full Coprocessor System Figure 2 Connection of two computation units. The connection of two computation units together using asynchronous interfaces is shown in Figure 2. Two asynchronous interfaces are needed between each pair of computation units to allow independent passage of constants and data. Each computation unit has its own local clock signal. These clock signals are intended to have the same nominal frequency and are mostly likely derived from the same clock source. Local clock signals have tightly controlled skew such that the usual synchronous design paradigm can be used within a clock domain. This allows the standard types of digital design tools to be used to design the logic internal to a local The full system including processor, coprocessor, and array of computation units is shown in Figure 3. The processor and coprocessor are in the same local clock domain (clock 0) and each of the n computation units occupies its own local clock domain (clock 1 through clock n). The coprocessor loads constants eight bits at a time through the chain of Const connections. Internal to the computation units there is an eight bit wide and ten unit long synchronous FIFO for constant values. After 10n bytes of constants have been sent to the computation unit array by the coprocessor the constants are fully loaded. There is no need to pass constants from computation block n back to the coprocessor and therefore only one asynchronous interface is needed at that place. Data are passed 84 bits at a time through the chain of Data connections. This data is composed of D, I, M, Max, and Char. There is no need to ever pass Char back to the coprocessor from computation unit n, so the 20 bits of Char are omitted and Data is only 64 bits wide at that point. The need to pass D, I, M, and Max from the coprocessor to computation unit 1 occurs only when the query string does not fit in the array of computation blocks and multiple passes of the database through the array is needed. If the query string completely fits or it is the first pass of a multi-pass run, the D, I, M, and Max values are set to zero by the coprocessor.

Figure 3 Full coprocessor system. One of the functions of the coprocessor is to expand eight-bit ASCII-coded symbols for DNA bases or amino acids into the 20-bit one-hot code used to specify Char in the computation unit array. The processor is responsible for maintaining the d matrix and calculating all of the constants to be loaded via Const. One reason for having the coprocessor do the one-hot expansion is to reduce the bandwidth over the processor-coprocessor interface. A good possible choice for the processor in this system is an ARM922T CPU core [19][20][21]. This is a small 32-bit RISC processor designed to be used as a core on an ASIC. The ARM922T CPU core is built around an ARM9 processor which has a standard coprocessor interface. One standard coprocessor which uses this interface is the memory management unit (MMU), but additional application-specific coprocessors can be designed to use the interface definition. This coprocessor interface is 32 bits wide. After initialization of the constants, information passing from the processor to coprocessor is in the form of ASCII characters, so four database units can be sent per transfer. Information passing from the coprocessor to the processor is a series of 16-bit scores, one for each database character passed into the coprocessor. The 32-bit processor-coprocessor interface therefore handles an average of 4/3 database characters per clock cycle. The ARM922T is capable of operating at 200 MHz in the 180 nm TSMC CMOS process. The synchronous logic within the coprocessor and computation units has not yet been designed, but it is not unreasonable to expect these to operate at a similar speed (the coprocessor is required to operate on the same clock as the processor). If so, then the processor will be near full processing capacity moving data into and out of the coprocessor during the database access phase of processing. At 200 million database characters per second, a search of the entire human genome (about 3 billion base-pairs) would take about 15 seconds. This assumes that there are enough computation units to hold the entire query string. Performance with longer query strings would be significantly less since the processor would need to store and retrieve the intermediate D, I, M, and Max values once for every pass in excess of the first. 4. Conclusion The main advantage of the GALS approach used in the design of the alignment coprocessor is the ease of scaling to smaller CMOS feature sizes which allows for an increase in the number of computation units in the coprocessor array. Increasing the number of units allows for longer query strings to be processed without using multiple passes. Alternatively, more than one set of processor, coprocessor, and computation unit array can be placed on a single integrated circuit. This would increase throughput rather than increase efficient query string length. The next steps in this work will be the design of the synchronous logic within the coprocessor and computation units. This will yield information on the layout size of the computation unit which in turn will determine how many units can be placed on a single integrated circuit. The design will also allow simulation to determine if the coprocessor and computation blocks can in fact run at near 200 MHz in a 180 nm CMOS process. It is already know from the asynchronous interface design that the interfaces are not a large contributor to layout area and can easily support 200 MHz throughput. References

[1] T. Smith and M. Waterman, Identification of Common Molecular Sequences, Journal of Molecular Biology, pp. 195-197, 1981. [2] S. Altschul, W. Gish, E. Myers, and D. Lipman, Basic Local Alignment Search Tool, Journal of Molecular Biology, pp. 403-410, 1990. [3] W. Pearson and D. Lipman, Improved Tools for Biological Sequence Comparison, Proceedings of the National Academy of Science, pp. 2444-2448, 1988. [4] National Center for Biotechnology Information (NCBI), http://www.ncbi.nlm.nih.gov. [5] S. Smith and J. Frenzel, Bioinformatics Application of a Scalable Supercomputer-on-chip Architecture, Proceedings of the International Conference on Parallel and Distributed Processing Techniques, Volume 1, pp. 385-391, 2003. [6] L. Grate, M. Diekhans, D. Dahle, and R. Hughey, Sequence Analysis with the Kestrel SIMD Parallel Processor, Proceedings of the Pacific Symposium on Biocomputing, pp. 263-274, 2001. [7] P. Guerdoux-Jamet and D. Lavenier, SAMBA: Hardware Accelerator for Biological Sequence Comparison, Computer Applications in Biosciences, pp. 609-615, 1997. [8] D. Chapiro, Globally-Asynchronous Locally- Synchronous Systems, Doctoral Thesis, Stanford University, 1984. 12 th IEEE International ASIC/SOC Conference, pp. 317-321, 1999. [13] K. Yun and A. Dooply, Pausible Clocking- Based Heterogeneous Systems, IEEE Transactions on VLSI Systems, pp. 482-488, 1999. [14] S. Smith and J. Frenzel, Low-latency Multiple Clock Domain Interfacing Without Alteration of Local Clocks, Proceedings of the 15 th Biennial IEEE University / Government / Industry Microelectronics Symposium, pp. 342-343, 2003. [15] C. Branden and J. Tooze, Introduction to Protein Structure, 2 nd Edition, Garland Publishing, 1999. [16] S. Smith, A Multiple-Clock-Domain Bus Architecture Using Asynchronous FIFOs as Elastic Elements, Doctoral Thesis, University of Idaho, 2003. [17] Taiwan Semiconductor Manufacturing Company website, http://www.tsmc.com. [18] MOSIS website, http://www.mosis.org. [19] S. Furber, ARM System-on-Chip Architecture, 2 nd Edition, Addison-Wesley, 2000. [20] D. Seal, ARM Architecture Reference Manual, 2 nd Edition, Addison-Wesley, 2000. [21] ARM Ltd. website, http://www.arm.com. [9] Paracel, Inc. website, http://www.paracel.com. [10] TimeLogic Corp. website, http://www.timelogic.com. [11] D. Bailey, Clock Distribution, in Design of High-Performance Microprocessor Circuits, IEEE Press, pp. 261-281, 2001. [12] J. Muttersbach, T. Villiger, H. Kaeslin, N. Felber, and W. Fichtner, Globally-Asynchronous Locally-Synchronous Architectures to Simplify the Design of On-Chip Systems, Proceedings of the