Pipelined Multipliers for Reconfigurable Hardware

Pipelined Multipliers for Reonfigurable Hardware Mithell J. Myjak and José G. Delgado-Frias Shool of Eletrial Engineering and Computer Siene, Washington State University Pullman, WA 99164-2752 USA {mmyjak, jdelgado}@ees.wsu.edu Abstrat Reonfigurable devies used in digital signal proessing appliations must handle large amounts of data in vetor form. Most signal proessing algorithms use multipliation extensively; thus, the hardware must support this operation to ahieve high performane. However, mapping a multiplier on traditional fine-grain devies produes a omplex struture whose performane is limited by the routing overhead. In this paper, we present a novel pipelined multiplier struture suitable for medium-grain and oarse-grain reonfigurable ell arrays. We first implement an unsigned n-bit multiplier using m-bit ells. Then, we show how the same struture an work with two s-omplement data with small hanges to the onfiguration. The struture requires n/m 2 ells, but an exeute vetor operations in a pipelined fashion. We also disuss the benefits of using a hierarhial design for large multipliers. 1. Introdution Reonfigurable hardware has beome an attrative option for implementing digital signal proessing (DSP), espeially in appliations that require both high performane and flexibility. The performane of reonfigurable devies typially falls between ustom integrated iruits and digital signal proessors, while the flexibility may even surpass both alternatives. In addition, reonfigurable hardware inurs a low development ost and an be adapted if the needs of the appliation hange, even after deployment [1]. DSP algorithms plae great demands on the proessing power of any hardware implementation, due to the large amount of binary arithmeti involved. For example, the basi operation of a finite impulse-response (FIR) filter ontains several multipliations and additions: y[n] = x[n] + x[n 1] + + b k x[n k] (1) As another example, the essential omponent of the Fast Fourier Transform (FFT) also requires these two operations, although the inputs and outputs are omplex numbers in this ase: Y = X + X 1, Y 1 = (X X 1 ) W. (2) Most DSP algorithms repeat basi operations suh as these for every sample in the data set. To ahieve maximum performane, the hardware an pipeline the multipliation and addition operations so that multiple samples an be proessed simultaneously. This tehnique dramatially redues the exeution time of DSP algorithms, but inurs a penalty of additional overhead. Consider the problem of using reonfigurable hardware to perform binary arithmeti on large data sets. In general, reonfigurable devies ontain an array of programmable ells and interonnetion strutures [2]. Traditional fine-grain arhitetures suh as the fieldprogrammable gate array (FPGA) plae little funtionality in the ells. Implementing an adder on a fine-grain devie presents no major problems, as the ells an easily generate the arry and sum bits required for this operation. However, implementing a multiplier is a signifiant hallenge, due to the routing delays assoiated with inter-ell ommuniation in fine-grain devies. One way around this problem is to inlude dediated multipliation hardware in the arhiteture [3, 4]. Another approah is to extend the apabilities of eah ell to work with m-bit data words instead of single bits [5]. In fat, these medium-grain and oarse-grain reonfigurable devies show great promise for DSP appliations [6, 7]. With this approah, the problem then beomes the following: given an array of m-bit ells, design a struture to multiply two unsigned n-bit integers A and B: Y 2n 1: = (A n 1: B n 1: ). (3) Note that the output Y ontains 2n bits. As an initial approah, onsider the familiar arry-save multiplier [8]. Figure 1 illustrates this struture for n=2 and m=4. The

multiplier ontains a retangular array of ells with n/m ells in the horizontal diretion and n/m + 1 ells in the vertial diretion. A and B are divided into m-bit portions and broadast aross the olumns and rows of the array. A 15:12 A 11:8 A 7:4 A 3: x x x x x B 3: Y 3: 2. Unsigned Multiplier The most omplex operation performed by any ell in the arry-save multiplier is the m-bit multiply-aumulate (MAC) funtion m 1: = (a m 1: b m 1: ) + m 1: + d m 1:. (4) Here a and b represent the two m-bit portions of A and B, and and d denote the two terms added to the result. Figure 2 illustrates the inputs and outputs of a ell that omputes the MAC operation. a 3: d 3: 3: : : Figure 2. Inputs and outputs of MAC ell (m=4) :4 + + + + + Y 35:32 Y 31:38 Y 27:24 Y 23:2 Figure 1. Carry-save multiplier (n=2, m=4) Unlike a typial arry-save multiplier, eah ell works with m-bit inputs instead of 1-bit inputs. Cells on the top row multiply two m-bit portions of A and B, passing the upper and lower portions of result to other ells or the Y output. Cells in the middle rows perform the same multipliation and then add up to two m-bit terms to the result. Finally, ells on the bottom row add up to three m- bit terms together. As a funtion of n and m, the arrysave multiplier requires n/m 2 + n/m ells, while the ritial path is 2 n/m ells long. In the remainder of this paper, we desribe a novel improvement to the arry-save multiplier that redues the total number of ells required. As desribed in Setion 2, we modify the interonnetion struture so that every ell arries the same workload. The resulting struture an be pipelined easily for high throughput. In Setion 3, we demonstrate that the same struture an be used to perform two s-omplement multipliation with slight hanges to the operations performed by eah ell. Setion 4 briefly explores a hierarhial arhiteture in whih eah ell uses a matrix of smaller elements to implement the neessary operations. Finally, Setion 5 gives some onluding remarks. As we proposed in [9], one an redue the size of the multiplier by ensuring that all ells perform the same operation. Suh a redution is possible beause reonfigurable devies typially ontain an array of idential ells. Observe in Figure 1 that the bottom row of ells do not perform multipliation, and the ells in the left olumn add one term instead of the usual two. Hene, we an eliminate the bottom row of ells and rearrange the interonnetion sheme so that the ells in the left olumn take up the slak. Figure 3 depits the resulting struture for the 2-bit ase. This improved design has a size of n/m 2 ells and a ritial path 2 n/m 1 ells long. Notie that two extra n-bit terms C and D an be inorporated into the top row of ells. This enhanement reates a powerful n-bit MAC unit that evaluates the expression Y 2n 1: = (A n 1: B n 1: ) + C n 1: + D n 1:. (5) The primary advantage of using arrays of ells to perform multipliation is that the DSP algorithm an exploit the benefits of pipelining. Suppose that we superpipeline the struture in Figure 3 so that eah ell oupies one pipeline stage. The lok yle time thus inludes the time to evaluate the m-bit MAC funtion as well as the time to transfer the result to adjaent ells. Figure 4 labels eah ell in the improved MAC unit with the lok yle at whih it alulates the intermediate result. We an then insert the appropriate number of pipeline registers into the module, as depited by slashes in the figure.

C 19:16 A 15:12 A 11:8 A 7:4 A 3: D 15:12 D 11:8 D 7:4 D 3: C 15:12 C 11:8 C 7:4 C 3: B 3: Y 3: The pipelining sheme in Figure 4 has one disadvantage: the m-bit portions of the B input must be broadast aross several olumns of ells during the same lok yle. Depending on the arhiteture of the reonfigurable devie, suh an operation may not be feasible. One way to eliminate the broadast is to pipeline all the internal lines of the MAC unit. As shown in Figure 5 for the 2-bit ase, this hange inreases the total lateny of the module to 3 n/m 2. This additional delay should be negligible for most DSP algorithms, sine the multiplier still initiates one operation per lok yle. For example, to multiply two 2-bit vetors of 1 elements, the former design requires 18 yles, whereas the latter design requires 112 yles. If the lok yle time an be redued due to the absene of broadast operations, the MAC unit with pipelined internal lines atually ahieves higher performane. Y 35:32 Y 31:38 Y 27:24 Y 23:2 Figure 3. Improved MAC unit (n=2, m=4) C 19:16 A 15:12 A 11:8 A 7:4 A 3: D 15:12 D 11:8 D 7:4 D 3: C 15:12 C 11:8 C 7:4 C 3: C 19:16 A 15:12 A 11:8 A 7:4 A 3: D 15:12 D 11:8 D 7:4 D 3: C 15:12 C 11:8 C 7:4 C 3: 5 4 3 2 1 7 6 5 4 3 B 3: Y 3: 1 1 1 1 1 B 3: Y 3: 9 8 7 6 5 3 2 2 2 2 11 1 9 8 7 5 4 3 3 3 13 12 11 1 9 7 6 5 4 4 Y 35:32 Y 31:38 Y 27:24 Y 23:2 9 8 7 6 5 Figure 5. Superpipelined MAC unit with pipelined internal lines Y 35:32 Y 31:38 Y 27:24 Y 23:2 pipeline lathes Figure 4. Superpipelined MAC unit The lateny of this superpipelined design equals the length of the ritial path, or 2 n/m 1 yles. However, the struture an initiate one operation per lok yle, making the resulting throughput very high. Notie that the MAC unit generates the Y output in a staggered fashion: the least signifiant m bits in the first yle, the next m bits in the seond yle, and so forth. The B input should also arrive in this staggered fashion. 3. Two s-complement Multiplier As a rule, DSP algorithms work with both positive and negative numbers, so it is reasonable to expet that appliations may require two s-omplement multipliation. In fat, the same multiplier struture an be used to perform this operation, exept that some ells evaluate slightly different funtions. Before we disuss these modifiations, reall from (4) that eah ell in the unsigned MAC unit evaluates the m-bit MAC funtion m 1: = (a m 1: b m 1: ) + m 1: + d m 1:. (4)

Figure 6 illustrates how these m-bit terms are defined for various ells in the design. For onsisteny, the input to the ell always appears to the left of the d input. 3: a 3: d 3: :4 : : 3: :4 a 3: : d 3: : Figure 6. Cells in improved MAC unit a 3: 3: : d 3: : Now onsider a two s-omplement MAC unit that handles n-bit inputs in m-bit portions. From the properties of two s-omplement numbers, the most signifiant m-bit portion has two s-omplement format, but the remaining m-bit portions have unsigned format. Hene, if we modify the unsigned MAC unit to perform two s-omplement arithmeti, many of the ells will still operate on unsigned inputs. Figure 7 depits the two s-omplement multiplier for n=2 and m=4. Solid lines denote unsigned data; dashed lines denote two s-omplement data. C 19:16 :4 A 15:12 A 11:8 A 7:4 A 3: D 15:12 D 11:8 D 7:4 D 3: C 15:12 C 11:8 C 7:4 C 3: unsigned portion of B. The ell also adds two somplement portions of C and D to the result. In order to represent the entire range of valid outputs, the B ell must generate a 2m-bit output y whose upper m bits and lower m bits are both two s-omplement numbers. This data format is unusual, but is the best hoie for representing the result. One an think of the ell as generating two m-bit outputs satisfying the expression 2 m m 1:m + y m 1: = (a m 1: b m 1: ) + m 1: + d m 1:. (6) where a m 1:, m 1:, d m 1:, m 1:m, and y m 1: are in two somplement format. Table 1 lists several example 4-bit alulations for the B ell. Reall that a 4-bit two somplement numbers ranges from 8 to 7, whereas a 4-bit unsigned number ranges from to 15. Table 1: Example alulations of the B ell a 3: : 3: d 3: :4 : : = 16:4 + : 5 5 5 5 2 7 25 5 1 5 5 4 4 6 5 5 5 5 2 7 25 5 1 5 5 4 4 6 7 15 7 7 7 7 119 8 15 8 8 8 8 136 B D C A A A D A C A A D A A C A B 3: Y 3: A similar analysis an be performed for the remaining ells used in the multiplier. For example, the C ells generate an unsigned output m 1:m and a two somplement output y m 1:. With the data formats shown in Figure 7, the n-bit multiplier an generate a two somplement output Y without additional hardware. Table 2 lists the input and output format of eah type of ell (inluding the G ell used later). A + sign denotes unsigned format, and a sign denotes two somplement format. Table 2: Data format requirements for eah ell H F F F E Y 35:32 Y 31:38 Y 27:24 Y 23:2 two s omplement line Figure 7. Two s-omplement MAC unit (m=2, n=4) Observe that some of the ells generate two somplement outputs, whereas other ells do not. In fat, the two s-omplement MAC unit ontains seven types of ells, labeled A through H in the figure (G is missing for tehnial reasons). The A ells simply evaluate the unsigned MAC funtion in (4). However, the B ell must multiply the two s-omplement portion of A with an Type a m 1: b m 1: m 1: d m 1: m 1:m y m 1: A + + + + + + B + C + + + + D + + + E + + + F + + + G + + + + H +

4. Hierarhial Multiplier The last two setions have demonstrated that n-bit MAC units in general require seven types of ells. Eah ell performs the MAC funtion on m-bit inputs, but different ells use different data formats. A natural question is how eah ell an implement the required m- bit operations. For m=1, a simple ombinational iruit suffies, but for larger m, the most pratial solution may involve some kind of arithmeti unit. For reonfigurable devies, onsider the following alternative: to implement the m-bit operations required by eah ell, use an m m array of 1-bit ells. In other words, the proposed arhiteture ontains a two-level hierarhy of ells and elements, where ells work with m-bit words and elements work with single bits. The next question is how the m m array of elements an implement all the funtionality required by m-bit ells. For type A ells, the solution is simple: use the unsigned multiplier struture presented in Setion 2. As shown in Figure 8 for m=4, eah of the elements works with data in unsigned form. Hene, one an lassify the elements as type A as well. where MAJ(P, Q, R) = (P Q) (P R) (Q R) XOR(P, Q, R) = P Q R. (9) As disussed in the last setion, two s-omplement MAC units require additional types of ells. Type B ells, for example, assume that a,, and d have two somplement format, and that b has unsigned format. Using an m m array of elements to implement a type B ell produes the result in Figure 9. 3 a 1 d 1 1 B A A A D C A A D A C A 3 a 1 d 1 1 D A A C Figure 8. Type A ell (m=4) Eah element omputes the 1-bit MAC funtion ψ 1: = (α β) + γ + δ, (7) where α, β, γ, and δ denote the inputs to the element, and ψ signifies the 2-bit output. Note that multipliation redues to the logial AND operation, denoted by, in the 1-bit ase. Eah bit of the output ψ an be expressed in terms of the ombinational logi funtions Figure 9. Type B ell (m=4) Knowing the data format for eah input to the ell, one an determine the format of every internal line using the information in Table 2. The proedure losely parallels the analysis for the two s-omplement multiplier in Figure 7, exept that the signal names are Greek symbols instead of lowerase letters. The implementation of the type B ell requires elements of types A, B, and C. Note that both the upper and lower portions of the y output have two s-omplement format, as shown in Figure 7. Continuing on, ells of types C and D have straightforward implementations (Figures 1-11). Type E ells require five types of elements, inluding type G (Figure 12). Type F ells are similar (Figure 13). Finally, type H ells have the same formatting assignments as the two s-omplement multiplier (Figure 14). This property holds beause all the inputs and outputs of a type H ell have two s-omplement format. ψ 1 = MAJ(α β, γ, δ) ψ = XOR(α β, γ, δ), (8)

3 a 1 d 1 1 3 a 1 d 1 1 C A A A C A A A A C A A A C A A A A C A A A C A A A A C F F F E Figure 1. Type C ell (m=4) Figure 13. Type F ell (m=4) 3 a 1 d 1 1 3 a 1 d 1 1 B A A A D C A A D A C A H F F E Figure 11. Type D ell (m=4) 3 2 a 2 d 2 1 a 1 d 1 G A A A A C A A A A C A F F F E y 7 Figure 12. Type E ell (m=4) Figure 14. Type H ell (m=4) Now onsider the MAC funtion omputed by type B elements. From Table 2, the α, γ, δ, ψ 1, and ψ signals of type B elements all have two s-omplement format. For eah of these signals, logi denotes and logi 1 denotes 1. Hene, type B elements ompute the expression 2ψ 1 ψ = ( α β) γ δ, (1) whih simplifies to 2ψ 1 + ψ = (α β) + γ + δ. (11) Sine (11) and (7) are equivalent, elements of types A and B implement the same ombinational logi expressions. Performing a similar analysis on the remaining types of ells reveals that only four distint types of elements are required. In fat, eah element implements the same

expression for ψ ; the only differene is the expression used to ompute ψ 1. Table 3 lists the funtions orresponding to eah type of element. (Note that denotes the logial omplement.) A reonfigurable arhiteture ould exploit these similarities to implement all neessary operations effiiently. Table 3: Redution of element types Type ψ 1 ψ Same as A MAJ(α β, γ, δ) XOR(α β, γ, δ) A B MAJ(α β, γ, δ) XOR(α β, γ, δ) A C MAJ(α β, γ, δ) XOR(α β, γ, δ) C D MAJ(α β, γ, δ) XOR(α β, γ, δ) C E MAJ(α β, γ, δ) XOR(α β, γ, δ) C F MAJ(α β, γ, δ) XOR(α β, γ, δ) F G MAJ(α β, γ, δ) XOR(α β, γ, δ) F H MAJ(α β, γ, δ) XOR(α β, γ, δ) H 5. Conluding Remarks In this paper, we have presented a novel sheme for performing n-bit multiply-aumulate (MAC) operations using a reonfigurable array of m-bit ells. Eah ell omputes an m-bit MAC funtion with two additive terms. The struture an be superpipelined into m-bit units for extremely high throughput, as required in signal proessing appliations. With suitable hanges to the onfiguration of eah ell, the struture an handle unsigned or two s-omplement inputs. To implement the funtionality required by eah ell, we propose to use an m m matrix of reonfigurable 1-bit elements. Only four types of elements are required to onstrut multipliers of any size. As a final note, we have used the onepts presented in this paper to reate a two-level reonfigurable arhiteture for digital signal proessing appliations [9]. The arhiteture ontains an array of reonfigurable 4-bit ells, eah of whih onsists of a 4 4 matrix of elements. Eah element, in turn, uses a 4-input, 2-bit lookup table to evaluate arithmeti or logi funtions. Cells an onnet to neighboring ells in any diretion. However, the matrix of elements an only assume two strutures, one of whih is the struture of the MAC unit. Having the apability to ompute the MAC operation means that ells an perform the arithmeti funtions neessary for digital signal proessing. 7. Referenes [1] R. Tessier and W. Burleson, Reonfigurable omputing for digital signal proessing: a survey, in Y. Hu (ed.), Programmable digital signal proessors, Marel Dekker In., 21. [2] K. Compton and S. Hauk, Reonfigurable omputing: a survey of systems and software, ACM Computing Surveys, vol. 34, no. 2, Jun 22, pp. 171-21. [3] K. Rajagopalan and P. Sutton, A flexible multipliation unit for an FPGA logi blok, in Pro. 21 IEEE International Symposium on Ciruits and Systems, 21, pp. 546-549. [4] S. Haynes and P. Cheung, Configurable multiplier bloks for embedding in FPGAs, Eletronis Letters, vol. 34, iss. 7, Apr 1998, pp. 638-639. [5] R. Hartenstein, Coarse grain reonfigurable arhitetures, in Pro. 6th Asia South Paifi Design Automation Conferene, Yokohama, Japan, 21, pp. 564-57. [6] J. Smit et al, Low ost and fast turnaround: reonfigurable graph-based exeution units, in Pro. 7th BELSIGN Workshop, Enshede, Netherlands, 1998. [7] P. Heysters and G. Smit, Mapping of DSP algorithms on the MONTIUM arhiteture, in Pro. International Parallel and Distributed Proessing Symposium, Apr 23, pp. 18-185. [8] J. Rabaey et al, Digital Integrated Ciruits: A Design Perspetive, 2nd ed., Upper Saddle River, NJ: Pearson Eduation, In., 23, pp. 591-592. [9] M. Myjak and J. Delgado-Frias, A two-level reonfigurable arhiteture for digital signal proessing, in Pro. 23 International Conferene on VLSI, Las Vegas, NV, Jun 23, pp. 21-27. 6. Aknowledgment M. Myjak is supported by the U.S. Department of Homeland Seurity Graduate Fellowship.