Pipelined Multipliers for Reconfigurable Hardware

Similar documents
On - Line Path Delay Fault Testing of Omega MINs M. Bellos 1, E. Kalligeros 1, D. Nikolos 1,2 & H. T. Vergos 1,2

Design of High Speed Mac Unit

Analysis of input and output configurations for use in four-valued CCD programmable logic arrays

Partial Character Decoding for Improved Regular Expression Matching in FPGAs

Approximate logic synthesis for error tolerant applications

A DYNAMIC ACCESS CONTROL WITH BINARY KEY-PAIR

Reduced-Complexity Column-Layered Decoding and. Implementation for LDPC Codes

High Speed Area Efficient VLSI Architecture for DCT using Proposed CORDIC Algorithm

HEXA: Compact Data Structures for Faster Packet Processing

A {k, n}-secret Sharing Scheme for Color Images

Department of Electrical and Computer Engineering University of Wisconsin Madison. Fall

Abstract. Key Words: Image Filters, Fuzzy Filters, Order Statistics Filters, Rank Ordered Mean Filters, Channel Noise. 1.

Automatic Physical Design Tuning: Workload as a Sequence Sanjay Agrawal Microsoft Research One Microsoft Way Redmond, WA, USA +1-(425)

A two-level reconfigurable architecture for digital signal processing

The Minimum Redundancy Maximum Relevance Approach to Building Sparse Support Vector Machines

Multi-Channel Wireless Networks: Capacity and Protocols

DECODING OF ARRAY LDPC CODES USING ON-THE FLY COMPUTATION Kiran Gunnam, Weihuang Wang, Euncheol Kim, Gwan Choi, Mark Yeary *

COST PERFORMANCE ASPECTS OF CCD FAST AUXILIARY MEMORY

Zippy - A coarse-grained reconfigurable array with support for hardware virtualization

Outline: Software Design

A Dual-Hamiltonian-Path-Based Multicasting Strategy for Wormhole-Routed Star Graph Interconnection Networks

Graph-Based vs Depth-Based Data Representation for Multiview Images

SVC-DASH-M: Scalable Video Coding Dynamic Adaptive Streaming Over HTTP Using Multiple Connections

Gray Codes for Reflectable Languages

System-Level Parallelism and Throughput Optimization in Designing Reconfigurable Computing Applications

A Coarse-to-Fine Classification Scheme for Facial Expression Recognition

We don t need no generation - a practical approach to sliding window RLNC

A Novel Validity Index for Determination of the Optimal Number of Clusters

This fact makes it difficult to evaluate the cost function to be minimized

A Partial Sorting Algorithm in Multi-Hop Wireless Sensor Networks

Multi-Piece Mold Design Based on Linear Mixed-Integer Program Toward Guaranteed Optimality

Extracting Partition Statistics from Semistructured Data

Algorithms, Mechanisms and Procedures for the Computer-aided Project Generation System

Folding. Hardware Mapped vs. Time multiplexed. Folding by N (N=folding factor) Node A. Unfolding by J A 1 A J-1. Time multiplexed/microcoded

Cluster-Based Cumulative Ensembles

Cross-layer Resource Allocation on Broadband Power Line Based on Novel QoS-priority Scheduling Function in MAC Layer

The AMDREL Project in Retrospective

High-level synthesis under I/O Timing and Memory constraints

Direct-Mapped Caches

Improved Circuit-to-CNF Transformation for SAT-based ATPG

mahines. HBSP enhanes the appliability of the BSP model by inorporating parameters that reet the relative speeds of the heterogeneous omputing omponen

C 2 C 3 C 1 M S. f e. e f (3,0) (0,1) (2,0) (-1,1) (1,0) (-1,0) (1,-1) (0,-1) (-2,0) (-3,0) (0,-2)

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

Volume 3, Issue 9, September 2013 International Journal of Advanced Research in Computer Science and Software Engineering

Performance of Histogram-Based Skin Colour Segmentation for Arms Detection in Human Motion Analysis Application

What are Cycle-Stealing Systems Good For? A Detailed Performance Model Case Study

Detection and Recognition of Non-Occluded Objects using Signature Map

A Load-Balanced Clustering Protocol for Hierarchical Wireless Sensor Networks

Fuzzy Meta Node Fuzzy Metagraph and its Cluster Analysis

NONLINEAR BACK PROJECTION FOR TOMOGRAPHIC IMAGE RECONSTRUCTION. Ken Sauer and Charles A. Bouman

Self-Adaptive Parent to Mean-Centric Recombination for Real-Parameter Optimization

Establishing Secure Ethernet LANs Using Intelligent Switching Hubs in Internet Environments

3-D IMAGE MODELS AND COMPRESSION - SYNTHETIC HYBRID OR NATURAL FIT?

Methods for Multi-Dimensional Robustness Optimization in Complex Embedded Systems

Constructing Transaction Serialization Order for Incremental. Data Warehouse Refresh. Ming-Ling Lo and Hui-I Hsiao. IBM T. J. Watson Research Center

Improved flooding of broadcast messages using extended multipoint relaying

Fast Elliptic Curve Algorithm of Embedded Mobile Equipment

Sparse Certificates for 2-Connectivity in Directed Graphs

arxiv: v1 [cs.db] 13 Sep 2017

HIGHER ORDER full-wave three-dimensional (3-D) large-domain techniques in

The recursive decoupling method for solving tridiagonal linear systems

Space- and Time-Efficient BDD Construction via Working Set Control

CleanUp: Improving Quadrilateral Finite Element Meshes

An Efficient and Scalable Approach to CNN Queries in a Road Network

Parallelization and Performance of 3D Ultrasound Imaging Beamforming Algorithms on Modern Clusters

Anonymity Trilemma: Strong Anonymity, Low Bandwidth, Low Latency Choose Two

Parallel Block-Layered Nonbinary QC-LDPC Decoding on GPU

Acoustic Links. Maximizing Channel Utilization for Underwater

An Optimized Approach on Applying Genetic Algorithm to Adaptive Cluster Validity Index

Australian Journal of Basic and Applied Sciences. A new Divide and Shuffle Based algorithm of Encryption for Text Message

Gradient based progressive probabilistic Hough transform

Performance Improvement of TCP on Wireless Cellular Networks by Adaptive FEC Combined with Explicit Loss Notification

Cluster-based Cooperative Communication with Network Coding in Wireless Networks

On the Generation of Multiplexer Circuits for Pass Transistor Logic

Query Evaluation Overview. Query Optimization: Chap. 15. Evaluation Example. Cost Estimation. Query Blocks. Query Blocks

Dynamic Algorithms Multiple Choice Test

Reading Object Code. A Visible/Z Lesson

Reading Object Code. A Visible/Z Lesson

Make your process world

New Fuzzy Object Segmentation Algorithm for Video Sequences *

Smooth Trajectory Planning Along Bezier Curve for Mobile Robots with Velocity Constraints

Video Data and Sonar Data: Real World Data Fusion Example

Algorithms for External Memory Lecture 6 Graph Algorithms - Weighted List Ranking

Implementing Load-Balanced Switches With Fat-Tree Networks

Exploring the Commonality in Feature Modeling Notations

Facility Location: Distributed Approximation

splitting tehniques that partition live ranges have been proposed to solve both the spilling problem[5][8] and the assignment problem[8][9]. The parti

Accommodations of QoS DiffServ Over IP and MPLS Networks

DETECTION METHOD FOR NETWORK PENETRATING BEHAVIOR BASED ON COMMUNICATION FINGERPRINT

Unsupervised Stereoscopic Video Object Segmentation Based on Active Contours and Retrainable Neural Networks

Stable Road Lane Model Based on Clothoids

Detecting Outliers in High-Dimensional Datasets with Mixed Attributes

Evaluation of Benchmark Performance Estimation for Parallel. Fortran Programs on Massively Parallel SIMD and MIMD. Computers.

Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps

1. Introduction. 2. The Probable Stope Algorithm

Boosted Random Forest

COMBINATION OF INTERSECTION- AND SWEPT-BASED METHODS FOR SINGLE-MATERIAL REMAP

Plot-to-track correlation in A-SMGCS using the target images from a Surface Movement Radar

Colouring contact graphs of squares and rectilinear polygons de Berg, M.T.; Markovic, A.; Woeginger, G.

Alleviating DFT cost using testability driven HLS

Transcription:

Pipelined Multipliers for Reonfigurable Hardware Mithell J. Myjak and José G. Delgado-Frias Shool of Eletrial Engineering and Computer Siene, Washington State University Pullman, WA 99164-2752 USA {mmyjak, jdelgado}@ees.wsu.edu Abstrat Reonfigurable devies used in digital signal proessing appliations must handle large amounts of data in vetor form. Most signal proessing algorithms use multipliation extensively; thus, the hardware must support this operation to ahieve high performane. However, mapping a multiplier on traditional fine-grain devies produes a omplex struture whose performane is limited by the routing overhead. In this paper, we present a novel pipelined multiplier struture suitable for medium-grain and oarse-grain reonfigurable ell arrays. We first implement an unsigned n-bit multiplier using m-bit ells. Then, we show how the same struture an work with two s-omplement data with small hanges to the onfiguration. The struture requires n/m 2 ells, but an exeute vetor operations in a pipelined fashion. We also disuss the benefits of using a hierarhial design for large multipliers. 1. Introdution Reonfigurable hardware has beome an attrative option for implementing digital signal proessing (DSP), espeially in appliations that require both high performane and flexibility. The performane of reonfigurable devies typially falls between ustom integrated iruits and digital signal proessors, while the flexibility may even surpass both alternatives. In addition, reonfigurable hardware inurs a low development ost and an be adapted if the needs of the appliation hange, even after deployment [1]. DSP algorithms plae great demands on the proessing power of any hardware implementation, due to the large amount of binary arithmeti involved. For example, the basi operation of a finite impulse-response (FIR) filter ontains several multipliations and additions: y[n] = x[n] + x[n 1] + + b k x[n k] (1) As another example, the essential omponent of the Fast Fourier Transform (FFT) also requires these two operations, although the inputs and outputs are omplex numbers in this ase: Y = X + X 1, Y 1 = (X X 1 ) W. (2) Most DSP algorithms repeat basi operations suh as these for every sample in the data set. To ahieve maximum performane, the hardware an pipeline the multipliation and addition operations so that multiple samples an be proessed simultaneously. This tehnique dramatially redues the exeution time of DSP algorithms, but inurs a penalty of additional overhead. Consider the problem of using reonfigurable hardware to perform binary arithmeti on large data sets. In general, reonfigurable devies ontain an array of programmable ells and interonnetion strutures [2]. Traditional fine-grain arhitetures suh as the fieldprogrammable gate array (FPGA) plae little funtionality in the ells. Implementing an adder on a fine-grain devie presents no major problems, as the ells an easily generate the arry and sum bits required for this operation. However, implementing a multiplier is a signifiant hallenge, due to the routing delays assoiated with inter-ell ommuniation in fine-grain devies. One way around this problem is to inlude dediated multipliation hardware in the arhiteture [3, 4]. Another approah is to extend the apabilities of eah ell to work with m-bit data words instead of single bits [5]. In fat, these medium-grain and oarse-grain reonfigurable devies show great promise for DSP appliations [6, 7]. With this approah, the problem then beomes the following: given an array of m-bit ells, design a struture to multiply two unsigned n-bit integers A and B: Y 2n 1: = (A n 1: B n 1: ). (3) Note that the output Y ontains 2n bits. As an initial approah, onsider the familiar arry-save multiplier [8]. Figure 1 illustrates this struture for n=2 and m=4. The

multiplier ontains a retangular array of ells with n/m ells in the horizontal diretion and n/m + 1 ells in the vertial diretion. A and B are divided into m-bit portions and broadast aross the olumns and rows of the array. A 15:12 A 11:8 A 7:4 A 3: x x x x x B 3: Y 3: 2. Unsigned Multiplier The most omplex operation performed by any ell in the arry-save multiplier is the m-bit multiply-aumulate (MAC) funtion m 1: = (a m 1: b m 1: ) + m 1: + d m 1:. (4) Here a and b represent the two m-bit portions of A and B, and and d denote the two terms added to the result. Figure 2 illustrates the inputs and outputs of a ell that omputes the MAC operation. a 3: d 3: 3: : : Figure 2. Inputs and outputs of MAC ell (m=4) :4 + + + + + Y 35:32 Y 31:38 Y 27:24 Y 23:2 Figure 1. Carry-save multiplier (n=2, m=4) Unlike a typial arry-save multiplier, eah ell works with m-bit inputs instead of 1-bit inputs. Cells on the top row multiply two m-bit portions of A and B, passing the upper and lower portions of result to other ells or the Y output. Cells in the middle rows perform the same multipliation and then add up to two m-bit terms to the result. Finally, ells on the bottom row add up to three m- bit terms together. As a funtion of n and m, the arrysave multiplier requires n/m 2 + n/m ells, while the ritial path is 2 n/m ells long. In the remainder of this paper, we desribe a novel improvement to the arry-save multiplier that redues the total number of ells required. As desribed in Setion 2, we modify the interonnetion struture so that every ell arries the same workload. The resulting struture an be pipelined easily for high throughput. In Setion 3, we demonstrate that the same struture an be used to perform two s-omplement multipliation with slight hanges to the operations performed by eah ell. Setion 4 briefly explores a hierarhial arhiteture in whih eah ell uses a matrix of smaller elements to implement the neessary operations. Finally, Setion 5 gives some onluding remarks. As we proposed in [9], one an redue the size of the multiplier by ensuring that all ells perform the same operation. Suh a redution is possible beause reonfigurable devies typially ontain an array of idential ells. Observe in Figure 1 that the bottom row of ells do not perform multipliation, and the ells in the left olumn add one term instead of the usual two. Hene, we an eliminate the bottom row of ells and rearrange the interonnetion sheme so that the ells in the left olumn take up the slak. Figure 3 depits the resulting struture for the 2-bit ase. This improved design has a size of n/m 2 ells and a ritial path 2 n/m 1 ells long. Notie that two extra n-bit terms C and D an be inorporated into the top row of ells. This enhanement reates a powerful n-bit MAC unit that evaluates the expression Y 2n 1: = (A n 1: B n 1: ) + C n 1: + D n 1:. (5) The primary advantage of using arrays of ells to perform multipliation is that the DSP algorithm an exploit the benefits of pipelining. Suppose that we superpipeline the struture in Figure 3 so that eah ell oupies one pipeline stage. The lok yle time thus inludes the time to evaluate the m-bit MAC funtion as well as the time to transfer the result to adjaent ells. Figure 4 labels eah ell in the improved MAC unit with the lok yle at whih it alulates the intermediate result. We an then insert the appropriate number of pipeline registers into the module, as depited by slashes in the figure.

C 19:16 A 15:12 A 11:8 A 7:4 A 3: D 15:12 D 11:8 D 7:4 D 3: C 15:12 C 11:8 C 7:4 C 3: B 3: Y 3: The pipelining sheme in Figure 4 has one disadvantage: the m-bit portions of the B input must be broadast aross several olumns of ells during the same lok yle. Depending on the arhiteture of the reonfigurable devie, suh an operation may not be feasible. One way to eliminate the broadast is to pipeline all the internal lines of the MAC unit. As shown in Figure 5 for the 2-bit ase, this hange inreases the total lateny of the module to 3 n/m 2. This additional delay should be negligible for most DSP algorithms, sine the multiplier still initiates one operation per lok yle. For example, to multiply two 2-bit vetors of 1 elements, the former design requires 18 yles, whereas the latter design requires 112 yles. If the lok yle time an be redued due to the absene of broadast operations, the MAC unit with pipelined internal lines atually ahieves higher performane. Y 35:32 Y 31:38 Y 27:24 Y 23:2 Figure 3. Improved MAC unit (n=2, m=4) C 19:16 A 15:12 A 11:8 A 7:4 A 3: D 15:12 D 11:8 D 7:4 D 3: C 15:12 C 11:8 C 7:4 C 3: C 19:16 A 15:12 A 11:8 A 7:4 A 3: D 15:12 D 11:8 D 7:4 D 3: C 15:12 C 11:8 C 7:4 C 3: 5 4 3 2 1 7 6 5 4 3 B 3: Y 3: 1 1 1 1 1 B 3: Y 3: 9 8 7 6 5 3 2 2 2 2 11 1 9 8 7 5 4 3 3 3 13 12 11 1 9 7 6 5 4 4 Y 35:32 Y 31:38 Y 27:24 Y 23:2 9 8 7 6 5 Figure 5. Superpipelined MAC unit with pipelined internal lines Y 35:32 Y 31:38 Y 27:24 Y 23:2 pipeline lathes Figure 4. Superpipelined MAC unit The lateny of this superpipelined design equals the length of the ritial path, or 2 n/m 1 yles. However, the struture an initiate one operation per lok yle, making the resulting throughput very high. Notie that the MAC unit generates the Y output in a staggered fashion: the least signifiant m bits in the first yle, the next m bits in the seond yle, and so forth. The B input should also arrive in this staggered fashion. 3. Two s-complement Multiplier As a rule, DSP algorithms work with both positive and negative numbers, so it is reasonable to expet that appliations may require two s-omplement multipliation. In fat, the same multiplier struture an be used to perform this operation, exept that some ells evaluate slightly different funtions. Before we disuss these modifiations, reall from (4) that eah ell in the unsigned MAC unit evaluates the m-bit MAC funtion m 1: = (a m 1: b m 1: ) + m 1: + d m 1:. (4)

Figure 6 illustrates how these m-bit terms are defined for various ells in the design. For onsisteny, the input to the ell always appears to the left of the d input. 3: a 3: d 3: :4 : : 3: :4 a 3: : d 3: : Figure 6. Cells in improved MAC unit a 3: 3: : d 3: : Now onsider a two s-omplement MAC unit that handles n-bit inputs in m-bit portions. From the properties of two s-omplement numbers, the most signifiant m-bit portion has two s-omplement format, but the remaining m-bit portions have unsigned format. Hene, if we modify the unsigned MAC unit to perform two s-omplement arithmeti, many of the ells will still operate on unsigned inputs. Figure 7 depits the two s-omplement multiplier for n=2 and m=4. Solid lines denote unsigned data; dashed lines denote two s-omplement data. C 19:16 :4 A 15:12 A 11:8 A 7:4 A 3: D 15:12 D 11:8 D 7:4 D 3: C 15:12 C 11:8 C 7:4 C 3: unsigned portion of B. The ell also adds two somplement portions of C and D to the result. In order to represent the entire range of valid outputs, the B ell must generate a 2m-bit output y whose upper m bits and lower m bits are both two s-omplement numbers. This data format is unusual, but is the best hoie for representing the result. One an think of the ell as generating two m-bit outputs satisfying the expression 2 m m 1:m + y m 1: = (a m 1: b m 1: ) + m 1: + d m 1:. (6) where a m 1:, m 1:, d m 1:, m 1:m, and y m 1: are in two somplement format. Table 1 lists several example 4-bit alulations for the B ell. Reall that a 4-bit two somplement numbers ranges from 8 to 7, whereas a 4-bit unsigned number ranges from to 15. Table 1: Example alulations of the B ell a 3: : 3: d 3: :4 : : = 16:4 + : 5 5 5 5 2 7 25 5 1 5 5 4 4 6 5 5 5 5 2 7 25 5 1 5 5 4 4 6 7 15 7 7 7 7 119 8 15 8 8 8 8 136 B D C A A A D A C A A D A A C A B 3: Y 3: A similar analysis an be performed for the remaining ells used in the multiplier. For example, the C ells generate an unsigned output m 1:m and a two somplement output y m 1:. With the data formats shown in Figure 7, the n-bit multiplier an generate a two somplement output Y without additional hardware. Table 2 lists the input and output format of eah type of ell (inluding the G ell used later). A + sign denotes unsigned format, and a sign denotes two somplement format. Table 2: Data format requirements for eah ell H F F F E Y 35:32 Y 31:38 Y 27:24 Y 23:2 two s omplement line Figure 7. Two s-omplement MAC unit (m=2, n=4) Observe that some of the ells generate two somplement outputs, whereas other ells do not. In fat, the two s-omplement MAC unit ontains seven types of ells, labeled A through H in the figure (G is missing for tehnial reasons). The A ells simply evaluate the unsigned MAC funtion in (4). However, the B ell must multiply the two s-omplement portion of A with an Type a m 1: b m 1: m 1: d m 1: m 1:m y m 1: A + + + + + + B + C + + + + D + + + E + + + F + + + G + + + + H +

4. Hierarhial Multiplier The last two setions have demonstrated that n-bit MAC units in general require seven types of ells. Eah ell performs the MAC funtion on m-bit inputs, but different ells use different data formats. A natural question is how eah ell an implement the required m- bit operations. For m=1, a simple ombinational iruit suffies, but for larger m, the most pratial solution may involve some kind of arithmeti unit. For reonfigurable devies, onsider the following alternative: to implement the m-bit operations required by eah ell, use an m m array of 1-bit ells. In other words, the proposed arhiteture ontains a two-level hierarhy of ells and elements, where ells work with m-bit words and elements work with single bits. The next question is how the m m array of elements an implement all the funtionality required by m-bit ells. For type A ells, the solution is simple: use the unsigned multiplier struture presented in Setion 2. As shown in Figure 8 for m=4, eah of the elements works with data in unsigned form. Hene, one an lassify the elements as type A as well. where MAJ(P, Q, R) = (P Q) (P R) (Q R) XOR(P, Q, R) = P Q R. (9) As disussed in the last setion, two s-omplement MAC units require additional types of ells. Type B ells, for example, assume that a,, and d have two somplement format, and that b has unsigned format. Using an m m array of elements to implement a type B ell produes the result in Figure 9. 3 a 1 d 1 1 B A A A D C A A D A C A 3 a 1 d 1 1 D A A C Figure 8. Type A ell (m=4) Eah element omputes the 1-bit MAC funtion ψ 1: = (α β) + γ + δ, (7) where α, β, γ, and δ denote the inputs to the element, and ψ signifies the 2-bit output. Note that multipliation redues to the logial AND operation, denoted by, in the 1-bit ase. Eah bit of the output ψ an be expressed in terms of the ombinational logi funtions Figure 9. Type B ell (m=4) Knowing the data format for eah input to the ell, one an determine the format of every internal line using the information in Table 2. The proedure losely parallels the analysis for the two s-omplement multiplier in Figure 7, exept that the signal names are Greek symbols instead of lowerase letters. The implementation of the type B ell requires elements of types A, B, and C. Note that both the upper and lower portions of the y output have two s-omplement format, as shown in Figure 7. Continuing on, ells of types C and D have straightforward implementations (Figures 1-11). Type E ells require five types of elements, inluding type G (Figure 12). Type F ells are similar (Figure 13). Finally, type H ells have the same formatting assignments as the two s-omplement multiplier (Figure 14). This property holds beause all the inputs and outputs of a type H ell have two s-omplement format. ψ 1 = MAJ(α β, γ, δ) ψ = XOR(α β, γ, δ), (8)

3 a 1 d 1 1 3 a 1 d 1 1 C A A A C A A A A C A A A C A A A A C A A A C A A A A C F F F E Figure 1. Type C ell (m=4) Figure 13. Type F ell (m=4) 3 a 1 d 1 1 3 a 1 d 1 1 B A A A D C A A D A C A H F F E Figure 11. Type D ell (m=4) 3 2 a 2 d 2 1 a 1 d 1 G A A A A C A A A A C A F F F E y 7 Figure 12. Type E ell (m=4) Figure 14. Type H ell (m=4) Now onsider the MAC funtion omputed by type B elements. From Table 2, the α, γ, δ, ψ 1, and ψ signals of type B elements all have two s-omplement format. For eah of these signals, logi denotes and logi 1 denotes 1. Hene, type B elements ompute the expression 2ψ 1 ψ = ( α β) γ δ, (1) whih simplifies to 2ψ 1 + ψ = (α β) + γ + δ. (11) Sine (11) and (7) are equivalent, elements of types A and B implement the same ombinational logi expressions. Performing a similar analysis on the remaining types of ells reveals that only four distint types of elements are required. In fat, eah element implements the same

expression for ψ ; the only differene is the expression used to ompute ψ 1. Table 3 lists the funtions orresponding to eah type of element. (Note that denotes the logial omplement.) A reonfigurable arhiteture ould exploit these similarities to implement all neessary operations effiiently. Table 3: Redution of element types Type ψ 1 ψ Same as A MAJ(α β, γ, δ) XOR(α β, γ, δ) A B MAJ(α β, γ, δ) XOR(α β, γ, δ) A C MAJ(α β, γ, δ) XOR(α β, γ, δ) C D MAJ(α β, γ, δ) XOR(α β, γ, δ) C E MAJ(α β, γ, δ) XOR(α β, γ, δ) C F MAJ(α β, γ, δ) XOR(α β, γ, δ) F G MAJ(α β, γ, δ) XOR(α β, γ, δ) F H MAJ(α β, γ, δ) XOR(α β, γ, δ) H 5. Conluding Remarks In this paper, we have presented a novel sheme for performing n-bit multiply-aumulate (MAC) operations using a reonfigurable array of m-bit ells. Eah ell omputes an m-bit MAC funtion with two additive terms. The struture an be superpipelined into m-bit units for extremely high throughput, as required in signal proessing appliations. With suitable hanges to the onfiguration of eah ell, the struture an handle unsigned or two s-omplement inputs. To implement the funtionality required by eah ell, we propose to use an m m matrix of reonfigurable 1-bit elements. Only four types of elements are required to onstrut multipliers of any size. As a final note, we have used the onepts presented in this paper to reate a two-level reonfigurable arhiteture for digital signal proessing appliations [9]. The arhiteture ontains an array of reonfigurable 4-bit ells, eah of whih onsists of a 4 4 matrix of elements. Eah element, in turn, uses a 4-input, 2-bit lookup table to evaluate arithmeti or logi funtions. Cells an onnet to neighboring ells in any diretion. However, the matrix of elements an only assume two strutures, one of whih is the struture of the MAC unit. Having the apability to ompute the MAC operation means that ells an perform the arithmeti funtions neessary for digital signal proessing. 7. Referenes [1] R. Tessier and W. Burleson, Reonfigurable omputing for digital signal proessing: a survey, in Y. Hu (ed.), Programmable digital signal proessors, Marel Dekker In., 21. [2] K. Compton and S. Hauk, Reonfigurable omputing: a survey of systems and software, ACM Computing Surveys, vol. 34, no. 2, Jun 22, pp. 171-21. [3] K. Rajagopalan and P. Sutton, A flexible multipliation unit for an FPGA logi blok, in Pro. 21 IEEE International Symposium on Ciruits and Systems, 21, pp. 546-549. [4] S. Haynes and P. Cheung, Configurable multiplier bloks for embedding in FPGAs, Eletronis Letters, vol. 34, iss. 7, Apr 1998, pp. 638-639. [5] R. Hartenstein, Coarse grain reonfigurable arhitetures, in Pro. 6th Asia South Paifi Design Automation Conferene, Yokohama, Japan, 21, pp. 564-57. [6] J. Smit et al, Low ost and fast turnaround: reonfigurable graph-based exeution units, in Pro. 7th BELSIGN Workshop, Enshede, Netherlands, 1998. [7] P. Heysters and G. Smit, Mapping of DSP algorithms on the MONTIUM arhiteture, in Pro. International Parallel and Distributed Proessing Symposium, Apr 23, pp. 18-185. [8] J. Rabaey et al, Digital Integrated Ciruits: A Design Perspetive, 2nd ed., Upper Saddle River, NJ: Pearson Eduation, In., 23, pp. 591-592. [9] M. Myjak and J. Delgado-Frias, A two-level reonfigurable arhiteture for digital signal proessing, in Pro. 23 International Conferene on VLSI, Las Vegas, NV, Jun 23, pp. 21-27. 6. Aknowledgment M. Myjak is supported by the U.S. Department of Homeland Seurity Graduate Fellowship.