Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

Similar documents
Heterogeneous Technology Mapping for FPGAs with Dual-Port Embedded Memory Arrays

Heterogeneous Technology Mapping for Area Reduction in FPGA s with Embedded Memory Arrays

How Much Logic Should Go in an FPGA Logic Block?

3. G. G. Lemieux and S. D. Brown, ëa detailed router for allocating wire segments

Figure 1. PLA-Style Logic Block. P Product terms. I Inputs

SPEED AND AREA TRADE-OFFS IN CLUSTER-BASED FPGA ARCHITECTURES

Saving Power by Mapping Finite-State Machines into Embedded Memory Blocks in FPGAs

HYBRID FPGA ARCHITECTURE

Stratix vs. Virtex-II Pro FPGA Performance Analysis

Basic Block. Inputs. K input. N outputs. I inputs MUX. Clock. Input Multiplexors

On Nominal Delay Minimization in LUT-Based FPGA Technology Mapping

Memory Footprint Reduction for FPGA Routing Algorithms

Designing Heterogeneous FPGAs with Multiple SBs *

Technology Mapping and Packing. FPGAs

Digital Integrated Circuits

Conclusions and Future Work. We introduce a new method for dealing with the shortage of quality benchmark circuits

Beyond the Combinatorial Limit in Depth Minimization for LUT-Based FPGA Designs

FPGA Clock Network Architecture: Flexibility vs. Area and Power

Benefits of Embedded RAM in FLEX 10K Devices

Reducing Power in an FPGA via Computer-Aided Design

Chapter 5: ASICs Vs. PLDs

OPTIMIZING COARSE- GRAINED UNITS IN FLOATING POINT HYBRID FPGA

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

Congestion-Driven Regional Re-clustering for Low-Cost FPGAs

FlowMap: An Optimal Technology Mapping Algorithm for Delay Optimization in Lookup-Table Based FPGA Designs

Design and Implementation of FPGA Logic Architectures using Hybrid LUT/Multiplexer

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

An Introduction to Programmable Logic

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

Device And Architecture Co-Optimization for FPGA Power Reduction

INTRODUCTION TO FPGA ARCHITECTURE

THE technology mapping and synthesis problem for field

An FPGA Architecture Supporting Dynamically-Controlled Power Gating

THE COARSE-GRAINED / FINE-GRAINED LOGIC INTERFACE IN FPGAS WITH EMBEDDED FLOATING-POINT ARITHMETIC UNITS

A System-Level Stochastic Circuit Generator for FPGA Architecture Evaluation

The Impact of Pipelining on Energy per Operation in Field-Programmable Gate Arrays

RASP: A General Logic Synthesis System for SRAM-based FPGAs

Programmable Logic. Any other approaches?

Boolean Matching for Complex PLBs in LUT-based FPGAs with Application to Architecture Evaluation. Jason Cong and Yean-Yow Hwang

Programmable Memory Blocks Supporting Content-Addressable Memory

Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization

SUBMITTED FOR PUBLICATION TO: IEEE TRANSACTIONS ON VLSI, DECEMBER 5, A Low-Power Field-Programmable Gate Array Routing Fabric.

FPGA: What? Why? Marco D. Santambrogio

Power Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas

An Intelligent Multi-Port Memory

Field Programmable Gate Array

Routing Wire Optimization through Generic Synthesis on FPGA Carry Chains

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

A Memory-Based Programmable Logic Device Using Look-Up Table Cascade with Synchronous Static Random Access Memories

ON THE INTERACTION BETWEEN POWER-AWARE FPGA CAD ALGORITHMS

Simultaneous Depth and Area Minimization in LUT-based FPGA Mapping

Hybrid LUT/Multiplexer FPGA Logic Architectures

Design Space Exploration Using Parameterized Cores

CPLDs vs. FPGAs: Comparing High-Capacity Programmable Logic

Computer Structure. Unit 2: Memory and programmable devices

Exploring Logic Block Granularity for Regular Fabrics

Delay Estimation for Technology Independent Synthesis

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool

Cluster-Based Architecture, Timing-Driven Packing and Timing-Driven Placement for FPGAs

Design, Synthesis and Evaluation of Heterogeneous FPGA with Mixed LUTs and Macro-Gates

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public

Academic Clustering and Placement Tools for Modern Field-Programmable Gate Array Architectures

PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA. Laurent Lemarchand. Informatique. ea 2215, D pt. ubo University{ bp 809

Steven J.E. Wilton, Jonathan Rose, and Zvonko G. Vranesic. University oftoronto.

FPGA Based Digital Design Using Verilog HDL

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Early Models in Silicon with SystemC synthesis

Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors

Using Sparse Crossbars within LUT Clusters

FIELD programmable gate arrays (FPGAs) provide an attractive

Design Verification Using the SignalTap II Embedded

Placement Algorithm for FPGA Circuits

Topics. FPGA Design EECE 277. Interconnect and Logic Elements Part 2. Laboratory Assignment #1 Save Everything!!! Guest Lecture

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I

FPGA How do they work?

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

High-level Variable Selection for Partial-Scan Implementation

Abbas El Gamal. Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program. Stanford University

LSN 6 Programmable Logic Devices

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline

FYSE420 DIGITAL ELECTRONICS. Lecture 7

Design Methodologies. Full-Custom Design

Factor Cuts. Satrajit Chatterjee Alan Mishchenko Robert Brayton ABSTRACT

Fused-Arithmetic Unit Generation for Reconfigurable Devices using Common Subgraph Extraction

IMPROVING MEMORY AND VALIDATION SUPPORT IN FPGA ARCHITECTURE EXPLORATION. Andrew Somerville

4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013)

Simultaneous Depth and Area Minimization in LUT-based FPGA Mapping

FPGA. Agenda 11/05/2016. Scheduling tasks on Reconfigurable FPGA architectures. Definition. Overview. Characteristics of the CLB.

Combinational and Sequential Mapping with Priority Cuts

Logic Block Clustering of Large Designs for Channel-Width Constrained FPGAs

Stratix II vs. Virtex-4 Performance Comparison

Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures

Configurable Embedded Systems: Using Programmable Logic to Compress Embedded System Design Cycles

EE219A Spring 2008 Special Topics in Circuits and Signal Processing. Lecture 9. FPGA Architecture. Ranier Yap, Mohamed Ali.

Architecture and Synthesis of. Field-Programmable Gate Arrays with. Hard-wired Connections. Kevin Charles Kenton Chung

DESIGN AND IMPLEMENTATION OF HYBRID LUT/MULTIPLEXER FPGA LOGIC ARCHITECTURES

Interconnect Testing in Cluster-Based FPGA Architectures

Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience

Transcription:

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver, BC, Canada, V6T 1Z4 stevew@ece.ubc.ca Λ Abstract It has become clear that large embedded configurable memory arrays will be essential in future FPGAs. Embedded arrays provide high-density high-speed implementations of the storage parts of circuits. Unfortunately, they require the FPGA vendor to partition the device into memory and logic resources at manufacture-time. This leads to a waste of chip area for customers that do not use all of the storage provided. This chip area need not be wasted, and can in fact be used very efficiently, if the arrays are configured as large multi-output ROMs, and used to implement logic. In this paper, we investigate how the architecture of the FPGA embedded arrays affects their ability to implement logic. Specifically, we focus on architectures which contain more than one size of memory array. We show that these heterogeneous architectures result in significantly denser implementations of logic than architectures with only one size of memory array. We also show that the best heterogeneous architecture contains both bit arrays and bit arrays. 1 Introduction On-chip storage has become an essential component of high-density FPGAs. The large systems that will be implemented on these FPGAs often require storage; implementing this storage on-chip results in faster clock frequencies and lower system costs. Two implementations of onchip memory in FPGAs have emerged: fine-grained and coarse-grained. In FPGAs employing fine-grained on-chip storage, such as the Xilinx 4000 FPGAs, each lookup table can be configured as a small RAM, and these RAMs Λ This work was supported by the Natural Sciences and Engineering Research Council of Canada, and UBC s Centre for Integrated Computer Systems Research. can be combined to implement larger user memories [1]. FPGAs employing the coarse-grained approach, on the other hand, contain large embedded arrays which are used to implement the storage parts of circuits. Examples of such devices are the Altera 10K, Apex, and Stratix devices [2, 3, 4], the Xilinx Virtex and Virtex II FPGAs [5], the Actel 3200DX and SPGA parts [6, 7], and the Lattice isplsi FPGAs [8]. The coarse-grained approach results in significantly denser memory implementations, since the per-bit overhead is much smaller [9]. Unfortunately, it also requires the FPGA vendor to partition the chip into memory and logic regions when the FPGA is designed. Since circuits have widely-varying memory requirements, this averagecase partitioning may result in poor device utilizations for logic-intensive or memory-intensive circuits. In particular, if a circuit does not use all the available memory arrays to implement storage, the chip area devoted to the unused arrays is wasted. This chip area need not be wasted, however, if the unused memory arrays are used to implement logic. Configuring the arrays as ROMs results in large multi-output lookup-tables that can very efficiently implement some logic circuits. In [10], a new tool, SMAP, was presented that packs as much circuit information as possible into the available memory arrays, and maps the rest of the circuit into four-input lookup-tables. It was shown that this technique results in extremely dense logic implementations for many circuits; not only is the chip area of the unused arrays not wasted, but it is used more efficiently than if the arrays were replaced by logic blocks. Thus, even customers that do not require storage can benefit from embedded memory arrays. The effectiveness of this mapping technique, however, is very dependent on the architecture of the embedded memory arrays. If the arrays are too small, the amount of logic that can be packed into each will be small, while if the arrays are too large, much of each array will be

unused. Previous studies have focused on the architecture of these memory resources when implementing storage [11, 12, 13]. Since they are so effective at implementing logic, however, it is important that the design of the embedded memory arrays also consider this. In [14], the the effects of array depth, width, and flexibility of memory arrays when they are used to implement logic were explored. That paper, however, only considered homogeneous memory architectures, ie. architectures in which each memory array is identical. In this paper, we show that significant density improvements are possible if the FPGA contains a heterogeneous memory architecture, that is, an architecture with more than one size of memory array. The goals of this paper are as follows: 1. The first goal is to quantify the density improvements that are possible with a heterogeneous memory architecture (compared to a homogeneous memory architecture) when used to implement logic. 2. There are many possible heterogeneous memory architectures (different array sizes, numbers, etc.). The second goal of this paper is to find the heterogeneous memory architecture that can most efficiently implement logic. The architectural space explored in this paper is described in Section 2. Section 3 describes the experimental methodology and reviews the SMAP algorithm. Finally, Section 4 presents experimental results. 2 Embedded Array Architectures Table 1 summarizes the parameters that define the FPGA embedded memory array architecture, along with values of these parameters for several commercial devices. In this paper we are considering architectures with two different array sizes; we denote the number of bits in each type of array as B 1 and B 2. The number of each type of arrays is denoted N 1 and N 2. We assume that all arrays have the same set of allowable data widths, and denote that set by w eff. For a fixed size, a wider memory implies fewer memory words in each array. In the Altera FLEX10K for example, B =bits, and w eff = f1; 2; 4; 8g, meaning each array can be configured to be one of x1, x2, x4, or x8. 3 Methodology To compare memory array architectures, we employed an experimental methodology in which we varied the various architectural parameters, and mapped a set of 28 N M C P D B A L H J K G F Q E N M P C A F a) Original Circuit b) Final Implementation Figure 1: Example Mapping to a 8-Input, 3-Output Memory Block benchmark circuits to each architecture. Each circuit contained between 527 and 6598 4-LUTs. Fifteen of the circuits were sequential. The combinational circuits and 9 of the sequential circuits were obtained from the Microelectronics Corporation of North Carolina (MCNC) benchmark suite, while the remaining sequential circuits were obtained from the University of Toronto and were the result of synthesis from VHDL and Verilog. All circuits were optimized using SIS [15] and mapped to four-input lookuptables using Flowmap and Flowpack [16]. The SMAP algorithm was then used to pack as much circuit information as possible into the available memory arrays. The number of nodes that can be packed to the available arrays is used as a metric to compare memory array architectures. The results in this paper depend heavily on the SMAP algorithm, which was originally developed for architectures in which all arrays are the same size. The following subsection reviews SMAP, while the subsequent subsection shows how SMAP can be used to map logic to a heterogeneous memory architecture. 3.1 Review of SMAP This section briefly reviews SMAP; for more details, see [10]. The SMAP algorithm is based on Flowpack, a postprocessing step of Flowmap [16]. Given a seed node, the algorithm finds the maximum-volume k-feasible cut, where k is the number of address inputs to each memory array. A k-feasible cut is a set of no more than k nodes in the faninnetwork of the seed such that the the seed can be expressed entirely as a function of the k nodes; the maximum-volume k-feasible cut is the cut which contains the most nodes between the cut and the seed. The nodes that make up the cut become the memory array inputs. Figure 1(a) shows an example circuit along with the the maximum 8-feasible cut for seed node A. Given a seed node and a cut, SMAP then selects which nodes will become the memory array outputs. Any node that can be expressed as a function of the cut nodes is a potential memory array output. The selection of the outputs Q E

Parameter Meaning Commercial Devices Range in Altera 10K Vantis VF1 Lattice isp6192 this paper N1 Number of Type-1 Arrays 3-16 28-48 1 1-9 N2 Number of Type-2 Arrays - - - 1-9 B1 Bits per Type-1 Array 4608 - Bits per Type-2 Array - - - - w eff Allowable Data Widths f1,2,4,8g f4g f9,18g f1,2,4,8g Table 1: Architectural Parameters is an optimization problem, since different combination of outputs will lead to different numbers of nodes that can be packed into the arrays. In [10], a heuristic was presented; the outputs with the largest number of nodes in their maximum fanout-free cone (maximum cone rooted at the potential output such that no node in the cone drives a node not in the cone) are selected. As shown in [10], those nodes in the maximum fanout-free cones of the outputs can be packed into the array. All other nodes in the network must be implemented using logic blocks. In Figure 1(a), nodes C, A, and F are the selected outputs; Figure 1(b) shows the resulting circuit implementation. Since the selection of the seed node is so important, we repeat the algorithm for each seed node, and choose the best results. If there is more than one array available, we map to the first array as described above. Then, we remove the nodes implemented by that array, and repeat the entire algorithm for the second array. This is repeated for each available array. 3.2 Extension to Heterogeneous Memory Architectures The SMAP algorithm was developed assuming a homogeneous memory architecture; that is, one in which each memory array is identical. Since the arrays are packed one at a time, the above algorithm can be applied directly to architectures with different sized memory arrays. The only issue is whether the large or small arrays should be filled first. Experimentally, we have determined that the best results are obtained if we fill all of the large arrays first. The SMAP algorithm is greedy, in that, for each array, the largest portion of logic that can be mapped to the array is selected. Thus, the largest gains are likely to be obtained from the first few arrays that are filled; therefore it makes sense that these first few arrays are the large ones. 4 Results 4.1 Homogeneous Architecture Results We first consider architectures in which all arrays are of the same size (this is the homogeneous case considered in [14]). Figure 2 shows how the effectiveness of each memory array in implementing logic depends on the array size, assuming 8 arrays are available. Figure 2(a) shows the number of logic blocks that can be packed into the arrays (averaged over our 28 benchmark circuits) vs. array size. Figure 2(b) shows the estimated chip area of the 8 memory arrays, also as a function of array size. The area estimates were obtained from a detailed area model [17] and are expressed in logic block equivalents (LBE). One LBE is the area required to implement one logic block. Figure 2(c) shows the packing density as a function of array size. The packing density is defined as the ratio of the number of logic blocks that can be packed into the available memory arrays over the area required to implement the memory arrays (in LBEs). A packing density of 1 means that the density of logic implemented in memory arrays is equal to that if the logic was implemented in logic blocks. A packing density greater than 1 means that the density of logic implemented in memory arrays is greater than that if logic blocks were used. As Figure 2(c) shows, the packing density is greater than 1 for all but the largest memory array. The highest packing density occurs when the arrays each contain bits. See [14] for a more thorough coverage of homogeneous architectures. 4.2 Heterogeneous Architecture Results In this section, we consider architectures which contain two different sizes of memory arrays. Using the terminology of Section 2, each FPGA will have N 1 arrays of B 1 bits each and N 2 arrays of B 2 bits each. We restrict our attention to architectures with three different ratios of N 1 : N 2 : 1:1, 1:2, and 1:3. Figure 3 shows the packing density for several sizes of B 1 and B 2, assuming the ratio N 1 = N 2 =4(that is, there

Packed Logic Blocks 350 300 250 200 150 100 50 0 Bits per Array Area (equiv. logic blocks) 350 300 250 200 150 100 50 0 Bits per Array Packing Ratio 3 2 1 Bits per Array a) Logic Blocks Packed b) Area c) Packing Ratio Figure 2: Homogeneous Architecture Results, 8 arrays Array 2 size () 4 2.17 2.10 Array 1 size (B1) 2.67 2.61 2.77 2.79 2.73 2.86 2.73 3.42 3.33 3.27 2.98 2.63 2.43 2.41 2.40 2.28 4 1.63 5 6 7 3 1.43 1.24 0.99 a) Numerical Results b) Graphical Results Figure 3: Heterogeneous Architectures, 4 arrays of each type B1 are four of each kind of array). As the results show, the best packing density occurs when there are four arrays of bits each, and four arrays of bits each (we did not consider array sizes smaller than bits, since such small arrays would not be suitable for implementing the memory parts of circuits, and thus, would not likely be considered by an FPGA manufacturer). The packing density at this point is 23% higher than the best packing density obtained for homogeneous architectures. We repeated the experiments for several values of N 1 and N 2 ; selected graphical results are shown in Figure 4. In Figure 4(a), one of each type of array is assumed. In this case, the best architecture is a homogeneous architecture in which both arrays contain bits. This was the only configuration for which a homogeneous architecture was found to be the best. Results for FPGAs with the ratio N 1 : N 2 = 1 : 2 (that is, FPGAs for which there are twice as many type-2 arrays as type-1 arrays) are shown in Figure 4(c) and (d). Results for FPGAs with the ratio N 1 : N 2 =1:3(three times as many type-2 arrays as type-1 arrays) are shown in Figure 4(e) and (f). In both cases, the best architecture was found to consist of bit arrays and bit arrays (this was the case for all architectures which we investigated, except the N 1 = N 2 =1case as described above). It is interesting to note that although an FPGA with both bit arrays and bit arrays was found to be best, in some cases, (Figures 4(c) and (e)) the majority of the arrays should contain bits, while in other cases, the majority of the arrays should contain bits (Figures 4(d) and (f)). This can be observed in the graphs by noticing that in Figures 4(c) and (e), the highest point is to the left of the center of the graph, while in Figure 4(d) and (f), the highest point is to the right of the center of the graph. We have investigated other architectures with a N 1 : N 2 ratio of 1 : 2 and 1 : 3, and have confirmed that, as the total number of arrays increases, the preference for smaller arrays increases. Intuitively, if there are more arrays, the SMAP tool is less able to effectively fill the larger arrays with logic. A second conclusion that can be drawn from the results in Figure 4 (and confirmed by other experiments we have performed) is that as the total number of arrays increases, the advantage due to heterogeneous architectures (compared to homogeneous architectures) tends to increase. If there are only two arrays, a homogeneous architecture is

better, while if there are 12 arrays (Figures 4(d) and (f)), the heterogeneous architecture is considerably better (22% better in each case). 5 Conclusions Although embedded arrays in FPGAs were developed in order to implement on-chip storage, it is clear that these arrays can also be configured as ROMs and used to implement logic. In this paper, we have shown that significant density improvements are possible if the FPGA contains a heterogeneous memory architecture, that is, an architecture with more than one size of memory array. The amount of improvement depends on how many memory arrays are present; if there are eight arrays, we have shown that the best heterogeneous architecture can implement logic 23% more efficiently than the best homogeneous architecture. In virtually all cases, we have found that the best heterogeneous architecture consists of some bit arrays, and some bit arrays. The exact number of each size of array depends on the total number of arrays available; the more arrays that are present, the larger the proportion that should be bits. We have also shown that the benefits of heterogeneous architectures become more significant as the number of arrays increase. This is a compelling argument for heterogeneous memory architectures. Future architectures are likely to contain more memory than they do now; FP- GAs with such large memory capacities would significantly benefit if a heterogeneous architecture is used. References [1] Xilinx, Inc., Virtex V Field Programmable Gate Arrays, ver. 1.6, July 1999. [2] Altera Corporation, FLEX 10K Embedded Programmable Logic Family Data Sheet, ver. 4.1, Mar 2001. [3] Altera Corporation, APEX 20K Programmable Logic Device Family Data Sheet, ver. 2.1, Feb 2002. [4] Altera Corporation, Stratix Programmable Logic Device Family Datasheet, 2002. [5] Xilinx, Inc., XC4000E and XC4000X Series Field Programmable Gate Arrays, ver. 1.6, May 1999. [6] Actel Corporation, Datasheet: 3200DX Field-Programmable Gate Arrays, 1995. [7] Actel Corporation, Actel s Reprogrammable SPGAs, 1996. [8] Lattice Semiconductor Corporation, Datasheet: isplsi and plsi 6192 High Density Programmable Logic with Dedicated Memory and Register/Counter Modules, July 1996. [9] T. Ngai, J. Rose, and S. J. E. Wilton, An SRAM- Programmable field-configurable memory, in Proceedings of the IEEE 1995 Custom Integrated Circuits Conference, pp. 499 502, May 1995. [10] S. J. E. Wilton, SMAP: heterogeneous technology mapping for FPGAs with embedded memory arrays, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 171 178, February 1998. [11] S. J. E. Wilton, J. Rose, and Z. G. Vranesic, Architecture of centralized field-configurable memory, in Proceedings of the ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, pp. 97 103, 1995. [12] S. J. E. Wilton, J. Rose, and Z. G. Vranesic, Memory/logic interconnect flexibility in FPGAs with large embedded memory arrays, in Proceedings of the IEEE 1996 Custom Integrated Circuits Conference, pp. 144 147, May 1996. [13] S. J. E. Wilton, J. Rose, and Z. G. Vranesic, Memoryto-memory connection structures in FPGAs with embedded memory arrays, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 10 16, February 1997. [14] S. J. E. Wilton, Implementing logic in FPGA embedded memory arrays: Architectural implications, in IEEE Custom Integrated Circuits Conference, May 1998. [15] E. Sentovich, SIS: A system for sequential circuit analysis, Tech. Rep. UCB/ERL M92/41, Electronics Research Laboratory, University of California, Berkeley, May 1992. [16] J. Cong and Y. Ding, FlowMap: an optimal technology mapping algorithm for delay optimization in lookup-table based FPGA designs, IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, vol. 13, pp. 1 12, January 1994. [17] S. J. E. Wilton, Architectures and Algorithms for Field- Programmable Gate Arrays with Embedded Memory. PhD thesis, University of Toronto, 1997.

4.0 B1 0.5 B1 a) N0 =1,N1 =1 b) N0 =8,N1 =8 4.0 B1 B1 c) N1 =1,N2 =2 d) N1 =4,N2 =8 4.0 B1 B1 e) N1 =1,N2 =3 f) N1 =3,N2 =9 Figure 4: Other Selected Heterogeneous Architecture Results