Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver, BC, Canada, V6T 1Z4 stevew@ece.ubc.ca Λ Abstract It has become clear that large embedded configurable memory arrays will be essential in future FPGAs. Embedded arrays provide high-density high-speed implementations of the storage parts of circuits. Unfortunately, they require the FPGA vendor to partition the device into memory and logic resources at manufacture-time. This leads to a waste of chip area for customers that do not use all of the storage provided. This chip area need not be wasted, and can in fact be used very efficiently, if the arrays are configured as large multi-output ROMs, and used to implement logic. In this paper, we investigate how the architecture of the FPGA embedded arrays affects their ability to implement logic. Specifically, we focus on architectures which contain more than one size of memory array. We show that these heterogeneous architectures result in significantly denser implementations of logic than architectures with only one size of memory array. We also show that the best heterogeneous architecture contains both bit arrays and bit arrays. 1 Introduction On-chip storage has become an essential component of high-density FPGAs. The large systems that will be implemented on these FPGAs often require storage; implementing this storage on-chip results in faster clock frequencies and lower system costs. Two implementations of onchip memory in FPGAs have emerged: fine-grained and coarse-grained. In FPGAs employing fine-grained on-chip storage, such as the Xilinx 4000 FPGAs, each lookup table can be configured as a small RAM, and these RAMs Λ This work was supported by the Natural Sciences and Engineering Research Council of Canada, and UBC s Centre for Integrated Computer Systems Research. can be combined to implement larger user memories [1]. FPGAs employing the coarse-grained approach, on the other hand, contain large embedded arrays which are used to implement the storage parts of circuits. Examples of such devices are the Altera 10K, Apex, and Stratix devices [2, 3, 4], the Xilinx Virtex and Virtex II FPGAs [5], the Actel 3200DX and SPGA parts [6, 7], and the Lattice isplsi FPGAs [8]. The coarse-grained approach results in significantly denser memory implementations, since the per-bit overhead is much smaller [9]. Unfortunately, it also requires the FPGA vendor to partition the chip into memory and logic regions when the FPGA is designed. Since circuits have widely-varying memory requirements, this averagecase partitioning may result in poor device utilizations for logic-intensive or memory-intensive circuits. In particular, if a circuit does not use all the available memory arrays to implement storage, the chip area devoted to the unused arrays is wasted. This chip area need not be wasted, however, if the unused memory arrays are used to implement logic. Configuring the arrays as ROMs results in large multi-output lookup-tables that can very efficiently implement some logic circuits. In [10], a new tool, SMAP, was presented that packs as much circuit information as possible into the available memory arrays, and maps the rest of the circuit into four-input lookup-tables. It was shown that this technique results in extremely dense logic implementations for many circuits; not only is the chip area of the unused arrays not wasted, but it is used more efficiently than if the arrays were replaced by logic blocks. Thus, even customers that do not require storage can benefit from embedded memory arrays. The effectiveness of this mapping technique, however, is very dependent on the architecture of the embedded memory arrays. If the arrays are too small, the amount of logic that can be packed into each will be small, while if the arrays are too large, much of each array will be

unused. Previous studies have focused on the architecture of these memory resources when implementing storage [11, 12, 13]. Since they are so effective at implementing logic, however, it is important that the design of the embedded memory arrays also consider this. In [14], the the effects of array depth, width, and flexibility of memory arrays when they are used to implement logic were explored. That paper, however, only considered homogeneous memory architectures, ie. architectures in which each memory array is identical. In this paper, we show that significant density improvements are possible if the FPGA contains a heterogeneous memory architecture, that is, an architecture with more than one size of memory array. The goals of this paper are as follows: 1. The first goal is to quantify the density improvements that are possible with a heterogeneous memory architecture (compared to a homogeneous memory architecture) when used to implement logic. 2. There are many possible heterogeneous memory architectures (different array sizes, numbers, etc.). The second goal of this paper is to find the heterogeneous memory architecture that can most efficiently implement logic. The architectural space explored in this paper is described in Section 2. Section 3 describes the experimental methodology and reviews the SMAP algorithm. Finally, Section 4 presents experimental results. 2 Embedded Array Architectures Table 1 summarizes the parameters that define the FPGA embedded memory array architecture, along with values of these parameters for several commercial devices. In this paper we are considering architectures with two different array sizes; we denote the number of bits in each type of array as B 1 and B 2. The number of each type of arrays is denoted N 1 and N 2. We assume that all arrays have the same set of allowable data widths, and denote that set by w eff. For a fixed size, a wider memory implies fewer memory words in each array. In the Altera FLEX10K for example, B =bits, and w eff = f1; 2; 4; 8g, meaning each array can be configured to be one of x1, x2, x4, or x8. 3 Methodology To compare memory array architectures, we employed an experimental methodology in which we varied the various architectural parameters, and mapped a set of 28 N M C P D B A L H J K G F Q E N M P C A F a) Original Circuit b) Final Implementation Figure 1: Example Mapping to a 8-Input, 3-Output Memory Block benchmark circuits to each architecture. Each circuit contained between 527 and 6598 4-LUTs. Fifteen of the circuits were sequential. The combinational circuits and 9 of the sequential circuits were obtained from the Microelectronics Corporation of North Carolina (MCNC) benchmark suite, while the remaining sequential circuits were obtained from the University of Toronto and were the result of synthesis from VHDL and Verilog. All circuits were optimized using SIS [15] and mapped to four-input lookuptables using Flowmap and Flowpack [16]. The SMAP algorithm was then used to pack as much circuit information as possible into the available memory arrays. The number of nodes that can be packed to the available arrays is used as a metric to compare memory array architectures. The results in this paper depend heavily on the SMAP algorithm, which was originally developed for architectures in which all arrays are the same size. The following subsection reviews SMAP, while the subsequent subsection shows how SMAP can be used to map logic to a heterogeneous memory architecture. 3.1 Review of SMAP This section briefly reviews SMAP; for more details, see [10]. The SMAP algorithm is based on Flowpack, a postprocessing step of Flowmap [16]. Given a seed node, the algorithm finds the maximum-volume k-feasible cut, where k is the number of address inputs to each memory array. A k-feasible cut is a set of no more than k nodes in the faninnetwork of the seed such that the the seed can be expressed entirely as a function of the k nodes; the maximum-volume k-feasible cut is the cut which contains the most nodes between the cut and the seed. The nodes that make up the cut become the memory array inputs. Figure 1(a) shows an example circuit along with the the maximum 8-feasible cut for seed node A. Given a seed node and a cut, SMAP then selects which nodes will become the memory array outputs. Any node that can be expressed as a function of the cut nodes is a potential memory array output. The selection of the outputs Q E

Parameter Meaning Commercial Devices Range in Altera 10K Vantis VF1 Lattice isp6192 this paper N1 Number of Type-1 Arrays 3-16 28-48 1 1-9 N2 Number of Type-2 Arrays - - - 1-9 B1 Bits per Type-1 Array 4608 - Bits per Type-2 Array - - - - w eff Allowable Data Widths f1,2,4,8g f4g f9,18g f1,2,4,8g Table 1: Architectural Parameters is an optimization problem, since different combination of outputs will lead to different numbers of nodes that can be packed into the arrays. In [10], a heuristic was presented; the outputs with the largest number of nodes in their maximum fanout-free cone (maximum cone rooted at the potential output such that no node in the cone drives a node not in the cone) are selected. As shown in [10], those nodes in the maximum fanout-free cones of the outputs can be packed into the array. All other nodes in the network must be implemented using logic blocks. In Figure 1(a), nodes C, A, and F are the selected outputs; Figure 1(b) shows the resulting circuit implementation. Since the selection of the seed node is so important, we repeat the algorithm for each seed node, and choose the best results. If there is more than one array available, we map to the first array as described above. Then, we remove the nodes implemented by that array, and repeat the entire algorithm for the second array. This is repeated for each available array. 3.2 Extension to Heterogeneous Memory Architectures The SMAP algorithm was developed assuming a homogeneous memory architecture; that is, one in which each memory array is identical. Since the arrays are packed one at a time, the above algorithm can be applied directly to architectures with different sized memory arrays. The only issue is whether the large or small arrays should be filled first. Experimentally, we have determined that the best results are obtained if we fill all of the large arrays first. The SMAP algorithm is greedy, in that, for each array, the largest portion of logic that can be mapped to the array is selected. Thus, the largest gains are likely to be obtained from the first few arrays that are filled; therefore it makes sense that these first few arrays are the large ones. 4 Results 4.1 Homogeneous Architecture Results We first consider architectures in which all arrays are of the same size (this is the homogeneous case considered in [14]). Figure 2 shows how the effectiveness of each memory array in implementing logic depends on the array size, assuming 8 arrays are available. Figure 2(a) shows the number of logic blocks that can be packed into the arrays (averaged over our 28 benchmark circuits) vs. array size. Figure 2(b) shows the estimated chip area of the 8 memory arrays, also as a function of array size. The area estimates were obtained from a detailed area model [17] and are expressed in logic block equivalents (LBE). One LBE is the area required to implement one logic block. Figure 2(c) shows the packing density as a function of array size. The packing density is defined as the ratio of the number of logic blocks that can be packed into the available memory arrays over the area required to implement the memory arrays (in LBEs). A packing density of 1 means that the density of logic implemented in memory arrays is equal to that if the logic was implemented in logic blocks. A packing density greater than 1 means that the density of logic implemented in memory arrays is greater than that if logic blocks were used. As Figure 2(c) shows, the packing density is greater than 1 for all but the largest memory array. The highest packing density occurs when the arrays each contain bits. See [14] for a more thorough coverage of homogeneous architectures. 4.2 Heterogeneous Architecture Results In this section, we consider architectures which contain two different sizes of memory arrays. Using the terminology of Section 2, each FPGA will have N 1 arrays of B 1 bits each and N 2 arrays of B 2 bits each. We restrict our attention to architectures with three different ratios of N 1 : N 2 : 1:1, 1:2, and 1:3. Figure 3 shows the packing density for several sizes of B 1 and B 2, assuming the ratio N 1 = N 2 =4(that is, there

Packed Logic Blocks 350 300 250 200 150 100 50 0 Bits per Array Area (equiv. logic blocks) 350 300 250 200 150 100 50 0 Bits per Array Packing Ratio 3 2 1 Bits per Array a) Logic Blocks Packed b) Area c) Packing Ratio Figure 2: Homogeneous Architecture Results, 8 arrays Array 2 size () 4 2.17 2.10 Array 1 size (B1) 2.67 2.61 2.77 2.79 2.73 2.86 2.73 3.42 3.33 3.27 2.98 2.63 2.43 2.41 2.40 2.28 4 1.63 5 6 7 3 1.43 1.24 0.99 a) Numerical Results b) Graphical Results Figure 3: Heterogeneous Architectures, 4 arrays of each type B1 are four of each kind of array). As the results show, the best packing density occurs when there are four arrays of bits each, and four arrays of bits each (we did not consider array sizes smaller than bits, since such small arrays would not be suitable for implementing the memory parts of circuits, and thus, would not likely be considered by an FPGA manufacturer). The packing density at this point is 23% higher than the best packing density obtained for homogeneous architectures. We repeated the experiments for several values of N 1 and N 2 ; selected graphical results are shown in Figure 4. In Figure 4(a), one of each type of array is assumed. In this case, the best architecture is a homogeneous architecture in which both arrays contain bits. This was the only configuration for which a homogeneous architecture was found to be the best. Results for FPGAs with the ratio N 1 : N 2 = 1 : 2 (that is, FPGAs for which there are twice as many type-2 arrays as type-1 arrays) are shown in Figure 4(c) and (d). Results for FPGAs with the ratio N 1 : N 2 =1:3(three times as many type-2 arrays as type-1 arrays) are shown in Figure 4(e) and (f). In both cases, the best architecture was found to consist of bit arrays and bit arrays (this was the case for all architectures which we investigated, except the N 1 = N 2 =1case as described above). It is interesting to note that although an FPGA with both bit arrays and bit arrays was found to be best, in some cases, (Figures 4(c) and (e)) the majority of the arrays should contain bits, while in other cases, the majority of the arrays should contain bits (Figures 4(d) and (f)). This can be observed in the graphs by noticing that in Figures 4(c) and (e), the highest point is to the left of the center of the graph, while in Figure 4(d) and (f), the highest point is to the right of the center of the graph. We have investigated other architectures with a N 1 : N 2 ratio of 1 : 2 and 1 : 3, and have confirmed that, as the total number of arrays increases, the preference for smaller arrays increases. Intuitively, if there are more arrays, the SMAP tool is less able to effectively fill the larger arrays with logic. A second conclusion that can be drawn from the results in Figure 4 (and confirmed by other experiments we have performed) is that as the total number of arrays increases, the advantage due to heterogeneous architectures (compared to homogeneous architectures) tends to increase. If there are only two arrays, a homogeneous architecture is

better, while if there are 12 arrays (Figures 4(d) and (f)), the heterogeneous architecture is considerably better (22% better in each case). 5 Conclusions Although embedded arrays in FPGAs were developed in order to implement on-chip storage, it is clear that these arrays can also be configured as ROMs and used to implement logic. In this paper, we have shown that significant density improvements are possible if the FPGA contains a heterogeneous memory architecture, that is, an architecture with more than one size of memory array. The amount of improvement depends on how many memory arrays are present; if there are eight arrays, we have shown that the best heterogeneous architecture can implement logic 23% more efficiently than the best homogeneous architecture. In virtually all cases, we have found that the best heterogeneous architecture consists of some bit arrays, and some bit arrays. The exact number of each size of array depends on the total number of arrays available; the more arrays that are present, the larger the proportion that should be bits. We have also shown that the benefits of heterogeneous architectures become more significant as the number of arrays increase. This is a compelling argument for heterogeneous memory architectures. Future architectures are likely to contain more memory than they do now; FP- GAs with such large memory capacities would significantly benefit if a heterogeneous architecture is used. References [1] Xilinx, Inc., Virtex V Field Programmable Gate Arrays, ver. 1.6, July 1999. [2] Altera Corporation, FLEX 10K Embedded Programmable Logic Family Data Sheet, ver. 4.1, Mar 2001. [3] Altera Corporation, APEX 20K Programmable Logic Device Family Data Sheet, ver. 2.1, Feb 2002. [4] Altera Corporation, Stratix Programmable Logic Device Family Datasheet, 2002. [5] Xilinx, Inc., XC4000E and XC4000X Series Field Programmable Gate Arrays, ver. 1.6, May 1999. [6] Actel Corporation, Datasheet: 3200DX Field-Programmable Gate Arrays, 1995. [7] Actel Corporation, Actel s Reprogrammable SPGAs, 1996. [8] Lattice Semiconductor Corporation, Datasheet: isplsi and plsi 6192 High Density Programmable Logic with Dedicated Memory and Register/Counter Modules, July 1996. [9] T. Ngai, J. Rose, and S. J. E. Wilton, An SRAM- Programmable field-configurable memory, in Proceedings of the IEEE 1995 Custom Integrated Circuits Conference, pp. 499 502, May 1995. [10] S. J. E. Wilton, SMAP: heterogeneous technology mapping for FPGAs with embedded memory arrays, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 171 178, February 1998. [11] S. J. E. Wilton, J. Rose, and Z. G. Vranesic, Architecture of centralized field-configurable memory, in Proceedings of the ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, pp. 97 103, 1995. [12] S. J. E. Wilton, J. Rose, and Z. G. Vranesic, Memory/logic interconnect flexibility in FPGAs with large embedded memory arrays, in Proceedings of the IEEE 1996 Custom Integrated Circuits Conference, pp. 144 147, May 1996. [13] S. J. E. Wilton, J. Rose, and Z. G. Vranesic, Memoryto-memory connection structures in FPGAs with embedded memory arrays, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 10 16, February 1997. [14] S. J. E. Wilton, Implementing logic in FPGA embedded memory arrays: Architectural implications, in IEEE Custom Integrated Circuits Conference, May 1998. [15] E. Sentovich, SIS: A system for sequential circuit analysis, Tech. Rep. UCB/ERL M92/41, Electronics Research Laboratory, University of California, Berkeley, May 1992. [16] J. Cong and Y. Ding, FlowMap: an optimal technology mapping algorithm for delay optimization in lookup-table based FPGA designs, IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, vol. 13, pp. 1 12, January 1994. [17] S. J. E. Wilton, Architectures and Algorithms for Field- Programmable Gate Arrays with Embedded Memory. PhD thesis, University of Toronto, 1997.

4.0 B1 0.5 B1 a) N0 =1,N1 =1 b) N0 =8,N1 =8 4.0 B1 B1 c) N1 =1,N2 =2 d) N1 =4,N2 =8 4.0 B1 B1 e) N1 =1,N2 =3 f) N1 =3,N2 =9 Figure 4: Other Selected Heterogeneous Architecture Results