Circuit Memory Requirements Number of Utilization Number of Utilization. Variable Length one 768x16, two 32x7, è è

Size: px

Start display at page:

Download "Circuit Memory Requirements Number of Utilization Number of Utilization. Variable Length one 768x16, two 32x7, è è"

Lee Thomasine Hutchinson
5 years ago
Views:

1 Previous Alorithm ë6ë SPACK Alorithm Circuit Memory Requirements Number of Utilization Number of Utilization Arrays Req'd Arrays Req'd Variable Lenth one 768x16, two 32x7, è è CODEC one 512x1 Discrete Cosine two 16x è è Transform Chip Video one 24x112 èdual portè è è Compression one 16x96 èdual portè Encryption one 256x è 1 100è Circuit Robot one 172x è 2 42è Controller Filter two 8x24 èdual portè è è one 320x24 Neural Network one 160x8, one 32x è è Chip 1 Neural Network one 1310x24, è è Chip 2 one 1024x16 DMA Chip one 15x24, one 16x4, è è for LAN one 256x32 Translation Look- two 256x59, è è aside Buæer one 16x18 èdual portè Proof-of-Concept three 128x8, è è Viterbi Decoder one 28x3 èdual portè Imae two 128x è è Backprojector DSP Control one 1024x32, one 128x16, è è Unit three 64x16, two 24x16 Vector two 256x9, two 256x8, è è Processin three 128x9, Unit three 128x16 èdual portè Communications six 88x8, one 64x è è Circuit 1 Communications three 736x è è Circuit 2 Communications four 368x16, è è Circuit 3 one 736x16 Communications two 1620x3, two 168x12, è è Circuit 4 two 366x11 Communications one 192x è è Circuit 5 Averae è è Table 2. Experimental Results

2 8. J. Con and S. Xu, ëtechnoloy mappin for FPGAs with embedded memory blocks," in Proceedins of the ACMèSIGDA International Symposium on Field- Prorammable Gate Arrays, pp. 179í187, February Altera Corporation, ëimplementin RAM functions in FLEX 10K devices." Technical Note, Nov P. K. Jha and N. D. Dutt, ëlibrary mappin for memories," in Proceedins of the 1997 European Desin and Test Conference, March D. Karchmer and J. Rose, ëdeænition and solution of the memory packin problem for æeld-prorammable systems," in Proceedins of the IEEE International Conference on Computer-Aided Desin, pp. 20í26, H. Schmit and D. Thomas Jr., ëaddress eneration for memories containin multiple arrays," IEEE Transactions on Computer-Aided Desin of Interated Circuits and Systems, vol. 17, May P. R. Panda and N. D. Dutt, ëbehavioral array mappin into multiport memories taretin low power," in Proceedins of the 10th International Conference on VLSI Desin, Jan M. Balakrishnan, A. Majumdar, D. Banerji, J. Linders, and J. Majithia, ëallocation of multiport memories in datapath synthesis," IEEE Transactions on Computer-Aided Desin, vol. 7, April 1988.

3 wasted èbecause each physical array could not be completely ælled with loical memoriesè. Usin the previous alorithm, the utilization is only 35.4è averaed over all circuits. SPACK results in a siniæcantly hiher utilization of 51.7è. 5 Conclusions In this paper, we have presented a new loical-to-physical mappin alorithm that tarets FPGAs with dual-port embedded arrays. The purpose of the alorithm is to map the memories required by a circuit to the physical FPGA memory resources. This is an important problem, since an implementation of a user's memory that requires even one more physical array than necessary could very easily cause a circuit to not æt on a iven FPGA. Previous work has studied FPGAs with sinle-port embedded arrays. Most current FPGAs, however, contain dual-port arrays. We have shown that by explicitly takin advantae of the dual-port nature of these arrays, our alorithm produces considerably more eæcient implementations of the memory parts of circuits. Speciæcally, we have shown that under the riht conditions, we can pack two sinle-port user memories èor parts of two sinle-port user memoriesè into a dual-port array. Our alorithm results in memory implementations that use, on averae, 28è fewer arrays than an alorithm that does not take advantae of the dual-port arrays in this way. Acknowledments This work was supported by Cypress Semiconductor, the Natural Sciences a nd Enineerin Research Council of Canada, and UBC's Centre for Interated Computer Systems Research. References 1. Altera Corporation, Datasheet: FLEX 10K Embedded Prorammable Loic Family, May Altera Corporation, Datasheet: FLEX 10KE Embedded Prorammable Loic Family, Auust Xilinx, Inc., ëvirtex: Our new million-ate 100-MHz FPGA technoloy." XCell: The Quarterly Journal for Xilinx Prorammable Loic Users, First Quarter Actel Corporation, Datasheet: Interator Series FPGAs: 40MX and 42MX Families, April Lattice Semiconductor Corporation, Datasheet: isplsi and plsi 6192 Hih Density Prorammable Loic with Dedicated Memory and ReisterèCounter Modules, July S. J. E. Wilton, Architectures and Alorithms for Field-Prorammable Gate Arrays with Embedded Memory. PhD thesis, University of Toronto, S. J. E. Wilton, ësmap: heteroeneous technoloy mappin for FPGAs with embedded memory arrays," in ACMèSIGDA International Symposium on Field- Prorammable Gate Arrays, pp. 171í178, February 1998.

4 phase2: bin list = ç sort component list èlarest component firstè for each component i f for each bin j in the bin list f if fitsècomponent i, bin jè f add component i to bin j o to next component b = new bin with component i as its only occupant bin list = bin list ë b fitsècomponent i, bin jè f k = component already in bin j if èp ç pc i + pc kè and èmc i = mc kè and ëèw mci ç wc i + wc k è or èb=w mci ç dc i + pow2èdc kèè or èb=w mck ç dc k + pow2èdcièë f returnèyesè else f returnènoè Fi. 5. Summary of Phase 2 of the Alorithm Note that the purpose of Table 2 is not to compare FPGAs with sinle-port arrays to those with dual-port arrays. If the FPGA had only sinle-port arrays, many of the circuits in the table could not have even been implemented èunless the dual-port user memories were implemented by time-multiplexin two ports onto a sinle physical portè. Rather, the purpose of the table is to show that when taretin FPGAs with dual-port arrays, our alorithm performs considerably better than the previous alorithm. Since we expect most future FPGAs to contain dual-port arrays, this is an important result. The fourth and sixth columns of Table 2 show the utilization of the arrays. The utilization is deæned as: number of bits in loical memory conæuration utilization = 100 ènumber of bits in each arrayèènumber of arrays usedè A utilization of 100è means that every bit in the physical arrays was used, while a utilization lower than 100è means that some bits in the arrays were

5 eneral, the followin condition must be true to combine arrays: P ç pc i + pc j è6è Most FPGAs with dual-port arrays require that each port in a physical array be used in the same mode èe. one port can not be used as a 4Kx1 while the other is used as a 512x8è. Thus, one ænal constraint is that two components i and j can only be packed toether if: mc i = mc j è7è With these constraints, we can formulate the packin problem as a multidimensional bin-packin problem. The physical arrays are bins, and the components are the objects to be packed. In order for two components to be packed in the same bin, constraints 6 and 7 as well as either 4 or 5 must be satisæed. Fiure 5 summarizes this phase of the alorithm. 3.3 Phase 3: Wire toether the Memories After phase 2, the components implementin each loical memory may be scattered amon several physical arrays. The ænal step in the alorithm is to combine the arrays and connect them to the rest of the circuit. If the ëhorizontal" partitionin was used in phase 1, the address ports can be simply wired toether. If the ëvertical" partitionin was in phase 1, a multiplexor and decoder are needed to connect the components. Both of these techniques are described in ë6ë and ë9ë and so will not be discussed further here. 4 Results and Discussion Our alorithm was implemented in a proram called SPACK. To evaluate SPACK, we compare it to results obtained from the alorithm presented in ë6ë. That alorithm maps sinle-port loical memories to sinle-port arrays. Each loical memory is broken into components and each component is implemented by a sinle physical array èthis is the same as our alorithm, without Phase 2è. Althouh all loical memories considered in ë6ë were sinle-port, the alorithm will support dual-port memories, as lon as the physical arrays are dual-port. Thus, we can compare it directly with our alorithm usin benchmark circuits containin both sinle and dual-port loical memories. Table 2 shows the results from SPACK and the previous alorithm for 19 benchmark circuits. The circuits and their loical memory conæurations are shown in the ærst two columns of Table 2. Each circuit was mapped to 4Kbit physical arrays, each of which can be used as a 4Kx1, 2Kx2, 1Kx4, or 512x8. The number of arrays required to implement each benchmark usin each alorithm is shown in the third and æfth columns of the table. Averaed over all benchmark circuits, the previous alorithm required 9.89 arrays, while SPACK required only 7.1 arrays.

6 wc i wc j B/W Mem i Mem j dc i dc j Mem i Mem j B/W dc i dc j Mem i Mem j B/W W W W aè bè cè Fi. 4. Combinin Sinle-Port Memories into Dual-Port Physical Arrays Fiure 4èbè shows how two components i and j can be combined ëvertically". In this case, dc i of the words are used to implement component i and dc j are used to implement component j. By accessin words 0 to dc i, 1 throuh one port and words dc i to dc i + dc j, 1 throuh the other port, we can access each component independently. A straihtforward implementation of this, however, would require an adder on the path feedin the second address port, since an oæset of dc i must be added to the second component's address. This adder would be on the critical path of the memory access, which could slow down the circuit. We can eliminate the need for an adder as lon as the followin condition holds: B=W ç dc i +pow2èdc j è è5è where pow2èxè means x is ërounded-up" to the next hihest power-of-two. As lon as condition 5 holds, we can pack the components as shown in Fiure 4ècè. Component i is implemented startin at word 0, so no adder is needed on the ærst address port. Component j is implemented in the top pow2èdc j è words. Then, lo 2 èpow2èdc j èè address lines in the second address port can be used to address component j, while the remainin lines in the second address port are set to '1'. In this way, both memories can be accessed independently, and no adder is needed on either address port. The example in Fiure 3èbè was implemented in this way. Note that condition 5 is suæcient but not necessary. By employin the addressin techniques in ë12ë, the constraint could be relaxed somewhat. Experimentation has shown, however, that for our problem, these more complex addressin schemes rarely lead to a ænal implementation that uses fewer memory arrays. The above discussion has assumed that both components are sinle-ported. If either component requires two access ports, then it can not be packed with any other component, and must be implemented in its own physical array. In

7 Unused Mem 1 Address (bits 0 7) 9 9 Unused Mem 2 Address (bits 0 7) MSB Mem 1 Address (bits 0 7) Mem 2 Address (bits 0 7) 8 8 MSB 1 Address 1 Address2 Address 1 Address2 Address 1 Address2 dual port 512x8 Data 1 Data 2 dual port 512x8 Data 1 Data 2 dual port 512x8 Data 1 Data Unused Mem 1 Data aè Unused Mem 2 Data Mem 1 Data bè Mem 2 Data Fi. 3. Two ways of implementin two sinle-port 192x8 loical memories 192x8 loical memories usin an architecture with B = 4096 and P = 2. Phase 1 would create two components, each 192x8. If these components were implemented directly, two memory arrays would be required, as shown in Fiure 3èaè. Even thouh the oriinal loical memory conæuration only consists of 3072 bits, a total of 8192 bits ètwo physical arraysè are used to implement it. An alternative implementation is shown in Fiure 3èbè; in this implementation, each loical memory is mapped to a portion of a sinle array, and one of the array's two ports is used for each loical memory. The upper order address bit of port 1 is tied to 0, while the upper order address bit of port 2 is tied to 1. This ensures that each port sees a diæerent 256x8 portion of the physical array. Since both ports are independent, both loical memories can be accessed independently. This example illustrates the oal in phase 2: the components from phase 1 are packed into the available memory arrays such that the total number of required arrays is as small as possible. Informally,two sinle-port components can be packed into a dual-port physical array ëvertically" or ëhorizontally". Consider packin two arrays i and j ëhorizontally" as shown in Fiure 4èaè. In this case, the physical array is of width W, and the two components are of width wc i and wc j. Each word in the memory contains wc i bits for component i and wc j bits for component j. By supplyin an address to the ærst port's address bus, and accessin data throuh bits 0 to wc i,1 of the ærst port's data bus, the ærst component can be accessed. The second component can be accessed in the same way usin the second port's address bus and bits wc i to wc i + wc j, 1 of the second port's data bus. Since the ports are independent, both components can be accessed independently. In order to combine arrays in this way, it is suæcient that: W ç wc i + wc j è4è where W is the physical array width in the mode chosen to implement the components.

8 Phase1: component list = ç for each loical memory i c = 1 f for each physical array mode j èstartin from widestèf l ml m c wi di j = W j B=W j if c j éc then f c = c j m = j construct c new components for this loical memory calculate wc k, dck, pck for each new component k mc k = m for each new component k component list = component list ë new components Fi. 2. Summary of Phase 1 of the Alorithm In eneral, there are many ways to partition each loical memory.to simplify the task, we only consider partitions in which each component has the same mc i èthat is, each component can be implemented by a physical array in the same modeè. The partitions in Fiures 1èaè and 1èbè would be considered, therefore, while the one in Fiure 1ècè would not. Note that this only applies to components that make up a sinle loical memory. A loical memory conæuration typically has several loical memories; components from diæerent loical memories may correspond to diæerent physical array modes. Given this assumption, the number of components required to implement a loical memory i usin physical array mode j is: c = ç çç wi di W j B=W j ç è3è To ænd the partition that results in the smallest c, we cycle throuh all possible array modes and choose the best result. This partitionin is done independently for each loical memory in the loical memory conæuration. This is summarized in Fiure Phase 2: Bin Packin Given the list of components found in phase 1, it is possible to implement the loical memory conæuration directly by usin one physical array for each component. As will be shown in Section 4, this often results in very poor utilization of the memory arrays. As an example, consider implementin two sinle-port

9 aè bè cè Fi. 1. Three ways to partition a 3584x3 loical memory 3 -to-physical Mappin Alorithm In this section, our new loical-to-physical mappin alorithm is described. The alorithm consists of three phases: durin the ærst phase, the loical memories are broken into components, in the second phase, these components are packed into the physical arrays, and in the third phase, these physical arrays are wired toether to implement the oriinal loical memory conæuration. 3.1 Phase 1: Break Memories into Components The ærst phase of the alorithm partitions each loical memory into several components, each of which is small enouh to æt into a sinle physical array. Each component represents a portion of the bits in the oriinal loical memory, and can be described by its width, wc i, its depth, dc i, and the number of ports required, pc i. In order for the component to æt in a sinle physical array, wc i and dc i must satisfy the followin inequalities: wc i ç W j dc i ç B=W j è1è è2è for some value of j between 0 and M, 1 èrecall that a physical array can be used in one of M modes, each of which has a diæerent widthèdepthè. The number of ports required by each component, pc i, is the same as the number of ports required by the oriinal loical memory.we also deæne a quantity mc i for each component which indicates the physical array modeèsè that can be used to implement this component. As an example, Fiure 1 shows three ways in which a 3584x3 loical memory could be broken into components. In each case, it is assumed that each physical array consists of 4096 bits èb = 4096è and can be used as a 4096x1, 2048x2, 1024x4, or a 512x8. In Fiure 1èaè, wc i = 1 for each component, while in Fiure 1èbè, all components have wc i = 3. In Fiure 1, one of the components has wc i = 1, while the others have wc i =2.

10 used in many architectures today, we feel that true dual-port memories will be available in future devices. Thus, we focus our eæorts on studyin alorithms that taret true dual-port memories. 2.2 User Circuit Assumptions It is assumed that the user circuit to be implemented on the FPGA contains both loic and memory portions. In this paper, we are only concerned with the memory portion. We assume that the memory portion of the circuit consists of l independent user memories. We refer to each of these user memories as a loical memory. The set of all loical memories required for a circuit will be referred to as that circuit's loical memory conæuration. The depth of loical memory k è0 ç k ç l, 1è will be denoted d k, the width of loical memory k will be denoted w k, and the number of ports required by loical memory k èmaximum number of simultaneous accesses to memory kè will be denoted p k. Unlike ë6ë, we allow user memories that require either one or two ports. These parameters are summarized in the bottom half of Table Problem Statement The problem studied in this paper can be stated as follows: Given: 1. An FPGA architecture described by B, M and W i è0 ç iémè as described in Subsection 2.1 èthis paper only considers architectures with P = 2è, 2. A Memory Conæuration described by l, d k, w k, and p k è0 ç kélè, as described in Subsection 2.2 èin this paper, 1 ç p k ç 2 for all kè, Find: An implementation of the loical memory conæuration usin n embedded memories. Such that: n is as small as possible. Note that the oal is to implement the loical memory conæuration usin as few physical arrays as possible. In an FPGA with N arrays, it may appear that minimizin the number of arrays required to implement the loical memory conæuration is immaterial, as lon as the implementation requires N or fewer arrays. Minimizin the number of arrays is important, however, since the remainin arrays can be conæured as ROMs, and be very eæciently used to implement the loic part of the user circuit ë7, 8ë. The fewer arrays that are used to implement memory, the more that will be available to implement loic.

11 Parameter Meanin N Number of Arrays B Bits per Array P Ports per Array M Number of Modes for each array W i Data width of array inmodei l Number of Memories d k Depth of Memory k w k Width of Memory k Ports in Memory k p k Table 1. Architectural and Circuit Parameters those obtained by simply extendin a previous alorithm that was developed to taret sinle-port arrays. 2 Problem Deænition In this section, we ærst describe our assumptions reardin the taret FPGA architecture and the user circuit that is to be mapped, and then present a precise deænition of the -to-physical Memory Mappin problem. 2.1 Architectural Framework The top half of Table 1 summarizes the parameters that deæne the FPGA embedded memory array architecture. The number of embedded memory arrays is denoted by N, the number of bits in each array is denoted by B, the number of independent access ports in each array is denoted by P. Each array can be used in one of M diæerent modes; each mode has a diæerent width and depth. The width of each array inmodei is denoted W i ; the depth can be calculated as B=W i. In the Altera FLEX10KE, B = 4096 bits, P =2,M =4,and fw 0 ;W 1 ;W 2 ;W 3 = f2; 4; 8; 16, meanin each array is dual-port and can be conæured to be one of 2048x2, 1024x4, 512x8, or 256x16. In this paper, we will only consider dual port arrays, ie. P = 2. Note that some FPGA architectures, such as the Altera FLEX10KE, contain two independent ports, but one port is a dedicated read port and one port is a dedicated write port. This works well for many applications èsuch as a ærst-in ærst-out buæer that is used to temporarily hold data in a communication systemè, but there are many applications for which this is insuæcient èa dual-port reister æle in a processor which must be read by two functional units simultaneously, for exampleè. To implement these sorts of circuits, true dual-port memory arrays are required, in which the two accesses are independent, and either can be a read or write. With the increasin importance of embedded memory in FPGAs, and since true dualport arrays appear to be a natural evolution from the restricted dual-port model

12 depth. As an example, the Altera 10KE devices contain between six and twenty 4-Kbit blocks, each of which can be used as a 2Kx2, a 1Kx4, a 512x8, or a 256x16 array. 1 These arrays can be combined to implement larer user memories. The task of implementin the memories required by a user circuit usin the FPGA embedded arrays is called loical-to-physical mappin ë6ë. Because of the lare number of ways in which arrays can be combined, and because each array can be used in one of several modes èwidthsèdepthsè, this problem is not trivial. Yet, it is vitally important í since each FPGA contains only a few memory arrays, a sub-optimal implementation that wastes even one memory array could very easily cause a circuit to not æt on a iven FPGA. Even if the memory conæuration does æt on the FPGA, minimizin the number of arrays needed to implement the storae part of the circuit is beneæcial because unused memory arrays can be conæured as ROM and used to implement loic ë7,8ë. In ë6, 9ë, loical-to-physical mappin for FPGAs with sinle-port embedded arrays èarrays in which only one access can be performed at a timeè is discussed. Many recent FPGAs, however, contain dual-port arrays èso that two accesses can be performed by each array concurrentlyè ë2í4ë. Many applications require memories that can be accessed simultaneously by two separate subcircuits; these applications can most eæciently be implemented if the FPGA has dual-port arrays. In this paper, a new loical-to-physical mappin alorithm that tarets dualport arrays is presented. We show that this new alorithm results in much more eæcient implementations than if we simply extend the techniques taretin sinle-port arrays ë6, 9ë. The user circuits are assumed to consist of both sinle and dual-port user memories; our improvement is obtained by intelliently packin the sinle-port user memories into the dual-port physical arrays. Under the riht conditions, each dual-port array can implement two sinle-port memories èor parts of two sinle-port memoriesè. Besides ë6ë and ë9ë, little work as been done in this area. Jha and Dutt describe an alorithm to map loical memories to physical library elements ë10ë, but do not consider the optimizations that are possible when the physical elements are dual-port. Karchmer and Rose show how user memories can be implemented by larer physical memory chips, but only consider sinle-port physical arrays ë11ë. Their work is also diæerent in that they consider discrete memory devices, which do not have the variety of modes that FPGA memory arrays have. There has also been considerable work mappin variables to both sinle and dual port memories durin hih-level synthesis in an attempt to minimize the execution time of an alorithm ë12í14ë. None of these papers consider physical memories with the conæurability of FPGA arrays, however. This paper is oranized as follows. Section 2 presents our assumptions reardin the FPGA architecture and the application circuits, and then ives a precise deænition of the problem solved in this paper. Section 3 then describes our alorithm. Finally, Section 4 compares the results from our alorithm with 1 In this paper, a axb memory has a words of b bits each.

13 -to-physical Memory Mappin for FPGAs with Dual-Port Embedded Arrays William K.C. Ho and Steven J.E. Wilton Department of Electrical and Computer Enineerin University of British Columbia, Vancouver, B.C., Canada, fwilliamh stevew Abstract. On-chip storae has become critical in lare FPGAs. This has led most FPGA vendors to include conæurable embedded arrays in their devices. Because of the lare number of ways in which the arrays can be combined, and because of the conæurability of each array, there are often many ways to implement the memories required by a circuit. Implementin user memories usin physical arrays is called loical-tophysical mappin, and has previously been studied for sinle-port FPGA memory arrays. Most current FPGAs, however, contain dual-port arrays. In this paper, we present a loical-to-physical alorithm that speciæcally tarets dual-port FPGA arrays. We show that this alorithm results in 28è denser memory implementations than the only previously published alorithm. 1 Introduction It has become clear that on-chip storae is critical in lare FPGAs. As FPGAs row, they are bein used to implement entire systems, rather than the small loic subcircuits that have traditionally been tareted to FPGAs. One of the important diæerences between these lare systems and smaller loic subcircuits is that the lare systems often require storae. Althouh this storae could be implemented oæ-chip, on-chip storae has a number of advantaes. Besides the obvious advantaes of interation, on-chip storae will often lead to hiher clock frequencies, since IèO pins need not be driven with each memory access. In addition, on-chip storae will relax IèO pin requirements, since pins need not be devoted to external memory connections. These advantaes have led most FPGA vendors to produce architectures with siniæcant amounts of on-chip storae. Since the storae requirements of circuits vary widely, the FPGA memory architecture must be æexible enouh to implement diæerent numbers of independently addressable memories as well as diæerent memory shapes and sizes. Many recent commercial devices, such as the Altera 10K and 10KE devices ë1, 2ë, the Xilinx Virtex FPGAs ë3ë, the Actel 42MX ë4ë, and the Lattice isplsi 6192 FPGAs ë5ë, provide several lare arrays embedded into the FPGA. Each array can typically be used in one of several modes, each with a diæerent width and

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver, BC, Canada, V6T