WWW.IJITECH.ORG ISSN 2321-8665 Vol.05,Issue.09, September-2017, Pages:1693-1697 AJJAM PUSHPA 1, C. H. RAMA MOHAN 2 1 PG Scholar, Dept of ECE(DECS), Shirdi Sai Institute of Science and Technology, Anantapuramu, AP, India. 2 Assistant Professor, Dept of ECE(DECS), Shirdi Sai Institute of Science and Technology, Anantapuramu, AP, India. Abstract: The utilization of block RAMs (BRAMs) is a critical performance factor for multi ported memory designs on field programmable gate arrays (FPGAs). Not only does the excessive demand on BRAMs block the usage of BRAMs from other parts of a design, but the complex routing between BRAMs and logic also limits the operating frequency. This paper first introduces a brand new perspective and a more efficient way of using a conventional two reads one write (2R1W) memory as a 2R1W/4R memory. By exploiting the 2R1W/4R as the building block, this paper introduces a hierarchical design of 4R1W memory that requires 25% fewer BRAMs than the previous approach of duplicating the 2R1W module. Memories with more read/write ports can be extended from the proposed 2R1W/4R memory and the hierarchical 4R1W memory. Compared with previous xor-based and live value table-based approaches, the proposed designs can, respectively, reduce up to 53% and 69% of BRAM usage for 4R2W memory designs with 8K-depth. Keywords: Block RAM (BRAM), Field-Programmable Gate Array (FPGA), Multi Ported Memory, Performance. I. INTRODUCTION Field-Programmable gate arrays (FPGAs) have been broadly used in fast prototyping of complex digital systems. FPGAs contain programmable logic arrays, usually referred to as slices [4]. Slices can be configured into different logic functions. The flexible routing channels can support data transferring between logic slices. In addition to implementing logic operations, if needed, the slices can also be used as storage elements, such as flip-flops, register files, or other memory modules. Due to the increasing complexity of digital systems, there is a growing demand for in-system memory m1odules. Synthesizing a large number of memory modules would consume a significant amount of slices, and would therefore result in an inefficient design. The excessive usage of slices could also pose a limiting factor to the maximum size of a system that can be prototyped on an FPGA. To more efficiently support the in-system memory, modern FPGAs deploy block RAMs (BRAMs) that are hardcore memory blocks integrated within an FPGA to support efficient memory usage in a design. Compared with the storage module synthesized by slices, BRAMs are more area and power efficient while at the same time achieving higher operating frequencies. Multi ported memories, which allow multiple concurrent reads and writes, are frequently used in various digital designs on FPGAs to achieve high memory bandwidth. For example, the register file of an FPGA-based scalar MIPS-like soft processor [3] requires one write port and two read ports. Processors that issue multiple instructions require even more access ports. The shared cache system among multiple soft processors on FPGA should support multiple concurrent accesses. A routing table in a network switching function would also need to enable multiple accesses in order to support multiple requests from different ingress ports. Time multiplexing and task scheduling are alternative solutions to support multiple accesses. However, these schemes would usually cause more complex designs with many corner cases and therefore requiring extra effort of verification. Having a generic multi ported memory module can greatly ease the design when a system really needs to provide concurrent data accesses. This concern becomes more important especially when FPGAs are usually used for quick prototyping and functional verification. II. PROPOSED DESIGN This section proposes efficient solutions to implement multi ported memories on FPGAs. Unlike the replication method in [1] and [3], the approach proposed in this paper supports multiple reads with XOR operations, while multiple writes can be enabled using additional BRAMs. A remap table is Fig 1. Example of a 2R1W memory implemented by BDX technique. Added to track the location of the correct data the main memory architecture is similar to that of our previous work introduced. On top of the main architecture, this paper Copyright @ 2017 IJIT. All rights reserved.
introduces a brand new perspective of using a 2R1W module as either a 2R1W or a 4R module, denoted as a 2R1W/4R memory. By applying the 2R1W/4R, this paper exploits the versatile usage mode and proposes a hierarchical XOR-based design of 4R1W memory that requires fewer BRAMs than previous designs. Memories with more read/write ports can be supported by extending the proposed 2R1W/4R memory and the hierarchical 4R1W memory. A. Techniques to Increase Read Ports 1.Bank Division With XOR Design Scheme: Bank Division With XOR (BDX) is an approach to increase read ports proposed in [9]. Unlike the method used in [1] and [3], BDX avoids replicating the storage elements of the whole memory space. With BDX, multiple reads can be supported by using the XOR operations. Note that BDX is different from the XORbased design in [2]. The XOR-based approach in [2] uses XOR operations to increase write ports by storing the encoded data to maintain the data coherence between memory modules. BDX uses XOR operations to increase read ports by retrieving the target data from the encoded value. Fig 2.Example of mr1w memory implemented with multiple 2R1W modules. Fig1 illustrates an example of a 2R1W memory implemented with the BDX scheme. As shown in Fig. 1(a), the memory space is distributed to four data banks (banks 0 3). One XOR-bank is added to keep the XOR values of the data banks. Each of these memory banks is assumed to be a 2RW module that can support two reads, or two writes, or one read and one write. Therefore, the BDX design becomes a 2R1W memory module which can support two reads and one write concurrently. The XOR-bank stores the XOR-ed value of all the data at the same offset in every data bank. The XOR operation for this example is shown in (4). P n represents the value stored at offset n of the XOR-bank. D( m,n ) represents the data at offset n of bank m Since there are four data banks in the example shown in Fig. 6(a), the value at the same offset (offset n) of each bank (banks 0 3) will be XOR-ed and stored to the same offset (offset n) at the XOR-bank P n = D (0,n) D (1,n) D (2,n) D (3,n). (1) As shown in Fig. 1(a), two reads R 0 and R 1 are both going to bank 1. In this example, one read (R0) will access the data directly from bank 1 (R 0 = D (1,0 )). Due to the bank conflict, the other read R1 cannot access bank 1. Instead, the target data of R1 can be recovered by XOR-ing the values at the same offset AJJAM PUSHPA, C. H. RAMA MOHAN of the other data banks as well as the XOR-bank. As shown in the following equation, the target data at offset n of data bank 1 can be recovered by XOR-ing the values at the same offset (offset n) from banks 0 to 3, as well as the 2. XOR-bank: D (1,n) = D (0,n) D (2,n) D (3,n) Pn. (2) The write request in this example is implemented in the twocycle pipeline architecture. As illustrated in Fig. 1(b), at the first cycle W 0 writes the data D (1,n) directly to bank 1. At the same cycle, the data at the same offset of the other banks [D (0,n) D (2,n) and D (3,n) ] together with D (1,n) are read and XOR-ed. At the second cycle, the XOR-ed value will be written back to the XOR bank at offset n. With the pipeline architecture, this design can process one write request every cycle. The 2R1W memory introduced previously only needs an additional XOR-bank that requires the same storage size of each data bank. In this case, assume all the data banks are of equal size. The number of memory entries in the XOR-bank is Memory Depth/#Data Banks, where Memory Depth represents the total number of memory entries in the memory space and #Data Banks denotes the number of data banks in the multi ported memory. To support more than two read ports, BDX needs to deploy multiple 2R1W memories. Fig. 2 shows an example of mr1w memory. Each 2R1W module supports two out of m reads. In this way, all the m reads can be serviced concurrently by multiple 2R1W modules. Note that when implementing a multi ported memory on FPGAs, replicating memory modules could significantly increase the usage of the limited BRAMs on an FPGA. To address this issue, this paper proposes a new architecture to increase read ports, referred to as hierarchical bank division with XOR (HBDX). HBDX applies Fig 3. Two modes of 2R1W/4R module. The module is implemented with four data banks and one XOR-bank. (a) 2R1W mode. (b) 4R mode. 2.2R1W/4R(An Efficient Two-Mode Memory): To implement HBDX in an efficient way, this paper introduces a brand new perspective of using a 2R1W module as either a 2R1W or a 4R module. This new way of using the 2R1W module is denoted as 2R1W/4R. This hybrid module can support either 2R and 1Wor
4R. Note that the 2R1W/4R module uses exactly the same design as the 2R1W module introduced in Fig. 1. Fig. 3 illustrates how the two modes work. Fig. 3(a) shows the 2R1W mode. When there is a write request W 0, this design can support up to two conflicting reads. The write request W0 stores the data directly to the target data bank, and reads all the data at the same offset from the other data banks (R update) to update the XOR-bank. Fig. 3(b) shows the 4R mode. This mode only works when there is no write request. In this case, the design can support up to four conflicting reads. Consider one of the worst cases when all the four reads (R0 to R3) are going to bank 0. As illustrated in Fig. 8(b), R0 and R2 access bank 0 directly. At the same time, R1 and R3 can retrieve the target data by XOR-ing the values at the same offset of the other data banks as well as the XOR-bank. 3. HBDX Designs With 2R1W/4R Module: Fig. 2 illustrates a design scheme that can support more read ports by replicating the 2R1W module. However, this design scheme could significantly increase the usage of the limited BRAMs on an FPGA. To achieve a more BRAM-efficient design, this paper proposes HBDX, which adopts a hierarchical structure that organizes the 2R1W to achieve 4R1W without replicating the 2R1W module. To further enhance the design, HBDX in this section leverages the 2R1W/4R scheme introduced in the previous section as the basic building module to implement a 4R1W module. Fig. 4 illustrates a 4R1W memory design by using the HBDX scheme. In this 4R1W design, each basic building block is a 2R1W module of the BDX scheme introduced in Fig. 1. According to the 2R1W/4R memory proposed in the previous section, this 2R1W module can be used as either a 2R1W or a 4R module. The HBDX in Fig. 4 will utilize this versatile usage mode to achieve a more efficient 4R1W design. data. And the write request W 0 can be supported with a pipeline architecture similar to the one shown in the example of Fig. 1(b). In this case, W 0 stores the data directly to bank 0. The data at the same offset of the other data banks are read and XOR-ed at the same cycle. The XOR-ed value will be stored back to the XOR bank in the next cycle. In this example, bank 0 is used in the 2R1W mode, while other banks are used in the 4R mode. The HBDX-based 4R1W memory has demonstrated to be more BRAM efficient than the previous approach of simply duplicating the 2R1W module [9]. B. Techniques to Increase Write Ports Bank division with remap table (BDRT) is an approach to increase write ports proposed in [9]. Unlike the LVT design used in [1] and [3], BDRT avoids replicating the whole memory space and supports multiple writes using additional BRAMs and a remap table to track the location of the latest data. Fig. 5 shows an example of the design for a 1R2W memory. This example consists of two data banks (banks 0 and 1), one bank buffer, and a remap table. The Null entries in a memory bank are the entries that do not store any valid data. When receiving W 0 and W 1, these requests will first look up the remap table to identify the correct BRAM that stores the latest data. According to the remap table, W 0 and W1 are, respectively, going to address 0 and address 1 in bank 0. In this case, W 0 will store the value directly to address 0 in bank 0. W 1 will be directed to the bank buffer whose offset 1 is a Null entry. At the same time, the remap table needs to be updated to reflect the modified location of W 1, as shown in Fig. 5(b). The address 1 of the remap table stores the value 2 which is the identification number of the bank buffer. This state of remap table shows that the data of address 1 is now stored in the bank buffer. Fig 4. HBDX 4R1W implemented with 2R1W/4R modules. Consider one of the worst cases when all the 4R and 1W are going to bank 0. Two read requests, R 0 and R 1 in this example, would read directly from bank 0. The other two read requests, R 2 and R 3, would read the same offset from the other data banks (banks 1 3) and the XOR-bank to recover the target Fig 5. Example of a 1R2W memory implemented with BDRT technique Compared with the LVT approach used in previous works, the BDRT requires smaller storage space. The extra cost of BDRT is the additional registers to implement the remap table. The required number of these extra registers is shown in the following equation, where Memory Depth is the total number of
memory entries in the memory space and #Data Banks denotes the number of data banks in the multi ported memory # of registers for remap table =([log2(#databanks+1)] 1) MemoryDepth. (3) C. Integrating the Read/Write Techniques This section shows the design of a multiported memory that integrates HBDX and BDRT. Fig.5 shows an architecture that uses HBDX and BDRT to implement an mrnw memory. This memory architecture is divided into k data banks. Based on BDRT, to support all the writes, there require total n 1 bank buffers. A hash mechanism is added to distribute the writes to banks. AJJAM PUSHPA, C. H. RAMA MOHAN write requests, W 0 and W 1, will look up the remap table for the correct banks to store the values. In this example, the addresses of the two writes, addresses 0 and 1, both belong to bank 0. In this case, W 0 will store the value directly to bank 0. W 1 will be directed to a bank buffer that has a Null entry at the same offset of W 1 on bank 0. At the same time, the remap table needs to be updated to reflect the modified location of W 1, as shown in Fig. 7(b). 2. Extend to 4RnW Memory With 2R1W/4R Module: This section further extends the design to a 4RnWmemory. According to the architecture presented in Fig. 6, a 4RnW memory requires 4R1W modules as building blocks. A 4R1W module can be implemented simply by the replication. IV. SIMULATION RESULTS Each BRAM is 32-bit data width by 1K-depth, and can be configured as two-port mode (supporting one read and one write), or dual-port mode (supporting two reads, or two writes, or one read and one write). A slice contains four 6-input look-up tables, eight registers, and multiplexers. Fig 6. Architecture that integrates HBDX and BDRT to achieve a mrnw memory. 1. Example of 2R2W Memory: Fig. 7 shows an example of 2R2W memory that applies this architecture. In Fig. 7(a), assume there are two write requests and two read requests. The writes W 0 and W 1 are going to addresses 0 and 1, respectively. The reads R 0 and R 1 are retrieving the data from Fig 8. Simulation diagram of BDX. Fig 7. Example of a 2R2W memory that integrates BDX and BDRT. Addresses 2 and 3, respectively each bank is a 2R1W module that can support two read ports and one write port. The 2R1W module can be implemented with BDX as illustrated in Fig. 1. A read request will access all the banks and choose the data returned by the bank that stores the latest value of the target data based on the record in the remap table. The two Fig 9. Schematic Diagram of BDX.
Fig 10. RTL Diagram of BDX. A. AREA Description A significant amount of 42% area overhead results due to inclusion of the BDX. However, the overhead when estimated with respect to the router area reduces to 8% and still further to 2% when estimated with respect to the area of the entire BDX.This is because the mapping tables are synthesized by the registers and connections in slices of FPGAs. Having large mapping tables would not only occupy slices, but also introduce more complex decision logics and routing. Extra consumption of slices also blocks the usage of slices by other logic modules, and would consequently impact on the overall timing of the design. The experimental results have also revealed that when the slice utilization approaches the capacity of the target FPGA, the timing degradation could be very significant. B. Power Report The proposed designs can reduce the power for a BDX by up to 5.5 percent compared with XOR Based design. The design environment is based on Xilinx ISE 14.6. The synthesis tool is set to favor clock frequency. The data width is 32 bit for all the multi ported designs in this paper. The designs of multi ported memory assume no stalls. V. CONCLUSION This paper proposes efficient BRAM-based multiported memory designs on FPGAs. The existing design methods require significant amounts of BRAMs to implement a memory module that supports multiple read and write ports. Occupying too many BRAMs for multiported memory could seriously restrict the usage of BRAMs for other parts of a design. This paper proposes techniques that can attain efficient multiported memory designs. This paper introduces a novel 2R1W/4R memory. By exploiting the 2R1W/4R as the building block, this paper proposes a hierarchical design of 4R1W memory that requires 33% fewer BRAMs than the previous designs based on replication. Memories with more read/write ports can be extended from the proposed 2R1W/4R memory and the hierarchical 4R1W memory. Compared with XOR-based and LVT-based approaches, the proposed designs can, respectively, reduce up to 53% and 69% of BRAM usage for 4R2W memory designs with 8K-depth. For complex multi ported designs, the proposed BRAM-efficient approaches can achieve higher clock frequencies by alleviating the complex routing in an FPGA. For 4R3W memory with 8K-depth, the proposed design can save 53% of BRAMs while at the same time enhance the operating frequency by 20%. VI. REFERENCES [1] C. E. LaForest and J. G. Steffan, Efficient multi-ported memories for FPGAs, in Proc. 18th Annu. ACM/SIGDA Int. Symp. Field Program. Gate Arrays, 2010, pp. 41 50. [2] C. E. LaForest, M. G. Liu, E. Rapati, and J. G. Steffan, Multi-ported memories for FPGAs via XOR, in Proc. 20th Annu. ACM/SIGDA Int. Symp. Field Program. Gate Arrays (FPGA), 2012, pp. 209 218. [3] C. E. Laforest, Z. Li, T. O Rourke, M. G. Liu, and J. G. Steffan, Composing multi-ported memories on FPGAs, ACM Trans. Reconfigurable Technol. Syst., vol. 7, no. 3, Aug. 2014, Art. no. 16. [4] Xilinx. 7 Series FPGAs Configurable Logic Block User Guide, accessed on May 30, 2016. [Online]. Available: http://www.xilinx.com/support/documentation/user_guides/ug47 4_7Series_CLB.pdf [5] Xilinx. Zynq-7000 All Programmable SoC Overview, accessedonmay30,2016.[online].available:ttp://www.xilinx.co m/support/documentation/data_sheets/ds190-zynq-7000- Overview.pdf [6] G. A. Malazgirt, H. E. Yantir, A. Yurdakul, and S. Niar, Application specific multi-port memory customization in FPGAs, in Proc. IEEE Int. Conf. Field Program. Logic Appl. (FPL), Sep. 2014, pp. 1 4. [7] H. E. Yantir and A. Yurdakul, An efficient heterogeneous register file implementation for FPGAs, in Proc. IEEE Int. Parallel Distrib. Process. Symp. Workshops (IPDPSW), May 2014, pp. 293 298. [8] H. E. Yantir, S. Bayar, and A. Yurdakul, Efficient implementations of multi-pumped multi-port register files in FPGAs, in Proc. Euromicro Conf. Digit. Syst. Design (DSD), Sep. 2013, pp. 185 192. [9] J.-L. Lin and B.-C. C. Lai, BRAM efficient multi-ported memory on FPGA, in Proc. Int. Symp. VLSI Design, Autom. Test (VLSI-DAT), Apr. 2015, pp. 1 4.