ISSN Vol.05,Issue.09, September-2017, Pages:

Similar documents
Efficient Designs of Multiported Memory on FPGA

Gated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

Resource Efficient Multi Ported Sram Based Ternary Content Addressable Memory

Area Efficient Z-TCAM for Network Applications

Design and Implementation of CVNS Based Low Power 64-Bit Adder

FPGA Based Low Area Motion Estimation with BISCD Architecture

DESIGN OF PARAMETER EXTRACTOR IN LOW POWER PRECOMPUTATION BASED CONTENT ADDRESSABLE MEMORY

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE

Reliability of Memory Storage System Using Decimal Matrix Code and Meta-Cure

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study

GENERATION OF PSEUDO-RANDOM NUMBER BY USING WELL AND RESEEDING METHOD. V.Divya Bharathi 1, Arivasanth.M 2

Novel Design of Dual Core RISC Architecture Implementation

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient

POWER REDUCTION IN CONTENT ADDRESSABLE MEMORY

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA

VLSI Implementation of Parallel CRC Using Pipelining, Unfolding and Retiming

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

A Countermeasure Circuit for Secure AES Engine against Differential Power Analysis

Xilinx Based Simulation of Line detection Using Hough Transform

HDL IMPLEMENTATION OF SRAM BASED ERROR CORRECTION AND DETECTION USING ORTHOGONAL LATIN SQUARE CODES

Modular Multi-ported SRAMbased. Ameer M.S. Abdelhadi Guy G.F. Lemieux

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Three-D DWT of Efficient Architecture

Design and Implementation of 3-D DWT for Video Processing Applications

of Soft Core Processor Clock Synchronization DDR Controller and SDRAM by Using RISC Architecture

Design and Implementation of Buffer Loan Algorithm for BiNoC Router

FPGA Implementation of Efficient Carry-Select Adder Using Verilog HDL

Design of a High Speed FPGA-Based Classifier for Efficient Packet Classification

Fast FPGA Routing Approach Using Stochestic Architecture

Composing Multi-Ported Memories on FPGAs

Area Delay Power Efficient Carry Select Adder

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI.

Design and Implementation of FPGA Logic Architectures using Hybrid LUT/Multiplexer

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP

DESIGN AND IMPLEMENTATION OF APPLICATION SPECIFIC 32-BITALU USING XILINX FPGA

New Approach for Affine Combination of A New Architecture of RISC cum CISC Processor

VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier

Power Optimized Programmable Truncated Multiplier and Accumulator Using Reversible Adder

Novel Implementation of Low Power Test Patterns for In Situ Test

A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique

ISSN Vol.03, Issue.02, March-2015, Pages:

ISSN Vol.08,Issue.12, September-2016, Pages:

DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech)

AN IMPLEMENTATION THAT FACILITATE ANTICIPATORY TEST FORECAST FOR IM-CHIPS

Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

Online Heavy Hitter Detector on FPGA

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

DUE to the high computational complexity and real-time

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

High Performance Interconnect and NoC Router Design

Implementation of a Bi-Variate Gaussian Random Number Generator on FPGA without Using Multipliers

[Kalyani*, 4.(9): September, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

Fault Tolerant Parallel Filters Based on ECC Codes

Effective Implementation of LDPC for Memory Applications

Design of AHB Arbiter with Effective Arbitration Logic for DMA Controller in AMBA Bus

Design of FM0 and Manchester Encoder and Decoder for DSRC Application Using SOLS Technique

Available online at ScienceDirect. Procedia Technology 25 (2016 )

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders

IMPLEMENTATION OF CONFIGURABLE FLOATING POINT MULTIPLIER

LOW-POWER SPLIT-RADIX FFT PROCESSORS

Design and Characterization of High Speed Carry Select Adder

Run-time power and performance scaling in 28 nm FPGAs

Memory-efficient and fast run-time reconfiguration of regularly structured designs

FPGA Implementation of ALU Based Address Generation for Memory

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

Area Delay Power Efficient Carry-Select Adder

ISSN: [Shubhangi* et al., 6(8): August, 2017] Impact Factor: 4.116

Design of Delay Efficient Carry Save Adder

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

Carry Select Adder with High Speed and Power Efficiency

Implementation of SCN Based Content Addressable Memory

DETECTION AND CORRECTION OF CELL UPSETS USING MODIFIED DECIMAL MATRIX

DESIGN AND IMPLEMENTATION ARCHITECTURE FOR RELIABLE ROUTER RKT SWITCH IN NOC

High-throughput Online Hash Table on FPGA*

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER.

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs

Power Optimized Transition and Forbidden Free Pattern Crosstalk Avoidance

Prachi Sharma 1, Rama Laxmi 2, Arun Kumar Mishra 3 1 Student, 2,3 Assistant Professor, EC Department, Bhabha College of Engineering

An Energy Improvement in Cache System by Using Write Through Policy

ISSN Vol.03,Issue.05, July-2015, Pages:

IMPLEMENTATION OF DIGITAL CMOS COMPARATOR USING PARALLEL PREFIX TREE

RESEARCH ARTICLE Software/Hardware Parallel Long Period Random Number Generation Framework Based on the Well Method

the main limitations of the work is that wiring increases with 1. INTRODUCTION

CHAPTER 4 BLOOM FILTER

Unique Journal of Engineering and Advanced Sciences Available online: Research Article

FPGA Implementation of High Speed AES Algorithm for Improving The System Computing Speed

/$ IEEE

FPGA Implementation of Double Error Correction Orthogonal Latin Squares Codes

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

Synthesis of VHDL Code for FPGA Design Flow Using Xilinx PlanAhead Tool

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit

Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems.

Transcription:

WWW.IJITECH.ORG ISSN 2321-8665 Vol.05,Issue.09, September-2017, Pages:1693-1697 AJJAM PUSHPA 1, C. H. RAMA MOHAN 2 1 PG Scholar, Dept of ECE(DECS), Shirdi Sai Institute of Science and Technology, Anantapuramu, AP, India. 2 Assistant Professor, Dept of ECE(DECS), Shirdi Sai Institute of Science and Technology, Anantapuramu, AP, India. Abstract: The utilization of block RAMs (BRAMs) is a critical performance factor for multi ported memory designs on field programmable gate arrays (FPGAs). Not only does the excessive demand on BRAMs block the usage of BRAMs from other parts of a design, but the complex routing between BRAMs and logic also limits the operating frequency. This paper first introduces a brand new perspective and a more efficient way of using a conventional two reads one write (2R1W) memory as a 2R1W/4R memory. By exploiting the 2R1W/4R as the building block, this paper introduces a hierarchical design of 4R1W memory that requires 25% fewer BRAMs than the previous approach of duplicating the 2R1W module. Memories with more read/write ports can be extended from the proposed 2R1W/4R memory and the hierarchical 4R1W memory. Compared with previous xor-based and live value table-based approaches, the proposed designs can, respectively, reduce up to 53% and 69% of BRAM usage for 4R2W memory designs with 8K-depth. Keywords: Block RAM (BRAM), Field-Programmable Gate Array (FPGA), Multi Ported Memory, Performance. I. INTRODUCTION Field-Programmable gate arrays (FPGAs) have been broadly used in fast prototyping of complex digital systems. FPGAs contain programmable logic arrays, usually referred to as slices [4]. Slices can be configured into different logic functions. The flexible routing channels can support data transferring between logic slices. In addition to implementing logic operations, if needed, the slices can also be used as storage elements, such as flip-flops, register files, or other memory modules. Due to the increasing complexity of digital systems, there is a growing demand for in-system memory m1odules. Synthesizing a large number of memory modules would consume a significant amount of slices, and would therefore result in an inefficient design. The excessive usage of slices could also pose a limiting factor to the maximum size of a system that can be prototyped on an FPGA. To more efficiently support the in-system memory, modern FPGAs deploy block RAMs (BRAMs) that are hardcore memory blocks integrated within an FPGA to support efficient memory usage in a design. Compared with the storage module synthesized by slices, BRAMs are more area and power efficient while at the same time achieving higher operating frequencies. Multi ported memories, which allow multiple concurrent reads and writes, are frequently used in various digital designs on FPGAs to achieve high memory bandwidth. For example, the register file of an FPGA-based scalar MIPS-like soft processor [3] requires one write port and two read ports. Processors that issue multiple instructions require even more access ports. The shared cache system among multiple soft processors on FPGA should support multiple concurrent accesses. A routing table in a network switching function would also need to enable multiple accesses in order to support multiple requests from different ingress ports. Time multiplexing and task scheduling are alternative solutions to support multiple accesses. However, these schemes would usually cause more complex designs with many corner cases and therefore requiring extra effort of verification. Having a generic multi ported memory module can greatly ease the design when a system really needs to provide concurrent data accesses. This concern becomes more important especially when FPGAs are usually used for quick prototyping and functional verification. II. PROPOSED DESIGN This section proposes efficient solutions to implement multi ported memories on FPGAs. Unlike the replication method in [1] and [3], the approach proposed in this paper supports multiple reads with XOR operations, while multiple writes can be enabled using additional BRAMs. A remap table is Fig 1. Example of a 2R1W memory implemented by BDX technique. Added to track the location of the correct data the main memory architecture is similar to that of our previous work introduced. On top of the main architecture, this paper Copyright @ 2017 IJIT. All rights reserved.

introduces a brand new perspective of using a 2R1W module as either a 2R1W or a 4R module, denoted as a 2R1W/4R memory. By applying the 2R1W/4R, this paper exploits the versatile usage mode and proposes a hierarchical XOR-based design of 4R1W memory that requires fewer BRAMs than previous designs. Memories with more read/write ports can be supported by extending the proposed 2R1W/4R memory and the hierarchical 4R1W memory. A. Techniques to Increase Read Ports 1.Bank Division With XOR Design Scheme: Bank Division With XOR (BDX) is an approach to increase read ports proposed in [9]. Unlike the method used in [1] and [3], BDX avoids replicating the storage elements of the whole memory space. With BDX, multiple reads can be supported by using the XOR operations. Note that BDX is different from the XORbased design in [2]. The XOR-based approach in [2] uses XOR operations to increase write ports by storing the encoded data to maintain the data coherence between memory modules. BDX uses XOR operations to increase read ports by retrieving the target data from the encoded value. Fig 2.Example of mr1w memory implemented with multiple 2R1W modules. Fig1 illustrates an example of a 2R1W memory implemented with the BDX scheme. As shown in Fig. 1(a), the memory space is distributed to four data banks (banks 0 3). One XOR-bank is added to keep the XOR values of the data banks. Each of these memory banks is assumed to be a 2RW module that can support two reads, or two writes, or one read and one write. Therefore, the BDX design becomes a 2R1W memory module which can support two reads and one write concurrently. The XOR-bank stores the XOR-ed value of all the data at the same offset in every data bank. The XOR operation for this example is shown in (4). P n represents the value stored at offset n of the XOR-bank. D( m,n ) represents the data at offset n of bank m Since there are four data banks in the example shown in Fig. 6(a), the value at the same offset (offset n) of each bank (banks 0 3) will be XOR-ed and stored to the same offset (offset n) at the XOR-bank P n = D (0,n) D (1,n) D (2,n) D (3,n). (1) As shown in Fig. 1(a), two reads R 0 and R 1 are both going to bank 1. In this example, one read (R0) will access the data directly from bank 1 (R 0 = D (1,0 )). Due to the bank conflict, the other read R1 cannot access bank 1. Instead, the target data of R1 can be recovered by XOR-ing the values at the same offset AJJAM PUSHPA, C. H. RAMA MOHAN of the other data banks as well as the XOR-bank. As shown in the following equation, the target data at offset n of data bank 1 can be recovered by XOR-ing the values at the same offset (offset n) from banks 0 to 3, as well as the 2. XOR-bank: D (1,n) = D (0,n) D (2,n) D (3,n) Pn. (2) The write request in this example is implemented in the twocycle pipeline architecture. As illustrated in Fig. 1(b), at the first cycle W 0 writes the data D (1,n) directly to bank 1. At the same cycle, the data at the same offset of the other banks [D (0,n) D (2,n) and D (3,n) ] together with D (1,n) are read and XOR-ed. At the second cycle, the XOR-ed value will be written back to the XOR bank at offset n. With the pipeline architecture, this design can process one write request every cycle. The 2R1W memory introduced previously only needs an additional XOR-bank that requires the same storage size of each data bank. In this case, assume all the data banks are of equal size. The number of memory entries in the XOR-bank is Memory Depth/#Data Banks, where Memory Depth represents the total number of memory entries in the memory space and #Data Banks denotes the number of data banks in the multi ported memory. To support more than two read ports, BDX needs to deploy multiple 2R1W memories. Fig. 2 shows an example of mr1w memory. Each 2R1W module supports two out of m reads. In this way, all the m reads can be serviced concurrently by multiple 2R1W modules. Note that when implementing a multi ported memory on FPGAs, replicating memory modules could significantly increase the usage of the limited BRAMs on an FPGA. To address this issue, this paper proposes a new architecture to increase read ports, referred to as hierarchical bank division with XOR (HBDX). HBDX applies Fig 3. Two modes of 2R1W/4R module. The module is implemented with four data banks and one XOR-bank. (a) 2R1W mode. (b) 4R mode. 2.2R1W/4R(An Efficient Two-Mode Memory): To implement HBDX in an efficient way, this paper introduces a brand new perspective of using a 2R1W module as either a 2R1W or a 4R module. This new way of using the 2R1W module is denoted as 2R1W/4R. This hybrid module can support either 2R and 1Wor

4R. Note that the 2R1W/4R module uses exactly the same design as the 2R1W module introduced in Fig. 1. Fig. 3 illustrates how the two modes work. Fig. 3(a) shows the 2R1W mode. When there is a write request W 0, this design can support up to two conflicting reads. The write request W0 stores the data directly to the target data bank, and reads all the data at the same offset from the other data banks (R update) to update the XOR-bank. Fig. 3(b) shows the 4R mode. This mode only works when there is no write request. In this case, the design can support up to four conflicting reads. Consider one of the worst cases when all the four reads (R0 to R3) are going to bank 0. As illustrated in Fig. 8(b), R0 and R2 access bank 0 directly. At the same time, R1 and R3 can retrieve the target data by XOR-ing the values at the same offset of the other data banks as well as the XOR-bank. 3. HBDX Designs With 2R1W/4R Module: Fig. 2 illustrates a design scheme that can support more read ports by replicating the 2R1W module. However, this design scheme could significantly increase the usage of the limited BRAMs on an FPGA. To achieve a more BRAM-efficient design, this paper proposes HBDX, which adopts a hierarchical structure that organizes the 2R1W to achieve 4R1W without replicating the 2R1W module. To further enhance the design, HBDX in this section leverages the 2R1W/4R scheme introduced in the previous section as the basic building module to implement a 4R1W module. Fig. 4 illustrates a 4R1W memory design by using the HBDX scheme. In this 4R1W design, each basic building block is a 2R1W module of the BDX scheme introduced in Fig. 1. According to the 2R1W/4R memory proposed in the previous section, this 2R1W module can be used as either a 2R1W or a 4R module. The HBDX in Fig. 4 will utilize this versatile usage mode to achieve a more efficient 4R1W design. data. And the write request W 0 can be supported with a pipeline architecture similar to the one shown in the example of Fig. 1(b). In this case, W 0 stores the data directly to bank 0. The data at the same offset of the other data banks are read and XOR-ed at the same cycle. The XOR-ed value will be stored back to the XOR bank in the next cycle. In this example, bank 0 is used in the 2R1W mode, while other banks are used in the 4R mode. The HBDX-based 4R1W memory has demonstrated to be more BRAM efficient than the previous approach of simply duplicating the 2R1W module [9]. B. Techniques to Increase Write Ports Bank division with remap table (BDRT) is an approach to increase write ports proposed in [9]. Unlike the LVT design used in [1] and [3], BDRT avoids replicating the whole memory space and supports multiple writes using additional BRAMs and a remap table to track the location of the latest data. Fig. 5 shows an example of the design for a 1R2W memory. This example consists of two data banks (banks 0 and 1), one bank buffer, and a remap table. The Null entries in a memory bank are the entries that do not store any valid data. When receiving W 0 and W 1, these requests will first look up the remap table to identify the correct BRAM that stores the latest data. According to the remap table, W 0 and W1 are, respectively, going to address 0 and address 1 in bank 0. In this case, W 0 will store the value directly to address 0 in bank 0. W 1 will be directed to the bank buffer whose offset 1 is a Null entry. At the same time, the remap table needs to be updated to reflect the modified location of W 1, as shown in Fig. 5(b). The address 1 of the remap table stores the value 2 which is the identification number of the bank buffer. This state of remap table shows that the data of address 1 is now stored in the bank buffer. Fig 4. HBDX 4R1W implemented with 2R1W/4R modules. Consider one of the worst cases when all the 4R and 1W are going to bank 0. Two read requests, R 0 and R 1 in this example, would read directly from bank 0. The other two read requests, R 2 and R 3, would read the same offset from the other data banks (banks 1 3) and the XOR-bank to recover the target Fig 5. Example of a 1R2W memory implemented with BDRT technique Compared with the LVT approach used in previous works, the BDRT requires smaller storage space. The extra cost of BDRT is the additional registers to implement the remap table. The required number of these extra registers is shown in the following equation, where Memory Depth is the total number of

memory entries in the memory space and #Data Banks denotes the number of data banks in the multi ported memory # of registers for remap table =([log2(#databanks+1)] 1) MemoryDepth. (3) C. Integrating the Read/Write Techniques This section shows the design of a multiported memory that integrates HBDX and BDRT. Fig.5 shows an architecture that uses HBDX and BDRT to implement an mrnw memory. This memory architecture is divided into k data banks. Based on BDRT, to support all the writes, there require total n 1 bank buffers. A hash mechanism is added to distribute the writes to banks. AJJAM PUSHPA, C. H. RAMA MOHAN write requests, W 0 and W 1, will look up the remap table for the correct banks to store the values. In this example, the addresses of the two writes, addresses 0 and 1, both belong to bank 0. In this case, W 0 will store the value directly to bank 0. W 1 will be directed to a bank buffer that has a Null entry at the same offset of W 1 on bank 0. At the same time, the remap table needs to be updated to reflect the modified location of W 1, as shown in Fig. 7(b). 2. Extend to 4RnW Memory With 2R1W/4R Module: This section further extends the design to a 4RnWmemory. According to the architecture presented in Fig. 6, a 4RnW memory requires 4R1W modules as building blocks. A 4R1W module can be implemented simply by the replication. IV. SIMULATION RESULTS Each BRAM is 32-bit data width by 1K-depth, and can be configured as two-port mode (supporting one read and one write), or dual-port mode (supporting two reads, or two writes, or one read and one write). A slice contains four 6-input look-up tables, eight registers, and multiplexers. Fig 6. Architecture that integrates HBDX and BDRT to achieve a mrnw memory. 1. Example of 2R2W Memory: Fig. 7 shows an example of 2R2W memory that applies this architecture. In Fig. 7(a), assume there are two write requests and two read requests. The writes W 0 and W 1 are going to addresses 0 and 1, respectively. The reads R 0 and R 1 are retrieving the data from Fig 8. Simulation diagram of BDX. Fig 7. Example of a 2R2W memory that integrates BDX and BDRT. Addresses 2 and 3, respectively each bank is a 2R1W module that can support two read ports and one write port. The 2R1W module can be implemented with BDX as illustrated in Fig. 1. A read request will access all the banks and choose the data returned by the bank that stores the latest value of the target data based on the record in the remap table. The two Fig 9. Schematic Diagram of BDX.

Fig 10. RTL Diagram of BDX. A. AREA Description A significant amount of 42% area overhead results due to inclusion of the BDX. However, the overhead when estimated with respect to the router area reduces to 8% and still further to 2% when estimated with respect to the area of the entire BDX.This is because the mapping tables are synthesized by the registers and connections in slices of FPGAs. Having large mapping tables would not only occupy slices, but also introduce more complex decision logics and routing. Extra consumption of slices also blocks the usage of slices by other logic modules, and would consequently impact on the overall timing of the design. The experimental results have also revealed that when the slice utilization approaches the capacity of the target FPGA, the timing degradation could be very significant. B. Power Report The proposed designs can reduce the power for a BDX by up to 5.5 percent compared with XOR Based design. The design environment is based on Xilinx ISE 14.6. The synthesis tool is set to favor clock frequency. The data width is 32 bit for all the multi ported designs in this paper. The designs of multi ported memory assume no stalls. V. CONCLUSION This paper proposes efficient BRAM-based multiported memory designs on FPGAs. The existing design methods require significant amounts of BRAMs to implement a memory module that supports multiple read and write ports. Occupying too many BRAMs for multiported memory could seriously restrict the usage of BRAMs for other parts of a design. This paper proposes techniques that can attain efficient multiported memory designs. This paper introduces a novel 2R1W/4R memory. By exploiting the 2R1W/4R as the building block, this paper proposes a hierarchical design of 4R1W memory that requires 33% fewer BRAMs than the previous designs based on replication. Memories with more read/write ports can be extended from the proposed 2R1W/4R memory and the hierarchical 4R1W memory. Compared with XOR-based and LVT-based approaches, the proposed designs can, respectively, reduce up to 53% and 69% of BRAM usage for 4R2W memory designs with 8K-depth. For complex multi ported designs, the proposed BRAM-efficient approaches can achieve higher clock frequencies by alleviating the complex routing in an FPGA. For 4R3W memory with 8K-depth, the proposed design can save 53% of BRAMs while at the same time enhance the operating frequency by 20%. VI. REFERENCES [1] C. E. LaForest and J. G. Steffan, Efficient multi-ported memories for FPGAs, in Proc. 18th Annu. ACM/SIGDA Int. Symp. Field Program. Gate Arrays, 2010, pp. 41 50. [2] C. E. LaForest, M. G. Liu, E. Rapati, and J. G. Steffan, Multi-ported memories for FPGAs via XOR, in Proc. 20th Annu. ACM/SIGDA Int. Symp. Field Program. Gate Arrays (FPGA), 2012, pp. 209 218. [3] C. E. Laforest, Z. Li, T. O Rourke, M. G. Liu, and J. G. Steffan, Composing multi-ported memories on FPGAs, ACM Trans. Reconfigurable Technol. Syst., vol. 7, no. 3, Aug. 2014, Art. no. 16. [4] Xilinx. 7 Series FPGAs Configurable Logic Block User Guide, accessed on May 30, 2016. [Online]. Available: http://www.xilinx.com/support/documentation/user_guides/ug47 4_7Series_CLB.pdf [5] Xilinx. Zynq-7000 All Programmable SoC Overview, accessedonmay30,2016.[online].available:ttp://www.xilinx.co m/support/documentation/data_sheets/ds190-zynq-7000- Overview.pdf [6] G. A. Malazgirt, H. E. Yantir, A. Yurdakul, and S. Niar, Application specific multi-port memory customization in FPGAs, in Proc. IEEE Int. Conf. Field Program. Logic Appl. (FPL), Sep. 2014, pp. 1 4. [7] H. E. Yantir and A. Yurdakul, An efficient heterogeneous register file implementation for FPGAs, in Proc. IEEE Int. Parallel Distrib. Process. Symp. Workshops (IPDPSW), May 2014, pp. 293 298. [8] H. E. Yantir, S. Bayar, and A. Yurdakul, Efficient implementations of multi-pumped multi-port register files in FPGAs, in Proc. Euromicro Conf. Digit. Syst. Design (DSD), Sep. 2013, pp. 185 192. [9] J.-L. Lin and B.-C. C. Lai, BRAM efficient multi-ported memory on FPGA, in Proc. Int. Symp. VLSI Design, Autom. Test (VLSI-DAT), Apr. 2015, pp. 1 4.