A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs

Size: px

Start display at page:

Download "A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs"

Vernon Wiggins
5 years ago
Views:

1 A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs Harrys Sidiropoulos, Kostas Siozios and Dimitrios Soudris School of Electrical & Computer Engineering National Technical University of Athens, Greece {harrys, ksiop, Abstract. This paper introduces a novel methodology for enabling rapid exploration of memory hierarchies onto FPGA devices. The methodology is software supported by a new open-source tool framework, named NAROUTO. Among others, the proposed framework enables critical tasks during architecture s design, such as memory hierarchy and floor-planning. Furthermore, NAROUTO framework is the only available solution for power/energy evaluation of different memory organizations. Experimental results shown that NAROUTO framework leads to significant area, power (about 82%) and performance (about 46%) improvements, as compared to existing solutions. Keywords: FPGA, CAD Tool, Exploration Framework. 1 Introduction Recently, reconfigurable architectures, and more specifically Field-Programmable Gate Arrays (FPGAs), have become efficient alternatives to Application-Specific Integrated Circuits (ASICs) due to their inherent re-programmability feature. FPGA platforms include, apart from logic and routing infrastructure, more complex components (e.g. memory blocks, DSP cores, embedded CPUs, etc.) that further improve their efficiency. One of the upmost important tasks for designing an efficient FPGA device is the architecture-level exploration that determines the architecture of building blocks/components, as well as their optimal organization. This problem becomes even more important nowadays, due to the increased complexity posed by additional (heterogeneous) IP blocks. In order to accomplish this task, up to now many tools have been released that automate the exploration procedure, stating from synthesis and technology mapping [1, 2], up to placement and routing (P&R) [3], and power/energy estimation [5]. Since these tools support only devices consisted of configurable logic blocks (CLBs) and routing infrastructure, they cannot be employed for architecture-level exploration at FPGA platforms that contain also more complex IP blocks (e.g. memories, DSPs cores, etc.). Although commercial frameworks (e.g. [6]) support heterogeneity and power estimation, unfortunately they allow only a small degree of architecture-level exploration. Recently, two frameworks that support application mapping onto FPGAs with such IP cores were introduced [1, 11]. These frameworks are based on a commercial synthesizer [6], while the P&R step is performed with [4]. Even though these This work was supported by the HiPEAC Grand entitled On Providing Dynamic Reliability Improvement in FPGA. 75

2 solutions alleviate the limitation about heterogeneity support, they do not provide results about power consumption (dynamic or static) and energy dissipation. In this paper we propose a new framework targeting to support architecture-level exploration and power estimation for FPGAs that incorporate different memory hierarchies and organizations in terms of delay, area and power/energy consumption. More specifically, the contributions of this work, as compared to prior publications, are summarized, as follows: Introduction of a novel methodology for exploring memory hierarchies and organization, targeting to FPGAs. Extension of an existing tool for power/energy consumption estimating [5, 9], in order to handle also designs with different types, as well as multiple instantiations, of memory blocks. Development of a new open-source tool framework, named NAROUTO (public available at [10]), that software supports the proposed methodology. The rest of the paper is organized, as follows: Section 2 highlights the dominating problems in existing tools for heterogeneity support. The proposed framework is described in section 3, while section 4 discusses the experimental results that prove the efficiency of the proposed framework. Finally, conclusions are summarized in section 5. 2 LIMITATIONS IN HETEROGENEITY SUPPORT In this section we highlight the main limitations in heterogeneity support for available frameworks [1, 11]. The synthesis and technology mapping for both of these solutions are performed with Altera Quartus II [6], while the placement and routing (P&R) is software supported by VPR 5.0 tool [4]. As we will depict later, these tools cannot incorporate with a press button approach, designs that contain functionalities that are mapped onto heterogeneous IP blocks of modern FPGAs (e.g. memories, DSPs, CPUs, etc). Even though the synthesis output from Quartus produces BLIF (Berkeley Logic Interchange Format) format [7] with IP blocks, the resulted netlist is not logically equivalent to the original RTL description, since any IP block of the design is translated into a blackbox (BB) instantiation. More specifically, whenever the functionality of an IP block cannot be mapped onto LUTs and F/Fs, this block is replaced with a BB. A BB provides the same number of input/output pins, as compared to the IP block that actually replaces (in order to enable transparently signal propagation). The requirement for incorporating BBs inside BLIF files is in order to enable academic frameworks to handle state-of-the-art designs (that often contain many heterogeneous blocks). Since BLIF format does not have a build-in support for these heterogeneous blocks, many serious limitations need to be alleviated in order application s functionality not to be disturbed during synthesis and technology mapping. For instance, assuming a design with a 8,192 8 bit RAM block, existing synthesis and technology mapping tools will produce a BLIF file that contains 8,192 unique BBs (the synthesis output for a memory block is reported at word level). However, the limitations of such an approach are summarized, as follows: The application s functionality is altered, since the BBs (both their total number, as well as their connections to rest BBs/CLBs) do not correspond to the initial application s RTL description. For a given design, all the BBs are marked with a unique keyword (.blackbox ) regardless of their actual functionality. This imposes that each design can employ up to one type of BB. However, existing applications assume numerous BBs, each of which has its own characteristics (e.g., size, throughput, power/energy consumption, etc.). 76

3 The increased number of BBs leads to delay, power and area penalties due to the additional routing infrastructure needed for signal communication. 3 PROPOSED FRAMEWORK This section describes the proposed framework, named NAROUTO [9], which is depicted in Figure 1.This framework allows the architecture-level exploration at FPGAs with memory blocks, in terms of delay, area and power/energy consumption. In order to software support NAROUTO framework, a number of new open-source CAD tools have been developed. Due to lack of space, it is not possible to give details about the employed algorithms that support each step of NAROUTO framework; however, more info can be found in [9]. 3.1 Synthesis and Technology Mapping The first step at NAROUTO framework deals with application s synthesis and technology mapping. These tasks are software supported by Quartus [6] tool, while the output is a hierarchical netlist in BLIF format [7]. Such a format is a pre-request not only for the rest tools of NAROUTO framework, but for the majority of academic tools targeting to FPGAs. In order to extract the technology mapped netlist, where the heterogeneous IP blocks (e.g. memories, DSPs, etc.) are replaced with BBs, the following macro is employed in Quartus tool: set_global_assignment -name INI_VARS no_add_ops=on; dump_blif_after_lut_map=on 3.2 Generation of input files for power estimation Next step deals with the generation of activity files for the estimation of power/energy consumption. Since existing version of ACE 2.0 tool [5] cannot support BLIF netlists with BB(s), we have developed a preprocessing step in order to enable the calculation of static probabilities and transition densities from primary inputs to primary outputs for all the nets of the design with BBs. The new tool, named Hb_for_ACE, initially annotates application s netlist by removing all the BB instantiations from BLIF files, and then it connects the BB input and output pins to the BLIF s primary outputs and primary inputs, respectively. By applying the Hb_for_ACE tool, the retrieved design does not contain any BBs, and hence the ACE 2.0 tool can be employed. Regarding the calculation of power/energy consumption for BBs, we assume that these BBs are connected through nets with static probability 0.5 and transition density 0.2 (except if different values are given by the designer). 3.3 Technology mapping onto target FPGA Next, the netlist in BLIF format (with BBs), as it was already retrieved from technology mapping, is mapped onto the target FPGA. For this purpose we use the HBT-VPACK tool [4] in conjunction to a new set of tools that provide efficient handling of BBs. More specifically, the new set of tools focus on alleviating the limitation of Quartus tool that splits a single 77

heterogeneous block into multiple (partial) BBs. Next subsection describes in more detail the main features of the new developed tools. Figure 1: The proposed NAROUTO Framework 3.

4 heterogeneous block into multiple (partial) BBs. Next subsection describes in more detail the main features of the new developed tools. Figure 1: The proposed NAROUTO Framework BlackBox_Profiler The BlackBox_Profiler identifies the number of individual BBs incorporated by a design, as well as the specifications for each of them (e.g. functionality, size, number of pins, etc.). This task is accomplished by finding all the partial BBs that belong to a unique IP block. This is feasible since all the partial BBs of an IP have the same signals for control and 78

5 communication with the rest FPGA components (e.g. the read/write enable inputs of a RAM). After identifying the instantiations for different BBs, the specifications for each of them are retrieved from a technology library (e.g. based on datasheets) BlackBox_Packing The output from BlackBox_Profiler gives guidelines regarding how to appropriately cluster all the partial BBs (belonging to the same IP block) into a unique BB. This task, referred as Single-Packing or SP, in NAROUTO framework is software supported by the BlackBox_Packing tool. In order to further improve the flexibility of proposed framework, BlackBox_Profiler supports one more level of packing (mentioned as Full-Packed or FP). The goal of this additional packing is to cluster all the BBs of the same type, into a larger super-bb. For instance, assume that a design requires 16 1Kbyte RAM blocks. The BLIF netlist from Quartus output will report that design has 16,000 BBs, each of which actually corresponds to a 1 byte. After applying SP with NAROUTO, the resulted netlist incorporate 16 BBs, each of which corresponds to a 1Kbyte. With the second level of packing (FP), the netlist will contain only 1 BB with size 16Kbytes. Hence, with the usage of NAROUTO framework, it is possible to evaluate different memory hierarchies Pin_Multiplexing Apart from the limitation of Quartus tool to generate an excessive high number of partial BBs per IP block, each of these BBs have much more I/O pins than those actually exist in the IP block. This imposes that target FPGA incorporates a wider channel, which in turns leads tp higher delay, power and area overheads. In order to overcome from this limitation, we developed a new tool, named Pin_Multiplexing, which aim to reduce the input/output (I/O) pins of BBs. During this task, we do not merge many signals into one, since this would undermine the structural and functional integrity of final netlist. On contrast, the reduction of pins is based on implementing a set of multiplexers at CLBs. More specifically, input signals of a BB initially pass through multiplexing CLBs, and the new multiplexed signals are the actually inputs of the BB. Similarly, output signals of a BB are multiplexed and pass through de-multiplexing CLBs in advance of connecting to the rest netlist. More info regarding this multiplexing/demultiplexing strategy can be found in [9]. Based on the design specifications, the inputs/outputs of a BB can be recursively multiplexed many times, in order to further reduce the number of required pins. This allows deriving BBs with the same number of I/O pins, as compared to the corresponding IP cores (these values were already extracted from the component library during BlackBox_Profiler task). 3.4 Placement and Routing The last step of the proposed framework deals with application s P&R. After that, delay, power/energy and area metrics are extracted in order to evaluate the design implementation. This task is accomplished by a new tool, named HBVPR, which is based on VPR [4] and Powermodel [5], [8] tools. 79

6 As part of HBVPR, we have developed an additional tool that automatically generates the XML descriptions of target FPGA architectures. This tool allows the generation of architectural templates for FPGAs with many types of BBs, each of which might have different properties (e.g. number of pins, functionality, size, etc). 4 EXPERIMENTAL RESULTS This section provides a number of qualitative and quantitative comparisons among the proposed framework (NAROUTO) and two available approaches found in relevant literature [1, 11], under a number of DSP applications from [10]. Table 1 gives a qualitative comparison among the frameworks discussed throughout this paper. Based on this table we can claim that NAROUTO supports more efficiently designs with BBs, while the power/energy estimation feature is incremental to the existing frameworks. Table 1: Qualitative comparison in supported features Feature NAROUTO [6], [12] Different types of BBs Unlimited 1 Realistic number of BBs Yes No Realistic number of I/Os per BBs Yes No Power estimation Yes No Part of complete framework Yes Yes Open source Yes Yes The target FPGA used for the scopes of this paper incorporates a cluster (CLB) size equals to 10, 4-input LUTs and 22/10 inputs/outputs per CLB. The FPGA array, as well as the routing channel width, is the minimum for which each application is routable. Table 2 summarizes some statistics about the application mapping onto such an FPGA device with NAROUTO framework. Table 2: Employed benchmark suite from [11] Benchmark Functionality 4LUT F/Fs RAM bits I/Os oc_aes_core_inv Encryption 5, , oc_ata_ocidec3 Processor 1, oc_hdlc Processor , oc_minirisc Processor , oc_oc8051 Processor 4, , os_blowfish Encryption 5, , Average: 3, , Next, we discuss three possible floor-plans for the memory blocks. These floor-plans, as they are depicted in Figure 2, correspond to FPGA architectures where the memories are assigned to the borders of the device (Figure 2(a)), to the center (Figure 2(b)), and a scenario where memories are uniformly distributed over the FPGA (Figure 2(c)). The three alternative floor-plans are denoted as Border, Center and Uniform, respectively. Note that different floor-plans result to different performance for application mapping, since each of them impose different placement and routing. In order to quantify these floor-plans, we P&R a number of applications onto FPGA devices, where memory blocks are assigned based on Figure 2. 80

Table 4 summarizes the performance and power metrics regarding the employed benchmark suite under the three candidate floor-plans for memory blocks.

selection also imposes the highest power consumption.

dissipation (with an almost negligible penalty in performance).

7 Table 4 summarizes the performance and power metrics regarding the employed benchmark suite under the three candidate floor-plans for memory blocks. As we can conclude for this table, whenever memories are uniformly distributed over the FPGA (Figure 2(c)), applications are mapped under higher operation frequencies (smaller delay), but this selection also imposes the highest power consumption. On the other hand, if we aim to design a poweraware FPGA architecture, then memories should be floor-planed at the center (Figure 2(b)) of the device, since this selection leads to lower power dissipation (with an almost negligible penalty in performance). (a) (b) (c) Figure 2: Different floor-plans for memory blocks: (a) placed in borders, (b) placed in center, and (c) uniformly distributed. Since target FPGA devices have to meet both timing and power constraints, the selection of most suitable memory floor-planning is performed under these criteria. For this purpose, Table 5 gives the Energy Delay Product (EDP) for a number of applications. Based on these results, it is evident that whenever memory blocks are assigned to the center of the FPGA, this leads to the minimum EDP value. More specifically, the reduction of EDP is up to 33%, as compared to the floor-plan where memories are uniformly distributed over the device. Hence, for the rest of this paper, such an organization of memory blocks is assumed. Note that apart from these three candidate floor-plans, any other floor-plan can be also explored by NAROUTO framework. Table 4: Exploration results for topology selection of memory blocks Benchmark Operation Frequency (MHz) Power Consumption (mwatt) Border Center Uniform Border Center Uniform oc_aes_core_inv oc_ata_ocidec oc_hdlc oc_minirisc oc_oc os_blowfish Average: Table 6 depicts the required number of BBs for different designs. The second column corresponds to the number of BBs as it is retrieved from Quartus synthesis (with the usage of existing approaches [6, 12]), while third and fifth columns give the corresponding values after SP and FP, respectively. For some designs, the BlackBox_Profiler clusters all the BBs into one BB during the first level of packing (SP). Hence, during the FP there is no further 81

8 reduction. Fourth column depicts the estimated RAM bits for each BB after SP. The corresponding value after FP for a given design is retrieved by summarizing all the partial values (shown at fourth column). Table 6 proves our claim that both Quartus synthesizer [6], as well as the existing frameworks [1, 11], cannot handle efficiently designs with BBs. More specifically, it s the first time that a public available framework supports realistic application mapping in heterogeneous FPGA architectures, by supporting the clustering of BBs with same functionality and type (e.g. memories with different sizes) into a super-bb (e.g. BlockRam). Based on experimental results, an average of 68.5 BBs per application is assumed with existing approaches, while the proposed one (after SP) leads only to 5 BBs. We have to mention that the additional partial BBs used at [1, 11] introduce constraints during P&R, which in turn result to delay, power/energy and area overheads. Table 5: Exploration results for topology selection of memory blocks Benchmark Energy Delay Product ( 10-6 ) Border Center Uniform oc_aes_core_inv oc_ata_ocidec oc_hdlc oc_minirisc oc_oc os_blowfish Average: Ratio: Table 7 gives the summary of I/O pins for all the BBs of each design, before and after pin multiplexing, as it is derived from SP. Based on the results we can conclude that before multiplexing (it corresponds to solution retrieved from [1, 11]), there is an average demand for 162 I/O pins for BBs, while after SP, the pin requirement is eliminated to 30.5 (there is a reduction about 80% in the pins number). Table 6: Number and size of BBs before and after packing Benchmark Existing SP FP [6], [12] # of BBs Size of BBs # of BBs oc_aes_core_inv ,176 1 oc_ata_ocidec oc_hdlc ,024 1 oc_minirisc ,024 1 oc_oc os_blowfish ,434 1 Average: ,282 1 Such an unrealistic demand for pins posed by [1, 11] among others introduce constraints that do not allow some of the benchmarks to be mapped onto heterogeneous FPGAs. These constraints are mainly tightly firmed to the routing channel width, which in many cases (especially when the number of I/O pins from BBs is extremely high) exceeded the maximum value the design tools could manage. 82

In order to quantify the gains from applying the proposed framework in terms of delay and power consumption, Figures 3 and 4 plot these variations for different applications.

9 In order to quantify the gains from applying the proposed framework in terms of delay and power consumption, Figures 3 and 4 plot these variations for different applications. For each design at these graphs we provide three solutions, namely (i) Initial [1, 11], (ii) SP (Single Packed), and (iii) FP (Full Packed). Table 7: Total number of I/O pins for BBs before and after SP. Benchmark Total pins of all BBs Before multiplexing After multiplexing oc_aes_core_inv oc_ata_ocidec oc_hdlc oc_minirisc oc_oc os_blowfish Average: Based on the results we can conclude that SP and FP lead to an average delay reduction about 46%, as compared to existing frameworks [1, 11]. Similar, regarding the power consumption, the proposed solutions (SP and FP) achieve an average reduction about 82%. As we have already mentioned, these gains occur due to better handling of BBs inside designs. Since the designs retrieved with NAROUTO framework incorporate fewer BBs, and hence fewer I/O pins around each of them, the proposed framework leads to smaller FPGA devices, composed among others with fewer tracks per routing channel. Figure 3: Delay evaluation for alternative application implementations. 83

10 5 CONCLUSIONS A novel methodology, as well as the supporting tool framework, for enabling rapid memory exploration in FPGA devices, was proposed. This framework can handle designs with IP cores more efficiently, as compared to existing solutions, while it is the first tool that also provides measurements about power/energy consumption. Experimental results shown average gains in terms of delay and power consumption about 46% and 82%, respectively, as compared to relevant solutions, whereas different memory floor-plans lead to EDP reduction up to 33%. Figure 4: Power consumption for alternative application implementations. REFERENCES [1] J. Pistorius, et.al., Benchmarking method and designs targeting logic synthesis for FPGAs", Proc. IWLS, pp , [2] M. Gao, J.H. Jiang, Y. Jiang, Y. Li, S. Sinha, and R. Brayton, MVSIS, International Workshop on Logic Synthesis, [3] V. Betz and J. Rose, VPR: A New Packing, Placement and Routing Tool for FPGA Research, Int. Workshop on Field-Programmable Logic and Applications, 1997, pp [4] L. Jason, et.al., VPR 5.0: FPGA cad and architecture exploration tools with single-driver routing, heterogeneity and process scaling, Int. Symp. on FPGA, pp , [5] K. Poon, et.al., A detailed power model for field-programmable gate arrays, ACM Trans. on TODAES, Vol.10 No.2, pp , April [6] Altera, Corporation, Quartus II Software. [7] Berkeley Logic Interchange Format (BLIF), University of California, Berkeley, [8] P. Jamieson, et.al., An Energy and Power Consumption Analysis of FPGA Routing Architectures, Field-Programmable Technology, pp , [9] C. Sidiropoulos, Development of a design framework for Power/Energy consumption estimation in heterogeneous FPGA architectures, Master thesis, NTUA, Greece, 2010 (available at [10] Altera, Corporation, Quartus-II University Interface Program. [11] S. Dai and E. Bozorgzadeh, CAD Tool for FPGAs with Embedded Hard Cores for Design Space Exploration of Future Architectures, 14th Symp. FCCM,

Journal of Systems Architecture

Journal of Systems Architecture 59 (2013) 78 90 Contents lists available at SciVerse ScienceDirect Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc On supporting rapid exploration