Journal of Systems Architecture

Size: px

Start display at page:

Download "Journal of Systems Architecture"

Justin Stevenson
6 years ago
Views:

Journal of Systems Architecture 59 (2013) 78 90 Contents lists available at SciVerse ScienceDirect Journal of Systems Architecture journal homepage: www.elsevier.

article info abstract Article history: Available online 21 November 2012 Keywords: Heterogeneous FPGA CAD tool Exploration framework This paper introduces a novel methodology for enabling fast yet

1 Journal of Systems Architecture 59 (2013) Contents lists available at SciVerse ScienceDirect Journal of Systems Architecture journal homepage: On supporting rapid exploration of memory hierarchies onto FPGAs Harry Sidiropoulos, Kostas Siozios, Dimitrios Soudris 9 Heroon Polytechneiou, Zographou Campus, Athens, Greece article info abstract Article history: Available online 21 November 2012 Keywords: Heterogeneous FPGA CAD tool Exploration framework This paper introduces a novel methodology for enabling fast yet accurate exploration of memory organizations onto FPGA devices. The proposed methodology is software supported by a new open-source tool framework, named NAROUTO. This framework is the only public available solution for performing architecture-level exploration, as well as application mapping onto FPGA devices with different memory organizations, under a variety of design criteria (e.g. delay improvement, power optimization, area savings, etc.). Experimental results with a number of industrial oriented kernels prove the efficiency of the proposed solution, as compared to similar approaches, since it provides better manipulation of memory blocks, leading to architectures with higher performance in terms of area, power and delay. Ó 2012 Elsevier B.V. All rights reserved. 1. Introduction Corresponding author. Tel.: address: ksiop@microlab.ntua.gr (K. Siozios). Recent years, reconfigurable architectures and more specifically Field Programmable Gate Arrays (FPGAs) have become efficient alternatives to Application Specific Integrated Circuits (ASICs). The characteristics and capabilities of these architectures have changed and improved significantly the last two decades, from arrays of Look-Up Tables (LUTs), to heterogeneous devices that integrate a number of hardware components (e.g. LUTs with different sizes, microprocessors, DSP modules, RAM blocks, etc.). In other words, the logic fabric of an FPGA changed gradually from a homogeneous and regular architecture to a heterogeneous (or piece-wise homogeneous) device. Previous studies [12 14] show that one of the upmost important tasks for designing an efficient FPGA device is the architecture-level exploration. This task among others determines the number, the organization (i.e. floor-plan), as well as the parameters for the device components (e.g. look-up table size, channel width, array size, etc.). Note that the problem of sufficient and accurate architecture-level exploration becomes far more important nowadays, due to the increased complexity posed by heterogeneous IP blocks found in FPGA platforms. In order to accomplish this task, a number of methodologies and Computer-Aided Design (CAD) tools have been proposed. These solutions involve among others synthesis and technology mapping [1,2], placement and routing (P&R) [3,13], as well as power and energy estimation [6] techniques. The development of new tools targeting the reconfigurable domain is tackled both by academia and industry. More specifically, tools developed in academia have mainly focused on architecturelevel exploration for homogeneous FPGAs (i.e. devices consisted solely from configurable logic blocks (CLBs)). Even though these solutions are sufficient for evaluating new CAD algorithms, they cannot handle additional Intellectual Property (IP) blocks (e.g. memories, DSPs, embedded CPUs, etc.) found in reconfigurable architectures. On the other hand, commercial frameworks support FPGA devices with numerous heterogeneous IP blocks, but unfortunately they allow only a small degree of architecture-level exploration. Recently, two frameworks, one from academia and the other from industry, were released that provide some kind of flexibility in performing architecture-level exploration for heterogeneous FPGAs. These frameworks are based on a commercial synthesizer, Altera s Quartus [7], while the P&R step is performed with algorithms found in VPR tool [3]. Even though the combination of these two solutions potentially can alleviate the limitation about heterogeneity support, the derived results lack accuracy. In addition, the application s implementation could not be evaluated in terms of power and energy dissipation. Since FPGAs are usually power limited devices [4,5,15], this limitation is a crucial drawback for scoring the efficiency of retrieved architectural solutions. In this paper we propose a new framework for supporting the tasks of architecture-level exploration and application mapping onto heterogeneous FPGAs. The proposed framework, named NARO- UTO, is based on a number of open source tools. This flow is publicly available for downloading, extending and improving [8], in order to support more advanced heterogeneous blocks (e.g. CPUs) [14,21]. The contributions of this work, as compared to prior publications are summarized as follows: Introduction of a novel software-supported methodology for enabling rapid architecture-level exploration for heterogeneous FPGAs that consist of different memory organizations and/or hierarchies /$ - see front matter Ó 2012 Elsevier B.V. All rights reserved.

2 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) Development of a new tool framework that enables application mapping onto these heterogeneous FPGAs. Rather than similar frameworks that support only one type of heterogeneous block (e.g. memory of a given size), our solution exhibits additional flexibility, enabling among others simultaneous handling of heterogeneous blocks with different types and/or properties. Apart from the delay metric, the evaluation of application implementations onto a target heterogeneous FPGA can also be performed in terms of power and energy dissipation (both static and dynamic). The rest of the paper is organized as follows: Section 2 highlights the main limitations found in similar approaches targeting architecture-level exploration, whereas Section 3 gives an overview of the employed heterogeneous FPGA. The proposed methodology, as well as the supporting tool framework are described in Sections 4 and 5, respectively. Section 6 provides a number of qualitative and quantitative comparisons that prove the efficiency of the proposed solution, as compared to the state-of-art approach. Finally, conclusions are summarized in Section Motivation example A common limitation found to existing software frameworks that perform architecture-level exploration affects that none of them can handle macro-blocks, apart from logic resources (slices) and interconnect fabric. On the other hand, commercial tools are not easily adapted to evaluate reconfigurable architectures that differ from the actually fabricated devices. Additionally, since these solutions are based exclusively with academic tools, usually they are evaluated with the usage of synthetic benchmarks (as available academic synthesizers are able to tackle only designs with reduced complexity). Hence, there is a limitation of software-supported tools that are able to perform fast and accurate evaluation of different architectural selections. This section highlights the main limitations found in existing tools for supporting architecture-level exploration, as well as application mapping onto FPGAs consisting of heterogeneous blocks. Starting from an application s description in VHDL or Verilog format, first of all we perform synthesis with the usage of Altera Quartus Framework [7], whereas the output is reported at BLIF (Berkeley Logic Interchange Format) format [9]. This format corresponds to a gate-level netlist with basic primitives for input, output, logic gates, flip/flops, etc. Even though BLIF is a widely accepted format for academic tools, it is rather restrictive, as it is unable to express heterogeneous components, such as RAM blocks, DSP blocks (e.g. multiplier), processors, etc. Furthermore, it cannot express arithmetic carry chains without converting them to gates. Instead of these components, the BLIF netlist uses BlackBoxes (BBs) to enable transparent signal propagation. However, since BBs do not have any meaningful functionality, the derived netlist lacks in accuracy. Additionally, as we will depict later, existing tools provide a non-optimal way for handling designs with BBs. Next, we summarize the main drawbacks of existing (academic/ commercial) software solutions: The application s functionality described at BLIF netlist differs from the application s RTL description, since the BBs do not provide any functionality. For a given design, all the BBs are marked with the same keyword (.blackbox ), regardless of their actual functionality. This imposes that each design can employ only one type of BB (e.g. only memory, DSP, or embedded CPU). Additionally, all these BBs are assumed to have the same properties (e.g. size, throughput, power/energy consumption, etc.), regardless of their usage. In case the design incorporates BlockRAMs, the usage of existing tools (Quartus and VPR-5.0) assumes an excessive number of distinct BBs, each of which corresponds to a few memory s words which are part of a whole memory block. This overestimation in number of BBs results to mentionable delay, power and area overheads due to the additional routing resources needed for signal communication. Finally, existing approaches cannot support the evaluation of architectural selections based on different memory organizations and/or hierarchies. 3. Target architecture Our target architecture is a generic FPGA device similar to recent FPGAs from Altera (Stratix) [10] and Xilinx (Virtex) [11] architectures, consisting of logic resources, memory blocks, special purpose components (e.g. embedded processor, DPS blocks, etc.) and input/output pads. The glue logic of our FPGA device is organized into an array of slices, while the communication among hardware blocks is provided through a hierarchical interconnection network of fast and versatile routing resources. By the term slice we refer to the CLB, the up and right routing segments, as well as the corresponding switch box. The next level of hierarchy assumes that each CLB is formed by a number of Basic Logic Elements (BLEs), while each of the BLEs is formed by a Look-Up Table (LUT), a flip/flop, a number of multiplexers (at inputs and outputs), as well as the required wires for local connectivity. Such an architectural arrangement allows local interconnects between BLEs to be optimized [13]. Fig. 1 depicts a template of the employed architecture with embedded Ram and DSP blocks [12]. The previously mentioned architecture parameters for CLBs differ among vendors and FPGA families, since their values affect the device performance and power/energy consumption. For instance, the Altera Stratix FPGAs group 10 BLEs in order to form a Logic Array Block (LAB) [10]. Similarly, regarding the Xilinx Virtex-II- Pro devices, 2 LUTs are contained in a BLE, while 4 BLEs are joined to form a slice [11]. Apart from the logic and routing infrastructure, our FPGA architecture incorporates also a number of heterogeneous blocks. Throughout this paper, we employ this feature in order to study the impact of different memory hierarchies. More specifically, two different approaches, depicted in Figs. 2 and 3, are evaluated with our software-supported framework. These memory hierarchies are summarized as follows: Scenario 1, depicted schematically in Fig. 2, affects the shared memory architecture. Typically this memory organization assumes a large block of RAM which is accessible by several different CLBs. Even though application mapping onto a device that provides such a memory hierarchy is a relatively easy task, however, a number of limitations might arise when multiple CLBs need fast access to memory. Additionally, an architecture with shared memory cannot scale very well. Scenario 2 affects the shared-distributed memory architecture. This approach, depicted schematically in Fig. 3, apart from a number of shared memories (as discussed previously), incorporates a mechanism that supports each CLB to have direct access to a private (dedicated) memory. The key advantage of shared-distributed memory is the unified address space in which all data can be found. Additionally, this memory hierarchy is more easily scaled with an application s requirements.

3 80 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) Fig. 1. Template of the employed heterogeneous FPGA device. Fig. 2. An instantiation of the shared memory architecture (denoted as scenario 1). For both data-memory architecture models, a shared background (usually off-chip) memory module is assumed. Throughout this paper, we do not study issues related to how data are mapped onto these memories, since this task is tackled by the synthesis and technology mapping tools. Also, for both hierarchies we assume that shared memories may be simultaneously accessed by multiple CLBs. In order to physically implement these hierarchies, a number of special purpose routing tracks that provide signal connectivity among memory blocks, are employed. Note that the performance metrics (e.g. delay and power/energy consumption) of these dedicated routing paths are taken into consideration during the application mapping. Even though our framework can handle any memory hierarchy, if it is appropriately modeled, throughout this paper we select to study these two scenarios because they are widely accepted in the computer architecture field. 4. Proposed methodology This section describes in detail the proposed methodology for performing architecture-level exploration to heterogeneous FPGAs. More specifically, the introduced methodology, depicted schematically in Fig. 4, studies two complementary design problems: problem (1) the architecture-level exploration in order to determine a number of architectural parameters that affect heterogeneous components and problem (2) the application implementation onto these heterogeneous FPGA devices. Even though this methodology

H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) 78 90 81 Fig. 3. An instantiation of the shared-distributed memory architecture (denoted as scenario 2).

4 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) Fig. 3. An instantiation of the shared-distributed memory architecture (denoted as scenario 2). is able to handle devices consisting of different types of heterogeneous blocks, throughout this paper we evaluate only its efficiency in terms of handling architectures with alternative memory organizations. In this case, the heterogeneity affects the properties of these memory blocks (e.g. size, delay, power/energy consumption, etc). As input to our methodology we use the application s description in VHDL or Verilog, which is synthesized and technology mapped, while the output is extracted in BLIF format. We have already mentioned that BLIF format exhibits limited support for designs with heterogeneous components. Hence, in order to preserve the functionality of the design, the derived netlist has to be appropriately modified. However, before applying such modification, it is crucial to perform application profiling in order to determine the different types of Black-Boxes (BBs) found in the design (e.g. memories with different properties), as well as the number of instanti- Fig. 4. The proposed methodology.

5 82 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) ations per BB (each of which has different properties). The profiling task becomes even more important because a single heterogeneous block is usually reported as multiple BBs from the synthesis and technology mapping tools. Next, the architecture selection picks from the component library the appropriate instances for BBs. During this task, the efficiency of multiple components per functionality (e.g. memories with different properties), or the organization of these components (e.g. memory hierarchies), can be evaluated. For additional accuracy, the delay, power/energy dissipation and silicon area characterization of heterogeneous blocks found in these libraries are based on a number of well-established models [18 20]. By appropriately selecting combinations among these BBs, it is possible to perform a sufficient architecture-level exploration in terms of the number of BBs, as well as their organization. The outcome from this task is a set of Pareto curves that balance the studied criteria. Based on these curves, an architect is able to design an optimized FPGA device. Then, an application s netlist is placed and routed (P&R) onto the selected FPGA. The output of this task provides a number of metrics (e.g. delay, power, area) that allow sufficient evaluation of an application s implementation. In case the derived solution does not meet system specifications, there is a feedback loop for additional improvements. More specifically, if we are primarily interest to find out the optimal organization of hardware resources, or BBs, over an FPGA (referred as Problem i ), the feedback loop affects the architectural selections. During this step, different topologies and/or instantiations of BBs (e.g. memory blocks with different organization) are selected. On the other hand, whenever our goal is to maximize the performance metrics by enabling a more effective application implementation (Problem ii), then the feedback loop goes to the P&R step. 5. The proposed NAROUTO framework This section introduces the NAROUTO framework [14], which software supports the proposed architecture-level exploration methodology for heterogeneous FPGA devices. This framework, depicted schematically in Fig. 5, is composed by a number of opensource CAD tools that either have been developed from scratch, or have been extensively modified to be aware of the additional functionality required for sufficient handling of designs with multiple BBs. Even though the NAROUTO framework supports devices consisting of different types of heterogeneous components, throughout this study, the BBs are tuned to represent BlockRAMs. For this scope, two candidate memory hierarchies are evaluated (described in Figs. 2 and 3) Synthesis and technology mapping The first task of the NAROUTO framework deals with application synthesis and technology mapping. Even though a number of academic tools (e.g. ABC [1], SIS [2]) could be employed, we prefer to accomplish it with a well-established commercial tool. For this purpose, the Altera Quartus tool [7] is employed, since its output (hierarchical netlist in BLIF format) is complementary to the academic tools. Note that the BLIF format is a pre-requisite for the majority of academic tools dealing with FPGAs. In order to enable Quartus to report the output in BLIF format, where the heterogeneous components are replaced with BBs, the following TCL command is applied: set_global_assignment -name INI_VARS no_add_ops = on; dump_blif_after_lut_map = on A limitation of the derived output affects the excessive high number of BBs found in the BLIF netlist, which does not correspond to the actual number of utilized macro blocks. To make matters worse, there is no justification between BBs belonging to different heterogeneous blocks (e.g. memory contents that are stored in different BlockRAMs). Hence, the tools from the NAROUTO framework that are described in this section provide a mechanism to alleviate this limitation Activity estimation The next step in our framework involves the generation of activity files for power/energy estimation. For this purpose, a number of well-established models are employed [6,18 20]. Additionally, since existing versions of the ACE tool [6] cannot support BLIF netlists with BB (s), a special pre-processing step that deals with the computation of static probabilities and transition densities from primary inputs to primary outputs for all the networks of the design that include at least one BB, has been introduced. The new tool, named Hb_for_ACE, initially removes all the BBs from the BLIF netlist, and then it connects the BB input and output pins to the BLIF s primary outputs and primary inputs, respectively. By applying this technique, it is feasible to remove from the design description all the BB (s), and hence enable the ACE 2.0 tool to be sufficiently applied. On the other hand, regarding networks that include at least one BB, the corresponding values of static probability and transition density are retrieved from an exhaustive simulation. Algorithm 1 provides the pseudo-code for the open-source HB_for_ACE (transform Hierarchical Blifs for ACE) tool: Algorithm 1. Pseudo-code for HB_for_ACE tool. function hb_for_ace (Input_blif) { // Input: blif netlist with BBs // Output: blif netlist compatible with ACE BB_inputs[ ];// Array for storing all BBs input pins BB_outputs[ ];// Array for storing all BBs output pins primary_inputs[ ];// Array for storing primary input pins primary_outputs[ ];// Array for storing primary output pins // Get the primary I/O pins of the design primary_inputs[ ] = get_primary_inputs (Input_blif); primary_outputs[ ] = get_primary_outputs (Input_blif); // Get the blackboxes I/O pins BB_inputs[ ] = get_blackbox_inputs (Input_blif); BB_outputs[ ] = get_blackbox_outputs (Input_blif); // Delete any reference to blackboxes from the blif netlist delete_blackbox_subcircuits (Input_blif); delete_blackbox_models (Input_blif); // Connect the BBs I/Os to the design s primary O/I pins append (primary_inputs[ ], BB_outputs[ ]); append (primary_outputs[ ], BB_inputs[ ]); // Print the ACE compatible blif netlist printout_final_blif (Output_blif_filename); 5.3. Technology mapping onto heterogeneous FPGAs Having as input the application s BLIF description that includes also information about the BBs, the next task in our methodology

6 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) Fig. 5. The proposed NAROUTO framework. deals with the packaging of technology mapped cells on logic blocks (CLBs). The size of derived clusters depends on the underline FPGA architecture. This task is supported with a set of CAD tools, which are based on T-VPack [3,13]. These tools were appropriately extended in order to be aware of multiple types of BBs, each of which might have different properties. Additionally, these tools alleviate the limitation of Quartus synthesizer in effectively handling netlists with BBs. Upcoming subsections describe in more detail the tools developed to support the technology mapping onto heterogeneous FPGA BlackBox profiler The BlackBox_Profiler parses the application description in order to identify different types of BBs, as well as how many instances of each of them are utilized for application implementation. Part of this procedure also deals with appropriate modeling

7 84 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) of these BBs, in order to better meet the specifications of heterogeneous components that it actually replaces. Typical examples of these specifications are the functionality of heterogeneous components (e.g. memory, DSP, etc.), its size, as well as the number of I/O pins. In order to retrieve these properties, we parse the application netlist to identify all the partial BBs that belong to a single macro block. This task is feasible to be accomplished since all these partial BBs use the same signals (e.g. the read/write enable inputs of a RAM) for control and communication with the rest FPGA components. Then, the specifications for each BB are retrieved from the corresponding technology library, as it was discussed in Section 4. These values will be employed later for performing application evaluation in terms of delay, power/energy dissipation, and area metrics. Algorithm 2 depict the pseudo-code for BB-aware profiling. Algorithm 2. Pseudo-code for Blackbox_Profiler. function blackbox-aware_technology_mapping { struct Blackbox { blackbox_name; blackbox_inputs[]; blackbox_outputs[]; ; struct Type { blackbox_name; blackbox_inputs[]; blackbox_outputs[]; instances_num; blackbox_func; ; struct Type blackbox_types[]; struct Blackbox blackboxes[]; // Find BBs utilized into the design blackboxes[] = get_blackboxes_instances (); blackboxes_array_size = get_size (blackboxes[]); blackbox_types_array_size = 0; new_type_flag = 1; for (i = 0;i < blackboxes_array_size;i++) { for (j = 0;j < blackbox_types_array_size;j++) { // Search all known BB types by comparing control signals if (control_pins_match (blackboxes[i],blackbox_types[j])) { blackbox_types[j].instances num++; new_type_flag = 0; break; if (new_type_flag==1) { // Create a new instance for this BB type struct Type new; new.blackbox_name = blackboxes[i].name; new.blackbox_inputs = blackboxes[i].inputs; new.blackbox_outputs = blackboxes[i].outputs; new.blackbox_instances_num = 1; add element to array (new, blackbox_types[]); blackbox_types_array_size++; // Find properties for this BB from a technology library for (i = 0;i < blackbox_types_array_size;i++) { blackbox types[i].func = get_info_from_tech_lib (); BlackBox packing The output from BlackBox_Profiler gives a number of guidelines regarding how to collapse all the partial BBs that belong to the same macro block, into a single BB. This task, referred as Single- Packing or SP, in the NAROUTO framework is software supported with the BlackBox_Packing tool. Additionally, the introduced framework supports one more level of packing, mentioned as Full-Packed or FP. The goal of this additional packing is to collapse recursively all the BBs of the same type, into a larger super-bb. For instance, assume that the memory requirements for a given application is 16 1 kbyte RAM blocks. The BLIF netlist, as it is retrieved from Quartus reports that the design contains 16,384 ( ) BBs, each of which actually corresponds to one byte. After applying SP, the resulting netlist has 16 BBs, each of which represents 1 kbyte, whereas with the second level of packing (FP), the netlist will contain only 1 super-bb with size 16 kbytes. Note that during SP and FP packing, we take into consideration the desired memory hierarchy (as it is defined by the employed architecture description file). Additional details about how this is applied to our framework can be found in Section 3. Algorithms 3 and 4 give the corresponding pseudo-codes for BB packing level 1 (SP) and level 2 (FP), respectively. Algorithms 3. Algorithm for black-box Packing level 1. function BB_Packing_Level_1 { // Stores the BB types. This info was already extracted during // BB profiling blackbox_types[]; // Stores all the BB instances, as they found during BB profiling blackboxes[]; // Stores the new packed BBs packed_blackboxes[] = blackbox_types[]; for (i = 0;i<blackboxes_array_size;i++) { // For each BB instance for (j = 0;j<blackbox_types_array_size;j++) { // Search all known BB types by comparing their control signals if (control_pins_match (blackboxes[i], blackbox_types[j])) { // If the BB s type is found, then it is merged with the // BB instance packed_blackboxes[j] = merge (packed_blackboxes[j], blackboxes[i]); break; // End of the BB types loop // End of the BB instances loop Algorithm 4. Algorithm for black-box Packing level 2. function BB_Packing_Level_2 { // Stores the packed BBs, as they already retrieved from BB // packing level 1 packed_blackboxes[]; // An super-block which stores the FP BB full_packed_blackbox; for (i = 0;i < packed_blackboxes_array_size;i++) {

8 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) full_packed_blackbox= merge (full_packed_blackbox, packed_blackboxes[i]); // End of the packed BB loop Pin multiplexing Apart from the number of partial BBs that are retrieved after synthesis and technology mapping, we will depict later that each of these BBs exhibits also an excessive requirement for I/O pins. This imposes that the target FPGA needs a wider routing channel, which in turn leads to delay, power/energy dissipation, and area penalties. More specifically, based on our analysis, we found that only a subset of the I/O pins found for each BB are actually required for preserving the application s functionality. Hence, the NAROUTO framework provides a mechanism that initially identifies the required pins for each BB, and eliminates the redundant I/Os. The pseudo-code of this tool, named Pin_Multiplexing, is depicted in Algorithm 5. We have to notice that during this task there is no signal merging, since this would undermine the structural and functional integrity of the final netlist. On contrast, the reduction of pins is based on implementing a set of multiplexers at CLBs. More specifically, input signals of a BB initially pass through multiplexing CLBs, and the new multiplexed signals are fed as inputs to the BBs. Similarly, output signals of a BB are multiplexed and pass through de-multiplexing CLBs in advance of connecting to the rest of the netlist. Based on the design specifications, as they are retrieved from the component library depicted in Fig. 4, the I/O pins for each BB can be recursively multiplexed many times, in order to represent the number of pins found to the corresponding heterogeneous block that it actually replaces. Algorithm 5. Algorithm for pin multiplexing. function pin_multiplexing { // Array for storing the packed BBs, as it was derived from FP1 sp/fp_blackboxes[]; // Define the aggressiveness for pin multiplexing. // Levels 1, 2,...denote that I/Os of BBs will be // multiplexed once, twice, etc. multiplexion_level; // Each CLB multiplex a number of I/O pins equals to its number // of inputs minus 1 (for clock input) clb_mux_pin_num = CLB_input_num - 1; // Each CLB demultiplex a number of I/O pins equal to its number // of LUTs clb_demux_pin_num = CLB_LUT_num; for (i = 0;i < sp/fp_blackboxes_array_size;i++) {// For each BB // Temporary storage of I/Os for a BB input_pins[] = get_inputs (sp/ fp_blackboxes[i]); output_pins[] = get_outputs (sp/ fp_blackboxes[i]); in_pin_num = get_length (input_pins[]); out_pin_num = get_length (output_pins[]); for (j = 0;j < multiplexion_level;j++) { // Multiplex the I/O of BBs multiplexion_level times for (k = 0;k < in_pin_num;k+=clb_mux_pin_num) { // Multiplex clb_mux_pin_num pins in every // multiplexing CLB create_mux_clb (input_pins[k], input_pins[k + clb_mux_pin_num]); for (k = 0;k < out_pin_num;k+=clb_demux_pin_num) { // Demultiplex clb_demux_pin_num pins in every // demultiplexing CLB create_demux_clb (output_pins[k], output_pins[k + clb_demux_pin_num]); // I/Os are updated with the new multiplexed pins to enable // re-multiplexing input_pins[] = get_multiplexed_input_pins (); output_pins[] = get_multiplexed_output_pins (); Update activity The pin multiplexing technique discussed previously, imposes variations in the application s routing. These variations occur mainly because BBs have to be connected with the rest of the design through fewer I/O pins. In order to take into account the impact of pin multiplexing during power analysis, information regarding signal activity has to be appropriately updated. Note that during this task, we also take into consideration the additional networks that implement the functionality of pin multiplexing by computing the proper activity values for these additional networks. Algorithm 6 gives the pseudo-code for computing the average static probability and transition density regarding the multiplexed signals. Algorithm 6. Algorithm for Activity_Updater. function update_activities { // Identify all the I/O signals of BBs io_signals_of_bbs[]; // Identify static_probability and transitional_density for each // BB signal activities_of_bbs[]; for (i = 0;i < io_signals_of_bbs_array_size;i++) { // For all the multiplexed signals // tmp_signals[] array stores all the multiplexed signals tmp_signals[] = get_all_signals_multiplexed_in (io_signals_of_bbs[i]); // Store the static_probability and transitional_density // of a multiplexed signal tmp_prob = get_signal_probability (io_signals_of_bbs[i]); (continued on next page)

9 86 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) tmp_dens = get_signal_density (io_signals_of_bbs[i]); for (j = 0;j < tmp_signals_array_size;j++) { // Compute static_probability and transitional_density for // multiplexed signals static_probability = calculate_static_prob (tmp_prob); transitional_density = calculate (tmp_dens); // Update the signal s activity update_activity (tmp_signals[j], static_probability,transitional_density); 5.4. Placement and routing The last task our proposed framework deals, is the task of application placement and routing onto the FPGA. For this purpose we employ a simulated annealing algorithm for placement and a congestion pathfinder routing. Both of these algorithms are based on VPR [3,13], but they have been extensively modified in order to be aware of the inherent constraints posed by heterogeneous components. More specifically, the implementation of these algorithms in the NAROUTO framework provides techniques for efficient handling multiple types of heterogeneous BBs, as well as estimation of power/energy dissipation (through appropriately extension of the Powermodel tool [6]). Note that the new tool can handle heterogeneous FPGAs with embedded macro blocks other than memories, represented as new types of BBs, if they are appropriately modeled in the component library. 6. Experimental results This section provides a number of qualitative and quantitative comparisons that prove the efficiency of the introduced framework, named NAROUTO, as compared to the state-of-art solution (VPR-5.0 tool [3]). Note that for the sake of completeness, application synthesis and technology mapping both for the proposed, as well as the existing solution, were performed with the usage of Quartus toolset [7]. Table 1 gives a qualitative comparison among the introduced framework, the state-of-the-art solution, as well as a commercially available toolset. This comparison is performed under a number of different criteria than span from architecture-oriented (e.g. heterogeneity support), application-oriented (e.g. constraint application mapping), as well as implementation-oriented (e.g. complete framework) parameters. A number of conclusions can be derived from this table. The proposed framework supports more efficiently designs with BBs, whereas the power and energy estimation features are similar to those found in relevant commercial approaches. Additionally, we have to notice that only academic flows (e.g. NAROUTO and VPR- 5.0) enable architecture-level exploration. Hence, the commercial flow tackle exclusively Problem No. 2 (see Fig. 1), whereas the proposed solution supports also Problem No. 1. Even though the first problem could be handled by VPR-5.0, the lack of power/energy support, as well as the non-sufficient usage of BBs, introduce a number of problems. For evaluation purposes, the alternative toolflows are quantified with the usage of DSP applications from Altera s Quip toolkit [16]. Table 2 summarizes the main characteristics of the employed benchmark suite, whereas the complexity of these applications guarantees that the derived conclusions are valid for the majority of digital designs implemented onto FPGAs. Note that our framework does not focus on minimizing either the memory requirements, or the memory accesses, since we assume that these problems were tackled during application synthesis with Altera Quartus. Regarding the glue logic of target FPGA, it consists of 10 4-input LUTs and 22/10 input/output pins per CLB, whereas the FPGA array, as well as the routing channel width, depends on the target application. More specifically, the values of these two parameters correspond to the minimum array and channel width, respectively, for successful application P&R Evaluation of different memory hierarchies Initially, we evaluate maximum operation frequency and power consumption regarding the two memory hierarchies studied throughout this paper. For this purpose Table 3 quantifies maximum operation frequency for the alternative memory hierarchies, mentioned as Scenario 1 and Scenario 2 in Figs. 2 and 3, respectively. As a reference to this analysis we also provide the corresponding results when using the VPR-5.0 tool [3]. Based on Table 3 we can conclude that the usage of the proposed methodology leads to mentionable performance enhancement, as compared to the corresponding gains retrieved when application implementation is performed with VPR-5.0. More specifically, scenarios 1 and 2 achieve on average performance enhancement of 1.96 and 2.07, respectively. Apart from the performance improvement, our proposed methodology is expected to achieve also mentionable power savings. The results of this analysis are summarized in Table 4. Based on them, our two case studies ( Scenario 1 and Scenario 2 ) lead to average power reduction compared to reference implementation (with the usage of VPR-5.0) of 13.5% and 43.7%, respectively. These results denote that memory hierarchies lead to superior performance due to better manipulation of data transfers. Since Table 1 Qualitative comparison in supported features. Feature NAROUTO VPR-5.0 [3] QUARTUS [7] Support BBs Yes Yes Yes Different types of BBs Unlimited 1 Unlimited Realistic number of BBs Yes No No Realistic number of I/Os per BBs Yes No No Power estimation Yes No Yes Constraints during application mapping Timing power area trade-off Timing Timing power area trade-off Modular tools Yes Yes No Part of complete framework Yes No Yes Open source Yes Yes No

10 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) Table 2 Employed benchmark suite from [16]. Benchmark Functionality 4-LUT F/Fs RAM bits I/Os oc_aes_core_inv Encryption , oc_ata_ocidec3 Processor oc_hdlc Processor oc_minirisc Processor oc_oc8051 Processor os_blowfish Encryption , Average: , Table 3 Evaluation in term of maximum operation frequency (MHz) for different memory hierarchies. Benchmark Reference [3] Scenario 1 (Proposed) the target FPGA should exhibit as high as possible performance, for the rest of the paper we employ an architecture, where memory blocks are organized based on the hierarchy depicted in Fig. 3 ( Scenario 2 ). Note that throughout this study we do not aim to find out the optimal memory hierarchy that maximizes the performance improvement. On the contrary, our framework can quantify a number of performance metrics for a given memory hierarchy, whereas it also supports an efficient application mapping onto this device. Additional memory hierarchies and/or organizations can be found in relevant references that further improve the performance, but this goal is beyond the scopes of this paper Evaluation of alternative memory floor-plans Scenario 2(Proposed) cc_aes_core_inv oc_ata_ocidec oc_hdlc oc_minirisc oc_oc os_blowfish ucsb_152_tap_fir Average: Ratio: In this subsection we study a number of different floor-plans for the memory blocks that follow the hierarchy depicted in Scenario 2. The output from this analysis defines the spatial assignment of memory blocks over the target FPGA architecture. For this purpose, we evaluate three representative floor-plans, as they are depicted in Fig. 6. More specifically, we study FPGAs, where the memories are assigned to the borders of the device (Fig. 6(a)), to the center of the device (Fig. 6(b)), as well as a scenario where memories are uniformly distributed over the FPGA architecture (Fig. 6(c)). For the rest of the paper, these floor-plans are denoted as Border, Center and Uniform, respectively. In this figure, the gray color square boxes denote logic cells (CLBs), whereas the memory blocks (BBs) are depicted with different colors. Note that apart from these floor-plans, any other floor-plan can also be evaluated with the NAROUTO framework. The spatial assignment of memory blocks, as they are retrieved from the alternative floor-plans discussed in this subsection, results in mentionable wire-length variations for routing paths, and hence it is expected to highly affect the application s delay and power dissipation. Since our device is a general-purpose FPGA, the selection of preferable memory floor-plan is based on the minimization of PowerDelay product (PDP). Fig. 7 plots the PDP for the studied benchmark suite, whereas Table 5 gives the average values for the three alternative solutions. Based on these results we can conclude that whenever memory blocks are assigned to the center of the FPGA, this leads to the minimum PDP value. More specifically, the average PDP savings for this memory floor plan, as compared to Border and Uniform distributions of BBs are 29% and 49%, respectively. Hence, for the rest of the paper, such a memory floor-plan is assumed Evaluation of different packing techniques This subsection evaluates the efficiency of NAROUTO framework to handle designs with heterogeneous components. As we have already mentioned, Quartus synthesis and technology mapping translates these components into a single type of BBs, ignoring their functionality. The results of this analysis are summarized in Table 6. The second column depicts the number of BBs found in VPR-5.0 (it is equal to the number of BBs retrieved from Quartus synthesis), while the third and fifth columns give the corresponding values after SP and FP, respectively. Note that for some designs the SP leads to a single BB (in this case, the design uses only one BlockRAM). Hence, during the FP there is no further reduction. Furthermore, the forth column in Table 6 depicts the size for each memory block after SP, whereas the corresponding value after FP for a given design is retrieved by summarizing all the partial memory sizes reported at SP. Note that we cannot provide the size of BBs for the VPR-5.0 tool, because this value is not possible to identify (BBs at VPR-5.0 do not correspond to actual memory components). A number of conclusions might be derived from Table 6. The number of BBs retrieved from Quartus tool is excessively high, while it also does not represent actual macro blocks (e.g. BlockRAMs). For instance, regarding our benchmark suite, VPR- 5.0 requires an average of 68.5 BBs per application, while the introduced framework reports about 5 BBs per benchmark (for the SP technique). This overestimation of heterogeneous components impose that VPR-5.0 cannot be employed for sufficient architecture-level exploration and/or application mapping onto heterogeneous FPGAs. Apart from the increased number of BBs, the application s netlist after synthesis incorporates also an excessive number of Table 4 Evaluation in term of application s power consumption (mwatt) for different memory hierarchies. Benchmark Reference based on [3] Scenario 1 (Proposed) Scenario 2 (Proposed) cc_aes_core_inv oc_ata_ocidec oc_hdlc oc_minirisc oc_oc os_blowfish ucsb_152_tap_fir Average: Ratio:

88 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) 78 90 (a) (b) (c) Fig. 6.

Table 5 Average PDP for different floor-plans of memory blocks. Border Center Uniform Average PDP 0.57 0.44 0.66 Ratio: 1.29 1.00 1.49 Fig.

In order to evaluate the efficiency of the NARO- UTO framework to handle designs with realistic number of I/Os, Table 7 summarizes the

More specifically, second and fourth columns refer to the number of I/Os retrieved from Quartus tool, whereas the third and fifth columns

11 88 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) (a) (b) (c) Fig. 6. Alternative floor-plans for memory blocks: (a) placed in borders, (b) placed in center, and (c) uniformly distributed. Table 5 Average PDP for different floor-plans of memory blocks. Border Center Uniform Average PDP Ratio: Fig. 7. PowerDelay product for different floor-plans of memory blocks. input/output pins. In order to evaluate the efficiency of the NARO- UTO framework to handle designs with realistic number of I/Os, Table 7 summarizes the total and the average number of I/O pins per BB. More specifically, second and fourth columns refer to the number of I/Os retrieved from Quartus tool, whereas the third and fifth columns give the corresponding values after the SP approach (by applying the Pin_Multiplexing tool), respectively. Note that for this study we assume, without affecting the efficiency of proposed methodology, that only SP is applied. Based on the results depicted in this table, the pin multiplexing technique leads to designs where each BB incorporates a more

H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) 78 90 89 Table 6 Number and size of BBs before and after packing.

972 1 678 os_blowfish 160 5 5 13,434 1 Average 68.5 5.17 18,289 1.5 Table 7 Number of I/O pins for BBs before and after SP.

12 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) Table 6 Number and size of BBs before and after packing. Benchmark Existing [3] SP FP # of BBs Size of BBs # of BBs oc_aes_core_inv ,176 1 oc_ata_ocidec oc_hdlc oc_minirisc oc_oc os_blowfish ,434 1 Average , Table 7 Number of I/O pins for BBs before and after SP. Benchmark Total pins for BBs Average pins per BBs Before SP After SP Before SP After SP oc_aes_core_inv oc_ata_ocidec oc_hdlc oc_minirisc oc_oc os_blowfish Average Fig. 10. PowerDelay product for FPGAs under different CMOS technologies. Table 8 Average PDP for different CMOS technologies. CMOS technology 45 nm 65 nm 90 nm 130 nm 180 nm PDP improvement as compared to 180 nm Fig. 8. Evaluation in term of delay for alternative application implementations. realistic number of I/O pins, as compared to existing approaches. More specifically, existing version of VPR-5.0, which does not incorporate the pin multiplexing technique, assumes that on average each BB contains about 80 I/O pins, whereas after our study, we found that only 16 of them actually exist (about 5 fewer I/Os per BB). A consequence of having an excessive number of BBs and I/O pins is that the wire-length needed for successful P&R is considerably increased. This problem becomes far more important in the highly utilized regions of the device, where in order to avoid congestion, routing algorithms employ a wider routing channel. However, such a selection introduces considerable performance degradation. Figs. 8 and 9 highlight the consequences posed by the limited efficiency found in VPR-5.0 to handle designs with BBs. More specifically, the figures give the delay and power consumption, respectively, regarding the employed benchmark suite. For both figures, three alternative application implementations are studied: (i) initial (it corresponds to the existing way for application implementation with VPR-5.0 tool), (ii) single packed (SP) and (iii) full packed (FP). From these graphs, it is evident that both SP and FP lead to considerable delay and power savings, as compared to the initial solution. This improvement occurs mainly due to better manipulation of memory blocks (both fewer number, as well as fewer I/Os per BB). Additionally, we have to notice that for a number of benchmarks, initial solution (VPR-5.0) cannot provide results due to the limitations (memory overflows) in host PC (for our study we employed a Quad-core with 8 GB of RAM) Evaluation of different CMOS technologies Fig. 9. Evaluation in term of power consumption for alternative application implementations. The last metric in our exploration is different CMOS technologies. For this purpose the target FPGA is appropriately described with a number of well-established models found in relevant references [17 19]. Fig. 10 evaluates in term of PDP the application mapping when FPGA devices are modeled at 45 nm, 65 nm,

90 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) 78 90 90 nm, 130 nm and 180 nm CMOS technologies, whereas Table 8 gives the average PDP values among the studied benchmarks.

More specifically, the maximum PDP occurs when 180 nm technology is assumed, whereas the ratio of PDP improvement is not linear with technology scaling.

The last conclusion is very important since it enables our proposed NAROUTO framework to evaluate the architectural selections of the underlying FPGA device. 7.

13 90 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) nm, 130 nm and 180 nm CMOS technologies, whereas Table 8 gives the average PDP values among the studied benchmarks. For demonstration purposes, the values plotted in this figure are normalized over the maximum PDP for each benchmark. A number of conclusions might be derived from this analysis. More specifically, the maximum PDP occurs when 180 nm technology is assumed, whereas the ratio of PDP improvement is not linear with technology scaling. Additionally, the performance enhancement between alternative CMOS technologies seems to be application independent. The last conclusion is very important since it enables our proposed NAROUTO framework to evaluate the architectural selections of the underlying FPGA device. 7. Conclusions A novel methodology, as well as the supporting tool framework, for enabling architecture-level exploration of heterogeneous FPGAs, was proposed. This framework was tuned in order to enable efficient handling of memory hierarchies onto general-purpose reconfigurable devices. Experimental results prove the efficiency of proposed solution, since we achieve mentionable delay, power, and area savings, as compared to the state-of-the-art approach. Finally, the introduced NAROUTO framework is the only software-supported approach that enables evaluation of power and energy dissipation metrics of heterogeneous FPGA devices. References [1] J. Pistorius, M. Hutton, A. Mishchenko, R. Brayton, Benchmarking method and designs targeting logic synthesis for FPGAs, in: Proc. of International Workshop on Logic and Synthesis (IWLS), 2007, pp [2] M. Gao, J.H. Jiang, Y. Jiang, Y. Li, S. Sinha, R. Brayton, MVSIS, International Workshop on Logic Synthesis, [3] J. Luu, I. Kuon, P. Jamieson, T. Campbell, A. Ye, W.M. Fang, J. Rose, VPR 5.0: FPGA CAD and Architecture Exploration Tools with Single-Driver Routing, heterogeneity and process scaling, in: Proc. of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), 2009, pp [4] S. Sharp, Conquering the Three Challenges of Power Consumption: Why is power such an issue? Power Managmenet, vol. 1, p. 5. August [5] K. Nowak, J. Meerbergen, An FPGA architecture with enhanced datapath functionality, in: Proc. of the 2003 ACM/SIGDA 11th International Symposium on Field Programmable Gate Arrays (FPGA), 2003, pp [6] K. Poon, S. Wilton, A. Yan, A detailed power model for field-programmable gate arrays, in: ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 10(2), April 2005, pp [7] Altera, Corporation, Quartus II Software. [8] CAD tools for FGPAs. Available at: < software.html>. [9] Berkeley Logic Interchange Format (BLIF), University of California, Berkeley, [10] Altera Stratix Device Handbook. Available at: < literature/hb/stx/stratix_handbook.pdf>. [11] Xilinx Virtex-II Pro Handbook. Available at: < documentation/virtex-ii_pro.htm>. [12] S. Vassiliadis, D. Soudris, Fine and Coarse-Grain Reconfigurable Systems, Springer, [13] V. Betz, J. Rose, A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs, Kluwer Academic Publishers, [14] C. Sidiropoulos, Development of a design framework for Power/Energy consumption estimation in heterogeneous FPGA architectures, Master thesis, NTUA, Greece, Available at: < software/narouto>. [15] International Technology Roadmap for Semiconductors (ITRS), Chapter Interconnect, Edition [16] Altera, Corporation, Quartus-II University Interface Program. [17] W. Zhao, Y. Cao, New generation of Predictive Technology Model for sub- 45 nm early design exploration, IEEE Transactions on Electron Devices 53 (11) (2006) [18] Available from: < architecture_table.html>. [19] S. Wilton, N. Jouppi, CACTI: an enhanced cache access and cycle time model, IEEE Journal of Solid-State Circuits 31 (5) (1996) , / [20] J.M. Rabaey, Low Power Design Essentials, Series on Integrated Circuits and Systems, Springer, New York, NY, [21] Available from: < Harry Sidiropoulos received his Diploma in Electrical and Computer Engineering from the National Technical University of Athens, Greece, in He is currently working towards his Ph.D. in the same university. His research interests include FPGAs and CAD algorithms. Dr. Kostas Siozios received his Diploma, Master and Ph.D. Degree in Electrical and Computer Engineering from the Democritus University of Thrace, Greece, in 2001, 2003 and 2009, respectively. Now he is working as research associate in the National Technical University of Athens, Greece. His research interests include CAD algorithms, low-power reconfigurable architectures and parallel architectures. He has published more than 53 papers in international journals and conferences. Also, he has contributed in 4 books of Kluwer and Springer. The last years he works as principal investigator in numerous research projects funded from the European Commission (EC), as well as the Greek Government and Industry. Prof. Dimitrios Soudris received his Diploma in Electrical Engineering from the University of Patras, Greece, in He received the Ph.D. Degree in Electrical Engineering, from the University of Patras in He was working as a Professor in Dept. of Electrical and Computer Engineering, Democritus University of Thrace for 13 years since He is currently working as Ass. Professor in School of Electrical and Computer Engineering, Dept. Computer Science of National Technical University of Athens, Greece. His research interests include embedded systems design, low power VLSI design and reconfigurable architectures. He has published more than 210 papers in international journals and conferences. Also, he is coauthor/coeditor in five bo oks of Kluwer and Springer. He is leader and principal investigator in numerous research projects funded from the Greek Government and Industry as well as the European Commission (ESPRIT II-III-IV and 5th & 7th IST). He has served as General Chair and Program Chair for PATMOS 99 and 2000, respectively, and General Chair of IFIP-VLSI-SOC Also, he received an award from INTEL and IBM for the EU project LPGD and awards in ASP-DAC 05 and VLSI 05 for EU AMDREL project IST He is a member of the IEEE, the VLSI Systems and Applications Technical Committee of IEEE CAS and the ACM.

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs Harrys Sidiropoulos, Kostas Siozios and Dimitrios Soudris School of Electrical & Computer Engineering National