Journal of Systems Architecture

Size: px
Start display at page:

Download "Journal of Systems Architecture"

Transcription

1 Journal of Systems Architecture 59 (2013) Contents lists available at SciVerse ScienceDirect Journal of Systems Architecture journal homepage: On supporting rapid exploration of memory hierarchies onto FPGAs Harry Sidiropoulos, Kostas Siozios, Dimitrios Soudris 9 Heroon Polytechneiou, Zographou Campus, Athens, Greece article info abstract Article history: Available online 21 November 2012 Keywords: Heterogeneous FPGA CAD tool Exploration framework This paper introduces a novel methodology for enabling fast yet accurate exploration of memory organizations onto FPGA devices. The proposed methodology is software supported by a new open-source tool framework, named NAROUTO. This framework is the only public available solution for performing architecture-level exploration, as well as application mapping onto FPGA devices with different memory organizations, under a variety of design criteria (e.g. delay improvement, power optimization, area savings, etc.). Experimental results with a number of industrial oriented kernels prove the efficiency of the proposed solution, as compared to similar approaches, since it provides better manipulation of memory blocks, leading to architectures with higher performance in terms of area, power and delay. Ó 2012 Elsevier B.V. All rights reserved. 1. Introduction Corresponding author. Tel.: address: ksiop@microlab.ntua.gr (K. Siozios). Recent years, reconfigurable architectures and more specifically Field Programmable Gate Arrays (FPGAs) have become efficient alternatives to Application Specific Integrated Circuits (ASICs). The characteristics and capabilities of these architectures have changed and improved significantly the last two decades, from arrays of Look-Up Tables (LUTs), to heterogeneous devices that integrate a number of hardware components (e.g. LUTs with different sizes, microprocessors, DSP modules, RAM blocks, etc.). In other words, the logic fabric of an FPGA changed gradually from a homogeneous and regular architecture to a heterogeneous (or piece-wise homogeneous) device. Previous studies [12 14] show that one of the upmost important tasks for designing an efficient FPGA device is the architecture-level exploration. This task among others determines the number, the organization (i.e. floor-plan), as well as the parameters for the device components (e.g. look-up table size, channel width, array size, etc.). Note that the problem of sufficient and accurate architecture-level exploration becomes far more important nowadays, due to the increased complexity posed by heterogeneous IP blocks found in FPGA platforms. In order to accomplish this task, a number of methodologies and Computer-Aided Design (CAD) tools have been proposed. These solutions involve among others synthesis and technology mapping [1,2], placement and routing (P&R) [3,13], as well as power and energy estimation [6] techniques. The development of new tools targeting the reconfigurable domain is tackled both by academia and industry. More specifically, tools developed in academia have mainly focused on architecturelevel exploration for homogeneous FPGAs (i.e. devices consisted solely from configurable logic blocks (CLBs)). Even though these solutions are sufficient for evaluating new CAD algorithms, they cannot handle additional Intellectual Property (IP) blocks (e.g. memories, DSPs, embedded CPUs, etc.) found in reconfigurable architectures. On the other hand, commercial frameworks support FPGA devices with numerous heterogeneous IP blocks, but unfortunately they allow only a small degree of architecture-level exploration. Recently, two frameworks, one from academia and the other from industry, were released that provide some kind of flexibility in performing architecture-level exploration for heterogeneous FPGAs. These frameworks are based on a commercial synthesizer, Altera s Quartus [7], while the P&R step is performed with algorithms found in VPR tool [3]. Even though the combination of these two solutions potentially can alleviate the limitation about heterogeneity support, the derived results lack accuracy. In addition, the application s implementation could not be evaluated in terms of power and energy dissipation. Since FPGAs are usually power limited devices [4,5,15], this limitation is a crucial drawback for scoring the efficiency of retrieved architectural solutions. In this paper we propose a new framework for supporting the tasks of architecture-level exploration and application mapping onto heterogeneous FPGAs. The proposed framework, named NARO- UTO, is based on a number of open source tools. This flow is publicly available for downloading, extending and improving [8], in order to support more advanced heterogeneous blocks (e.g. CPUs) [14,21]. The contributions of this work, as compared to prior publications are summarized as follows: Introduction of a novel software-supported methodology for enabling rapid architecture-level exploration for heterogeneous FPGAs that consist of different memory organizations and/or hierarchies /$ - see front matter Ó 2012 Elsevier B.V. All rights reserved.

2 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) Development of a new tool framework that enables application mapping onto these heterogeneous FPGAs. Rather than similar frameworks that support only one type of heterogeneous block (e.g. memory of a given size), our solution exhibits additional flexibility, enabling among others simultaneous handling of heterogeneous blocks with different types and/or properties. Apart from the delay metric, the evaluation of application implementations onto a target heterogeneous FPGA can also be performed in terms of power and energy dissipation (both static and dynamic). The rest of the paper is organized as follows: Section 2 highlights the main limitations found in similar approaches targeting architecture-level exploration, whereas Section 3 gives an overview of the employed heterogeneous FPGA. The proposed methodology, as well as the supporting tool framework are described in Sections 4 and 5, respectively. Section 6 provides a number of qualitative and quantitative comparisons that prove the efficiency of the proposed solution, as compared to the state-of-art approach. Finally, conclusions are summarized in Section Motivation example A common limitation found to existing software frameworks that perform architecture-level exploration affects that none of them can handle macro-blocks, apart from logic resources (slices) and interconnect fabric. On the other hand, commercial tools are not easily adapted to evaluate reconfigurable architectures that differ from the actually fabricated devices. Additionally, since these solutions are based exclusively with academic tools, usually they are evaluated with the usage of synthetic benchmarks (as available academic synthesizers are able to tackle only designs with reduced complexity). Hence, there is a limitation of software-supported tools that are able to perform fast and accurate evaluation of different architectural selections. This section highlights the main limitations found in existing tools for supporting architecture-level exploration, as well as application mapping onto FPGAs consisting of heterogeneous blocks. Starting from an application s description in VHDL or Verilog format, first of all we perform synthesis with the usage of Altera Quartus Framework [7], whereas the output is reported at BLIF (Berkeley Logic Interchange Format) format [9]. This format corresponds to a gate-level netlist with basic primitives for input, output, logic gates, flip/flops, etc. Even though BLIF is a widely accepted format for academic tools, it is rather restrictive, as it is unable to express heterogeneous components, such as RAM blocks, DSP blocks (e.g. multiplier), processors, etc. Furthermore, it cannot express arithmetic carry chains without converting them to gates. Instead of these components, the BLIF netlist uses BlackBoxes (BBs) to enable transparent signal propagation. However, since BBs do not have any meaningful functionality, the derived netlist lacks in accuracy. Additionally, as we will depict later, existing tools provide a non-optimal way for handling designs with BBs. Next, we summarize the main drawbacks of existing (academic/ commercial) software solutions: The application s functionality described at BLIF netlist differs from the application s RTL description, since the BBs do not provide any functionality. For a given design, all the BBs are marked with the same keyword (.blackbox ), regardless of their actual functionality. This imposes that each design can employ only one type of BB (e.g. only memory, DSP, or embedded CPU). Additionally, all these BBs are assumed to have the same properties (e.g. size, throughput, power/energy consumption, etc.), regardless of their usage. In case the design incorporates BlockRAMs, the usage of existing tools (Quartus and VPR-5.0) assumes an excessive number of distinct BBs, each of which corresponds to a few memory s words which are part of a whole memory block. This overestimation in number of BBs results to mentionable delay, power and area overheads due to the additional routing resources needed for signal communication. Finally, existing approaches cannot support the evaluation of architectural selections based on different memory organizations and/or hierarchies. 3. Target architecture Our target architecture is a generic FPGA device similar to recent FPGAs from Altera (Stratix) [10] and Xilinx (Virtex) [11] architectures, consisting of logic resources, memory blocks, special purpose components (e.g. embedded processor, DPS blocks, etc.) and input/output pads. The glue logic of our FPGA device is organized into an array of slices, while the communication among hardware blocks is provided through a hierarchical interconnection network of fast and versatile routing resources. By the term slice we refer to the CLB, the up and right routing segments, as well as the corresponding switch box. The next level of hierarchy assumes that each CLB is formed by a number of Basic Logic Elements (BLEs), while each of the BLEs is formed by a Look-Up Table (LUT), a flip/flop, a number of multiplexers (at inputs and outputs), as well as the required wires for local connectivity. Such an architectural arrangement allows local interconnects between BLEs to be optimized [13]. Fig. 1 depicts a template of the employed architecture with embedded Ram and DSP blocks [12]. The previously mentioned architecture parameters for CLBs differ among vendors and FPGA families, since their values affect the device performance and power/energy consumption. For instance, the Altera Stratix FPGAs group 10 BLEs in order to form a Logic Array Block (LAB) [10]. Similarly, regarding the Xilinx Virtex-II- Pro devices, 2 LUTs are contained in a BLE, while 4 BLEs are joined to form a slice [11]. Apart from the logic and routing infrastructure, our FPGA architecture incorporates also a number of heterogeneous blocks. Throughout this paper, we employ this feature in order to study the impact of different memory hierarchies. More specifically, two different approaches, depicted in Figs. 2 and 3, are evaluated with our software-supported framework. These memory hierarchies are summarized as follows: Scenario 1, depicted schematically in Fig. 2, affects the shared memory architecture. Typically this memory organization assumes a large block of RAM which is accessible by several different CLBs. Even though application mapping onto a device that provides such a memory hierarchy is a relatively easy task, however, a number of limitations might arise when multiple CLBs need fast access to memory. Additionally, an architecture with shared memory cannot scale very well. Scenario 2 affects the shared-distributed memory architecture. This approach, depicted schematically in Fig. 3, apart from a number of shared memories (as discussed previously), incorporates a mechanism that supports each CLB to have direct access to a private (dedicated) memory. The key advantage of shared-distributed memory is the unified address space in which all data can be found. Additionally, this memory hierarchy is more easily scaled with an application s requirements.

3 80 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) Fig. 1. Template of the employed heterogeneous FPGA device. Fig. 2. An instantiation of the shared memory architecture (denoted as scenario 1). For both data-memory architecture models, a shared background (usually off-chip) memory module is assumed. Throughout this paper, we do not study issues related to how data are mapped onto these memories, since this task is tackled by the synthesis and technology mapping tools. Also, for both hierarchies we assume that shared memories may be simultaneously accessed by multiple CLBs. In order to physically implement these hierarchies, a number of special purpose routing tracks that provide signal connectivity among memory blocks, are employed. Note that the performance metrics (e.g. delay and power/energy consumption) of these dedicated routing paths are taken into consideration during the application mapping. Even though our framework can handle any memory hierarchy, if it is appropriately modeled, throughout this paper we select to study these two scenarios because they are widely accepted in the computer architecture field. 4. Proposed methodology This section describes in detail the proposed methodology for performing architecture-level exploration to heterogeneous FPGAs. More specifically, the introduced methodology, depicted schematically in Fig. 4, studies two complementary design problems: problem (1) the architecture-level exploration in order to determine a number of architectural parameters that affect heterogeneous components and problem (2) the application implementation onto these heterogeneous FPGA devices. Even though this methodology

4 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) Fig. 3. An instantiation of the shared-distributed memory architecture (denoted as scenario 2). is able to handle devices consisting of different types of heterogeneous blocks, throughout this paper we evaluate only its efficiency in terms of handling architectures with alternative memory organizations. In this case, the heterogeneity affects the properties of these memory blocks (e.g. size, delay, power/energy consumption, etc). As input to our methodology we use the application s description in VHDL or Verilog, which is synthesized and technology mapped, while the output is extracted in BLIF format. We have already mentioned that BLIF format exhibits limited support for designs with heterogeneous components. Hence, in order to preserve the functionality of the design, the derived netlist has to be appropriately modified. However, before applying such modification, it is crucial to perform application profiling in order to determine the different types of Black-Boxes (BBs) found in the design (e.g. memories with different properties), as well as the number of instanti- Fig. 4. The proposed methodology.

5 82 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) ations per BB (each of which has different properties). The profiling task becomes even more important because a single heterogeneous block is usually reported as multiple BBs from the synthesis and technology mapping tools. Next, the architecture selection picks from the component library the appropriate instances for BBs. During this task, the efficiency of multiple components per functionality (e.g. memories with different properties), or the organization of these components (e.g. memory hierarchies), can be evaluated. For additional accuracy, the delay, power/energy dissipation and silicon area characterization of heterogeneous blocks found in these libraries are based on a number of well-established models [18 20]. By appropriately selecting combinations among these BBs, it is possible to perform a sufficient architecture-level exploration in terms of the number of BBs, as well as their organization. The outcome from this task is a set of Pareto curves that balance the studied criteria. Based on these curves, an architect is able to design an optimized FPGA device. Then, an application s netlist is placed and routed (P&R) onto the selected FPGA. The output of this task provides a number of metrics (e.g. delay, power, area) that allow sufficient evaluation of an application s implementation. In case the derived solution does not meet system specifications, there is a feedback loop for additional improvements. More specifically, if we are primarily interest to find out the optimal organization of hardware resources, or BBs, over an FPGA (referred as Problem i ), the feedback loop affects the architectural selections. During this step, different topologies and/or instantiations of BBs (e.g. memory blocks with different organization) are selected. On the other hand, whenever our goal is to maximize the performance metrics by enabling a more effective application implementation (Problem ii), then the feedback loop goes to the P&R step. 5. The proposed NAROUTO framework This section introduces the NAROUTO framework [14], which software supports the proposed architecture-level exploration methodology for heterogeneous FPGA devices. This framework, depicted schematically in Fig. 5, is composed by a number of opensource CAD tools that either have been developed from scratch, or have been extensively modified to be aware of the additional functionality required for sufficient handling of designs with multiple BBs. Even though the NAROUTO framework supports devices consisting of different types of heterogeneous components, throughout this study, the BBs are tuned to represent BlockRAMs. For this scope, two candidate memory hierarchies are evaluated (described in Figs. 2 and 3) Synthesis and technology mapping The first task of the NAROUTO framework deals with application synthesis and technology mapping. Even though a number of academic tools (e.g. ABC [1], SIS [2]) could be employed, we prefer to accomplish it with a well-established commercial tool. For this purpose, the Altera Quartus tool [7] is employed, since its output (hierarchical netlist in BLIF format) is complementary to the academic tools. Note that the BLIF format is a pre-requisite for the majority of academic tools dealing with FPGAs. In order to enable Quartus to report the output in BLIF format, where the heterogeneous components are replaced with BBs, the following TCL command is applied: set_global_assignment -name INI_VARS no_add_ops = on; dump_blif_after_lut_map = on A limitation of the derived output affects the excessive high number of BBs found in the BLIF netlist, which does not correspond to the actual number of utilized macro blocks. To make matters worse, there is no justification between BBs belonging to different heterogeneous blocks (e.g. memory contents that are stored in different BlockRAMs). Hence, the tools from the NAROUTO framework that are described in this section provide a mechanism to alleviate this limitation Activity estimation The next step in our framework involves the generation of activity files for power/energy estimation. For this purpose, a number of well-established models are employed [6,18 20]. Additionally, since existing versions of the ACE tool [6] cannot support BLIF netlists with BB (s), a special pre-processing step that deals with the computation of static probabilities and transition densities from primary inputs to primary outputs for all the networks of the design that include at least one BB, has been introduced. The new tool, named Hb_for_ACE, initially removes all the BBs from the BLIF netlist, and then it connects the BB input and output pins to the BLIF s primary outputs and primary inputs, respectively. By applying this technique, it is feasible to remove from the design description all the BB (s), and hence enable the ACE 2.0 tool to be sufficiently applied. On the other hand, regarding networks that include at least one BB, the corresponding values of static probability and transition density are retrieved from an exhaustive simulation. Algorithm 1 provides the pseudo-code for the open-source HB_for_ACE (transform Hierarchical Blifs for ACE) tool: Algorithm 1. Pseudo-code for HB_for_ACE tool. function hb_for_ace (Input_blif) { // Input: blif netlist with BBs // Output: blif netlist compatible with ACE BB_inputs[ ];// Array for storing all BBs input pins BB_outputs[ ];// Array for storing all BBs output pins primary_inputs[ ];// Array for storing primary input pins primary_outputs[ ];// Array for storing primary output pins // Get the primary I/O pins of the design primary_inputs[ ] = get_primary_inputs (Input_blif); primary_outputs[ ] = get_primary_outputs (Input_blif); // Get the blackboxes I/O pins BB_inputs[ ] = get_blackbox_inputs (Input_blif); BB_outputs[ ] = get_blackbox_outputs (Input_blif); // Delete any reference to blackboxes from the blif netlist delete_blackbox_subcircuits (Input_blif); delete_blackbox_models (Input_blif); // Connect the BBs I/Os to the design s primary O/I pins append (primary_inputs[ ], BB_outputs[ ]); append (primary_outputs[ ], BB_inputs[ ]); // Print the ACE compatible blif netlist printout_final_blif (Output_blif_filename); 5.3. Technology mapping onto heterogeneous FPGAs Having as input the application s BLIF description that includes also information about the BBs, the next task in our methodology

6 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) Fig. 5. The proposed NAROUTO framework. deals with the packaging of technology mapped cells on logic blocks (CLBs). The size of derived clusters depends on the underline FPGA architecture. This task is supported with a set of CAD tools, which are based on T-VPack [3,13]. These tools were appropriately extended in order to be aware of multiple types of BBs, each of which might have different properties. Additionally, these tools alleviate the limitation of Quartus synthesizer in effectively handling netlists with BBs. Upcoming subsections describe in more detail the tools developed to support the technology mapping onto heterogeneous FPGA BlackBox profiler The BlackBox_Profiler parses the application description in order to identify different types of BBs, as well as how many instances of each of them are utilized for application implementation. Part of this procedure also deals with appropriate modeling

7 84 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) of these BBs, in order to better meet the specifications of heterogeneous components that it actually replaces. Typical examples of these specifications are the functionality of heterogeneous components (e.g. memory, DSP, etc.), its size, as well as the number of I/O pins. In order to retrieve these properties, we parse the application netlist to identify all the partial BBs that belong to a single macro block. This task is feasible to be accomplished since all these partial BBs use the same signals (e.g. the read/write enable inputs of a RAM) for control and communication with the rest FPGA components. Then, the specifications for each BB are retrieved from the corresponding technology library, as it was discussed in Section 4. These values will be employed later for performing application evaluation in terms of delay, power/energy dissipation, and area metrics. Algorithm 2 depict the pseudo-code for BB-aware profiling. Algorithm 2. Pseudo-code for Blackbox_Profiler. function blackbox-aware_technology_mapping { struct Blackbox { blackbox_name; blackbox_inputs[]; blackbox_outputs[]; ; struct Type { blackbox_name; blackbox_inputs[]; blackbox_outputs[]; instances_num; blackbox_func; ; struct Type blackbox_types[]; struct Blackbox blackboxes[]; // Find BBs utilized into the design blackboxes[] = get_blackboxes_instances (); blackboxes_array_size = get_size (blackboxes[]); blackbox_types_array_size = 0; new_type_flag = 1; for (i = 0;i < blackboxes_array_size;i++) { for (j = 0;j < blackbox_types_array_size;j++) { // Search all known BB types by comparing control signals if (control_pins_match (blackboxes[i],blackbox_types[j])) { blackbox_types[j].instances num++; new_type_flag = 0; break; if (new_type_flag==1) { // Create a new instance for this BB type struct Type new; new.blackbox_name = blackboxes[i].name; new.blackbox_inputs = blackboxes[i].inputs; new.blackbox_outputs = blackboxes[i].outputs; new.blackbox_instances_num = 1; add element to array (new, blackbox_types[]); blackbox_types_array_size++; // Find properties for this BB from a technology library for (i = 0;i < blackbox_types_array_size;i++) { blackbox types[i].func = get_info_from_tech_lib (); BlackBox packing The output from BlackBox_Profiler gives a number of guidelines regarding how to collapse all the partial BBs that belong to the same macro block, into a single BB. This task, referred as Single- Packing or SP, in the NAROUTO framework is software supported with the BlackBox_Packing tool. Additionally, the introduced framework supports one more level of packing, mentioned as Full-Packed or FP. The goal of this additional packing is to collapse recursively all the BBs of the same type, into a larger super-bb. For instance, assume that the memory requirements for a given application is 16 1 kbyte RAM blocks. The BLIF netlist, as it is retrieved from Quartus reports that the design contains 16,384 ( ) BBs, each of which actually corresponds to one byte. After applying SP, the resulting netlist has 16 BBs, each of which represents 1 kbyte, whereas with the second level of packing (FP), the netlist will contain only 1 super-bb with size 16 kbytes. Note that during SP and FP packing, we take into consideration the desired memory hierarchy (as it is defined by the employed architecture description file). Additional details about how this is applied to our framework can be found in Section 3. Algorithms 3 and 4 give the corresponding pseudo-codes for BB packing level 1 (SP) and level 2 (FP), respectively. Algorithms 3. Algorithm for black-box Packing level 1. function BB_Packing_Level_1 { // Stores the BB types. This info was already extracted during // BB profiling blackbox_types[]; // Stores all the BB instances, as they found during BB profiling blackboxes[]; // Stores the new packed BBs packed_blackboxes[] = blackbox_types[]; for (i = 0;i<blackboxes_array_size;i++) { // For each BB instance for (j = 0;j<blackbox_types_array_size;j++) { // Search all known BB types by comparing their control signals if (control_pins_match (blackboxes[i], blackbox_types[j])) { // If the BB s type is found, then it is merged with the // BB instance packed_blackboxes[j] = merge (packed_blackboxes[j], blackboxes[i]); break; // End of the BB types loop // End of the BB instances loop Algorithm 4. Algorithm for black-box Packing level 2. function BB_Packing_Level_2 { // Stores the packed BBs, as they already retrieved from BB // packing level 1 packed_blackboxes[]; // An super-block which stores the FP BB full_packed_blackbox; for (i = 0;i < packed_blackboxes_array_size;i++) {

8 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) full_packed_blackbox= merge (full_packed_blackbox, packed_blackboxes[i]); // End of the packed BB loop Pin multiplexing Apart from the number of partial BBs that are retrieved after synthesis and technology mapping, we will depict later that each of these BBs exhibits also an excessive requirement for I/O pins. This imposes that the target FPGA needs a wider routing channel, which in turn leads to delay, power/energy dissipation, and area penalties. More specifically, based on our analysis, we found that only a subset of the I/O pins found for each BB are actually required for preserving the application s functionality. Hence, the NAROUTO framework provides a mechanism that initially identifies the required pins for each BB, and eliminates the redundant I/Os. The pseudo-code of this tool, named Pin_Multiplexing, is depicted in Algorithm 5. We have to notice that during this task there is no signal merging, since this would undermine the structural and functional integrity of the final netlist. On contrast, the reduction of pins is based on implementing a set of multiplexers at CLBs. More specifically, input signals of a BB initially pass through multiplexing CLBs, and the new multiplexed signals are fed as inputs to the BBs. Similarly, output signals of a BB are multiplexed and pass through de-multiplexing CLBs in advance of connecting to the rest of the netlist. Based on the design specifications, as they are retrieved from the component library depicted in Fig. 4, the I/O pins for each BB can be recursively multiplexed many times, in order to represent the number of pins found to the corresponding heterogeneous block that it actually replaces. Algorithm 5. Algorithm for pin multiplexing. function pin_multiplexing { // Array for storing the packed BBs, as it was derived from FP1 sp/fp_blackboxes[]; // Define the aggressiveness for pin multiplexing. // Levels 1, 2,...denote that I/Os of BBs will be // multiplexed once, twice, etc. multiplexion_level; // Each CLB multiplex a number of I/O pins equals to its number // of inputs minus 1 (for clock input) clb_mux_pin_num = CLB_input_num - 1; // Each CLB demultiplex a number of I/O pins equal to its number // of LUTs clb_demux_pin_num = CLB_LUT_num; for (i = 0;i < sp/fp_blackboxes_array_size;i++) {// For each BB // Temporary storage of I/Os for a BB input_pins[] = get_inputs (sp/ fp_blackboxes[i]); output_pins[] = get_outputs (sp/ fp_blackboxes[i]); in_pin_num = get_length (input_pins[]); out_pin_num = get_length (output_pins[]); for (j = 0;j < multiplexion_level;j++) { // Multiplex the I/O of BBs multiplexion_level times for (k = 0;k < in_pin_num;k+=clb_mux_pin_num) { // Multiplex clb_mux_pin_num pins in every // multiplexing CLB create_mux_clb (input_pins[k], input_pins[k + clb_mux_pin_num]); for (k = 0;k < out_pin_num;k+=clb_demux_pin_num) { // Demultiplex clb_demux_pin_num pins in every // demultiplexing CLB create_demux_clb (output_pins[k], output_pins[k + clb_demux_pin_num]); // I/Os are updated with the new multiplexed pins to enable // re-multiplexing input_pins[] = get_multiplexed_input_pins (); output_pins[] = get_multiplexed_output_pins (); Update activity The pin multiplexing technique discussed previously, imposes variations in the application s routing. These variations occur mainly because BBs have to be connected with the rest of the design through fewer I/O pins. In order to take into account the impact of pin multiplexing during power analysis, information regarding signal activity has to be appropriately updated. Note that during this task, we also take into consideration the additional networks that implement the functionality of pin multiplexing by computing the proper activity values for these additional networks. Algorithm 6 gives the pseudo-code for computing the average static probability and transition density regarding the multiplexed signals. Algorithm 6. Algorithm for Activity_Updater. function update_activities { // Identify all the I/O signals of BBs io_signals_of_bbs[]; // Identify static_probability and transitional_density for each // BB signal activities_of_bbs[]; for (i = 0;i < io_signals_of_bbs_array_size;i++) { // For all the multiplexed signals // tmp_signals[] array stores all the multiplexed signals tmp_signals[] = get_all_signals_multiplexed_in (io_signals_of_bbs[i]); // Store the static_probability and transitional_density // of a multiplexed signal tmp_prob = get_signal_probability (io_signals_of_bbs[i]); (continued on next page)

9 86 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) tmp_dens = get_signal_density (io_signals_of_bbs[i]); for (j = 0;j < tmp_signals_array_size;j++) { // Compute static_probability and transitional_density for // multiplexed signals static_probability = calculate_static_prob (tmp_prob); transitional_density = calculate (tmp_dens); // Update the signal s activity update_activity (tmp_signals[j], static_probability,transitional_density); 5.4. Placement and routing The last task our proposed framework deals, is the task of application placement and routing onto the FPGA. For this purpose we employ a simulated annealing algorithm for placement and a congestion pathfinder routing. Both of these algorithms are based on VPR [3,13], but they have been extensively modified in order to be aware of the inherent constraints posed by heterogeneous components. More specifically, the implementation of these algorithms in the NAROUTO framework provides techniques for efficient handling multiple types of heterogeneous BBs, as well as estimation of power/energy dissipation (through appropriately extension of the Powermodel tool [6]). Note that the new tool can handle heterogeneous FPGAs with embedded macro blocks other than memories, represented as new types of BBs, if they are appropriately modeled in the component library. 6. Experimental results This section provides a number of qualitative and quantitative comparisons that prove the efficiency of the introduced framework, named NAROUTO, as compared to the state-of-art solution (VPR-5.0 tool [3]). Note that for the sake of completeness, application synthesis and technology mapping both for the proposed, as well as the existing solution, were performed with the usage of Quartus toolset [7]. Table 1 gives a qualitative comparison among the introduced framework, the state-of-the-art solution, as well as a commercially available toolset. This comparison is performed under a number of different criteria than span from architecture-oriented (e.g. heterogeneity support), application-oriented (e.g. constraint application mapping), as well as implementation-oriented (e.g. complete framework) parameters. A number of conclusions can be derived from this table. The proposed framework supports more efficiently designs with BBs, whereas the power and energy estimation features are similar to those found in relevant commercial approaches. Additionally, we have to notice that only academic flows (e.g. NAROUTO and VPR- 5.0) enable architecture-level exploration. Hence, the commercial flow tackle exclusively Problem No. 2 (see Fig. 1), whereas the proposed solution supports also Problem No. 1. Even though the first problem could be handled by VPR-5.0, the lack of power/energy support, as well as the non-sufficient usage of BBs, introduce a number of problems. For evaluation purposes, the alternative toolflows are quantified with the usage of DSP applications from Altera s Quip toolkit [16]. Table 2 summarizes the main characteristics of the employed benchmark suite, whereas the complexity of these applications guarantees that the derived conclusions are valid for the majority of digital designs implemented onto FPGAs. Note that our framework does not focus on minimizing either the memory requirements, or the memory accesses, since we assume that these problems were tackled during application synthesis with Altera Quartus. Regarding the glue logic of target FPGA, it consists of 10 4-input LUTs and 22/10 input/output pins per CLB, whereas the FPGA array, as well as the routing channel width, depends on the target application. More specifically, the values of these two parameters correspond to the minimum array and channel width, respectively, for successful application P&R Evaluation of different memory hierarchies Initially, we evaluate maximum operation frequency and power consumption regarding the two memory hierarchies studied throughout this paper. For this purpose Table 3 quantifies maximum operation frequency for the alternative memory hierarchies, mentioned as Scenario 1 and Scenario 2 in Figs. 2 and 3, respectively. As a reference to this analysis we also provide the corresponding results when using the VPR-5.0 tool [3]. Based on Table 3 we can conclude that the usage of the proposed methodology leads to mentionable performance enhancement, as compared to the corresponding gains retrieved when application implementation is performed with VPR-5.0. More specifically, scenarios 1 and 2 achieve on average performance enhancement of 1.96 and 2.07, respectively. Apart from the performance improvement, our proposed methodology is expected to achieve also mentionable power savings. The results of this analysis are summarized in Table 4. Based on them, our two case studies ( Scenario 1 and Scenario 2 ) lead to average power reduction compared to reference implementation (with the usage of VPR-5.0) of 13.5% and 43.7%, respectively. These results denote that memory hierarchies lead to superior performance due to better manipulation of data transfers. Since Table 1 Qualitative comparison in supported features. Feature NAROUTO VPR-5.0 [3] QUARTUS [7] Support BBs Yes Yes Yes Different types of BBs Unlimited 1 Unlimited Realistic number of BBs Yes No No Realistic number of I/Os per BBs Yes No No Power estimation Yes No Yes Constraints during application mapping Timing power area trade-off Timing Timing power area trade-off Modular tools Yes Yes No Part of complete framework Yes No Yes Open source Yes Yes No

10 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) Table 2 Employed benchmark suite from [16]. Benchmark Functionality 4-LUT F/Fs RAM bits I/Os oc_aes_core_inv Encryption , oc_ata_ocidec3 Processor oc_hdlc Processor oc_minirisc Processor oc_oc8051 Processor os_blowfish Encryption , Average: , Table 3 Evaluation in term of maximum operation frequency (MHz) for different memory hierarchies. Benchmark Reference [3] Scenario 1 (Proposed) the target FPGA should exhibit as high as possible performance, for the rest of the paper we employ an architecture, where memory blocks are organized based on the hierarchy depicted in Fig. 3 ( Scenario 2 ). Note that throughout this study we do not aim to find out the optimal memory hierarchy that maximizes the performance improvement. On the contrary, our framework can quantify a number of performance metrics for a given memory hierarchy, whereas it also supports an efficient application mapping onto this device. Additional memory hierarchies and/or organizations can be found in relevant references that further improve the performance, but this goal is beyond the scopes of this paper Evaluation of alternative memory floor-plans Scenario 2(Proposed) cc_aes_core_inv oc_ata_ocidec oc_hdlc oc_minirisc oc_oc os_blowfish ucsb_152_tap_fir Average: Ratio: In this subsection we study a number of different floor-plans for the memory blocks that follow the hierarchy depicted in Scenario 2. The output from this analysis defines the spatial assignment of memory blocks over the target FPGA architecture. For this purpose, we evaluate three representative floor-plans, as they are depicted in Fig. 6. More specifically, we study FPGAs, where the memories are assigned to the borders of the device (Fig. 6(a)), to the center of the device (Fig. 6(b)), as well as a scenario where memories are uniformly distributed over the FPGA architecture (Fig. 6(c)). For the rest of the paper, these floor-plans are denoted as Border, Center and Uniform, respectively. In this figure, the gray color square boxes denote logic cells (CLBs), whereas the memory blocks (BBs) are depicted with different colors. Note that apart from these floor-plans, any other floor-plan can also be evaluated with the NAROUTO framework. The spatial assignment of memory blocks, as they are retrieved from the alternative floor-plans discussed in this subsection, results in mentionable wire-length variations for routing paths, and hence it is expected to highly affect the application s delay and power dissipation. Since our device is a general-purpose FPGA, the selection of preferable memory floor-plan is based on the minimization of PowerDelay product (PDP). Fig. 7 plots the PDP for the studied benchmark suite, whereas Table 5 gives the average values for the three alternative solutions. Based on these results we can conclude that whenever memory blocks are assigned to the center of the FPGA, this leads to the minimum PDP value. More specifically, the average PDP savings for this memory floor plan, as compared to Border and Uniform distributions of BBs are 29% and 49%, respectively. Hence, for the rest of the paper, such a memory floor-plan is assumed Evaluation of different packing techniques This subsection evaluates the efficiency of NAROUTO framework to handle designs with heterogeneous components. As we have already mentioned, Quartus synthesis and technology mapping translates these components into a single type of BBs, ignoring their functionality. The results of this analysis are summarized in Table 6. The second column depicts the number of BBs found in VPR-5.0 (it is equal to the number of BBs retrieved from Quartus synthesis), while the third and fifth columns give the corresponding values after SP and FP, respectively. Note that for some designs the SP leads to a single BB (in this case, the design uses only one BlockRAM). Hence, during the FP there is no further reduction. Furthermore, the forth column in Table 6 depicts the size for each memory block after SP, whereas the corresponding value after FP for a given design is retrieved by summarizing all the partial memory sizes reported at SP. Note that we cannot provide the size of BBs for the VPR-5.0 tool, because this value is not possible to identify (BBs at VPR-5.0 do not correspond to actual memory components). A number of conclusions might be derived from Table 6. The number of BBs retrieved from Quartus tool is excessively high, while it also does not represent actual macro blocks (e.g. BlockRAMs). For instance, regarding our benchmark suite, VPR- 5.0 requires an average of 68.5 BBs per application, while the introduced framework reports about 5 BBs per benchmark (for the SP technique). This overestimation of heterogeneous components impose that VPR-5.0 cannot be employed for sufficient architecture-level exploration and/or application mapping onto heterogeneous FPGAs. Apart from the increased number of BBs, the application s netlist after synthesis incorporates also an excessive number of Table 4 Evaluation in term of application s power consumption (mwatt) for different memory hierarchies. Benchmark Reference based on [3] Scenario 1 (Proposed) Scenario 2 (Proposed) cc_aes_core_inv oc_ata_ocidec oc_hdlc oc_minirisc oc_oc os_blowfish ucsb_152_tap_fir Average: Ratio:

11 88 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) (a) (b) (c) Fig. 6. Alternative floor-plans for memory blocks: (a) placed in borders, (b) placed in center, and (c) uniformly distributed. Table 5 Average PDP for different floor-plans of memory blocks. Border Center Uniform Average PDP Ratio: Fig. 7. PowerDelay product for different floor-plans of memory blocks. input/output pins. In order to evaluate the efficiency of the NARO- UTO framework to handle designs with realistic number of I/Os, Table 7 summarizes the total and the average number of I/O pins per BB. More specifically, second and fourth columns refer to the number of I/Os retrieved from Quartus tool, whereas the third and fifth columns give the corresponding values after the SP approach (by applying the Pin_Multiplexing tool), respectively. Note that for this study we assume, without affecting the efficiency of proposed methodology, that only SP is applied. Based on the results depicted in this table, the pin multiplexing technique leads to designs where each BB incorporates a more

12 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) Table 6 Number and size of BBs before and after packing. Benchmark Existing [3] SP FP # of BBs Size of BBs # of BBs oc_aes_core_inv ,176 1 oc_ata_ocidec oc_hdlc oc_minirisc oc_oc os_blowfish ,434 1 Average , Table 7 Number of I/O pins for BBs before and after SP. Benchmark Total pins for BBs Average pins per BBs Before SP After SP Before SP After SP oc_aes_core_inv oc_ata_ocidec oc_hdlc oc_minirisc oc_oc os_blowfish Average Fig. 10. PowerDelay product for FPGAs under different CMOS technologies. Table 8 Average PDP for different CMOS technologies. CMOS technology 45 nm 65 nm 90 nm 130 nm 180 nm PDP improvement as compared to 180 nm Fig. 8. Evaluation in term of delay for alternative application implementations. realistic number of I/O pins, as compared to existing approaches. More specifically, existing version of VPR-5.0, which does not incorporate the pin multiplexing technique, assumes that on average each BB contains about 80 I/O pins, whereas after our study, we found that only 16 of them actually exist (about 5 fewer I/Os per BB). A consequence of having an excessive number of BBs and I/O pins is that the wire-length needed for successful P&R is considerably increased. This problem becomes far more important in the highly utilized regions of the device, where in order to avoid congestion, routing algorithms employ a wider routing channel. However, such a selection introduces considerable performance degradation. Figs. 8 and 9 highlight the consequences posed by the limited efficiency found in VPR-5.0 to handle designs with BBs. More specifically, the figures give the delay and power consumption, respectively, regarding the employed benchmark suite. For both figures, three alternative application implementations are studied: (i) initial (it corresponds to the existing way for application implementation with VPR-5.0 tool), (ii) single packed (SP) and (iii) full packed (FP). From these graphs, it is evident that both SP and FP lead to considerable delay and power savings, as compared to the initial solution. This improvement occurs mainly due to better manipulation of memory blocks (both fewer number, as well as fewer I/Os per BB). Additionally, we have to notice that for a number of benchmarks, initial solution (VPR-5.0) cannot provide results due to the limitations (memory overflows) in host PC (for our study we employed a Quad-core with 8 GB of RAM) Evaluation of different CMOS technologies Fig. 9. Evaluation in term of power consumption for alternative application implementations. The last metric in our exploration is different CMOS technologies. For this purpose the target FPGA is appropriately described with a number of well-established models found in relevant references [17 19]. Fig. 10 evaluates in term of PDP the application mapping when FPGA devices are modeled at 45 nm, 65 nm,

13 90 H. Sidiropoulos et al. / Journal of Systems Architecture 59 (2013) nm, 130 nm and 180 nm CMOS technologies, whereas Table 8 gives the average PDP values among the studied benchmarks. For demonstration purposes, the values plotted in this figure are normalized over the maximum PDP for each benchmark. A number of conclusions might be derived from this analysis. More specifically, the maximum PDP occurs when 180 nm technology is assumed, whereas the ratio of PDP improvement is not linear with technology scaling. Additionally, the performance enhancement between alternative CMOS technologies seems to be application independent. The last conclusion is very important since it enables our proposed NAROUTO framework to evaluate the architectural selections of the underlying FPGA device. 7. Conclusions A novel methodology, as well as the supporting tool framework, for enabling architecture-level exploration of heterogeneous FPGAs, was proposed. This framework was tuned in order to enable efficient handling of memory hierarchies onto general-purpose reconfigurable devices. Experimental results prove the efficiency of proposed solution, since we achieve mentionable delay, power, and area savings, as compared to the state-of-the-art approach. Finally, the introduced NAROUTO framework is the only software-supported approach that enables evaluation of power and energy dissipation metrics of heterogeneous FPGA devices. References [1] J. Pistorius, M. Hutton, A. Mishchenko, R. Brayton, Benchmarking method and designs targeting logic synthesis for FPGAs, in: Proc. of International Workshop on Logic and Synthesis (IWLS), 2007, pp [2] M. Gao, J.H. Jiang, Y. Jiang, Y. Li, S. Sinha, R. Brayton, MVSIS, International Workshop on Logic Synthesis, [3] J. Luu, I. Kuon, P. Jamieson, T. Campbell, A. Ye, W.M. Fang, J. Rose, VPR 5.0: FPGA CAD and Architecture Exploration Tools with Single-Driver Routing, heterogeneity and process scaling, in: Proc. of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), 2009, pp [4] S. Sharp, Conquering the Three Challenges of Power Consumption: Why is power such an issue? Power Managmenet, vol. 1, p. 5. August [5] K. Nowak, J. Meerbergen, An FPGA architecture with enhanced datapath functionality, in: Proc. of the 2003 ACM/SIGDA 11th International Symposium on Field Programmable Gate Arrays (FPGA), 2003, pp [6] K. Poon, S. Wilton, A. Yan, A detailed power model for field-programmable gate arrays, in: ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 10(2), April 2005, pp [7] Altera, Corporation, Quartus II Software. [8] CAD tools for FGPAs. Available at: < software.html>. [9] Berkeley Logic Interchange Format (BLIF), University of California, Berkeley, [10] Altera Stratix Device Handbook. Available at: < literature/hb/stx/stratix_handbook.pdf>. [11] Xilinx Virtex-II Pro Handbook. Available at: < documentation/virtex-ii_pro.htm>. [12] S. Vassiliadis, D. Soudris, Fine and Coarse-Grain Reconfigurable Systems, Springer, [13] V. Betz, J. Rose, A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs, Kluwer Academic Publishers, [14] C. Sidiropoulos, Development of a design framework for Power/Energy consumption estimation in heterogeneous FPGA architectures, Master thesis, NTUA, Greece, Available at: < software/narouto>. [15] International Technology Roadmap for Semiconductors (ITRS), Chapter Interconnect, Edition [16] Altera, Corporation, Quartus-II University Interface Program. [17] W. Zhao, Y. Cao, New generation of Predictive Technology Model for sub- 45 nm early design exploration, IEEE Transactions on Electron Devices 53 (11) (2006) [18] Available from: < architecture_table.html>. [19] S. Wilton, N. Jouppi, CACTI: an enhanced cache access and cycle time model, IEEE Journal of Solid-State Circuits 31 (5) (1996) , / [20] J.M. Rabaey, Low Power Design Essentials, Series on Integrated Circuits and Systems, Springer, New York, NY, [21] Available from: < Harry Sidiropoulos received his Diploma in Electrical and Computer Engineering from the National Technical University of Athens, Greece, in He is currently working towards his Ph.D. in the same university. His research interests include FPGAs and CAD algorithms. Dr. Kostas Siozios received his Diploma, Master and Ph.D. Degree in Electrical and Computer Engineering from the Democritus University of Thrace, Greece, in 2001, 2003 and 2009, respectively. Now he is working as research associate in the National Technical University of Athens, Greece. His research interests include CAD algorithms, low-power reconfigurable architectures and parallel architectures. He has published more than 53 papers in international journals and conferences. Also, he has contributed in 4 books of Kluwer and Springer. The last years he works as principal investigator in numerous research projects funded from the European Commission (EC), as well as the Greek Government and Industry. Prof. Dimitrios Soudris received his Diploma in Electrical Engineering from the University of Patras, Greece, in He received the Ph.D. Degree in Electrical Engineering, from the University of Patras in He was working as a Professor in Dept. of Electrical and Computer Engineering, Democritus University of Thrace for 13 years since He is currently working as Ass. Professor in School of Electrical and Computer Engineering, Dept. Computer Science of National Technical University of Athens, Greece. His research interests include embedded systems design, low power VLSI design and reconfigurable architectures. He has published more than 210 papers in international journals and conferences. Also, he is coauthor/coeditor in five bo oks of Kluwer and Springer. He is leader and principal investigator in numerous research projects funded from the Greek Government and Industry as well as the European Commission (ESPRIT II-III-IV and 5th & 7th IST). He has served as General Chair and Program Chair for PATMOS 99 and 2000, respectively, and General Chair of IFIP-VLSI-SOC Also, he received an award from INTEL and IBM for the EU project LPGD and awards in ASP-DAC 05 and VLSI 05 for EU AMDREL project IST He is a member of the IEEE, the VLSI Systems and Applications Technical Committee of IEEE CAS and the ACM.

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs Harrys Sidiropoulos, Kostas Siozios and Dimitrios Soudris School of Electrical & Computer Engineering National

More information

Designing Heterogeneous FPGAs with Multiple SBs *

Designing Heterogeneous FPGAs with Multiple SBs * Designing Heterogeneous FPGAs with Multiple SBs * K. Siozios, S. Mamagkakis, D. Soudris, and A. Thanailakis VLSI Design and Testing Center, Department of Electrical and Computer Engineering, Democritus

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

Academic Clustering and Placement Tools for Modern Field-Programmable Gate Array Architectures

Academic Clustering and Placement Tools for Modern Field-Programmable Gate Array Architectures Academic Clustering and Placement Tools for Modern Field-Programmable Gate Array Architectures by Daniele G Paladino A thesis submitted in conformity with the requirements for the degree of Master of Applied

More information

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver, BC, Canada, V6T

More information

On Supporting Adaptive Fault Tolerant at Run-Time with Virtual FPGAs

On Supporting Adaptive Fault Tolerant at Run-Time with Virtual FPGAs On Supporting Adaptive Fault Tolerant at Run-Time with Virtual FPAs K. Siozios 1, D. Soudris 1 and M. Hüebner 2 1 School of ECE, National Technical University of Athens reece Email: {ksiop, dsoudris}@microlab.ntua.gr

More information

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms Jingzhao Ou and Viktor K. Prasanna Department of Electrical Engineering, University of Southern California Los Angeles, California,

More information

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool Jin Hee Kim and Jason Anderson FPL 2015 London, UK September 3, 2015 2 Motivation for Synthesizable FPGA Trend towards ASIC design flow Design

More information

Stratix vs. Virtex-II Pro FPGA Performance Analysis

Stratix vs. Virtex-II Pro FPGA Performance Analysis White Paper Stratix vs. Virtex-II Pro FPGA Performance Analysis The Stratix TM and Stratix II architecture provides outstanding performance for the high performance design segment, providing clear performance

More information

INTRODUCTION TO FPGA ARCHITECTURE

INTRODUCTION TO FPGA ARCHITECTURE 3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)

More information

Development of tools supporting. MEANDER Design Framework

Development of tools supporting. MEANDER Design Framework Development of tools supporting FPGA reconfigurable hardware MEANDER Design Framework Presentation Outline Current state of academic design tools Proposed design flow Proposed graphical user interface

More information

Low energy and High-performance Embedded Systems Design and Reconfigurable Architectures

Low energy and High-performance Embedded Systems Design and Reconfigurable Architectures Low energy and High-performance Embedded Systems Design and Reconfigurable Architectures Ass. Professor Dimitrios Soudris School of Electrical and Computer Eng., National Technical Univ. of Athens, Greece

More information

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders Vol. 3, Issue. 4, July-august. 2013 pp-2266-2270 ISSN: 2249-6645 Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders V.Krishna Kumari (1), Y.Sri Chakrapani

More information

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function. FPGA Logic block of an FPGA can be configured in such a way that it can provide functionality as simple as that of transistor or as complex as that of a microprocessor. It can used to implement different

More information

An automatic tool flow for the combined implementation of multi-mode circuits

An automatic tool flow for the combined implementation of multi-mode circuits An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João M. P. Cardoso and Dirk Stroobandt Ghent University, ELIS Department Sint-Pietersnieuwstraat

More information

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011 FPGA for Complex System Implementation National Chiao Tung University Chun-Jen Tsai 04/14/2011 About FPGA FPGA was invented by Ross Freeman in 1989 SRAM-based FPGA properties Standard parts Allowing multi-level

More information

DIGITAL DESIGN TECHNOLOGY & TECHNIQUES

DIGITAL DESIGN TECHNOLOGY & TECHNIQUES DIGITAL DESIGN TECHNOLOGY & TECHNIQUES CAD for ASIC Design 1 INTEGRATED CIRCUITS (IC) An integrated circuit (IC) consists complex electronic circuitries and their interconnections. William Shockley et

More information

Design and Implementation of FPGA Logic Architectures using Hybrid LUT/Multiplexer

Design and Implementation of FPGA Logic Architectures using Hybrid LUT/Multiplexer Design and Implementation of FPGA Logic Architectures using Hybrid LUT/Multiplexer Krosuri Rajyalakshmi 1 J.Narashima Rao 2 rajyalakshmi.krosuri@gmail.com 1 jnarasimharao09@gmail.com 2 1 PG Scholar, VLSI,

More information

A Hierarchical Description Language and Packing Algorithm for Heterogenous FPGAs. Jason Luu

A Hierarchical Description Language and Packing Algorithm for Heterogenous FPGAs. Jason Luu A Hierarchical Description Language and Packing Algorithm for Heterogenous FPGAs by Jason Luu A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate

More information

A Software-Supported Methodology for Designing General-Purpose Interconnection Networks for Reconfigurable Architectures

A Software-Supported Methodology for Designing General-Purpose Interconnection Networks for Reconfigurable Architectures A Software-Supported Methodology for Designing General-Purpose Interconnection Networks for Reconfigurable Architectures Kostas Siozios, Dimitrios Soudris and Antonios Thanailakis Abstract Modern applications

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

Fast FPGA Routing Approach Using Stochestic Architecture

Fast FPGA Routing Approach Using Stochestic Architecture . Fast FPGA Routing Approach Using Stochestic Architecture MITESH GURJAR 1, NAYAN PATEL 2 1 M.E. Student, VLSI and Embedded System Design, GTU PG School, Ahmedabad, Gujarat, India. 2 Professor, Sabar Institute

More information

8. Best Practices for Incremental Compilation Partitions and Floorplan Assignments

8. Best Practices for Incremental Compilation Partitions and Floorplan Assignments 8. Best Practices for Incremental Compilation Partitions and Floorplan Assignments QII51017-9.0.0 Introduction The Quartus II incremental compilation feature allows you to partition a design, compile partitions

More information

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient ISSN (Online) : 2278-1021 Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient PUSHPALATHA CHOPPA 1, B.N. SRINIVASA RAO 2 PG Scholar (VLSI Design), Department of ECE, Avanthi

More information

Fault-Free: A Framework for Supporting Fault Tolerance in FPGAs

Fault-Free: A Framework for Supporting Fault Tolerance in FPGAs Fault-Free: A Framework for Supporting Fault Tolerance in FPGAs Kostas Siozios 1, Dimitrios Soudris 1 and Dionisios Pnevmatikatos 2 1 School of Electrical & Computer Engineering, National Technical University

More information

FPGA Clock Network Architecture: Flexibility vs. Area and Power

FPGA Clock Network Architecture: Flexibility vs. Area and Power FPGA Clock Network Architecture: Flexibility vs. Area and Power Julien Lamoureux and Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver, B.C.,

More information

Exploring Logic Block Granularity for Regular Fabrics

Exploring Logic Block Granularity for Regular Fabrics 1530-1591/04 $20.00 (c) 2004 IEEE Exploring Logic Block Granularity for Regular Fabrics A. Koorapaty, V. Kheterpal, P. Gopalakrishnan, M. Fu, L. Pileggi {aneeshk, vkheterp, pgopalak, mfu, pileggi}@ece.cmu.edu

More information

Advanced FPGA Design Methodologies with Xilinx Vivado

Advanced FPGA Design Methodologies with Xilinx Vivado Advanced FPGA Design Methodologies with Xilinx Vivado Alexander Jäger Computer Architecture Group Heidelberg University, Germany Abstract With shrinking feature sizes in the ASIC manufacturing technology,

More information

ARCHITECTURE AND CAD FOR DEEP-SUBMICRON FPGAs

ARCHITECTURE AND CAD FOR DEEP-SUBMICRON FPGAs ARCHITECTURE AND CAD FOR DEEP-SUBMICRON FPGAs THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE ARCHITECTURE AND CAD FOR DEEP-SUBMICRON FPGAs Vaughn Betz Jonathan Rose Alexander Marquardt

More information

What is Xilinx Design Language?

What is Xilinx Design Language? Bill Jason P. Tomas University of Nevada Las Vegas Dept. of Electrical and Computer Engineering What is Xilinx Design Language? XDL is a human readable ASCII format compatible with the more widely used

More information

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis

More information

Best Practices for Incremental Compilation Partitions and Floorplan Assignments

Best Practices for Incremental Compilation Partitions and Floorplan Assignments Best Practices for Incremental Compilation Partitions and Floorplan Assignments December 2007, ver. 1.0 Application Note 470 Introduction The Quartus II incremental compilation feature allows you to partition

More information

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline CPE/EE 422/522 Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices Dr. Rhonda Kay Gaede UAH Outline Introduction Field-Programmable Gate Arrays Virtex Virtex-E, Virtex-II, and Virtex-II

More information

Spiral 2-8. Cell Layout

Spiral 2-8. Cell Layout 2-8.1 Spiral 2-8 Cell Layout 2-8.2 Learning Outcomes I understand how a digital circuit is composed of layers of materials forming transistors and wires I understand how each layer is expressed as geometric

More information

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology http://dx.doi.org/10.5573/jsts.014.14.6.760 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.14, NO.6, DECEMBER, 014 A 56-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology Sung-Joon Lee

More information

Hardware Modeling using Verilog Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Hardware Modeling using Verilog Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Hardware Modeling using Verilog Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture 01 Introduction Welcome to the course on Hardware

More information

MODULAR PARTITIONING FOR INCREMENTAL COMPILATION

MODULAR PARTITIONING FOR INCREMENTAL COMPILATION MODULAR PARTITIONING FOR INCREMENTAL COMPILATION Mehrdad Eslami Dehkordi, Stephen D. Brown Dept. of Electrical and Computer Engineering University of Toronto, Toronto, Canada email: {eslami,brown}@eecg.utoronto.ca

More information

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given

More information

SUBMITTED FOR PUBLICATION TO: IEEE TRANSACTIONS ON VLSI, DECEMBER 5, A Low-Power Field-Programmable Gate Array Routing Fabric.

SUBMITTED FOR PUBLICATION TO: IEEE TRANSACTIONS ON VLSI, DECEMBER 5, A Low-Power Field-Programmable Gate Array Routing Fabric. SUBMITTED FOR PUBLICATION TO: IEEE TRANSACTIONS ON VLSI, DECEMBER 5, 2007 1 A Low-Power Field-Programmable Gate Array Routing Fabric Mingjie Lin Abbas El Gamal Abstract This paper describes a new FPGA

More information

Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures

Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures Prof. Lei He EE Department, UCLA LHE@ee.ucla.edu Partially supported by NSF. Pathway to Power Efficiency and Variation Tolerance

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

Abbas El Gamal. Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program. Stanford University

Abbas El Gamal. Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program. Stanford University Abbas El Gamal Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program Stanford University Chip stacking Vertical interconnect density < 20/mm Wafer Stacking

More information

A System-Level Stochastic Circuit Generator for FPGA Architecture Evaluation

A System-Level Stochastic Circuit Generator for FPGA Architecture Evaluation A System-Level Stochastic Circuit Generator for FPGA Architecture Evaluation Cindy Mark, Ava Shui, Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver,

More information

THE COARSE-GRAINED / FINE-GRAINED LOGIC INTERFACE IN FPGAS WITH EMBEDDED FLOATING-POINT ARITHMETIC UNITS

THE COARSE-GRAINED / FINE-GRAINED LOGIC INTERFACE IN FPGAS WITH EMBEDDED FLOATING-POINT ARITHMETIC UNITS THE COARSE-GRAINED / FINE-GRAINED LOGIC INTERFACE IN FPGAS WITH EMBEDDED FLOATING-POINT ARITHMETIC UNITS Chi Wai Yu 1, Julien Lamoureux 2, Steven J.E. Wilton 2, Philip H.W. Leong 3, Wayne Luk 1 1 Dept

More information

Application-Specific Mesh-based Heterogeneous FPGA Architectures

Application-Specific Mesh-based Heterogeneous FPGA Architectures Application-Specific Mesh-based Heterogeneous FPGA Architectures Husain Parvez H abib Mehrez Application-Specific Mesh-based Heterogeneous FPGA Architectures Husain Parvez Habib Mehrez Université Pierre

More information

The Impact of Pipelining on Energy per Operation in Field-Programmable Gate Arrays

The Impact of Pipelining on Energy per Operation in Field-Programmable Gate Arrays The Impact of Pipelining on Energy per Operation in Field-Programmable Gate Arrays Steven J.E. Wilton 1, Su-Shin Ang 2 and Wayne Luk 2 1 Dept. of Electrical and Computer Eng. University of British Columbia

More information

ECEN 449 Microprocessor System Design. FPGAs and Reconfigurable Computing

ECEN 449 Microprocessor System Design. FPGAs and Reconfigurable Computing ECEN 449 Microprocessor System Design FPGAs and Reconfigurable Computing Some of the notes for this course were developed using the course notes for ECE 412 from the University of Illinois, Urbana-Champaign

More information

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp Scientia Iranica, Vol. 11, No. 3, pp 159{164 c Sharif University of Technology, July 2004 On Routing Architecture for Hybrid FPGA M. Nadjarbashi, S.M. Fakhraie 1 and A. Kaviani 2 In this paper, the routing

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016 NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I Overview Anti-fuse and EEPROM-based devices Contemporary SRAM devices - Wiring - Embedded New trends - Single-driver wiring -

More information

Digital Design Methodology

Digital Design Methodology Digital Design Methodology Prof. Soo-Ik Chae Digital System Designs and Practices Using Verilog HDL and FPGAs @ 2008, John Wiley 1-1 Digital Design Methodology (Added) Design Methodology Design Specification

More information

Saving Power by Mapping Finite-State Machines into Embedded Memory Blocks in FPGAs

Saving Power by Mapping Finite-State Machines into Embedded Memory Blocks in FPGAs Saving Power by Mapping Finite-State Machines into Embedded Memory Blocks in FPGAs Anurag Tiwari and Karen A. Tomko Department of ECECS, University of Cincinnati Cincinnati, OH 45221-0030, USA {atiwari,

More information

Power Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas

Power Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas Power Solutions for Leading-Edge FPGAs Vaughn Betz & Paul Ekas Agenda 90 nm Power Overview Stratix II : Power Optimization Without Sacrificing Performance Technical Features & Competitive Results Dynamic

More information

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Fall 2002 EECS150 - Lec06-FPGA Page 1 Outline What are FPGAs? Why use FPGAs (a short history

More information

Workspace for '4-FPGA' Page 1 (row 1, column 1)

Workspace for '4-FPGA' Page 1 (row 1, column 1) Workspace for '4-FPGA' Page 1 (row 1, column 1) Workspace for '4-FPGA' Page 2 (row 2, column 1) Workspace for '4-FPGA' Page 3 (row 3, column 1) ECEN 449 Microprocessor System Design FPGAs and Reconfigurable

More information

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning By: Roman Lysecky and Frank Vahid Presented By: Anton Kiriwas Disclaimer This specific

More information

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic ECE 42/52 Rapid Prototyping with FPGAs Dr. Charlie Wang Department of Electrical and Computer Engineering University of Colorado at Colorado Springs Evolution of Implementation Technologies Discrete devices:

More information

A Direct Memory Access Controller (DMAC) IP-Core using the AMBA AXI protocol

A Direct Memory Access Controller (DMAC) IP-Core using the AMBA AXI protocol SIM 2011 26 th South Symposium on Microelectronics 167 A Direct Memory Access Controller (DMAC) IP-Core using the AMBA AXI protocol 1 Ilan Correa, 2 José Luís Güntzel, 1 Aldebaro Klautau and 1 João Crisóstomo

More information

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs? EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Outline What are FPGAs? Why use FPGAs (a short history lesson). FPGA variations Internal logic

More information

A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique

A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique P. Durga Prasad, M. Tech Scholar, C. Ravi Shankar Reddy, Lecturer, V. Sumalatha, Associate Professor Department

More information

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today.

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today. Overview This set of notes introduces many of the features available in the FPGAs of today. The majority use SRAM based configuration cells, which allows fast reconfiguation. Allows new design ideas to

More information

Performance Improvement and Size Reduction Scheme over Circuits by Using LUT/MUX Architecture

Performance Improvement and Size Reduction Scheme over Circuits by Using LUT/MUX Architecture Performance Improvement and Size Reduction Scheme over Circuits by Using LUT/MUX Architecture R. Pradeepa 1, S.P. Senthil Kumar 2 M.E. VLSI Design, Shanmuganathan Engineering College, Arasampatti, Pudukkottai-622507,

More information

RTL Coding General Concepts

RTL Coding General Concepts RTL Coding General Concepts Typical Digital System 2 Components of a Digital System Printed circuit board (PCB) Embedded d software microprocessor microcontroller digital signal processor (DSP) ASIC Programmable

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017 Design of Low Power Adder in ALU Using Flexible Charge Recycling Dynamic Circuit Pallavi Mamidala 1 K. Anil kumar 2 mamidalapallavi@gmail.com 1 anilkumar10436@gmail.com 2 1 Assistant Professor, Dept of

More information

FPGAs: FAST TRACK TO DSP

FPGAs: FAST TRACK TO DSP FPGAs: FAST TRACK TO DSP Revised February 2009 ABSRACT: Given the prevalence of digital signal processing in a variety of industry segments, several implementation solutions are available depending on

More information

Research Article Architecture-Level Exploration of Alternative Interconnection Schemes Targeting 3D FPGAs: A Software-Supported Methodology

Research Article Architecture-Level Exploration of Alternative Interconnection Schemes Targeting 3D FPGAs: A Software-Supported Methodology International Journal of Reconfigurable Computing Volume 2008, Article ID 764942, 18 pages doi:10.1155/2008/764942 Research Article Architecture-Level Exploration of Alternative Interconnection Schemes

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

COE 561 Digital System Design & Synthesis Introduction

COE 561 Digital System Design & Synthesis Introduction 1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design

More information

Digital Design Methodology (Revisited) Design Methodology: Big Picture

Digital Design Methodology (Revisited) Design Methodology: Big Picture Digital Design Methodology (Revisited) Design Methodology Design Specification Verification Synthesis Technology Options Full Custom VLSI Standard Cell ASIC FPGA CS 150 Fall 2005 - Lec #25 Design Methodology

More information

Available online at ScienceDirect. Procedia Technology 24 (2016 )

Available online at   ScienceDirect. Procedia Technology 24 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1120 1126 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) FPGA

More information

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

FPGA Implementation and Validation of the Asynchronous Array of simple Processors FPGA Implementation and Validation of the Asynchronous Array of simple Processors Jeremy W. Webb VLSI Computation Laboratory Department of ECE University of California, Davis One Shields Avenue Davis,

More information

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Michalis D. Galanis, Gregory Dimitroulakos, and Costas E. Goutis VLSI Design Laboratory, Electrical and Computer Engineering

More information

Programmable Logic Devices HDL-Based Design Flows CMPE 415

Programmable Logic Devices HDL-Based Design Flows CMPE 415 HDL-Based Design Flows: ASIC Toward the end of the 80s, it became difficult to use schematic-based ASIC flows to deal with the size and complexity of >5K or more gates. HDLs were introduced to deal with

More information

Performance Imrovement of a Navigataion System Using Partial Reconfiguration

Performance Imrovement of a Navigataion System Using Partial Reconfiguration Performance Imrovement of a Navigataion System Using Partial Reconfiguration S.S.Shriramwar 1, Dr. N.K.Choudhari 2 1 Priyadarshini College of Engineering, R.T.M. Nagpur Unversity,Nagpur, sshriramwar@yahoo.com

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 1502 Design and Characterization of Koggestone, Sparse Koggestone, Spanning tree and Brentkung Adders V. Krishna

More information

Intel Arria 10 FPGA Performance Benchmarking Methodology and Results

Intel Arria 10 FPGA Performance Benchmarking Methodology and Results white paper FPGA Intel Arria 10 FPGA Performance Benchmarking Methodology and Results Intel Arria 10 FPGAs deliver more than a speed grade faster core performance and up to a 20% advantage for publicly

More information

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent

More information

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Walter Stechele, Stephan Herrmann, Andreas Herkersdorf Technische Universität München 80290 München Germany Walter.Stechele@ei.tum.de

More information

DESIGN STRATEGIES & TOOLS UTILIZED

DESIGN STRATEGIES & TOOLS UTILIZED CHAPTER 7 DESIGN STRATEGIES & TOOLS UTILIZED 7-1. Field Programmable Gate Array The internal architecture of an FPGA consist of several uncommitted logic blocks in which the design is to be encoded. The

More information

Three-Dimensional Integrated Circuits: Performance, Design Methodology, and CAD Tools

Three-Dimensional Integrated Circuits: Performance, Design Methodology, and CAD Tools Three-Dimensional Integrated Circuits: Performance, Design Methodology, and CAD Tools Shamik Das, Anantha Chandrakasan, and Rafael Reif Microsystems Technology Laboratories Massachusetts Institute of Technology

More information

DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech)

DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech) DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech) K.Prasad Babu 2 M.tech (Ph.d) hanumanthurao19@gmail.com 1 kprasadbabuece433@gmail.com 2 1 PG scholar, VLSI, St.JOHNS

More information

Conclusions and Future Work. We introduce a new method for dealing with the shortage of quality benchmark circuits

Conclusions and Future Work. We introduce a new method for dealing with the shortage of quality benchmark circuits Chapter 7 Conclusions and Future Work 7.1 Thesis Summary. In this thesis we make new inroads into the understanding of digital circuits as graphs. We introduce a new method for dealing with the shortage

More information

4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013)

4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013) 1 4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013) Lab #1: ITB Room 157, Thurs. and Fridays, 2:30-5:20, EOW Demos to TA: Thurs, Fri, Sept.

More information

High Performance and Area Efficient DSP Architecture using Dadda Multiplier

High Performance and Area Efficient DSP Architecture using Dadda Multiplier 2017 IJSRST Volume 3 Issue 6 Print ISSN: 2395-6011 Online ISSN: 2395-602X Themed Section: Science and Technology High Performance and Area Efficient DSP Architecture using Dadda Multiplier V.Kiran Kumar

More information

EITF35: Introduction to Structured VLSI Design

EITF35: Introduction to Structured VLSI Design EITF35: Introduction to Structured VLSI Design Introduction to FPGA design Rakesh Gangarajaiah Rakesh.gangarajaiah@eit.lth.se Slides from Chenxin Zhang and Steffan Malkowsky WWW.FPGA What is FPGA? Field

More information

Introduction to VHDL Design on Quartus II and DE2 Board

Introduction to VHDL Design on Quartus II and DE2 Board ECP3116 Digital Computer Design Lab Experiment Duration: 3 hours Introduction to VHDL Design on Quartus II and DE2 Board Objective To learn how to create projects using Quartus II, design circuits and

More information

Leso Martin, Musil Tomáš

Leso Martin, Musil Tomáš SAFETY CORE APPROACH FOR THE SYSTEM WITH HIGH DEMANDS FOR A SAFETY AND RELIABILITY DESIGN IN A PARTIALLY DYNAMICALLY RECON- FIGURABLE FIELD-PROGRAMMABLE GATE ARRAY (FPGA) Leso Martin, Musil Tomáš Abstract:

More information

System Verification of Hardware Optimization Based on Edge Detection

System Verification of Hardware Optimization Based on Edge Detection Circuits and Systems, 2013, 4, 293-298 http://dx.doi.org/10.4236/cs.2013.43040 Published Online July 2013 (http://www.scirp.org/journal/cs) System Verification of Hardware Optimization Based on Edge Detection

More information

Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University

Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage silage@temple.edu ECE Temple University www.temple.edu/scdl Signal Processing Algorithms into Fixed Point FPGA Hardware Motivation

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE 754-2008 Standard M. Shyamsi, M. I. Ibrahimy, S. M. A. Motakabber and M. R. Ahsan Dept. of Electrical and Computer Engineering

More information

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips Overview CSE372 Digital Systems Organization and Design Lab Prof. Milo Martin Unit 5: Hardware Synthesis CAD (Computer Aided Design) Use computers to design computers Virtuous cycle Architectural-level,

More information

Lossless Compression using Efficient Encoding of Bitmasks

Lossless Compression using Efficient Encoding of Bitmasks Lossless Compression using Efficient Encoding of Bitmasks Chetan Murthy and Prabhat Mishra Department of Computer and Information Science and Engineering University of Florida, Gainesville, FL 326, USA

More information

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Politecnico di Milano & EPFL A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Vincenzo Rana, Ivan Beretta, Donatella Sciuto Donatella Sciuto sciuto@elet.polimi.it Introduction

More information

ANALYZING THE PERFORMANCE OF CARRY TREE ADDERS BASED ON FPGA S

ANALYZING THE PERFORMANCE OF CARRY TREE ADDERS BASED ON FPGA S ANALYZING THE PERFORMANCE OF CARRY TREE ADDERS BASED ON FPGA S RENUKUNTLA KIRAN 1 & SUNITHA NAMPALLY 2 1,2 Ganapathy Engineering College E-mail: kiran00447@gmail.com, nsunitha566@gmail.com Abstract- In

More information

CS310 Embedded Computer Systems. Maeng

CS310 Embedded Computer Systems. Maeng 1 INTRODUCTION (PART II) Maeng Three key embedded system technologies 2 Technology A manner of accomplishing a task, especially using technical processes, methods, or knowledge Three key technologies for

More information

Design and Implementation of CVNS Based Low Power 64-Bit Adder

Design and Implementation of CVNS Based Low Power 64-Bit Adder Design and Implementation of CVNS Based Low Power 64-Bit Adder Ch.Vijay Kumar Department of ECE Embedded Systems & VLSI Design Vishakhapatnam, India Sri.Sagara Pandu Department of ECE Embedded Systems

More information

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Bradley F. Dutton, Graduate Student Member, IEEE, and Charles E. Stroud, Fellow, IEEE Dept. of Electrical and Computer Engineering

More information

Placement Algorithm for FPGA Circuits

Placement Algorithm for FPGA Circuits Placement Algorithm for FPGA Circuits ZOLTAN BARUCH, OCTAVIAN CREŢ, KALMAN PUSZTAI Computer Science Department, Technical University of Cluj-Napoca, 26, Bariţiu St., 3400 Cluj-Napoca, Romania {Zoltan.Baruch,

More information