Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors

Similar documents
Design Space Exploration Using Parameterized Cores

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

The Nios II Family of Configurable Soft-core Processors

EMBEDDED SOPC DESIGN WITH NIOS II PROCESSOR AND VHDL EXAMPLES

USING C-TO-HARDWARE ACCELERATION IN FPGAS FOR WAVEFORM BASEBAND PROCESSING

ECE332, Week 2, Lecture 3. September 5, 2007

ECE332, Week 2, Lecture 3

QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection

FPGA-Based Rapid Prototyping of Digital Signal Processing Systems

Pilot: A Platform-based HW/SW Synthesis System

CHAPTER 7 FPGA IMPLEMENTATION OF HIGH SPEED ARITHMETIC CIRCUIT FOR FACTORIAL CALCULATION

Mixed Signal Verification of an FPGA-Embedded DDR3 SDRAM Memory Controller using ADMS

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.

Implementation of Optimized ALU for Digital System Applications using Partial Reconfiguration

Keywords: Soft Core Processor, Arithmetic and Logical Unit, Back End Implementation and Front End Implementation.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning

A High Performance Bus Communication Architecture through Bus Splitting

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

High Speed Fault Injection Tool (FITO) Implemented With VHDL on FPGA For Testing Fault Tolerant Designs

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems

EN2911X: Reconfigurable Computing Lecture 01: Introduction

Digital Systems Design. System on a Programmable Chip

Memory Systems IRAM. Principle of IRAM

Interfacing a High Speed Crypto Accelerator to an Embedded CPU

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public

Networks-on-Chip Router: Configuration and Implementation

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141

Building Data Path for the Custom Instruction. Yong ZHU *

N-Model Tests for VLSI Circuits

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

VLSI DESIGN OF REDUCED INSTRUCTION SET COMPUTER PROCESSOR CORE USING VHDL

Design and Implementation of A Reconfigurable Arbiter

xpilot: A Platform-Based Behavioral Synthesis System

SAMBA-BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ. Ruibing Lu and Cheng-Kok Koh

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

Multiple Sequence Alignment Using Reconfigurable Computing

Digital Integrated Circuits

Qsys and IP Core Integration

Design of Embedded Hardware and Firmware

CHAPTER 1 INTRODUCTION

Park Sung Chul. AE MentorGraphics Korea

DESIGN STRATEGIES & TOOLS UTILIZED

Supporting Multithreading in Configurable Soft Processor Cores

COMPUTER ORGANISATION CHAPTER 1 BASIC STRUCTURE OF COMPUTERS

Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices

FPGAs: FAST TRACK TO DSP

Soft processors as a prospective platform of the future

Estimation of worst case latency of periodic tasks in a real time distributed environment

System-on Solution from Altera and Xilinx

Design and Implementation of MP3 Player Based on FPGA Dezheng Sun

Hardware Design. MicroBlaze 7.1. This material exempt per Department of Commerce license exception TSU Xilinx, Inc. All Rights Reserved

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

TIERS: Topology IndependEnt Pipelined Routing and Scheduling for VirtualWire TM Compilation

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

EITF35: Introduction to Structured VLSI Design

A VARIETY OF ICS ARE POSSIBLE DESIGNING FPGAS & ASICS. APPLICATIONS MAY USE STANDARD ICs or FPGAs/ASICs FAB FOUNDRIES COST BILLIONS

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Instantaneous Cylinder Pressure Estimation

COE 561 Digital System Design & Synthesis Introduction

AC : INTRODUCING LABORATORIES WITH SOFT PROCES- SOR CORES USING FPGAS INTO THE COMPUTER ENGINEERING CURRICULUM

Design of a System-on-Chip Switched Network and its Design Support Λ

The University of Reduced Instruction Set Computer (MARC)

Power Optimization in FPGA Designs

AUTONOMOUS RECONFIGURATION OF IP CORE UNITS USING BLRB ALGORITHM

Design and Test Solutions for Networks-on-Chip. Jin-Ho Ahn Hoseo University

Controller Synthesis for Hardware Accelerator Design

EECS150 - Digital Design Lecture 13 - Accelerators. Recap and Outline

The S6000 Family of Processors

LCD DISPLAY CONTROLLER APPLICATION BY DYNAMICALLY RECONFIGURABLE PLL USING NIOS II

SF-LRU Cache Replacement Algorithm

A Direct Memory Access Controller (DMAC) IP-Core using the AMBA AXI protocol

Hardware Software Co-design and SoC. Neeraj Goel IIT Delhi

A Quantized Transform-Domain Motion Estimation Technique for H.264 Secondary SP-frames

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study

FPGA Based Digital Design Using Verilog HDL

Lab 1: Using the LegUp High-level Synthesis Framework

Design Methodologies. Full-Custom Design

Design and Implementation of a Data Compression Scheme: A Partial Matching Approach

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

ECE1387 Exercise 3: Using the LegUp High-level Synthesis Framework

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

Chapter 5: ASICs Vs. PLDs

Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University

Employing Multi-FPGA Debug Techniques

Systems Development Tools for Embedded Systems and SOC s

FPGA Based Intelligent Co-operative Processor in Memory Architecture

Fast Flexible FPGA-Tuned Networks-on-Chip

Embedded Real-Time Video Processing System on FPGA

Implementation of Reduced Power Open Core Protocol Compliant Memory System using VHDL

High-Performance Linear Algebra Processor using FPGA

Fast Evaluation of the Square Root and Other Nonlinear Functions in FPGA

Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience

RECONFIGURABLE SPI DRIVER FOR MIPS SOFT-CORE PROCESSOR USING FPGA

A Dynamic Memory Management Unit for Embedded Real-Time System-on-a-Chip

Digital Design Methodology (Revisited) Design Methodology: Big Picture

Transcription:

Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors Siew-Kei Lam Centre for High Performance Embedded Systems, Nanyang Technological University, Singapore (assklam@ntu.edu.sg) Mohammed Shoaib Indian Institute of Technology, Madras, India (ee03b111@ee.iitm.ac.in) Thambipillai Srikanthan Centre for High Performance Embedded Systems, Nanyang Technological University, Singapore (astsrikan@ntu.edu.sg) Abstract Instruction set customization is becoming a preferred approach for accelerating high-speed demanding applications. In this paper, we present performance and delay-area product estimation models to accelerate the design of custom instructions on the Nios II configurable processor platform. The proposed models outline the performance bandwidth and delayarea product to enable profitable selection on the type and number of custom instructions, without the need to undertake time-consuming hardware synthesis in the design exploration stage. The models exhibit a high degree of accuracy as they incorporate the architectural dependencies of the arbitrator logic between the Nios II processor and custom hardware. Experimental results reveal that the area-time implications of the arbitrator logic with respect to the number of custom instructions can significantly affect the system s performance and area utilization. 1. Introduction General-purpose processors tend to suffer from performance when executing non-standard operations that are not supported by the instruction set. On the other hand, full hardware implementations necessitate lengthy design time that increases the TTM (Time-To- Market) pressure. In recent years, configurable processors [1][2][3][4] have emerged to bridge the full hardware and processor based implementation gap in the SoC (System-on-a-Chip) design continuum. The fundamental premise of configurable processors is that they can be modified or extended to address specific design issues by changing the processor s feature set without prohibitive cost and lengthy development time. In order to support post-manufacturing configurability, commercial RISPs (Reconfigurable Instruction Set Processors) [5] such as the Altera Nios [6] and Xilinx Microblaze [7] platforms have been introduced. A RISP facilitates critical parts of the application to be implemented on a reconfigurable fabric using a specialized instruction set. Despite the advantages of configurable processors and RISPs, selecting custom instructions from a large number of possible candidates still remains an open issue [2]. Existing methodologies for custom instruction selection that incorporates a hardware synthesis flow in the design exploration process such as that proposed in [8] can become too time consuming. Others methods such as [9] aims to expedite the custom instruction selection process by modeling the performance gain of the custom instruction candidates, but fail to consider the inherent micro-architectural dependencies resulting from the generation of custom logic in RISPs. In this paper, we introduce accurate performance and delay-area product estimation models that can be used to effectively select custom instructions for the Altera Nios II platform. We will show that the areatime implications of the arbitrator logic are substantial enough to affect the overall system performance and area utilization. Unlike previous works, the proposed models incorporate the architectural dependencies of the arbitrator logic between the Nios II processor and custom logic. In the next section, we will provide an introduction to the Nios II platform. Section 3 illustrates an example of the Nios II system development flow that has been modified to include model-based custom instruction selection. Section 4 describes the proposed performance estimation model that incorporates the arbitrator s dependencies on the number of custom instructions. In the next section, the delay-area product estimation model is introduced with an example to justify its significance for area-time trade-off study. Finally, the paper concludes with some considerations for future directions.

2. The Nios II Configurable Processor Platform The Nios II configurable processor platform used to derive the estimation models consists of the Altera Stratix II EP2S60F672-C5ES device and can cater up to 256 custom instructions. As shown in Figure 1, the custom instruction logic connects directly to the Nios II ALU (Arithmetic Logic Unit). There are different custom instruction architectures available to suit the application s requirements. The architectures range from simple, single-cycle combinatorial architectures to extended variable-length, multi-cycle custom instruction architectures. The most common type of implementation is the single-cycle combinatorial architecture that allows for custom logic realizations with two inputs and one output port. components. In this paper, we will show that the delay and area of the arbitration logic grows deterministically with the number of custom instructions, which facilitates the design of accurate models. 3. Modified Development Flow Figure 3 describes the Nios II system development flow with emphasis on instruction set customization. The process to analyze the system requirements have been modified to include two tasks 1) Profiling to identify critical portions of the application, and 2) Model-based custom instruction selection. In the latter, the proposed estimation models can be employed to provide more reliable design exploration for selecting custom instructions in order to meet the application requirements. Application Analyze System Requirements Profiling Application Requirements Custom Instruction Selection Custom Logic NIOS II Cores & Standard Peripherals Define and Generate System Figure 1 The Nios II configurable processor architecture The Nios II platform utilizes the Avalon bus module to interface the Nios II processor with the custom logic, and other peripherals as shown in Figure 2. Figure 2 The Avalon bus module The module is a configurable bus architecture that is auto-generated to fit the interconnection needs of the designer s peripherals. The Avalon bus module consists of the control, data and address signals, and arbitration logic that are connected to the peripheral Hardware Development Flow Verification Software Development Flow Figure 3 Modified NIOS II development flow Next, the Nios II processor, required peripherals and custom logic are defined and automatically generated. The resulting system becomes the target platform for the hardware and software development flow. These development flows may undergo several iterations until the system has been successfully verified. In addition, it may be necessary to predefine and generate a new system with different choices of custom instruction logic in order to meet the application s requirements. This will incur iterations (denoted by the dashed line) that significantly increase the design time. The modified flow aims to reduced such iterations by providing more reliable custom instruction selection through accurate estimation models early in the design phase.

4. Proposed Performance Estimation Models This section describes the proposed performance estimation model. It is noteworthy that the models were derived from experimentation based on unconstrained designs. Although more optimal solutions can be realized by setting tighter constraints for the synthesis process, this would lead to substantial expansion in the area utilization. Moreover, our attempt to formulate a model for constrained designs is hampered by the difficulty in obtaining deterministic results. As such, the proposed models are well suited for early design considerations to reduce subsequent optimization efforts (i.e. during the hardware development flow in Figure 3). This is justified by the fact that constraint settings are usually application dependent, and constrained-based design exploration can be a time-consuming process. 4.1. Modeling the Arbitrator Delay Dependencies Data a which serve as the BCD decoder will increase along with number of And gates and Or gate levels. Hence, although the arbitrator logic hierarchy remains the same (i.e. Nor-And-Or) with the increase of custom logic, the number of gates will grow accordingly, introducing combinatorial and routing delays through the arbitrator. The critical path through the arbitrator was experimentally obtained and was found to follow a deterministic pattern with increasing number of custom instructions as shown in Figure 5. It is noteworthy that the delay through the arbitrator can significantly affect the maximum clock frequency of the system. Best Case - Delay (ns) 4.3 4.2 4.1 4.0 3.9 3.8 3.7 3.6 1 2 3 4 5 6 7 8 9 10 11 Data b Chip select 8 Arbitrator (Sim) Aribitrator (Est) Custom Logic 1 Custom Logic 2 Custom Logic 3 Arbitrator Logic Nios II Processor Chip select 8 Results from custom logic And1 And2 And And1 Result Or1 Worst Case - Delay (ns) 7.6 7.4 7.2 7.0 6.8 6.6 6.4 And2 And And1 Or2 6.2 1 2 3 4 5 6 7 8 9 10 11 And2 Or Arbitrator (Sim) Aribitrator (Est) And Figure 4 The arbitrator logic The arbitrator dependencies on the number of custom instructions were first studied. Figure 4 shows the overview of the arbitrator between the Nios II processor and the custom logic, and a schematic diagram of the arbitrator logic for a three custom logic implementation. The arbitrator employs a complemented BCD coding using the chip select signals to select the required custom logic result. It is evident that the complexity of the arbitrator will grow with the number of custom instructions. For example, as the number of custom instructions increases, the number of Nor gates Figure 5 Simulated (dotted lines) and modeled arbitrator delay with number of custom instructions The variation in the delay observed from Figure 5 is step-wise predictable due to the regular increase in the arbitrator s logic with the number of custom instructions. In addition, the height of the step reduces due to the marginal delay incurred in the arbitrator hierarchy with increasing number of custom logic. Equations (1) and (2) describe the proposed delay models of the arbitrator with respect to the number of custom instructions (c). When compared to the simulated results in Figure 5, the models exhibit an average percentage error of only 0.12% and

0.3476% for the best and worst case respectively. In addition, the maximum percentage error for the arbitrator delay model is only 1.414% and 2.041% for the best and worst case. best Best case arbitrator delay ( ns) = Tarb = 3.70 n < 1 n 3.70 + (0.25 0.04( m 1)) m= 1 n 1 Worst case arbitrator delay ( ns) = T worst arb (1) 6.45 n < 1 (2) n m 6.45 + 2 n 1 m= 1 n = log 2 c (3) 2 4.2. Performance Estimation Model Based on the arbitrator delay model, we introduce a performance estimation model that provides an indication of the speedup achieved over the full Nios II processor implementation, when custom instructions are introduced. The proposed performance estimation model P is shown in equation (4), where ( Clock cycles) proc and ( Clock cycles) proc+ is the number of clock cycles when the entire application is implemented on the Nios II processor, and when custom instructions are introduced respectively. ( Clock cycles) P = ( Clock cycles) T T proc proc+ proc proc+ = (4) T proc is the clock delay when no custom instructions are introduced and can be potentially very low (i.e. 7.22 ns). T + is the clock delay when proc custom instructions are introduced and is limited by the arbitrator and the custom logic delays as described in (5), where T arb can be calculated from (1) and (2), and T denotes the longest critical path of the custom instructions. proc+ proc, ( T T T ) T = max + (5) arb It is noteworthy that although ( Clock cycles) proc+ is generally lower than ( Clock cycles) proc, the clock frequency that can be applied after introducing custom instructions (i.e. 1 T proc+ ) is restricted by the arbitrator delay and critical path of the custom logic. Hence, in general, we find that Tproc+ Tproc. The proposed performance estimation model P therefore provides a more effective means for selecting profitable custom instructions by taking into consideration the reduced clock cycles and maximum clock frequency that can be applied. 4.3. Validating the Proposed Performance Estimation Model In order to validate the proposed performance model, a test program was written with up to two million custom instruction execution calls. This is a reasonable number in data-intensive applications such as image processing. For example, if every pixel in a 1024x768 image necessitates only one custom instruction call, this would already incur close to 0.8 million calls for a single image frame processing. In order to validate the models with respect to the number of custom instruction, each custom logic were made identical to that shown in Figure 6. Data a Data b Constant 1 + 5 Constant 2 Constant 3 Left Shifter 5 X Result Area 24 ALUTs Critical path 4.418 ns Figure 6 Custom logic implementation The delay introduced by the arbitrator due to the critical path and arbitrator delay is significant and becomes a performance bottleneck, which restricts the maximum clock frequency that can be applied. For example, a full Nios II processor implementation can be clocked at 138.5 MHz, while the maximum frequency that is applied for increasing number of custom instructions varies as shown in Figure 7. The limitation in the operation frequency of the customized architecture introduced by the delay through the arbitrator can be observed to be dependant on the number of custom instructions as we maintained all the custom logic to be of the same type.

Maximum Frequency (MHz) 125 120 115 110 105 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Simulated Average 5. Modeling the Arbitrator Delay-Area Dependencies In cases where area-time trade-offs are of concern, the delay-area product provides a more effective metric for custom instruction selection. To illustrate the significance of delay-area product in design exploration, we use an example to evaluate the feasibility of employing extended custom instruction architecture on the Nios II platform. Figure 7 Simulated (dotted line) and average maximum operating frequency with number of custom instructions By substituting T proc = 7.22 ns, and the maximum operating frequency obtained in Figure 7 in equation (4), the performance estimates of the system can be obtained. Figure 8 shows the computed best and worstcase performance estimation model, and the actual simulated result (dotted line). We observe the expected performance falls well within the proposed bandwidth. It is noteworthy, that the simulated results will vary for different data sets, but it is guaranteed that they will always fall within the performance bandwidth. In addition, the average simulated performance (nondotted line) exhibits a similar pattern to the proposed model, further justifying the accuracy of the proposed model. Performance 1.90 1.80 1.70 1.60 1.50 1.40 Figure 9 Extended custom instruction implementation Figure 9 shows the extended custom instruction architecture that allows for a single custom logic block to output results for different operations through multiplexers. The choice of using extended custom instruction implementation over the combinatorial architecture may be motivated by the area gain that can be obtained. As shown in the area comparison plot in Figure 10, the area of the multiplexer is lower than the arbitrator for different number of custom instructions. It is also evident that the area incurred by the multiplexer and arbitrator is substantial when compared to the area utilization of the custom logic in Figure 6. 1.30 1.20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Best Case Worst Case Simulated Average Figure 8 Validating the performance model The proposed performance models can be easily extended to multi-cycle custom instruction architectures as it is based only on the number of clock cycles and clock delay, where the estimated clock delay can be derived using (5). Area (ALUT) 300 250 200 150 100 50 0 2 3 4 5 6 7 8 9 10 11 12 Arbitrator Multiplexer Custom Logic Figure 10 Area comparison between the arbitrator and multiplexer (estimated area of custom logic is also shown)

3000 6. Conclusion Delay-Area Product 2500 2000 1500 1000 500 0 2 3 4 5 6 7 8 9 10 11 12 Arbitrator Multiplexer Figure 11 Delay-area product comparison between the arbitrator and multiplexer However, the delay-area product comparison shown in Figure 11 reveals that the combinatorial implementation provides the best area-time trade-offs, and could therefore be a better choice for custom instruction implementation. The proposed delay-area product estimation models are shown in (6) and (7), where c is the number of custom instructions. Best case delay area product = 40(2c 1) (6) Worst case delay area product = 150( c 1) (7) Figure 12 shows the best case and worst-case delayarea product of the arbitrator with increasing number of custom instructions. The simulated values of the delay-area product are shown in dotted lines, which increase in a near-linear fashion for higher orders of custom instructions. Delay-Area Product 1800 1600 1400 1200 1000 800 600 400 200 0 2 3 4 5 6 7 8 9 10 11 12 Worst Case (Sim) Worst Case (Est) Best Case (Sim) Best Case (Est) Figure 12 Delay-area product of arbitrator with number of custom instructions We have proposed accurate estimation models to facilitate efficient custom instruction selection for the Nios II configurable processor platform. The models incorporate the arbitrator s dependencies, which have been shown to be significant enough to affect the overall system s performance and area utilization. An analysis of the arbitrator reveals that the critical path increases deterministically with increasing number of custom instructions. The performance estimation model takes into consideration, the reduced clock cycles and maximum operating frequency when custom instructions are incorporated. In addition, delay-area product estimation model was proposed to facilitate area-time trade-off based design exploration. The proposed models provide for rapid analysis in the custom instruction selection process. Finally, the proposed models are being explored to further improve their accuracy for highly constrained designs by taking into account other inherent micro-architectural constraints and dependencies in RISPs. 7. References 1. Dutt, N., Choi, K. Configurable Processors for Embedded Computing, Computer, Vol. 36, No. 1, 2003, 120-123 2. Henkel, J. Closing the SoC Design Gap, IEEE Computer, Vol. 36, No. 9, 2003, 119-121. 3. Xtensa Microprocessor http//www.tensilica.com 4. ARCtangent Processor http//www.arc.com 5. Barat, F., Lauwereins, R., Deconinck, G. Reconfigurable Instruction Set Processors from a Hardware/Software Perspective, IEEE Transactions on Software Engineering, Vol. 28, No. 9 (2002) 847-862 6. Altera Nios Soft Core Embedded Processor http//www.altera.com 7. Xilinx Platform FPGAs http//www.xilinx.com 8. Fei Sun, Ravi, S., Raghunathan, A., Jha, N.K., Custom- Instruction Synthesis for Extensible-Processor Platforms, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 23, No. 2, February 2004, pp. 216-228. 9. Jason Cong, Yiping Fan, Guoling Han and Zhiru Zhang, "Application-Specific Instruction Generation for Configurable Processor Architectures", International Symposium on Field Programmable Gate Arrays, February 2005, pp. 99-106.