A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

Size: px

Start display at page:

Download "A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT"

Bernice Bennett
5 years ago
Views:

1 A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT N. Vassiliadis, N. Kavvadias, G. Theodoridis, S. Nikolaidis Section of Electronics and Computers, Department of Physics, Aristotle University of Thessaloniki, Thessaloniki, Greece ABSTRACT In this paper, the architecture of an embedded processor extended with a tightly-coupled coarse-grain Reconfigurable Functional Unit (RFU) is proposed. The efficient integration of the RFU with the control unit and the datapath of the processor eliminate the communication overhead between them. To speed up execution, the RFU exploits Instruction Level Parallelism (ILP) and spatial computation. Also, the proposed integration of the RFU exploits efficiently the pipeline structure of the processor, leading to further performance improvements. Furthermore, a development framework for the introduced architecture is presented. The framework is fully automated, hiding all reconfigurable hardware related issues from the user. The hardware model of the architecture was synthesized in a 0.13um process and all information regarding area and delay were estimated and presented. A set of benchmarks is used to evaluate the architecture and the development framework. Experimental results prove performance improvements in addition with potential energy reduction. KEYWORDS RISP, RFU, tightly-coupled, coarse-grain 1. INTRODUCTION Reconfigurable Computing is emerging as a challenging opportunity for implementing computational intensive kernels on embedded systems [DeHon and Wawrzynek, 1999]. By combining the post-fabrication programmability of embedded processors with the computational style most commonly employed in ASIC designs, high performance and flexibility are achieved. Such an appealing architecture is Reconfigurable Instruction Set Processors (RISPs) [Barat and Lauwereins, 2000]. RISPs couple reconfigurable hardware on a standard processor featuring dynamic instruction set extensions. The presence of the reconfigurable hardware allows reuse of hardware resources by adapting the instruction set to the currently executed algorithm. In this paper, the architecture of an embedded single issue RISC processor extended with a tightlycoupled coarse-grain RFU is introduced. In our solution, the efficient integration of the RFU in the control unit and the datapath of the processor eliminates the communication overhead between them. To speed up execution, the RFU executes as reconfigurable instructions Multiple-Input-Single-Output (MISO) clusters of primitive instructions. A number of data-independent instructions can be executed in parallel in the RFU, exploiting ILP. In this way, the processor s parallelism increases without the need for an extremely long instruction word, as is the case for VLIW processors. To further increase performance the introduced architecture uses spatial computation. In particular, a chain of operations with a delay that fits in the processor s clock cycle is executed in a single cycle. Furthermore, the careful integration of the RFU exploits efficiently the pipeline structure of the processor. The RFU is floating between two concurrent pipeline stages and can operate in both of them combining spatial and temporal computation. This gives the opportunity to execute longer chains of operation in one execution cycle with better utilization of the available hardware. In addition, a development framework for the introduced architecture is proposed. The framework is fully automated in the sense that hides all

2 reconfigurable hardware related issues requiring no interaction with the user other than that of a traditional software design flow. The hardware model of an evaluation version of the proposed architecture has been synthesized in a 0.13um technology and all components have been evaluated in terms of area and performance. Experimental results show that the area overhead due to the integration of the RFU in the processor core is small. A set of benchmarks were implemented in the architecture using the proposed development framework. Results prove significant performance improvements compared to the standalone processor. Moreover, reduced instruction memory access of the proposed architecture can potentially lead to reduced energy consumption. The paper is organized as follows. Section 2 discusses related work. In section 3 the proposed architecture is presented in detail, while the development framework is described in section 4. Experimental results derived after the synthesis of the architecture and execution of a benchmark set are presented in section 5. Finally, conclusions are drawn in section RELATED WORK The overwhelming majority of the proposed reconfigurable systems fall into two main categories based on the coupling type between the processor and the reconfigurable hardware: 1) the reconfigurable hardware is a co-processor communicating with the main processor and 2) the reconfigurable hardware is a functional unit of the processor pipeline (we will state this category as RFU from now on). The first category includes among others, the Garp, NAPA, Molen, REMARC, and PipeRench [Callahan T. J. et al.], [Gokhale M. B. and Stone J. M., 1998], [ Vassiliadis S. et al., 2004], [Miyamori T. and Olukotun K., 1999], [Goldstein S. C. et al.]. In this case, the coupling between the processor and the RH is loosely; communication is performed explicitly using special instructions to move data and control directives to and from the RH. To hide the overhead introduced by this type of communication, the number of clock cycles for each use of the RH must be high. Furthermore, the RH usually features direct connection to memory and state registers, while can operate in parallel with the processor. In this way, the allowed performance increase is significant. However, only parts of the code, weakly interacting with the rest of the code, can be mapped to the RH and exploit this performance gain. These parts of the code must be identified and replaced with the appropriate special instructions. Garp and Molen features automation of this process but only for loop bodies and complete functions, respectively. For NAPA and PipeRench this process is performed manually. Examples of the second category are systems such as PRISC, Chimaera, and XiRisc [Razdan R. and Smith M. D., 1994], [Ye Z. A. et al., 2000], [La Rosa A. et al., 2005]. Here, communication is performed implicitly and the coupling is more tightly. Data is read and written directly to and from the processor s register file, while the RH is treated as another functional unit of the processor. This makes control logic simple, while the communication overhead is eliminated but an opcode space explosion is likely. In this case, parts of the code implemented in the RH are smaller and can be seen as dynamic extensions of the processor s instruction set. Fully automated compilers are not reported in the literature neither for this category. For example, in XiRisc the identification of the extracted computational kernel must be performed manually, while PRISC and Chimaera feature no selection process for the identified instructions. Our approach falls in the second category, since it tightly couples an RFU to the processor core. The implicit communication offers the possibility to consider for acceleration the whole application and not just kernels, which is usually the case for the co-processor approach. Even though smaller speedups are achieved for the kernels compared to the co-processor approach, they are achieved in all application s space. Thus, the average speedup of the application is preserved. In addition, most of the proposed architectures use FPGAs (fine-grain) (Garp, NAPA, Molen, PRISC, Chimaera, and XiRisc) rather than coarse-grain (REMARC and PipeRench) reconfigurable hardware. FPGAs exhibit higher flexibility but require large configuration memories, suffer from large reconfiguration overhead, and require expensive multi-context structures to alleviate this overhead. Furthermore, FPGAs require a complex implementation process involving HDL, synthesis, place-and-route that must be performed usually by the user (only Garp has reported an automatic implementation through module mapping). On the other hand, coarse-grains are more suitable for word-level operations, much like the instruction set of a standard processor.

3 Our intention is to extend a base processor with reconfigurable instruction set extensions. These instructions are clusters of the primitive instructions of the processor s instruction set. Therefore, we choose the incorporation of a coarse-grain RFU that can execute these clusters in a more efficient way. Finally, the fact that coarse-grains can be more easily adapted to traditional compilation techniques, result to the introduction of an automated development framework that requires no interaction with the user other than that of a traditional compiler flow. 3. PROPOSED ARCHITECTURE The proposed ReRISC (Reconfigurable Reduced Instruction Set Computer) architecture consists of a RISC processor core tightly coupled with an RFU. The organization of ReRISC is depicted in Figure 1. At the top level there are the Processor Core, the RFU, and the Interface responsible for the communication between them. The Processor Core is composed by the DataPath and Control Logic. Control Logic comprises the Core Control and the Coupling Control to decode standard and reconfigurable instructions, respectively. The RFU is realized in three layers. The Processing Layer features an array of coarse-grain reconfigurable Processing Elements (PEs) plus a local memory to provide extra read-only operands to the PEs. Communication between PEs is performed with the reconfigurable Interconnection Layer. Finally, the Configuration Layer configures properly the two other layers to execute the required operations. The components of the RFU are configured at design time to achieve better adaptation to the targeted application, in terms of performance and hardware utilization. Figure 1. ReRISC Organization ReRISC architecture is presented in more detail in Figure 2. On every execution cycle an instruction is fetched from the Instruction Memory during the first pipeline stage. The instruction opcode is forwarded to the Instruction Decode stage where a reserved bit indicates its type (i.e. if it is going to be executed by the RISC core or by the RFU). If the instruction belongs to the Processor Core s standard instruction set the Core Control decodes the opcode to produce the necessary control signals for the DataPath. Also, two operands are fetched from the register file. If the instruction is going to be executed by the RFU (reconfigurable instruction) the opcode is decoded by the Coupling Control, which generates control signals for the Processor Core/RFU Interface. In addition, the opcode is forwarded to the RFU s Configuration Layer to configure the Processing and Interconnect Layers. In that case four operands are fetched from the Processor Core s register file to the RFU. Based on the reconfigurable instruction type, the RFU can execute an instruction on two pipeline stages. Thus, results of computations are delivered back to the Core Processor either to the Execution or the Memory Access pipeline stages. Due to the tight coupling the RFU is treated by the Processor Core as an extra functional unit capable to execute reconfigurable instructions. Each reconfigurable instruction is expanded onto a set of the Processor Core s standard instructions executed by the PEs inside the RFU. The first feature that RFU supports to improve performance is ILP. Specifically, instructions with no data dependencies can be executed in parallel by the RFU. To further improve performance, RFU can perform spatial computation. In that case a chain of operations that fits in a certain time budget, which is determined by the Processor Core s critical path, is performed in one cycle. Finally, to even further increase performance, the RFU exploits the Processor Core s pipeline structure to combine spatial and temporal computation. In this way the RFU can execute long chains of operations that don t fit in the processor critical path, by breaking them to two smaller chains, each one

4 executed in one of two successively pipeline stages, namely Execution and Memory Access. In addition, since the RFU operates on two pipeline stages two reconfigurable instructions can be processed simultaneously on two different stages, offering a better utilization of the available hardware. Figure 2. ReRISC architecture 3.1 Processor Core The core is a 32-bit single issue RISC processor based on the classic five pipeline stages DLX processor [Hennessy and Patterson, 1991]. It is divided in two main logic regions: the datapath and the control logic that both have been properly extended to support the coupling of the RFU. The DataPath can perform all basic operations (i.e. Arithmetic, Logic, Multiplication, Shifting, Memory Access and Conditional Branches). Control Logic consists of Core Control and the Coupling Control. The Core Control is responsible for decoding the fetched instruction and the generation of the control signals for all Processor Core components, while it also resolves any kind of control and data hazards. Coupling Control extends the capabilities of the Control Logic to support the coupling with the RFU. If the decoded instruction is referred to the RFU for execution, Coupling Control forwards the opcode of the instruction to the RFU. Also, it generates the control signals necessary for the Interface between the RISC and the RFU. Finally, since the RFU can support up to four operands, Coupling Control Logic extends properly the data hazard resolve unit. The opcode of each reconfigurable instruction encodes also the type of operation that must be performed by the RFU. Control Logic properly decodes the opcode and generates the appropriate control signals for the core data channel in order to support these instructions. The RFU supports the following types of operations: (i) complex arithmetic/logic computations, (ii) data transfer operations with complex addressing modes, and (iii) complex control flow operations. Each reconfigurable instruction can be performed in more than one cycle, leading in performance improvement in addition to reduced instruction memory accesses. The number of cycles required by each reconfigurable instruction is part of its configuration. Control Logic receives this number and properly stalls the pipeline structure to support them. The instructions types supported by the RFU ensure the adaptation of the processor to any targeted application. 3.2 Processor Core/RFU Interface The Interface, depicted in Figure 2, provides on each cycle at the Operand Fetch pipeline stage, up to four operand values to the RFU from the register file. After the requested computations have been performed, RFU s results are forwarded back to the core data channel. Multiplexers at the end of the Execution and Memory Access pipeline stages, controlled by the Coupling Control, determine if results come from the DataPath or the RFU. A powerful feature of the proposed architecture is the efficient utilization of the

5 pipeline stages. Specifically, the RFU can operate, and make available its resources, not only in Execution Stage but also in the Memory Access stage, when performing Complex Arithmetic/Logic Computations and produce result that can be delivered to the core channel at the end of this stage. 3.3 RFU As it has been mentioned, the RFU consists of three main architectural layers. The first is the Processing Layer in which all the computations are performed. The second is the Interconnection Layer that manages the communication between the PEs. Finally, the Configuration Layer configures the two previously mentioned layers in order to perform the requested operation. When a reconfigurable instruction is identified on the Instruction Decode pipeline stage, its opcode with the appropriate operands retrieved from the RISC s register file are forwarded to the RFU. The configuration layer receives the opcode and performs all necessary actions for configuring the RFU. Specifically, if the configuration bit stream of the instruction is stored locally in the configuration memory, the layer retrieves the configuration bits and configures the Processing and Interconnection layer. Otherwise, a flag indicates that the configuration is missing and the processor stalls until all necessary bits are downloaded to the local configuration memory Processing Layer The processing layer is a reconfigurable array of coarse-grain PEs. As we target at high-performance architecture, we choose the use of coarse-grain PEs since they offer great advantages in terms of performance, reconfiguration time, and reconfiguration memory compared with fine-grain PEs. At the current version of the architecture, each PE has been designed to perform the same operations with the processor s DataPath. The structure of the PE is shown in Figure 3. On each execution cycle, two results are produced: the unregistered and registered output of the PE. Through a multiplexer controlled by a configuration bit the appropriate output is selected. The proposed structure of the PE offers important features that are discussed in the following. Figure 3. PE Basic Structure Spatial Computation: The unregistered output of a PE can be directly connected to the input of another PE constructing chains of operations that can be processed in just one cycle. If the required time for executing a chain does not exceed the critical path of the Processor Core, the performance improvement offered by the spatial computation is proportional to the depth of the chain. Spatial-Temporal Computation: The registered output of a basic structure holds the result of an operation computed when the reconfigurable instruction was on the Execution pipeline stage of the Processor Core. This result can be connected to the input of another PE for further computation on the Memory Access stage. In this way temporal computation is performed. Also, the registered output of a basic structure can hold the result of a spatial computation. Furthermore, this result is possible to be connected on another PE participating in another spatial computation. Thus, spatial and temporal computation, are combined, providing the opportunity to execute longer chains of normal instructions by fully exploiting the pipeline structure of the Processor Core. Floating PEs: On each execution cycle a part from each two successive reconfigurable instructions can be executed on the RFU. The stage that each PE operates is determined by the configuration of the reconfigurable instruction. From this point of view PEs can be seen as floating between the Execution and Memory Access pipeline stage of the core processor. This structure offers the opportunity for maximum utilization of the available hardware.

6 Combined with the PEs, the Processing Layer features a Local Memory Structure, which is a register file for storing read only values, used by instructions executed in the RFU. Thus, the four operands from the RISC s register file can be extended with constant values offering the possibility to create larger Multiple Input Single Output instructions (MISOs). MISOs are candidates for execution as reconfigurable instructions in the RFU and therefore by increasing there size more performance gain can be produced. The number of stored values that can be addressed and their total number can be configured at design time. The address of constant operands for each reconfigurable instruction its part of the instruction configuration bit stream Interconnection Layer The structure of this layer is illustrated in Figure 4. Conventionally, the Interconnection Layer is constructed with buses and steering logic. As the size of the reconfigurable array is kept relative small, this structure can be efficient in terms of area and delay. Moreover, it requires a small number of configuration bits that can provide significant performance and area occupation improvements. In addition, full connectivity between the PEs is offered, resulting to the possibility for maximum utilization of the available hardware. The introduced structure features two global blocks for the inter-communication of the RFU: Input Network and Output Network. The former is responsible to receive the operands from RISC s register file and the local memory and delivers to the following blocks their registered and unregistered values. In this way, operands for both execution stages of the RFU are constructed. The Output Network can be configured to select the appropriate PE result that is going to be delivered to the output of each stage of the RFU. Figure 4. Interconnection Layer Block Structure For the intra-communication between the PEs, two blocks are offered for each PE Basic Structure: Stage Selector and Operand Selector. The first is configured to select the stage from which the PE receives operands. Thus, this block is the one that configures the stage that each PE will operate. Operand Selector receives the final operands, in addition with feedbacks from each PE and is configured to forward the appropriate values Configuration Layer The components and operation of the Configuration Layer is depicted in Figure 5. On each execution cycle the opcode of the reconfigurable instruction is delivered from the core processor s Instruction Decode stage to the RFU. The opcode is forwarded to a local structure that stores the configuration bits of the locally available instructions. If the required instruction is available the configuration bits for the processing and interconnection layers are retrieved. In a different case, a control signal indicates that new configuration bits must be downloaded from an external configuration memory to the local storage structure and the processor execution stalls.

7 Reconfigurable Instruction Opcode Configuration Bits 1 st Stage 2 nd Stage Configuration Cache Configuration Bits Local Storage Structure 1 st Stage Resource Occupation 2 nd Stage Resource Occupation Resources Availability Control Logic Resources Distribution Bits 1 st Stage Configuration Bits 2 nd Stage Configuration Bits RFU Control Signals M U X Resource i Configuration Bits M U X Resource n Configuration Bits Figure 5. Configuration Layer Structure In addition, as part of the configuration bit stream of each instruction, the storage structure delivers two words, that each one indicates the Resources Occupation required for the execution of the instruction on the corresponding stage. These words are forwarded to the Resource Availability Control Logic, that stores for one cycle the 2 nd Stage Resource Occupation Word. On each cycle the logic compares the 1 st Stage Resource Occupation of the current instruction with the 2 nd of the previously instruction. If a resource conflict is produced, a control signal indicates to the processor core to stall the pipeline execution for one cycle. Finally, the retrieved configuration bits moves through pipeline registers to the first and second execution stage of the RFU. A multiplexer controlled by the Resource Configuration bits, selects the correct configuration bits for each PE and its corresponding interconnection network. The inner structure of the Configuration Bits Local Storage Structure is presented in more detail in Figure 6. Every time a new opcode is received the Availability Check component compares its value with those on the Table of Available Configurations. Then, a Configuration ID is produced and is being decoded by the ID Decoder in order to produce the address of the Local Configuration Memory in which the configuration bits are stored. The ID Decoder can also produce a Miss Flag if the configuration is not available locally or an Error Flag if more than one configuration belongs to the requested instruction. In the second case the processor halts with an error code. Figure 6. Configuration Bits Local Storage Structure The Miss Flag indicates that new configuration bits must be downloaded to the local configuration memory. To perform such an operation only the opcode in the Table and the corresponding configuration bits in the correct address of the local memory must be downloaded. The number of cycles required for a new instruction to be download, depends on the number of PEs available. For an array with eight PEs designed for evaluation purposes 134 bits were required for each reconfigurable instruction. If 32-bit bandwidth with the external configuration memory is available only five cycles are required to download the configuration bits of a new instruction. For a memory structure with sixteen available reconfigurable instructions, again used on

8 the evaluation version of the proposed architecture, eighty cycles are required for refreshing the whole structure. The overhead is very small and can be further reduced with a higher bandwidth. 4. DEVELOPMENT FRAMEWORK Our approach of compiling for ReRISC involves primarily the transparent to the user incorporation of compiler extensions to support the reconfigurable instruction set extensions. Under this demand, we developed an automated development framework for ReRISC, which organization is depicted in Figure 7. The complete flow is divided in five distinct stages, namely: 1) Front-End, 2) Profiling, 3) Instruction Generation, 4) Instruction Selection, and 5) Back-End. Each stage of the flow is presented in detail below. Figure 7. Proposed Compilation System Flow Front-End: The framework supports C/C++ codes that are firstly fed to the front-end. MachSUIF [Smith M.D. and Holloway G., 2002] is used to generate the Control and Data Flow Graph (CDFG) of the application using the SUIFvm Intermediate Representation (IR). In addition, a number of machine independent optimizations (e.g. dead code elimination, strength reduction) are performed in the CDFG. The output of this stage is an optimized IR in the form of a CDFG. Profiling: A MachSUIF pass has been developed that instruments the CDFG with profiling annotations, which mark the entrances and exits of basic blocks (we will state DFGs as basic blocks from now on). A modified m2c pass (the original is supplied with the MachSUIF) translates the CDFG to equivalent C code, while the annotations regarding the basic blocks are converted to program counters. Compiling and executing the generated code, profiling information for the execution frequency of the basic blocks is collected. Instruction Generation: The instruction generation stage is divided in two steps. The goal of the first step (pattern generation) is the identification of complex patterns of primitive operations (e.g. SUIFvm operations) that can be merged into one reconfigurable instruction. Pattern generation is performed using an in-house framework for automated extension of embedded processors described in [Kavvadias N. and Nikolaidis S., 2005]. The patterns generation engine is based on the MaxMISO (maximal multiple-input single-output) algorithm [Alippi A. et al., 1999], which identifies the maximal non-overlapping connected subgraphs of the basic blocks that produce a single computation result. An enhanced version of the algorithm implemented in [Alippi A. et al., 1999] has been used. These enhancements consists of user defined parameters which control: 1) the maximum number of inputs of the pattern, 2) the type of operations included in the pattern (e.g. ALU, multiply, etc), 3) the permitted types of the pattern (e.g. computation, addressing or control flow as described in Section 3), and 4) the maximum number of operation in the patterns. Exploiting these features the user can configure the architecture at design time to fine-tune it towards an application domain.

9 In the second step, the mapping of the previously identified patterns in the RFU is performed and the actual reconfigurable instruction set extensions are generated. A mapper for the target RFU has been developed for this reason. Since any resource constraint has been resolved in the pattern generation step and the 1-D array of the RFU offers full connectivity, the implementation of the mapper is significantly simplified. The steps performed by the mapper are: 1. Calculate the latency of each operation in the pattern. This latency includes the accumulated latencies of the operation s presidencies in the pattern s chains. The latency is calculated using user parameters defining the delay of the modules of the RFU (PEs, interconnection etc.). These parameters are provided by the designer of the architecture. 2. Place each operation in a PE and appropriately configure its functionality. 3. Put the PE for execution in the appropriate pipeline stage based on the calculated delay and the type of the pattern (e.g. computation, addressing etc.). This is performed by selecting the register or un-register output of the PE. 4. Configure the multiplexers of the 1-D array for appropriate interconnection of the PEs. 5. Report the reconfigurable instruction set semantics (e.g. latency, type, resources etc.). Instruction Selection: In this stage, the final instruction set extensions are selected. The only metric for the selection of an instruction is the offered speed-up. The instruction selection stage has been also implemented to automatically estimate the speed-up of each instruction. Firstly, the static speed-up of each instruction is calculated. This is accomplished by considering the software versus the hardware (RFU) execution cycles of the instruction. The software execution cycles are equal to the number of operations of which the instruction consists, while the hardware cycles have been reported by the mapper in the previously stage. The static speed-ups are multiplied by the execution frequency of the basic block (derived at profiling stage) and the dynamic speed-ups are calculated. Finally, we perform pair-wise graph isomorphism on the set of instructions. The VFLib2 graph matching library [Foggia P.] is used for this reason. A set of isomorphic instructions defines a group for which the offered speed-up is calculated by summing the dynamic speed-ups of the group members. The instructions/groups are ranked based on their dynamic speed-ups and the best are selected. The output of this stage is the reconfigurable instruction set extensions in addition to statistics (speed-up, number of instructions etc.) that are going to be demonstrated in Section 5. Back-End: The back-end of the framework flow is the only stage that has not yet been fully implemented. However, since reconfigurable instructions do not require any special manipulation for the communication and synchronization between processor and RFU, the back-end is much like any traditional processor back-end performing tasks like scheduling, register allocation etc. 5. EXPERIMENTAL RESULTS 5.1 Architecture Synthesis For the evaluation of the proposed architecture a hardware model described in VHDL was designed. The configuration used for the evaluation model of the RFU is presented in Table 1. The model was synthesized with STM 0.13um process. Area and delay values were collected for all components of the architecture and they are presented in Table 2. The RFU requires approximately x3.3 the area required by the Processor Core. However, the presented version of the proposed architecture does not target any specific application domain. Tuning the RFU for an application domain can result to fewer PEs and less PE functionality, which can dramatically decrease the RFU area requirements. Furthermore, it must be pointed out that no instruction and data caches were taken into account. Caches dominate the area of embedded processors and thus the RFU area overhead is significantly small. Measurements of the delay of the Processor Core and critical components of the RFU structure are presented in Table 3. The critical path of the processor core is 4 nsec and is not influenced by the incorporation of the RFU. Since the RFU extends the capabilities of the Processor Core, this critical path determines the clock frequency of the proposed architecture. The table also presents delays for all types of processing elements and interconnections of the RFU. These values are provided to the development framework to accomplish the mapping process as described in Section 4.

10 Table 1. Configuration used for the RFU Evaluation Model Configuration Value Granularity (except the 16x16 multiplier) 32-bits Number of Processing Elements 8 Processing Elements Functionality Arithmetic, Logic, Shifter, Multiplier Configuration Memory Size 16 words of 134 bits Local Operand Memory Size 8 operands of 32-bits Number of Provided Local Operands 4 Table 2. Area Requirements for Processor Core and RFU Component Area (mm 2 ) Processor Core RFU Processing Layer RFU Interconnection Layer RFU Configuration Layer RFU Total Table 3. Delay Estimations for Processor Core and RFU-Reconfigurable Instruction Delay Model Component Delay(nsec) DFG operation Latency(cycles) Processor Core 4 Node with I/O edges 1 Arithmetic PE 1.2 Arithmetic Node 0.3 Logic PE 0.1 Logic Node Shifter PE 0.4 Shifter Node 0.1 RFU Multiplier PE 2.9 Multiplier Node Input to PE 0.4 Primary Input Edge 0.1 FeedBack to PE 0.3 Edge connecting nodes PE to Output 0.2 Primary Output Edge Demonstration In order to demonstrate the elaboration model of the RFU two reconfigurable instructions derived from Quantization algorithm are used. The DFGs of the instructions are shown in Figure 8. Instruction in Figure 8a consists of six operations and requires 3 register and 2 constants operands. Mapping of operations can be performed on both execution stages of the RFU, as presented in Figure 8a. Even though, the full instruction cannot be performed as spatial computation in one execution cycle due to the clock constrain, by combining spatial and temporal computation this becomes possible. Spatial is performed within the two Execution Stages while temporal by forwarding the result from the first to the second stage. As a result the reconfigurable instruction that would require six cycles to execute on the Processor Core can be executed in just one cycle in the RFU. In addition, only one Instruction Memory access, rather than six, is required for the execution of the reconfigurable instruction to the RFU, resulting to reduction of the energy consumption. Figure 8b, presents the second reconfigurable instruction. Because of the clock constrain there is no way to execute the full instruction on the RFU in one cycle. However it can be executed as a two cycle reconfigurable instruction. The two successive multiplications are performed on the 1 st Execution Stage of the RFU, while the processor stalls for one cycle offering two execution cycles for the completion of the operations. The rest of the instruction is executed on the 2 nd Execution Stage of the RFU. Two execution cycles are required for the execution of this instruction on the RFU, rather than five that would required for execution on the Processor core. Also, one access in the instruction memory is required, rather than two that would be required if the RFU was not supporting multicycle instructions.

11 Figure 8. Implementation example of two reconfigurable instructions in the proposed architecture 5.3 Performance Evaluation To evaluate the performance of the proposed architecture we have consider a set of benchmarks shown in Table 4 including a brief description. Since, we are currently not targeting any specific application domain we selected a set of applications derived from various application fields. The set includes both kernels (1-3) and complete applications (4-10). The C source code of each benchmark has been fed to the framework. All experimental results have been automatically generated by the framework and they are instruction accurate. Since no Operating System (OS) support is currently available for our architecture, all OS calls (like printf, fopen etc.) are taken over by the host operating system without being consider in the evaluation. Speedups have been calculated by comparing the base RISC processor of the architecture with and without support of the RFU unit. Table 4. Benchmarks Description Name Description 1 dct Discrete Cosine Transform used on 8x8 image blocks 2 quant Quantization operation used in Jpeg compression 3 vlc Variable Length Coding for Jpeg and Mpeg compression algorithms 4 dijsktra Dijkstra's algorithm to find shortest path 5 stringsearch A Pratt-Boyer-Moore string search algorithm 6 crc32 Checksum algorithm - 32-bits Cyclic Redundancy Code 7 gost Cryptographic algorithm the Russian analog to DES 8 sha NIST Secure Hash Algorithm 9 rs-encode REED-SOLOMON Encoder 10 mpeg4_senc Shape Coding for MPEG4 Figure 9 presents the achieved speedups for the implementation of the benchmarks in the proposed architecture. The speedups vary from x2.2 for the crc32 algorithm to x3.4 for the gost cryptographic algorithm. The average value is x2.9, which clearly indicates the performance efficiency of the proposed architecture. As it is observed similar speedups are produced for kernels and complete applications. Based on the well-known Amdahl s law attempting to accelerate the part of the application (usually referred to as the kernels) by a factor S, produces an overall application speedup which is only a fraction of S. This is usually the case for the co-processor approach. Our approach attempts to accelerate the whole application. Thus, even though a smaller speedup is achieved for the kernels, compare to the co-processor approach, speedup is maintained for the complete application. Furthermore, Figure 10 illustrates another benefit produced by the incorporation of the RFU. That is the reduction of instruction fetches required to execute an application, which in turn means reduction in the instruction memory accesses. Reduction varied from 53% to 75% produced for the MPEG4 shape coding,

12 with an average value of 63%. Since a major source of energy consumption of embedded processors is the instruction memory accesses [Benini L. et al., 2001] significant energy savings can be produced. 4 3 SpeedUp dct quant vlc dijsktra stringsearch crc32 gost sha rs-encode mpeg4_senc Figure 9. Benchmarks SpeedUps 100 Instruction Fetces Reduction (%) dct quant vlc dijsktra stringsearch crc32 gost sha rs-encode mpeg4_senc Figure 8. Instruction fetches reduction for each benchmark 6. CONCLUSION A RISP processor, which tightly-couples a coarse-grain RFU, has been proposed. The complete architecture has been presented. The supported features of the RFU namely ILP and spatial computation, in addition with its integration in the processor s structure aim to improve performance. A hardware description model of the architecture has been designed and synthesized using a 0.13um technology. Synthesis results showed that no delay and reasonable area overhead has been produced by the incorporation of the RFU to the base processor. Furthermore, a development framework for the introduced architecture has been presented. Using this framework a set of benchmarks has been implemented. Results prove significant performance improvements in addition to reduced instruction memory access that can potentially reduce energy consumption.

13 ACKNOWLEDGMENT This work was supported by the General Secretariat of Research and Technology of Greece and the European Union. REFERENCES Alippi A. et al., A DAG-Based Design Approach for Reconfigurable VLIW Processors. IEEE International Conference on Design and Test in Europe, Munich, Germany, pp Barat, F. and Lauwereins, R., Reconfigurable Instruction Set Processors: A Survey. IEEE International Workshop on Rapid System Prototyping, pp Benini L. et al., A Power Modeling and Estimation Framework for VLIW-based Embedded Systems. Proceedings of Int. Workshop on Power And Timing Modeling, Optimization and Simulation PATMOS, Sept. 2001, pp Callahan T. J. et al., The Garp Architecture and C Compiler. IEEE Computer, vol. 33, no. 4, pp DeHon, A. and Wawrzynek, J., 1999, Reconfigurable Computing: What, Why, and Implications for Design Automation. DAC, pp Foggia P. The VFLib Graph Matching Library. Gokhale M. B. and Stone J. M., NAPA C: Compiling for a Hybrid RISC/FPGA Architecture. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pp Goldstein S. C. et al., Piperench: A Coprocessor for Streaming Multimedia Acceleration. 26th Annual Int. Symposium on Computer Architecture, pp Hennessy J. and Patterson D., Computer Architecture: A quantitative approach. Morgan Kaufmann. Kavvadias N. and Nikolaidis S., Automated Instruction-Set Extension of Embedded Processors with Application to MPEG-4 Video Encoding. IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP'05), pp La Rosa A. et al., Software Development for High-Performance, Reconfigurable, Embedded Multimedia Systems. IEEE Design and Test of Computers, vol. 22, no. 1, pp Miyamori T. and Olukotun K., REMARC: Reconfigurable Multimedia Array Co-Processor. IEICE Trans. Information Systems, vol. E82-D, no. 2, pp Razdan R. and Smith M. D., A High-Performance Microarchitecture with Hardware-Programmable Functional Units. 27th Annual Int. Symposium on Microarchitecture (MICRO 27), pp Smith M.D. and Holloway G., An introduction to Machine SUIF and its portable libraries for analysis and optimization. Technical report, edition, Division of Engineering and Applied Sciences, Harvard University, USA. Vassiliadis S. et al., The MOLEN Polymorphic Processor. IEEE Transactions on Computers, vol. 53, no. 11, pp Ye Z. A. et al., A C compiler for a Processor with a Reconfigurable Functional Unit. Int. Symposium on Field Programmable Gate Arrays (FPGA), pp

Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support

Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis Section of Electronics and Computers,