A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

Size: px
Start display at page:

Download "A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT"

Transcription

1 A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT N. Vassiliadis, N. Kavvadias, G. Theodoridis, S. Nikolaidis Section of Electronics and Computers, Department of Physics, Aristotle University of Thessaloniki, Thessaloniki, Greece ABSTRACT In this paper, the architecture of an embedded processor extended with a tightly-coupled coarse-grain Reconfigurable Functional Unit (RFU) is proposed. The efficient integration of the RFU with the control unit and the datapath of the processor eliminate the communication overhead between them. To speed up execution, the RFU exploits Instruction Level Parallelism (ILP) and spatial computation. Also, the proposed integration of the RFU exploits efficiently the pipeline structure of the processor, leading to further performance improvements. Furthermore, a development framework for the introduced architecture is presented. The framework is fully automated, hiding all reconfigurable hardware related issues from the user. The hardware model of the architecture was synthesized in a 0.13um process and all information regarding area and delay were estimated and presented. A set of benchmarks is used to evaluate the architecture and the development framework. Experimental results prove performance improvements in addition with potential energy reduction. KEYWORDS RISP, RFU, tightly-coupled, coarse-grain 1. INTRODUCTION Reconfigurable Computing is emerging as a challenging opportunity for implementing computational intensive kernels on embedded systems [DeHon and Wawrzynek, 1999]. By combining the post-fabrication programmability of embedded processors with the computational style most commonly employed in ASIC designs, high performance and flexibility are achieved. Such an appealing architecture is Reconfigurable Instruction Set Processors (RISPs) [Barat and Lauwereins, 2000]. RISPs couple reconfigurable hardware on a standard processor featuring dynamic instruction set extensions. The presence of the reconfigurable hardware allows reuse of hardware resources by adapting the instruction set to the currently executed algorithm. In this paper, the architecture of an embedded single issue RISC processor extended with a tightlycoupled coarse-grain RFU is introduced. In our solution, the efficient integration of the RFU in the control unit and the datapath of the processor eliminates the communication overhead between them. To speed up execution, the RFU executes as reconfigurable instructions Multiple-Input-Single-Output (MISO) clusters of primitive instructions. A number of data-independent instructions can be executed in parallel in the RFU, exploiting ILP. In this way, the processor s parallelism increases without the need for an extremely long instruction word, as is the case for VLIW processors. To further increase performance the introduced architecture uses spatial computation. In particular, a chain of operations with a delay that fits in the processor s clock cycle is executed in a single cycle. Furthermore, the careful integration of the RFU exploits efficiently the pipeline structure of the processor. The RFU is floating between two concurrent pipeline stages and can operate in both of them combining spatial and temporal computation. This gives the opportunity to execute longer chains of operation in one execution cycle with better utilization of the available hardware. In addition, a development framework for the introduced architecture is proposed. The framework is fully automated in the sense that hides all

2 reconfigurable hardware related issues requiring no interaction with the user other than that of a traditional software design flow. The hardware model of an evaluation version of the proposed architecture has been synthesized in a 0.13um technology and all components have been evaluated in terms of area and performance. Experimental results show that the area overhead due to the integration of the RFU in the processor core is small. A set of benchmarks were implemented in the architecture using the proposed development framework. Results prove significant performance improvements compared to the standalone processor. Moreover, reduced instruction memory access of the proposed architecture can potentially lead to reduced energy consumption. The paper is organized as follows. Section 2 discusses related work. In section 3 the proposed architecture is presented in detail, while the development framework is described in section 4. Experimental results derived after the synthesis of the architecture and execution of a benchmark set are presented in section 5. Finally, conclusions are drawn in section RELATED WORK The overwhelming majority of the proposed reconfigurable systems fall into two main categories based on the coupling type between the processor and the reconfigurable hardware: 1) the reconfigurable hardware is a co-processor communicating with the main processor and 2) the reconfigurable hardware is a functional unit of the processor pipeline (we will state this category as RFU from now on). The first category includes among others, the Garp, NAPA, Molen, REMARC, and PipeRench [Callahan T. J. et al.], [Gokhale M. B. and Stone J. M., 1998], [ Vassiliadis S. et al., 2004], [Miyamori T. and Olukotun K., 1999], [Goldstein S. C. et al.]. In this case, the coupling between the processor and the RH is loosely; communication is performed explicitly using special instructions to move data and control directives to and from the RH. To hide the overhead introduced by this type of communication, the number of clock cycles for each use of the RH must be high. Furthermore, the RH usually features direct connection to memory and state registers, while can operate in parallel with the processor. In this way, the allowed performance increase is significant. However, only parts of the code, weakly interacting with the rest of the code, can be mapped to the RH and exploit this performance gain. These parts of the code must be identified and replaced with the appropriate special instructions. Garp and Molen features automation of this process but only for loop bodies and complete functions, respectively. For NAPA and PipeRench this process is performed manually. Examples of the second category are systems such as PRISC, Chimaera, and XiRisc [Razdan R. and Smith M. D., 1994], [Ye Z. A. et al., 2000], [La Rosa A. et al., 2005]. Here, communication is performed implicitly and the coupling is more tightly. Data is read and written directly to and from the processor s register file, while the RH is treated as another functional unit of the processor. This makes control logic simple, while the communication overhead is eliminated but an opcode space explosion is likely. In this case, parts of the code implemented in the RH are smaller and can be seen as dynamic extensions of the processor s instruction set. Fully automated compilers are not reported in the literature neither for this category. For example, in XiRisc the identification of the extracted computational kernel must be performed manually, while PRISC and Chimaera feature no selection process for the identified instructions. Our approach falls in the second category, since it tightly couples an RFU to the processor core. The implicit communication offers the possibility to consider for acceleration the whole application and not just kernels, which is usually the case for the co-processor approach. Even though smaller speedups are achieved for the kernels compared to the co-processor approach, they are achieved in all application s space. Thus, the average speedup of the application is preserved. In addition, most of the proposed architectures use FPGAs (fine-grain) (Garp, NAPA, Molen, PRISC, Chimaera, and XiRisc) rather than coarse-grain (REMARC and PipeRench) reconfigurable hardware. FPGAs exhibit higher flexibility but require large configuration memories, suffer from large reconfiguration overhead, and require expensive multi-context structures to alleviate this overhead. Furthermore, FPGAs require a complex implementation process involving HDL, synthesis, place-and-route that must be performed usually by the user (only Garp has reported an automatic implementation through module mapping). On the other hand, coarse-grains are more suitable for word-level operations, much like the instruction set of a standard processor.

3 Our intention is to extend a base processor with reconfigurable instruction set extensions. These instructions are clusters of the primitive instructions of the processor s instruction set. Therefore, we choose the incorporation of a coarse-grain RFU that can execute these clusters in a more efficient way. Finally, the fact that coarse-grains can be more easily adapted to traditional compilation techniques, result to the introduction of an automated development framework that requires no interaction with the user other than that of a traditional compiler flow. 3. PROPOSED ARCHITECTURE The proposed ReRISC (Reconfigurable Reduced Instruction Set Computer) architecture consists of a RISC processor core tightly coupled with an RFU. The organization of ReRISC is depicted in Figure 1. At the top level there are the Processor Core, the RFU, and the Interface responsible for the communication between them. The Processor Core is composed by the DataPath and Control Logic. Control Logic comprises the Core Control and the Coupling Control to decode standard and reconfigurable instructions, respectively. The RFU is realized in three layers. The Processing Layer features an array of coarse-grain reconfigurable Processing Elements (PEs) plus a local memory to provide extra read-only operands to the PEs. Communication between PEs is performed with the reconfigurable Interconnection Layer. Finally, the Configuration Layer configures properly the two other layers to execute the required operations. The components of the RFU are configured at design time to achieve better adaptation to the targeted application, in terms of performance and hardware utilization. Figure 1. ReRISC Organization ReRISC architecture is presented in more detail in Figure 2. On every execution cycle an instruction is fetched from the Instruction Memory during the first pipeline stage. The instruction opcode is forwarded to the Instruction Decode stage where a reserved bit indicates its type (i.e. if it is going to be executed by the RISC core or by the RFU). If the instruction belongs to the Processor Core s standard instruction set the Core Control decodes the opcode to produce the necessary control signals for the DataPath. Also, two operands are fetched from the register file. If the instruction is going to be executed by the RFU (reconfigurable instruction) the opcode is decoded by the Coupling Control, which generates control signals for the Processor Core/RFU Interface. In addition, the opcode is forwarded to the RFU s Configuration Layer to configure the Processing and Interconnect Layers. In that case four operands are fetched from the Processor Core s register file to the RFU. Based on the reconfigurable instruction type, the RFU can execute an instruction on two pipeline stages. Thus, results of computations are delivered back to the Core Processor either to the Execution or the Memory Access pipeline stages. Due to the tight coupling the RFU is treated by the Processor Core as an extra functional unit capable to execute reconfigurable instructions. Each reconfigurable instruction is expanded onto a set of the Processor Core s standard instructions executed by the PEs inside the RFU. The first feature that RFU supports to improve performance is ILP. Specifically, instructions with no data dependencies can be executed in parallel by the RFU. To further improve performance, RFU can perform spatial computation. In that case a chain of operations that fits in a certain time budget, which is determined by the Processor Core s critical path, is performed in one cycle. Finally, to even further increase performance, the RFU exploits the Processor Core s pipeline structure to combine spatial and temporal computation. In this way the RFU can execute long chains of operations that don t fit in the processor critical path, by breaking them to two smaller chains, each one

4 executed in one of two successively pipeline stages, namely Execution and Memory Access. In addition, since the RFU operates on two pipeline stages two reconfigurable instructions can be processed simultaneously on two different stages, offering a better utilization of the available hardware. Figure 2. ReRISC architecture 3.1 Processor Core The core is a 32-bit single issue RISC processor based on the classic five pipeline stages DLX processor [Hennessy and Patterson, 1991]. It is divided in two main logic regions: the datapath and the control logic that both have been properly extended to support the coupling of the RFU. The DataPath can perform all basic operations (i.e. Arithmetic, Logic, Multiplication, Shifting, Memory Access and Conditional Branches). Control Logic consists of Core Control and the Coupling Control. The Core Control is responsible for decoding the fetched instruction and the generation of the control signals for all Processor Core components, while it also resolves any kind of control and data hazards. Coupling Control extends the capabilities of the Control Logic to support the coupling with the RFU. If the decoded instruction is referred to the RFU for execution, Coupling Control forwards the opcode of the instruction to the RFU. Also, it generates the control signals necessary for the Interface between the RISC and the RFU. Finally, since the RFU can support up to four operands, Coupling Control Logic extends properly the data hazard resolve unit. The opcode of each reconfigurable instruction encodes also the type of operation that must be performed by the RFU. Control Logic properly decodes the opcode and generates the appropriate control signals for the core data channel in order to support these instructions. The RFU supports the following types of operations: (i) complex arithmetic/logic computations, (ii) data transfer operations with complex addressing modes, and (iii) complex control flow operations. Each reconfigurable instruction can be performed in more than one cycle, leading in performance improvement in addition to reduced instruction memory accesses. The number of cycles required by each reconfigurable instruction is part of its configuration. Control Logic receives this number and properly stalls the pipeline structure to support them. The instructions types supported by the RFU ensure the adaptation of the processor to any targeted application. 3.2 Processor Core/RFU Interface The Interface, depicted in Figure 2, provides on each cycle at the Operand Fetch pipeline stage, up to four operand values to the RFU from the register file. After the requested computations have been performed, RFU s results are forwarded back to the core data channel. Multiplexers at the end of the Execution and Memory Access pipeline stages, controlled by the Coupling Control, determine if results come from the DataPath or the RFU. A powerful feature of the proposed architecture is the efficient utilization of the

5 pipeline stages. Specifically, the RFU can operate, and make available its resources, not only in Execution Stage but also in the Memory Access stage, when performing Complex Arithmetic/Logic Computations and produce result that can be delivered to the core channel at the end of this stage. 3.3 RFU As it has been mentioned, the RFU consists of three main architectural layers. The first is the Processing Layer in which all the computations are performed. The second is the Interconnection Layer that manages the communication between the PEs. Finally, the Configuration Layer configures the two previously mentioned layers in order to perform the requested operation. When a reconfigurable instruction is identified on the Instruction Decode pipeline stage, its opcode with the appropriate operands retrieved from the RISC s register file are forwarded to the RFU. The configuration layer receives the opcode and performs all necessary actions for configuring the RFU. Specifically, if the configuration bit stream of the instruction is stored locally in the configuration memory, the layer retrieves the configuration bits and configures the Processing and Interconnection layer. Otherwise, a flag indicates that the configuration is missing and the processor stalls until all necessary bits are downloaded to the local configuration memory Processing Layer The processing layer is a reconfigurable array of coarse-grain PEs. As we target at high-performance architecture, we choose the use of coarse-grain PEs since they offer great advantages in terms of performance, reconfiguration time, and reconfiguration memory compared with fine-grain PEs. At the current version of the architecture, each PE has been designed to perform the same operations with the processor s DataPath. The structure of the PE is shown in Figure 3. On each execution cycle, two results are produced: the unregistered and registered output of the PE. Through a multiplexer controlled by a configuration bit the appropriate output is selected. The proposed structure of the PE offers important features that are discussed in the following. Figure 3. PE Basic Structure Spatial Computation: The unregistered output of a PE can be directly connected to the input of another PE constructing chains of operations that can be processed in just one cycle. If the required time for executing a chain does not exceed the critical path of the Processor Core, the performance improvement offered by the spatial computation is proportional to the depth of the chain. Spatial-Temporal Computation: The registered output of a basic structure holds the result of an operation computed when the reconfigurable instruction was on the Execution pipeline stage of the Processor Core. This result can be connected to the input of another PE for further computation on the Memory Access stage. In this way temporal computation is performed. Also, the registered output of a basic structure can hold the result of a spatial computation. Furthermore, this result is possible to be connected on another PE participating in another spatial computation. Thus, spatial and temporal computation, are combined, providing the opportunity to execute longer chains of normal instructions by fully exploiting the pipeline structure of the Processor Core. Floating PEs: On each execution cycle a part from each two successive reconfigurable instructions can be executed on the RFU. The stage that each PE operates is determined by the configuration of the reconfigurable instruction. From this point of view PEs can be seen as floating between the Execution and Memory Access pipeline stage of the core processor. This structure offers the opportunity for maximum utilization of the available hardware.

6 Combined with the PEs, the Processing Layer features a Local Memory Structure, which is a register file for storing read only values, used by instructions executed in the RFU. Thus, the four operands from the RISC s register file can be extended with constant values offering the possibility to create larger Multiple Input Single Output instructions (MISOs). MISOs are candidates for execution as reconfigurable instructions in the RFU and therefore by increasing there size more performance gain can be produced. The number of stored values that can be addressed and their total number can be configured at design time. The address of constant operands for each reconfigurable instruction its part of the instruction configuration bit stream Interconnection Layer The structure of this layer is illustrated in Figure 4. Conventionally, the Interconnection Layer is constructed with buses and steering logic. As the size of the reconfigurable array is kept relative small, this structure can be efficient in terms of area and delay. Moreover, it requires a small number of configuration bits that can provide significant performance and area occupation improvements. In addition, full connectivity between the PEs is offered, resulting to the possibility for maximum utilization of the available hardware. The introduced structure features two global blocks for the inter-communication of the RFU: Input Network and Output Network. The former is responsible to receive the operands from RISC s register file and the local memory and delivers to the following blocks their registered and unregistered values. In this way, operands for both execution stages of the RFU are constructed. The Output Network can be configured to select the appropriate PE result that is going to be delivered to the output of each stage of the RFU. Figure 4. Interconnection Layer Block Structure For the intra-communication between the PEs, two blocks are offered for each PE Basic Structure: Stage Selector and Operand Selector. The first is configured to select the stage from which the PE receives operands. Thus, this block is the one that configures the stage that each PE will operate. Operand Selector receives the final operands, in addition with feedbacks from each PE and is configured to forward the appropriate values Configuration Layer The components and operation of the Configuration Layer is depicted in Figure 5. On each execution cycle the opcode of the reconfigurable instruction is delivered from the core processor s Instruction Decode stage to the RFU. The opcode is forwarded to a local structure that stores the configuration bits of the locally available instructions. If the required instruction is available the configuration bits for the processing and interconnection layers are retrieved. In a different case, a control signal indicates that new configuration bits must be downloaded from an external configuration memory to the local storage structure and the processor execution stalls.

7 Reconfigurable Instruction Opcode Configuration Bits 1 st Stage 2 nd Stage Configuration Cache Configuration Bits Local Storage Structure 1 st Stage Resource Occupation 2 nd Stage Resource Occupation Resources Availability Control Logic Resources Distribution Bits 1 st Stage Configuration Bits 2 nd Stage Configuration Bits RFU Control Signals M U X Resource i Configuration Bits M U X Resource n Configuration Bits Figure 5. Configuration Layer Structure In addition, as part of the configuration bit stream of each instruction, the storage structure delivers two words, that each one indicates the Resources Occupation required for the execution of the instruction on the corresponding stage. These words are forwarded to the Resource Availability Control Logic, that stores for one cycle the 2 nd Stage Resource Occupation Word. On each cycle the logic compares the 1 st Stage Resource Occupation of the current instruction with the 2 nd of the previously instruction. If a resource conflict is produced, a control signal indicates to the processor core to stall the pipeline execution for one cycle. Finally, the retrieved configuration bits moves through pipeline registers to the first and second execution stage of the RFU. A multiplexer controlled by the Resource Configuration bits, selects the correct configuration bits for each PE and its corresponding interconnection network. The inner structure of the Configuration Bits Local Storage Structure is presented in more detail in Figure 6. Every time a new opcode is received the Availability Check component compares its value with those on the Table of Available Configurations. Then, a Configuration ID is produced and is being decoded by the ID Decoder in order to produce the address of the Local Configuration Memory in which the configuration bits are stored. The ID Decoder can also produce a Miss Flag if the configuration is not available locally or an Error Flag if more than one configuration belongs to the requested instruction. In the second case the processor halts with an error code. Figure 6. Configuration Bits Local Storage Structure The Miss Flag indicates that new configuration bits must be downloaded to the local configuration memory. To perform such an operation only the opcode in the Table and the corresponding configuration bits in the correct address of the local memory must be downloaded. The number of cycles required for a new instruction to be download, depends on the number of PEs available. For an array with eight PEs designed for evaluation purposes 134 bits were required for each reconfigurable instruction. If 32-bit bandwidth with the external configuration memory is available only five cycles are required to download the configuration bits of a new instruction. For a memory structure with sixteen available reconfigurable instructions, again used on

8 the evaluation version of the proposed architecture, eighty cycles are required for refreshing the whole structure. The overhead is very small and can be further reduced with a higher bandwidth. 4. DEVELOPMENT FRAMEWORK Our approach of compiling for ReRISC involves primarily the transparent to the user incorporation of compiler extensions to support the reconfigurable instruction set extensions. Under this demand, we developed an automated development framework for ReRISC, which organization is depicted in Figure 7. The complete flow is divided in five distinct stages, namely: 1) Front-End, 2) Profiling, 3) Instruction Generation, 4) Instruction Selection, and 5) Back-End. Each stage of the flow is presented in detail below. Figure 7. Proposed Compilation System Flow Front-End: The framework supports C/C++ codes that are firstly fed to the front-end. MachSUIF [Smith M.D. and Holloway G., 2002] is used to generate the Control and Data Flow Graph (CDFG) of the application using the SUIFvm Intermediate Representation (IR). In addition, a number of machine independent optimizations (e.g. dead code elimination, strength reduction) are performed in the CDFG. The output of this stage is an optimized IR in the form of a CDFG. Profiling: A MachSUIF pass has been developed that instruments the CDFG with profiling annotations, which mark the entrances and exits of basic blocks (we will state DFGs as basic blocks from now on). A modified m2c pass (the original is supplied with the MachSUIF) translates the CDFG to equivalent C code, while the annotations regarding the basic blocks are converted to program counters. Compiling and executing the generated code, profiling information for the execution frequency of the basic blocks is collected. Instruction Generation: The instruction generation stage is divided in two steps. The goal of the first step (pattern generation) is the identification of complex patterns of primitive operations (e.g. SUIFvm operations) that can be merged into one reconfigurable instruction. Pattern generation is performed using an in-house framework for automated extension of embedded processors described in [Kavvadias N. and Nikolaidis S., 2005]. The patterns generation engine is based on the MaxMISO (maximal multiple-input single-output) algorithm [Alippi A. et al., 1999], which identifies the maximal non-overlapping connected subgraphs of the basic blocks that produce a single computation result. An enhanced version of the algorithm implemented in [Alippi A. et al., 1999] has been used. These enhancements consists of user defined parameters which control: 1) the maximum number of inputs of the pattern, 2) the type of operations included in the pattern (e.g. ALU, multiply, etc), 3) the permitted types of the pattern (e.g. computation, addressing or control flow as described in Section 3), and 4) the maximum number of operation in the patterns. Exploiting these features the user can configure the architecture at design time to fine-tune it towards an application domain.

9 In the second step, the mapping of the previously identified patterns in the RFU is performed and the actual reconfigurable instruction set extensions are generated. A mapper for the target RFU has been developed for this reason. Since any resource constraint has been resolved in the pattern generation step and the 1-D array of the RFU offers full connectivity, the implementation of the mapper is significantly simplified. The steps performed by the mapper are: 1. Calculate the latency of each operation in the pattern. This latency includes the accumulated latencies of the operation s presidencies in the pattern s chains. The latency is calculated using user parameters defining the delay of the modules of the RFU (PEs, interconnection etc.). These parameters are provided by the designer of the architecture. 2. Place each operation in a PE and appropriately configure its functionality. 3. Put the PE for execution in the appropriate pipeline stage based on the calculated delay and the type of the pattern (e.g. computation, addressing etc.). This is performed by selecting the register or un-register output of the PE. 4. Configure the multiplexers of the 1-D array for appropriate interconnection of the PEs. 5. Report the reconfigurable instruction set semantics (e.g. latency, type, resources etc.). Instruction Selection: In this stage, the final instruction set extensions are selected. The only metric for the selection of an instruction is the offered speed-up. The instruction selection stage has been also implemented to automatically estimate the speed-up of each instruction. Firstly, the static speed-up of each instruction is calculated. This is accomplished by considering the software versus the hardware (RFU) execution cycles of the instruction. The software execution cycles are equal to the number of operations of which the instruction consists, while the hardware cycles have been reported by the mapper in the previously stage. The static speed-ups are multiplied by the execution frequency of the basic block (derived at profiling stage) and the dynamic speed-ups are calculated. Finally, we perform pair-wise graph isomorphism on the set of instructions. The VFLib2 graph matching library [Foggia P.] is used for this reason. A set of isomorphic instructions defines a group for which the offered speed-up is calculated by summing the dynamic speed-ups of the group members. The instructions/groups are ranked based on their dynamic speed-ups and the best are selected. The output of this stage is the reconfigurable instruction set extensions in addition to statistics (speed-up, number of instructions etc.) that are going to be demonstrated in Section 5. Back-End: The back-end of the framework flow is the only stage that has not yet been fully implemented. However, since reconfigurable instructions do not require any special manipulation for the communication and synchronization between processor and RFU, the back-end is much like any traditional processor back-end performing tasks like scheduling, register allocation etc. 5. EXPERIMENTAL RESULTS 5.1 Architecture Synthesis For the evaluation of the proposed architecture a hardware model described in VHDL was designed. The configuration used for the evaluation model of the RFU is presented in Table 1. The model was synthesized with STM 0.13um process. Area and delay values were collected for all components of the architecture and they are presented in Table 2. The RFU requires approximately x3.3 the area required by the Processor Core. However, the presented version of the proposed architecture does not target any specific application domain. Tuning the RFU for an application domain can result to fewer PEs and less PE functionality, which can dramatically decrease the RFU area requirements. Furthermore, it must be pointed out that no instruction and data caches were taken into account. Caches dominate the area of embedded processors and thus the RFU area overhead is significantly small. Measurements of the delay of the Processor Core and critical components of the RFU structure are presented in Table 3. The critical path of the processor core is 4 nsec and is not influenced by the incorporation of the RFU. Since the RFU extends the capabilities of the Processor Core, this critical path determines the clock frequency of the proposed architecture. The table also presents delays for all types of processing elements and interconnections of the RFU. These values are provided to the development framework to accomplish the mapping process as described in Section 4.

10 Table 1. Configuration used for the RFU Evaluation Model Configuration Value Granularity (except the 16x16 multiplier) 32-bits Number of Processing Elements 8 Processing Elements Functionality Arithmetic, Logic, Shifter, Multiplier Configuration Memory Size 16 words of 134 bits Local Operand Memory Size 8 operands of 32-bits Number of Provided Local Operands 4 Table 2. Area Requirements for Processor Core and RFU Component Area (mm 2 ) Processor Core RFU Processing Layer RFU Interconnection Layer RFU Configuration Layer RFU Total Table 3. Delay Estimations for Processor Core and RFU-Reconfigurable Instruction Delay Model Component Delay(nsec) DFG operation Latency(cycles) Processor Core 4 Node with I/O edges 1 Arithmetic PE 1.2 Arithmetic Node 0.3 Logic PE 0.1 Logic Node Shifter PE 0.4 Shifter Node 0.1 RFU Multiplier PE 2.9 Multiplier Node Input to PE 0.4 Primary Input Edge 0.1 FeedBack to PE 0.3 Edge connecting nodes PE to Output 0.2 Primary Output Edge Demonstration In order to demonstrate the elaboration model of the RFU two reconfigurable instructions derived from Quantization algorithm are used. The DFGs of the instructions are shown in Figure 8. Instruction in Figure 8a consists of six operations and requires 3 register and 2 constants operands. Mapping of operations can be performed on both execution stages of the RFU, as presented in Figure 8a. Even though, the full instruction cannot be performed as spatial computation in one execution cycle due to the clock constrain, by combining spatial and temporal computation this becomes possible. Spatial is performed within the two Execution Stages while temporal by forwarding the result from the first to the second stage. As a result the reconfigurable instruction that would require six cycles to execute on the Processor Core can be executed in just one cycle in the RFU. In addition, only one Instruction Memory access, rather than six, is required for the execution of the reconfigurable instruction to the RFU, resulting to reduction of the energy consumption. Figure 8b, presents the second reconfigurable instruction. Because of the clock constrain there is no way to execute the full instruction on the RFU in one cycle. However it can be executed as a two cycle reconfigurable instruction. The two successive multiplications are performed on the 1 st Execution Stage of the RFU, while the processor stalls for one cycle offering two execution cycles for the completion of the operations. The rest of the instruction is executed on the 2 nd Execution Stage of the RFU. Two execution cycles are required for the execution of this instruction on the RFU, rather than five that would required for execution on the Processor core. Also, one access in the instruction memory is required, rather than two that would be required if the RFU was not supporting multicycle instructions.

11 Figure 8. Implementation example of two reconfigurable instructions in the proposed architecture 5.3 Performance Evaluation To evaluate the performance of the proposed architecture we have consider a set of benchmarks shown in Table 4 including a brief description. Since, we are currently not targeting any specific application domain we selected a set of applications derived from various application fields. The set includes both kernels (1-3) and complete applications (4-10). The C source code of each benchmark has been fed to the framework. All experimental results have been automatically generated by the framework and they are instruction accurate. Since no Operating System (OS) support is currently available for our architecture, all OS calls (like printf, fopen etc.) are taken over by the host operating system without being consider in the evaluation. Speedups have been calculated by comparing the base RISC processor of the architecture with and without support of the RFU unit. Table 4. Benchmarks Description Name Description 1 dct Discrete Cosine Transform used on 8x8 image blocks 2 quant Quantization operation used in Jpeg compression 3 vlc Variable Length Coding for Jpeg and Mpeg compression algorithms 4 dijsktra Dijkstra's algorithm to find shortest path 5 stringsearch A Pratt-Boyer-Moore string search algorithm 6 crc32 Checksum algorithm - 32-bits Cyclic Redundancy Code 7 gost Cryptographic algorithm the Russian analog to DES 8 sha NIST Secure Hash Algorithm 9 rs-encode REED-SOLOMON Encoder 10 mpeg4_senc Shape Coding for MPEG4 Figure 9 presents the achieved speedups for the implementation of the benchmarks in the proposed architecture. The speedups vary from x2.2 for the crc32 algorithm to x3.4 for the gost cryptographic algorithm. The average value is x2.9, which clearly indicates the performance efficiency of the proposed architecture. As it is observed similar speedups are produced for kernels and complete applications. Based on the well-known Amdahl s law attempting to accelerate the part of the application (usually referred to as the kernels) by a factor S, produces an overall application speedup which is only a fraction of S. This is usually the case for the co-processor approach. Our approach attempts to accelerate the whole application. Thus, even though a smaller speedup is achieved for the kernels, compare to the co-processor approach, speedup is maintained for the complete application. Furthermore, Figure 10 illustrates another benefit produced by the incorporation of the RFU. That is the reduction of instruction fetches required to execute an application, which in turn means reduction in the instruction memory accesses. Reduction varied from 53% to 75% produced for the MPEG4 shape coding,

12 with an average value of 63%. Since a major source of energy consumption of embedded processors is the instruction memory accesses [Benini L. et al., 2001] significant energy savings can be produced. 4 3 SpeedUp dct quant vlc dijsktra stringsearch crc32 gost sha rs-encode mpeg4_senc Figure 9. Benchmarks SpeedUps 100 Instruction Fetces Reduction (%) dct quant vlc dijsktra stringsearch crc32 gost sha rs-encode mpeg4_senc Figure 8. Instruction fetches reduction for each benchmark 6. CONCLUSION A RISP processor, which tightly-couples a coarse-grain RFU, has been proposed. The complete architecture has been presented. The supported features of the RFU namely ILP and spatial computation, in addition with its integration in the processor s structure aim to improve performance. A hardware description model of the architecture has been designed and synthesized using a 0.13um technology. Synthesis results showed that no delay and reasonable area overhead has been produced by the incorporation of the RFU to the base processor. Furthermore, a development framework for the introduced architecture has been presented. Using this framework a set of benchmarks has been implemented. Results prove significant performance improvements in addition to reduced instruction memory access that can potentially reduce energy consumption.

13 ACKNOWLEDGMENT This work was supported by the General Secretariat of Research and Technology of Greece and the European Union. REFERENCES Alippi A. et al., A DAG-Based Design Approach for Reconfigurable VLIW Processors. IEEE International Conference on Design and Test in Europe, Munich, Germany, pp Barat, F. and Lauwereins, R., Reconfigurable Instruction Set Processors: A Survey. IEEE International Workshop on Rapid System Prototyping, pp Benini L. et al., A Power Modeling and Estimation Framework for VLIW-based Embedded Systems. Proceedings of Int. Workshop on Power And Timing Modeling, Optimization and Simulation PATMOS, Sept. 2001, pp Callahan T. J. et al., The Garp Architecture and C Compiler. IEEE Computer, vol. 33, no. 4, pp DeHon, A. and Wawrzynek, J., 1999, Reconfigurable Computing: What, Why, and Implications for Design Automation. DAC, pp Foggia P. The VFLib Graph Matching Library. Gokhale M. B. and Stone J. M., NAPA C: Compiling for a Hybrid RISC/FPGA Architecture. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pp Goldstein S. C. et al., Piperench: A Coprocessor for Streaming Multimedia Acceleration. 26th Annual Int. Symposium on Computer Architecture, pp Hennessy J. and Patterson D., Computer Architecture: A quantitative approach. Morgan Kaufmann. Kavvadias N. and Nikolaidis S., Automated Instruction-Set Extension of Embedded Processors with Application to MPEG-4 Video Encoding. IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP'05), pp La Rosa A. et al., Software Development for High-Performance, Reconfigurable, Embedded Multimedia Systems. IEEE Design and Test of Computers, vol. 22, no. 1, pp Miyamori T. and Olukotun K., REMARC: Reconfigurable Multimedia Array Co-Processor. IEICE Trans. Information Systems, vol. E82-D, no. 2, pp Razdan R. and Smith M. D., A High-Performance Microarchitecture with Hardware-Programmable Functional Units. 27th Annual Int. Symposium on Microarchitecture (MICRO 27), pp Smith M.D. and Holloway G., An introduction to Machine SUIF and its portable libraries for analysis and optimization. Technical report, edition, Division of Engineering and Applied Sciences, Harvard University, USA. Vassiliadis S. et al., The MOLEN Polymorphic Processor. IEEE Transactions on Computers, vol. 53, no. 11, pp Ye Z. A. et al., A C compiler for a Processor with a Reconfigurable Functional Unit. Int. Symposium on Field Programmable Gate Arrays (FPGA), pp

Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support

Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis Section of Electronics and Computers,

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Michalis D. Galanis, Gregory Dimitroulakos, and Costas E. Goutis VLSI Design Laboratory, Electrical and Computer Engineering

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Configuration Steering for a Reconfigurable Superscalar Processor

Configuration Steering for a Reconfigurable Superscalar Processor Steering for a Reconfigurable Superscalar Processor Brian F. Veale School of Computer Science University of Oklahoma veale@ou.edu John K. Annio School of Computer Science University of Oklahoma annio@ou.edu

More information

A Reconfigurable Functional Unit for an Adaptive Extensible Processor

A Reconfigurable Functional Unit for an Adaptive Extensible Processor A Reconfigurable Functional Unit for an Adaptive Extensible Processor Hamid Noori Farhad Mehdipour Kazuaki Murakami Koji Inoue and Morteza SahebZamani Department of Informatics, Graduate School of Information

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016 NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering

More information

A Low Power and High Speed MPSOC Architecture for Reconfigurable Application

A Low Power and High Speed MPSOC Architecture for Reconfigurable Application ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference

More information

Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path

Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path MICHALIS D. GALANIS 1, GREGORY DIMITROULAKOS 2, COSTAS E. GOUTIS 3 VLSI Design Laboratory, Electrical

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi Lecture - 34 Compilers for Embedded Systems Today, we shall look at the compilers, which

More information

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit P Ajith Kumar 1, M Vijaya Lakshmi 2 P.G. Student, Department of Electronics and Communication Engineering, St.Martin s Engineering College,

More information

The MorphoSys Parallel Reconfigurable System

The MorphoSys Parallel Reconfigurable System The MorphoSys Parallel Reconfigurable System Guangming Lu 1, Hartej Singh 1,Ming-hauLee 1, Nader Bagherzadeh 1, Fadi Kurdahi 1, and Eliseu M.C. Filho 2 1 Department of Electrical and Computer Engineering

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Towards Optimal Custom Instruction Processors

Towards Optimal Custom Instruction Processors Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18 Overview 1. background: extensible processors

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Run-time Adaptable Architectures for Heterogeneous Behavior Embedded Systems

Run-time Adaptable Architectures for Heterogeneous Behavior Embedded Systems Run-time Adaptable Architectures for Heterogeneous Behavior Embedded Systems Antonio Carlos S. Beck 1, Mateus B. Rutzig 1, Georgi Gaydadjiev 2, Luigi Carro 1, 1 Universidade Federal do Rio Grande do Sul

More information

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY Department of Computer science and engineering Year :II year CS6303 COMPUTER ARCHITECTURE Question Bank UNIT-1OVERVIEW AND INSTRUCTIONS PART-B

More information

RFU Based Computational Unit Design For Reconfigurable Processors

RFU Based Computational Unit Design For Reconfigurable Processors RFU Based Computational Unit Design For Reconfigurable Processors M. Aqeel Iqbal Faculty of Engineering and Information Technology Foundation University, Institute of Engineering and Management Sciences

More information

Unit 2: High-Level Synthesis

Unit 2: High-Level Synthesis Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

A Reconfigurable Multifunction Computing Cache Architecture

A Reconfigurable Multifunction Computing Cache Architecture IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 4, AUGUST 2001 509 A Reconfigurable Multifunction Computing Cache Architecture Huesung Kim, Student Member, IEEE, Arun K. Somani,

More information

A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems

A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems MICHALIS D. GALANIS 1, GREGORY DIMITROULAKOS 2, COSTAS E. GOUTIS 3 VLSI Design Laboratory, Electrical & Computer Engineering

More information

Integrating MRPSOC with multigrain parallelism for improvement of performance

Integrating MRPSOC with multigrain parallelism for improvement of performance Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,

More information

Interprocedural Optimization for Dynamic Hardware Configurations

Interprocedural Optimization for Dynamic Hardware Configurations Interprocedural Optimization for Dynamic Hardware Configurations Elena Moscu Panainte, Koen Bertels, and Stamatis Vassiliadis Computer Engineering Lab Delft University of Technology, The Netherlands {E.Panainte,

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

VHDL Implementation of a MIPS-32 Pipeline Processor

VHDL Implementation of a MIPS-32 Pipeline Processor Journal From the SelectedWorks of Kirat Pal Singh Winter November 9, 2012 VHDL Implementation of a MIPS-32 Pipeline Processor Kirat Pal Singh Shivani Parmar This work is licensed under a Creative Commons

More information

A Video Compression Case Study on a Reconfigurable VLIW Architecture

A Video Compression Case Study on a Reconfigurable VLIW Architecture A Video Compression Case Study on a Reconfigurable VLIW Architecture Davide Rizzo Osvaldo Colavin AST San Diego Lab, STMicroelectronics, Inc. {davide.rizzo, osvaldo.colavin} @st.com Abstract In this paper,

More information

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National University of Singapore {ramkumar,liuhb,tulika}@comp.nus.edu.sg

More information

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems Abstract Reconfigurable hardware can be used to build a multitasking system where tasks are assigned to HW resources at run-time

More information

Synthetic Benchmark Generator for the MOLEN Processor

Synthetic Benchmark Generator for the MOLEN Processor Synthetic Benchmark Generator for the MOLEN Processor Stephan Wong, Guanzhou Luo, and Sorin Cotofana Computer Engineering Laboratory, Electrical Engineering Department, Delft University of Technology,

More information

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

Processors. Young W. Lim. May 12, 2016

Processors. Young W. Lim. May 12, 2016 Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

VLSI DESIGN OF REDUCED INSTRUCTION SET COMPUTER PROCESSOR CORE USING VHDL

VLSI DESIGN OF REDUCED INSTRUCTION SET COMPUTER PROCESSOR CORE USING VHDL International Journal of Electronics, Communication & Instrumentation Engineering Research and Development (IJECIERD) ISSN 2249-684X Vol.2, Issue 3 (Spl.) Sep 2012 42-47 TJPRC Pvt. Ltd., VLSI DESIGN OF

More information

Embedded Soc using High Performance Arm Core Processor D.sridhar raja Assistant professor, Dept. of E&I, Bharath university, Chennai

Embedded Soc using High Performance Arm Core Processor D.sridhar raja Assistant professor, Dept. of E&I, Bharath university, Chennai Embedded Soc using High Performance Arm Core Processor D.sridhar raja Assistant professor, Dept. of E&I, Bharath university, Chennai Abstract: ARM is one of the most licensed and thus widespread processor

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

A C Compiler for a Processor with a Reconfigurable Functional Unit Abstract

A C Compiler for a Processor with a Reconfigurable Functional Unit Abstract A C Compiler for a Processor with a Reconfigurable Functional Unit Alex Ye, Nagaraj Shenoy, Scott Hauck, Prithviraj Banerjee Department of Electrical and Computer Engineering, Northwestern University Evanston,

More information

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne Architectural-Level Synthesis Giovanni De Micheli Integrated Systems Centre EPF Lausanne This presentation can be used for non-commercial purposes as long as this note and the copyright footers are not

More information

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National University of Singapore {ramkumar,liuhb,tulika}@comp.nus.edu.sg

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng. CS 265 Computer Architecture Wei Lu, Ph.D., P.Eng. Part 5: Processors Our goal: understand basics of processors and CPU understand the architecture of MARIE, a model computer a close look at the instruction

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining

More information

Compiling for the Molen Programming Paradigm

Compiling for the Molen Programming Paradigm Compiling for the Molen Programming Paradigm Elena Moscu Panainte 1, Koen Bertels 1, and Stamatis Vassiliadis 1 Computer Engineering Lab Electrical Engineering Department, TU Delft, The Netherlands {E.Panainte,K.Bertels,S.Vassiliadis}@et.tudelft.nl

More information

Data Parallel Architectures

Data Parallel Architectures EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer

More information

Performance/Cost trade-off evaluation for the DCT implementation on the Dynamically Reconfigurable Processor

Performance/Cost trade-off evaluation for the DCT implementation on the Dynamically Reconfigurable Processor Performance/Cost trade-off evaluation for the DCT implementation on the Dynamically Reconfigurable Processor Vu Manh Tuan, Yohei Hasegawa, Naohiro Katsura and Hideharu Amano Graduate School of Science

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Automatic Compilation to a Coarse-Grained Reconfigurable System-opn-Chip

Automatic Compilation to a Coarse-Grained Reconfigurable System-opn-Chip Automatic Compilation to a Coarse-Grained Reconfigurable System-opn-Chip GIRISH VENKATARAMANI and WALID NAJJAR University of California, Riverside FADI KURDAHI and NADER BAGHERZADEH University of California,

More information

Instructor Information

Instructor Information CS 203A Advanced Computer Architecture Lecture 1 1 Instructor Information Rajiv Gupta Office: Engg.II Room 408 E-mail: gupta@cs.ucr.edu Tel: (951) 827-2558 Office Times: T, Th 1-2 pm 2 1 Course Syllabus

More information

Course web site: teaching/courses/car. Piazza discussion forum:

Course web site:   teaching/courses/car. Piazza discussion forum: Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start

More information

Design of a Processor to Support the Teaching of Computer Systems

Design of a Processor to Support the Teaching of Computer Systems Design of a Processor to Support the Teaching of Computer Systems Murray Pearson, Dean Armstrong and Tony McGregor Department of Computer Science University of Waikato Hamilton New Zealand fmpearson,daa1,tonymg@cs.waikato.nz

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor Abstract The proposed work is the design of a 32 bit RISC (Reduced Instruction Set Computer) processor. The design

More information

Automatic Compilation to a Coarse-grained Reconfigurable System-on-Chip

Automatic Compilation to a Coarse-grained Reconfigurable System-on-Chip Automatic Compilation to a Coarse-grained Reconfigurable System-on-Chip Girish Venkataramani Carnegie-Mellon University girish@cs.cmu.edu Walid Najjar University of California Riverside {girish,najjar}@cs.ucr.edu

More information

The CPU Design Kit: An Instructional Prototyping Platform. for Teaching Processor Design. Anujan Varma, Lampros Kalampoukas

The CPU Design Kit: An Instructional Prototyping Platform. for Teaching Processor Design. Anujan Varma, Lampros Kalampoukas The CPU Design Kit: An Instructional Prototyping Platform for Teaching Processor Design Anujan Varma, Lampros Kalampoukas Dimitrios Stiliadis, and Quinn Jacobson Computer Engineering Department University

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

RED: A Reconfigurable Datapath

RED: A Reconfigurable Datapath RED: A Reconfigurable Datapath Fernando Rincón, José M. Moya, Juan Carlos López Universidad de Castilla-La Mancha Departamento de Informática {frincon,fmoya,lopez}@inf-cr.uclm.es Abstract The popularity

More information

MARKET demands urge embedded systems to incorporate

MARKET demands urge embedded systems to incorporate IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 3, MARCH 2011 429 High Performance and Area Efficient Flexible DSP Datapath Synthesis Sotirios Xydis, Student Member, IEEE,

More information

A Streaming Multi-Threaded Model

A Streaming Multi-Threaded Model A Streaming Multi-Threaded Model Extended Abstract Eylon Caspi, André DeHon, John Wawrzynek September 30, 2001 Summary. We present SCORE, a multi-threaded model that relies on streams to expose thread

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

An ASIP Design Methodology for Embedded Systems

An ASIP Design Methodology for Embedded Systems An ASIP Design Methodology for Embedded Systems Abstract A well-known challenge during processor design is to obtain the best possible results for a typical target application domain that is generally

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

EECS4201 Computer Architecture

EECS4201 Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be

More information

Cost-Driven Hybrid Configuration Prefetching for Partial Reconfigurable Coprocessor

Cost-Driven Hybrid Configuration Prefetching for Partial Reconfigurable Coprocessor Cost-Driven Hybrid Configuration Prefetching for Partial Reconfigurable Coprocessor Ying Chen, Simon Y. Chen 2 School of Engineering San Francisco State University 600 Holloway Ave San Francisco, CA 9432

More information

Introduction to reconfigurable systems

Introduction to reconfigurable systems Introduction to reconfigurable systems Reconfigurable system (RS)= any system whose sub-system configurations can be changed or modified after fabrication Reconfigurable computing (RC) is commonly used

More information

General Purpose Signal Processors

General Purpose Signal Processors General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:

More information

Computer Architecture V Fall Practice Exam Questions

Computer Architecture V Fall Practice Exam Questions Computer Architecture V22.0436 Fall 2002 Practice Exam Questions These are practice exam questions for the material covered since the mid-term exam. Please note that the final exam is cumulative. See the

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

Fundamentals of Computer Design

Fundamentals of Computer Design Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

HDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL

HDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL ARCHITECTURAL-LEVEL SYNTHESIS Motivation. Outline cgiovanni De Micheli Stanford University Compiling language models into abstract models. Behavioral-level optimization and program-level transformations.

More information

Supporting Multithreading in Configurable Soft Processor Cores

Supporting Multithreading in Configurable Soft Processor Cores Supporting Multithreading in Configurable Soft Processor Cores Roger Moussali, Nabil Ghanem, and Mazen A. R. Saghir Department of Electrical and Computer Engineering American University of Beirut P.O.

More information

Instruction Pipelining

Instruction Pipelining Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages

More information

Application of Power-Management Techniques for Low Power Processor Design

Application of Power-Management Techniques for Low Power Processor Design 1 Application of Power-Management Techniques for Low Power Processor Design Sivaram Gopalakrishnan, Chris Condrat, Elaine Ly Department of Electrical and Computer Engineering, University of Utah, UT 84112

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language. Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central

More information

Chapter 2 Lecture 1 Computer Systems Organization

Chapter 2 Lecture 1 Computer Systems Organization Chapter 2 Lecture 1 Computer Systems Organization This chapter provides an introduction to the components Processors: Primary Memory: Secondary Memory: Input/Output: Busses The Central Processing Unit

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection

QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection Sunil Shukla 1,2, Neil W. Bergmann 1, Jürgen Becker 2 1 ITEE, University of Queensland, Brisbane, QLD 4072, Australia {sunil, n.bergmann}@itee.uq.edu.au

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

ECE 486/586. Computer Architecture. Lecture # 7

ECE 486/586. Computer Architecture. Lecture # 7 ECE 486/586 Computer Architecture Lecture # 7 Spring 2015 Portland State University Lecture Topics Instruction Set Principles Instruction Encoding Role of Compilers The MIPS Architecture Reference: Appendix

More information

Embedded Systems Ch 15 ARM Organization and Implementation

Embedded Systems Ch 15 ARM Organization and Implementation Embedded Systems Ch 15 ARM Organization and Implementation Byung Kook Kim Dept of EECS Korea Advanced Institute of Science and Technology Summary ARM architecture Very little change From the first 3-micron

More information

Reconfigurable Computing. Introduction

Reconfigurable Computing. Introduction Reconfigurable Computing Tony Givargis and Nikil Dutt Introduction! Reconfigurable computing, a new paradigm for system design Post fabrication software personalization for hardware computation Traditionally

More information

Scalable Multi-cores with Improved Per-core Performance Using Off-the-critical Path Reconfigurable Hardware

Scalable Multi-cores with Improved Per-core Performance Using Off-the-critical Path Reconfigurable Hardware Scalable Multi-cores with Improved Per-core Performance Using Off-the-critical Path Reconfigurable Hardware Tameesh Suri and Aneesh Aggarwal Department of Electrical and Computer Engineering State University

More information

CONTACT: ,

CONTACT: , S.N0 Project Title Year of publication of IEEE base paper 1 Design of a high security Sha-3 keccak algorithm 2012 2 Error correcting unordered codes for asynchronous communication 2012 3 Low power multipliers

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Understanding Sources of Inefficiency in General-Purpose Chips

Understanding Sources of Inefficiency in General-Purpose Chips Understanding Sources of Inefficiency in General-Purpose Chips Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex Solomatnikov Benjamin Lee Stephen Richardson Christos Kozyrakis Mark Horowitz GP Processors

More information

High-Level Synthesis (HLS)

High-Level Synthesis (HLS) Course contents Unit 11: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 11 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information