. Regularity: CFPP architecture seeks geometric

Size: px

Start display at page:

Download ". Regularity: CFPP architecture seeks geometric"

Claire West
6 years ago
Views:

1 Survey of the Counterflow Pipeline Processor Architectures Pic Balaji Esther Ososanya Department of Electrical Engineering, University of The District of Columbia, Van Ness Campus, Washington D.C USA Wagdy Mahmoud Karthik Thangarajan Keywords - Counterflow pipeline, synchronous and asynchronous systems, RlSC processors, MIPS architecture. Abstract - The Counterflow Pipeline Processor (CFPP) Architecture is a RISC-based pipeline processor [l I. It was proposed in 1994 as asynchronous processor architecture. Recently, researches have implemented it as synchronous processor architecture and later improved its design in terms of speed and performance by reducing average execution latency of instructions and minimize pipeline stalling. In this paper, we survey the architecture and the key design issues such as implementing as a synchronous and an asynchronous architecture and discuss the advantages and disadvantages of these implementations. Further, our research on evaluating the performance of the counterflow pipeline processor architecture to that of the traditional MIPS processor architecture 141 is also discussed. 1. INTRODUCTION The Counterflow Pipeline Processor (CFPP) Architecture was introduced by Sproull et al. [l] of Sun Microsystems lab, as a simple and regular structure of pipeline processors. It underlies the family of RlSC processor architectures. The architecture has two pipelines in which the data structure representing instructions and result flow in opposite directions. This allows each instruction and each counterflowing result to interact at every stage. Though the initial architecture was intended for asynchronous implementation, recent research proposed synchronous implementations of the CFPP architectures. Other variations and modifications to the original design have been proposed. In this paper we present the original structure of CFPP architecture, the pipeline rules that govern the architecture, and the way it handles branches and traps. We also discuss the different implementations of the architecture and their advantages and disadvantages. Finally, we discuss about our research to evaluate the performance of the architecture. 11. THE ORIGINAL CFPP This section covers the original CFFP properties, structure, functional units and sidings, pipeline rules, conditional branches and traps, and advantages and disadvantages. A. CFPP Properties The CFPP is a simple and regular structure with the following properties: Local Control: Only local information decides whether an instruction in the CFPP should advance.. Regularity: CFPP architecture seeks geometric regularity in the processor chip layout. Communication: Every stage communicates primarily with its nearest neighbors. This allows for short and fast communication paths. Modularity: Stages may differ in their computational logic. However, all the stages adopt the same communication protocol. B. CFPP Structure The original structure of the Counterflow Pipeline Processor (CFPP) Architecture is shown in Figure 1 [2][5]. CFPP has an instruction fetch unit at one end of the pipeline stages, and a register file on the other end. In between these two stages, instructions and results flow in opposite directions. The pipeline through which the instruction flows is called the Instruction pipeline, and the pipeline through which the result flows is called the result pipeline. Pipeline stages operate concurrently and each stage has an independent function to complete. Alongside the series of pipeline stages, side units perform various arithmetic, logical, and memory operations. These side units are called sidings /02/$ IEEE 1

2 The instruction and result flows through the pipeline as packets. The Instruction packets fetched from the instruction memory are decoded using the Decode and Fetch unit before being sent to the pipeline stage next to that unit. The register file is one of the sources of the result packets that are sent into the pipeline stage. The instruction packets in the instruction pipeline and the result packets in the result pipeline interact with each other at every stage. Within each stage, instructions and results packets consist of smaller records called bindings. A binding contains a register address, the register contents, and a 1-bit flag to indicate whether or not the content of the register is valid. A typical binding is shown in Figure 2 [I]. Each stage also contains hardware for comparing addresses of the different binding to determine if any information can be exchanged between the instruction and result packets. The instruction packets consist of the instruction operation code (opcode), three bindings, and the program counter value. The first two bindings contain the instruction operands and the third binding contains the result. The result packets consist of the two bindings. The interaction of the instruction and the result pipelines is one of the key features of the architecture. An instruction and result packages occupy the same stage at the same time exchange information. I Binding Name Register Name Fig. 2: Instruction / Result Binding [I] During their interaction, the address registers of the instruction's operands are compared to the address registers of the result and in case of a match, the result value is copied to the register content field of the instruction. This is called a garner operation. Similarly, the source operand registers are updated with valid result rlegister values. This operation, which is called as update operation, allows newly computed result values to be available to the subsequent instructions, even before they are stored in the register file. When an instruction reaches the end of the pipeline, the data values stored in its destination binding is written into the corresponding location in the register file. Until this happen, instructions are considered speculative and may be cancelled in case of a trap or branch. C. Functional Units and Sidings There can be functional units called sidings connected to the pipeline. These functional units perform memory, logic and arithmetic operations. Siding!; are connected to the pipeline through launch and return stages. The siding themselves are pipelined. A stage of the pipeline that launches the instruction to the siding is a launch stage. The launch action is called launch sequence. The results from the sidings are returned back to the processor after a few stages at the return stage. This is a retum sequence. While siding is performing the operation on the instructions launched, the processor may perform another operation simultaneously. Thus sidings can facilitate several operations progressing concurrently. However, sidings need not be a part of the architecture. The instructions may also be executed in the pipeline stages without using any siding. Typically, instructions with long computation delays (Long latency instructions) are executed in sidings. D. Pipeline Rules The pipeline follows some set of execution and matching rules [5][1]. Execution Rules: There are four execution rules direct the flow of information between various stages of the pipeline. Fig. 1 : Countertlow Pipeline Processor Architecture [2][5] El. No Overtaking: The instructions cannot move away from the program order in the instruction pipeline, i.e., the instructions cannot pass each other. 2

3 E2. Execution: The instruction can be executed only if all the instruction s source bindings are valid, and if the instruction occupies a stage with suitable computing logic. At the end of the instruction execution, its destination binding flag is marked valid and its destination binding value is filled with the result. E3. Insert Result: On completing the execution of some instruction, the destination binding will be marked valid and one or more copies of it will be made for later instructions awaiting the particular value. E4. Stalling for Operands: No operation can retire into the register file without being executed. An unexecuted instruction must wait at the last stage with suitable computing logic until it can be executed. Matching Rules: These rules govem the exchange of bindings between the instruction and the result pipelines occupying the same stage at the same time. M 1. Garner Instruction Operands: when a valid result binding matches an invalid instruction operand binding, replace the operand value with the result value and mark the operand binding valid. M2. Kill result: when an invalid destination binding matches a valid result binding, mark the result binding invalid. M3. Update Results: when a valid destination binding matches a result binding, copy the destination value into the result value and mark it valid. E. Conditional Branches and Traps One of the features of CFPP architecture is the way it handles traps and branch instructions. A single bit identifier in the instruction and the result bindings of the pipelines helps in handling branches and traps efficiently. The CFPP usually predict that a conditional branch will not be taken. In each of a trap or a wrongly-predicated branch, a specially-marked result (poison bill) travels down the result pipeline invalidating all instructions in the pipeline after the trap or the wrongly-predicted instruction (kill instruction). When the poison bill reaches the end of the result pipeline, it is interrupted by the stage responsible for the program counter control (the decode unit). In case of a trap, the address of the trap handler is loaded into the program counter. Similarly, In case of wrongly-predicated instruction, the target of the branch address is loaded into the program counter. Thus the architecture can recover from erroneous branch predictions and support precise interrupts. F. Advantages and Disadvantages The CFPP has several advantages and disadvantages. The CFPP design Advantages 1. Speculative Execution: This is one of the most important advantages of CFPP. Branches are predicted at the beginning of the pipeline. A single bit identifier helps in handling branch predictions and traps. 2. Out-of-order Execution: It can execute out-of-order instructions in different stages of the pipeline at the same time. 3. Asynchronous vs. synchronous: The CFPP was primarily designed as an asynchronous processor. However, it can be implemented as a synchronous processor also. In synchronous implementation [7], a) the instructions can move only one stage and only at the clock signal, b) The speed of the clock is determined by the slowest stage in the pipeline, and c) Power consumption is usually more than that for asynchronous implementation. In asynchronous implementation [8], instruction executions are event triggered. Therefore, instruction can move to the next stage as soon as it can, thus minimizing pipeline stalling. 4. Others: The CFPP design exhibits instruction level parallelism [6] features like super-pipelining, thereby improving execution speed. The CFPP Design Disadvantages Enforcing the pipeline matching rules may be expensive. As it involves two pipelines along with sidings, it may use more chip area. Average execution latency increases as the number of stages in the pipeline increase. The design may introduce delays, such as the time between instruction issue and acquisition of all its operands. Though CFPP provides register-renaming, data forwarding and a simple, efficient implementation for handling interrupts and branching, it may cripple its performance due to more chances for pipeline stalling. For example, consider a dependent memory instruction following an add instruction. In a conventional pipeline, the add operation would have been performed and its result would be stored in the register. The memory operation can then use the result for its execution. In CFPP, if the memory instruction has its launch before the launch of an add instruction, it will be stalled till add operation has been performed. It then has to pass through all the pipeline stage again, before it could be executed. Due to instruction advancement rules, the maximum throughput of a pipeline is achieved when it is half full. As there are higher probabilities for pipeline stalling in CFPP than any other conventional processors, it is highly recommended to have a very efficient compiler that would handle some of the data dependencies. 3

4 111. ADVANCES IN SYNCHRONOUS IMPLEMENTATION Researches at Oregon State University [2][3] explored possibilities of synchronous implementation of counterflow pipeline processor. Janik et al. [2] first attempted designing a general synchronous pipeline structure called Virtual Register Processor [3] (VRP). Miller et al. [9] in their research identified three conditions due to which the pipeline stalls occur in counterflow processors. The first is that the instruction that requires registers that had not been used before would have to travel up half the pipeline when it can have the operand values from the register file. The second is that, since the instruction stays in the pipeline till it reaches the register file, stalling in any intermediate instruction can stall all the subsequent instructions. The third is that, dependencies between the instructions being issued must be resolved in order to issue more than one instruction per cycle. The first problem was resolved by arranging the register file on the same side of the pipeline as the decode unit. In order to overcome the second problem, VRP Processor was further modified in VRP+ processor [9] by adding reorder buffer (ROB). The basic architecture of VRP+ is shown in Fig. 3. By wrapping the instruction pipeline back onto itself, the pipeline stalling was minimized. As there are always sidings capable of executing further down the pipeline, the instructions that are not executed do not stall. Also, as there is no last siding, the need to check the dependencies between concurrently issued instructions is eliminated. IV. APPLICATION SPECIFIC COUNTERFLOW PROCESSORS The features in CFPP like simple and regular structure, modularity, local control, and inherent handling of complex structures such as register renaming and speculative execution have made Childers et al. [lo] to target this architecture for Application-Specific Instruction-set Processors (ASIPs). ASlPs use minimal instruction set and micro-architecture elements and give a good performance at a low cost and hence widely used in embedded systems. Childers designed counterflow pipeline [ 11 ] customized to a kernel loop. He modified the original counterflow pipeline to a very long instruction word (VLI W) architecture called wide counter-flow pipeline (WCPF) [ 13][ 14][ 151 to exploit ILP in kernel loops. He demonstrated that CFPP are appropriate for constructing application specific processors. The WCFPs had low design complexity although achieving performance in comparison to the general-purpose architecture. The custom WCFPs had an instruction width of 4 with memory, multiplier, and divider sidings. The simulations showed that the custom pipeline could achieve speed-ups of with an average speed-up of 3.7 for several benchmarks (fir, kl, k5, k7, kl2, gsm, dither, dot, dct and mexp) E%y these work [10][11][12] he explained that CFPP is a flexible target for high-level synthesis of application-specific processors. Childers et al. [6] determined that the speedup (Fig.4) of asynchronous CFPP could achieve up to 6 times the speedup of synchronous general-purpose pipeline processor. He attributed the following reasons for the improved speedup of asynchronous CFPP. 0 0 CFPP eliminates resource contention as they are tailored to the resource requirements of the graphs. The stages are arranged to minimize the latency of conveying source operands. They achieve average case execution time. Fig. 3: Basic VRP+ Architecture [O] J Fig. 4: Speedup of Custom asynchronous CFPPs over general purpose synchronous CFPPs [GI 4

5 V. EVALUATION OF COUNTERFLOW PROCESSOR Though all the researches have brought out the salient features of the counterflow processor, we could not find any design to evaluate CFPP performance to that of a conventional pipeline processor based on some standard metrics. We are currently working on synchronous implementation and evaluation of the performance of the counterflow pipeline processor with traditional MIPS pipeline processor. As MlPS processor [4] was one of the first RlSC processor proposed, we choose this to evaluate CFPP architecture performance. The research uses Xilinx Foundation CAD tools and the design is coded using VHDL Programming. Both the processors are implemented on the same FPGA device (VIRTEX) and their performance parameters like speed, average CPI (Clocks-per-instruction), number of logic cells and system gates and performance to cost ratio are to be evaluated. Both the processors are implemented to execute the same chosen instruction set. Our research also includes comparison and evaluation of the performance of a branch prediction technique (BTB-HIPT) for traditional MIPS pipeline processor architecture and CFPP architecture. VI. CONCLUSION This paper is a novel approach to study the different works on the counterflow pipeline architecture. Since it was proposed in 1994, there has been several research and simulations on the architecture. The simulations claim that the counterflow pipeline processor would prove better than the conventional processors. It can be implemented both synchronously and asynchronously. Dr. Childers choose CFPP over other architectures for ASIPs and embedded systems. However there has not been any asynchronous implementation of the processor. It still remains a statement that If implemented successfully, Counterflow Pipeline Processor will be the first asynchronous processor [5]. We hence conclude with an expectation that our research on synchronous implementation would identify the advantages and disadvantages of counterflow pipeline processor (CFPP) architecture over the traditional pipeline processor architecture. VII. ACKNOWLEDGEMENT The authors gratefully acknowledge Dr. Ken Currie, Center for Manufacturing Research, Tennessee Technological University, Cookeville, TN for supporting their research. VIII. REFERENCES [I] R.F. Sproull, I.E. Sutherland, and C.E. Molnar, [2] [3] Counterflow Pipeline Processor Architecture, IEEE Design and test of Computers, Fall 1994, Vol. 11, NO.3, pp K.J. Janik, and S. Lu, Synchronous Implementation of a Counterflow Pipeline Processor IEEE International Symposium on Circuits and Systems, 1996, ~01.4, pp K.J. Janik, S. Lu, and M.F. Miller, Advances of the Counterflow Pipeline Microarchitecture, Third Svmposium on High Performance Computer Architecture, Feb. 1997, pp [4] J.L. Hennessy, and D.A. Patterson, Computer Architecture - Hardware Software CO-design, Morgan [SI [6] Kaufmann Publications, M.D. Jones, A New Approach to Microprocessors, Department of Computer Science, Brigham Young University, Utah, j ones/latex/sproull.html/sproull.html.html B. Childers, and J. Davidson, Application-Specific Pipelines for Exploiting Instruction-Level Parallelism, University of Virginia, Technical Report No. CS-98-14, May 1, [7] C.H. Van Berkel, M.B. Josephs, and S.M. Nowick, Scanning the technology: Applications of Asynchronous Circuits, Proceedings of the IEEE Feb vol. 87, NO. 2, pp [8] AI. Davis, S.M. Nowick, An lntroduction to Asynchronous Circuit Design A technical report UUCS , University of Utah, [9] M.F. Miller, K.J. Janik, and S. Lu, Non-Stalling Counterflow Architecture, Fourth Symposium on High Performance Computer Architecture, Las Vegas, NV, Feb. 1998, pp [ 101 B. Childers, Custom embedded counterflow pipelines, Ph. D., Thesis at University of Virginia, Charlottesville, Virginia, Jan [I 11 B. Childers, and J. Davidson, Automatic Counterflow Pipeline Synthesis, University of Virginia, Technical Report No. CS-98-01, January [12] B. Childers, and J. Davidson, A Design Environment for Counterflow Pipeline Synthesis, University of Virginia, Technical Report No. CS-98-05, March [I31 B. Childers, and J. Davidson, An infrastructure for designing custom embedded counterflow pipelines, Hawaii International Conference on System Sciences, Maui, Hawaii, January 3-7,2000. [ 141 B. Childers, and J. Davidson, Architectural Considerations for Application-Specific Counterflow Pipelines, Proceedings of the 20 Anniversary Conference on Advanced Research in VLSI, Atlanta, Georgia, March 1999 pp [15] B. Childers, and J. Davidson, Automatic Design of Custom wide-issue Counterflow Pipelines, University of Virginia, CS Technical Report ( , January

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The