. Regularity: CFPP architecture seeks geometric
|
|
- Claire West
- 6 years ago
- Views:
Transcription
1 Survey of the Counterflow Pipeline Processor Architectures Pic Balaji Esther Ososanya Department of Electrical Engineering, University of The District of Columbia, Van Ness Campus, Washington D.C USA Wagdy Mahmoud Karthik Thangarajan Keywords - Counterflow pipeline, synchronous and asynchronous systems, RlSC processors, MIPS architecture. Abstract - The Counterflow Pipeline Processor (CFPP) Architecture is a RISC-based pipeline processor [l I. It was proposed in 1994 as asynchronous processor architecture. Recently, researches have implemented it as synchronous processor architecture and later improved its design in terms of speed and performance by reducing average execution latency of instructions and minimize pipeline stalling. In this paper, we survey the architecture and the key design issues such as implementing as a synchronous and an asynchronous architecture and discuss the advantages and disadvantages of these implementations. Further, our research on evaluating the performance of the counterflow pipeline processor architecture to that of the traditional MIPS processor architecture 141 is also discussed. 1. INTRODUCTION The Counterflow Pipeline Processor (CFPP) Architecture was introduced by Sproull et al. [l] of Sun Microsystems lab, as a simple and regular structure of pipeline processors. It underlies the family of RlSC processor architectures. The architecture has two pipelines in which the data structure representing instructions and result flow in opposite directions. This allows each instruction and each counterflowing result to interact at every stage. Though the initial architecture was intended for asynchronous implementation, recent research proposed synchronous implementations of the CFPP architectures. Other variations and modifications to the original design have been proposed. In this paper we present the original structure of CFPP architecture, the pipeline rules that govern the architecture, and the way it handles branches and traps. We also discuss the different implementations of the architecture and their advantages and disadvantages. Finally, we discuss about our research to evaluate the performance of the architecture. 11. THE ORIGINAL CFPP This section covers the original CFFP properties, structure, functional units and sidings, pipeline rules, conditional branches and traps, and advantages and disadvantages. A. CFPP Properties The CFPP is a simple and regular structure with the following properties: Local Control: Only local information decides whether an instruction in the CFPP should advance.. Regularity: CFPP architecture seeks geometric regularity in the processor chip layout. Communication: Every stage communicates primarily with its nearest neighbors. This allows for short and fast communication paths. Modularity: Stages may differ in their computational logic. However, all the stages adopt the same communication protocol. B. CFPP Structure The original structure of the Counterflow Pipeline Processor (CFPP) Architecture is shown in Figure 1 [2][5]. CFPP has an instruction fetch unit at one end of the pipeline stages, and a register file on the other end. In between these two stages, instructions and results flow in opposite directions. The pipeline through which the instruction flows is called the Instruction pipeline, and the pipeline through which the result flows is called the result pipeline. Pipeline stages operate concurrently and each stage has an independent function to complete. Alongside the series of pipeline stages, side units perform various arithmetic, logical, and memory operations. These side units are called sidings /02/$ IEEE 1
2 The instruction and result flows through the pipeline as packets. The Instruction packets fetched from the instruction memory are decoded using the Decode and Fetch unit before being sent to the pipeline stage next to that unit. The register file is one of the sources of the result packets that are sent into the pipeline stage. The instruction packets in the instruction pipeline and the result packets in the result pipeline interact with each other at every stage. Within each stage, instructions and results packets consist of smaller records called bindings. A binding contains a register address, the register contents, and a 1-bit flag to indicate whether or not the content of the register is valid. A typical binding is shown in Figure 2 [I]. Each stage also contains hardware for comparing addresses of the different binding to determine if any information can be exchanged between the instruction and result packets. The instruction packets consist of the instruction operation code (opcode), three bindings, and the program counter value. The first two bindings contain the instruction operands and the third binding contains the result. The result packets consist of the two bindings. The interaction of the instruction and the result pipelines is one of the key features of the architecture. An instruction and result packages occupy the same stage at the same time exchange information. I Binding Name Register Name Fig. 2: Instruction / Result Binding [I] During their interaction, the address registers of the instruction's operands are compared to the address registers of the result and in case of a match, the result value is copied to the register content field of the instruction. This is called a garner operation. Similarly, the source operand registers are updated with valid result rlegister values. This operation, which is called as update operation, allows newly computed result values to be available to the subsequent instructions, even before they are stored in the register file. When an instruction reaches the end of the pipeline, the data values stored in its destination binding is written into the corresponding location in the register file. Until this happen, instructions are considered speculative and may be cancelled in case of a trap or branch. C. Functional Units and Sidings There can be functional units called sidings connected to the pipeline. These functional units perform memory, logic and arithmetic operations. Siding!; are connected to the pipeline through launch and return stages. The siding themselves are pipelined. A stage of the pipeline that launches the instruction to the siding is a launch stage. The launch action is called launch sequence. The results from the sidings are returned back to the processor after a few stages at the return stage. This is a retum sequence. While siding is performing the operation on the instructions launched, the processor may perform another operation simultaneously. Thus sidings can facilitate several operations progressing concurrently. However, sidings need not be a part of the architecture. The instructions may also be executed in the pipeline stages without using any siding. Typically, instructions with long computation delays (Long latency instructions) are executed in sidings. D. Pipeline Rules The pipeline follows some set of execution and matching rules [5][1]. Execution Rules: There are four execution rules direct the flow of information between various stages of the pipeline. Fig. 1 : Countertlow Pipeline Processor Architecture [2][5] El. No Overtaking: The instructions cannot move away from the program order in the instruction pipeline, i.e., the instructions cannot pass each other. 2
3 E2. Execution: The instruction can be executed only if all the instruction s source bindings are valid, and if the instruction occupies a stage with suitable computing logic. At the end of the instruction execution, its destination binding flag is marked valid and its destination binding value is filled with the result. E3. Insert Result: On completing the execution of some instruction, the destination binding will be marked valid and one or more copies of it will be made for later instructions awaiting the particular value. E4. Stalling for Operands: No operation can retire into the register file without being executed. An unexecuted instruction must wait at the last stage with suitable computing logic until it can be executed. Matching Rules: These rules govem the exchange of bindings between the instruction and the result pipelines occupying the same stage at the same time. M 1. Garner Instruction Operands: when a valid result binding matches an invalid instruction operand binding, replace the operand value with the result value and mark the operand binding valid. M2. Kill result: when an invalid destination binding matches a valid result binding, mark the result binding invalid. M3. Update Results: when a valid destination binding matches a result binding, copy the destination value into the result value and mark it valid. E. Conditional Branches and Traps One of the features of CFPP architecture is the way it handles traps and branch instructions. A single bit identifier in the instruction and the result bindings of the pipelines helps in handling branches and traps efficiently. The CFPP usually predict that a conditional branch will not be taken. In each of a trap or a wrongly-predicated branch, a specially-marked result (poison bill) travels down the result pipeline invalidating all instructions in the pipeline after the trap or the wrongly-predicted instruction (kill instruction). When the poison bill reaches the end of the result pipeline, it is interrupted by the stage responsible for the program counter control (the decode unit). In case of a trap, the address of the trap handler is loaded into the program counter. Similarly, In case of wrongly-predicated instruction, the target of the branch address is loaded into the program counter. Thus the architecture can recover from erroneous branch predictions and support precise interrupts. F. Advantages and Disadvantages The CFPP has several advantages and disadvantages. The CFPP design Advantages 1. Speculative Execution: This is one of the most important advantages of CFPP. Branches are predicted at the beginning of the pipeline. A single bit identifier helps in handling branch predictions and traps. 2. Out-of-order Execution: It can execute out-of-order instructions in different stages of the pipeline at the same time. 3. Asynchronous vs. synchronous: The CFPP was primarily designed as an asynchronous processor. However, it can be implemented as a synchronous processor also. In synchronous implementation [7], a) the instructions can move only one stage and only at the clock signal, b) The speed of the clock is determined by the slowest stage in the pipeline, and c) Power consumption is usually more than that for asynchronous implementation. In asynchronous implementation [8], instruction executions are event triggered. Therefore, instruction can move to the next stage as soon as it can, thus minimizing pipeline stalling. 4. Others: The CFPP design exhibits instruction level parallelism [6] features like super-pipelining, thereby improving execution speed. The CFPP Design Disadvantages Enforcing the pipeline matching rules may be expensive. As it involves two pipelines along with sidings, it may use more chip area. Average execution latency increases as the number of stages in the pipeline increase. The design may introduce delays, such as the time between instruction issue and acquisition of all its operands. Though CFPP provides register-renaming, data forwarding and a simple, efficient implementation for handling interrupts and branching, it may cripple its performance due to more chances for pipeline stalling. For example, consider a dependent memory instruction following an add instruction. In a conventional pipeline, the add operation would have been performed and its result would be stored in the register. The memory operation can then use the result for its execution. In CFPP, if the memory instruction has its launch before the launch of an add instruction, it will be stalled till add operation has been performed. It then has to pass through all the pipeline stage again, before it could be executed. Due to instruction advancement rules, the maximum throughput of a pipeline is achieved when it is half full. As there are higher probabilities for pipeline stalling in CFPP than any other conventional processors, it is highly recommended to have a very efficient compiler that would handle some of the data dependencies. 3
4 111. ADVANCES IN SYNCHRONOUS IMPLEMENTATION Researches at Oregon State University [2][3] explored possibilities of synchronous implementation of counterflow pipeline processor. Janik et al. [2] first attempted designing a general synchronous pipeline structure called Virtual Register Processor [3] (VRP). Miller et al. [9] in their research identified three conditions due to which the pipeline stalls occur in counterflow processors. The first is that the instruction that requires registers that had not been used before would have to travel up half the pipeline when it can have the operand values from the register file. The second is that, since the instruction stays in the pipeline till it reaches the register file, stalling in any intermediate instruction can stall all the subsequent instructions. The third is that, dependencies between the instructions being issued must be resolved in order to issue more than one instruction per cycle. The first problem was resolved by arranging the register file on the same side of the pipeline as the decode unit. In order to overcome the second problem, VRP Processor was further modified in VRP+ processor [9] by adding reorder buffer (ROB). The basic architecture of VRP+ is shown in Fig. 3. By wrapping the instruction pipeline back onto itself, the pipeline stalling was minimized. As there are always sidings capable of executing further down the pipeline, the instructions that are not executed do not stall. Also, as there is no last siding, the need to check the dependencies between concurrently issued instructions is eliminated. IV. APPLICATION SPECIFIC COUNTERFLOW PROCESSORS The features in CFPP like simple and regular structure, modularity, local control, and inherent handling of complex structures such as register renaming and speculative execution have made Childers et al. [lo] to target this architecture for Application-Specific Instruction-set Processors (ASIPs). ASlPs use minimal instruction set and micro-architecture elements and give a good performance at a low cost and hence widely used in embedded systems. Childers designed counterflow pipeline [ 11 ] customized to a kernel loop. He modified the original counterflow pipeline to a very long instruction word (VLI W) architecture called wide counter-flow pipeline (WCPF) [ 13][ 14][ 151 to exploit ILP in kernel loops. He demonstrated that CFPP are appropriate for constructing application specific processors. The WCFPs had low design complexity although achieving performance in comparison to the general-purpose architecture. The custom WCFPs had an instruction width of 4 with memory, multiplier, and divider sidings. The simulations showed that the custom pipeline could achieve speed-ups of with an average speed-up of 3.7 for several benchmarks (fir, kl, k5, k7, kl2, gsm, dither, dot, dct and mexp) E%y these work [10][11][12] he explained that CFPP is a flexible target for high-level synthesis of application-specific processors. Childers et al. [6] determined that the speedup (Fig.4) of asynchronous CFPP could achieve up to 6 times the speedup of synchronous general-purpose pipeline processor. He attributed the following reasons for the improved speedup of asynchronous CFPP. 0 0 CFPP eliminates resource contention as they are tailored to the resource requirements of the graphs. The stages are arranged to minimize the latency of conveying source operands. They achieve average case execution time. Fig. 3: Basic VRP+ Architecture [O] J Fig. 4: Speedup of Custom asynchronous CFPPs over general purpose synchronous CFPPs [GI 4
5 V. EVALUATION OF COUNTERFLOW PROCESSOR Though all the researches have brought out the salient features of the counterflow processor, we could not find any design to evaluate CFPP performance to that of a conventional pipeline processor based on some standard metrics. We are currently working on synchronous implementation and evaluation of the performance of the counterflow pipeline processor with traditional MIPS pipeline processor. As MlPS processor [4] was one of the first RlSC processor proposed, we choose this to evaluate CFPP architecture performance. The research uses Xilinx Foundation CAD tools and the design is coded using VHDL Programming. Both the processors are implemented on the same FPGA device (VIRTEX) and their performance parameters like speed, average CPI (Clocks-per-instruction), number of logic cells and system gates and performance to cost ratio are to be evaluated. Both the processors are implemented to execute the same chosen instruction set. Our research also includes comparison and evaluation of the performance of a branch prediction technique (BTB-HIPT) for traditional MIPS pipeline processor architecture and CFPP architecture. VI. CONCLUSION This paper is a novel approach to study the different works on the counterflow pipeline architecture. Since it was proposed in 1994, there has been several research and simulations on the architecture. The simulations claim that the counterflow pipeline processor would prove better than the conventional processors. It can be implemented both synchronously and asynchronously. Dr. Childers choose CFPP over other architectures for ASIPs and embedded systems. However there has not been any asynchronous implementation of the processor. It still remains a statement that If implemented successfully, Counterflow Pipeline Processor will be the first asynchronous processor [5]. We hence conclude with an expectation that our research on synchronous implementation would identify the advantages and disadvantages of counterflow pipeline processor (CFPP) architecture over the traditional pipeline processor architecture. VII. ACKNOWLEDGEMENT The authors gratefully acknowledge Dr. Ken Currie, Center for Manufacturing Research, Tennessee Technological University, Cookeville, TN for supporting their research. VIII. REFERENCES [I] R.F. Sproull, I.E. Sutherland, and C.E. Molnar, [2] [3] Counterflow Pipeline Processor Architecture, IEEE Design and test of Computers, Fall 1994, Vol. 11, NO.3, pp K.J. Janik, and S. Lu, Synchronous Implementation of a Counterflow Pipeline Processor IEEE International Symposium on Circuits and Systems, 1996, ~01.4, pp K.J. Janik, S. Lu, and M.F. Miller, Advances of the Counterflow Pipeline Microarchitecture, Third Svmposium on High Performance Computer Architecture, Feb. 1997, pp [4] J.L. Hennessy, and D.A. Patterson, Computer Architecture - Hardware Software CO-design, Morgan [SI [6] Kaufmann Publications, M.D. Jones, A New Approach to Microprocessors, Department of Computer Science, Brigham Young University, Utah, j ones/latex/sproull.html/sproull.html.html B. Childers, and J. Davidson, Application-Specific Pipelines for Exploiting Instruction-Level Parallelism, University of Virginia, Technical Report No. CS-98-14, May 1, [7] C.H. Van Berkel, M.B. Josephs, and S.M. Nowick, Scanning the technology: Applications of Asynchronous Circuits, Proceedings of the IEEE Feb vol. 87, NO. 2, pp [8] AI. Davis, S.M. Nowick, An lntroduction to Asynchronous Circuit Design A technical report UUCS , University of Utah, [9] M.F. Miller, K.J. Janik, and S. Lu, Non-Stalling Counterflow Architecture, Fourth Symposium on High Performance Computer Architecture, Las Vegas, NV, Feb. 1998, pp [ 101 B. Childers, Custom embedded counterflow pipelines, Ph. D., Thesis at University of Virginia, Charlottesville, Virginia, Jan [I 11 B. Childers, and J. Davidson, Automatic Counterflow Pipeline Synthesis, University of Virginia, Technical Report No. CS-98-01, January [12] B. Childers, and J. Davidson, A Design Environment for Counterflow Pipeline Synthesis, University of Virginia, Technical Report No. CS-98-05, March [I31 B. Childers, and J. Davidson, An infrastructure for designing custom embedded counterflow pipelines, Hawaii International Conference on System Sciences, Maui, Hawaii, January 3-7,2000. [ 141 B. Childers, and J. Davidson, Architectural Considerations for Application-Specific Counterflow Pipelines, Proceedings of the 20 Anniversary Conference on Advanced Research in VLSI, Atlanta, Georgia, March 1999 pp [15] B. Childers, and J. Davidson, Automatic Design of Custom wide-issue Counterflow Pipelines, University of Virginia, CS Technical Report ( , January
Automatic Counterflow Pipeline Synthesis
Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The
More informationIEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 2, FEBRUARY Custom Wide Counterflow Pipelines for High-Performance Embedded Applications
IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 2, FEBRUARY 2004 141 Custom Wide Counterflow Pipelines for High-Performance Embedded Applications Bruce R. Childers, Member, IEEE, and Jack W. Davidson, Member,
More informationApplication-Specific Pipelines for Exploiting Instruction-Level Parallelism
Application-Specific Pipelines for Exploiting Instruction-Level Parallelism Bruce R. Childers, Jack W. Davidson Department of Computer Science University of Virginia Charlottesville, Virginia 22903 {brc2m,
More informationProcessors. Young W. Lim. May 12, 2016
Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version
More informationDesign of a Pipelined 32 Bit MIPS Processor with Floating Point Unit
Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit P Ajith Kumar 1, M Vijaya Lakshmi 2 P.G. Student, Department of Electronics and Communication Engineering, St.Martin s Engineering College,
More informationDesign and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor
Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor Abstract The proposed work is the design of a 32 bit RISC (Reduced Instruction Set Computer) processor. The design
More informationInstr. execution impl. view
Pipelining Sangyeun Cho Computer Science Department Instr. execution impl. view Single (long) cycle implementation Multi-cycle implementation Pipelined implementation Processing an instruction Fetch instruction
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationCustom Wide Counterflow Pipelines for High Performance Embedded Applications
Custom Wide Counterflow Pipelines for High Performance Embedded Applications Bruce R. Childers Department of Computer Science University of Pittsburgh Pittsburgh, Pennsylvania 15260 childers@cs.pitt.edu
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationProcessors. Young W. Lim. May 9, 2016
Processors Young W. Lim May 9, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationChapter 3 & Appendix C Part B: ILP and Its Exploitation
CS359: Computer Architecture Chapter 3 & Appendix C Part B: ILP and Its Exploitation Yanyan Shen Department of Computer Science and Engineering Shanghai Jiao Tong University 1 Outline 3.1 Concepts and
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationCS433 Homework 2 (Chapter 3)
CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies
More informationDesign of 16-bit RISC Processor Supraj Gaonkar 1, Anitha M. 2
Design of 16-bit RISC Processor Supraj Gaonkar 1, Anitha M. 2 1 M.Tech student, Sir M Visvesvaraya Institute of Technology Bangalore. Karnataka, India 2 Associate Professor Department of Telecommunication
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationPipelined MIPS processor with cache controller using VHDL implementation for educational purpose
Journal From the SelectedWorks of Kirat Pal Singh Winter December 28, 203 Pipelined MIPS processor with cache controller using VHDL implementation for educational purpose Hadeel Sh. Mahmood, College of
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationA Configurable Multi-Ported Register File Architecture for Soft Processor Cores
A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box
More informationNovel Design of Dual Core RISC Architecture Implementation
Journal From the SelectedWorks of Kirat Pal Singh Spring May 18, 2015 Novel Design of Dual Core RISC Architecture Implementation Akshatha Rai K, VTU University, MITE, Moodbidri, Karnataka Basavaraj H J,
More informationReorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)
Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers
More informationEmbedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory
Embedded Systems 8. Hardware Components Lothar Thiele Computer Engineering and Networks Laboratory Do you Remember? 8 2 8 3 High Level Physical View 8 4 High Level Physical View 8 5 Implementation Alternatives
More informationSingle Instructions Can Execute Several Low Level
We have made it easy for you to find a PDF Ebooks without any digging. And by having access to our ebooks online or by storing it on your computer, you have convenient answers with single instructions
More informationDesign and Implementation of a Super Scalar DLX based Microprocessor
Design and Implementation of a Super Scalar DLX based Microprocessor 2 DLX Architecture As mentioned above, the Kishon is based on the original DLX as studies in (Hennessy & Patterson, 1996). By: Amnon
More informationIntroduction to Asynchronous Circuits and Systems
RCIM Presentation Introduction to Asynchronous Circuits and Systems Kristofer Perta April 02 / 2004 University of Windsor Computer and Electrical Engineering Dept. Presentation Outline Section - Introduction
More informationEN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design
EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationLECTURE 3: THE PROCESSOR
LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU
More informationRedacted for Privacy
AN ABSTRACT OF THE THESIS OF Michael F. Miller for the degree of Masters of Science in Electrical and Computer Engineering presented on July 17, 1997. Title: CounterDataFlow Architecture: Design and Performance.
More informationENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013
ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of
More informationSpring 2014 Midterm Exam Review
mr 1 When / Where Spring 2014 Midterm Exam Review mr 1 Monday, 31 March 2014, 9:30-10:40 CDT 1112 P. Taylor Hall (Here) Conditions Closed Book, Closed Notes Bring one sheet of notes (both sides), 216 mm
More informationTHE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION
THE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION Radu Balaban Computer Science student, Technical University of Cluj Napoca, Romania horizon3d@yahoo.com Horea Hopârtean Computer Science student,
More informationModule 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.
Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch
More informationInstruction Level Parallelism
Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches
More informationstructural RTL for mov ra, rb Answer:- (Page 164) Virtualians Social Network Prepared by: Irfan Khan
Solved Subjective Midterm Papers For Preparation of Midterm Exam Two approaches for control unit. Answer:- (Page 150) Additionally, there are two different approaches to the control unit design; it can
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationCIS 371 Spring 2010 Thu. 4 March 2010
1 Computer Organization and Design Midterm Exam Solutions CIS 371 Spring 2010 Thu. 4 March 2010 This exam is closed book and note. You may use one double-sided sheet of notes, but no magnifying glasses!
More informationTHE latest generation of microprocessors uses a combination
1254 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995 A 14-Port 3.8-ns 116-Word 64-b Read-Renaming Register File Creigton Asato Abstract A 116-word by 64-b register file for a 154 MHz
More informationNon-Stalling CounterFlow Architecture
Non-Stalling CounterFlow Architecture Michael F. Miller, Kennneth J. Janik, and Shih-Lien Lu: mikem@ichips.intel.com, kjanik@icihps.intel.com, sllu@ece.orst.edu Dept of Electrical and Computer Engineering,
More informationChapter 9. Pipelining Design Techniques
Chapter 9 Pipelining Design Techniques 9.1 General Concepts Pipelining refers to the technique in which a given task is divided into a number of subtasks that need to be performed in sequence. Each subtask
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationA Study for Branch Predictors to Alleviate the Aliasing Problem
A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract
More informationThe counterow pipeline processor architecture (cfpp) is a proposal for a family of microarchitectures
Counterow Pipeline Processor Architecture Robert F. Sproull Ivan E. Sutherland Sun Microsystems Laboratories, Inc. Charles E. Molnar Institute for Biomedical Computing Washington University SMLI TR-94-25
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationPipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome
Thoai Nam Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy & David a Patterson,
More informationSuperscalar Processors
Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 05
More informationInstruction Pipelining Review
Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number
More information1.3 Data processing; data storage; data movement; and control.
CHAPTER 1 OVERVIEW ANSWERS TO QUESTIONS 1.1 Computer architecture refers to those attributes of a system visible to a programmer or, put another way, those attributes that have a direct impact on the logical
More informationMinimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline
Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding
More informationInstruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction
Instruction Level Parallelism ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Basic Block A straight line code sequence with no branches in except to the entry and no branches
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationChapter 4 The Processor (Part 4)
Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline
More informationInstructional Level Parallelism
ECE 585/SID: 999-28-7104/Taposh Dutta Roy 1 Instructional Level Parallelism Taposh Dutta Roy, Student Member, IEEE Abstract This paper is a review of the developments in Instruction level parallelism.
More informationHomework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures
Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang
More informationWebsite for Students VTU NOTES QUESTION PAPERS NEWS RESULTS
Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly
More informationA Mechanism for Verifying Data Speculation
A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es
More informationComputer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)
18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures
More informationCS Mid-Term Examination - Fall Solutions. Section A.
CS 211 - Mid-Term Examination - Fall 2008. Solutions Section A. Ques.1: 10 points For each of the questions, underline or circle the most suitable answer(s). The performance of a pipeline processor is
More informationA Synthesizable RTL Design of Asynchronous FIFO Interfaced with SRAM
A Synthesizable RTL Design of Asynchronous FIFO Interfaced with SRAM Mansi Jhamb, Sugam Kapoor USIT, GGSIPU Sector 16-C, Dwarka, New Delhi-110078, India Abstract This paper demonstrates an asynchronous
More informationEmbedded Systems. 7. System Components
Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic
More informationDesign of Out-Of-Order Superscalar Processor with Speculative Thread Level Parallelism
ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationPipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome
Pipeline Thoai Nam Outline Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy
More informationEE 8217 *Reconfigurable Computing Systems Engineering* Sample of Final Examination
1 Student name: Date: June 26, 2008 General requirements for the exam: 1. This is CLOSED BOOK examination; 2. No questions allowed within the examination period; 3. If something is not clear in question
More informationDesign and Implementation of a FPGA-based Pipelined Microcontroller
Design and Implementation of a FPGA-based Pipelined Microcontroller Rainer Bermbach, Martin Kupfer University of Applied Sciences Braunschweig / Wolfenbüttel Germany Embedded World 2009, Nürnberg, 03.03.09
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationIntroduction to CPU Design
١ Introduction to CPU Design Computer Organization & Assembly Language Programming Dr Adnan Gutub aagutub at uqu.edu.sa [Adapted from slides of Dr. Kip Irvine: Assembly Language for Intel-Based Computers]
More informationAdvanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017
Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation
More informationc. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?
Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined
More informationLoad1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1
Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]
More informationof Soft Core Processor Clock Synchronization DDR Controller and SDRAM by Using RISC Architecture
Enhancement of Soft Core Processor Clock Synchronization DDR Controller and SDRAM by Using RISC Architecture Sushmita Bilani Department of Electronics and Communication (Embedded System & VLSI Design),
More informationPredict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch
branch taken Revisiting Branch Hazard Solutions Stall Predict Not Taken Predict Taken Branch Delay Slot Branch I+1 I+2 I+3 Predict Not Taken branch not taken Branch I+1 IF (bubble) (bubble) (bubble) (bubble)
More informationCS433 Homework 2 (Chapter 3)
CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..
More informationTechniques for Efficient Processing in Runahead Execution Engines
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu
More informationPIPELINE AND VECTOR PROCESSING
PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction
More informationChapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction
More informationComputer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014
18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationComputer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović
Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies
VLSI IMPLEMENTATION OF HIGH PERFORMANCE DISTRIBUTED ARITHMETIC (DA) BASED ADAPTIVE FILTER WITH FAST CONVERGENCE FACTOR G. PARTHIBAN 1, P.SATHIYA 2 PG Student, VLSI Design, Department of ECE, Surya Group
More informationAdvanced Instruction-Level Parallelism
Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu
More informationINSTRUCTION LEVEL PARALLELISM
INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,
More informationSF-LRU Cache Replacement Algorithm
SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationInstructor Information
CS 203A Advanced Computer Architecture Lecture 1 1 Instructor Information Rajiv Gupta Office: Engg.II Room 408 E-mail: gupta@cs.ucr.edu Tel: (951) 827-2558 Office Times: T, Th 1-2 pm 2 1 Course Syllabus
More informationDepartment of Computer Science and Engineering
Department of Computer Science and Engineering UNIT-III PROCESSOR AND CONTROL UNIT PART A 1. Define MIPS. MIPS:One alternative to time as the metric is MIPS(Million Instruction Per Second) MIPS=Instruction
More informationSuperscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More information4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16
4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt
More informationElectronics Engineering, DBACER, Nagpur, Maharashtra, India 5. Electronics Engineering, RGCER, Nagpur, Maharashtra, India.
Volume 5, Issue 3, March 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Design and Implementation
More information