CAD for VLSI 2 Pro ject - Superscalar Processor Implementation
|
|
- Trevor Harrison
- 6 years ago
- Views:
Transcription
1 CAD for VLSI 2 Pro ject - Superscalar Processor Implementation 1 Superscalar Processor Ob jective: The main objective is to implement a superscalar pipelined processor using Verilog HDL. This project may be divided into three parts: The Arithmetic Logic Unit The Pipelined Processor Architecture Cache design 2 The ALU The objective in this phase is to implement the ALU for integer operations. For a fast ALU, the following are required Fully pipelined Carry Lookahead Adder (CLA) - 32 bit Fully pipelined Wallace Tree Multiplier (WTM) - 32 bit Fully pipelined Load - Store Unit (LSU) Refer to lab class notes and your earlier homework for this. Each processor in your design must have A 1 numbers of CLA, A 2 numbers of WTM and A 3 numbers of LSU, where A 1, A 2 and A 3 are parameters. The addition, multiplication and load-store operation may take several cycles to complete. But the pipelining above ensures that at every cycle a new set of operands can be pushed into the Arithmetic units for computation. Note that if there is a structural hazard due to non-availability of functional units, pipeline may stall and all the instructions that follow the stalled instruction should not be scheduled. In other words, the issue of instructions is in program order. It is interesting to note that if the issue is not in program order then, the Tomasulo technique described in the class will not correctly handle the data hazards. An id for every instruction should also be passed through the units. This is needed because, if a reservation station R is waiting for a result from an execution unit E, it should specify that instruction, from several instructions that may currently be pipelined and executed in E. 1
2 3 The Pipelined Processor 3.1 The Basic Pipeline The processor that you have to design is a RISC (Reduced Instruction Set Computer) also called the Load-Store architecture with the following instruction set. General purpose registers: Assume that there are thirty two, 32-bit registers, named R0,..., R31. R0 always stores the value 0 to facilitate many calculations involving zero (jump on zero for example). Instruction set: The instruction set of the processor includes 3 Arithmetic instructions ADD R1, R2, R3 ; //R1 = R2 + R3 SUB R1, R2, R3 ; //R1 = R2 - R3 MUL R1, R2, R3 ; //R1 = R2 * R3 All operations are two s complement operations. Exactly one of the source operands of the arithmetic instruction can be a signed immediate operand of 16 bits stored in two s complement format. ADD R1, R0, #5; makes R1 = 5 2 Data transfer instructions LD R1, [Reg]; //R1=content of the memory location; address is specified by Reg. SD [Reg], R1;//[Reg] = R1 2 Control transfer instructions JMP L1; //Unconditional jump to location L1 BEQZ (Reg), L1; //Jump to L1 if Reg content is zero L1 is given as an offset from current Program Counter (PC). This is called PC-relative addressing. Halt instruction HLT There are basically 5 stages of instruction execution as shown in Figure 1. Also, the instructions are assumed to be of fixed length of 4 bytes each. In a store instruction, the WB stage is non-existent. In an arithmetic instruction the MEM stage is non-existent. The processor is pipelined at the instruction level also. 1. Instruction fetch cycle (IF): IR Mem[PC]; NPC PC + 4; Operation: Send out the Program Counter (PC) and fetch the instruction from memory into the Instruction Register (IR); increment the PC by 4 to address the next sequential instruction. The IR is used to hold the instruction that will be needed on subsequent clock cycles; likewise the register NPC is used to hold the next sequential PC. The above describes fetching of one instruction at a time. You should fetch P 1 number of instructions at any time in the Superscalar architecture. Note that our desire is to execute more than one instruction in every cycle. 2
3 Instruction Fetch - IF Instruction Decoding-ID Execution or Addr evaluation - EX Memory access/branch completion - MEM Write back results - WB Figure 1: The five stages of instruction execution 2. Instruction Decode/Register fetch cycle (ID): A Regs [rs]; B Regs [rt]; Imm sign-extended immediate fields of IR; Operation: Decode the instruction and access the register file to read the registers (rs and rt are the register specifiers). The outputs of the general purpose registers are read into two temporary registers (A and B) for use in later clock cycles. The lower 16 bits of the IR are also sign extended and stored into the temporary register Imm, for use in the next cycle. Decoding is done in parallel with reading registers, which is possible by ensuring that these fields are at a fixed location in the instruction format. Assume that the immediate portion of an instruction is located in an identical place in every instruction, the sign extended immediate is also calculated during this cycle in case it is needed in the next cycle. The above describes, how to decode one instruction. You should parallely decode P 1 instructions. In addition, in the superscalar execution, before registers are fetched, the register status indicators have to be consulted. Also beware of Load and Store instructions, that reads registers for calculating memory addresses. These register reads can lead to RAW hazards. This stage is responsible for dynamically scheduling of P 1 instructions at any time into the respective A 1, A 2 and A 3 units. If units are not available, then stall the pipeline, as a structural hazard is caused. The memory aliasing problem is to be handled using an associative memory as the memory status indicator. Note that the size of this associative memory will be A 3 Number of pipeline stages in the Load-Store unit. The above will be the maximum number of memory addresses that could be accessed at a time. 3
4 3. Execution/Effective Address cycle (EX): The ALU operates on the operands prepared in the prior cycle, performing one of the following four functions depending on the instruction type. Memory reference: (LD and ST) ALUOutput R0 + Reg; Operation: The ALU adds R0 with the contents of Reg fetched in earlier cycle to form the effective address and places the result into the register ALUOutput. Consult the memory status indicator for resolving the memory aliasing problem. Register-Register ALU instruction:(add, SUB and MUL) ALUOutput A op B Operation: The ALU performs the operation specified by the function code on the value in register A and on the value in register B. The result is placed in the temporary register ALUOutput. Register-Immediate ALU Instruction:(ADD, SUB and MUL) ALUOutput A op Imm; Operation: The ALU performs the operation specified by the opcode on the value in register A and on the operand Imm. The result is placed in the temporary register ALUOutput. Branch: ALUOutput NPC + (Imm << 2); Cond (A == 0) Operation : The ALU adds the NPC to the sign-extended immediate value in Imm, which is shifted left by 2 bits to create a word offset, to compute the address of the branch target. Register A, which has been read in the prior cycle, is checked to determine whether the branch is taken. Since we are considering only one form of branch (BEQZ), the comparison is against 0. Note that BEQZ is actually a pseudo instruction that translates to a BEQ with R0 as an operand. For simplicity, this is the only form of branch we consider. To reduce penalty due to control hazards, the jumps can be treated specially. Both the unconditional and conditional jumps may be decoded in the IF cycle itself. Note that unconditional Jumps can be executed at IF cycle and conditional jumps in ID cycle. This is straight forward to implement. Note that out of the P 1 instructions fetched along with a JMP instruction, all the instructions that appear after the jump instruction should not be scheduled. In case of conditional jump the pipeline should be stalled for one cycle due to the control hazard. The load-store architecture enables the effective memory address calculation and execution cycle to be combined into a single clock cycle, since no instruction needs to simultaneously calculate a data address, calculate an instruction target address, and perform an operation on the data. 4. Memory access cycle (MEM): Memory reference : LMD Mem [ALUOutput] or Mem [ALUOutput] B; Operation: Access memory, if needed. If instruction is a load, data returns from memory and is placed in the LMD (load memory data) register; if it is a store, then the data from the B register is written into memory. In either case 4
5 the address used is the one computed during the prior cycle and stored in the register ALUOutput. Note: Each processor has two caches - the Instruction cache and the Data cache. The memory has two ports - a read port for accessing instruction and a read/write port for accessing data. Conflicts in addressing on these ports, namely same address loaded on the ports should be resolved. When two or more Load/Store units try to access the cache, there would be a structural hazard for accessing the data cache, resulting in stalling of the pipeline inside the Load/Store units. In your implementation, assume that a Cache-based structural hazard takes one extra cycle for simultaneous access by two LSUs. In the worst case you may waste A 3 1 cycles due to Cache-based structural hazards. In the case of a Cache miss, after the Cache miss is detected, assume it takes two clock cycles to access memory and read/write data. 5. Write-back cycle (WB): Register-Register ALU instruction: Regs[rd] ALUOutput; Load instruction: Regs[rt] LMD; Operation: Write the result into the register file, whether it comes from the memory system (which is in LMD) or from the ALU (which is in ALUOutput); the register destination field is also in one of two positions (rd or rt) depending on the effective opcode. The write back in superscalar is on the Common Data Bus (CDB), which is communicated back to the reservation stations. The CDB is shared by several execution units to write back results. The CDB should be designed to handle C 1 units to commit back the result at a time. The CDB has 32 C 1 data lines and does the following function. Note that C 1 A 1 + A 2 + A 3. The Bus arbiter has a simple circular-token protocol. It has a register which stores an integer K = A 1 + A 2 + A 3. In a current cycle the Bus arbiter permits the next C 1 units from the k th execution unit in a circular fashion that have a request for CDB to write into CDB. Note: The Write-back cycle resets the Register status indicator and the memory status indicator (if applicable). 3.2 Implementation of the Parallelism The ideal CPI (Cycles per Instruction) of a pipelined processor is 1. So we cannot achieve better than that without introducing redundancy. This redundancy is in the form of parallel execution units in the EX stage as shown in Figure 2. This arrangement helps overlapped and out-of-order execution of instructions on the EX stage in addition to the conventional pipelining. This arrangement has the potential to achieve a CPI<1. 5
6 IF ID E X 1 E X 2 E X 3 E X N Figure 2: Duplication of functional units for parallelism 3.3 Pipelining hazards Hazards are situations which prevent the next instruction in the instruction stream from getting executed in its designated clock cycle. Hazards may stall the pipeline. There are three types of hazards Structural - If some functional units are duplicated to accommodate overlap in execution and some combination of instructions cannot be run in parallel then structural hazard results. For e.g., we have only one write port and pipelining requires 2 writes to be done in that clock cycle. Data hazards to be explained shortly. Control hazards arise from pipelining of branches and other instructions that change the Program Counter (PC). For e.g., In a conditional Jump instruction, till the condition is evaluated the new PC can take either the incremented PC value or the address accessed in that instruction. To avoid this we either stall the pipeline for 2 cycles or use branch predictors. In this project assume no branch predictors are used. Instead, we choose to stall the processor. When a conflict is encountered, all instructions before the stalled instructions need to continue and all the instructions after the stalled instruction need to be stalled Data hazard classification 1. RAW - Read After Write Consider the instruction sequence given below. ADD R1, R2, R3 SUB R4, R1, R5 The result of ADD instruction that is written into R1 is required for the SUB instruction to proceed. 2. WAW - Write After Write LW R1, [addr] SUB R4, R1, R6 ADD R1, R2, R3 6
7 The result of the ADD cannot be written to R1 before LW is written into R1 as the former is needed by SUB. In addition, if the LW goes into a cache miss then ADD reaches the WB stage before the first instruction. So R1 has the older value at the end of the sequence. 3. WAR - Write After Read SD [addr], R4 ADD R4, R3, R2 Actually, the older value of R4 should get stored in [addr], by SD instruction before the new value of R4 is updated by the ADD instruction. Mem status cache Register status indicator Issue unit Reg File RS 1 RS 2 RS N EX 1 EX 2 EX N Common Data Bus(CDB) Reservation station Operation Q j Q k V j V k Address Busy Figure 3: Hardware for handling the pipelining hazards 3.4 Hardware for handling pipelining hazards The hardware used to overcome data hazards is shown in Figure 3. There are K = A 1 + A 2 + A 3 execution units running in parallel giving the data to a common bus (Common Data bus CDB). Each execution unit has an identification number which is an integer in the range [1..K-1]. The register file is an array of registers which give the inputs to the execution units. It has K triples of 5 bit input to specify the register, a read/write input signal and a 32 bit output port. The memory status cache is an associative memory with each entry as shown in figure 4 and is implemented to avoid the memory-aliasing problem. The register status indicator is implemented for handling the RAW and WAR hazards. Each execution unit is driven by an intermediate block called reservation station. The bits of reservation station are changed by the issue unit. The register status bits indicate the following (0, 0): if the register is not being currently written by any other instruction (i, j): if the execution unit i is currently evaluating the instruction with id j, where result is to be written to it. Whenever an execution unit finishes evaluation, it puts its result and its id on the CDB. The reservation stations of other execution units are waiting for the result from a particular execution unit by constantly snooping the CDB. The format of bits in reservation station is given below 7
8 Q j =0 indicates that V j holds the value of the operand 1. Q k =0 indicates that V k holds the value of the operand 2. Q j =(m, j), where m=0 indicates that 1 st operand needs to be taken from output of instruction with id 0 j 0 currently executed in the m th unit. Q k =(m, j), where m=0 indicates that 2 st operand needs to be taken from output of instruction with id 0 j 0 currently executed in the m th unit. Busy=0 execution unit is free. Busy=1 it is waiting for input. Effective address Unit number accessing it Instruction id Figure 4: An entry in the associative memory Using these units the various hazards are handled. There is need for explanation of the memory status register. It is used to handle the memory aliasing problem. The memory aliasing problem occurs under the situation given below: SD [R3+300], R4 LD R2, [R0+100] A read after write conflict will occur if R3+300 = R To handle this problem, the associative mem status register is used. Each entry in the associative memory is shown in Figure 4. Whenever a load is done, it finds out whether the associative memory has the address, and then it does a read from the CDB itself. When the corresponding unit as printed out by the entry in associative memory completes the specified instruction as specified by the entry. This is called the Tomasulo s scoreboard technique. The architecture shown above was basically meant for handling the data hazards. For the other two hazards, a separate kind of architecture is not necessary. Firstly, the structural hazard cannot be avoided. To handle the control hazard we can do one of the following. We stall the pipeline until completion of this instruction We can use branch predictors In this project, you will stall instructions till the branch condition is evaluated. 3.5 The CACHE The cache is used to bridge the gap between the speeds of the fast processor and the slow main memory. The cache memory is smaller than the main memory and faster than it. It sits between the processor and the main memory and holds data from a portion of main memory which is locally referred. The use of cache is motivated by the principle of locality of reference. There are basically 2 types of cache viz. the fully associative and the direct mapped cache. We use a cache which is a combination of both, the set associative cache. 8
9 The structure of a cache entry is given below. Tag Data V D Tag Data V D Tag Data V D Tag Data V D Figure 5: An entry in the set associative cache memory V: Validity of data. D: dirty bit; If 1, then it indicates that the data has been written by the processor and is inconsistent with the data in the memory. If 0, then it has not been modified by the processor. Caches use two policies for writing to memory: 1. Write through: If a value is to be written, then it is updated in the cache and also written to the main memory immediately. 2. Write back: The value is written only to the cache and written into memory only if a location with D=1 and V=1 is to be replaced. You will design a Cache unit with a write back policy for this project. The set associative cache has C 2 cache lines; each can hold up to C 3 cache entries. In other words we design a C 3-way set associative cache with C 2 entries. So, up to C 3 collisions can be handled without having to replace a cache entry. Tag is the MSB portion of the address which is not used in cache address generation and hence used to identify it uniquely. The LSB log 2C 2 bits of the main memory address is used for decoding into a particular cache line. Hence assume C 2 to be a power of 2. The system bus has separate data lines and address lines. 1. Read hit - the cache line holds the value being searched for. 2. Read miss - The cache line does not hold the data, hence need to be accessed from the memory. 3. Write hit - The cache entry to be written into is already in the cache, so update can be done in the cache only. 4. Write miss - Then the data already in the cache entry has to be written to main memory and the new data has to be written to this cache entry. Parameter list: A 1, A 2, A 3, P 1, C 1, C 2, C 3, N A 1 - Number of CLAs in the processor. A 2 - Number of WTMs in the processor. A 3 - Number of LSUs in the processor. P 1 - Number of instructions fetched at a time. C 1 - Number of execution units whose results are to be committed simultaneously. C2 - Number of cache lines in the set associative cache. C 3 - Number of cache entries held by a cache line in the set associative cache (or) in other words, number of ways in the set-associative cache. Once the RTL is developed, the next document would give you the verification plan, which can enable you to do the Functional Verification of your RTL. 9
10 4 Implementation Your Verilog code must follow synthesis guidelines that are discussed in the class. You will be required to take the design through the various steps of design flow later. Primary requirement for those stages is that the code is synthesizable. Further instructions will be given as you proceed. Remember that this is a group project and partitioning of your design is an absolute requirement. Use your time judiciously. Unlike project specifications for other groups, grading scheme for the report is not provided. I will talk to the groups and decide on the grading policy. 10
Instruction Pipelining
Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages
More informationInstruction Pipelining
Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationExecution/Effective address
Pipelined RC 69 Pipelined RC Instruction Fetch IR mem[pc] NPC PC+4 Instruction Decode/Operands fetch A Regs[rs]; B regs[rt]; Imm sign extended immediate field Execution/Effective address Memory Ref ALUOutput
More informationPipelining. Maurizio Palesi
* Pipelining * Adapted from David A. Patterson s CS252 lecture slides, http://www.cs.berkeley/~pattrsn/252s98/index.html Copyright 1998 UCB 1 References John L. Hennessy and David A. Patterson, Computer
More informationAppendix C. Abdullah Muzahid CS 5513
Appendix C Abdullah Muzahid CS 5513 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero) Single address mode for load/store: base + displacement no indirection
More informationAdvanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017
Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation
More informationComputer System. Hiroaki Kobayashi 6/16/2010. Ver /16/2010 Computer Science 1
Computer System Hiroaki Kobayashi 6/16/2010 6/16/2010 Computer Science 1 Ver. 1.1 Agenda Basic model of modern computer systems Von Neumann Model Stored-program instructions and data are stored on memory
More informationComputer System. Agenda
Computer System Hiroaki Kobayashi 7/6/2011 Ver. 07062011 7/6/2011 Computer Science 1 Agenda Basic model of modern computer systems Von Neumann Model Stored-program instructions and data are stored on memory
More informationRecall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationPage 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight
Pipelining: Its Natural! Chapter 3 Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes A B C D Dryer takes 40 minutes Folder
More informationCOSC4201 Pipelining. Prof. Mokhtar Aboelaze York University
COSC4201 Pipelining Prof. Mokhtar Aboelaze York University 1 Instructions: Fetch Every instruction could be executed in 5 cycles, these 5 cycles are (MIPS like machine). Instruction fetch IR Mem[PC] NPC
More informationCOSC 6385 Computer Architecture - Pipelining
COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationThis Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods
10-1 Dynamic Scheduling 10-1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods Not yet complete. (Material below may
More informationWhat is Pipelining? Time per instruction on unpipelined machine Number of pipe stages
What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationChapter 4 The Processor 1. Chapter 4A. The Processor
Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined
More informationLECTURE 3: THE PROCESSOR
LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationReferences EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)
EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s
More informationPipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science
Pipeline Overview Dr. Jiang Li Adapted from the slides provided by the authors Outline MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and
More informationCS433 Homework 2 (Chapter 3)
CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationChapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction
More informationPage 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationUpdated Exercises by Diana Franklin
C-82 Appendix C Pipelining: Basic and Intermediate Concepts Updated Exercises by Diana Franklin C.1 [15/15/15/15/25/10/15] Use the following code fragment: Loop: LD R1,0(R2) ;load R1 from address
More informationWhat is Pipelining? RISC remainder (our assumptions)
What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism
More informationCS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes
CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.
More informationInstruction Pipelining Review
Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number
More informationThese actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.
MIPS Pipe Line 2 Introduction Pipelining To complete an instruction a computer needs to perform a number of actions. These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously
More informationComputer Organization MIPS Architecture. Department of Computer Science Missouri University of Science & Technology
Computer Organization MIPS Architecture Department of Computer Science Missouri University of Science & Technology hurson@mst.edu Computer Organization Note, this unit will be covered in three lectures.
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations
More informationELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control
ELEC 52/62 Computer Architecture and Design Spring 217 Lecture 4: Datapath and Control Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849
More informationData Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard
Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:
More informationThis Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods
10 1 Dynamic Scheduling 10 1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods Not yet complete. (Material below may repeat
More informationEI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building
More informationCS433 Homework 2 (Chapter 3)
CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..
More informationHY425 Lecture 05: Branch Prediction
HY425 Lecture 05: Branch Prediction Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS October 19, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 05: Branch Prediction 1 / 45 Exploiting ILP in hardware
More informationCPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts
CPE 110408443 Computer Architecture Appendix A: Pipelining: Basic and Intermediate Concepts Sa ed R. Abed [Computer Engineering Department, Hashemite University] Outline Basic concept of Pipelining The
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationCIS 662: Midterm. 16 cycles, 6 stalls
CIS 662: Midterm Name: Points: /100 First read all the questions carefully and note how many points each question carries and how difficult it is. You have 1 hour 15 minutes. Plan your time accordingly.
More informationECSE 425 Lecture 6: Pipelining
ECSE 425 Lecture 6: Pipelining H&P, Appendix A Vu, Meyer Textbook figures 2007 Elsevier Science Last Time Processor Performance EquaQon System performance Benchmarks 2 Today Pipelining Basics RISC InstrucQon
More informationPipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!
Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!
More informationOverview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP
Overview Appendix A Pipelining: Basic and Intermediate Concepts Basics of Pipelining Pipeline Hazards Pipeline Implementation Pipelining + Exceptions Pipeline to handle Multicycle Operations 1 2 Unpipelined
More information(Basic) Processor Pipeline
(Basic) Processor Pipeline Nima Honarmand Generic Instruction Life Cycle Logical steps in processing an instruction: Instruction Fetch (IF_STEP) Instruction Decode (ID_STEP) Operand Fetch (OF_STEP) Might
More informationPipelining. Each step does a small fraction of the job All steps ideally operate concurrently
Pipelining Computational assembly line Each step does a small fraction of the job All steps ideally operate concurrently A form of vertical concurrency Stage/segment - responsible for 1 step 1 machine
More informationEE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University
EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted
More informationComputer Architecture
Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in
More informationHardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.
Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)
More informationVery Simple MIPS Implementation
06 1 MIPS Pipelined Implementation 06 1 line: (In this set.) Unpipelined Implementation. (Diagram only.) Pipelined MIPS Implementations: Hardware, notation, hazards. Dependency Definitions. Hazards: Definitions,
More informationProcessor: Superscalars Dynamic Scheduling
Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),
More informationLecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1
Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number
More informationAppendix A. Overview
Appendix A Pipelining: Basic and Intermediate Concepts 1 Overview Basics of Pipelining Pipeline Hazards Pipeline Implementation Pipelining + Exceptions Pipeline to handle Multicycle Operations 2 1 Unpipelined
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationAppendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,
Appendix C Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Pipelining Multiple instructions are overlapped in execution Each is in a different stage Each stage is called
More informationPipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Pipeline Hazards Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hazards What are hazards? Situations that prevent starting the next instruction
More informationare Softw Instruction Set Architecture Microarchitecture are rdw
Program, Application Software Programming Language Compiler/Interpreter Operating System Instruction Set Architecture Hardware Microarchitecture Digital Logic Devices (transistors, etc.) Solid-State Physics
More informationAppendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationLecture 7 Pipelining. Peng Liu.
Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt
More informationPIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah
PIPELINING: HAZARDS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 1 submission deadline: Jan. 30 th This
More informationPipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &
More information3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 02: Introduction II Shuai Wang Department of Computer Science and Technology Nanjing University Pipeline Hazards Major hurdle to pipelining: hazards prevent the
More informationPipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.
Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview
More informationECE 505 Computer Architecture
ECE 505 Computer Architecture Pipelining 2 Berk Sunar and Thomas Eisenbarth Review 5 stages of RISC IF ID EX MEM WB Ideal speedup of pipelining = Pipeline depth (N) Practically Implementation problems
More informationMinimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline
Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More informationPage 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
More informationThe Tomasulo Algorithm Implementation
2162 Term Project The Tomasulo Algorithm Implementation Assigned: 11/3/2015 Due: 12/15/2015 In this project, you will implement the Tomasulo algorithm with register renaming, ROB, speculative execution
More informationImproving Performance: Pipelining
Improving Performance: Pipelining Memory General registers Memory ID EXE MEM WB Instruction Fetch (includes PC increment) ID Instruction Decode + fetching values from general purpose registers EXE EXEcute
More informationmywbut.com Pipelining
Pipelining 1 What Is Pipelining? Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. Today, pipelining is the key implementation technique used to make
More informationScoreboard information (3 tables) Four stages of scoreboard control
Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy
More informationSome material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier
Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20
More informationCS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism
CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste
More informationVery Simple MIPS Implementation
06 1 MIPS Pipelined Implementation 06 1 line: (In this set.) Unpipelined Implementation. (Diagram only.) Pipelined MIPS Implementations: Hardware, notation, hazards. Dependency Definitions. Hazards: Definitions,
More informationPipelining. CSC Friday, November 6, 2015
Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not
More informationLecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1
Lecture 3 Pipelining Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair)
More informationStructure of Computer Systems
288 between this new matrix and the initial collision matrix M A, because the original forbidden latencies for functional unit A still have to be considered in later initiations. Figure 5.37. State diagram
More informationDepartment of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri
Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many
More information1 Hazards COMP2611 Fall 2015 Pipelined Processor
1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add
More informationTomasulo s Algorithm
Tomasulo s Algorithm Architecture to increase ILP Removes WAR and WAW dependencies during issue WAR and WAW Name Dependencies Artifact of using the same storage location (variable name) Can be avoided
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationControl Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.
Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationAdvanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012
Advanced Computer Architecture CMSC 611 Homework 3 Due in class Oct 17 th, 2012 (Show your work to receive partial credit) 1) For the following code snippet list the data dependencies and rewrite the code
More informationComputer Architecture V Fall Practice Exam Questions
Computer Architecture V22.0436 Fall 2002 Practice Exam Questions These are practice exam questions for the material covered since the mid-term exam. Please note that the final exam is cumulative. See the
More informationEITF20: Computer Architecture Part3.2.1: Pipeline - 3
EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done
More informationLecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2
Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time
More information