CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

Size: px
Start display at page:

Download "CAD for VLSI 2 Pro ject - Superscalar Processor Implementation"


1 CAD for VLSI 2 Pro ject - Superscalar Processor Implementation 1 Superscalar Processor Ob jective: The main objective is to implement a superscalar pipelined processor using Verilog HDL. This project may be divided into three parts: The Arithmetic Logic Unit The Pipelined Processor Architecture Cache design 2 The ALU The objective in this phase is to implement the ALU for integer operations. For a fast ALU, the following are required Fully pipelined Carry Lookahead Adder (CLA) - 32 bit Fully pipelined Wallace Tree Multiplier (WTM) - 32 bit Fully pipelined Load - Store Unit (LSU) Refer to lab class notes and your earlier homework for this. Each processor in your design must have A 1 numbers of CLA, A 2 numbers of WTM and A 3 numbers of LSU, where A 1, A 2 and A 3 are parameters. The addition, multiplication and load-store operation may take several cycles to complete. But the pipelining above ensures that at every cycle a new set of operands can be pushed into the Arithmetic units for computation. Note that if there is a structural hazard due to non-availability of functional units, pipeline may stall and all the instructions that follow the stalled instruction should not be scheduled. In other words, the issue of instructions is in program order. It is interesting to note that if the issue is not in program order then, the Tomasulo technique described in the class will not correctly handle the data hazards. An id for every instruction should also be passed through the units. This is needed because, if a reservation station R is waiting for a result from an execution unit E, it should specify that instruction, from several instructions that may currently be pipelined and executed in E. 1

2 3 The Pipelined Processor 3.1 The Basic Pipeline The processor that you have to design is a RISC (Reduced Instruction Set Computer) also called the Load-Store architecture with the following instruction set. General purpose registers: Assume that there are thirty two, 32-bit registers, named R0,..., R31. R0 always stores the value 0 to facilitate many calculations involving zero (jump on zero for example). Instruction set: The instruction set of the processor includes 3 Arithmetic instructions ADD R1, R2, R3 ; //R1 = R2 + R3 SUB R1, R2, R3 ; //R1 = R2 - R3 MUL R1, R2, R3 ; //R1 = R2 * R3 All operations are two s complement operations. Exactly one of the source operands of the arithmetic instruction can be a signed immediate operand of 16 bits stored in two s complement format. ADD R1, R0, #5; makes R1 = 5 2 Data transfer instructions LD R1, [Reg]; //R1=content of the memory location; address is specified by Reg. SD [Reg], R1;//[Reg] = R1 2 Control transfer instructions JMP L1; //Unconditional jump to location L1 BEQZ (Reg), L1; //Jump to L1 if Reg content is zero L1 is given as an offset from current Program Counter (PC). This is called PC-relative addressing. Halt instruction HLT There are basically 5 stages of instruction execution as shown in Figure 1. Also, the instructions are assumed to be of fixed length of 4 bytes each. In a store instruction, the WB stage is non-existent. In an arithmetic instruction the MEM stage is non-existent. The processor is pipelined at the instruction level also. 1. Instruction fetch cycle (IF): IR Mem[PC]; NPC PC + 4; Operation: Send out the Program Counter (PC) and fetch the instruction from memory into the Instruction Register (IR); increment the PC by 4 to address the next sequential instruction. The IR is used to hold the instruction that will be needed on subsequent clock cycles; likewise the register NPC is used to hold the next sequential PC. The above describes fetching of one instruction at a time. You should fetch P 1 number of instructions at any time in the Superscalar architecture. Note that our desire is to execute more than one instruction in every cycle. 2

3 Instruction Fetch - IF Instruction Decoding-ID Execution or Addr evaluation - EX Memory access/branch completion - MEM Write back results - WB Figure 1: The five stages of instruction execution 2. Instruction Decode/Register fetch cycle (ID): A Regs [rs]; B Regs [rt]; Imm sign-extended immediate fields of IR; Operation: Decode the instruction and access the register file to read the registers (rs and rt are the register specifiers). The outputs of the general purpose registers are read into two temporary registers (A and B) for use in later clock cycles. The lower 16 bits of the IR are also sign extended and stored into the temporary register Imm, for use in the next cycle. Decoding is done in parallel with reading registers, which is possible by ensuring that these fields are at a fixed location in the instruction format. Assume that the immediate portion of an instruction is located in an identical place in every instruction, the sign extended immediate is also calculated during this cycle in case it is needed in the next cycle. The above describes, how to decode one instruction. You should parallely decode P 1 instructions. In addition, in the superscalar execution, before registers are fetched, the register status indicators have to be consulted. Also beware of Load and Store instructions, that reads registers for calculating memory addresses. These register reads can lead to RAW hazards. This stage is responsible for dynamically scheduling of P 1 instructions at any time into the respective A 1, A 2 and A 3 units. If units are not available, then stall the pipeline, as a structural hazard is caused. The memory aliasing problem is to be handled using an associative memory as the memory status indicator. Note that the size of this associative memory will be A 3 Number of pipeline stages in the Load-Store unit. The above will be the maximum number of memory addresses that could be accessed at a time. 3

4 3. Execution/Effective Address cycle (EX): The ALU operates on the operands prepared in the prior cycle, performing one of the following four functions depending on the instruction type. Memory reference: (LD and ST) ALUOutput R0 + Reg; Operation: The ALU adds R0 with the contents of Reg fetched in earlier cycle to form the effective address and places the result into the register ALUOutput. Consult the memory status indicator for resolving the memory aliasing problem. Register-Register ALU instruction:(add, SUB and MUL) ALUOutput A op B Operation: The ALU performs the operation specified by the function code on the value in register A and on the value in register B. The result is placed in the temporary register ALUOutput. Register-Immediate ALU Instruction:(ADD, SUB and MUL) ALUOutput A op Imm; Operation: The ALU performs the operation specified by the opcode on the value in register A and on the operand Imm. The result is placed in the temporary register ALUOutput. Branch: ALUOutput NPC + (Imm << 2); Cond (A == 0) Operation : The ALU adds the NPC to the sign-extended immediate value in Imm, which is shifted left by 2 bits to create a word offset, to compute the address of the branch target. Register A, which has been read in the prior cycle, is checked to determine whether the branch is taken. Since we are considering only one form of branch (BEQZ), the comparison is against 0. Note that BEQZ is actually a pseudo instruction that translates to a BEQ with R0 as an operand. For simplicity, this is the only form of branch we consider. To reduce penalty due to control hazards, the jumps can be treated specially. Both the unconditional and conditional jumps may be decoded in the IF cycle itself. Note that unconditional Jumps can be executed at IF cycle and conditional jumps in ID cycle. This is straight forward to implement. Note that out of the P 1 instructions fetched along with a JMP instruction, all the instructions that appear after the jump instruction should not be scheduled. In case of conditional jump the pipeline should be stalled for one cycle due to the control hazard. The load-store architecture enables the effective memory address calculation and execution cycle to be combined into a single clock cycle, since no instruction needs to simultaneously calculate a data address, calculate an instruction target address, and perform an operation on the data. 4. Memory access cycle (MEM): Memory reference : LMD Mem [ALUOutput] or Mem [ALUOutput] B; Operation: Access memory, if needed. If instruction is a load, data returns from memory and is placed in the LMD (load memory data) register; if it is a store, then the data from the B register is written into memory. In either case 4

5 the address used is the one computed during the prior cycle and stored in the register ALUOutput. Note: Each processor has two caches - the Instruction cache and the Data cache. The memory has two ports - a read port for accessing instruction and a read/write port for accessing data. Conflicts in addressing on these ports, namely same address loaded on the ports should be resolved. When two or more Load/Store units try to access the cache, there would be a structural hazard for accessing the data cache, resulting in stalling of the pipeline inside the Load/Store units. In your implementation, assume that a Cache-based structural hazard takes one extra cycle for simultaneous access by two LSUs. In the worst case you may waste A 3 1 cycles due to Cache-based structural hazards. In the case of a Cache miss, after the Cache miss is detected, assume it takes two clock cycles to access memory and read/write data. 5. Write-back cycle (WB): Register-Register ALU instruction: Regs[rd] ALUOutput; Load instruction: Regs[rt] LMD; Operation: Write the result into the register file, whether it comes from the memory system (which is in LMD) or from the ALU (which is in ALUOutput); the register destination field is also in one of two positions (rd or rt) depending on the effective opcode. The write back in superscalar is on the Common Data Bus (CDB), which is communicated back to the reservation stations. The CDB is shared by several execution units to write back results. The CDB should be designed to handle C 1 units to commit back the result at a time. The CDB has 32 C 1 data lines and does the following function. Note that C 1 A 1 + A 2 + A 3. The Bus arbiter has a simple circular-token protocol. It has a register which stores an integer K = A 1 + A 2 + A 3. In a current cycle the Bus arbiter permits the next C 1 units from the k th execution unit in a circular fashion that have a request for CDB to write into CDB. Note: The Write-back cycle resets the Register status indicator and the memory status indicator (if applicable). 3.2 Implementation of the Parallelism The ideal CPI (Cycles per Instruction) of a pipelined processor is 1. So we cannot achieve better than that without introducing redundancy. This redundancy is in the form of parallel execution units in the EX stage as shown in Figure 2. This arrangement helps overlapped and out-of-order execution of instructions on the EX stage in addition to the conventional pipelining. This arrangement has the potential to achieve a CPI<1. 5

6 IF ID E X 1 E X 2 E X 3 E X N Figure 2: Duplication of functional units for parallelism 3.3 Pipelining hazards Hazards are situations which prevent the next instruction in the instruction stream from getting executed in its designated clock cycle. Hazards may stall the pipeline. There are three types of hazards Structural - If some functional units are duplicated to accommodate overlap in execution and some combination of instructions cannot be run in parallel then structural hazard results. For e.g., we have only one write port and pipelining requires 2 writes to be done in that clock cycle. Data hazards to be explained shortly. Control hazards arise from pipelining of branches and other instructions that change the Program Counter (PC). For e.g., In a conditional Jump instruction, till the condition is evaluated the new PC can take either the incremented PC value or the address accessed in that instruction. To avoid this we either stall the pipeline for 2 cycles or use branch predictors. In this project assume no branch predictors are used. Instead, we choose to stall the processor. When a conflict is encountered, all instructions before the stalled instructions need to continue and all the instructions after the stalled instruction need to be stalled Data hazard classification 1. RAW - Read After Write Consider the instruction sequence given below. ADD R1, R2, R3 SUB R4, R1, R5 The result of ADD instruction that is written into R1 is required for the SUB instruction to proceed. 2. WAW - Write After Write LW R1, [addr] SUB R4, R1, R6 ADD R1, R2, R3 6

7 The result of the ADD cannot be written to R1 before LW is written into R1 as the former is needed by SUB. In addition, if the LW goes into a cache miss then ADD reaches the WB stage before the first instruction. So R1 has the older value at the end of the sequence. 3. WAR - Write After Read SD [addr], R4 ADD R4, R3, R2 Actually, the older value of R4 should get stored in [addr], by SD instruction before the new value of R4 is updated by the ADD instruction. Mem status cache Register status indicator Issue unit Reg File RS 1 RS 2 RS N EX 1 EX 2 EX N Common Data Bus(CDB) Reservation station Operation Q j Q k V j V k Address Busy Figure 3: Hardware for handling the pipelining hazards 3.4 Hardware for handling pipelining hazards The hardware used to overcome data hazards is shown in Figure 3. There are K = A 1 + A 2 + A 3 execution units running in parallel giving the data to a common bus (Common Data bus CDB). Each execution unit has an identification number which is an integer in the range [1..K-1]. The register file is an array of registers which give the inputs to the execution units. It has K triples of 5 bit input to specify the register, a read/write input signal and a 32 bit output port. The memory status cache is an associative memory with each entry as shown in figure 4 and is implemented to avoid the memory-aliasing problem. The register status indicator is implemented for handling the RAW and WAR hazards. Each execution unit is driven by an intermediate block called reservation station. The bits of reservation station are changed by the issue unit. The register status bits indicate the following (0, 0): if the register is not being currently written by any other instruction (i, j): if the execution unit i is currently evaluating the instruction with id j, where result is to be written to it. Whenever an execution unit finishes evaluation, it puts its result and its id on the CDB. The reservation stations of other execution units are waiting for the result from a particular execution unit by constantly snooping the CDB. The format of bits in reservation station is given below 7

8 Q j =0 indicates that V j holds the value of the operand 1. Q k =0 indicates that V k holds the value of the operand 2. Q j =(m, j), where m=0 indicates that 1 st operand needs to be taken from output of instruction with id 0 j 0 currently executed in the m th unit. Q k =(m, j), where m=0 indicates that 2 st operand needs to be taken from output of instruction with id 0 j 0 currently executed in the m th unit. Busy=0 execution unit is free. Busy=1 it is waiting for input. Effective address Unit number accessing it Instruction id Figure 4: An entry in the associative memory Using these units the various hazards are handled. There is need for explanation of the memory status register. It is used to handle the memory aliasing problem. The memory aliasing problem occurs under the situation given below: SD [R3+300], R4 LD R2, [R0+100] A read after write conflict will occur if R3+300 = R To handle this problem, the associative mem status register is used. Each entry in the associative memory is shown in Figure 4. Whenever a load is done, it finds out whether the associative memory has the address, and then it does a read from the CDB itself. When the corresponding unit as printed out by the entry in associative memory completes the specified instruction as specified by the entry. This is called the Tomasulo s scoreboard technique. The architecture shown above was basically meant for handling the data hazards. For the other two hazards, a separate kind of architecture is not necessary. Firstly, the structural hazard cannot be avoided. To handle the control hazard we can do one of the following. We stall the pipeline until completion of this instruction We can use branch predictors In this project, you will stall instructions till the branch condition is evaluated. 3.5 The CACHE The cache is used to bridge the gap between the speeds of the fast processor and the slow main memory. The cache memory is smaller than the main memory and faster than it. It sits between the processor and the main memory and holds data from a portion of main memory which is locally referred. The use of cache is motivated by the principle of locality of reference. There are basically 2 types of cache viz. the fully associative and the direct mapped cache. We use a cache which is a combination of both, the set associative cache. 8

9 The structure of a cache entry is given below. Tag Data V D Tag Data V D Tag Data V D Tag Data V D Figure 5: An entry in the set associative cache memory V: Validity of data. D: dirty bit; If 1, then it indicates that the data has been written by the processor and is inconsistent with the data in the memory. If 0, then it has not been modified by the processor. Caches use two policies for writing to memory: 1. Write through: If a value is to be written, then it is updated in the cache and also written to the main memory immediately. 2. Write back: The value is written only to the cache and written into memory only if a location with D=1 and V=1 is to be replaced. You will design a Cache unit with a write back policy for this project. The set associative cache has C 2 cache lines; each can hold up to C 3 cache entries. In other words we design a C 3-way set associative cache with C 2 entries. So, up to C 3 collisions can be handled without having to replace a cache entry. Tag is the MSB portion of the address which is not used in cache address generation and hence used to identify it uniquely. The LSB log 2C 2 bits of the main memory address is used for decoding into a particular cache line. Hence assume C 2 to be a power of 2. The system bus has separate data lines and address lines. 1. Read hit - the cache line holds the value being searched for. 2. Read miss - The cache line does not hold the data, hence need to be accessed from the memory. 3. Write hit - The cache entry to be written into is already in the cache, so update can be done in the cache only. 4. Write miss - Then the data already in the cache entry has to be written to main memory and the new data has to be written to this cache entry. Parameter list: A 1, A 2, A 3, P 1, C 1, C 2, C 3, N A 1 - Number of CLAs in the processor. A 2 - Number of WTMs in the processor. A 3 - Number of LSUs in the processor. P 1 - Number of instructions fetched at a time. C 1 - Number of execution units whose results are to be committed simultaneously. C2 - Number of cache lines in the set associative cache. C 3 - Number of cache entries held by a cache line in the set associative cache (or) in other words, number of ways in the set-associative cache. Once the RTL is developed, the next document would give you the verification plan, which can enable you to do the Functional Verification of your RTL. 9

10 4 Implementation Your Verilog code must follow synthesis guidelines that are discussed in the class. You will be required to take the design through the various steps of design flow later. Primary requirement for those stages is that the code is synthesizable. Further instructions will be given as you proceed. Remember that this is a group project and partitioning of your design is an absolute requirement. Use your time judiciously. Unlike project specifications for other groups, grading scheme for the report is not provided. I will talk to the groups and decide on the grading policy. 10

Instruction Pipelining

Instruction Pipelining Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages

More information

Instruction Pipelining

Instruction Pipelining Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Execution/Effective address

Execution/Effective address Pipelined RC 69 Pipelined RC Instruction Fetch IR mem[pc] NPC PC+4 Instruction Decode/Operands fetch A Regs[rs]; B regs[rt]; Imm sign extended immediate field Execution/Effective address Memory Ref ALUOutput

More information

Pipelining. Maurizio Palesi

Pipelining. Maurizio Palesi * Pipelining * Adapted from David A. Patterson s CS252 lecture slides, http://www.cs.berkeley/~pattrsn/252s98/index.html Copyright 1998 UCB 1 References John L. Hennessy and David A. Patterson, Computer

More information

Appendix C. Abdullah Muzahid CS 5513

Appendix C. Abdullah Muzahid CS 5513 Appendix C Abdullah Muzahid CS 5513 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero) Single address mode for load/store: base + displacement no indirection

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

Computer System. Hiroaki Kobayashi 6/16/2010. Ver /16/2010 Computer Science 1

Computer System. Hiroaki Kobayashi 6/16/2010. Ver /16/2010 Computer Science 1 Computer System Hiroaki Kobayashi 6/16/2010 6/16/2010 Computer Science 1 Ver. 1.1 Agenda Basic model of modern computer systems Von Neumann Model Stored-program instructions and data are stored on memory

More information

Computer System. Agenda

Computer System. Agenda Computer System Hiroaki Kobayashi 7/6/2011 Ver. 07062011 7/6/2011 Computer Science 1 Agenda Basic model of modern computer systems Von Neumann Model Stored-program instructions and data are stored on memory

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight Pipelining: Its Natural! Chapter 3 Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes A B C D Dryer takes 40 minutes Folder

More information

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University COSC4201 Pipelining Prof. Mokhtar Aboelaze York University 1 Instructions: Fetch Every instruction could be executed in 5 cycles, these 5 cycles are (MIPS like machine). Instruction fetch IR Mem[PC] NPC

More information

COSC 6385 Computer Architecture - Pipelining

COSC 6385 Computer Architecture - Pipelining COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods 10-1 Dynamic Scheduling 10-1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods Not yet complete. (Material below may

More information

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information


COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information


LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions) EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s

More information

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science Pipeline Overview Dr. Jiang Li Adapted from the slides provided by the authors Outline MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Updated Exercises by Diana Franklin

Updated Exercises by Diana Franklin C-82 Appendix C Pipelining: Basic and Intermediate Concepts Updated Exercises by Diana Franklin C.1 [15/15/15/15/25/10/15] Use the following code fragment: Loop: LD R1,0(R2) ;load R1 from address

More information

What is Pipelining? RISC remainder (our assumptions)

What is Pipelining? RISC remainder (our assumptions) What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions. MIPS Pipe Line 2 Introduction Pipelining To complete an instruction a computer needs to perform a number of actions. These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously

More information

Computer Organization MIPS Architecture. Department of Computer Science Missouri University of Science & Technology

Computer Organization MIPS Architecture. Department of Computer Science Missouri University of Science & Technology Computer Organization MIPS Architecture Department of Computer Science Missouri University of Science & Technology Computer Organization Note, this unit will be covered in three lectures.

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control ELEC 52/62 Computer Architecture and Design Spring 217 Lecture 4: Datapath and Control Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849

More information

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:

More information

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods 10 1 Dynamic Scheduling 10 1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods Not yet complete. (Material below may repeat

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..

More information

HY425 Lecture 05: Branch Prediction

HY425 Lecture 05: Branch Prediction HY425 Lecture 05: Branch Prediction Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS October 19, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 05: Branch Prediction 1 / 45 Exploiting ILP in hardware

More information

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts CPE 110408443 Computer Architecture Appendix A: Pipelining: Basic and Intermediate Concepts Sa ed R. Abed [Computer Engineering Department, Hashemite University] Outline Basic concept of Pipelining The

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

CIS 662: Midterm. 16 cycles, 6 stalls

CIS 662: Midterm. 16 cycles, 6 stalls CIS 662: Midterm Name: Points: /100 First read all the questions carefully and note how many points each question carries and how difficult it is. You have 1 hour 15 minutes. Plan your time accordingly.

More information

ECSE 425 Lecture 6: Pipelining

ECSE 425 Lecture 6: Pipelining ECSE 425 Lecture 6: Pipelining H&P, Appendix A Vu, Meyer Textbook figures 2007 Elsevier Science Last Time Processor Performance EquaQon System performance Benchmarks 2 Today Pipelining Basics RISC InstrucQon

More information

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017! Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!

More information

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP Overview Appendix A Pipelining: Basic and Intermediate Concepts Basics of Pipelining Pipeline Hazards Pipeline Implementation Pipelining + Exceptions Pipeline to handle Multicycle Operations 1 2 Unpipelined

More information

(Basic) Processor Pipeline

(Basic) Processor Pipeline (Basic) Processor Pipeline Nima Honarmand Generic Instruction Life Cycle Logical steps in processing an instruction: Instruction Fetch (IF_STEP) Instruction Decode (ID_STEP) Operand Fetch (OF_STEP) Might

More information

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently Pipelining Computational assembly line Each step does a small fraction of the job All steps ideally operate concurrently A form of vertical concurrency Stage/segment - responsible for 1 step 1 machine

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

Computer Architecture

Computer Architecture Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

Very Simple MIPS Implementation

Very Simple MIPS Implementation 06 1 MIPS Pipelined Implementation 06 1 line: (In this set.) Unpipelined Implementation. (Diagram only.) Pipelined MIPS Implementations: Hardware, notation, hazards. Dependency Definitions. Hazards: Definitions,

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

More information

Appendix A. Overview

Appendix A. Overview Appendix A Pipelining: Basic and Intermediate Concepts 1 Overview Basics of Pipelining Pipeline Hazards Pipeline Implementation Pipelining + Exceptions Pipeline to handle Multicycle Operations 2 1 Unpipelined

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002, Appendix C Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Pipelining Multiple instructions are overlapped in execution Each is in a different stage Each stage is called

More information

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Pipeline Hazards Jin-Soo Kim ( Computer Systems Laboratory Sungkyunkwan University Hazards What are hazards? Situations that prevent starting the next instruction

More information

are Softw Instruction Set Architecture Microarchitecture are rdw

are Softw Instruction Set Architecture Microarchitecture are rdw Program, Application Software Programming Language Compiler/Interpreter Operating System Instruction Set Architecture Hardware Microarchitecture Digital Logic Devices (transistors, etc.) Solid-State Physics

More information

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows

More information


ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Lecture 7 Pipelining. Peng Liu.

Lecture 7 Pipelining. Peng Liu. Lecture 7 Pipelining Peng Liu 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt

More information

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah PIPELINING: HAZARDS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 1 submission deadline: Jan. 30 th This

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle? CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 02: Introduction II Shuai Wang Department of Computer Science and Technology Nanjing University Pipeline Hazards Major hurdle to pipelining: hazards prevent the

More information

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3. Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview

More information

ECE 505 Computer Architecture

ECE 505 Computer Architecture ECE 505 Computer Architecture Pipelining 2 Berk Sunar and Thomas Eisenbarth Review 5 stages of RISC IF ID EX MEM WB Ideal speedup of pipelining = Pipeline depth (N) Practically Implementation problems

More information

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

The Tomasulo Algorithm Implementation

The Tomasulo Algorithm Implementation 2162 Term Project The Tomasulo Algorithm Implementation Assigned: 11/3/2015 Due: 12/15/2015 In this project, you will implement the Tomasulo algorithm with register renaming, ROB, speculative execution

More information

Improving Performance: Pipelining

Improving Performance: Pipelining Improving Performance: Pipelining Memory General registers Memory ID EXE MEM WB Instruction Fetch (includes PC increment) ID Instruction Decode + fetching values from general purpose registers EXE EXEcute

More information Pipelining Pipelining Pipelining 1 What Is Pipelining? Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. Today, pipelining is the key implementation technique used to make

More information

Scoreboard information (3 tables) Four stages of scoreboard control

Scoreboard information (3 tables) Four stages of scoreboard control Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Very Simple MIPS Implementation

Very Simple MIPS Implementation 06 1 MIPS Pipelined Implementation 06 1 line: (In this set.) Unpipelined Implementation. (Diagram only.) Pipelined MIPS Implementations: Hardware, notation, hazards. Dependency Definitions. Hazards: Definitions,

More information

Pipelining. CSC Friday, November 6, 2015

Pipelining. CSC Friday, November 6, 2015 Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not

More information

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 Lecture 3 Pipelining Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair)

More information

Structure of Computer Systems

Structure of Computer Systems 288 between this new matrix and the initial collision matrix M A, because the original forbidden latencies for functional unit A still have to be considered in later initiations. Figure 5.37. State diagram

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

1 Hazards COMP2611 Fall 2015 Pipelined Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor 1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add

More information

Tomasulo s Algorithm

Tomasulo s Algorithm Tomasulo s Algorithm Architecture to increase ILP Removes WAR and WAW dependencies during issue WAR and WAW Name Dependencies Artifact of using the same storage location (variable name) Can be avoided

More information


COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions. Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012 Advanced Computer Architecture CMSC 611 Homework 3 Due in class Oct 17 th, 2012 (Show your work to receive partial credit) 1) For the following code snippet list the data dependencies and rewrite the code

More information

Computer Architecture V Fall Practice Exam Questions

Computer Architecture V Fall Practice Exam Questions Computer Architecture V22.0436 Fall 2002 Practice Exam Questions These are practice exam questions for the material covered since the mid-term exam. Please note that the final exam is cumulative. See the

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information