Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

Size: px
Start display at page:

Download "Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP"

Transcription

1 Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

2 INTRODUCTION Been considering what makes the CPU run at a particular speed Spent the last two weeks looking at memory latency And how caching can help speed things up, by reducing the time to fetch instructions and data Today, look at other tricks used by CPU designers to make the run fast

3 FETCH DECODE EXECUTE FETCH DECODE EXECUTE FETCH DECODE EXECUTE FETCH DECODE EXECUTE CLOCK CYCLE Sets the minimum time any instruction will take to run as 3 cycles (one cycle for each stage).

4 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 11 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 11 INST 12 CLOCK CYCLE KEY: FETCH DECODE EXECUTE Once we get going, we finish execution of an instruction ever clock cycle

5 BUBBLES Pipeline hazards introduce bubbles into the pipeline Points in time where the CPU isn t executing an instruction because the hazard forced the delay of an earlier stage Also known as a pipeline stall Size of the bubble depends on the instructions In the worst case, we can end up with instructions effectively executing serially Can rewrite our code to be more pipeline friendly by reordering the instructions

6 CONTROL HAZARD Branches cause another form of pipeline hazard a Control Hazard When the proper instruction cannot execute in the next clock cycle because a different instruction was fetched With a conditional branch, cannot know until the branch is execute whether you ll get a control hazard You might have fetched the right instruction, you might not AKA a branch hazard

7 CONTROL HAZARD In our case, the unconditional branch means we definitely haven t fetched the correct instruction Need to discard the currently fetched and decoded instructions and start again Causes a stall as long as the pipeline Not just branches, any instruction which alters the CPU

8 B _CMP CMP BLE CLOCK CYCLE SWI 4 SWI 4 SWI 2 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Fetch the correct instruction (a CMP R0,R1) in this case it happens to be the same instruction but could just be any instruction Pipeline then continues as before Until we reach another branch, when the same thing happens although in this case it is a conditional branch so we might be in a position where the condition matches

9 MITIGATING CONTROL HAZARD Control Hazards introduce a bubble that is as one stage less than the pipeline It s possible to design the CPU instruction set to mitigate this in some circumstances Can lead to some interesting instruction sets Execute instruction after branch In this case, the pipeline is 3 cycles, so the stall is 2 cycles long

10 CONDITIONAL INSTRUCTIONS ARM designers took a different approach Realised that some branches only happen to skip one or two instructions Decided to make every instruction conditional (not just branches) Any ARM instruction can have a condition code placed on it Instruction is only executed if the condition is met Means we only have a one-cycle bubble (in the execute phase) As in our euclid example Show how we can rewrite our Euclid example in three lines using this

11 PIPELINE LENGTH Pipeline length depends on the implementation of the CPU For example, the MIPS CPU has a five stage pipeline Instruction Fetch (from memory) (IF) Decode and read values from registers (ID) Execute operation or calculate address (i.e. use ALU) (EX) Access operand in data memory (MEM) Write back result into registers (WB) Instruction set is designed to allow this to happen

12 PIPELINE LENGTH As pipeline broken down into smaller steps The steps do less and take less time to run So can run faster But the cost of a stall (e.g. for a branch) becomes much greater More types of hazards can appear Another common hazard is the data hazard Pentium 4 had a 20 stage pipeline, a branch stall would take several clock cycles

13 DATA HAZARD Data hazard occurs when an instruction needs a value That hasn t yet been calculated by a previous instruction Take the following ARM code ADD R0,R1,R2 SUB R2,R0,#5 Second instruction cannot begin executing until the value for R0 is calculated Now lets look at how this would play out in a MIPS like pipeline

14 ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 CLOCK CYCLE ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?

15 ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?

16 R0 UPDATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?

17 R0 UPDATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE R0 FETCHED BY SUB ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?

18 R0 UPDATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE R0 FETCHED BY SUB ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?

19 MITIGATING DATA HAZARDS Can use an approach called forwarding or bypassing to mitigate a data hazard Rather than have the instruction wait for the data to be written back We provide a short cut from the internal buffers in the CPU to provide the data Rather than needing to fetch it from the register

20 R0 VALUE CALCULATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE R0 NEEDED BY SUB Add instruction calculates the value of R0 in EX phase Sub doesn t need it till the EX phase so we provide a short cut in the CPU design to get the value into the right place

21 PIPELINES All pipeline stages must take the same amount of time to complete Or rather the longest step will define the time each step of the pipeline will take to run Doesn t matter if a step completes early We can design our instruction set to help this

22 DESIGNING INSTRUCTION SETS FOR PIPELINE Helps if all instructions are the same length Means the instruction fetch always takes the same amount of time Also helps if there is regularity in the bit patterns use to express instructions E.g. the bits for a register are in the same place for each instruction Separating memory access for other instructions Compare ARM where each instruction is 4 bytes With x86 where instructions varies from 1 to 16 bytes (And the length isn t known until you start decoding it Modern CPU translate x86 instructions into RISC like instructions internally

23 BRANCH PREDICTION Control Hazards happen when the CPU has started to fetch the wrong instruction Instructions pass through the early stages of the pipeline But not needed so work gets thrown away And CPU has to start again and fetch the correct instruction

24 B _CMP CMP BLE CLOCK CYCLE SWI 4 SWI 4 SWI 2 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Here we had started to fetch SWI 4 and SWI 2 after our Branch that we don t need So next instruction has to fetch the correct CMP instruction causing a stall

25 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE SWI 4 SWI 4 SWI 2 CMP 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Here we had started to fetch SWI 4 and SWI 2 after our Branch that we don t need So next instruction has to fetch the correct CMP instruction causing a stall

26 BRANCH PREDICTION Our CPU is using a very naive approach to fetching the next instruction Always fetches the next one linearly in memory But with loops this is almost always going to be the wrong instruction Loop will usual happen several times And only the last iteration does the next instruction in memory get executed Surely it d make more sense to assume the branch was taken? Makes the pipeline construction more complex but doable

27 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1

28 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1

29 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1

30 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE BLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?

31 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE BLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?

32 B _CMP CMP BLE CLOCK CYCLE BLE 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?

33 B _CMP CMP BLE CLOCK CYCLE BLE BLE 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?

34 BRANCH PREDICTION Relatively easy to predict which way a loop will branch (i.e. to loop) However, for branches used to implement a conditional statement it is much harder Which is the best path to take by default? Need the CPU to be able to predict the way the branch will happen

35 BRANCH PREDICTION CPU uses the past to predict how a branch will be taken Keeps track of how many times it branched and how many times it didn t For the branch instructions it has seen recently Uses these statistics to work which instruction is the best one to predict Requires considerable logic to implement

36 SPECULATIVE EXECUTION Branch prediction is an example of speculative execution CPU is doing some work on the assumption that it ll probably be needed But it might also end up being thrown away Depending on the pipeline design this could get as far as actually calculating results

37 INSTRUCTION-LEVEL PARALLELISM Pipelining speeds up the CPU by enabling many instructions to execute at once Known as Instruction-level parallelism Largely invisible to the programmer But limited in the amount of parallelism we can exploit Due to the structure of the CPU data path Although if you know how things work you can construct code to benefits

38 Also saw how the data flows through the CPU Highlight how data flows

39 SUPERSCALAR But what if we built the CPU with more than one ALUs CPU could perform two additions at the same time Would be able to execute two instructions at the same time CPU designed like this is described as superscalar Can get the time taken to execute an instruction to less than one CPU Certain instructions

40 SUPERSCALAR CPU fetches two instructions in one clock cycle CPU decodes two instructions in one clock cycle CPU executes two instructions in one clock cycle Result is that each instruction appears to complete in 0.5 clock cycles Where possible Not possible if the second instruction depends on the output of the first Or the first is a branch

41 APPLE A8 CPU Taken from analysis at Several different data paths that instructions can take through the CPU Not all equal up to the control logic to make sure the instruction follows the correct path

42 IN-ORDER CPU we have considered would be described as being in-order Executes the instructions in the order they appear in memory Program needs to be written to ensure a superscalar CPU can execute the instructions in parallel Up to the programmer/compiler to design the code carefully to get the best order Problem the best order varies from CPU implementation to implementation Works ok in some applications

43 LDR R0,_a LDR R1,_b ADD R0,R0,#5 Saw a situation like this earlier on our simple CPU Caused a stall because we can t fetch the ADD instruction until after LDR R1,_b has completed executing But if we reorder the instructions Same effect but we reduce the stall to one cycle

44 LDR R0,_a LDR R1,_b ADD R0,R0,#5 LDR R0,_a ADD R0,R0,#5 LDR R1,_b Saw a situation like this earlier on our simple CPU Caused a stall because we can t fetch the ADD instruction until after LDR R1,_b has completed executing But if we reorder the instructions Same effect but we reduce the stall to one cycle

45 OUT-OF-ORDER EXECUTION Some CPUs however go one step further Will reorder the instructions to execute them in the best manner for the CPU design Known as out-of-order execution Lots of tricks used to implement this e.g. register renaming

46 MULTI-CORE These kind of tricks can only get us so far Require a lot of logic to implement The alternative is to have lots of separate CPU cores And rewrite our programs to run in parallel But that brings its own issues

47 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 11 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 11 INST 12 CLOCK CYCLE KEY: FETCH DECODE EXECUTE Once we get going, we finish execution of an instruction ever clock cycle

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle? CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

More information

Pipelining. CSC Friday, November 6, 2015

Pipelining. CSC Friday, November 6, 2015 Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not

More information

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14 MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4 IC220 Set #9: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life Return to Chapter 4 Midnight Laundry Task order A B C D 6 PM 7 8 9 0 2 2 AM 2 Smarty Laundry Task order A B C D 6 PM

More information

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

1 Hazards COMP2611 Fall 2015 Pipelined Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor 1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add

More information

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions. Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

ECE260: Fundamentals of Computer Engineering

ECE260: Fundamentals of Computer Engineering Pipelining James Moscola Dept. of Engineering & Computer Science York College of Pennsylvania Based on Computer Organization and Design, 5th Edition by Patterson & Hennessy What is Pipelining? Pipelining

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution Important guidelines: Always state your assumptions and clearly explain your answers. Please upload your solution document

More information

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline? 1. Imagine we have a non-pipelined processor running at 1MHz and want to run a program with 1000 instructions. a) How much time would it take to execute the program? 1 instruction per cycle. 1MHz clock

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Processor (II) - pipelining. Hwansoo Han

Processor (II) - pipelining. Hwansoo Han Processor (II) - pipelining Hwansoo Han Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 =2.3 Non-stop: 2n/0.5n + 1.5 4 = number

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Elements of CPU performance

Elements of CPU performance Elements of CPU performance Cycle time. CPU pipeline. Superscalar design. Memory system. Texec = instructions ( )( program cycles instruction seconds )( ) cycle ARM7TDM CPU Core ARM Cortex A-9 Microarchitecture

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1 Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

Pipeline Hazards. Midterm #2 on 11/29 5th and final problem set on 11/22 9th and final lab on 12/1. https://goo.gl/forms/hkuvwlhvuyzvdat42

Pipeline Hazards. Midterm #2 on 11/29 5th and final problem set on 11/22 9th and final lab on 12/1. https://goo.gl/forms/hkuvwlhvuyzvdat42 Pipeline Hazards https://goo.gl/forms/hkuvwlhvuyzvdat42 Midterm #2 on 11/29 5th and final problem set on 11/22 9th and final lab on 12/1 1 ARM 3-stage pipeline Fetch,, and Execute Stages Instructions are

More information

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Pipeline Hazards Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hazards What are hazards? Situations that prevent starting the next instruction

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Thoai Nam Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy & David a Patterson,

More information

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Instruction-Level Parallelism. Instruction Level Parallelism (ILP) Instruction-Level Parallelism CS448 1 Pipelining Instruction Level Parallelism (ILP) Limited form of ILP Overlapping instructions, these instructions can be evaluated in parallel (to some degree) Pipeline

More information

ECEC 355: Pipelining

ECEC 355: Pipelining ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Pipeline Thoai Nam Outline Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards Instructors: Vladimir Stojanovic and Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/sp16 1 Pipelined Execution Representation Time

More information

Final Exam Fall 2007

Final Exam Fall 2007 ICS 233 - Computer Architecture & Assembly Language Final Exam Fall 2007 Wednesday, January 23, 2007 7:30 am 10:00 am Computer Engineering Department College of Computer Sciences & Engineering King Fahd

More information

CSEE 3827: Fundamentals of Computer Systems

CSEE 3827: Fundamentals of Computer Systems CSEE 3827: Fundamentals of Computer Systems Lecture 21 and 22 April 22 and 27, 2009 martha@cs.columbia.edu Amdahl s Law Be aware when optimizing... T = improved Taffected improvement factor + T unaffected

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction

More information

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly

More information

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Pipelining Recall Pipelining is parallelizing execution Key to speedups in processors Split instruction

More information

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are

More information

CPU Pipelining Issues

CPU Pipelining Issues CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput

More information

Chapter 4 The Processor 1. Chapter 4B. The Processor

Chapter 4 The Processor 1. Chapter 4B. The Processor Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes. The Processor Pipeline Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes. Pipeline A Basic MIPS Implementation Memory-reference instructions Load Word (lw) and Store Word (sw) ALU instructions

More information

EE 4980 Modern Electronic Systems. Processor Advanced

EE 4980 Modern Electronic Systems. Processor Advanced EE 4980 Modern Electronic Systems Processor Advanced Architecture General Purpose Processor User Programmable Intended to run end user selected programs Application Independent PowerPoint, Chrome, Twitter,

More information

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions Tutorial Questions 2. [AY2014/5 Semester 2 Exam] Refer to the following MIPS program: # register $s0 contains a 32-bit

More information

Portland State University ECE 587/687. Memory Ordering

Portland State University ECE 587/687. Memory Ordering Portland State University ECE 587/687 Memory Ordering Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2018 Handling Memory Operations Review pipeline for out of order, superscalar processors

More information

CSE 490/590 Computer Architecture Homework 2

CSE 490/590 Computer Architecture Homework 2 CSE 490/590 Computer Architecture Homework 2 1. Suppose that you have the following out-of-order datapath with 1-cycle ALU, 2-cycle Mem, 3-cycle Fadd, 5-cycle Fmul, no branch prediction, and in-order fetch

More information

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation Instruction Scheduling Last week Register allocation Background: Pipelining Basics Idea Begin executing an instruction before completing the previous one Today Instruction scheduling The problem: Pipelined

More information

ECE 505 Computer Architecture

ECE 505 Computer Architecture ECE 505 Computer Architecture Pipelining 2 Berk Sunar and Thomas Eisenbarth Review 5 stages of RISC IF ID EX MEM WB Ideal speedup of pipelining = Pipeline depth (N) Practically Implementation problems

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 16, 2014 Time: 1 hour + 15 minutes Name: Alias: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your

More information

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language. Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Advanced Instruction-Level Parallelism

Advanced Instruction-Level Parallelism Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu

More information

EIE/ENE 334 Microprocessors

EIE/ENE 334 Microprocessors EIE/ENE 334 Microprocessors Lecture 6: The Processor Week #06/07 : Dejwoot KHAWPARISUTH Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, 2009, Elsevier (MK) http://webstaff.kmutt.ac.th/~dejwoot.kha/

More information

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB EE 4720 Homework 4 Solution Due: 22 April 2002 To solve Problem 3 and the next assignment a paper has to be read. Do not leave the reading to the last minute, however try attempting the first problem below

More information

Control Flow and Loops. Steven R. Bagley

Control Flow and Loops. Steven R. Bagley Control Flow and Loops Steven R. Bagley Introduction Started to look at writing ARM Assembly Language Saw the structure of various commands Load (LDR), Store (STR) for accessing memory SWIs for OS access

More information

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM EXAM #1 CS 2410 Graduate Computer Architecture Spring 2016, MW 11:00 AM 12:15 PM Directions: This exam is closed book. Put all materials under your desk, including cell phones, smart phones, smart watches,

More information

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3. Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC

More information

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University The Processor (3) Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Chapter 06: Instruction Pipelining and Parallel Processing

Chapter 06: Instruction Pipelining and Parallel Processing Chapter 06: Instruction Pipelining and Parallel Processing Lesson 09: Superscalar Processors and Parallel Computer Systems Objective To understand parallel pipelines and multiple execution units Instruction

More information

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

Thomas Polzer Institut für Technische Informatik

Thomas Polzer Institut für Technische Informatik Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations Determined by ISA

More information

The Processor: Improving the performance - Control Hazards

The Processor: Improving the performance - Control Hazards The Processor: Improving the performance - Control Hazards Wednesday 14 October 15 Many slides adapted from: and Design, Patterson & Hennessy 5th Edition, 2014, MK and from Prof. Mary Jane Irwin, PSU Summary

More information

Writing ARM Assembly. Steven R. Bagley

Writing ARM Assembly. Steven R. Bagley Writing ARM Assembly Steven R. Bagley Introduction Previously, looked at how the system is built out of simple logic gates Last week, started to look at the CPU Writing code in ARM assembly language Assembly

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency? Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Superscalar Processors Ch 14

Superscalar Processors Ch 14 Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

5008: Computer Architecture HW#2

5008: Computer Architecture HW#2 5008: Computer Architecture HW#2 1. We will now support for register-memory ALU operations to the classic five-stage RISC pipeline. To offset this increase in complexity, all memory addressing will be

More information

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Hakim Weatherspoon CS 3410 Computer Science Cornell University Hakim Weatherspoon CS 3410 Computer Science Cornell University The slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, McKee, and Sirer. memory inst register

More information

Writing ARM Assembly. Steven R. Bagley

Writing ARM Assembly. Steven R. Bagley Writing ARM Assembly Steven R. Bagley Hello World B main hello DEFB Hello World\n\0 goodbye DEFB Goodbye Universe\n\0 ALIGN main ADR R0, hello ; put address of hello string in R0 SWI 3 ; print it out ADR

More information

Pipelining: Overview. CPSC 252 Computer Organization Ellen Walker, Hiram College

Pipelining: Overview. CPSC 252 Computer Organization Ellen Walker, Hiram College Pipelining: Overview CPSC 252 Computer Organization Ellen Walker, Hiram College Pipelining the Wash Divide into 4 steps: Wash, Dry, Fold, Put Away Perform the steps in parallel Wash 1 Wash 2, Dry 1 Wash

More information

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties Instruction-Level Parallelism Dynamic Branch Prediction CS448 1 Reducing Branch Penalties Last chapter static schemes Move branch calculation earlier in pipeline Static branch prediction Always taken,

More information

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information