Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP
|
|
- Clarence Sullivan
- 5 years ago
- Views:
Transcription
1 Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP
2 INTRODUCTION Been considering what makes the CPU run at a particular speed Spent the last two weeks looking at memory latency And how caching can help speed things up, by reducing the time to fetch instructions and data Today, look at other tricks used by CPU designers to make the run fast
3 FETCH DECODE EXECUTE FETCH DECODE EXECUTE FETCH DECODE EXECUTE FETCH DECODE EXECUTE CLOCK CYCLE Sets the minimum time any instruction will take to run as 3 cycles (one cycle for each stage).
4 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 11 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 11 INST 12 CLOCK CYCLE KEY: FETCH DECODE EXECUTE Once we get going, we finish execution of an instruction ever clock cycle
5 BUBBLES Pipeline hazards introduce bubbles into the pipeline Points in time where the CPU isn t executing an instruction because the hazard forced the delay of an earlier stage Also known as a pipeline stall Size of the bubble depends on the instructions In the worst case, we can end up with instructions effectively executing serially Can rewrite our code to be more pipeline friendly by reordering the instructions
6 CONTROL HAZARD Branches cause another form of pipeline hazard a Control Hazard When the proper instruction cannot execute in the next clock cycle because a different instruction was fetched With a conditional branch, cannot know until the branch is execute whether you ll get a control hazard You might have fetched the right instruction, you might not AKA a branch hazard
7 CONTROL HAZARD In our case, the unconditional branch means we definitely haven t fetched the correct instruction Need to discard the currently fetched and decoded instructions and start again Causes a stall as long as the pipeline Not just branches, any instruction which alters the CPU
8 B _CMP CMP BLE CLOCK CYCLE SWI 4 SWI 4 SWI 2 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Fetch the correct instruction (a CMP R0,R1) in this case it happens to be the same instruction but could just be any instruction Pipeline then continues as before Until we reach another branch, when the same thing happens although in this case it is a conditional branch so we might be in a position where the condition matches
9 MITIGATING CONTROL HAZARD Control Hazards introduce a bubble that is as one stage less than the pipeline It s possible to design the CPU instruction set to mitigate this in some circumstances Can lead to some interesting instruction sets Execute instruction after branch In this case, the pipeline is 3 cycles, so the stall is 2 cycles long
10 CONDITIONAL INSTRUCTIONS ARM designers took a different approach Realised that some branches only happen to skip one or two instructions Decided to make every instruction conditional (not just branches) Any ARM instruction can have a condition code placed on it Instruction is only executed if the condition is met Means we only have a one-cycle bubble (in the execute phase) As in our euclid example Show how we can rewrite our Euclid example in three lines using this
11 PIPELINE LENGTH Pipeline length depends on the implementation of the CPU For example, the MIPS CPU has a five stage pipeline Instruction Fetch (from memory) (IF) Decode and read values from registers (ID) Execute operation or calculate address (i.e. use ALU) (EX) Access operand in data memory (MEM) Write back result into registers (WB) Instruction set is designed to allow this to happen
12 PIPELINE LENGTH As pipeline broken down into smaller steps The steps do less and take less time to run So can run faster But the cost of a stall (e.g. for a branch) becomes much greater More types of hazards can appear Another common hazard is the data hazard Pentium 4 had a 20 stage pipeline, a branch stall would take several clock cycles
13 DATA HAZARD Data hazard occurs when an instruction needs a value That hasn t yet been calculated by a previous instruction Take the following ARM code ADD R0,R1,R2 SUB R2,R0,#5 Second instruction cannot begin executing until the value for R0 is calculated Now lets look at how this would play out in a MIPS like pipeline
14 ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 CLOCK CYCLE ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?
15 ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?
16 R0 UPDATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?
17 R0 UPDATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE R0 FETCHED BY SUB ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?
18 R0 UPDATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE R0 FETCHED BY SUB ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?
19 MITIGATING DATA HAZARDS Can use an approach called forwarding or bypassing to mitigate a data hazard Rather than have the instruction wait for the data to be written back We provide a short cut from the internal buffers in the CPU to provide the data Rather than needing to fetch it from the register
20 R0 VALUE CALCULATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE R0 NEEDED BY SUB Add instruction calculates the value of R0 in EX phase Sub doesn t need it till the EX phase so we provide a short cut in the CPU design to get the value into the right place
21 PIPELINES All pipeline stages must take the same amount of time to complete Or rather the longest step will define the time each step of the pipeline will take to run Doesn t matter if a step completes early We can design our instruction set to help this
22 DESIGNING INSTRUCTION SETS FOR PIPELINE Helps if all instructions are the same length Means the instruction fetch always takes the same amount of time Also helps if there is regularity in the bit patterns use to express instructions E.g. the bits for a register are in the same place for each instruction Separating memory access for other instructions Compare ARM where each instruction is 4 bytes With x86 where instructions varies from 1 to 16 bytes (And the length isn t known until you start decoding it Modern CPU translate x86 instructions into RISC like instructions internally
23 BRANCH PREDICTION Control Hazards happen when the CPU has started to fetch the wrong instruction Instructions pass through the early stages of the pipeline But not needed so work gets thrown away And CPU has to start again and fetch the correct instruction
24 B _CMP CMP BLE CLOCK CYCLE SWI 4 SWI 4 SWI 2 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Here we had started to fetch SWI 4 and SWI 2 after our Branch that we don t need So next instruction has to fetch the correct CMP instruction causing a stall
25 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE SWI 4 SWI 4 SWI 2 CMP 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Here we had started to fetch SWI 4 and SWI 2 after our Branch that we don t need So next instruction has to fetch the correct CMP instruction causing a stall
26 BRANCH PREDICTION Our CPU is using a very naive approach to fetching the next instruction Always fetches the next one linearly in memory But with loops this is almost always going to be the wrong instruction Loop will usual happen several times And only the last iteration does the next instruction in memory get executed Surely it d make more sense to assume the branch was taken? Makes the pipeline construction more complex but doable
27 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1
28 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1
29 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1
30 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE BLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?
31 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE BLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?
32 B _CMP CMP BLE CLOCK CYCLE BLE 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?
33 B _CMP CMP BLE CLOCK CYCLE BLE BLE 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?
34 BRANCH PREDICTION Relatively easy to predict which way a loop will branch (i.e. to loop) However, for branches used to implement a conditional statement it is much harder Which is the best path to take by default? Need the CPU to be able to predict the way the branch will happen
35 BRANCH PREDICTION CPU uses the past to predict how a branch will be taken Keeps track of how many times it branched and how many times it didn t For the branch instructions it has seen recently Uses these statistics to work which instruction is the best one to predict Requires considerable logic to implement
36 SPECULATIVE EXECUTION Branch prediction is an example of speculative execution CPU is doing some work on the assumption that it ll probably be needed But it might also end up being thrown away Depending on the pipeline design this could get as far as actually calculating results
37 INSTRUCTION-LEVEL PARALLELISM Pipelining speeds up the CPU by enabling many instructions to execute at once Known as Instruction-level parallelism Largely invisible to the programmer But limited in the amount of parallelism we can exploit Due to the structure of the CPU data path Although if you know how things work you can construct code to benefits
38 Also saw how the data flows through the CPU Highlight how data flows
39 SUPERSCALAR But what if we built the CPU with more than one ALUs CPU could perform two additions at the same time Would be able to execute two instructions at the same time CPU designed like this is described as superscalar Can get the time taken to execute an instruction to less than one CPU Certain instructions
40 SUPERSCALAR CPU fetches two instructions in one clock cycle CPU decodes two instructions in one clock cycle CPU executes two instructions in one clock cycle Result is that each instruction appears to complete in 0.5 clock cycles Where possible Not possible if the second instruction depends on the output of the first Or the first is a branch
41 APPLE A8 CPU Taken from analysis at Several different data paths that instructions can take through the CPU Not all equal up to the control logic to make sure the instruction follows the correct path
42 IN-ORDER CPU we have considered would be described as being in-order Executes the instructions in the order they appear in memory Program needs to be written to ensure a superscalar CPU can execute the instructions in parallel Up to the programmer/compiler to design the code carefully to get the best order Problem the best order varies from CPU implementation to implementation Works ok in some applications
43 LDR R0,_a LDR R1,_b ADD R0,R0,#5 Saw a situation like this earlier on our simple CPU Caused a stall because we can t fetch the ADD instruction until after LDR R1,_b has completed executing But if we reorder the instructions Same effect but we reduce the stall to one cycle
44 LDR R0,_a LDR R1,_b ADD R0,R0,#5 LDR R0,_a ADD R0,R0,#5 LDR R1,_b Saw a situation like this earlier on our simple CPU Caused a stall because we can t fetch the ADD instruction until after LDR R1,_b has completed executing But if we reorder the instructions Same effect but we reduce the stall to one cycle
45 OUT-OF-ORDER EXECUTION Some CPUs however go one step further Will reorder the instructions to execute them in the best manner for the CPU design Known as out-of-order execution Lots of tricks used to implement this e.g. register renaming
46 MULTI-CORE These kind of tricks can only get us so far Require a lot of logic to implement The alternative is to have lots of separate CPU cores And rewrite our programs to run in parallel But that brings its own issues
47 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 11 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 11 INST 12 CLOCK CYCLE KEY: FETCH DECODE EXECUTE Once we get going, we finish execution of an instruction ever clock cycle
3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:
More informationPipelining. CSC Friday, November 6, 2015
Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not
More informationMIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14
MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationInstruction Level Parallelism
Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic
More informationPage 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationMidnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4
IC220 Set #9: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life Return to Chapter 4 Midnight Laundry Task order A B C D 6 PM 7 8 9 0 2 2 AM 2 Smarty Laundry Task order A B C D 6 PM
More informationPage # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More information1 Hazards COMP2611 Fall 2015 Pipelined Processor
1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add
More informationControl Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.
Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationECE260: Fundamentals of Computer Engineering
Pipelining James Moscola Dept. of Engineering & Computer Science York College of Pennsylvania Based on Computer Organization and Design, 5th Edition by Patterson & Hennessy What is Pipelining? Pipelining
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationEC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution
EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution Important guidelines: Always state your assumptions and clearly explain your answers. Please upload your solution document
More informationAssuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?
1. Imagine we have a non-pipelined processor running at 1MHz and want to run a program with 1000 instructions. a) How much time would it take to execute the program? 1 instruction per cycle. 1MHz clock
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationProcessor (II) - pipelining. Hwansoo Han
Processor (II) - pipelining Hwansoo Han Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 =2.3 Non-stop: 2n/0.5n + 1.5 4 = number
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationECE 571 Advanced Microprocessor-Based Design Lecture 4
ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationElements of CPU performance
Elements of CPU performance Cycle time. CPU pipeline. Superscalar design. Memory system. Texec = instructions ( )( program cycles instruction seconds )( ) cycle ARM7TDM CPU Core ARM Cortex A-9 Microarchitecture
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationArchitectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1
Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural
More informationLECTURE 3: THE PROCESSOR
LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU
More informationPipeline Hazards. Midterm #2 on 11/29 5th and final problem set on 11/22 9th and final lab on 12/1. https://goo.gl/forms/hkuvwlhvuyzvdat42
Pipeline Hazards https://goo.gl/forms/hkuvwlhvuyzvdat42 Midterm #2 on 11/29 5th and final problem set on 11/22 9th and final lab on 12/1 1 ARM 3-stage pipeline Fetch,, and Execute Stages Instructions are
More informationPipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Pipeline Hazards Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hazards What are hazards? Situations that prevent starting the next instruction
More informationComplex Pipelines and Branch Prediction
Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle
More informationPipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome
Thoai Nam Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy & David a Patterson,
More informationComputer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationInstruction-Level Parallelism. Instruction Level Parallelism (ILP)
Instruction-Level Parallelism CS448 1 Pipelining Instruction Level Parallelism (ILP) Limited form of ILP Overlapping instructions, these instructions can be evaluated in parallel (to some degree) Pipeline
More informationECEC 355: Pipelining
ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not
More informationData Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard
Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined
More informationPipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome
Pipeline Thoai Nam Outline Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy
More informationDepartment of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri
Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many
More informationCS 61C: Great Ideas in Computer Architecture Pipelining and Hazards
CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards Instructors: Vladimir Stojanovic and Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/sp16 1 Pipelined Execution Representation Time
More informationFinal Exam Fall 2007
ICS 233 - Computer Architecture & Assembly Language Final Exam Fall 2007 Wednesday, January 23, 2007 7:30 am 10:00 am Computer Engineering Department College of Computer Sciences & Engineering King Fahd
More informationCSEE 3827: Fundamentals of Computer Systems
CSEE 3827: Fundamentals of Computer Systems Lecture 21 and 22 April 22 and 27, 2009 martha@cs.columbia.edu Amdahl s Law Be aware when optimizing... T = improved Taffected improvement factor + T unaffected
More informationHPC VT Machine-dependent Optimization
HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationAs the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.
Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction
More informationWebsite for Students VTU NOTES QUESTION PAPERS NEWS RESULTS
Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly
More informationOrange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Pipelining Recall Pipelining is parallelizing execution Key to speedups in processors Split instruction
More informationComputer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović
Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are
More informationCPU Pipelining Issues
CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput
More informationChapter 4 The Processor 1. Chapter 4B. The Processor
Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always
More informationAdvanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017
Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation
More informationThe Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.
The Processor Pipeline Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes. Pipeline A Basic MIPS Implementation Memory-reference instructions Load Word (lw) and Store Word (sw) ALU instructions
More informationEE 4980 Modern Electronic Systems. Processor Advanced
EE 4980 Modern Electronic Systems Processor Advanced Architecture General Purpose Processor User Programmable Intended to run end user selected programs Application Independent PowerPoint, Chrome, Twitter,
More informationCS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions
CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions Tutorial Questions 2. [AY2014/5 Semester 2 Exam] Refer to the following MIPS program: # register $s0 contains a 32-bit
More informationPortland State University ECE 587/687. Memory Ordering
Portland State University ECE 587/687 Memory Ordering Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2018 Handling Memory Operations Review pipeline for out of order, superscalar processors
More informationCSE 490/590 Computer Architecture Homework 2
CSE 490/590 Computer Architecture Homework 2 1. Suppose that you have the following out-of-order datapath with 1-cycle ALU, 2-cycle Mem, 3-cycle Fadd, 5-cycle Fmul, no branch prediction, and in-order fetch
More informationDetermined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version
MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationSISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:
SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs
More informationBackground: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation
Instruction Scheduling Last week Register allocation Background: Pipelining Basics Idea Begin executing an instruction before completing the previous one Today Instruction scheduling The problem: Pipelined
More informationECE 505 Computer Architecture
ECE 505 Computer Architecture Pipelining 2 Berk Sunar and Thomas Eisenbarth Review 5 stages of RISC IF ID EX MEM WB Ideal speedup of pipelining = Pipeline depth (N) Practically Implementation problems
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationChapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction
More informationCS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes
CS433 Midterm Prof Josep Torrellas October 16, 2014 Time: 1 hour + 15 minutes Name: Alias: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your
More informationArchitectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.
Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationAdvanced Instruction-Level Parallelism
Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu
More informationEIE/ENE 334 Microprocessors
EIE/ENE 334 Microprocessors Lecture 6: The Processor Week #06/07 : Dejwoot KHAWPARISUTH Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, 2009, Elsevier (MK) http://webstaff.kmutt.ac.th/~dejwoot.kha/
More informationIF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB
EE 4720 Homework 4 Solution Due: 22 April 2002 To solve Problem 3 and the next assignment a paper has to be read. Do not leave the reading to the last minute, however try attempting the first problem below
More informationControl Flow and Loops. Steven R. Bagley
Control Flow and Loops Steven R. Bagley Introduction Started to look at writing ARM Assembly Language Saw the structure of various commands Load (LDR), Store (STR) for accessing memory SWIs for OS access
More informationEXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM
EXAM #1 CS 2410 Graduate Computer Architecture Spring 2016, MW 11:00 AM 12:15 PM Directions: This exam is closed book. Put all materials under your desk, including cell phones, smart phones, smart watches,
More informationPipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.
Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationPipelining and Vector Processing
Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC
More informationThe Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
The Processor (3) Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationINSTRUCTION LEVEL PARALLELISM
INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationChapter 06: Instruction Pipelining and Parallel Processing
Chapter 06: Instruction Pipelining and Parallel Processing Lesson 09: Superscalar Processors and Parallel Computer Systems Objective To understand parallel pipelines and multiple execution units Instruction
More informationChapter 4 The Processor 1. Chapter 4A. The Processor
Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationThomas Polzer Institut für Technische Informatik
Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of
More informationDYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING
DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,
More informationChapter 4. The Processor
Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations Determined by ISA
More informationThe Processor: Improving the performance - Control Hazards
The Processor: Improving the performance - Control Hazards Wednesday 14 October 15 Many slides adapted from: and Design, Patterson & Hennessy 5th Edition, 2014, MK and from Prof. Mary Jane Irwin, PSU Summary
More informationWriting ARM Assembly. Steven R. Bagley
Writing ARM Assembly Steven R. Bagley Introduction Previously, looked at how the system is built out of simple logic gates Last week, started to look at the CPU Writing code in ARM assembly language Assembly
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationSuperscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More informationSuperscalar Processors Ch 14
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More information5008: Computer Architecture HW#2
5008: Computer Architecture HW#2 1. We will now support for register-memory ALU operations to the classic five-stage RISC pipeline. To offset this increase in complexity, all memory addressing will be
More informationHakim Weatherspoon CS 3410 Computer Science Cornell University
Hakim Weatherspoon CS 3410 Computer Science Cornell University The slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, McKee, and Sirer. memory inst register
More informationWriting ARM Assembly. Steven R. Bagley
Writing ARM Assembly Steven R. Bagley Hello World B main hello DEFB Hello World\n\0 goodbye DEFB Goodbye Universe\n\0 ALIGN main ADR R0, hello ; put address of hello string in R0 SWI 3 ; print it out ADR
More informationPipelining: Overview. CPSC 252 Computer Organization Ellen Walker, Hiram College
Pipelining: Overview CPSC 252 Computer Organization Ellen Walker, Hiram College Pipelining the Wash Divide into 4 steps: Wash, Dry, Fold, Put Away Perform the steps in parallel Wash 1 Wash 2, Dry 1 Wash
More informationInstruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties
Instruction-Level Parallelism Dynamic Branch Prediction CS448 1 Reducing Branch Penalties Last chapter static schemes Move branch calculation earlier in pipeline Static branch prediction Always taken,
More information4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16
4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationInstruction Level Parallelism
Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches
More information