Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

Similar documents
3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Pipelining. CSC Friday, November 6, 2015

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

COMPUTER ORGANIZATION AND DESI

Chapter 4. The Processor

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Instruction Level Parallelism

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Advanced Computer Architecture

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Multi-cycle Instructions in the Pipeline (Floating Point)

The Processor: Instruction-Level Parallelism

ECE260: Fundamentals of Computer Engineering

Full Datapath. Chapter 4 The Processor 2

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Pipelining and Vector Processing

Processor (II) - pipelining. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han

Full Datapath. Chapter 4 The Processor 2

Chapter 4. The Processor

ECE 571 Advanced Microprocessor-Based Design Lecture 4

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Elements of CPU performance

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

LECTURE 3: THE PROCESSOR

Pipeline Hazards. Midterm #2 on 11/29 5th and final problem set on 11/22 9th and final lab on 12/1.

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Complex Pipelines and Branch Prediction

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

ECEC 355: Pipelining

Copyright 2012, Elsevier Inc. All rights reserved.

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

COMPUTER ORGANIZATION AND DESIGN

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Final Exam Fall 2007

CSEE 3827: Fundamentals of Computer Systems

HPC VT Machine-dependent Optimization

Multiple Instruction Issue. Superscalars

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

CPU Pipelining Issues

Chapter 4 The Processor 1. Chapter 4B. The Processor

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

EE 4980 Modern Electronic Systems. Processor Advanced

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

Portland State University ECE 587/687. Memory Ordering

CSE 490/590 Computer Architecture Homework 2

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

ECE 505 Computer Architecture

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Hardware-based Speculation

Advanced Instruction-Level Parallelism

EIE/ENE 334 Microprocessors

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

Control Flow and Loops. Steven R. Bagley

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Control Hazards. Prediction

Pipelining and Vector Processing

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

INSTRUCTION LEVEL PARALLELISM

Chapter 4. The Processor

Chapter 06: Instruction Pipelining and Parallel Processing

Chapter 4 The Processor 1. Chapter 4A. The Processor

Thomas Polzer Institut für Technische Informatik

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Chapter 4. The Processor

The Processor: Improving the performance - Control Hazards

Writing ARM Assembly. Steven R. Bagley

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

5008: Computer Architecture HW#2

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Writing ARM Assembly. Steven R. Bagley

Pipelining: Overview. CPSC 252 Computer Organization Ellen Walker, Hiram College

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Dynamic Control Hazard Avoidance

Instruction Level Parallelism

Transcription:

Photo David Wright https://www.flickr.com/photos/dhwright/3312563248 STEVEN R. BAGLEY PIPELINES AND ILP

INTRODUCTION Been considering what makes the CPU run at a particular speed Spent the last two weeks looking at memory latency And how caching can help speed things up, by reducing the time to fetch instructions and data Today, look at other tricks used by CPU designers to make the run fast

FETCH DECODE EXECUTE FETCH DECODE EXECUTE FETCH DECODE EXECUTE FETCH DECODE EXECUTE CLOCK CYCLE Sets the minimum time any instruction will take to run as 3 cycles (one cycle for each stage).

INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 11 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 11 INST 12 CLOCK CYCLE KEY: FETCH DECODE EXECUTE Once we get going, we finish execution of an instruction ever clock cycle

BUBBLES Pipeline hazards introduce bubbles into the pipeline Points in time where the CPU isn t executing an instruction because the hazard forced the delay of an earlier stage Also known as a pipeline stall Size of the bubble depends on the instructions In the worst case, we can end up with instructions effectively executing serially Can rewrite our code to be more pipeline friendly by reordering the instructions

CONTROL HAZARD Branches cause another form of pipeline hazard a Control Hazard When the proper instruction cannot execute in the next clock cycle because a different instruction was fetched With a conditional branch, cannot know until the branch is execute whether you ll get a control hazard You might have fetched the right instruction, you might not AKA a branch hazard

CONTROL HAZARD In our case, the unconditional branch means we definitely haven t fetched the correct instruction Need to discard the currently fetched and decoded instructions and start again Causes a stall as long as the pipeline Not just branches, any instruction which alters the CPU

B _CMP CMP BLE CLOCK CYCLE SWI 4 SWI 4 SWI 2 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Fetch the correct instruction (a CMP R0,R1) in this case it happens to be the same instruction but could just be any instruction Pipeline then continues as before Until we reach another branch, when the same thing happens although in this case it is a conditional branch so we might be in a position where the condition matches

MITIGATING CONTROL HAZARD Control Hazards introduce a bubble that is as one stage less than the pipeline It s possible to design the CPU instruction set to mitigate this in some circumstances Can lead to some interesting instruction sets Execute instruction after branch In this case, the pipeline is 3 cycles, so the stall is 2 cycles long

CONDITIONAL INSTRUCTIONS ARM designers took a different approach Realised that some branches only happen to skip one or two instructions Decided to make every instruction conditional (not just branches) Any ARM instruction can have a condition code placed on it Instruction is only executed if the condition is met Means we only have a one-cycle bubble (in the execute phase) As in our euclid example Show how we can rewrite our Euclid example in three lines using this

PIPELINE LENGTH Pipeline length depends on the implementation of the CPU For example, the MIPS CPU has a five stage pipeline Instruction Fetch (from memory) (IF) Decode and read values from registers (ID) Execute operation or calculate address (i.e. use ALU) (EX) Access operand in data memory (MEM) Write back result into registers (WB) Instruction set is designed to allow this to happen

PIPELINE LENGTH As pipeline broken down into smaller steps The steps do less and take less time to run So can run faster But the cost of a stall (e.g. for a branch) becomes much greater More types of hazards can appear Another common hazard is the data hazard Pentium 4 had a 20 stage pipeline, a branch stall would take several clock cycles

DATA HAZARD Data hazard occurs when an instruction needs a value That hasn t yet been calculated by a previous instruction Take the following ARM code ADD R0,R1,R2 SUB R2,R0,#5 Second instruction cannot begin executing until the value for R0 is calculated Now lets look at how this would play out in a MIPS like pipeline

ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 CLOCK CYCLE ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?

ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?

R0 UPDATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?

R0 UPDATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE R0 FETCHED BY SUB ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?

R0 UPDATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE R0 FETCHED BY SUB ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?

MITIGATING DATA HAZARDS Can use an approach called forwarding or bypassing to mitigate a data hazard Rather than have the instruction wait for the data to be written back We provide a short cut from the internal buffers in the CPU to provide the data Rather than needing to fetch it from the register

R0 VALUE CALCULATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE R0 NEEDED BY SUB Add instruction calculates the value of R0 in EX phase Sub doesn t need it till the EX phase so we provide a short cut in the CPU design to get the value into the right place

PIPELINES All pipeline stages must take the same amount of time to complete Or rather the longest step will define the time each step of the pipeline will take to run Doesn t matter if a step completes early We can design our instruction set to help this

DESIGNING INSTRUCTION SETS FOR PIPELINE Helps if all instructions are the same length Means the instruction fetch always takes the same amount of time Also helps if there is regularity in the bit patterns use to express instructions E.g. the bits for a register are in the same place for each instruction Separating memory access for other instructions Compare ARM where each instruction is 4 bytes With x86 where instructions varies from 1 to 16 bytes (And the length isn t known until you start decoding it Modern CPU translate x86 instructions into RISC like instructions internally

BRANCH PREDICTION Control Hazards happen when the CPU has started to fetch the wrong instruction Instructions pass through the early stages of the pipeline But not needed so work gets thrown away And CPU has to start again and fetch the correct instruction

B _CMP CMP BLE CLOCK CYCLE SWI 4 SWI 4 SWI 2 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Here we had started to fetch SWI 4 and SWI 2 after our Branch that we don t need So next instruction has to fetch the correct CMP instruction causing a stall

0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE SWI 4 SWI 4 SWI 2 CMP 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Here we had started to fetch SWI 4 and SWI 2 after our Branch that we don t need So next instruction has to fetch the correct CMP instruction causing a stall

BRANCH PREDICTION Our CPU is using a very naive approach to fetching the next instruction Always fetches the next one linearly in memory But with loops this is almost always going to be the wrong instruction Loop will usual happen several times And only the last iteration does the next instruction in memory get executed Surely it d make more sense to assume the branch was taken? Makes the pipeline construction more complex but doable

0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1

0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1

0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1

0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE BLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?

0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE BLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?

B _CMP CMP BLE CLOCK CYCLE BLE 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?

B _CMP CMP BLE CLOCK CYCLE BLE BLE 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?

BRANCH PREDICTION Relatively easy to predict which way a loop will branch (i.e. to loop) However, for branches used to implement a conditional statement it is much harder Which is the best path to take by default? Need the CPU to be able to predict the way the branch will happen

BRANCH PREDICTION CPU uses the past to predict how a branch will be taken Keeps track of how many times it branched and how many times it didn t For the branch instructions it has seen recently Uses these statistics to work which instruction is the best one to predict Requires considerable logic to implement

SPECULATIVE EXECUTION Branch prediction is an example of speculative execution CPU is doing some work on the assumption that it ll probably be needed But it might also end up being thrown away Depending on the pipeline design this could get as far as actually calculating results

INSTRUCTION-LEVEL PARALLELISM Pipelining speeds up the CPU by enabling many instructions to execute at once Known as Instruction-level parallelism Largely invisible to the programmer But limited in the amount of parallelism we can exploit Due to the structure of the CPU data path Although if you know how things work you can construct code to benefits

Also saw how the data flows through the CPU Highlight how data flows

SUPERSCALAR But what if we built the CPU with more than one ALUs CPU could perform two additions at the same time Would be able to execute two instructions at the same time CPU designed like this is described as superscalar Can get the time taken to execute an instruction to less than one CPU Certain instructions

SUPERSCALAR CPU fetches two instructions in one clock cycle CPU decodes two instructions in one clock cycle CPU executes two instructions in one clock cycle Result is that each instruction appears to complete in 0.5 clock cycles Where possible Not possible if the second instruction depends on the output of the first Or the first is a branch

APPLE A8 CPU Taken from analysis at http://www.anandtech.com/show/7910/apples-cyclone-microarchitecture-detailed Several different data paths that instructions can take through the CPU Not all equal up to the control logic to make sure the instruction follows the correct path

IN-ORDER CPU we have considered would be described as being in-order Executes the instructions in the order they appear in memory Program needs to be written to ensure a superscalar CPU can execute the instructions in parallel Up to the programmer/compiler to design the code carefully to get the best order Problem the best order varies from CPU implementation to implementation Works ok in some applications

LDR R0,_a LDR R1,_b ADD R0,R0,#5 Saw a situation like this earlier on our simple CPU Caused a stall because we can t fetch the ADD instruction until after LDR R1,_b has completed executing But if we reorder the instructions Same effect but we reduce the stall to one cycle

LDR R0,_a LDR R1,_b ADD R0,R0,#5 LDR R0,_a ADD R0,R0,#5 LDR R1,_b Saw a situation like this earlier on our simple CPU Caused a stall because we can t fetch the ADD instruction until after LDR R1,_b has completed executing But if we reorder the instructions Same effect but we reduce the stall to one cycle

OUT-OF-ORDER EXECUTION Some CPUs however go one step further Will reorder the instructions to execute them in the best manner for the CPU design Known as out-of-order execution Lots of tricks used to implement this e.g. register renaming

MULTI-CORE These kind of tricks can only get us so far Require a lot of logic to implement The alternative is to have lots of separate CPU cores And rewrite our programs to run in parallel But that brings its own issues

INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 11 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 11 INST 12 CLOCK CYCLE KEY: FETCH DECODE EXECUTE Once we get going, we finish execution of an instruction ever clock cycle