Improve performance by increasing instruction throughput

Similar documents
Chapter Six. Dataı access. Reg. Instructionı. fetch. Dataı. Reg. access. Dataı. Reg. access. Dataı. Instructionı fetch. 2 ns 2 ns 2 ns 2 ns 2 ns

What do we have so far? Multi-Cycle Datapath (Textbook Version)

Chapter 4 (Part II) Sequential Laundry

T = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Multi-cycle Datapath (Our Version)

Designing a Pipelined CPU

What do we have so far? Multi-Cycle Datapath

Lecture 6: Pipelining

微算機系統第六章. Enhancing Performance with Pipelining 陳伯寧教授電信工程學系國立交通大學. Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Outline Marquette University

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12

COMPUTER ORGANIZATION AND DESIGN

14:332:331 Pipelined Datapath

EECS 322 Computer Architecture Improving Memory Access: the Cache

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

LECTURE 3: THE PROCESSOR

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Enhanced Performance with Pipelining

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

COMPUTER ORGANIZATION AND DESIGN

Chapter 4. The Processor

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Chapter 4. The Processor

Advanced Computer Architecture

Pipeline Data Hazards. Dealing With Data Hazards

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

Overview of Pipelining

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Full Datapath. Chapter 4 The Processor 2

CPU Pipelining Issues

Pipelining. CSC Friday, November 6, 2015

Instruction fetch. MemRead. IRWrite ALUSrcB = 01. ALUOp = 00. PCWrite. PCSource = 00. ALUSrcB = 00. R-type completion

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

ECE331: Hardware Organization and Design

Pipelining. Maurizio Palesi

Advanced Computer Architecture Pipelining

1 Hazards COMP2611 Fall 2015 Pipelined Processor

COSC 6385 Computer Architecture - Pipelining

Chapter 3 & Appendix C Pipelining Part A: Basic and Intermediate Concepts

Processor Design CSCE Instructor: Saraju P. Mohanty, Ph. D. NOTE: The figures, text etc included in slides are borrowed

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Full Datapath. Chapter 4 The Processor 2

ECEC 355: Pipelining

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

DEE 1053 Computer Organization Lecture 6: Pipelining

Processor (II) - pipelining. Hwansoo Han

ECE232: Hardware Organization and Design

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

ECE260: Fundamentals of Computer Engineering

LECTURE 10. Pipelining: Advanced ILP

COMP2611: Computer Organization. The Pipelined Processor

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

COMPUTER ORGANIZATION AND DESI

Lecture 7 Pipelining. Peng Liu.

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 4 The Processor 1. Chapter 4B. The Processor

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

EIE/ENE 334 Microprocessors

Processor Architecture

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Chapter 4. The Processor

SI232 Set #20: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Chapter 6 ADMIN. Reading for Chapter 6: 6.1,

Chapter 4. The Processor

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

cs470 - Computer Architecture 1 Spring 2002 Final Exam open books, open notes

MIPS An ISA for Pipelining

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Pipelining: Basic Concepts

Computer Architecture Spring 2016

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Thomas Polzer Institut für Technische Informatik

Chapter 4 The Processor 1. Chapter 4A. The Processor

ECE260: Fundamentals of Computer Engineering

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Lecture 8: Data Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

CENG 3420 Lecture 06: Pipeline

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

ECE154A Introduction to Computer Architecture. Homework 4 solution

Transcription:

Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw $2, 200($0) 8ns fetch ALU Data access lw $3, 300($0) Program execution Time order (in instructions) lw $1, 100($0) fetch 8ns 2 4 6 8 10 12 14 ALU Data access fetch 8ns... lw $2, 200($0) 2ns fetch ALU Data access lw $3, 300($0) 2ns fetch ALU Data access 2ns 2ns 2ns 2ns 2ns

What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction We ll build a simple pipeline and look at these issues We ll talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc.

IF: fetch ux 0 ID: decode/ register file read EX: Execute/ address calculation E: emory access WB: Write back 1 Add 4 Shift left 2 Add Add result PC Address memory Read register 1 Read data 1 Read register 2 isters Read data 2 Write register Write data 0 ux 1 Zero ALU ALU result Address Write data Data memory Read data ux 1 0 16 Sign extend 32 What do we need to add to actually split the datapath into stages?

0 u x 1 IF/ID ID/EX EX/E E/WB Add 4 Shift left 2 Add Add result PC Address memory Read register 1 Read data 1 Read register 2 isters Read data 2 Write register Write data 0 ux 1 Zero ALU ALU result Address Write data Data memory Read data 1 u x 0 16 Sign extend 32 Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

Time (in clock cycles) Program execution order (in instructions) lw $10, 20($1) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I ALU D sub $11, $2, $3 I ALU D Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths

PCSrc 0 u x 1 IF/ID ID/EX EX/E E/WB Add 4 Write Shift left 2 Add Add result Branch PC Address memory Read register 1 Read data 1 Read register 2 isters Read Write data 2 register Write data [15 0] 16 Sign 32 extend ALUSrc 0 u x 1 6 ALU control Zero ALU ALU result Address Write data emwrite Data memory emread Read data emto 1 u x 0 [20 16] [15 11] 0 u x 1 ALUOp Dst

We have 5 stages. What needs to be controlled in each stage? Fetch and PC Increment Decode / ister Fetch Execution emory Stage Write Back

Pass control signals along just like the data Execution/Address Calculation stage control lines emory access stage control lines stage control lines Dst ALU Op1 ALU Op0 ALU Src Branch em Read em Write write em to R-format 1 1 0 0 0 0 0 1 0 lw 0 0 0 1 0 1 0 1 1 sw X 0 0 1 0 0 1 0 X beq X 0 1 0 1 0 0 0 X WB Control WB EX WB IF/ID ID/EX EX/E E/WB

PCSrc 0 u x 1 Control ID/EX WB EX/E WB E/WB IF/ID EX WB Add PC 4 Address memory Read register 1 Read data 1 Read register 2 isters Read Write data 2 register Write data Write Shift left 2 0 u x 1 Add Add result ALUSrc Zero ALU ALU result Branch Write data emwrite Address Data memory Read data emto 1 u x 0 [15 0] 16 Sign 32 extend 6 ALU control emread [20 16] [15 11] 0 u x 1 Dst ALUOp

Problem with starting next instruction before first is finished dependencies that go backward in time are data hazards Time (in clock cycles) Value of register $2: Program execution order (in instructions) sub $2, $1, $3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I CC 7 CC 8 CC 9 10 10 10 10 10/ 20 20 20 20 20 D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 100($2) I D

Have compiler guarantee no hazards Where do we insert the nops? sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Problem: this really slows us down!

Use temporary results, don t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Value of register $2 : 10 10 10 10 10/ 20 20 20 20 20 Value of EX/E : X X X 20 X X X X X Value of E/WB : X X X X 20 X X X X Program execution order (in instructions) sub $2, $1,$3 I D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 100($2) I D

ID/EX WB EX/E Control WB E/WB IF/ID EX WB PC memory isters u x u x ALU Data memory u x IF/ID.isterRs Rs IF/ID.isterRt Rt IF/ID.isterRt IF/ID.isterRd Rt Rd u x EX/E.isterRd Forwarding unit E/WB.isterRd

Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register. Program execution order (in instructions) lw $2, 20($1) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I D CC 7 CC 8 CC 9 and $4, $2, $5 I D or $8, $2, $6 I D add $9, $4, $2 I D slt $1, $6, $7 I D Thus, we need a hazard detection unit to stall the load instruction

We can stall the pipeline by keeping an instruction in the same stage Program execution order (in instructions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 lw $2, 20($1) I D and $4, $2, $5 I D or $8, $2, $6 add $9, $4, $2 I I D bubble I D slt $1, $6, $7 I D

Stall by letting an instruction that won t write anything go forward Hazard detection unit ID/EX.emRead ID/EX IF/IDWrite Control 0 u x WB EX/E WB E/WB IF/ID EX WB PCWrite PC memory isters u x u x ALU Data memory u x IF/ID.isterRs IF/ID.isterRt IF/ID.isterRt IF/ID.isterRd Rt Rd u x EX/E.isterRd ID/EX.isterRt Rs Rt Forwarding unit E/WB.isterRd

When we decide to branch, other instructions are in the pipeline! Program execution order (in instructions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 40 beq $1, $3, 7 I D 44 and $12, $2, $5 I D 48 or $13, $6, $2 I D 52 add $14, $2, $2 I D 72 lw $4, 50($7) I D We are predicting branch not taken need to add hardware for flushing instructions if we are wrong

IF.Flush u x Hazard detection unit ID/EX WB EX/E Control 0 u x WB E/WB IF/ID EX WB PC 4 memory Shift left 2 isters = u x ux ALU Data memory u x Sign extend u x Forwarding unit

Try and avoid stalls! E.g., reorder these instructions: lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) Add a branch delay slot the next instruction after a branch is always executed rely on compiler to fill the slot with something useful Superscalar: start more than one instruction in the same cycle

High speed execution based on Level Parallelism (ILP): potential of short instruction sequences to execute in parallel High-speed microprocessors exploit ILP by: pipelined execution: overlap instructions superscalar execution: issue and execute Out-of-order execution multiple instructions per clock cycle emory accesses for high-speed microprocessor Data Cache, possibly multiported, multiple levels

Pipeline Superscalar Superpipeline Vector Processing VLIW Dynamic Scheduling Branch Prediction Cache

Clk Cycle 1 Cycle 2 Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk ultiple Cycle Implementation: Load Ifetch Exec em Wr Pipeline Implementation: Load Ifetch Exec em Wr Store Ifetch Exec em R-type Ifetch Store Ifetch Exec em Wr R-type Ifetch Exec em Wr

Pipeline does not help latency of single task, it helps throughput of entire workload Pipeline rate is limited by slowest pipeline stage ultiple tasks operate simultaneously Potential Speedup = No. of pipeline stages Pipeline Hazards reduce the performance from the ideal speedup gained by pipelining

Parallel execute different instructions check dependencies(data or control) Superscalar execution with degree n=3

m-issue, N instructions, k stages The time required by the scalar base machine T(1,1) = N + k -1 (base cycles) Ideal execution time of m-issue superscalar machine T(m,1) = k + (N - m) / m (base cycles) Ideal speed up S(m,1) = T(1,1) / T(m,1) = (N + k - 1) / {(N/m) + k - 1}

Superpipeline of degree n Cycle time is 1/n of base cycle Superpipelined execution with degree n= 3

Superpipelined superscalar of degree(3,3) Superpipelined superscalar of degree(m,n) executes m instructions every cycle with a pipeline cycle 1/n of base cycle

Vector processors have high-level operations that work on linear arrays of numbers: "vectors SCALAR (1 operation) r1 + r3 r2 VECTOR (N operations) v1 + v3 v2 vector length add r3, r1, r2 add.vv v3, v1, v2 25

Each result independent of previous result => long pipeline, compiler ensures no dependencies => high clock rate Vector instructions access memory with known pattern => highly interleaved memory => amortize memory latency => no data caches required! (Do use instruction cache) Reduces branches and branch problems in pipelines Single vector instruction implies lots of work => fewer instruction fetches 25

Originated from parallel microcode Code compaction by compiler

Easy decoding Possibility of having low code density Different parallelism require different instruction sets Random parallelism ular parallelism when Vector and SID

Hardware rearranges instruction execution to prevent stalls enables handling unknown dependencies(e.q., memory reference) and simplifies compiler enables compiled code to run efficiently on different platforms complicates hardware complicates exception handling imprecise exceptions difficult to restart after interrupt Dynamic Ordering Issue Completion

Central Window Complex (de)allocation Need capability of any type of instruction Selects among a larger number of instructions Reservation Station Partition instruction by functional unit Simple, duplicated control logic Larger number of entry at equivalent performance

odern microprocessor deeper and deeper pipeline wider and wider execution control hazard frequent bubble maker significant performance factor How to solve hazard? Loop counter Conditional instruction Branch prediction

Predict-taken IPS-X Predict-not-taken otorola C88000 Prediction bit in instruction compiler support PowerPC, SPARC V9

Branch Prediction Branch address Branch history taken or not-taken about previous branches two-level adaptive Hybrid correlated Branch Target Buffer lookup cache with branch address target address without computing fetch target instruction without stall

The Dilemma : fast, large and not expensive memory Locality solves the dilemma emory Hierarchy Design Consideration CPU Cache ain emory Cache Size Block apping Replacement Strategy Write Policy Unified / Separated Cache Coherency

1. Reduce the miss rate 2. Reduce the miss penalty 3. Reduce the time to hit in the cache.

Classifying isses: 3 Cs Compulsory The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first referen ce misses. (isses in even an Infinite Cache) Capacity If the cache cannot contain all the blocks needed during execu tion of a program, capacity misses will occur due to blocks being discarded and later retrieved. (isses in Fully Associative Size X Cache) Conflict If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur beca use a block can be discarded and later retrieved if too many blocks map to i ts set. Also called collision misses or interference misses. (isses in N-way Associative, Size X Cache) 4th C : Coherence - isses caused by cache coherence

Fast Hit Time + Low Conflict => Victim Cache Add buffer to place data discarded from cache Hardware Prefetching of s & Data Prefetching relies on having extra memory bandwidth that can be u sed without penalty Software Prefetching Data Compiler Optimizations s Data Blocking

Read Priority over Write on iss Write-through w/ write buffers => RAW conflicts with main memory reads on cache misses Write-back want buffer to hold displaced blocks Early Restart and Critical Word First Don t wait for full block to be loaded before restarting CPU Generally useful only in large blocks, Spatial locality => tend to want next sequential word, so not clear if benefit by early restart Non-blocking Caches to reduce stalls on misses Add a second-level cache

Small and Simple Caches Avoiding Address Translation If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag Limits cache to page size pipelining Cache

David A. Patterson, John L. Hennessy, Compter Architecture A Quantitative Approach, organ Kaufmann David A. Patterson, John L. Hennessy, Computer Organizations & Design The Hardware/Software Interface, organ Kaufmann,, 99 99 IDEC