Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Similar documents
CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

LECTURE 3: THE PROCESSOR

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN

Full Datapath. Chapter 4 The Processor 2

What is Pipelining? RISC remainder (our assumptions)

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

Chapter 4. The Processor

COSC 6385 Computer Architecture - Pipelining

Pipeline Review. Review

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

CSEE 3827: Fundamentals of Computer Systems

COMPUTER ORGANIZATION AND DESI

ELE 655 Microprocessor System Design

COMPUTER ORGANIZATION AND DESIGN

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Lecture 19 Introduction to Pipelining

14:332:331 Pipelined Datapath

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Computer Architecture

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Appendix C: Pipelining: Basic and Intermediate Concepts

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

1 Hazards COMP2611 Fall 2015 Pipelined Processor

MIPS An ISA for Pipelining

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Instruction Pipelining Review

LECTURE 10. Pipelining: Advanced ILP

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Pipelining. CSC Friday, November 6, 2015

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ECEC 355: Pipelining

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Chapter 4 The Processor 1. Chapter 4A. The Processor

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Chapter 4. The Processor

Chapter 4. The Processor

ECE260: Fundamentals of Computer Engineering

Chapter 4. The Processor

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Thomas Polzer Institut für Technische Informatik

Chapter 4. The Processor

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Computer Architecture. Lecture 6.1: Fundamentals of

ECE232: Hardware Organization and Design

Modern Computer Architecture

EIE/ENE 334 Microprocessors

Pipelining. Maurizio Palesi

The Processor: Improving the performance - Control Hazards

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

COMP2611: Computer Organization. The Pipelined Processor

ECE154A Introduction to Computer Architecture. Homework 4 solution

There are different characteristics for exceptions. They are as follows:

Outline Marquette University

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Advanced Computer Architecture

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Chapter 4 The Processor 1. Chapter 4B. The Processor

COMPUTER ORGANIZATION AND DESIGN

DEE 1053 Computer Organization Lecture 6: Pipelining

Lecture 2: Processor and Pipelining 1

Full Datapath. Chapter 4 The Processor 2

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Pipelined Processor Design

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

Processor (II) - pipelining. Hwansoo Han

Unresolved data hazards. CS2504, Spring'2007 Dimitris Nikolopoulos

Complex Pipelines and Branch Prediction

Improving Performance: Pipelining

CSE Lecture In Class Example Handout

What about branches? Branch outcomes are not known until EXE What are our options?

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Chapter 4. The Processor

Transcription:

1! 2! Suggested Readings!! Readings!! H&P: Chapter 4.5-4.7!! (Over the next 3-4 lectures)! Lecture 17" Short Pipelining Review! 3! Processor components! Multicore processors and programming! Recap: Pipelining improves throughput! Processor comparison! vs.! 3! 4! Inst. i+1! HLL code translation! 6! 7! 8! Time! 5! Program execution order (in instructions)! Inst. i+3! CSE 30321! 2! Inst. i+2! Goal:" Describe the fundamental components required in a single core of a modern microprocessor as well as how they interact with each other, with main memory, and with external storage media." The right HW for the right application! Inst. i! 1! Clock Number! Inst. #! Writing more! efficient code! 4!

5! Recap: pipeline math!! If times for all S stages are equal to T:!! Time for one initiation to complete still ST!! Time between 2 initiates = T not ST!! Initiations per second = 1/T! Time for N initiations to complete:!nt + (S-1)T! Throughput:!!!!Time per initiation = T + (S-1)T/N! T!!! Pipelining: Overlap multiple executions of same sequence!! Improves THROUGHPUT, not the time to perform a single operation! 6! Recap: Stalls and performance!! Stalls impede progress of a pipeline and result in deviation from 1 instruction executing/clock cycle!! Pipelining can be viewed to:!! Decrease CPI or clock cycle time for instruction!! Let#s see what affect stalls have on CPI!! CPI pipelined =!! Ideal CPI + Pipeline stall cycles per instruction!! 1 + Pipeline stall cycles per instruction!! Ignoring overhead and assuming stages are balanced:!! If no stalls, speedup equal to # of pipeline stages in ideal case! 7! Recap: Structural hazards!! 1 way to avoid structural hazards is to duplicate resources!! i.e.: An ALU to perform an arithmetic operation and an adder to increment PC!! If not all possible combinations of instructions can be executed, structural hazards occur!! Most common instances of structural hazards:!! When a functional unit not fully pipelined!! When some resource not duplicated enough!! Pipelines stall result of hazards, CPI increased from the usual 1! Load! 8! Recap: Structural hazard example! Instruction 1! Instruction 2! Instruction 3! Instruction 4! Mem! Mem! Mem! Mem! Mem! Time! What#s the problem here?"

9! Recap: Data hazards!! These exist because of pipelining!! Why do they exist???!! Pipelining changes order or read/write accesses to operands!! Order differs from order seen by sequentially executing instructions on un-pipelined machine!! Consider this example:!! ADD R1, R2, R3!! SUB R4, R1, R5!! AND R6, R1, R7!! OR R8, R1, R9!! XOR R10, R1, R11! All instructions after ADD use result of ADD " ADD writes the register in WB but SUB needs it in ID." This is a data hazard! 10! Recap: Forwarding & data hazards!! Problem illustrated on previous slide can actually be solved relatively easily with forwarding!! In this example, result of the ADD instruction not really needed until after ADD actually produces it!! Can we move the result from EX/MEM register to the beginning of ALU (where SUB needs it)?!! Yes! Hence this slide!!! Generally speaking:!! Forwarding occurs when a result is passed directly to functional unit that requires it.!! Result goes from output of one unit to input of another! 11! Recap: Forwarding doesn#t always work! 12! Recap: HW change for forwarding! LW R1, 0(R2)! SUB R4, R1, R5! AND R6, R1, R7! OR R8, R1, R9! Load has a latency that! forwarding can#t solve.! Pipeline must stall until! hazard cleared (starting! with instruction that! wants to use data until! source produces it).! Time! Can#t get data to subtract b/c result needed at beginning of! CC #4, but not produced until end of CC #4.!

13! Recap: Hazards vs. dependencies!! dependence: fixed property of instruction stream!! (i.e., program)!! hazard: property of program and processor organization!! implies potential for executing things in wrong order!! potential only exists if instructions can be simultaneously in-flight!! property of dynamic distance between instructions vs. pipeline depth!! For example, can have RAW dependence with or without hazard!! depends on pipeline! 14! Recap: Branch/Control Hazards!! So far, we#ve limited discussion of hazards to:!! Arithmetic/logic operations!! Data transfers!! Also need to consider hazards involving branches:!! Example:!! 40:!beq!$1, $3, $28 # ($28 gives address 72)!! 44:!and!$12, $2, $5!! 48:!or!$13, $6, $2!! 52:!add!$14, $2, $2!! 72:!lw!$4, 50($7)!! How long will it take before the branch decision takes effect?!! What happens in the meantime?! 15! Recap: How branches impact a pipeline! 16! Recap: Or assume branch is not taken!! On average, branches are taken $ the time!! If branch not taken!! Continue normal processing!! Else, if branch is taken!! Need to flush improper instruction from pipeline!! Cuts overall time for branch processing in $!! If branch condition true, must skip 44, 48, 52!! But, these have already started down the pipeline!! They will complete unless we do something about it!! How do we deal with this?!! We#ll consider 2 possibilities!

17! Recap: Branch penalty impact!! Assume 16% of all instructions are branches!! 4% unconditional branches: 3 cycle penalty!! 12% conditional: 50% taken!! For a sequence of N instructions (assume N is large)!! N cycles to initiate each!! 3 * 0.04 * N delays due to unconditional branches!! 0.5 * 3 * 0.12 * N delays due to conditional taken!! Also, an extra 4 cycles for pipeline to empty!! Total:!! 1.3*N + 4 total cycles (or 1.3 cycles/instruction) (CPI)!! 30% Performance Hit!!! (Bad thing)! 18! Lecture 17" Wrap-up Discussion of Hazards!! Some solutions:! 19! Branch Penalty Impact!! In ISA: branches always execute next 1 or 2 instructions!! Instruction so executed said to be in delay slot!! See SPARC ISA!! (example loop counter update)!! In organization: move comparator to ID stage and decide in the ID stage!! Reduces branch delay by 2 cycles!! Increases the cycle time! 20! Branch Prediction!! Prior solutions are ugly!! Better (& more common): guess in IF stage!! Technique is called branch predicting ; needs 2 parts:!! Predictor to guess where/if instruction will branch (and to where)!! Recovery Mechanism : i.e. a way to fix your mistake!! Prior strategy:!! Predictor: always guess branch never taken!! Recovery: flush instructions if branch taken!! Alternative: accumulate info. in IF stage as to!! Whether or not for any particular PC value a branch was taken next!! To where it is taken!! How to update with information from later stages!

21! A Branch Predictor!!! 22! Computing Performance! Program assumptions:!! 23% loads and in $ of cases, next instruction uses load value!! 13% stores!! 19% conditional branches!! 2% unconditional branches!! 43% other! Machine Assumptions:!! 5 stage pipe with all forwarding!! Only penalty is 1 cycle on use of load value immediately after a load)!! Jumps are totally resolved in ID stage for a 1 cycle branch penalty!! 75% branch prediction accuracy!! 1 cycle delay on misprediction! Example 5! 23! Examples! 24! Exception Hazards!! 40 hex :!!sub!$11, $2, $4!! 44 hex :!!and!$12, $2, $5!! 48 hex :!!or!$13, $6, $2!! 4b hex :!!add!$1, $2, $1!(overflow in EX stage)!! 50 hex :!!slt!$15, $6, $7!(already in ID stage)!! 54 hex :!!lw!$16, 50($7)!(already in IF stage)!!!! 40000040 hex :!!sw!$25, 1000($0)!exception handler!! 40000044 hex :!!sw!$26, 1004($0)!! Need to transfer control to exception handler ASAP!! Don#t want invalid data to contaminate registers or memory!! Need to flush instructions already in the pipeline!! Start fetching instructions from 40000040 hex!! Save addr. following offending instruction (50 hex ) in TrapPC (EPC)!! Don#t clobber $1 use for debugging! Examples 6-9!

25! Flushing pipeline after exception! 26! Discussion!! How does instruction set design impact pipelining?!! Does increasing the depth of pipelining always increase performance?!! Cycle 6:!! Exception detected, flush signals generated, bubbles injected!! Cycle 7!! 3 bubbles appear in ID, EX, MEM stages!! PC gets 40000040 hex, TrapPC gets 50 hex! 27! Comparative Performance!! Throughput: instructions per clock cycle = 1/cpi!! Pipeline has fast throughput and fast clock rate!! Latency: inherent execution time, in cycles!! High latency for pipelining causes problems!! Increased time to resolve hazards!! Performance:! 28! Summary!!Execution time *or* throughput!!amdahl#s law!! Multi-bus/multi-unit circuits!!one long clock cycle or N shorter cycles!! Pipelining!!overlap independent tasks!! Pipelining in processors!! hazards limit opportunities for overlap! Board!