Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Similar documents
Complex Pipelines and Branch Prediction

Lecture 4: RISC Computers

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Chapter 4. The Processor

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Hardware-based Speculation

What is Pipelining? RISC remainder (our assumptions)

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Advanced Computer Architecture

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Instruction Pipelining

Instruction Pipelining

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Instruction Pipelining Review

Pipelining. CSC Friday, November 6, 2015

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Lecture 4: RISC Computers

Instruction Level Parallelism (Branch Prediction)

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

COSC 6385 Computer Architecture - Pipelining

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Instruction Level Parallelism

Chapter 9 Pipelining. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

COMPUTER ORGANIZATION AND DESI

LECTURE 3: THE PROCESSOR

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Control unit. Input/output devices provide a means for us to make use of a computer system. Computer System. Computer.

CS433 Homework 2 (Chapter 3)

Lecture 5: Pipelining Basics

Instruction Level Parallelism

Dynamic Control Hazard Avoidance

Pipeline Architecture RISC

Chapter 4. The Processor

ECEC 355: Pipelining

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Multiple Instruction Issue. Superscalars

HY425 Lecture 05: Branch Prediction

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Static Branch Prediction

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

LECTURE 10. Pipelining: Advanced ILP

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Instruction-Level Parallelism and Its Exploitation

Appendix C: Pipelining: Basic and Intermediate Concepts

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

The Processor: Improving the performance - Control Hazards

COMPUTER ORGANIZATION AND DESIGN

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Good luck and have fun!

14:332:331 Pipelined Datapath

Pipeline Review. Review

UNIT- 5. Chapter 12 Processor Structure and Function

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ECE232: Hardware Organization and Design

ELE 655 Microprocessor System Design

Full Datapath. Chapter 4 The Processor 2

Lecture 8: RISC & Parallel Computers. Parallel computers

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

Control Hazards. Branch Prediction

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

CS433 Homework 2 (Chapter 3)

CS425 Computer Systems Architecture

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

5008: Computer Architecture HW#2

Chapter 4 The Processor 1. Chapter 4B. The Processor

COMPUTER ORGANIZATION AND DESIGN

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Module 4c: Pipelining

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Chapter 12. CPU Structure and Function. Yonsei University

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Modern Computer Architecture

Multi-cycle Instructions in the Pipeline (Floating Point)

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

CS252 Graduate Computer Architecture Midterm 1 Solutions

Lecture Topics ECE 341. Lecture # 14. Unconditional Branches. Instruction Hazards. Pipelining

Transcription:

Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time units. Resource needed: one general-purpose machine. Productivity: one product per N time units. Pipelined execution of an N-stage task: Basic Concepts 3 3 N N 3 N Production time: N time units. Resource needed: N specialpurpose machines. Productivity one product/time unit. Zebo Peng, IDA, LiTH

Instruction Execution Stages A typical instruction execution sequence:. Fetch Instruction (): Fetch the instruction from memory.. Decode Instruction (): Determine the op-code and the operand specifiers. 3. Calculate Operands (): Calculate the effective addresses (e.g., base addr. + offset -> effective addr.). 4. Fetch Operands (): Fetch the operands from memory. 5. Execute Instruction (): perform the operation. 6. Write Operand (): store the result in memory. Zebo Peng, IDA, LiTH 3 Instruction Pipelining I I time This is the ideal case Speed up by 6 times. I3 I4 I5 I6 I7 I8 I9 Zebo Peng, IDA, LiTH 4

Actual Instruction Pipelining I I I3 time In practice, there are many holes (bubbles), which reduces the speed-up factor. I4 I5 I6 I7 I8 Different execution patterns for different instructions! I9 Zebo Peng, IDA, LiTH 5 Number of Pipeline Stages In general, a larger number of stages gives better performance. However: A larger number of stages increases the overhead in moving information between stages and synchronization between stages. The complexity of the CPU grows with the number of stages. Difficult to keep a long pipeline at maximum rate due to the hazards. A stage may not be able to be divided into sub-stages. Intel 80486 and Pentium: Five-stage pipeline for integer instructions. Eight-stage pipeline for FP (floating points) instructions. Pentium 4 machine has 0 stages (!). IBM PowerPC has different numbers of stages (3-9) for different machines, e.g., PowerPC 440 has 7 stages. Zebo Peng, IDA, LiTH 6 3

Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH 7 Pipeline Hazards (Conflicts) Situations that prevent the next instruction in the instruction stream from executing during its designated clock cycle. The instruction is said to be stalled. When an instruction is stalled: All instructions later in the pipeline than it are also stalled; No new instructions are fetched during the stall; Instructions earlier than the stalled one continue as usual. Types of hazards: Structural hazards Data hazards Control hazards Zebo Peng, IDA, LiTH 8 4

Structural (Resource) Hazards Hardware conflicts caused by the use of the same hardware resource at the same time (e.g., memory conflicts). stall stall Penalty: cycle (Note: the performance lost is multiplied by the number of stages). Zebo Peng, IDA, LiTH 9 Structural Hazard Solutions In general, the hardware resources in conflict are duplicated in order to avoid structural hazards, e.g., two ALUs. Functional units (ALU, FP unit) can also be pipelined themselves to support several instructions at the same time. Memory conflicts can be solved by: having two separate caches, one for instructions and the other for data (Harvard Architecture); using multiple banks of the main memory; or keeping as many intermediate results as possible in the registers (!). Zebo Peng, IDA, LiTH 0 5

Data Hazards Caused by reversing the order of data-dependent operations due to the pipeline (e.g., WRITE/READ conflicts). ADD A, R; Mem(A) Mem(A) + R; SUB A, R; Mem(A) Mem(A) - R; ADD SUB Ex. A=00, Mem(00)=00, R=30, R=50 Value of Mem(A) needed Value of Mem(A) available Zebo Peng, IDA, LiTH Data Hazard Penalty ADD A, R; Mem(A) Mem(A) + R; SUB A, R; Mem(A) Mem(A) - R; ADD SUB Stall Stall Value of Mem(A) available Data hazard is an important issue: Penalty: cycles. It happens very often, since we have many data dependencies. Zebo Peng, IDA, LiTH 6

Data Hazard Solutions The penalty due to data hazards can be reduced by a technique called forwarding (bypassing). MUX ALU MUX Bypass Path Memory System Registers, cache and main memory The ALU result is fed back to the ALU input. If it detects that the value needed for an operation is the one produced by the previous one, and has not yet been written back. ALU selects the forwarded result, instead of the value from the memory system. Zebo Peng, IDA, LiTH 3 Control Hazards Caused by branch instructions, which change the instruction execution order. The branch should be taken! time BRA 5 IF Zero 3 4 5 6 5 This may lead to unintended change of memory contents (!) 6 Zebo Peng, IDA, LiTH 4 7

Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH 5 Branch Handling () Stall the pipeline until the branch outcome is known. The branch should be taken! time BRA 5 IF Zero 5 Stall Stall Stall Stall 6 This leads to very large lost of performance, in particular, since 0%-35% of the executed instructions are branches. Zebo Peng, IDA, LiTH 6 8

Branch Handling () Multiple streams implement hardware resources to deal with different branch alternatives. 3 4 5 6 5 6 7 8 Branch condition is known BRA 5 IF Zero Zebo Peng, IDA, LiTH 7 time This is an expensive solution, and you need special memory technique and control to fully utilize it! Branch Handling (3) Pre-fetch branch target when a conditional branch is recognized, the following instruction is fetched, and the branch target is also pre-fetched. Loop buffer use a small, high-speed memory to keep the n most recently fetched instructions. If a branch is to be taken, the buffer is first checked to see if the branch target is in it. Special cache for branch target instructions. Delayed branch re-arrange the instructions so that branching occurs later than originally specified. - Software solution. Zebo Peng, IDA, LiTH 8 9

Delayed Branch Ex (-Stage Pipeline) ADD Original inst. sequence: ADD X; No data-dependence BRA L; Delayed branch: BRA L; ADD X; BRA Branch condition known BRA ADD The compiler or the programmer has to find an instruction (to be executed regardless of the branch outcome) which can be moved from its original place to the branch delay slot. 60% to 85% success rate. This leads, however, to un-readable code. Zebo Peng, IDA, LiTH 9 Branch Prediction When a branch is encountered, a prediction is made and the predicted path is followed. The instructions on the predicted path are fetched. The fetched instruction can also be executed, which is called Speculative Execution. Results produced of these executions should be marked as tentative. When the branch outcome is decided, if the prediction is correct, the special tags on tentative results are removed. If not, the tentative results are removed, and the execution goes to the other path. Branch prediction can base on static or dynamic information. Zebo Peng, IDA, LiTH 0 0

Predict always taken Assume that jump will happen. Always fetch target instruction. Static Branch Prediction time BRA 5 IF Zero 5 6 Zebo Peng, IDA, LiTH Static Branch Prediction (Cont d) Predict never taken Assume that jump will not happen. Always fetch next instruction. Predict by Operation Codes Some instructions are more likely to result in a jump than others. BNZ (Branch if the result is Not Zero) Jump more likely BEZ (Branch if the result equals Zero) Jump less likely Can get up to 75% success. Predict by relative positions Backward-pointing branches will be taken (usually loop back). Forward-pointing branches will not be taken (often loop exist). Zebo Peng, IDA, LiTH

Dynamic Branch Prediction Based on branch history: Store information regarding branches in a branch-history table so as to predict the branch outcome more accurately. E.g., assuming that the branch will do what it did last time. Instruction Address Branch Ins. Branch Target Address Address 3 Branch History Table Match/No-match Not Taken (0) N One-Bit Predictor: T N Taken () Two errors per loop iteration. N N T N N NNNNT N N T Predicated Target Address Zebo Peng, IDA, LiTH 3 Bimodal Prediction Use a -bit saturating counter to predict the most common direction, where the first bit indicates the predication. Branches evaluated as taken (T) increment the state towards strongly taken; and Branches evaluated as not taken (N) decrement the counter towards strongly not taken. It tolerates a branch going an unusual direction one time. A loop-closing branch is miss-predicted once rather than twice. N N T N NNNNNTN N Strongly Not Taken (00) N T (+) T (+) T (+) Not Taken (0) Taken (0) N (-) N (-) N (-) Strongly Taken () T Zebo Peng, IDA, LiTH 4

Summary Instruction execution can be substantially accelerated by instruction pipelining. A pipeline is organized as a succession of N stages. Ideally N instructions can be active inside a pipeline. Keeping a pipeline at its maximal rate is prevented by pipeline hazards. Structural hazards are due to resource conflicts. Data hazards are caused by data dependencies between instructions. Control hazards are produced as consequence of branch instructions. Branches can dramatically affect pipeline performance. It is very important to reduce penalties produced by them. (Dynamic) prediction is an efficient way to address this problem. Zebo Peng, IDA, LiTH 5 3