Pipeline Hazards. Midterm #2 on 11/29 5th and final problem set on 11/22 9th and final lab on 12/1. https://goo.gl/forms/hkuvwlhvuyzvdat42

Similar documents
Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

Pipelining. CSC Friday, November 6, 2015

Chapter 4. The Processor

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

More CPU Pipelining Issues

CPU Pipelining Issues

CPE300: Digital System Architecture and Design

Chapter 5 (a) Overview

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Appendix C. Abdullah Muzahid CS 5513

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Pipelining. Maurizio Palesi

Comparison InstruCtions

ARM Shift Operations. Shift Type 00 - logical left 01 - logical right 10 - arithmetic right 11 - rotate right. Shift Amount 0-31 bits

Chapter 4. The Processor

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Elements of CPU performance

Instr. execution impl. view

COSC 6385 Computer Architecture - Pipelining

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

COMPUTER ORGANIZATION AND DESIGN

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

LECTURE 3: THE PROCESSOR

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

Full Datapath. Chapter 4 The Processor 2

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Lecture 7 Pipelining. Peng Liu.

Pipeline Issues. This pipeline stuff makes my head hurt! Maybe it s that dumb hat. L17 Pipeline Issues Fall /5/02

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Chapter 4 The Processor 1. Chapter 4A. The Processor

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

What is Pipelining? RISC remainder (our assumptions)

Computer Architecture. Lecture 6.1: Fundamentals of

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

CSEE 3827: Fundamentals of Computer Systems

Processor (II) - pipelining. Hwansoo Han

EITF20: Computer Architecture Part2.2.1: Pipeline-1

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Chapter 4. The Processor

Basic Pipelining Concepts

Full Datapath. Chapter 4 The Processor 2

RISC, CISC, and ISA Variations

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Pipelining: Hazards Ver. Jan 14, 2014

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Chapter 4. The Processor. Computer Architecture and IC Design Lab

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

RISC Pipeline. Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. See: P&H Chapter 4.6

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. See P&H Appendix , and 2.21

Improving Performance: Pipelining

The Big Picture: Where are We Now? EEM 486: Computer Architecture. Lecture 3. Designing a Single Cycle Datapath

The Processor: Instruction-Level Parallelism

Pipeline Control Hazards and Instruction Variations

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 6 Pipelining Part 1

Computer Architecture

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Pipeline Architecture RISC

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

The Assembly Language of the Boz 5

CS3350B Computer Architecture Winter 2015

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H

6.823 Computer System Architecture Datapath for DLX Problem Set #2

CS 152 Computer Architecture and Engineering

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Chapter 4. The Processor

Thomas Polzer Institut für Technische Informatik

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

CS 152, Spring 2011 Section 2

Chapter 4. The Processor

Pipelining. Pipeline performance

The Pipelined RiSC-16

EECS Digital Design

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

CS61C : Machine Structures

Chapter 4. The Processor

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

CS 61C: Great Ideas in Computer Architecture. MIPS CPU Datapath, Control Introduction

CS 152 Computer Architecture and Engineering

8.1. ( ) The operation performed in each step and the operands involved are as given in the figure below. Clock cycle Instruction

EIE/ENE 334 Microprocessors

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

Input. Output. Datapath: System for performing operations on data, plus memory access. Control: Control the datapath in response to instructions.

Transcription:

Pipeline Hazards https://goo.gl/forms/hkuvwlhvuyzvdat42 Midterm #2 on 11/29 5th and final problem set on 11/22 9th and final lab on 12/1 1

ARM 3-stage pipeline Fetch,, and Execute Stages Instructions are decoded in both the Fetch and stages Register ports are Read in the stage and Written at in the end of the Execute stage PC+4, register reads for stores (StrData). and BX source (BXreg) are delayed for use in later stages Reset Clr Reset Interrupt PC Addr Instruction Memory 32 Data Reset and Interrupt Logic Fetch Rs[11:8] Rm[3:0] Rn[19:16] Shift[11:7] [4] Imm[11:0] Rd[15:12] Cond[31:28] PSR Opcode[24:23] Opcode [24:21] LA[6:5] Rd[15:12] S[20] Fetch Instr dec Instr dec Func dec Btype 14 ERd WERF dec dttype Btype Cond BXinst Cond WERF Itype Dtype fttype ERd Btype dttype BXinst I 20 Cond SSSSSS:Imm[23:0]:00 1 0 Cond Add +4 RA RB RC WA 4-port Register file WD WE DA DB DC WERF Cond EN Data in Wr Imm[7:0] PSR Itype Dtype fttype 24 x 0 20 x 0 Addr 1 0 0 1 Data Memory Shift[11:8] 0 Itype 5 x 0 0 1 1 0 1 0 1 0 Dtype A B Shft Rot/Asr/Rgt Sub/Rsb ALU Math N,Z,C,V R dttype StrData 1 0 1 0 BXreg BXreg Data out Btype 1 0 StrData Fetch 0 1 2

Simple Instruction flow Consider the following instruction sequence: Instruction becomes available at the end of the Fetch stage Operands at the end of... sub add and cmp r0,r1,r2 r3,r1,#2 r1,r1,#1 r0,r4 Destination and PSR are updated at the end of Execute Time (in clock cycles) i i+1 i+2 i+3 i+4 i+5 Pipeline Fetch sub r0,r1,r2 add r3,r1,#2 and r1,r1,#1 cmp r0,r4 r1 r1 r1 sub r0,r1,r2 r2add r3,r1,#2 #2and r1,r1,#1 #1 cmp r0,r4 Execute sub r0,r1,r2 add r3,r1,#2 and r1,r0,#1 cmp r0,r4 r4 r0 r0 r3 r1 psr 3

Pipeline Control Hazards Pipelining HAZARDS are situations where the next instruction cannot execute in the next clock cycle. There are two forms of hazards, CONTROL and STRUCTURAL. Consider the instruction sequence shown: Time (in clock cycles)... loop: add r0,r0,r0 cmp r0,#64 ble loop and r1,r0,#7 sub r1,r0,r1 i i+1 i+2 i+3 i+4 i+5 Pipeline Fetch add r0,r0,r0 cmp r0,#64 ble loop and r1,r0,#7 sub r1,r0,r1 add r0,r0,r0 add r0,r0,r0 cmp r0,#64 ble loop and r1,r0,#7 Execute add r0,r0,r0 cmp r0,#64 ble loop?????? When the branch instruction reaches the execute stage the next 2 instructions have already been fetched! 4

Branch Fixes Problem: Two instructions following a branch are fetched before the branch decision is made (to take or not to take) Solutions: 1. Program around it. Define the ISA such that the branch does not take effect until after instructions in the DELAY SLOTS complete. This is how MIPS pipelines work. It leads to ODD looking code in tight (short) loops. Of course you could always put NOPs in the delay slots. 2. Detect the branch decision as early as possible, and ANNUL instructions in the delay slots. This is what ARM does. 5

Early Detect and Annul It helps that the ALU is not used by branch instructions We can detect branch instructions (B, BL, or BX) in the stage. The decision to branch is decided... no later than the current instruction in loop: add r0,r0,r0 the Execute stage. Thus, we could make cmp r0,#64 The branch decision in the stage. ble loop We then annul the following instruction by and r1,r0,#7 disabling WERF and PSR updates! Making the sub r1,r0,r1 next instruction a NOP! Time (in clock cycles) i i+1 i+2 i+3 i+4 i+5 Pipeline Fetch add r0,r0,r0 cmp r0,#64 ble loop and r1,r0,#7 add r0,r0,r0 cmp r0,#64 add r0,r0,r0 cmp r0,#64 ble loop and r1,r0,#7 add r0,r0,r0 Execute add r0,r0,r0 cmp r0,#64 ble loop NOP If we detect the branch in the decode stage then the PSR state of the instruction in the Execute stage can be combined to change the next PC. 6

The cost of taken branches When an ARM branch is taken the branch instructions are effectively 2 cycles rather than 1 when they aren t. In a MIPS-like instruction set, one can often fill the delay slots with useful instructions, but they are executed whether or not the branch is taken. The ARM approach is easier to understand, and since it does not EXPOSE the pipeline, it also allows for an alternative number of pipeline stages to be implemented in future designs, while conserving code compatibility. Lastly, using ARM, many conditional branches can be eliminated using the condition execution, which pipelines beautifully! 7

Structural Pipeline Hazards There s another problem with our code fragment! The destination register of instructions are written at the end of the Execute stage. However the following instruction might use this result as a source operand. Time (in clock cycles)... loop: add r0,r0,r0 cmp r0,#64 ble loop and r1,r0,#7 sub r1,r0,r1 i i+1 i+2 i+3 i+4 i+5 Pipeline Fetch add r0,r0,r0 cmp r0,#64 ble loop and r1,r0,#7 add r0,r0,r0 cmp r0,#64 add r0,r0,r0 cmp r0,#64 ble loop and r1,r0,#7 add r0,r0,r0 Execute add r0,r0,r0 cmp r0,#64 ble loop NOP The CMP instruction needs to access the contents of R0 before it is actually written are the end of i+2 8

Data Hazards Problem: When a register source is needed from a later stage of the pipeline before it is written. Solutions: 1. Program around it. One could document the weird semantics-- You can t reference the destination register of an instruction in the immediately following instruction. Would make make assembly language even harder to understand. Would expose the pipeline, once again making future improvements difficult to implement while maintaining code compatibility. 2. Hardware bypass multiplexers. 9

Source Bypassing The idea here is to load the value that will be saved in the destination register into the pipeline registers hold the ALU operands. We also need bypass MUXes on the StrReg and BXreg pipeline registers. Pipeline Opcode [24:21] LA[6:5] Rd[15:12] S[20] Func dec Rn Dec == Rd Exe ERd Cond From DA 0 1 A B Shft Rot/Asr/Rgt Sub/Rsb ALU Math N,Z,C,V R dttype EN 1 0 Dtype 1 0 Dtype PSR Rm Dec == Rd Exe From DB 0 1 From DC 0 1 i i+1 i+2 Fetch add r0,r0,r0 cmp r0,#64 ble loop add r0,r0,r0 cmp r0,#64 Execute 0 1 add r0,r0,r0 Rs Dec == Rd Exe The new value for R0 will be computed just prior to the rising clock edge between i+2 and i+3, we can take the output of 10

Load/Store Stalls Load and Store memory accesses are the actual bottleneck of the ARM pipeline. Also, recall that instructions and load/stores actually come from the same memory. Thus, we need to stall instruction fetching to allow for loads and stores. Time (in clock cycles)... loop: ldr r0,[r1,#4] add r0,r0,#4 str r0,[r1,#4] sub r0,r0,#4 and r2,r2,#f i i+1 i+2 i+3 i+4 i+5 i+6 Pipeline Fetch ldr r0,[r1,#4] add r0,r0,#4 str r0,[r1,#4] Mem read sub r0,r0,#4 and r2,r2,#f Mem store ldr r0,[r1,#4] add r0,r0,#4 str r0,[r1,#4] sub r0,r0,#4 Execute ldr r0,[r1,#4] add r0,r0,#4 str r0,[r1,#4] 11

Load/Store stall implementation Disable loading of pipeline registers for one clock when a load or store instruction reaches the execute stage. 1. Adding enable lines to the PC and pipeline registers on the control path 2. A simple 2-state state machine to stall the pipeline for 1 state to allow for the load/store memory cycle. PC Addr Instruction Memory Data Reset and Interrupt Logic Fetch EN EN EN NoStall NoStall... NoStall... I 27 I 26 D Q NoStall 12

Where does this leave us Overall we can now nearly triple the clock rate. Instructions have a throughput of one-per-clock with the following caveats: 1. Taken branches take 2 cycles. 2. Loads and store take 2 cycles. 3x speed up: 100 MHz clock now 300 MHz You can pipeline an ARM CPU even more. There exist ARM implementations with 7, 8, and 9 pipeline stages. But the overhead of bypass paths and stall cases increase. 13

Reality vs Specmanship Assuming approximately 10% of instructions executed are branches, and of those 80% of the time they are taken, and 12.5% of instruction executed are loads or stores, what sort of real speed up do we expect? Perf before = (100) * 1 = 100 Clocks * 10 * 10-9 sec/clock = 1000 * 10-9 secs Perf after = (10)(0.8) * 2 + 12.5 * 2 + 79.5 * 1 = 120.5 Clocks 120.5 * 3.333 * 10-9 sec/clock = 401.666 * 10-9 secs Speedup = ----------------- = 1000/401.666 = 2.490 X Perf before Perf after 14

Next time It appears memory access time is our real bottleneck. What tricks can be applied to improving CPU performance in this case? Interleaving Block-transfers Caching 15