Lecture 15: Pipelining. Spring 2018 Jason Tang

Similar documents
Chapter 8. Pipelining

CPS104 Computer Organization and Programming Lecture 19: Pipelining. Robert Wagner

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Pipelining. CSC Friday, November 6, 2015

COMPUTER ORGANIZATION AND DESIGN

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Instr. execution impl. view

ECE260: Fundamentals of Computer Engineering

Pipelining: Overview. CPSC 252 Computer Organization Ellen Walker, Hiram College

Chapter 4. The Processor

Modern Computer Architecture

ELCT 501: Digital System Design

Module 4c: Pipelining

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Pipeline Processors David Rye :: MTRX3700 Pipelining :: Slide 1 of 15

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

LECTURE 3: THE PROCESSOR

Computer Architecture. Lecture 6.1: Fundamentals of

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Outline Marquette University

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Computer Systems Architecture Spring 2016

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

CSEE 3827: Fundamentals of Computer Systems

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

Pipelining. Maurizio Palesi

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

1 Hazards COMP2611 Fall 2015 Pipelined Processor

What is Pipelining? RISC remainder (our assumptions)

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

Pipeline: Introduction

Chapter 5 (a) Overview

COMPUTER ORGANIZATION AND DESIGN

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Full Datapath. Chapter 4 The Processor 2

Lecture 6: Pipelining

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Processor (II) - pipelining. Hwansoo Han

COSC 6385 Computer Architecture - Pipelining

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Full Datapath. Chapter 4 The Processor 2

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

COMPUTER ORGANIZATION AND DESI

Lecture 19 Introduction to Pipelining

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Pipelining. Principles of pipelining Pipeline hazards Remedies. Pre-soak soak soap wash dry wipe. l Chapter 4.4 and 4.5

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Processor Architecture

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

COMP2611: Computer Organization. The Pipelined Processor

ECEC 355: Pipelining

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

Computer Architecture

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

CMSC Computer Architecture Lecture 4: Single-Cycle uarch and Pipelining. Prof. Yanjing Li University of Chicago

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Appendix A. Overview

COMPUTER ORGANIZATION AND DESIGN

ELE 655 Microprocessor System Design

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Chapter 4 The Processor 1. Chapter 4A. The Processor

Advanced Computer Architecture

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

MC9211Computer Organization. Unit 4 Lesson 1 Processor Design

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

Lecture 13: Multi-Cycle Control Unit. Spring 2018 Jason Tang

Chapter 4 (Part II) Sequential Laundry

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 7 Pipelining. Peng Liu.

Pipelined Processor Design

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

Pipeline Review. Review

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Working on the Pipeline

CS 3510 Comp&Net Arch

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Chapter 4 The Processor

Chapter 4. The Processor

Typical DSP application

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Computer Architecture Spring 2016

Introduction to Pipelining. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Chapter 4. The Processor

Transcription:

Lecture 15: Pipelining Spring 2018 Jason Tang 1

Topics Overview of pipelining Pipeline performance Pipeline hazards 2

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e r A B C D 30 40 20 30 40 20 30 40 20 30 40 20 A clothes washer takes 30 minutes, dryer takes 40 minutes, and folding takes 20 minutes Sequential laundry would thus take 6 hours for 4 loads 3

Pipelined Laundry T a s k O r d e r A B C D 6 PM 7 8 9 10 11 Midnight Time 30 40 40 40 40 20 Pipelining means start work as soon as possible Pipelined laundry would thus take 3.5 hours for 4 loads 4

Pipelining Does not improve latency of a single task, but improves throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Speedup correlates to the number of pipe stages Actual speedup reduced by: unbalanced lengths of pipe stages, time to fill pipeline, time to drain pipeline, and stalling due to dependencies 5

Multi-Cycle Instruction Execution IF: Instruction Fetch ID: Instruction Decode / Register File Read EX: Execute / Address Calculation MEM: Memory Access WB: Write back 4 adder Register File Data PC addr read data Instruction Memory A Sel B Sel W Sel ALU addr write data read data Extend Data Memory 6

Stages of Instruction Execution Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load: Fetch Decode Execute Memory Write Back As mentioned earlier, load instructions take the longest to process All instructions follow at most these five steps: Fetch: fetch instruction from Instruction Memory at PC Decode: fetch registers and decode instruction Execute: calculate results Memory: read/write data from Data Memory Write Back: write data back to register file 7

Instruction Pipelining Fetch Decode Execute Memory Write Back Time Fetch Decode Execute Memory Write Back Fetch Decode Execute Memory Write Back Fetch Decode Execute Memory Write Back Program Flow Fetch Decode Execute Memory Write Back Start handling next instruction while current instruction is in progress Pipelining is feasible when different devices are used at different stages of instruction execution Pipelined instruction throughput = Non-pipelined time / Number of stages 8

Datapath Comparison Example program flow: a load instruction followed by a store instruction Clk Cycle 1 Cycle 2 load store waste Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 load Fetch Decode Execute Memory Write Back store Fetch Decode Execute Memory Fetch load Fetch Decode Execute Memory store Write Back Fetch Decode Execute Memory Write Back 9

Pipeline Performance Fetch Decode Execute Memory Write Back 200 ps 100 ps 200 ps 200 ps 100 ps For a single-cycle datapath, how long would it take to execute a: load, store, and then a R-type? For a multi-cycle datapath, how long would it take to execute a: load, store, and then a R-type? For a pipelined datapath, how long would it take to execute a: load, store, and then a R-type (ignoring all hazards)? How long would it take to execute 1000 consecutive loads for: single-cycle, multi-cycle, and pipeline datapaths? 10

Designing Instruction Sets for Pipelining How bits are represented within an instruction affects pipeline performance Simplifying instruction fetch: RISC architectures [generally] have same sized instructions CISC architectures have varying length instructions Simplifying memory access: ARMv8-A has limited load and store instructions x86/64 allows memory to be used as operands to ALU 11

Pipeline Hazards Situation that prevents following instruction from executing on next clock cycle Structural hazard: attempt to use a resource two different ways at same time Example: all-in-one washer/dryer Data hazard: attempt to use item before it is ready Example: one sock still in washer when ready to fold sock pair Control hazard: attempt to make a decision before condition is evaluated Example: choosing laundry detergent based upon previous load 12

Structural Hazard Time ldur X1, [X9, #100] ldur X2, [X9, #100] ldur X3, [X9, #100] ldur X4, [X9, #100] Program Flow ldur X5, [X9, #100] Combined instruction/data memory can cause conflicting accesses Can be resolved by adding an idle cycle before fetching fourth load 13

Memory Architectures von Neumann (also known an Princeton) Architecture: single combined memory bus for both data and instructions Simpler to build, allows for self-modifying code, but leads to the von Neumann bottleneck Harvard Architecture: separate memory buses for data and instructions Allows parallel access to data and instructions, allows different memory technologies used, but much more complicated to build, prevents selfmodifying code Modified Harvard Architecture: has split caches, but unified main memory http://ithare.com/modified-harvard-architecture-clarifying-confusion/ 14

Data Hazard Time add X1, X2, X3 sub X4, X1, X3 ldur X1, [X2, #0] orr X8, X1, X9 Program Flow eor X10, X1, X11 Dependence of one instruction on an earlier one in pipeline Can be resolved by code reordering, forwarding, or stalling 15

Code Reordering Clever compilers (specifically, code generators) could reorder generated assembly to avoid data hazards Example: a = b + e; c = b + f; literal, unoptimized reordered, optimized ldur X1, [X0, #0] // load b ldur X2, [X0, #8] // load e add X3, X1, X2 // b + e ldur X4, [X0, #12] // load f add X5, X1, X4 // b + f ldur X1, [X0, #0] // load b ldur X2, [X0, #8] // load e ldur X4, [X0, #12] // load f add X3, X1, X2 // b + e add X5, X1, X4 // b + f 16

Forwarding add X1, X2, X3 X1 sub X4, X1, X3 Add hardware to retrieve missing data from an internal buffer instead of from programmer-visible registers or memory Only works for forward paths, later in time Does not work for a load immediately followed by an instruction that uses that result (a load-use data hazard) 17

Stalling ldur X1, [X2, #0] Time X1 X1 (stall) bubble bubble bubble bubble bubble orr X8, X1, X9 Program Flow eor X10, X1, X11 When code reordering and forwarding is insufficient, then intentionally stall pipeline by adding bubbles Many ways to detect when stalling is needed and how many bubbles to induce 18

Control Hazard Time add X1, X2, X3 cbz X1, 3 (stall) bubble bubble bubble bubble bubble (stall) bubble bubble bubble bubble bubble Program Flow orr X7, X8, X9 Upon branching, the PC for the next instructor is unknown until after decode (for unconditional branches) or after execution (for conditional branches) One solution is to always induce stall(s) when a branch instruction is detected, until after branch is resolved 19

Branch Prediction Time add X1, X2, X3 cbz X1, 3 ldur X1, [X2, #0] IF ID bubble bubble bubble sub X4, X1, X3 IF bubble bubble bubble bubble Program Flow orr X7, X8, X9 In simple case, assume that branch will never be taken If branch is taken, then flush pipeline, restarting with correct instruction 20