Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Similar documents
Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Pipelining

Instruction Pipelining

Multiple Instruction Issue. Superscalars

What is Pipelining? RISC remainder (our assumptions)

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

UNIT- 5. Chapter 12 Processor Structure and Function

Pipelining: Hazards Ver. Jan 14, 2014

Advanced Computer Architecture

Module 4c: Pipelining

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Exploitation of instruction level parallelism

ECE 341. Lecture # 15

(Refer Slide Time: 00:02:04)

What is Pipelining. work is done at each stage. The work is not finished until it has passed through all stages.

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

COSC 6385 Computer Architecture - Pipelining

CPE300: Digital System Architecture and Design

UNIT I (Two Marks Questions & Answers)

Instruction Level Parallelism (ILP)

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

More advanced CPUs. August 4, Howard Huang 1

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Control Hazards. Branch Prediction

Instruction Level Parallelism

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Appendix C: Pipelining: Basic and Intermediate Concepts

Advanced processor designs

The Processor: Instruction-Level Parallelism

COMPUTER ORGANIZATION AND DESI

Control Hazards. Prediction

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Pipeline Processors David Rye :: MTRX3700 Pipelining :: Slide 1 of 15

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Improve performance by increasing instruction throughput

Instruction Pipelining Review

Lecture 15: Pipelining. Spring 2018 Jason Tang

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Structure of Computer Systems

ECEC 355: Pipelining

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

EITF20: Computer Architecture Part2.2.1: Pipeline-1

14:332:331 Pipelined Datapath

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Pipelining. Maurizio Palesi

CPU Pipelining Issues

Chapter 9. Pipelining Design Techniques

omputer Design Concept adao Nakamura

DEE 1053 Computer Organization Lecture 6: Pipelining

ארכי טק טורת יחיד ת עיבוד מרכזי ת

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Department of CSE- Mahalakshmi Engineering College Page 1

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Outline Marquette University

Processing Unit CS206T

Modern Computer Architecture

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

Pipelining, Branch Prediction, Trends

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function

Pipelining and Vector Processing

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

Chapter Six. Dataı access. Reg. Instructionı. fetch. Dataı. Reg. access. Dataı. Reg. access. Dataı. Instructionı fetch. 2 ns 2 ns 2 ns 2 ns 2 ns

HPC VT Machine-dependent Optimization

(Basic) Processor Pipeline

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Superscalar Processors

Pipeline Review. Review

Instruction-Level Parallelism and Its Exploitation

( ) תשס"ח סמסטר ב' May, 2008 Hugo Guterman Web site:

CIS 662: Midterm. 16 cycles, 6 stalls

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Chapter 4. The Processor

Hardware-based Speculation

Pipeline Architecture RISC

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Multi-cycle Instructions in the Pipeline (Floating Point)

Pipelining and Vector Processing

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers

Lecture 19 Introduction to Pipelining

CS425 Computer Systems Architecture

Introduction to Pipelining. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Basic Computer Architecture

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Transcription:

Ti5317000 Parallel Computing PIPELINING Michał Roziecki, Tomáš Cipr 2005-2006

Introduction to pipelining What is this What is pipelining? Pipelining is an implementation technique in which multiple instructions are overlapped in execution Today, Pipelining is key to making processors fast

Introduction to pipelining - How it works The basic action of any microprocessor as it moves through the instruction stream can be broken down into a series of four simple steps, which each instruction in the code stream goes through in order to be executed: Fetch the next instruction from the address stored in the program counter. Store that instruction in the instruction register and decode it, and increment the address in the program counter. Execute the instruction currently in the instruction register. If the instruction is not a branch instruction but an arithmetic instruction, send it to the proper ALU. Write the results of that instruction from the ALU back into the destination register.

Introduction to pipelining - How it works The total execution time for each individual instruction is not changed by pipelining. It still takes an instruction 4ns to make it all the way through the processor. Pipelining doesn't speed up instruction execution time, but it does speed up program execution time by increasing the number of instructions finished per unit time. Fig. A four-stage pipeline

Introduction to pipelining - RISC vs. CISC RISC Small numer of instructions Simple instructions Low cycles per second, large code sizes Single-clock, reduced instruction only CISC Big numer of instructions Complex instructions Small code sizes, high cycles per second Includes multi-clock complex instructions

Introduction to pipelining Pielining vs. Single-cycle processors Single-cycle processor Pipelining processor For single-cycle processor it takes 16 nanosecond to execute four instructions, while for pipelining processor it takes only 7 nanoseconds.

Introduction to pipelining Counting example Suppose we execute 100 instructions Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

Introduction to pipelining Charecteristics of Pipelined Processor Design Main memory must operate in one cycle Instruction and data memory must appear separate Few buses are used Data is latched (stored in temporary registers) at each pipeline stage-called pipeline registers ALU operations take only 1 clock cycle

Introduction to pipelining Pipelining history IBM 360/91 - First implemented pipelining - Performance increased 2,5 to 25 times P6 (Pentium Pro) - Superscalar level 3 processor - Included 3 pipelines Future?

Pipelining issues Dependance among instructions Execution of some instructions can depend on the completion of others instructions in the pipeline One solution is to stall the pipeline Dependences involving registers can be detected and data forwarded to instruction needing it, without waiting for register write Dependence involving memory is harder and is sometimes addressed by restricting.

Pipelining issues Instructions Adapting All instructions must fit into a common pipeline stage structure A 5-stage pipeline is typically used in RISC processors Instruction fetch Decode and operand access ALU operations Data memory access Register write

Pipelining issues - Hazards Hazard is a result of dependency, it occurs when two or more of these simultaneous (possibly out of order) instructions conflict. There are typically three types of hazards: Data hazards Structural hazards Branching hazards

Pipelining issues - Data hazards Data hazards occur when data is modified. There are three situations it can occur in: Read after Write (RAW): Memory is modified and read soon after. Write after Read (WAR): Read from a memory location and write soon after to that location. Write after Write (WAW): Two instrutions that write to memory are performed.

Pipelining issues - Data hazards - Classification and possible solution Bubbling the Pipeline As instructions are fetched, control logic determines whether or not a hazard could/will occur. If this is true, then the control logic inserts NOPs into the pipeline. Forwarding Forwarding is implemented by feeding back the output of an instruction into the previous stage(s) of the pipeline as soon as the output of that instruction is available.

Pipelining issues - Structural hazards A structural hazard occurs when a part of the processor's hardware is needed by two or more instructions at the same time. A structural hazard might occur, for instance, if a program were to execute a branch instruction followed by a computation instruction. Because they are executed in parallel, and because branching is typically slow, it is quite possible (depending on architecture) that the computation instruction and the branch instruction will both require the ALU at the same time.

Pipelining issues - Branch hazards Branching hazards (also known as control hazards) occur when the processor is told to branch. If a certain condition is true, jump from one part of the instruction stream to another one - not necessarily the next one sequentially. In such a case, the processor cannot tell in advance whether or not it should process the next instruction.

Pipelining issues - Branch prediction The microprocessor tries to predict whether the branch instruction will jump or not, based on a record of what this branch has done previously. If the prediction turns out to be wrong, then it has to flush the pipeline and discard all calculations that were based on this prediction. But if the prediction was correct, then it has saved a lot of time. Different kinds of Branch preiction: Trivial prediction Static prediction Local branch prediction Combined branch prediction

THE SPEEDUP FROM PIPELINING - Speedup against single-cycle processor - What affects speedup? - Theoretical vs. real-world speedup - Two views of speedup; according to: - Number of pipeline stages - Instruction throughput

Speedup and pipeline depth - Pipeline depth = Number of stages - Instruction completed each clock cycle - Slicing more the instruction processing, faster the clock frequency can be - Speedup is ideally equal to pipeline depth - 4-stage pipeline ~ 4-time speedup - 8-stage pipeline ~ 8-time speedup

Speedup and pipeline depth (contd.) - Speedup = pipeline depth in reality? No! - Why not? - Equal duration of stages has to be preserved - Perfect splitting into equal stages impossible - Clock cycle suited to the slowest stage - More finely the stages are sliced, greater is speedup

Speedup: Theory vs. Real-World 20 18 16 14 Relative Speedup 12 10 8 6 4 2 2 4 6 8 10 12 14 16 18 20 Pipeline Depth

Speedup and instruction throughput - Speedup also affected by pipelining process itself - Instruction throughput = Instructions per clock - IPC = 1, in single-cycle processor - IPC = 1, in ideal pipelined processor - Issues in real-world - Pipeline filling - Pipeline stalls

Throughput and pipeline filling - Pipeline needs several clock cycles to fill up - No instructions completed up until now - Average IPC of 1 is limit for reality - More cycles the pipeline runs, greater is the average IPC and so speedup - Example 4-stage pipeline: - After 5 cycles: IPC = 1 instruction / 5 = 0.2 - After 100 cycles: IPC = 96 / 100 = 0.96

Average instruction throughput Average Instruction Throughput 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20 40 60 80 100 Clock Cycles Ideal pipeline vs. ideal 4-stage pipeline

Throughput and pipeline stalls - Pipeline can be kept full after filling? No! - Pipeline was still idealized - Real pipeline has to deal with hazards - Pipeline stalls - Pipeline flush - Speedup remains in decreasing

Instruction throughput with two-cycle stall Average Instruction Throughput 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20 40 60 80 100 Clock Cycles

ENHANCED PIPELINING - Is it possible to break the limit of one instruction per clock? Yes, but - Low-level parallelism needed - Instruction-level parallelism - Superscalar - Very Long Instruction Word (VLIW) - Data-level parallelism

Instruction-level parallelism - Speedup gained by adding more hardware - Make use of inherent parallelism - Superscalar - Dynamically distributes instructions to function units - VLIW - long instruction words with many operations statically compiled to a single word

Data-level parallelism - Based on SIMD concept - Single instruction is executed over short vectors - Benefit gained in specific applications - Multimedia - Complexity of processor is not much increased

PIPELINING ON PENTIUM 4 - Hyperpipelining technology - 20-stage pipeline - Clock frequency increased by 40% - Advanced branch prediction - 4 Kb branch target buffer - Successful prediction in 93-94%

SUMMARY - Pipelining characteristics - Fetch, Decode, Execute, Write - Single-cycle vs. pipelined processor - Pipelining issues - Hazards - The speedup from pipelining - Pipeline depth - Enhanced pipelining - Low-level parallelism