Chapter 9. Pipelining Design Techniques

Similar documents
Pipelining Design Techniques

Chapter 8. Pipelining

Chapter 9 Pipelining. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

What is Pipelining? RISC remainder (our assumptions)

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Processors. Young W. Lim. May 12, 2016

Chapter 06: Instruction Pipelining and Parallel Processing

Basic concepts UNIT III PIPELINING. Data hazards. Instruction hazards. Influence on instruction sets. Data path and control considerations

8.1. ( ) The operation performed in each step and the operands involved are as given in the figure below. Clock cycle Instruction

Multiple Instruction Issue. Superscalars

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Limitations of Scalar Pipelines

Lecture 15: Pipelining. Spring 2018 Jason Tang

Pipelining and Vector Processing

Processors. Young W. Lim. May 9, 2016

Pipelining: Overview. CPSC 252 Computer Organization Ellen Walker, Hiram College

COMPUTER ORGANIZATION AND DESI

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

UNIT 3 - Basic Processing Unit

Pipelining. CSC Friday, November 6, 2015

ECE260: Fundamentals of Computer Engineering

structural RTL for mov ra, rb Answer:- (Page 164) Virtualians Social Network Prepared by: Irfan Khan

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

ECEC 355: Pipelining

Module 4c: Pipelining

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Copyright 2012, Elsevier Inc. All rights reserved.

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Pipelining and Vector Processing

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Hardware-Based Speculation

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MC9211Computer Organization. Unit 4 Lesson 1 Processor Design

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

CPE300: Digital System Architecture and Design

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Instruction Pipelining Review

Chapter 4. The Processor

Pipelining. Parts of these slides are from the support material provided by W. Stallings

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Advanced issues in pipelining

Instr. execution impl. view

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Processor (II) - pipelining. Hwansoo Han

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

CS146 Computer Architecture. Fall Midterm Exam

CS252 Graduate Computer Architecture Midterm 1 Solutions

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

Full Datapath. Chapter 4 The Processor 2

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Practice Problems (Con t) The ALU performs operation x and puts the result in the RR The ALU operand Register B is loaded with the contents of Rx

PIPELINE AND VECTOR PROCESSING

1 Tomasulo s Algorithm

Chapter 4. The Processor

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

LECTURE 3: THE PROCESSOR

Alexandria University

Hardware-based Speculation

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Course web site: teaching/courses/car. Piazza discussion forum:

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Pipelined Processor Design

Pipeline Processors David Rye :: MTRX3700 Pipelining :: Slide 1 of 15

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

ILP: COMPILER-BASED TECHNIQUES

Practice Assignment 1

LECTURE 10. Pipelining: Advanced ILP

Improve performance by increasing instruction throughput

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Control Hazards. Branch Prediction

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

MARTHANDAM COLLEGE OF ENGINEERING AND TECHNOLOGY DEPARTMENT OF INFORMATION TECHNOLOGY TWO MARK QUESTIONS AND ANSWERS

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Chapter 5 (a) Overview

COMPUTER ORGANIZATION AND DESIGN

Static, multiple-issue (superscaler) pipelines

Dynamic Control Hazard Avoidance

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Superscalar Organization

Advanced processor designs

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CS311 Lecture: Pipelining and Superscalar Architectures

Transcription:

Chapter 9 Pipelining Design Techniques

9.1 General Concepts Pipelining refers to the technique in which a given task is divided into a number of subtasks that need to be performed in sequence. Each subtask is performed by a given functional unit. The units are connected in a serial fashion and all of them operate simultaneously. The use of Pipelining improves the performance as compared to the traditional sequential execution of tasks.

9.1 General Concepts Below is an illustration of the basic difference between executing four subtasks of a given instruction (in this case fetching F, decoding D, execution E, and writing the results W) using pipelining and sequential processing.

9.1 General Concepts In order to formulate some performance measures for the goodness of a pipeline in processing a series of tasks, a space time chart (called the Gantt's Chart) is used. The chart shows the succession of the sub-tasks in the pipe with respect to time.

9.1 General Concepts Three performance measures for the goodness of a pipeline are provided: Speed-up S(n), Throughput U(n), and Efficiency E(n). It is assumed that the unit time T = t units.

9.1 General Concepts Speedup S(n): Consider the execution of m tasks (instructions) using n-stages (units) pipeline. n+m-1 time units are required to complete m tasks. Speed up S(n) = Time using sequential processing Time using pipeline processing m n t = ( n + m 1) t S(n) = n i.e., n - fold increase in speed is theoretically possible m Lim m n = n + m 1 Throughput T(n): Throughput U ( n) = # of tasks executed per unit time = m (n + m -1) t U(n) =1 m Lim

9.1 General Concepts Efficiency E(n): Efficiency E( n) = Ratio E( n) = 1 m Lim of the actual speed up to the maximum speed up = Speed up n = m n + m 1 Fundamentals of Computer Organization and Architecture

9.2 Instruction Pipeline A pipeline stall: A Pipeline operation is said to have been stalled if one unit (stage) requires more time to perform its function, thus forcing other stages to become idle. Due to the extra time units needed for instruction to be fetched, the pipeline stalls. Such situations create what is known as pipeline bubble (or pipeline hazards).

9.2 Instruction Pipeline Pipeline Stall due to Instruction Dependency: Instruction dependency refers to the case whereby fetching of an instruction depends on the results of executing a previous instruction. Instruction dependency manifests itself in the execution of a conditional branch instruction. For example, in the case of a "branch if negative" instruction, the next instruction to fetch will not be known until the result of executing that branch if negative instruction is known.

9.2 Instruction Pipeline Pipeline Stall due to Instruction Dependency:

9.2 Instruction Pipeline Pipeline Stall due to Data Dependency: Data dependency in a pipeline occurs when a source operand of instruction I i depends on the results of executing a preceding instruction, I, i > j. Write-after-write data dependency j

9.2 Instruction Pipeline Pipeline Stall due to Data Dependency: Read-after-write data dependency

9.2 Instruction Pipeline Method used to prevent fetching the wrong instruction or operand: use of NOP (No Operation) In real-life situations, a mechanism is needed to guarantee fetching the appropriate instruction at the appropriate time. Insertion of NOP instructions will help carrying out this task. A "NOP" is an instruction that has no effect on the status of the processor.

9.2 Instruction Pipeline Method used to prevent fetching the wrong instruction or operand: use of NOP (No Operation)

9.2 Instruction Pipeline Methods used to reduce pipeline stall due to instruction dependency: Unconditional Branch Instructions Reordering of Instructions. Use of Dedicated Hardware in the Fetch Unit. Pre-computing of Branches and Reordering of Instructions. Instruction Pre-fetching. Conditional Branch Instructions Delayed Branch. Prediction of the Next Instruction to Fetch.

9.2 Instruction Pipeline Methods used to reduce pipeline stall due to data dependency Hardware Operand Forwarding Software Operand Forwarding Store-Fetch. Fetch-Fetch. Store-Store.

9.3 Example Pipeline Processors ARM 1026EJ-S Processor ARM 1022EJ-S is a pipeline processor whose ALU consists of six stages: Fetch Stage: for instruction cache access and branch prediction for instructions that have already been fetched. Issue Stage: for initial instruction decoding. Decode Stage: for final instruction decode, register read for ALU operations, forwarding, and initial interlock resolution. Execute Stage: for data access address calculation, data processing shift, shift & saturate, ALU operations, first stage multiplication, flag setting, condition code check, branch mis-predict detection, and store data register read. Memory Stage: for data cache access, second stage multiplication, and saturations. Write Stage: for register write and instruction retirement.

9.3 Example Pipeline Processors UltraSPARC-III Processor

9.4 Instruction-Level Parallelism

9.4 Instruction-Level Parallelism Superscalar Architectures A scalar machine is able to perform only one arithmetic operation at once. A superscalar architecture (SPA) is able to fetch, decode, execute, and store results of several instructions at the same time. In a SPA instruction processing consists of the fetch, decode, issue, and commit stages. The most crucial step in processing instructions in SPAs is the dependency analysis. The complexity of such analysis grows quadratically with the instruction word size. This puts a limit on the degree of parallelism that can be achieved with SPAs such that a degree of parallelism higher than four will be impractical.

9.4 Instruction-Level Parallelism Very Long Instruction Word (VLIW) The compiler performs dependency analysis and determines the appropriate groupings/scheduling of operations. Operations that can be performed simultaneously are grouped into a Very Long Instruction Word (VLIW). Therefore, the instruction word is made long enough in order to accommodate the maximum possible degree of parallelism. In VLIW, resource binding can be done by devoting each field of an instruction word to one and only one functional unit. However, this arrangement will lead to a limit on the mix of instructions that can be issued per cycle. A more flexible approach is to allow a given instruction field to be occupied by different kinds of operations.

9.5 Arithmetic Pipeline Fixed-Point Arithmetic Pipelines

9.5 Arithmetic Pipeline Floating-Point Arithmetic Pipelines

9.5 Arithmetic Pipeline Pipelined Multiplication using Carry-Save Addition

9.6 Summary In this chapter, the basic principles involved in designing pipeline architectures were considered. Our coverage started with a discussion on a number of metrics that can be used to assess the goodness of a pipeline. We then moved to present a general discussion on the main problems that need to be considered in designing a pipelined architecture. In particular we considered two main problems: Instruction and data dependency.

9.6 Summary The effect of these two problems on the performance of a pipeline has been elaborated. Some possible techniques that can be used to reduce the effect of the instruction and data dependency have been introduced and illustrated. An example of a recent pipeline architecture, the ARM 11 microarchitecture, has been presented.