Module 4c: Pipelining

Similar documents
Chapter 8. Pipelining

CPS104 Computer Organization and Programming Lecture 19: Pipelining. Robert Wagner

Pipeline Processors David Rye :: MTRX3700 Pipelining :: Slide 1 of 15

Lecture 15: Pipelining. Spring 2018 Jason Tang

Pipeline: Introduction

Pipelining. Maurizio Palesi

Chapter 5 (a) Overview

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

Computer Architecture. Lecture 6.1: Fundamentals of

Outline Marquette University

ECEC 355: Pipelining

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Lecture 19 Introduction to Pipelining

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

What is Pipelining? RISC remainder (our assumptions)

ECE260: Fundamentals of Computer Engineering

William Stallings Computer Organization and Architecture

Chapter 14 - Processor Structure and Function

MC9211Computer Organization. Unit 4 Lesson 1 Processor Design

Pipelining and Vector Processing

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Pipelining. Principles of pipelining Pipeline hazards Remedies. Pre-soak soak soap wash dry wipe. l Chapter 4.4 and 4.5

Pipelining. Parts of these slides are from the support material provided by W. Stallings

ELCT 501: Digital System Design

Instr. execution impl. view

UNIT 3 - Basic Processing Unit

Instruction Pipelining Review

COMPUTER ORGANIZATION AND DESIGN

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

CPU Structure and Function

Instruction Pipelining

Instruction Pipelining

Pipelining, Branch Prediction, Trends

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Modern Computer Architecture

CPU Structure and Function

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EE 457 Unit 6a. Basic Pipelining Techniques

Pipelining: Overview. CPSC 252 Computer Organization Ellen Walker, Hiram College

Chapter 4. The Processor

PIPELINE AND VECTOR PROCESSING

Introduction to Pipelining. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

Unpipelined Machine. Pipelining the Idea. Pipelining Overview. Pipelined Machine. MIPS Unpipelined. Similar to assembly line in a factory

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

UNIT- 5. Chapter 12 Processor Structure and Function

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

HPC VT Machine-dependent Optimization

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Working on the Pipeline

Full Datapath. Chapter 4 The Processor 2

Chapter 4 (Part II) Sequential Laundry

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 9. Pipelining Design Techniques

ECE-7 th sem. CAO-Unit 6. Pipeline and Vector Processing Dr.E V Prasad

Chapter 4. The Processor

East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 3 for Fall Semester, 2005

LECTURE 3: THE PROCESSOR

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Lecture 5: Pipelining Basics

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Lecture 6: Pipelining

DC57 COMPUTER ORGANIZATION JUNE 2013

Pipelined Processor Design

Chapter 3 & Appendix C Pipelining Part A: Basic and Intermediate Concepts

Computer Systems Architecture Spring 2016

Practice Problems (Con t) The ALU performs operation x and puts the result in the RR The ALU operand Register B is loaded with the contents of Rx

CS 3510 Comp&Net Arch

COMPUTER ORGANIZATION AND DESIGN

Chapter 9 Pipelining. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

CSE 141 Computer Architecture Spring Lectures 11 Exceptions and Introduction to Pipelining. Announcements

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Chapter 12. CPU Structure and Function. Yonsei University

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

C.1 Introduction. What Is Pipelining? C-2 Appendix C Pipelining: Basic and Intermediate Concepts

UNIT V: CENTRAL PROCESSING UNIT

Processor Architecture

EECS150 - Digital Design Lecture 09 - Parallelism

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Computer Architecture

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

DSP VLSI Design. Pipelining. Byungin Moon. Yonsei University

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Transcription:

Module 4c: Pipelining R E F E R E N C E S : S T A L L I N G S, C O M P U T E R O R G A N I Z A T I O N A N D A R C H I T E C T U R E M O R R I S M A N O, C O M P U T E R O R G A N I Z A T I O N A N D A R C H I T E C T U R E P A T T E R S O N A N D H E N N E S S Y, C O M P U T E R O R G A N I Z A T I O N A N D D E S I G N

Pipeline Performance of a computer can be increased by increasing the performance of the CPU. This can be done by executing more than one tasks at a time. This procedure is referred to as pipelining. The concept of pipelining is to allow the processing of a new task even though the processing of previous task has not ended. 2

The root of the single cycle processor s problems: The cycle time has to be long enough for the slowest instruction Solution: Break the instruction into smaller steps Execute each step (instead of the entire instruction) in one cycle Cycle time: time it takes to execute the longest step Keep all the steps to have similar length This is the essence of the multiple cycle processor

The advantages of the multiple cycle processor: Cycle time is much shorter Different instructions take different number of cycles to complete Load takes five cycles Jump only takes three cycles Allows a functional unit to be used more than once per instruction

Pipeline A single task is divided into several small independent processes. 5 Process T3 T2 T1 Segment 1 Segment 2 Segment 3

Analogy: Pipelined Laundry Non-pipelined approach: 1. run 1 load of clothes through washer 2. run load through dryer 3. fold the clothes (optional step for students) 4. put the clothes away (also optional). 6 Two loads? Start all over.

Analogy: Pipelined Laundry 7 While the first load is drying, put the second load in the washing machine. When the first load is being folded and the second load is in the dryer, put the third load in the washing machine.

T i m e T a s k o r d e r A B C D 6 P M 7 8 9 10 11 12 1 2 A M Non-pipelined 16 units of time T i m e 6 P M 7 8 9 10 11 12 1 2 A M T a s k o r d e r A B C D 8 Pipelined 7 units of time Notice that all the process have the same length.

Pipeline: Space Time Diagram Illustrates the behaviour of a pipeline S1 P1 P2 P3 P4 10 1 2 3 4 5 6 7 clock S2 S3 S4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 T1 n-1 S = segment; P = process;

Pipelining Lessons Pipelining doesn t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Stall for Dependences

Design Issues Since each segment are connected to each other in a sequence, the next segment cannot start execution until it has received the result from the previous segment (in this case, pipelining is not ideal). 12 So, the cycle time of the segments must be the same. However, it is known that the execution time of each segment is not the same. Therefore for synchronization, the cycle time for the pipeline will be based on, the longest execution time of the segment in the pipeline.

Pipeline performance: Degree of Speedup Given t n is the cycle time for non-pipelining and t p for pipelining. An ideal pipeline divides a task into k independent sequential processes: Each process requires t p time unit to complete, The task itself then requires k t p time units to complete. For n iterations of the task, the execution times will be: With no pipelining: n t n time units, With pipelining: k t p + (n-1) t p time units. Degree of speedup is thus: S = Ex. Time non-pipelining / Ex. Time with pipelining = (n t n )/ [k+(n-1)] t p If n is too large from k, n >> k and when t n = k t p, thus max. speedup: S max = k 13 An ideal pipeline each process has requires the same time unit

The Laundromat In a Laundromat the following process occur run 1 load of clothes through washer : 8 run load through dryer : 4 fold the clothes : 5 put the clothes away : 3 Answer these: Determine the cycle time for pipelined and non-pipelined process? Determine the execution time for both pipelined and nonpipelined execution. What is the maximum speedup possible? What is the real speedup value?

Non- Ideal Pipeline Structure: Example The operands pass through all segments in a fixed sequence 15 Control Unit Segments are separated by registers, that hold the intermediate results between stages. R1 S1 R2 S2... Rm Sm Data In Segment 1 Each segment Segment consists of 2 a circuit Segment m that performs sub-operation Data Out

Example 1: Pipeline (Non-ideal case) Given a 4 segment pipeline whereby each segment has a delay time as follows: - Segment 1 : 40 ns - Segment 2 : 25 ns - Segment 3 : 45 ns - Segment 4 : 45 ns The delay time for the interface register is 5ns. Calculate the: i) cycle time of the non-pipeline and pipeline, ii) execution time for 100 tasks, iii) real speedup, and iv) maximum speedup. 16

Example 1: Pipeline 17 Control Unit Data Input Segment 1 Segment 2 Segment 3 Segment 4 Data Output Seg.1 : 40 ns Cycle time Seg.2 : 25 ns Seg.3 : 45 ns Seg.4 : 45 ns i) Cycle time: t n = (40 + 25 + 45 + 45 + 5) ns = 160ns t p = the longest delay for execution + interface delay = (45 + 5) ns = 50 ns

Example 1: Pipeline ii. Execution time for 100 tasks 18 = k x t p + (n-1) t p (k + (n 1)) t p = ((4 * 50) + (99 * 50)) ns = (50 * 103) ns = 5150 ns For non-pipeline system, the total execution time for 100 tasks = n t n = 100 * 160ns = 16000 ns iii. The real speedup for 100 task Speed up = Execution time for non-pipeline /Execution time for pipeline = 16000 / 5150= 3.1068 iv. Maximum speedup, S max = k = 4

Instruction Pipeline The instruction cycle clearly shows the sequence of operations that take place in order to execute a single instruction. A good design goal of any system is to have all of its components performing useful work all of the time high efficiency. Following the instruction cycle in a sequential fashion does not permit this level of efficiency. Analogy: Automobile assembly line Perform all tasks concurrently, but on different (sequential) instructions 20 The result is temporal parallelism Result is the instruction pipeline This cause the instruction fetch and execute phases to overlap and perform simultaneous operations.

Implementation of Instruction Pipeline: Case 1 Divide the instruction cycle into two processes 22 Instruction fetch (Fetch cycle) Everything else (Execution phase) While one instruction is in execution, pre-fetching of the next instruction Assumes the memory bus will be idle at some point during the execution phase. Reduces the time to fetch an instruction to zero (ideal situation). Instruction prefetching is also known as fetch overlap

Implementation of Instruction Pipeline: Case 1 23 Sequential Fetch #1 Execute #1 Fetch #2 Execute #2 Pipeline Fetch #1 Execute #1 Fetch #2 Execute #2

Implementation of Instruction Pipeline: Case 1 Fetch and execute are not the same size, take different times to operate so its not really neat or ideal Execution is generally longer than fetch Whenever the execution unit is not using memory, the control increments the program counter, and read the consecutive instructions from memory efficient Fetch and put into queue But sometimes execution needs to branch, so all prefetching is useless. e.g. needs instruction 123 rather than 102 flush the queue and fetch again Also, if there is a branch, fetch must wait for the address from execute Execute must wait until that address is fetched.

Implementation of Instruction Pipeline: Case 2 Complex computer instructions require other finer phases Having more stages can further speedup Example: use a 6-stage pipeline: Fetch instruction (FI) read next instruction into buffer Decode instruction (DI) determine opcode and operand Calculate operands (CO) calculate effective address of operand Fetch operands (FO) fetch operand from memory Execute instruction (EI) Write (store) operand (WO) store result in memory The various stages will be more of equal duration Lets assume equal duration 26

Implementation of Instruction Pipeline: Case 2 Use multiple execution functional units to parallelize the actual execution phase of several instructions Use branching strategies to minimize branch impact 27

Space Time Diagram 28 A 6-stage pipeline can reduce execution time of 9 ins. From 54 to 14 time units. Non-pipelined: 9 ins. X 6 stages = 54 time units.

Comments on diagram Assumes each instruction goes thru all 6 stages not true Example: load instruction don t need WO stage Assumes there are no memory conflicts Example: FI, FO, WO involve memory access cannot do it simultaneously Unless FO and WO stage is null or value needed is already in cache then this is not a problem. Assumes no interrupt or branching happens Assumes no data dependency Where the CO stage may depend on contents of a register that could be altered by a previous instruction still in the pipeline

6-stage CPU Instruction Pipeline

Example 2 The estimated timings for each of the stages of an instruction pipeline: 31 Instruction Fetch Register Read ALU Operation Data Memory Register Write 2ns 1ns 2ns 2ns 1ns

P r o g r a m e x e c u t i o n T i m e o r d e r ( i n i n s t r u c t i o n s ) mov ax, num1 2 4 6 8 1 0 1 2 1 4 1 6 1 8 I n s t r u c t i o n D a t a R e g A L U R e g f e t c h a c c e s s mov bx, num2 8 n s I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g mov cx, num3 8 n s I n s t r u c t i o n f e t c h 8 n s... P r o g r a m e x e c u t i o n T i m e o r d e r ( i n i n s t r u c t i o n s ) mov ax, num1 2 4 6 8 1 0 1 2 1 4 I n s t r u c t i o n D a t a R e g A L U R e g f e t c h a c c e s s mov bx, num2 2 n s I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g mov cx, num3 2 n s I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g 2 n s 2 n s 2 n s 2 n s 2 n s 32

Pipeline Limitations

Pipeline Limitations Difficulties sometimes called limitations or hazards In general there are 3 major difficulties that causes instruction pipeline to deviate from its normal operation (i.e. affect pipeline performance) Resource conflict (or resource hazard or structural hazard) Caused by access to memory by 2 segments at the same time Data dependency (or data hazard) Conflict arise when an instruction depends on the result of a previous instruction, but this result is not yet available Branch difficulties (or control hazard) Arise from branch and other instructions that change the PC value Pipeline depth is often not included in the list but it does have an effect on pipeline performance (so it will be discussed)

Pipeline Limitations Hazards in pipeline can make the pipeline to stall Eliminating a hazard often requires that some instructions in the pipeline to be allowed to proceed while others are delayed When an instruction is stalled, instructions issued later than the stalled instruction are stopped, while the ones issued earlier must continue No new instructions are fetched during the stall

Pipeline Limitation: Pipeline depth If the speedup is based on the number of stages, why not build lots of stages? 37 Each stage uses latches at its input (output) to buffer the next set of inputs. More stages means more hardware overhead If the stage granularity is reduced too much, the latches and their control become a significant hardware overhead. Also suffer a time overhead in the propagation time through the latches. Limits the rate at which data can be clocked through the pipeline. Logic to handle memory and register use and to control the overall pipeline increases significantly with increasing pipeline depth.

6 Deep i.e. 6 stages

Pipeline Limitation : Data dependency 39 Pipelining must insure that computed results are the same as if computation was performed in strict sequential order. With multiple stages, two instructions in execution in the pipeline may have data dependencies -- must design the pipeline to prevent this. Data dependencies limit when an instruction can be input to the pipeline. Data dependency is shown in the following portion of a program: A = B + C; D = E + A; C = G x H; A = D / H; D needs A but cant read value of A because it still being written by previous instruction

Say that this is a 5-staged pipeline I1: ADD AX, BX ; [AX] [AX] + [BX] I2: SUB AX, CX ; [AX] [AX] - [CX] ADD ins. does not update reg. AX until stage 5 But SUB ins. needs the value at beginning of stage 3 cannot fetch something that is not ready To maintain correct operation, the pipeline must stall for 2 clock cycles Results in inefficient pipeline usage

Solutions to Data Dependency Hardware interlocks 42 An interlock is a circuit that detects instructions with data dependency and inserts required delays to resolve conflicts Operand forwarding Use circuit to detect possible conflict. Then uses extra hardware to keep results for future use down the pipeline Allowing the result of ALU directly sent as input to ALU to be used by other ALU operations in the next instruction cycle. Delayed load (NOP) Compiler is used to detect conflict and reorder instructions to delay loading of conflicting data. Using NOP instruction.

Solution to Data Dependency: Operand Forwarding 43 Src1, Src2 RSLT ALU Operation Operand Store Allowing result of ALU to be used as input to ALU Forwarding Data Path

Solution to Data Dependency: Delay Load (NOP) 45 1 2 3 4 5 6 7 Fetch Instruction Decode Instruction Execute Instruction Store Result I1 NOP I2 I3 I4 I1 NOP I2 I3 I4 I1 NOP I2 I3 I4 I1 NOP I2 I3 I4

Pipeline Limitation: Conflict of Resources Occurs when two segment need to access memory at the same time Can be solved by implementing modular memory. Fetch Instruction Decode Instruction 46 1 2 3 4 5 Fetch I4 I1 I2 I3 I4 I5 I1 I2 I3 I4 I5 All need access to memory at the same time Execute Instruction Store Result Fetching indirect operand for I3 I1 I2 I3 I4 I5 I1 I2 I3 I4 I5 Store/Write I1

5-staged pipeline, ideal case We have a conflict with 2 instruction needing the same resource Assume that memory has a single port so data reads and writes can only happen 1 at a time. Assume that the source operand for I1 is in memory. A delay in any stage can cause pipeline stalls. The FI stage for I3 must idle for 1 clock cycle before beginning

Handling Resource Conflict 49 This scenario describes a memory conflict caused by the instruction fetch of I3 and memory resident operand fetch of I1.

Handling Resource Conflict 50 Harvard architecture alleviates this issue.

Pipeline Limitation 4: Branching For the pipeline to have the desired operational speedup, we must feed it with long strings of instructions. However, 15-20% of instructions in an assemblylevel stream are (conditional) or not ; until it is actually executed. branches. 51 There must be a steady flow of instructions Conditional branches impossible to determine whether the branch will be taken Of these, 60-70% take the branch to a target address. Impact of the branch is that pipeline never really operates at its full capacity limiting the performance improvement that is derived from the pipeline

Example 4: Branching 52 Instruction that need to be flushed out Fetch Instruction Decode Instruction Execute Instruction Store Result idle

Solution for Branching 53 Delayed branch (using NOP) Rearranging the instructions Implementation of instruction queue

Solution for Branching: Delay Branch Using NOP 54 When the compiler detect a branch, it will automatically insert several NOP so that there is no interruptions in the pipeline. Example: I4 I5 I6 I7 MOV INC ADD RET KK MOV INC ADD RET KK The number of NOPs used is 1 less than the number of segments If k = Number of segments Then, NOP to be used = k 1 = 4 1 = 3 I8 NOP NOP NOP

Solution for Branching: Delay Branch Using NOP 55 1 2 3 4 5 6 7 8 9 10 Fetch Inst.(FI) Decode Inst.(DI) Execute Instr.(EI) Store Result I4 I5 I6 I7 NOP NOP NOP KK I4 I5 I6 I7 NOP NOP NOP KK I4 I5 I6 I7 NOP NOP NOP KK I4 I5 I6 I7 NOP NOP NOP KK 4 segments 3 NOPs Wasted 3 pipeline clock cycles per segment

Solution for Branching: Rearranging Instructions Tasks are rearranged so that the pipelined can operate efficiently. Fetch Instruction Decode Instruction Execute Instruction Store Result Mov Add Inc Ret 57 Ret Mov Add Inc How to arrange? By bringing the branch instruction up few notches. The number of notches = No segment 1 = k 1 = 4 1 = 3 Ret Mov Add Inc I1 I2 I3 I4 I5 Ret Mov Add Inc I1 I2 I3 I4 Ret Mov Add Inc I1 I2 I3 Ret Mov Add Inc I1 I2

Solution for Branching: Rearranging Instructions I4 I5 I6 I7 I8 MOV INC ADD RET KK Fetch Inst.(FI) Decode Inst.(DI) Execute Instr.(EI) I7 RET KK I4 MOV I5 INC I6 ADD 58 1 2 3 4 5 6 7 8 I7 I4 I5 I6 KK I7 I4 I5 I6 KK How to arrange? By bringing the branch instruction up few notches. The number of notches = No segment 1 = k 1 = 4 1 = 3 I7 I4 I5 I6 KK Store Result I7 I4 I5 I6 KK

I1 I2 I3 I4 I5 Example: Rearranging Instructions MOV AX,BX ADD BX,CX INC CX JMP LOSSY MOV DX,1 59 Pipeline with branching hazard I6 INC BX : I11 LOSSY: DEC AX Flushed out Idle 1 2 3 4 5 6 7 8 9 10 11 Fetch Inst.(FI) I1 I2 I3 I4 I5 I6 I11 Decode Inst.(DI) I1 I2 I3 I4 I5 I11 Execute Instr.(EI) I1 I2 I3 I4 I11 Store Result I1 I2 I3 I4 I11

I1 I2 I3 Example: Rearranging Instructions MOV AX,BX ADD BX,CX INC CX I4 I1 I2 JMP LOSSY MOV AX,BX 60 ADD BX,CX Bring JMP inst. up by 3 notches I4 I5 I6 JMP LOSSY MOV DX,1 INC BX I3 I5 I6 INC CX MOV DX,1 INC BX Pipeline AFTER rearranging instructions : I11 LOSSY: DEC AX : I11 LOSSY: DEC AX Fetch Inst.(FI) Decode Inst.(DI) Execute Instr.(EI) Store Result 1 2 3 4 5 6 7 8 9 10 11 I4 I1 I2 I3 I11 I4 I1 I2 I3 I11 I4 I1 I2 I3 I11 I4 I1 I2 I3 I11

Instruction Queue Solution for Branching: Instruction Queue 61 Fetch Instruction Prefetch target instruction: Prefetch both possible next instructions in the case of a conditional branch Fetch Operand Execute Instruction Store Result Instruction Queue S1 S2 JMP S4 When FI segment detect a JMP the next address instruction regarding to the JMP will be determined. S4 will be deleted and the new instruction will be fetched.

Example 5: Instruction Pipeline Given a pipeline that consist of 5 segments; Fetch Instruction, Decode Instruction, Fetch Operand, Execute Instruction and Store Result. 62 Say, n = 3 : I n : MOV AX, NUM ; AX NUM I n+1 : ADD BX, AX ; BX BX + AX I n+2 : JMP LOOP1 ; I n+10 I n+3 : : : : : I n+10 : LOOP1:

: I 3 MOV AX,NUM ; AX <-- NUM I 4 ADD BX, AX ; Here BX <-- we have BX + a AX I 5 JMP LOOP1 ; branching I13 problem I 6 : : : I 13 LOOP1: Draw the space time diagram for the execution of the instructions- Identify the branching and data dependency hazard flushed 1 2 3 4 5 6 7 8 9 10 11 12 Fetch Inst.(FI) I3 I4 I5 I6 I7 I8 I13 Pipeline is idle until JMP is completed and I13 is fetched and completed Decode Inst.(DI) I3 I4 I5 I6 I7 I13 Fetch Opr.(FO) I3 I4 I5 I6 I13 Execute Instr.(EI) When I5 (JMP) is executed, all fetched instructions will be flushed I3 I4 I5 I13 Store Result I3 I4 I5 I13

: I 3 MOV AX,NUM ; AX <-- NUM I 4 ADD BX, AX ; BX <-- BX + AX I 5 JMP LOOP1 ; I13 I 6 : : : Here we have data dependency problem Draw the space time diagram for the execution of the instructions- Identify the branching and data dependency hazard I4 needs AX, but AX is still being processed by I3 I 13 LOOP1: 1 2 3 4 5 6 7 8 9 10 11 12 Fetch Inst.(FI) I3 I4 I5 I6 I7 I8 I13 Decode Inst.(DI) I3 I4 I5 I6 I7 I13 Fetch Opr.(FO) I3 I4 I5 I6 I13 Execute I3 I4 I5 I13 Instr.(EI) Here is where data dependency is shown Store Result I3 I4 I5 I13 in the diagram

: I 3 MOV AX,NUM ; AX <-- NUM I 4 ADD BX, AX ; BX <-- BX + AX I 5 JMP LOOP1 ; I13 I 6 : : : I 13 LOOP1: Solve the data dependency problem using NOP instructions. Insert NOP between I3 and I4 I3 NOP I4 I5 : : 1 2 3 4 5 6 7 8 9 10 11 12 13 Fetch Inst.(FI) I3 NOP I4 I5 I13 Decode Inst.(DI) I3 NOP I4 I5 I13 Fetch Opr.(FO) I3 NOP I4 I5 I13 Execute Instr.(EI) I3 NOP I4 I5 I13 Store Result I3 NOP I4 I5 I13

: I 3 MOV AX,NUM ; AX <-- NUM NOP I 4 ADD BX, AX ; BX <-- BX + AX I 5 JMP LOOP1 ; I13 I 6 : : : I 13 LOOP1: Solve the branching problem by rearranging the instructions. The number of notches = No segment 1 = 5 1 = 4 I3 NOP I4 I5 : : I5 I2 I3 NOP I4 : : 1 2 3 4 5 6 7 8 9 10 11 12 13 Fetch Inst.(FI) I5 I2 I3 NOP I4 I13 Decode Inst.(DI) Fetch Opr.(FO) Execute Instr.(EI) I5 I2 I3 NOP I4 I13 I5 I2 I3 NOP I4 I13 I5 I2 I3 NOP I4 I13 Store Result I5 I2 I3 NOP I4 I13

: I 3 MOV AX,NUM ; AX <-- NUM NOP I 4 ADD BX, AX ; BX <-- BX + AX I 5 JMP LOOP1 ; I13 I 6 : : : I 13 LOOP1: Solve the branching problem using NOP instructions. I3 NOP I4 I5 : : The number of NOPs = No segment 1 = 5 1 = 4 1 2 3 4 5 6 7 8 9 10 11 12 13 Fetch Inst.(FI) I3 NOP I4 I5 NOP NOP NOP NOP I13 Decode Inst.(DI) I3 NOP I4 I5 NOP NOP NOP NOP I13 Fetch Opr.(FO) I3 NOP I4 I5 NOP NOP NOP NOP I13 Execute Instr.(EI) I3 NOP I4 I5 NOP NOP NOP NOP I13 Store Result I3 NOP I4 I5 NOP NOP NOP NOP I13

Try this out Given a pipeline that consist of 4 segments; Fetch Instruction, Decode Instruction, Execute Instruction and Store Result. Use this information to answer the following questions. -Draw the space time diagram for I1 MOV AX,BX the execution. I2 SUB BX,2 -Label clearly the data dependency I3 ADD BX, NUMB problem, the flushed instructions and the empty/idle slots. I4 JMP LOSSY I5 INC BX 16 DEC NUMB I7 LOSSY: DEC AX -Solve the data dependency using NOP instructions -Solve the branching problem by rearranging the instructions - Draw a new space time diagram

I1 I2 I3 I4 MOV AX,BX SUB BX,2 ADD BX, NUMB JMP LOSSY Draw the space time diagram for the execution. Label the data dependency problem, the flushed instructions and the empty/idle slots. I5 INC BX 16 DEC NUMB I7 LOSSY: DEC AX Fetch Inst.(FI) Decode Inst.(DI) Execute Instr.(EI) 1 2 3 4 5 6 7 8 9 10 11 12 13 I1 I2 I3 I4 I5 I6 I7 I1 I2 I3 I4 I5 I7 Data dependency problem Flushed instructions Empty/idle slots I1 I2 I3 I4 I7 Store Result I1 I2 I3 I4 I7

I1 MOV AX,BX I2 SUB BX,2 I3 ADD BX, NUMB I4 JMP LOSSY I5 INC BX 16 DEC NUMB I7 LOSSY: DEC AX A I1 I2 NOP I3 I4 B I1 I4 I2 NOP I3 -Solve the data dependency using NOP instructions A -Solve the branching problem by rearranging the instructions B - Draw a new space time diagram 1 2 3 4 5 6 7 8 9 10 11 C Fetch Inst.(FI) I1 I4 I2 NOP I3 I7 Decode Inst.(DI) Execute Instr.(EI) I1 I4 I2 NOP I3 I7 I1 I4 I2 NOP I3 I7 Store Result I1 I4 I2 NOP I3 I7

Arithmetic Pipeline The process of a complex arithmetic is subdivided into several task segments Every parameter that is involve with the arithmetic operations is processed in the segments and is control by the CPU clock. Example: S i = A i * B i +C i 71 Z i = A i * B i +C i *D i K i = A i * B i +C i /D i

Example 6: Designing Arithmetic Pipeline 72 Design an arithmetic pipeline for the operation S = A i * B i + C i where i = 1, 2, 3,, n. Specifications: 3 segment pipeline Each segment is allow to process only two tasks. Component: Register (R), Adder (A), Multiplier (M).

Example 6 S = A i * B i + C i where i = 1, 2, 3,, n. 73 The sequence of operation (follow the standard mathematical operational: Access the require data for the registers. Multiplying operation Addition operation

Example 6: S = A i * B i + C i Segment definition: Segment 1 : Reg 1 A i, Reg 2 B i Segment 2 : Multiply Reg 3 A i * B i, Reg 4 C i Segment 3 : Add Reg 5 Reg 3 + C i, S Reg 5 74 Notice that all segments only execution of 2 subtasks