Limiting The Data Hazards by Combining The Forwarding with Delay Slots Operations to Improve Dynamic Branch Prediction in Superscalar Processor

Similar documents
Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Computer Architecture

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

ECE154A Introduction to Computer Architecture. Homework 4 solution

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Instruction Pipelining Review

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Pipeline Review. Review

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

1 Hazards COMP2611 Fall 2015 Pipelined Processor

LECTURE 3: THE PROCESSOR

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

ECEC 355: Pipelining

COSC 6385 Computer Architecture - Pipelining

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

ELE 655 Microprocessor System Design

Computer Architecture Spring 2016

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

DLX Unpipelined Implementation

Full Datapath. Chapter 4 The Processor 2

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

HY425 Lecture 05: Branch Prediction

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Pipelining. Maurizio Palesi

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

COMPUTER ORGANIZATION AND DESIGN

Full Datapath. Chapter 4 The Processor 2

Outline. Pipelining basics The Basic Pipeline for DLX & MIPS Pipeline hazards. Handling exceptions Multi-cycle operations

Chapter 4. The Processor

Computer Architecture

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Instruction Pipelining

Appendix C: Pipelining: Basic and Intermediate Concepts

Pipelining. CSC Friday, November 6, 2015

Chapter 06: Instruction Pipelining and Parallel Processing

Chapter 4 The Processor 1. Chapter 4A. The Processor

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Improve performance by increasing instruction throughput

Modern Computer Architecture

Computer Organization and Structure

ECE/CS 552: Pipeline Hazards

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 4. The Processor

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN

Processor (II) - pipelining. Hwansoo Han

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EECC551 Review. Dynamic Hardware-Based Speculation

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Pipelining: Hazards Ver. Jan 14, 2014

Pipeline Architecture RISC

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Speeding Up DLX Computer Architecture Hadassah College Spring 2018 Speeding Up DLX Dr. Martin Land

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CIS 662: Midterm. 16 cycles, 6 stalls

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Instruction Pipelining

Chapter 4 The Processor 1. Chapter 4B. The Processor

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Pipelining is Hazardous!

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

What is Pipelining? RISC remainder (our assumptions)

Final Exam Fall 2007

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

(Basic) Processor Pipeline

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

zhandling Data Hazards The objectives of this module are to discuss how data hazards are handled in general and also in the MIPS architecture.

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

DEE 1053 Computer Organization Lecture 6: Pipelining

( ) תשס"ח סמסטר ב' May, 2008 Hugo Guterman Web site:

ECE 505 Computer Architecture

Instruction-Level Parallelism and Its Exploitation

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Transcription:

Limiting The Data Hazards by Combining The Forwarding with Delay Slots Operations to Improve Dynamic Branch Prediction in Superscalar Processor S.A.Hadoud and A.M.Mosbah Azzaituna University, Tarhuna Libya drsaiedhadood@yahoo.com Ali.amary81@yahoo.com ABSTRACT Modern microprocessor performance has been significantly increased by the exploitation of instruction level parallelism (ILD) [1]. Continued improvement is limited by pipeline hazards, due to data dependency between instructions in sequential programs. Many operations proposed in previous studies, such as delay slots and forwarding separately, to avoid occurrence of data dependency such as Architectural Tradeoffs in the Design of MIPSX [2], rewriting executable files to measure program behavior [3]. In this paper we will introduce a new mechanism to use a combination of these operations together to avoid the data hazard that causes degradation to performance of ILP. KEYWORDS ILP Instruction Level Parallelism LHR Local History Register ISA Instruction set Architecture GHR Global History Register BTB Branch Target Buffer CPI Clock cycle Per Instruction 2bc Two Bit Counter WA W Write After Write PHT Pattern History Table WAR Write After Read 1 INTRODUCTION Data hazards arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of the instructions in the pipeline [1] [3] [4], thus causing the pipeline to stall until the results is made available. A major effect of pipelining is to change the relative timing of instructions by overlapping their execution. This introduces data and control hazard. Data hazards occur when the pipeline changes the order of read/write accesses to operands [1], so that the order differs from the order seen by the sequentially executing instructions on the UN pipelined machine. Consider the pipelined execution of these instructions, introduced in table (1): Table (1): Instructions execution sequence 8 9 ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R5,R1 IF IDsub EX MEM WB AND R6,R1,R7 IF IDand EX MEM WB OR R8,R1,R9 IF IDor EX MEM WB XOR R10,R1,R11 IF IDxor EX MEM WB BHT Branch History Table RAW Read After Write BHR Brach History Register EHB Elastic History Buffer DHLF Dynamic History Length Fitting All the instructions after the ADD use the result of the ADD instruction (in R1) writes. The ADD instruction writes the value of R1 in the WB stage (shown black), and the SUB instruction reads the value during ID stage (IDsub). This problem is called A Data Hazard. ISBN: 9780989130547 2014 SDIWC 180

Unless precautions are taken to prevent it, the SUB instruction will read the wrong value and try to use it. The AND instruction is also affected by this data hazard. The write of R1 does not complete until the end of cycle 5 (shown black). Thus, the AND instruction that reads the registers during cycle 4 (IDand) will receive the wrong result. The OR instruction can be made to operate without incurring a hazard by a simple implementation technique. The technique is to perform register file reads in the second half of the cycle, and writes in the first half. Because both WB for ADD and IDor for OR are performed in one cycle (5), the write to register file by ADD will perform in the first half of the cycle, and the read of registers by OR will perform in the second half of the cycle. The XOR instruction operates properly, because its register read occur in cycle 6 after the register write by ADD. 2 FORWARDING The problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding, as shown in table (2). Table (2): forwarding execution sequence ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R5,R1 IF IDsub EX MEM WB AND R6,R1,R7 IF IDand EX MEM WB The key insight in forwarding is that the result is not really needed by SUB until after the ADD actually produces it. The only problem is to make it available for SUB when it needs it. If the result can be moved from where the ADD produces it (EX/MEM register), to where the SUB needs it (ALU input latch), then the need for a stall can be avoided. The ALU result from the EX/MEM register is always fed back to the ALU input latches. If the forwarding hardware detects that the previous ALU operation has written the register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. Forwarding of results to the ALU requires the additional of three extra inputs on each ALU multiplexer and the addition of three paths to the new inputs. The paths correspond to a forwarding of: a) the ALU output at the end of EX, b) the ALU output at the end of MEM, and c) The memory output at the end of MEM. Without forwarding our example will execute correctly with stall, as shown in table (3). Table (3): Execution sequence without forwarding ADD SUB AND R1, R2, R3 R4, R5, R1 R6, R1, R7 8 9 IF ID EX MEM WB IF stall Stall IDsub EX MEM WB stall Stall IF IDand EX MEM WB As our example shows, we need to forward results not only from the immediately previous instruction, but possibly from an instruction that started three cycles earlier. Forwarding can be arranged from MEM/WB latch to ALU input also. Using those forwarding paths the code sequence can be executed without stalls, as shown in table (4). Table (4): Execution sequence using forwarding ADD R1,R2,R3 IF ID EXadd MEMadd WB SUB R4,R5,R1 IF ID EXsub MEM WB AND R6,R1,R7 IF ID EXand MEM WB ISBN: 9780989130547 2014 SDIWC 181

The first forwarding is for value of R1 from EXadd to EXsub. The second forwarding is also for value of R1 from MEMadd to EXand. This code now can be executed without stalls. Forwarding can be generalized to include passing the result directly to the functional unit that requires it. A result is forwarded from the output of one unit to the input of another, rather than just from the result of a unit to the input of the same unit. One more example: Consider the instruction shown in table (5). Table (5): Execution sequence using forwarding 3. S OPERATION To ovoid the data dependency at the execution of dependent instruction we use delay slots. Delay slots can be filled by nop (no operation), but if we optimized the compiler, this will make more pipeline performance. Because the compiler will employ the independent instructions to fill the delay slots. Branch delay slots: inline instructions following a branch instruction, as explained in the following example: (a) From before the Branch: Always helpful when possible ADD R1,R2, R3 IF ID EX ad d MEM ad d WB LW R4,d (R1) IF ID EX lw MEM lw WB SW R4,12(R1) IF ID EXsw MEMsw WB ADD R1, R2, R3 BEQZ R2, L1 ADD R1, R2, R3 Stores require an operand during MEM, and forwarding of that operand is shown here. The first forwarding is for value of R1 from EX add to EX lw. The second forwarding is also for value of R1 from MEM add to EXsw. The third forwarding is for value of R4 from MEM lw to MEMsw. Observe that the SW instruction is storing the value of R4 into memory location computed by adding the displacement 12 to the value contained in register R1. This effective address computation is done in the ALU during the EX stage of the SW instruction. The value to be stored (R4 in this case) is needed only in the MEM stage as an input to Data Memory. Thus the value of R1 is forwarding to the EX stage for effective address computation and is needed earlier in time than the value of R4 which is forwarding to the input of Data Memory in the MEM stage. So forwarding takes place from ''left to right'' in time, but operands are not ALWAYS forwarding to the EX stage it depends on the instruction and the point in the Data path where the operand is needed. Of course, hardware support is necessary to support data forwarding. If the ADD instruction were: ADD R2, R1, and R3 the move would not be possible. (b) From the target: Helps when branch is taken. May duplicate instructions L2: BEQZ R2, L2 L2: Instructions between BEQ and SUB (in fall through) must not use R4.Why is instruction at L1 duplicated? What if R5 or R6 changed? (c) From Fall Through: Helps when branch is not taken. ISBN: 9780989130547 2014 SDIWC 182

Instructions at target (L1 and after) must not use R4 till set again. 3.1 Cancelling branch: Branch instruction indicates direction of prediction. If mispredicted the instruction in the delay slot is cancelled. Greater flexibility for compiler to schedule instructions. Compiler predicts branch direction, include in instruction itself. Mispredicted branch ''annuls'' the instruction in delay slot. The instruction behaves as a noop, and gives more leverage to compiler to select instructions to fill delay slots. 3.2 Limitation of delayed branch Compiler may not find appropriate instructions to fill delay slots. Then it fills delay slots with noops. Visible architectural feature likely to change with new implementations. Pipeline structure is exposed to compiler. Need to know how many delay slot? Must keep additional PC data to handle interrupts. 3.3 Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60%? 80%) of slots usefully filled. Delayed Branch downside: 78 stage pipelines, multiple instructions issued per clock (superscalar). 5. RESULTS In this paper, we have got important results that will interest our goal in this research. There are four types of data hazard WAR: Write After Read WAW: Write After Write RAW: Read After Write RAR: Read After Read don t cause problem. The following two tables show these results: Table (6) shows the results that produced from enabling forwarding and delay slots. Table (6): Results that produced from enabling forwarding and delay slots. Configure Execution Data hazard Code size Enable forwarding Enable delay slots 40 Cycles. 35 Instructio ns 1.43 Cycles Per Instructio ns (CPI). 0 WAR 0 WAW 1 RAW 140 bytes. 4. COMBINING THE FORWARDING AND S OPERATIONS TO AVOID DATA HAZARDS In this research a new technique is used. This is done by combining the forwarding and delay slots operations. This is achieved automatically by compiler. In our experimental work we used winmips 64 to get the results. We have used I/Q instruction as a model example to study the effect of these operations on data hazard. ISBN: 9780989130547 2014 SDIWC 183

Table (7) shows the results that produced from enabling delay slots. Table (7): Results that produced from enabling forwarding. REFERENCES [1].Mahesh Neupane. "Hazards in pipelining". [2]. Paul Chow and Mark Horowitz "Architectural Tradeoffs in the Design of MIPSX Configure Execution Data hazard Code size [3]. John L. Hennessy, and David A. Paterson,'' Computer architecture A Quantitative. 4 th Edition. 2006 Enable delay slots 59 Cycles 35 Instructions 0 WAR 0 WAW 20 RAW 140 bytes [4]. ArvoToomsalu.'' Microprocessor Systems Architecture''. 1.686 Cycles Per Instructions CPI From the above two tables of results, we noted that when we use the forwarding and delay slots together, we get result much better than using only delay slots. We can say that these results may have a small change when we use another model program, but this resultshows that our new mechanism of combining the forwarding and delay slots operations is the best solution as it prevents data dependency. 6. CONCLUSION Forwarding and delay slots operations were used together to avoid occurrence of data hazard that occur due to data dependency between instructions of certain program. This technique produced excellent results as the occurrence of data hazard is completely avoided. ISBN: 9780989130547 2014 SDIWC 184