Limiting The Data Hazards by Combining The Forwarding with Delay Slots Operations to Improve Dynamic Branch Prediction in Superscalar Processor S.A.Hadoud and A.M.Mosbah Azzaituna University, Tarhuna Libya drsaiedhadood@yahoo.com Ali.amary81@yahoo.com ABSTRACT Modern microprocessor performance has been significantly increased by the exploitation of instruction level parallelism (ILD) [1]. Continued improvement is limited by pipeline hazards, due to data dependency between instructions in sequential programs. Many operations proposed in previous studies, such as delay slots and forwarding separately, to avoid occurrence of data dependency such as Architectural Tradeoffs in the Design of MIPSX [2], rewriting executable files to measure program behavior [3]. In this paper we will introduce a new mechanism to use a combination of these operations together to avoid the data hazard that causes degradation to performance of ILP. KEYWORDS ILP Instruction Level Parallelism LHR Local History Register ISA Instruction set Architecture GHR Global History Register BTB Branch Target Buffer CPI Clock cycle Per Instruction 2bc Two Bit Counter WA W Write After Write PHT Pattern History Table WAR Write After Read 1 INTRODUCTION Data hazards arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of the instructions in the pipeline [1] [3] [4], thus causing the pipeline to stall until the results is made available. A major effect of pipelining is to change the relative timing of instructions by overlapping their execution. This introduces data and control hazard. Data hazards occur when the pipeline changes the order of read/write accesses to operands [1], so that the order differs from the order seen by the sequentially executing instructions on the UN pipelined machine. Consider the pipelined execution of these instructions, introduced in table (1): Table (1): Instructions execution sequence 8 9 ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R5,R1 IF IDsub EX MEM WB AND R6,R1,R7 IF IDand EX MEM WB OR R8,R1,R9 IF IDor EX MEM WB XOR R10,R1,R11 IF IDxor EX MEM WB BHT Branch History Table RAW Read After Write BHR Brach History Register EHB Elastic History Buffer DHLF Dynamic History Length Fitting All the instructions after the ADD use the result of the ADD instruction (in R1) writes. The ADD instruction writes the value of R1 in the WB stage (shown black), and the SUB instruction reads the value during ID stage (IDsub). This problem is called A Data Hazard. ISBN: 9780989130547 2014 SDIWC 180
Unless precautions are taken to prevent it, the SUB instruction will read the wrong value and try to use it. The AND instruction is also affected by this data hazard. The write of R1 does not complete until the end of cycle 5 (shown black). Thus, the AND instruction that reads the registers during cycle 4 (IDand) will receive the wrong result. The OR instruction can be made to operate without incurring a hazard by a simple implementation technique. The technique is to perform register file reads in the second half of the cycle, and writes in the first half. Because both WB for ADD and IDor for OR are performed in one cycle (5), the write to register file by ADD will perform in the first half of the cycle, and the read of registers by OR will perform in the second half of the cycle. The XOR instruction operates properly, because its register read occur in cycle 6 after the register write by ADD. 2 FORWARDING The problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding, as shown in table (2). Table (2): forwarding execution sequence ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R5,R1 IF IDsub EX MEM WB AND R6,R1,R7 IF IDand EX MEM WB The key insight in forwarding is that the result is not really needed by SUB until after the ADD actually produces it. The only problem is to make it available for SUB when it needs it. If the result can be moved from where the ADD produces it (EX/MEM register), to where the SUB needs it (ALU input latch), then the need for a stall can be avoided. The ALU result from the EX/MEM register is always fed back to the ALU input latches. If the forwarding hardware detects that the previous ALU operation has written the register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. Forwarding of results to the ALU requires the additional of three extra inputs on each ALU multiplexer and the addition of three paths to the new inputs. The paths correspond to a forwarding of: a) the ALU output at the end of EX, b) the ALU output at the end of MEM, and c) The memory output at the end of MEM. Without forwarding our example will execute correctly with stall, as shown in table (3). Table (3): Execution sequence without forwarding ADD SUB AND R1, R2, R3 R4, R5, R1 R6, R1, R7 8 9 IF ID EX MEM WB IF stall Stall IDsub EX MEM WB stall Stall IF IDand EX MEM WB As our example shows, we need to forward results not only from the immediately previous instruction, but possibly from an instruction that started three cycles earlier. Forwarding can be arranged from MEM/WB latch to ALU input also. Using those forwarding paths the code sequence can be executed without stalls, as shown in table (4). Table (4): Execution sequence using forwarding ADD R1,R2,R3 IF ID EXadd MEMadd WB SUB R4,R5,R1 IF ID EXsub MEM WB AND R6,R1,R7 IF ID EXand MEM WB ISBN: 9780989130547 2014 SDIWC 181
The first forwarding is for value of R1 from EXadd to EXsub. The second forwarding is also for value of R1 from MEMadd to EXand. This code now can be executed without stalls. Forwarding can be generalized to include passing the result directly to the functional unit that requires it. A result is forwarded from the output of one unit to the input of another, rather than just from the result of a unit to the input of the same unit. One more example: Consider the instruction shown in table (5). Table (5): Execution sequence using forwarding 3. S OPERATION To ovoid the data dependency at the execution of dependent instruction we use delay slots. Delay slots can be filled by nop (no operation), but if we optimized the compiler, this will make more pipeline performance. Because the compiler will employ the independent instructions to fill the delay slots. Branch delay slots: inline instructions following a branch instruction, as explained in the following example: (a) From before the Branch: Always helpful when possible ADD R1,R2, R3 IF ID EX ad d MEM ad d WB LW R4,d (R1) IF ID EX lw MEM lw WB SW R4,12(R1) IF ID EXsw MEMsw WB ADD R1, R2, R3 BEQZ R2, L1 ADD R1, R2, R3 Stores require an operand during MEM, and forwarding of that operand is shown here. The first forwarding is for value of R1 from EX add to EX lw. The second forwarding is also for value of R1 from MEM add to EXsw. The third forwarding is for value of R4 from MEM lw to MEMsw. Observe that the SW instruction is storing the value of R4 into memory location computed by adding the displacement 12 to the value contained in register R1. This effective address computation is done in the ALU during the EX stage of the SW instruction. The value to be stored (R4 in this case) is needed only in the MEM stage as an input to Data Memory. Thus the value of R1 is forwarding to the EX stage for effective address computation and is needed earlier in time than the value of R4 which is forwarding to the input of Data Memory in the MEM stage. So forwarding takes place from ''left to right'' in time, but operands are not ALWAYS forwarding to the EX stage it depends on the instruction and the point in the Data path where the operand is needed. Of course, hardware support is necessary to support data forwarding. If the ADD instruction were: ADD R2, R1, and R3 the move would not be possible. (b) From the target: Helps when branch is taken. May duplicate instructions L2: BEQZ R2, L2 L2: Instructions between BEQ and SUB (in fall through) must not use R4.Why is instruction at L1 duplicated? What if R5 or R6 changed? (c) From Fall Through: Helps when branch is not taken. ISBN: 9780989130547 2014 SDIWC 182
Instructions at target (L1 and after) must not use R4 till set again. 3.1 Cancelling branch: Branch instruction indicates direction of prediction. If mispredicted the instruction in the delay slot is cancelled. Greater flexibility for compiler to schedule instructions. Compiler predicts branch direction, include in instruction itself. Mispredicted branch ''annuls'' the instruction in delay slot. The instruction behaves as a noop, and gives more leverage to compiler to select instructions to fill delay slots. 3.2 Limitation of delayed branch Compiler may not find appropriate instructions to fill delay slots. Then it fills delay slots with noops. Visible architectural feature likely to change with new implementations. Pipeline structure is exposed to compiler. Need to know how many delay slot? Must keep additional PC data to handle interrupts. 3.3 Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60%? 80%) of slots usefully filled. Delayed Branch downside: 78 stage pipelines, multiple instructions issued per clock (superscalar). 5. RESULTS In this paper, we have got important results that will interest our goal in this research. There are four types of data hazard WAR: Write After Read WAW: Write After Write RAW: Read After Write RAR: Read After Read don t cause problem. The following two tables show these results: Table (6) shows the results that produced from enabling forwarding and delay slots. Table (6): Results that produced from enabling forwarding and delay slots. Configure Execution Data hazard Code size Enable forwarding Enable delay slots 40 Cycles. 35 Instructio ns 1.43 Cycles Per Instructio ns (CPI). 0 WAR 0 WAW 1 RAW 140 bytes. 4. COMBINING THE FORWARDING AND S OPERATIONS TO AVOID DATA HAZARDS In this research a new technique is used. This is done by combining the forwarding and delay slots operations. This is achieved automatically by compiler. In our experimental work we used winmips 64 to get the results. We have used I/Q instruction as a model example to study the effect of these operations on data hazard. ISBN: 9780989130547 2014 SDIWC 183
Table (7) shows the results that produced from enabling delay slots. Table (7): Results that produced from enabling forwarding. REFERENCES [1].Mahesh Neupane. "Hazards in pipelining". [2]. Paul Chow and Mark Horowitz "Architectural Tradeoffs in the Design of MIPSX Configure Execution Data hazard Code size [3]. John L. Hennessy, and David A. Paterson,'' Computer architecture A Quantitative. 4 th Edition. 2006 Enable delay slots 59 Cycles 35 Instructions 0 WAR 0 WAW 20 RAW 140 bytes [4]. ArvoToomsalu.'' Microprocessor Systems Architecture''. 1.686 Cycles Per Instructions CPI From the above two tables of results, we noted that when we use the forwarding and delay slots together, we get result much better than using only delay slots. We can say that these results may have a small change when we use another model program, but this resultshows that our new mechanism of combining the forwarding and delay slots operations is the best solution as it prevents data dependency. 6. CONCLUSION Forwarding and delay slots operations were used together to avoid occurrence of data hazard that occur due to data dependency between instructions of certain program. This technique produced excellent results as the occurrence of data hazard is completely avoided. ISBN: 9780989130547 2014 SDIWC 184