DSP VLSI Design. Pipelining. Byungin Moon. Yonsei University

Byungin Moon Yonsei University

Outline What is pipelining? Performance advantage of pipelining Pipeline depth Interlocking Due to resource contention Due to data dependency Branching Effects Interrupt Effects Pipeline programming modes Time-stationary Data-stationary 1

Definition A technique for increasing the performance of a processor (or other electronic system) by a sequence of operations into smaller pieces and executing these pieces in parallel when possible Used in almost all current DSP processors Strength Decreases the overall time required to the complete the set of operations Weakness Complicates programming Execution time of a specific instruction sequence can vary from case to case Certain instruction sequences must be avoid for correct program operation Represents a trade-off between efficiency and ease of use 2

Illustration of how pipelining increases performance on a Hypothetical Processor A hypothetical processor uses separate execution units to accomplish the following actions for a single instruction (each stage takes 20 ns to execute) resembles TI TMS320C3x Fetch and instruction word from memory Decode the instruction Read or write a data operand from or to memory Execute the ALU or MAC portion of the instruction Nonpipelined The four stages is executed sequentially Execution time of 80 ns per instruction Each stage is idle 75 % of the time Pipelined The four stages of execution are overlapped Executes a new instruction every 20 ns An instruction appears to the programmer to execute in one instruction cycle Instructions appear to execute sequentially 3

Performance Comparison (Nonpipelined vs. Pipelined) 4

Pipeline Depth Number of pipeline stages Vary from one processor to another A deeper pipeline Allows the processor to execute faster But makes the processor harder to program Most processors use three or four stages Three-stage pipeline Instruction fetch, decode, and execute Operand fetch is typically done in the latter part of the decode stage Four-stage pipeline Instruction fetch, decode, operand fetch, and execute Others Analog Devices processors (two stages) and TI TMS320C54x (five stages) 5

What is Interlocking? Resource Contention Pipelined processors may not perform as well as we have shown in the hypothetical example Mainly due to resource contention (conflict) Example Suppose it takes two instruction cycles to write to memory (like AT&T DSP16xx processors) Instruction I2 attempts to write to memory and I3 needs to read from memory The second cycle in I2 s data write phase conflicts with I3 s data read Solution to resource contention -> Interlocking Interlocking Delays the progression of the latter of the conflicting instructions through the pipeline 6

Example of Pipeline Resource Contention and Interlocking to Resolve Resource Contention 7

Complicated programming on Interlocking Pipelined Processors There is a number of interlocking sources For example, In processors supporting instructions with long immediate data Instruction with long immediate data require an additional program memory fetch to get the immediate data This long immediate data fetch conflicts with the fetch of the next instruction -> resulting interlocking Not easy to spot interlocks by reading the program code The pipeline is interlocked or not depending on the instructions that surrounds it For example, if the instruction I3 in the previous example did not need to read from data memory, Then there would be not conflict, and no interlock would occur 8

Data Hazard Another Interlocking Source Example from the Motorola DSP5600x Makes little use of interlocking Uses a three-stage pipeline Fetch Decode addresses used in data accesses are formed Execute ALU operation, data accesses, register loads Example code MOVE #$1234, R0 MOVE X : (R0), X0 (R0 contains the hexadecimal value 5678 before execution of the above) Seemingly The above instructions move the value stored at X memory address 1234 into register X0 Actually The above instructions move the value stored at X memory address 5678 This is because of a pipeline hazard resulting from data dependency 9

A Motorola DSP5600x Pipeline Hazard 10

Data Hazard Another Interlocking Source Interlocking to protect the programmer from the hazard TI TMS320C3x, TMS320C4x, and TMS320C5x processors TMS320C3x detect writes to any of its address registers and holds the progression through the pipeline of other instructions that use any address register until the write has completed Trade-off made by heavily interlocked processors Saves the programmer from worrying about whether certain instruction sequences will produce correct output Allows the programmer to write slower-than-optimal code, even without even realizing it 11

Interlocking to Solve the Pipeline Hazard (from TI TMS320C3x) LDI (load immediate) instruction loads a value into an address register MPYF (floating-point multiply) uses register-indirect addressing fetch one of its operands 12

Branching Effects Control dependency from branches When a branch instruction reaches the decode stage in the pipeline and realizes that it must begin executing at a new address, the next sequential instruction word has already been fetched and is in the pipeline After the processor realizes a branch instruction, it didn t know where the next instruction is located until the branch is resolved One solution multicycle branch Discard, or flush the unwanted instruction And cease fetching new instructions until the branch is resolved Results in some waste cycles Some processors use tricks to execute the branch late in the decode phase, saving one instruction cycle Almost all DSP processors use multicycle branches 13

Branch Effects Alternative to the multicycle branch delayed branch Several instructions following the branch are executed normally BRD NEW_ADDR INST 2 ; INST2 to INST4 INST 3 ; are executed before INST 4 ; the branch occurs Instructions that will be executed before the branch instruction must be located in memory after the branch The branch appears to be delayed in its effect by several instruction cycles TMS320C3x, TMS320C4x, TMS320C5x, ADSP-2100x, DSP32C, DSP32xx, and ZR3800x Trade-offs of multicycle and delayed branches ease of programming and efficiency, as with interlocking Can always place NOP instructions after a delayed branch, in the worst case Branch effects occur whenever there is a change in program flow Subroutine call instructions, subroutine return instructions, and return from interrupt instructions 14

Multicycle Branch vs. Delayed Branch 15

Interrupt Effects Interrupts have the effects similar to branches on the pipeline Interrupts typically involve a change in a program flow of control to branch to the interrupt service routine The pipeline often increases the processor s interrupt response time, much as it slows down branch execution When an interrupt occurs, Almost all processors allow instructions at the decode stage or further in the pipeline to finish executing, because these instructions may be partially executed. What occurs past this point Varies from processor to processor 16

Example from TI TMS320C5x One cycle after the interrupt is recognized the processor inserts an INTR instruction into the pipeline INTR is a special branch instruction that causes the processor to begin execution at the appropriate interrupt vector Causes a four-instruction delay before the first word of the interrupt vector 17

Normal Interrupts of Motorola DSP5600x DSP5600x does not use an INTR instruction Simply begins fetching from the vector location after the interrupt is recognized At most two words are fetched starting at this address One of the two words is a subroutine call Flushes the previously fetched instruction and then branches to the long interrupt vector 18

Fast Interrupts of Motorola DSP5600x The same as normal interrupts except Neither of the two words starting at the interrupt vector is a subroutine call The processor executes the two words and continues executing from the original program 19

Pipeline Programming Models Two major assembly code formats for pipelined processors Time-stationary The processor s instruction specify the action to be performed by the execution units during a single instruction cycle (example from AT&T DSP16xx) a0=a0+p p=x*y x=*r0++ y=*pt++ Each portion of the instruction operates on separate operands Related to operand-unrelated parallel moves More flexible Data-stationary Specifies the operations that are to be performed, but not the exact timings during which the actions are to be executed (example from AT&T DSP32xx) a1 = a1 + (*r5++ = *r4++) * *r3++ Related to operand-related parallel moves Easier to read 20

Two Basic Control Schemes for Pipelined Data Paths Data-stationary Passes control function code along with data Allows simple and straight-forward design of both the state sequencer and the data path control circuits for each stage Requires more layout area Time-stationary Provides the control signals for the entire pipeline from a single source external to the pipeline The central controller govern the entire state of the machine at each time unit More complex design Must remember the current pipe state and provides appropriate control signals for each pipe stage 21

Two Basic Control Schemes for Pipelined Data Paths Data-stationary Time-stationary 22