3 INSTRUCTION FLOW. Overview. Figure 3-0. Table 3-0. Listing 3-0.

Size: px

Start display at page:

Download "3 INSTRUCTION FLOW. Overview. Figure 3-0. Table 3-0. Listing 3-0."

Asher Moore
5 years ago
Views:

1 3 INSTRUCTION FLOW Figure 3-0. Table 3-0. Listing 3-0. Overview The TigerSHARC is a pipelined RISC-like machine, where ructions are read from memory o the ruction alignment buffer in quad words. Instruction lines (consisting of one to four ructions) are read from memory, decoded and executed a process taking eight cycles. To keep the ruction throughput high, the execution is pipelined, with a throughput of one ruction line every ernal clock cycle. The full flow can not be analyzed as a single eight stage pipeline, but as two sequential pipelines. The first is a three-stage fetch pipeline, and the second is five-stage execution pipeline. The two pipelines are distinct in the following ways: The fetch pipeline is quad word oriented (quad word every cycle), while the ruction pipeline is ruction line oriented. When the execution pipeline stalls, the fetch pipeline can continue because of the Instruction Alignment Buffer (IAB) which is between them. TigerSHARC SP Instruction Set Specification 3-1

2 Overview Fetch 1 Fetch 2 Fetch Pipeline Stages Fetch 3 Instruction Alignment Buffer (IAB) ecode Instruction Pipe Integer Access EX1 EX2 Figure 3-1. TigerSHARC Two Pipelines The first pipe stages are common to all ructions and memory access driven: Fetch1, Fetch2, and Fetch3 or in short F1, F2 and F3. The remaining pipe stages are ruction driven. The execution differs between the IALU, compute block, and sequencer (branch unit). The ruction driven pipe stages are ecode, Integer, Operand Access, Execute1 and Execute2 or in short, I, A, EX1 and EX2. The first three pipe stages are referred to as fetch pipe and the last five as ruction pipe. 3-2 TigerSHARC SP Instruction Set Specification

3 Instruction Flow The ructions in a single line are executed pseudo simultaneously. When there are two ructions in the same line which use the same register one as operand and the other as result the operand is determined as the value of the register prior to the execution of this line. For example: Initial values: R0 = 2, R1 = 3, R2 = 3, R3 = 8 Instruction line: R2 = R0 + R1; R6 = R2 * R3 (I);; R2 is modified by the first ruction, and the result is 5. Still the second ruction sees input to R2 as 3, and the result written to R6 is 24. This rule is not guaranteed for store ruction. For example, with the same initial values, we had: Instruction line: R2 = R0 + R1; [address] = R2;; The results are unpredictable and, furthermore there is no indication of this event. The pipeline creates complications because of the overlap between the execution time of ructions of different lines. For example, take a sequence of two ruction lines where the second uses the result of the first ruction line as an input operand. Because of the pipeline length, the result may not be ready when the second ruction fetches its operands. In such a case a stall is issued between the first and second ruction line. Since this may cause performance loss, the programmer or compiler should strive to create as few of these cases as possible. These combinations are legal however, and the result will be correct. This type of problem is discussed in details in Stall on page All results are written o the target registers and status flags at pipe stage EX2. There are two exceptions to this rule: 1. External memory access, in which the delay is determined by the system. TigerSHARC SP Instruction Set Specification 3-3

4 Fetch Pipe 2. MAC ructions, which write o MR registers and sticky flags one cycle after EX2. This is important to retain coherency in case of a pipeline break. There are some special bypasses for the latter ructions inputs, with the purpose of shortening or eliminating dependencies. The following pipeline diagrams show the progression of ruction lines, where an ruction line may consist of one, two, three, or four ructions. Hence at any ant in time, there may be up to four ructions simultaneously executing in different units of the processor. Fetch Pipe The fetch cycles are the first pipeline and are tied to the memory accesses. The progress in this pipeline is memory-driven and not ruction-driven. The fetch unit fills up the ruction alignment buffer (IAB) whenever the IAB has less than three quad words. Since the execution units can pull in ructions in throughput lower or equal to the fetch throughput of four words every cycle, if it is possible that the fetch unit will fill the IAB faster than the execution units pull the ructions out of the IAB. The IAB can be filled with up to five quad words of ructions. The ruction flow in this pipeline is very simple, as illustrated in Figure 3-2 on page 3-5. At every cycle another quad word can be fetched from the ernal memory. After three cycles latency, the ructions are available for the execution pipeline. When the fetch is from external memory the flow is similar although much slower. The fetch throughput is one ruction to every two SCLK cycles and the latency is according to the system design (external memory pipeline depth, wait cycles, etc. ). 3-4 TigerSHARC SP Instruction Set Specification

5 Instruction Flow Inst Inst Inst Inst Fetch1 Fetch2 Fetch3 Fetch1 Fetch2 Fetch3 Fetch1 Fetch2 Fetch3 Fetch1 Fetch2 Fetch3 CCLK Fetch1 Fetch2 Fetch3 Instruction available for execution pipeline Figure 3-2. Fetch and Switch TigerSHARC SP Instruction Set Specification 3-5

6 Fetch Pipe Instruction Alignment Buffer The ruction alignment buffer (IAB) acts as a buffer between the fetch pipeline and the execution pipeline. The IAB is actually a five quad word FIFO as shown in Figure 3-3 on page 3-7. Whenever fetch data is read from memory (full quad word), it is written o the next entry in the IAB. Whenever there is at least one full ruction in the IAB, the sequencer can pull it for execution. In Figure 3-2 on page 3-5 the ructions available for execution pipeline are actually the ructions in the IAB. The IAB insures execution of an entire ruction line without inserting additional stall cycles or forcing quad word alignment on ruction lines. In this scheme, no memory is unused; for example: 0x x x a; 2a; 3a;; 4b; 5b;; 6c; 7c; 8c 9c;; Memory 4b 3a 2a 1a 8c 7c 6c 5b 12e 11d 10d 9c 0x x x10088 The IAB is a FIFO that gets a quad word that was fetched from memory and outputs ruction lines o the execution pipeline. Branch Target Buffer (BTB) The branch target buffer (BTB) is used to reduce the performance loss that results from branching in a deeply pipelined processor. The BTB has 32 entries of 4-way set-associative cache (total of 128 entries) that store 3-6 TigerSHARC SP Instruction Set Specification

7 Instruction Flow 4 * 32 bit Internal bus 3 entry FIFO 4 * 32 bit 2 entry alignment buffer 8 * 32 bit alignment mux 8=>1 4 * 32 bit Each entry is 128 bits sequencer ruction mux 4=>1 KALU ruction mux 4=>1 JALU ruction mux 4=>1 32 bit 32 bit 32 bit CBX 1 ruction mux 4=>1 CBX 2 ruction mux 4=>1 CBY 1 ruction mux 4=>1 CBY 2 ruction mux 4=>1 32 bit 32 bit 32 bit 32 bit Figure 3-3. Instruction Alignment Buffer TigerSHARC SP Instruction Set Specification 3-7

8 Fetch Pipe branch target addresses and has a Least Recently Used (LRU) replacement policy. The BTB structure, as described in Figure 3-4, is active while the BTBEN bit in SQCTL is set. Entry 0 Entry 1 Entry 2 Entry 3 LRU TARGET TAG Figure 3-4. BTB Organization Every branch ruction whose prediction is taken may be written o BTB. The PC of the ruction line is written o the BTB tag and the target address is written o the BTB target field. If the jump is computed (by register), the target indicates to which register to refer. The BTB examines the flow of addresses during the pipeline stage Fetch 1. When the BTB recognizes the address of an ruction that caused a jump on a previous pass of the program code (BTB hit), the BTB substitutes the corresponding destination address (from the target field) as the fetch address for the following ruction. As a result, when a branch is currently cached and correctly predicted, the performance loss due to branching is reduced from either six or three stall cycles to zero. Only ernal memory branches are cached in the BTB. The width of the cached target addresses is 22 bits. The BTB stores only one tag entry per aligned quad word of program ructions and, consequently, only one branch may be predicted per aligned quad word. If a programmer requires that more than one adjacent branch be predicted, then one to three NOP 3-8 TigerSHARC SP Instruction Set Specification

9 Instruction Flow ructions must be inserted between the branches to insure that both branches do not fall o the same aligned quad word. To avoid the possiblity of placing more than one ruction containing a predicted branch within the same quad word boundary in memory and causing unexpected BTB function, this combination of ructions and placement causes an assembler warning. The assembler warns that it has detected two predicted jumps within ruction lines whose line endings are within four words of each other. Further, the assembler states that depending on section alignment, this combination of predicted branch ructions and the ructions placement in memory may violate constra that they cannot end in the same quad word. It s useful to examine how different placement of words in memory results in different contents in the BTB. For example, the code example in Listing 3-1 contains a predicted branch: Listing 3-1. Predicted Branches, Aligned Quad Words, and the BTB nop; nop; nop; nop;; jump HERE; nop;; nop; nop; nop; nop;; In memory, each ruction occuppies an address, and sets of four locations make up a quad word as shown in Figure 3-5. The quad word address is the address of the first ruction in the quad word (e.g., 0x0, 0x4, 0x8, etc...). 0x0 0x1 0x2 0x3 ruction ruction ruction ruction 1st Quad Word Figure 3-5. Instructions in Memory TigerSHARC SP Instruction Set Specification 3-9

10 Fetch Pipe 0x4 ruction 0x5 ruction 0x6 ruction 2nd Quad Word 0x7 ruction 0x Figure 3-5. Instructions in Memory epending on how the code in Listing 3-1 aligns in memory, quad word address 0x could contain: Quad word starts at 0x4 nop; nop;; jump HERE; nop;; End of ruction line containing the jump If so, the BTB entry for the branch would contain: Tag = 0x , Target Address = HERE But, the code in Listing 3-1 could align in memory differently. For example, this code could align such that quad word address 0x (first line) and 0x (second line) contain: Quad word starts at 0x4 nop; nop; nop;; jump HERE; nop;; nop; nop; nop; Quad word starts at 0x8 Also, end of ruction line containing the jump 3-10 TigerSHARC SP Instruction Set Specification

11 Instruction Flow If so, the BTB entry for the branch would contain: Tag = 0x , Target Address = HERE If prediction is enabled, at the F1 stage of the pipeline, the current PC is compared to the BTB tag values. If there is a match, the SP modifies the PC to reflect the branch target address stored in the BTB, and the sequencer continues to fetch subsequent quad words at the modified PC. If there is no match, the SP does not modify the PC, and the sequencer continues to fetch subsequent quad words at the unmodified PC. When the same ruction reaches the ecode stage of the pipeline, the ruction is identified as a branch ruction. If there was a BTB match, no exceptional action is taken. The PC has already been modified, and the sequencer has already fetched from the branch target address. If there is no BTB match, the sequencer aborts the two ructions fetched prior to reaching the ecode stage (two stall cycles), and the SP modifies the PC to reflect the branch target address and begins fetching quad words at the modified PC. The sequencer updates the BTB with the branch target address such that the next time the branch ruction is encountered, it is likely that there will be a BTB match. The BTB contents vary with the ruction placement in memory, because: The sequencer fetches ructions a full quad word at a time. An ruction line may occuppy less than a full quad word, occupy a full quad word, or span two quad words. An ruction line may start at a location other than a quad word aligned address. Because the BTB can store only a single branch target address for each aligned quad word of ruction code, its important to examine coding techniques that work with this BTB feature. The following code example TigerSHARC SP Instruction Set Specification 3-11

12 Fetch Pipe produces unpredictable results in the hardware, because this code (depending on memory placement) may attempt to force the BTB to store multiple branch target addresses from a single aligned quad word: jump FIRST_JUMP; LC1 = yr16;; jump SECON_JUMP; R29 = R27;; Illegal. Line ends of ruction lines containing jumps are within four ructions of each other. The situation can be remedied by using NOP ructions to force the branch ructions to exhibit at least 4 words of separation as follows: jump FIRST_JUMP; LC1 = yr16;; jump SECON_JUMP; R29 = R27; nop; nop;; /* Adding NOPs as above shifts the line ending of 2nd uction */ While adding these NOP ructions increases the size of the code, these NOP do not affect the performance of the code. Another way to control the relationship between alignment of code within quad words and BTB contents is to use the.align_code 4 assembler directive. This directive forces the immediately subsequent code to be quad word aligned as follows: jump FIRST_JUMP; LC1 = yr16;;.align_code 4; /* Forcing quad alignment shifts the line ending of the next ruction */ jump SECON_JUMP; R29 = R27;; If the BTB hit is a computed jump, the RETI or CJMP register is used (according to the ruction) as a target address. In this case, any change in this register s value until the jump takes place will cause the Tiger- SHARC to abort the fetched ructions and repeat the flow as if there were no hit TigerSHARC SP Instruction Set Specification

13 Instruction Flow Whenever program overlays are used to swap program segments o and out of ernal memory, the BTB must be cleared using ruction BTBINV in order to invalidate the BTB. The BTBLK bit in SQCTL is used for program sections that require branches to be permanently buffered. While the BTBLK bit is set, the BTB puts every new entry o the BTB in status LOCKE. When this happens the BTB entry will not be replaced until the whole BTB is flushed, in order to keep performance-critical jumps in BTB. The BTB contents can be accessed directly for debug and diagnostic purposes only and must be disabled prior to access by clearing the BTBEN bit in SQCTL. The BTB register groups are 0x30 to 0x37. You must access BTB contents only for testing. If you attempt this for functional work, you are responsible for preventing multi-hit and coherency problems. TigerSHARC SP Instruction Set Specification 3-13

14 ecode ecode The decode cycle is the first stage in the ruction pipeline. In this cycle the next full ruction line is extracted from the ruction alignment buffer and the different ructions are distributed to the execution units. The units are: JALU or KALU eger ructions, load/store and register transfers. Compute block X or Y or both two ructions (the switching within the CB is done by the RF). Sequencer branch and condition ructions, and others. The Instruction Alignment Buffer (IAB) also calculates the program counter of a sequential line and some of the non-sequential ructions. The switch does not perform any decoding. IALU Pipeline IALU ructions include address or data calculation and, optionally, memory access. Figure 3-6 on page 3-15 shows the ruction flow in the IALU pipeline. The IALU ruction is decoded and the calculation is executed at the ecode stage. If the IALU ruction includes a memory access, the bus is requested on stage Integer. In this case, the memory access begins at pipe stage Access as long as the bus is available for the IALU. The result of the address calculation is ready at the Integer stage. Since the execution of the IALU ruction may be aborted (either because of a condition or because the execution is sometimes speculative), the operand is returned to the destination register only at the end of EX2. The result is passed through the pipeline, where it may be extracted by a new ruction should it be required as a source operand. ependency between 3-14 TigerSHARC SP Instruction Set Specification

15 Instruction Flow IALU calculations normally do not cause any delay, but there are some exceptions. The data that is loaded, however, is only ready in the register at pipe stage EX2. Inst I A E1 E2 Inst I A E1 E2 I A E1 E2 Inst I A E1 E2 Inst CCLK Inst ecode Bus Request Bus Access Int Memory Access ata Transfer Available Results: IALU Calculation ata Transfer Figure 3-6. IALU pipeline TigerSHARC SP Instruction Set Specification 3-15

16 Compute Block Pipe Compute Block Pipe The compute block pipe is relatively simple. At the decode cycle, the compute block gets the ruction and transfers it to the execution unit (ALU, multiplier or shifter). At stage Integer the ruction is decoded in the execution unit (ALU, multiplier or shifter), and dependency is checked. At stage Access the source registers are selected in the register file. At the execution stages EX1 and EX2, the results and flags updates are calculated by the appropriate compute block. The execution is always two cycles, and the result is written o the target register on the rising edge after pipe stage EX2. See Figure 3-7 on page Branch Unit Pipe The branch unit is the most critical pipeline. It affects, and is affected by, all the other pipelines. Each branch flow differs from every other and is derived by the following criteria: Jump prediction (see Control Flow Instructions on page 2-312) BTB hit or miss (see BTB irect Access Registers on page A-66) Condition pipe stage when is it resolved (pipe stage I or EX2) The prediction is set by the programmer or compiler. The prediction is normally true or branch taken. When the programmer uses option (NP) in a control flow ruction, the prediction is branch not taken. For more information, see Jump/Call options on page In general it indicates if the default assumption for this branch will or will not be taken. Take, for example, a loop that is executed n times, where the branch is taken n-1 times, and always more than once. Setting this bit has two consequences: 3-16 TigerSHARC SP Instruction Set Specification

17 Instruction Flow Inst I A E1 E2 Inst I A E1 E2 I A E1 E2 Inst I A E1 E2 Inst CCLK Inst ecode Compute ecode Reg file Access Execution Cycle Execution Cycle Available Results: Figure 3-7. Compute Block Pipeline 1. The branch goes o BTB. 2. At stage ecode, the TigerSHARC will identify the ruction as a jump ruction and continue fetching from the target of the jump, regardless of the condition. TigerSHARC SP Instruction Set Specification 3-17

18 Branch Unit Pipe If a branch ruction is a BTB hit, the TigerSHARC fetches, in sequence, the target of the branch after fetching the branch. In this case there is no overhead for correct prediction. For a detailed description of BTB behavior see Branch Target Buffer (BTB) on page 3-6. The various condition codes are resolved in different stages. IALU conditions are resolved in stage I of the ruction that updates the condition flags. Compute block flags are updated in pipe stage EX2. The other flags (BM, FLG0-3 and TRUE) are asynchronous to the pipeline because they are created by external events. These are referred similar to the IALU conditions and are resolved at pipe stage Integer, except for the condition BM, which is resolved at pipe stage EX2. ifferent situations produce different flows and, as a result, different performance results. The parameters for the branch cost are: Prediction branch is taken or not taken Branch on IALU or compute block BTB hit miss Branch real result taken or not Theoretically, therefore, there are 16 combinations. The following combinations, however, are ignored: If the prediction is not taken, the BTB can not give a hit If the prediction is not taken and the branch is not taken, the flow is as if no branch exists If the prediction is taken and the branch is taken, the flow is identical for IALU and compute block ructions 3-18 TigerSHARC SP Instruction Set Specification

19 Instruction Flow The different flows are shown in to Figure 3-8 on page 3-20 through Figure 3-15 on page Each diagram shows the flow of each combination and its cost. The cost of a branch can be summarized as the following: If Prediction not taken, branch not taken no cost BTB hit, branch taken no cost BTB miss, prediction taken, branch taken two cycles Prediction taken, branch not taken (either BTB hit or miss); or prediction not taken, branch taken: IALU condition: three cycles Compute block: six cycles the prediction is not taken, there can not be a BTB hit since the prediction taken is a condition for adding an entry to BTB. One cycle should be added to the above branch costs if one of the following applies: The jump is taken and the target ruction line crosses a quad word boundary. The branch was predicted to be taken and was not, and the sequential ruction line crosses quad word boundary. TigerSHARC SP Instruction Set Specification 3-19

20 Branch Unit Pipe The branch cost in the Figure 3-8 example is none (it is irrelevant if the condition is on IALU or compute block). Inst F1 F2 F3 I J0=J1+J2 Inst if jeq, jump 10; Inst0 Inst1 Inst2 Inst3 F1 F2 F3 I F1 F2 F3 I F1 F2 F3 I F1 F2 F3 F1 F2 F3 I CCLK Fetch Fetch Fetch Target address Integer Figure 3-8. Prediction Taken, Branch Taken 3-20 TigerSHARC SP Instruction Set Specification

21 Instruction Flow The branch cost in the Figure 3-9 example is two cycles (it is irrelevant if the condition is on IALU or compute block). Inst F1 F2 F3 I J0=J1+J2 Inst if jeq, jump 10; Inst Inst Inst0 Inst1 F1 F2 F3 I F1 F2 F3 Aborted F1 F2 F3 Aborted F1 F2 F3 F1 F2 F3 I CCLK Fetch1 0 1 Fetch2 0 1 Fetch3 0 1 ecode Predict Integer final branch Figure 3-9. Prediction Taken, Branch Taken With BTB Miss TigerSHARC SP Instruction Set Specification 3-21

22 Branch Unit Pipe The branch cost in the Figure 3-10 example is six cycles. Inst Inst Inst F3 R0=R1+R2; F2 if aeq, jump 10; F1 F3 F2 I F3 A I E1 I A E2 E1 E2 A E1 Aborted Inst F1 F2 F3 I A Aborted Inst #5 Inst #6 Inst #7 Inst #8 F1 F2 F1 F3 F2 F1 F3 F2 F1 I F3 F2 Aborted Aborted Aborted Aborted Inst 0 F1 F2 F3 I A E1 E2 CCLK Fetch1 #5 #6 #7 #8 0 1 Fetch2 Fetch3 #5 #6 #5 #7 #6 #8 #7 0 # ecode Predict Integer #5 #6 #5 #7 # Access #5 0 1 EX1 EX2 final branch Figure Predicted Not Taken, Branch Taken on Compute Block 3-22 TigerSHARC SP Instruction Set Specification

23 Instruction Flow The branch cost in the Figure 3-11 example is three cycles. Inst F2 F3 I J0=J1+J2 Inst if jeq, jump 10; Inst Inst Inst#5 F1 F2 F3 I F1 F2 F3 Aborted F1 F2 F3 Aborted F1 F2 Aborted Inst0 F1 F2 F3 CCLK Fetch1 #5 0 Fetch2 #5 0 Fetch3 0 ecode Predict 0 Integer final branch 0 Figure Predicted Not Taken, Branch Taken on IALU TigerSHARC SP Instruction Set Specification 3-23

24 Branch Unit Pipe The branch cost in the Figure 3-12 example is six cycles. Inst R0=R1+R2; Inst Inst0 F1 if aeq, jump 10; F2 F1 F3 F2 F1 F3 F2 I F3 A I E1 E2 A E1 E2 I A E1 Aborted Inst1 F1 F2 F3 I A Aborted Inst2 F1 F2 F3 I Aborted Inst3 F1 F2 F3 Aborted Inst4 F1 F2 F3 Aborted Inst5 F1 F2 Aborted Inst CCLK F1 F2 F3 I A Fetch Fetch Fetch3 ecode Predict Integer Access EX1 0 1 EX2 final branch 0 Figure Prediction Taken, Branch Not Taken On Compute Block With BTB Hit 3-24 TigerSHARC SP Instruction Set Specification

25 Instruction Flow Inst Inst Inst The branch cost in the Figure 3-13 example is three cycles. Inst F2 F3 I J0=J1+J2 if jeq, jump 10; F1 F2 F3 I F1 F2 F3 Aborted F1 F2 F3 Aborted Inst0 F1 F2 Aborted Inst F1 F2 F3 I CCLK Fetch Fetch Fetch3 0 1 ecode Predict 0 Integer final branch Figure Prediction Taken, Branch Not Taken on IALU With BTB Hit TigerSHARC SP Instruction Set Specification 3-25

26 Branch Unit Pipe The branch cost in the Figure 3-14 example is six cycles. Inst R0=R1+R2; Inst Inst F1 if aeq, jump 10; F2 F1 F3 F2 F1 F3 F2 I A I Aborted E1 E2 A E1 E2 Inst Inst0 Inst1 Inst2 Inst3 Inst F1 Aborted F1 F2 F3 F1 F2 F1 F3 F2 F1 I F3 F2 F1 Aborted Aborted Aborted Aborted F2 F3 I A CCLK Fetch1 Fetch2 Fetch3 ecode Predict Integer Access EX1 EX2 final branch Figure Prediction Taken, Branch Not Taken On Compute Block With BTB Miss 3-26 TigerSHARC SP Instruction Set Specification

27 Instruction Flow The branch cost in the Figure 3-15 example is three cycles. Inst F2 F3 I J0=J1+J2 Inst if jeq, jump 10; Inst Inst Inst0 Inst F1 F2 F3 I F1 F2 F3 Aborted F1 F2 Aborted F1 F2 Aborted F1 F2 F3 I CCLK Fetch1 0 Fetch2 0 Fetch3 ecode Predict Integer final branch Figure Prediction Taken, Branch Not Taken on IALU With BTB Miss TigerSHARC SP Instruction Set Specification 3-27

28 Stall Stall The TigerSHARC supports any sequence of ruction lines, as long as each separate line is legal. The pipelined ruction execution causes overlap between the execution of different lines. Two problems may arise from this: 1. ependency 2. Resource conflict A ependency condition is caused by any ruction that uses as an input the result of a previous ruction, if the previous ruction data is not ready when the current ruction needs the operand. Resource conflicts only occur in the ernal memory. The following ructions cause a bus request conflicts: Load/store request an ernal bus according to ernal memory block. If the address is external, the virtual bus is used. Immediate load, move reg to reg, and add or sub with option CJMP all request the virtual bus. If any of the above two ructions use the same ernal bus, or if another resource (MA or BIU) requests the same bus on the same cycle that the IALU requests the bus, the bus might not be granted to the IALU. This in turn could cause a delay in the execution. This section details the different cases of stalls. A stall is any delay that is caused by one of the two conditions previously described. Although the information in this manual is detailed, there may be some cases that are not defined here or conditions that are not always perceivable to the system designer. Exact behavior can only be reproduced through the TigerSHARC simulator TigerSHARC SP Instruction Set Specification

29 Instruction Flow Bus Request The execution of the following ructions uses the ernal bus: 1. Ureg = [Jm + Jn/imm], Ureg = [Km + Kn/imm] all data types and options 2. [Jm + Jn/imm] = Ureg, [Km + Kn/imm] = Ureg all data types and options 3. Ureg = Ureg even if both are in the same register file 4. Ureg = immediate 5. Js = Jm +/- Jn (cj) The first two ruction types select a bus according to the memory block: Address 0x xFFFF: bus #0 Address 0x x8FFFF: bus Address 0x x10FFFF: bus Address 0x1C xFFFFFFFF: External address The other three ructions use the virtual bus. The arbitration between the masters on the bus is detailed in the Bus Arbitration Protocol of the TigerSHARC - Hardware Specification. The IALU always requests the bus on pipe stage Integer. If it doesn t receive the bus, the execution of the bus transaction is delayed until the bus is granted. The rest of the line, however, including the other IALU operations (e.g., post-modify of address), are continued. This is to prevent deadlock in case of two memory accesses in the same cycle to the same bus (a different implementation would cause deadlock). The following TigerSHARC SP Instruction Set Specification 3-29

30 Stall ruction lines are stalled until this line can continue executing the transaction (or transactions, if more than one of the transactions ructions are in execution). Figure 3-16 on page 3-31 illustrates an example for bus conflict on Inst. Inst is not delayed because of the bus conflict, but the transaction is delayed. The transaction update is performed two cycles after the completion of the ruction line (two is the number of cycles that the bus was requested for the transaction and not granted). The next ruction lines are also delayed two cycles TigerSHARC SP Instruction Set Specification

31 Instruction Flow Inst I A E1 E2 R5=[J0+=J4] Inst bus request bus request I I I bus transaction A E1 E2 Inst Inst CCLK I A E1 E2 I A E1 Inst decode Bus Request Bus access Int mem access ata transfer Bus request bus request bus grant Stall - I bus transaction Figure Load Instruction Effect of Resource Conflict on Bus TigerSHARC SP Instruction Set Specification 3-31

32 Stall Compute Block Instruction ependency This is the most common dependency in applications and occurs on compute block operations on the compute block register file. The compute block accesses the register file for operand fetch on pipe stage Access, uses the operand on EX1, and writes the result back on EX2. The delay is comprised of basically two cycles however, a bypass transfers the result (which is written at the end of pipe stage EX2) directly o the compute unit that is using it in the beginning of pipe stage EX1. As a result, one stall cycle is inserted in the dependent operation. There is also a one cycle stall when the MR is loaded immediately after MAC. Example: MR2 += R3 * R2;; MR2 = R4;; Figure 3-17 on page 3-33 illustrates compute block dependency. Two sequential ructions Inst followed by Inst, which are dependent on R0. Inst is stalled on pipe stage Integer because of this dependency. As a result, the following ructions are delayed as well TigerSHARC SP Instruction Set Specification

33 Instruction Flow Inst I A E1 E2 R0=R1+R2; Inst R5=R0*R4; Inst Inst I I A I E1 E2 A E1 E2 I A E1 CCLK Initial ecode Compute ecode Reg file access Execution Cycle Execution Cycle stall-i grant Available Results: Figure Compute Block ependency TigerSHARC SP Instruction Set Specification 3-33

34 Stall Load to Compute Block Instruction ependency ata in load ructions is transferred at pipe stage EX2, exactly as in compute block operations. In case of dependency between a load ruction and compute operation that uses this data, the behavior is similar to that of compute block dependency (see Figure 3-17 on page 3-33). Take, for example, the following sequence: xr0 = [memory access];; xr5 = R0 * R4;; This would cause a one cycle delay, occurring when the load ruction comes from ernal memory and the bus was accepted by the IALU that executes the transaction. If the load is from external memory or the bus request was delayed, the second ruction is executed two cycles after the completion of the load that is, after the data is returned. Load ata to IALU Instruction ependency The dependency between load ructions and IALU ructions is more problematic than in the previous cases because data is loaded at pipe stage EX2 and is used in stage ecode. To overcome this gap, four stalls are inserted before the ruction that is using the loaded data, as shown in Figure 3-18 on page TigerSHARC SP Instruction Set Specification

35 Instruction Flow Inst J0=[J2+J3]; Inst J5=J0+J3; Inst Inst I A E1 E2 I A E1 E2 I A E1 I A E2 E1 E2 CCLK Inst ecode Addr/ata Calculation Bus Access Int memory Access ata Transfer Stall- Available Results: (load data) Figure ependency between Load and IALU Operations TigerSHARC SP Instruction Set Specification 3-35

36 Stall Execution Instruction to Store The combination of any execution ruction followed by a store ruction is dependency free, because the data is transferred by the store at pipe stage EX2. The only exception to this rule is the store of data that has been loaded from external memory. For example: XR0 = [external address];; [J0+ = 0] = XR0;; In a case like this, there is a stall until XR0 is actually loaded. IALU ependency Conditional Normally IALU ructions are executed in a single cycle at pipe stage Integer. The result is pipelined and written o the result register at pipe stage EX2. If the following ruction uses the result of this ruction (either the result is used or a condition is used), the sequential ruction extracts the result from the pipeline. In one exceptional ance the bypass can not be used, as shown in Figure 3-19 on page This occurs when the first ruction is conditional, the bypass usage is conditional, and the condition value is not known yet. The result of in the example can not be extracted from the bypass and must be taken from the J0 register after the completion of the execution after pipe stage EX2. In this case three stall cycles are inserted if the condition is compute block, and one cycle for other types of condition TigerSHARC SP Instruction Set Specification

37 Instruction Flow Inst if az; do, J0=J2+J3; Inst J5=J0+J3; Inst Inst I A E1 E2 I A E1 E2 I A E1 I A E2 E1 E2 cclk Inst ecode Addr/ata Calculation Bus Access Int memory Access ata Transfer Stall- Available Results: (load data) Figure IALU Conditional ependency TigerSHARC SP Instruction Set Specification 3-37

38 Stall Enhanced Communication Instructions ependency All the ructions are executed in the compute pipeline. Similar to other compute ructions, all enhanced communication ructions have a dependency check. Every use of a result of the previous line causes a stall for one cycle. In some special cases the stall is eliminated by using special forwarding logic in order to improve the performance of certain algorithms executions. The forwarding logic can function (and the stall can be eliminated) only when the first ruction is not predicated (for example, "if <cond>; do, <>;;"). The exceptions cases are: 1. Load TR or THR register and any ruction that uses it on the next line. 2. Although the THR register is a hidden operand and/or result of the ructions ACS and ESPREA, there is no dependency on it. 3. The ruction ACS, which can use previous result of ACS ruction as TRmd with no stall. For example the following sequence will cause no stall: TR3:0 = ACS (.);; TR7:4 = ACS (TR1:0, TR5:4, R8);; Or the sequence: TR3:0 = ACS (.);; 3-38 TigerSHARC SP Instruction Set Specification

39 Instruction Flow TR7:4 = ACS (TR3:2, TR5:4, R8);; However, there are a few cases that cause stalls. The first case is when the dependency is on TRN. For example: TR3:0 = ACS (.);; TR7:4 = ACS (TR11:10, TR1:0, R8);; The second case is when two different formats are used in the two ructions. For example: XTR3:0 = ACS (TR5:4, TR7:6, R1);; XSTR11:8 = ACS (TR15:14, TR13:12, R2);; ACS of short operands has the identical flow. 4. ata transfer from a enhanced communication register to a compute register file has no dependency - The data transfer is executed in EX2. The enhanced communication register load can be executed in parallel to other enhanced communication ructions. Its code is similar to the code of a shifter ruction, while the code of the other ructions is similar to the code of ALU ructions. No exceptions are caused by the acceleration ructions. Interrupt Flow This section describes the flow of asynchronous events causing errupts and exceptions. The errupt different types are described in details in Interrupts chapter in the TigerSHARC SP Hardware Specification. The errupts in some applications are performance critical, and the TigerSHARC executes them (in most cases) in the same pipeline in the optimal flow. The next sections describe the different flows of errupts. Regular Interrupt Flow The simple case of a hardware errupt is shown in Figure 3-20 on page When an errupt is identified by the core (when the errupt bit in ILAT register is set) or when the errupt becomes enabled (the errupt TigerSHARC SP Instruction Set Specification 3-39

40 Interrupt Flow bit in IMASK register is set) the TigerSHARC starts fetching from the errupt routine address. The execution of the ructions of the regular flow continues, except for the last ruction before the errupt (Inst in the previous example). The return address saved in RETI would be the address of ruction 2. Inst F1 F2 F3 I Inst F1 F2 F3 Aborted Interrupt routine Inst F1 F2 F3 I Interrupt routine Inst F1 F2 F3 I Interrupt routine Inst F1 F2 F3 Interrupt routine Inst F1 F2 F3 I CCLK Fetch1 Fetch2 Fetch3 ecode Integer Interrupt identified Figure Interrupt Regular Flow 3-40 TigerSHARC SP Instruction Set Specification

41 Instruction Flow Interrupt in Speculative Flow When a branch ruction is fetched, the TigerSHARC cannot decide immediately if the branch is to be taken. Before the final decision is taken (3 to 6 cycles) the TigerSHARC continues to fetch ructions (and possibly begin but not end their execution) according to a prediction of the condition result. Before the final decision, the ructions that have been fetched are executed speculatively, that is, if the prediction is found incorrect, the execution of these ructions is aborted, and the correct ructions are fetched and executed, ead. This part of the program is called speculative flow. When an errupt occurs during a speculative flow, if the speculation is found incorrect, the speculative part is aborted while the errupt ructions that follow are not aborted. This is illustrated in Figure 3-21 on page When the errupt is inserted o the flow, ructions and are in the pipeline speculatively. When the jump ruction is finalized (EX2) and if the speculative is found wrong, ructions and are aborted (similar to the flow described in Figure 3-21 on page 3-42). The ructions that belong to the errupt flow, however, are not part of the speculative flow, and they are not aborted. The return address in this case is the correct target of the jump ruction. Similar flows happen in all cases of aborted speculative flows, when errupt routine ructions are already in the pipeline. TigerSHARC SP Instruction Set Specification 3-41

42 Interrupt Flow Inst 1 XR0 = r1 - r1;; I A E1 E2 Inst 2 if nxaeq, jump 100;; I A E1 Aborted (Last ruction before errupt) Inst 3 F3 I A E1 Inst 4 F2 F3 I A Interrupt 1 F1 F2 F3 I A E1 E2 Interrupt I Interrupt A E1 E2 Interrupt Interrupt Interrupt Figure Interrupt processing during speculative flows Interrupt isabled uring Execution Sometimes the programmer needs a certain part of the code to be executed free of errupts. In this case, disabling all hardware errupts by clearing bit [60] of IMASK is effective immediately (contrary to clearing a specific errupt enable). Be aware that there is a performance cost to using this feature. If the errupt is already in the pipeline when IMASK[60] is cleared, it will continue execution until reaching EX1, and only then will it be aborted and the flow returned to a normal flow. An example to this flow is shown in Figure 3-22 on page The errupt is identified by the TigerSHARC on the second cycle (when 3-42 TigerSHARC SP Instruction Set Specification

43 Instruction Flow is fetched). Inst, which clears IMASK[60] is only completed five cycles after the errupt occurs. When the first errupt routine ruction reaches EX1, IMASK[60] is checked again, and if it is cleared, the whole errupt flow is aborted and the TigerSHARC returns to its original flow. Exception Flow An exception is normally caused by using a specific ruction line. The exception routine s first ruction is the next ruction executed after the ruction that caused it. In order to make this happen, when the ruction line that caused the exception reaches EX2, all the ructions in the pipeline are aborted, and the TigerSHARC starts fetching from the exception routine. This flow is similar to the flow of unpredicted and taken jumps conditioned by EX2 condition (see Figure 3-11 on page 3-23). TigerSHARC SP Instruction Set Specification 3-43

44 Interrupt Flow Instr F1 F2 F3 I A E1 E2 Instr F1 F2 F3 I A E1 Aborted (last ruction before errupt) Interrupt routine F1 F2 F3 I A E1 Aborted Interrupt routine F1 F2 F3 I A Aborted Interrupt routine F1 F2 F3 I Aborted Interrupt routine F1 F2 F3 Aborted Interrupt routine#5 F1 F2 F3 Aborted Interrupt routine#6 F1 F2 Aborted Instr F1 F2 F3 I A Instr F1 F2 F3 I cclk Fetch1 Fetch2 Fetch3 ecode Predict Integer Access EX1 EX2 final branch Interrupt Identified #5 #6 #5 #6 #5 #6 #5 Global Interrupt enable bit cleared Figure Interrupt isabled While In Pipeline 3-44 TigerSHARC SP Instruction Set Specification

45 Instruction Flow TigerSHARC SP Instruction Set Specification 3-45

46 Interrupt Flow 3-46 TigerSHARC SP Instruction Set Specification

Engineer To Engineer Note

Engineer To Engineer Note a EE-205 Technical Notes on using Analog Devices' DSP components and development tools Contact our technical support by phone: (800) ANALOG-D or e-mail: dsp.support@analog.com