3 INSTRUCTION FLOW. Overview. Figure 3-0. Table 3-0. Listing 3-0.

Size: px
Start display at page:

Download "3 INSTRUCTION FLOW. Overview. Figure 3-0. Table 3-0. Listing 3-0."

Transcription

1 3 INSTRUCTION FLOW Figure 3-0. Table 3-0. Listing 3-0. Overview The TigerSHARC is a pipelined RISC-like machine, where ructions are read from memory o the ruction alignment buffer in quad words. Instruction lines (consisting of one to four ructions) are read from memory, decoded and executed a process taking eight cycles. To keep the ruction throughput high, the execution is pipelined, with a throughput of one ruction line every ernal clock cycle. The full flow can not be analyzed as a single eight stage pipeline, but as two sequential pipelines. The first is a three-stage fetch pipeline, and the second is five-stage execution pipeline. The two pipelines are distinct in the following ways: The fetch pipeline is quad word oriented (quad word every cycle), while the ruction pipeline is ruction line oriented. When the execution pipeline stalls, the fetch pipeline can continue because of the Instruction Alignment Buffer (IAB) which is between them. TigerSHARC SP Instruction Set Specification 3-1

2 Overview Fetch 1 Fetch 2 Fetch Pipeline Stages Fetch 3 Instruction Alignment Buffer (IAB) ecode Instruction Pipe Integer Access EX1 EX2 Figure 3-1. TigerSHARC Two Pipelines The first pipe stages are common to all ructions and memory access driven: Fetch1, Fetch2, and Fetch3 or in short F1, F2 and F3. The remaining pipe stages are ruction driven. The execution differs between the IALU, compute block, and sequencer (branch unit). The ruction driven pipe stages are ecode, Integer, Operand Access, Execute1 and Execute2 or in short, I, A, EX1 and EX2. The first three pipe stages are referred to as fetch pipe and the last five as ruction pipe. 3-2 TigerSHARC SP Instruction Set Specification

3 Instruction Flow The ructions in a single line are executed pseudo simultaneously. When there are two ructions in the same line which use the same register one as operand and the other as result the operand is determined as the value of the register prior to the execution of this line. For example: Initial values: R0 = 2, R1 = 3, R2 = 3, R3 = 8 Instruction line: R2 = R0 + R1; R6 = R2 * R3 (I);; R2 is modified by the first ruction, and the result is 5. Still the second ruction sees input to R2 as 3, and the result written to R6 is 24. This rule is not guaranteed for store ruction. For example, with the same initial values, we had: Instruction line: R2 = R0 + R1; [address] = R2;; The results are unpredictable and, furthermore there is no indication of this event. The pipeline creates complications because of the overlap between the execution time of ructions of different lines. For example, take a sequence of two ruction lines where the second uses the result of the first ruction line as an input operand. Because of the pipeline length, the result may not be ready when the second ruction fetches its operands. In such a case a stall is issued between the first and second ruction line. Since this may cause performance loss, the programmer or compiler should strive to create as few of these cases as possible. These combinations are legal however, and the result will be correct. This type of problem is discussed in details in Stall on page All results are written o the target registers and status flags at pipe stage EX2. There are two exceptions to this rule: 1. External memory access, in which the delay is determined by the system. TigerSHARC SP Instruction Set Specification 3-3

4 Fetch Pipe 2. MAC ructions, which write o MR registers and sticky flags one cycle after EX2. This is important to retain coherency in case of a pipeline break. There are some special bypasses for the latter ructions inputs, with the purpose of shortening or eliminating dependencies. The following pipeline diagrams show the progression of ruction lines, where an ruction line may consist of one, two, three, or four ructions. Hence at any ant in time, there may be up to four ructions simultaneously executing in different units of the processor. Fetch Pipe The fetch cycles are the first pipeline and are tied to the memory accesses. The progress in this pipeline is memory-driven and not ruction-driven. The fetch unit fills up the ruction alignment buffer (IAB) whenever the IAB has less than three quad words. Since the execution units can pull in ructions in throughput lower or equal to the fetch throughput of four words every cycle, if it is possible that the fetch unit will fill the IAB faster than the execution units pull the ructions out of the IAB. The IAB can be filled with up to five quad words of ructions. The ruction flow in this pipeline is very simple, as illustrated in Figure 3-2 on page 3-5. At every cycle another quad word can be fetched from the ernal memory. After three cycles latency, the ructions are available for the execution pipeline. When the fetch is from external memory the flow is similar although much slower. The fetch throughput is one ruction to every two SCLK cycles and the latency is according to the system design (external memory pipeline depth, wait cycles, etc. ). 3-4 TigerSHARC SP Instruction Set Specification

5 Instruction Flow Inst Inst Inst Inst Fetch1 Fetch2 Fetch3 Fetch1 Fetch2 Fetch3 Fetch1 Fetch2 Fetch3 Fetch1 Fetch2 Fetch3 CCLK Fetch1 Fetch2 Fetch3 Instruction available for execution pipeline Figure 3-2. Fetch and Switch TigerSHARC SP Instruction Set Specification 3-5

6 Fetch Pipe Instruction Alignment Buffer The ruction alignment buffer (IAB) acts as a buffer between the fetch pipeline and the execution pipeline. The IAB is actually a five quad word FIFO as shown in Figure 3-3 on page 3-7. Whenever fetch data is read from memory (full quad word), it is written o the next entry in the IAB. Whenever there is at least one full ruction in the IAB, the sequencer can pull it for execution. In Figure 3-2 on page 3-5 the ructions available for execution pipeline are actually the ructions in the IAB. The IAB insures execution of an entire ruction line without inserting additional stall cycles or forcing quad word alignment on ruction lines. In this scheme, no memory is unused; for example: 0x x x a; 2a; 3a;; 4b; 5b;; 6c; 7c; 8c 9c;; Memory 4b 3a 2a 1a 8c 7c 6c 5b 12e 11d 10d 9c 0x x x10088 The IAB is a FIFO that gets a quad word that was fetched from memory and outputs ruction lines o the execution pipeline. Branch Target Buffer (BTB) The branch target buffer (BTB) is used to reduce the performance loss that results from branching in a deeply pipelined processor. The BTB has 32 entries of 4-way set-associative cache (total of 128 entries) that store 3-6 TigerSHARC SP Instruction Set Specification

7 Instruction Flow 4 * 32 bit Internal bus 3 entry FIFO 4 * 32 bit 2 entry alignment buffer 8 * 32 bit alignment mux 8=>1 4 * 32 bit Each entry is 128 bits sequencer ruction mux 4=>1 KALU ruction mux 4=>1 JALU ruction mux 4=>1 32 bit 32 bit 32 bit CBX 1 ruction mux 4=>1 CBX 2 ruction mux 4=>1 CBY 1 ruction mux 4=>1 CBY 2 ruction mux 4=>1 32 bit 32 bit 32 bit 32 bit Figure 3-3. Instruction Alignment Buffer TigerSHARC SP Instruction Set Specification 3-7

8 Fetch Pipe branch target addresses and has a Least Recently Used (LRU) replacement policy. The BTB structure, as described in Figure 3-4, is active while the BTBEN bit in SQCTL is set. Entry 0 Entry 1 Entry 2 Entry 3 LRU TARGET TAG Figure 3-4. BTB Organization Every branch ruction whose prediction is taken may be written o BTB. The PC of the ruction line is written o the BTB tag and the target address is written o the BTB target field. If the jump is computed (by register), the target indicates to which register to refer. The BTB examines the flow of addresses during the pipeline stage Fetch 1. When the BTB recognizes the address of an ruction that caused a jump on a previous pass of the program code (BTB hit), the BTB substitutes the corresponding destination address (from the target field) as the fetch address for the following ruction. As a result, when a branch is currently cached and correctly predicted, the performance loss due to branching is reduced from either six or three stall cycles to zero. Only ernal memory branches are cached in the BTB. The width of the cached target addresses is 22 bits. The BTB stores only one tag entry per aligned quad word of program ructions and, consequently, only one branch may be predicted per aligned quad word. If a programmer requires that more than one adjacent branch be predicted, then one to three NOP 3-8 TigerSHARC SP Instruction Set Specification

9 Instruction Flow ructions must be inserted between the branches to insure that both branches do not fall o the same aligned quad word. To avoid the possiblity of placing more than one ruction containing a predicted branch within the same quad word boundary in memory and causing unexpected BTB function, this combination of ructions and placement causes an assembler warning. The assembler warns that it has detected two predicted jumps within ruction lines whose line endings are within four words of each other. Further, the assembler states that depending on section alignment, this combination of predicted branch ructions and the ructions placement in memory may violate constra that they cannot end in the same quad word. It s useful to examine how different placement of words in memory results in different contents in the BTB. For example, the code example in Listing 3-1 contains a predicted branch: Listing 3-1. Predicted Branches, Aligned Quad Words, and the BTB nop; nop; nop; nop;; jump HERE; nop;; nop; nop; nop; nop;; In memory, each ruction occuppies an address, and sets of four locations make up a quad word as shown in Figure 3-5. The quad word address is the address of the first ruction in the quad word (e.g., 0x0, 0x4, 0x8, etc...). 0x0 0x1 0x2 0x3 ruction ruction ruction ruction 1st Quad Word Figure 3-5. Instructions in Memory TigerSHARC SP Instruction Set Specification 3-9

10 Fetch Pipe 0x4 ruction 0x5 ruction 0x6 ruction 2nd Quad Word 0x7 ruction 0x Figure 3-5. Instructions in Memory epending on how the code in Listing 3-1 aligns in memory, quad word address 0x could contain: Quad word starts at 0x4 nop; nop;; jump HERE; nop;; End of ruction line containing the jump If so, the BTB entry for the branch would contain: Tag = 0x , Target Address = HERE But, the code in Listing 3-1 could align in memory differently. For example, this code could align such that quad word address 0x (first line) and 0x (second line) contain: Quad word starts at 0x4 nop; nop; nop;; jump HERE; nop;; nop; nop; nop; Quad word starts at 0x8 Also, end of ruction line containing the jump 3-10 TigerSHARC SP Instruction Set Specification

11 Instruction Flow If so, the BTB entry for the branch would contain: Tag = 0x , Target Address = HERE If prediction is enabled, at the F1 stage of the pipeline, the current PC is compared to the BTB tag values. If there is a match, the SP modifies the PC to reflect the branch target address stored in the BTB, and the sequencer continues to fetch subsequent quad words at the modified PC. If there is no match, the SP does not modify the PC, and the sequencer continues to fetch subsequent quad words at the unmodified PC. When the same ruction reaches the ecode stage of the pipeline, the ruction is identified as a branch ruction. If there was a BTB match, no exceptional action is taken. The PC has already been modified, and the sequencer has already fetched from the branch target address. If there is no BTB match, the sequencer aborts the two ructions fetched prior to reaching the ecode stage (two stall cycles), and the SP modifies the PC to reflect the branch target address and begins fetching quad words at the modified PC. The sequencer updates the BTB with the branch target address such that the next time the branch ruction is encountered, it is likely that there will be a BTB match. The BTB contents vary with the ruction placement in memory, because: The sequencer fetches ructions a full quad word at a time. An ruction line may occuppy less than a full quad word, occupy a full quad word, or span two quad words. An ruction line may start at a location other than a quad word aligned address. Because the BTB can store only a single branch target address for each aligned quad word of ruction code, its important to examine coding techniques that work with this BTB feature. The following code example TigerSHARC SP Instruction Set Specification 3-11

12 Fetch Pipe produces unpredictable results in the hardware, because this code (depending on memory placement) may attempt to force the BTB to store multiple branch target addresses from a single aligned quad word: jump FIRST_JUMP; LC1 = yr16;; jump SECON_JUMP; R29 = R27;; Illegal. Line ends of ruction lines containing jumps are within four ructions of each other. The situation can be remedied by using NOP ructions to force the branch ructions to exhibit at least 4 words of separation as follows: jump FIRST_JUMP; LC1 = yr16;; jump SECON_JUMP; R29 = R27; nop; nop;; /* Adding NOPs as above shifts the line ending of 2nd uction */ While adding these NOP ructions increases the size of the code, these NOP do not affect the performance of the code. Another way to control the relationship between alignment of code within quad words and BTB contents is to use the.align_code 4 assembler directive. This directive forces the immediately subsequent code to be quad word aligned as follows: jump FIRST_JUMP; LC1 = yr16;;.align_code 4; /* Forcing quad alignment shifts the line ending of the next ruction */ jump SECON_JUMP; R29 = R27;; If the BTB hit is a computed jump, the RETI or CJMP register is used (according to the ruction) as a target address. In this case, any change in this register s value until the jump takes place will cause the Tiger- SHARC to abort the fetched ructions and repeat the flow as if there were no hit TigerSHARC SP Instruction Set Specification

13 Instruction Flow Whenever program overlays are used to swap program segments o and out of ernal memory, the BTB must be cleared using ruction BTBINV in order to invalidate the BTB. The BTBLK bit in SQCTL is used for program sections that require branches to be permanently buffered. While the BTBLK bit is set, the BTB puts every new entry o the BTB in status LOCKE. When this happens the BTB entry will not be replaced until the whole BTB is flushed, in order to keep performance-critical jumps in BTB. The BTB contents can be accessed directly for debug and diagnostic purposes only and must be disabled prior to access by clearing the BTBEN bit in SQCTL. The BTB register groups are 0x30 to 0x37. You must access BTB contents only for testing. If you attempt this for functional work, you are responsible for preventing multi-hit and coherency problems. TigerSHARC SP Instruction Set Specification 3-13

14 ecode ecode The decode cycle is the first stage in the ruction pipeline. In this cycle the next full ruction line is extracted from the ruction alignment buffer and the different ructions are distributed to the execution units. The units are: JALU or KALU eger ructions, load/store and register transfers. Compute block X or Y or both two ructions (the switching within the CB is done by the RF). Sequencer branch and condition ructions, and others. The Instruction Alignment Buffer (IAB) also calculates the program counter of a sequential line and some of the non-sequential ructions. The switch does not perform any decoding. IALU Pipeline IALU ructions include address or data calculation and, optionally, memory access. Figure 3-6 on page 3-15 shows the ruction flow in the IALU pipeline. The IALU ruction is decoded and the calculation is executed at the ecode stage. If the IALU ruction includes a memory access, the bus is requested on stage Integer. In this case, the memory access begins at pipe stage Access as long as the bus is available for the IALU. The result of the address calculation is ready at the Integer stage. Since the execution of the IALU ruction may be aborted (either because of a condition or because the execution is sometimes speculative), the operand is returned to the destination register only at the end of EX2. The result is passed through the pipeline, where it may be extracted by a new ruction should it be required as a source operand. ependency between 3-14 TigerSHARC SP Instruction Set Specification

15 Instruction Flow IALU calculations normally do not cause any delay, but there are some exceptions. The data that is loaded, however, is only ready in the register at pipe stage EX2. Inst I A E1 E2 Inst I A E1 E2 I A E1 E2 Inst I A E1 E2 Inst CCLK Inst ecode Bus Request Bus Access Int Memory Access ata Transfer Available Results: IALU Calculation ata Transfer Figure 3-6. IALU pipeline TigerSHARC SP Instruction Set Specification 3-15

16 Compute Block Pipe Compute Block Pipe The compute block pipe is relatively simple. At the decode cycle, the compute block gets the ruction and transfers it to the execution unit (ALU, multiplier or shifter). At stage Integer the ruction is decoded in the execution unit (ALU, multiplier or shifter), and dependency is checked. At stage Access the source registers are selected in the register file. At the execution stages EX1 and EX2, the results and flags updates are calculated by the appropriate compute block. The execution is always two cycles, and the result is written o the target register on the rising edge after pipe stage EX2. See Figure 3-7 on page Branch Unit Pipe The branch unit is the most critical pipeline. It affects, and is affected by, all the other pipelines. Each branch flow differs from every other and is derived by the following criteria: Jump prediction (see Control Flow Instructions on page 2-312) BTB hit or miss (see BTB irect Access Registers on page A-66) Condition pipe stage when is it resolved (pipe stage I or EX2) The prediction is set by the programmer or compiler. The prediction is normally true or branch taken. When the programmer uses option (NP) in a control flow ruction, the prediction is branch not taken. For more information, see Jump/Call options on page In general it indicates if the default assumption for this branch will or will not be taken. Take, for example, a loop that is executed n times, where the branch is taken n-1 times, and always more than once. Setting this bit has two consequences: 3-16 TigerSHARC SP Instruction Set Specification

17 Instruction Flow Inst I A E1 E2 Inst I A E1 E2 I A E1 E2 Inst I A E1 E2 Inst CCLK Inst ecode Compute ecode Reg file Access Execution Cycle Execution Cycle Available Results: Figure 3-7. Compute Block Pipeline 1. The branch goes o BTB. 2. At stage ecode, the TigerSHARC will identify the ruction as a jump ruction and continue fetching from the target of the jump, regardless of the condition. TigerSHARC SP Instruction Set Specification 3-17

18 Branch Unit Pipe If a branch ruction is a BTB hit, the TigerSHARC fetches, in sequence, the target of the branch after fetching the branch. In this case there is no overhead for correct prediction. For a detailed description of BTB behavior see Branch Target Buffer (BTB) on page 3-6. The various condition codes are resolved in different stages. IALU conditions are resolved in stage I of the ruction that updates the condition flags. Compute block flags are updated in pipe stage EX2. The other flags (BM, FLG0-3 and TRUE) are asynchronous to the pipeline because they are created by external events. These are referred similar to the IALU conditions and are resolved at pipe stage Integer, except for the condition BM, which is resolved at pipe stage EX2. ifferent situations produce different flows and, as a result, different performance results. The parameters for the branch cost are: Prediction branch is taken or not taken Branch on IALU or compute block BTB hit miss Branch real result taken or not Theoretically, therefore, there are 16 combinations. The following combinations, however, are ignored: If the prediction is not taken, the BTB can not give a hit If the prediction is not taken and the branch is not taken, the flow is as if no branch exists If the prediction is taken and the branch is taken, the flow is identical for IALU and compute block ructions 3-18 TigerSHARC SP Instruction Set Specification

19 Instruction Flow The different flows are shown in to Figure 3-8 on page 3-20 through Figure 3-15 on page Each diagram shows the flow of each combination and its cost. The cost of a branch can be summarized as the following: If Prediction not taken, branch not taken no cost BTB hit, branch taken no cost BTB miss, prediction taken, branch taken two cycles Prediction taken, branch not taken (either BTB hit or miss); or prediction not taken, branch taken: IALU condition: three cycles Compute block: six cycles the prediction is not taken, there can not be a BTB hit since the prediction taken is a condition for adding an entry to BTB. One cycle should be added to the above branch costs if one of the following applies: The jump is taken and the target ruction line crosses a quad word boundary. The branch was predicted to be taken and was not, and the sequential ruction line crosses quad word boundary. TigerSHARC SP Instruction Set Specification 3-19

20 Branch Unit Pipe The branch cost in the Figure 3-8 example is none (it is irrelevant if the condition is on IALU or compute block). Inst F1 F2 F3 I J0=J1+J2 Inst if jeq, jump 10; Inst0 Inst1 Inst2 Inst3 F1 F2 F3 I F1 F2 F3 I F1 F2 F3 I F1 F2 F3 F1 F2 F3 I CCLK Fetch Fetch Fetch Target address Integer Figure 3-8. Prediction Taken, Branch Taken 3-20 TigerSHARC SP Instruction Set Specification

21 Instruction Flow The branch cost in the Figure 3-9 example is two cycles (it is irrelevant if the condition is on IALU or compute block). Inst F1 F2 F3 I J0=J1+J2 Inst if jeq, jump 10; Inst Inst Inst0 Inst1 F1 F2 F3 I F1 F2 F3 Aborted F1 F2 F3 Aborted F1 F2 F3 F1 F2 F3 I CCLK Fetch1 0 1 Fetch2 0 1 Fetch3 0 1 ecode Predict Integer final branch Figure 3-9. Prediction Taken, Branch Taken With BTB Miss TigerSHARC SP Instruction Set Specification 3-21

22 Branch Unit Pipe The branch cost in the Figure 3-10 example is six cycles. Inst Inst Inst F3 R0=R1+R2; F2 if aeq, jump 10; F1 F3 F2 I F3 A I E1 I A E2 E1 E2 A E1 Aborted Inst F1 F2 F3 I A Aborted Inst #5 Inst #6 Inst #7 Inst #8 F1 F2 F1 F3 F2 F1 F3 F2 F1 I F3 F2 Aborted Aborted Aborted Aborted Inst 0 F1 F2 F3 I A E1 E2 CCLK Fetch1 #5 #6 #7 #8 0 1 Fetch2 Fetch3 #5 #6 #5 #7 #6 #8 #7 0 # ecode Predict Integer #5 #6 #5 #7 # Access #5 0 1 EX1 EX2 final branch Figure Predicted Not Taken, Branch Taken on Compute Block 3-22 TigerSHARC SP Instruction Set Specification

23 Instruction Flow The branch cost in the Figure 3-11 example is three cycles. Inst F2 F3 I J0=J1+J2 Inst if jeq, jump 10; Inst Inst Inst#5 F1 F2 F3 I F1 F2 F3 Aborted F1 F2 F3 Aborted F1 F2 Aborted Inst0 F1 F2 F3 CCLK Fetch1 #5 0 Fetch2 #5 0 Fetch3 0 ecode Predict 0 Integer final branch 0 Figure Predicted Not Taken, Branch Taken on IALU TigerSHARC SP Instruction Set Specification 3-23

24 Branch Unit Pipe The branch cost in the Figure 3-12 example is six cycles. Inst R0=R1+R2; Inst Inst0 F1 if aeq, jump 10; F2 F1 F3 F2 F1 F3 F2 I F3 A I E1 E2 A E1 E2 I A E1 Aborted Inst1 F1 F2 F3 I A Aborted Inst2 F1 F2 F3 I Aborted Inst3 F1 F2 F3 Aborted Inst4 F1 F2 F3 Aborted Inst5 F1 F2 Aborted Inst CCLK F1 F2 F3 I A Fetch Fetch Fetch3 ecode Predict Integer Access EX1 0 1 EX2 final branch 0 Figure Prediction Taken, Branch Not Taken On Compute Block With BTB Hit 3-24 TigerSHARC SP Instruction Set Specification

25 Instruction Flow Inst Inst Inst The branch cost in the Figure 3-13 example is three cycles. Inst F2 F3 I J0=J1+J2 if jeq, jump 10; F1 F2 F3 I F1 F2 F3 Aborted F1 F2 F3 Aborted Inst0 F1 F2 Aborted Inst F1 F2 F3 I CCLK Fetch Fetch Fetch3 0 1 ecode Predict 0 Integer final branch Figure Prediction Taken, Branch Not Taken on IALU With BTB Hit TigerSHARC SP Instruction Set Specification 3-25

26 Branch Unit Pipe The branch cost in the Figure 3-14 example is six cycles. Inst R0=R1+R2; Inst Inst F1 if aeq, jump 10; F2 F1 F3 F2 F1 F3 F2 I A I Aborted E1 E2 A E1 E2 Inst Inst0 Inst1 Inst2 Inst3 Inst F1 Aborted F1 F2 F3 F1 F2 F1 F3 F2 F1 I F3 F2 F1 Aborted Aborted Aborted Aborted F2 F3 I A CCLK Fetch1 Fetch2 Fetch3 ecode Predict Integer Access EX1 EX2 final branch Figure Prediction Taken, Branch Not Taken On Compute Block With BTB Miss 3-26 TigerSHARC SP Instruction Set Specification

27 Instruction Flow The branch cost in the Figure 3-15 example is three cycles. Inst F2 F3 I J0=J1+J2 Inst if jeq, jump 10; Inst Inst Inst0 Inst F1 F2 F3 I F1 F2 F3 Aborted F1 F2 Aborted F1 F2 Aborted F1 F2 F3 I CCLK Fetch1 0 Fetch2 0 Fetch3 ecode Predict Integer final branch Figure Prediction Taken, Branch Not Taken on IALU With BTB Miss TigerSHARC SP Instruction Set Specification 3-27

28 Stall Stall The TigerSHARC supports any sequence of ruction lines, as long as each separate line is legal. The pipelined ruction execution causes overlap between the execution of different lines. Two problems may arise from this: 1. ependency 2. Resource conflict A ependency condition is caused by any ruction that uses as an input the result of a previous ruction, if the previous ruction data is not ready when the current ruction needs the operand. Resource conflicts only occur in the ernal memory. The following ructions cause a bus request conflicts: Load/store request an ernal bus according to ernal memory block. If the address is external, the virtual bus is used. Immediate load, move reg to reg, and add or sub with option CJMP all request the virtual bus. If any of the above two ructions use the same ernal bus, or if another resource (MA or BIU) requests the same bus on the same cycle that the IALU requests the bus, the bus might not be granted to the IALU. This in turn could cause a delay in the execution. This section details the different cases of stalls. A stall is any delay that is caused by one of the two conditions previously described. Although the information in this manual is detailed, there may be some cases that are not defined here or conditions that are not always perceivable to the system designer. Exact behavior can only be reproduced through the TigerSHARC simulator TigerSHARC SP Instruction Set Specification

29 Instruction Flow Bus Request The execution of the following ructions uses the ernal bus: 1. Ureg = [Jm + Jn/imm], Ureg = [Km + Kn/imm] all data types and options 2. [Jm + Jn/imm] = Ureg, [Km + Kn/imm] = Ureg all data types and options 3. Ureg = Ureg even if both are in the same register file 4. Ureg = immediate 5. Js = Jm +/- Jn (cj) The first two ruction types select a bus according to the memory block: Address 0x xFFFF: bus #0 Address 0x x8FFFF: bus Address 0x x10FFFF: bus Address 0x1C xFFFFFFFF: External address The other three ructions use the virtual bus. The arbitration between the masters on the bus is detailed in the Bus Arbitration Protocol of the TigerSHARC - Hardware Specification. The IALU always requests the bus on pipe stage Integer. If it doesn t receive the bus, the execution of the bus transaction is delayed until the bus is granted. The rest of the line, however, including the other IALU operations (e.g., post-modify of address), are continued. This is to prevent deadlock in case of two memory accesses in the same cycle to the same bus (a different implementation would cause deadlock). The following TigerSHARC SP Instruction Set Specification 3-29

30 Stall ruction lines are stalled until this line can continue executing the transaction (or transactions, if more than one of the transactions ructions are in execution). Figure 3-16 on page 3-31 illustrates an example for bus conflict on Inst. Inst is not delayed because of the bus conflict, but the transaction is delayed. The transaction update is performed two cycles after the completion of the ruction line (two is the number of cycles that the bus was requested for the transaction and not granted). The next ruction lines are also delayed two cycles TigerSHARC SP Instruction Set Specification

31 Instruction Flow Inst I A E1 E2 R5=[J0+=J4] Inst bus request bus request I I I bus transaction A E1 E2 Inst Inst CCLK I A E1 E2 I A E1 Inst decode Bus Request Bus access Int mem access ata transfer Bus request bus request bus grant Stall - I bus transaction Figure Load Instruction Effect of Resource Conflict on Bus TigerSHARC SP Instruction Set Specification 3-31

32 Stall Compute Block Instruction ependency This is the most common dependency in applications and occurs on compute block operations on the compute block register file. The compute block accesses the register file for operand fetch on pipe stage Access, uses the operand on EX1, and writes the result back on EX2. The delay is comprised of basically two cycles however, a bypass transfers the result (which is written at the end of pipe stage EX2) directly o the compute unit that is using it in the beginning of pipe stage EX1. As a result, one stall cycle is inserted in the dependent operation. There is also a one cycle stall when the MR is loaded immediately after MAC. Example: MR2 += R3 * R2;; MR2 = R4;; Figure 3-17 on page 3-33 illustrates compute block dependency. Two sequential ructions Inst followed by Inst, which are dependent on R0. Inst is stalled on pipe stage Integer because of this dependency. As a result, the following ructions are delayed as well TigerSHARC SP Instruction Set Specification

33 Instruction Flow Inst I A E1 E2 R0=R1+R2; Inst R5=R0*R4; Inst Inst I I A I E1 E2 A E1 E2 I A E1 CCLK Initial ecode Compute ecode Reg file access Execution Cycle Execution Cycle stall-i grant Available Results: Figure Compute Block ependency TigerSHARC SP Instruction Set Specification 3-33

34 Stall Load to Compute Block Instruction ependency ata in load ructions is transferred at pipe stage EX2, exactly as in compute block operations. In case of dependency between a load ruction and compute operation that uses this data, the behavior is similar to that of compute block dependency (see Figure 3-17 on page 3-33). Take, for example, the following sequence: xr0 = [memory access];; xr5 = R0 * R4;; This would cause a one cycle delay, occurring when the load ruction comes from ernal memory and the bus was accepted by the IALU that executes the transaction. If the load is from external memory or the bus request was delayed, the second ruction is executed two cycles after the completion of the load that is, after the data is returned. Load ata to IALU Instruction ependency The dependency between load ructions and IALU ructions is more problematic than in the previous cases because data is loaded at pipe stage EX2 and is used in stage ecode. To overcome this gap, four stalls are inserted before the ruction that is using the loaded data, as shown in Figure 3-18 on page TigerSHARC SP Instruction Set Specification

35 Instruction Flow Inst J0=[J2+J3]; Inst J5=J0+J3; Inst Inst I A E1 E2 I A E1 E2 I A E1 I A E2 E1 E2 CCLK Inst ecode Addr/ata Calculation Bus Access Int memory Access ata Transfer Stall- Available Results: (load data) Figure ependency between Load and IALU Operations TigerSHARC SP Instruction Set Specification 3-35

36 Stall Execution Instruction to Store The combination of any execution ruction followed by a store ruction is dependency free, because the data is transferred by the store at pipe stage EX2. The only exception to this rule is the store of data that has been loaded from external memory. For example: XR0 = [external address];; [J0+ = 0] = XR0;; In a case like this, there is a stall until XR0 is actually loaded. IALU ependency Conditional Normally IALU ructions are executed in a single cycle at pipe stage Integer. The result is pipelined and written o the result register at pipe stage EX2. If the following ruction uses the result of this ruction (either the result is used or a condition is used), the sequential ruction extracts the result from the pipeline. In one exceptional ance the bypass can not be used, as shown in Figure 3-19 on page This occurs when the first ruction is conditional, the bypass usage is conditional, and the condition value is not known yet. The result of in the example can not be extracted from the bypass and must be taken from the J0 register after the completion of the execution after pipe stage EX2. In this case three stall cycles are inserted if the condition is compute block, and one cycle for other types of condition TigerSHARC SP Instruction Set Specification

37 Instruction Flow Inst if az; do, J0=J2+J3; Inst J5=J0+J3; Inst Inst I A E1 E2 I A E1 E2 I A E1 I A E2 E1 E2 cclk Inst ecode Addr/ata Calculation Bus Access Int memory Access ata Transfer Stall- Available Results: (load data) Figure IALU Conditional ependency TigerSHARC SP Instruction Set Specification 3-37

38 Stall Enhanced Communication Instructions ependency All the ructions are executed in the compute pipeline. Similar to other compute ructions, all enhanced communication ructions have a dependency check. Every use of a result of the previous line causes a stall for one cycle. In some special cases the stall is eliminated by using special forwarding logic in order to improve the performance of certain algorithms executions. The forwarding logic can function (and the stall can be eliminated) only when the first ruction is not predicated (for example, "if <cond>; do, <>;;"). The exceptions cases are: 1. Load TR or THR register and any ruction that uses it on the next line. 2. Although the THR register is a hidden operand and/or result of the ructions ACS and ESPREA, there is no dependency on it. 3. The ruction ACS, which can use previous result of ACS ruction as TRmd with no stall. For example the following sequence will cause no stall: TR3:0 = ACS (.);; TR7:4 = ACS (TR1:0, TR5:4, R8);; Or the sequence: TR3:0 = ACS (.);; 3-38 TigerSHARC SP Instruction Set Specification

39 Instruction Flow TR7:4 = ACS (TR3:2, TR5:4, R8);; However, there are a few cases that cause stalls. The first case is when the dependency is on TRN. For example: TR3:0 = ACS (.);; TR7:4 = ACS (TR11:10, TR1:0, R8);; The second case is when two different formats are used in the two ructions. For example: XTR3:0 = ACS (TR5:4, TR7:6, R1);; XSTR11:8 = ACS (TR15:14, TR13:12, R2);; ACS of short operands has the identical flow. 4. ata transfer from a enhanced communication register to a compute register file has no dependency - The data transfer is executed in EX2. The enhanced communication register load can be executed in parallel to other enhanced communication ructions. Its code is similar to the code of a shifter ruction, while the code of the other ructions is similar to the code of ALU ructions. No exceptions are caused by the acceleration ructions. Interrupt Flow This section describes the flow of asynchronous events causing errupts and exceptions. The errupt different types are described in details in Interrupts chapter in the TigerSHARC SP Hardware Specification. The errupts in some applications are performance critical, and the TigerSHARC executes them (in most cases) in the same pipeline in the optimal flow. The next sections describe the different flows of errupts. Regular Interrupt Flow The simple case of a hardware errupt is shown in Figure 3-20 on page When an errupt is identified by the core (when the errupt bit in ILAT register is set) or when the errupt becomes enabled (the errupt TigerSHARC SP Instruction Set Specification 3-39

40 Interrupt Flow bit in IMASK register is set) the TigerSHARC starts fetching from the errupt routine address. The execution of the ructions of the regular flow continues, except for the last ruction before the errupt (Inst in the previous example). The return address saved in RETI would be the address of ruction 2. Inst F1 F2 F3 I Inst F1 F2 F3 Aborted Interrupt routine Inst F1 F2 F3 I Interrupt routine Inst F1 F2 F3 I Interrupt routine Inst F1 F2 F3 Interrupt routine Inst F1 F2 F3 I CCLK Fetch1 Fetch2 Fetch3 ecode Integer Interrupt identified Figure Interrupt Regular Flow 3-40 TigerSHARC SP Instruction Set Specification

41 Instruction Flow Interrupt in Speculative Flow When a branch ruction is fetched, the TigerSHARC cannot decide immediately if the branch is to be taken. Before the final decision is taken (3 to 6 cycles) the TigerSHARC continues to fetch ructions (and possibly begin but not end their execution) according to a prediction of the condition result. Before the final decision, the ructions that have been fetched are executed speculatively, that is, if the prediction is found incorrect, the execution of these ructions is aborted, and the correct ructions are fetched and executed, ead. This part of the program is called speculative flow. When an errupt occurs during a speculative flow, if the speculation is found incorrect, the speculative part is aborted while the errupt ructions that follow are not aborted. This is illustrated in Figure 3-21 on page When the errupt is inserted o the flow, ructions and are in the pipeline speculatively. When the jump ruction is finalized (EX2) and if the speculative is found wrong, ructions and are aborted (similar to the flow described in Figure 3-21 on page 3-42). The ructions that belong to the errupt flow, however, are not part of the speculative flow, and they are not aborted. The return address in this case is the correct target of the jump ruction. Similar flows happen in all cases of aborted speculative flows, when errupt routine ructions are already in the pipeline. TigerSHARC SP Instruction Set Specification 3-41

42 Interrupt Flow Inst 1 XR0 = r1 - r1;; I A E1 E2 Inst 2 if nxaeq, jump 100;; I A E1 Aborted (Last ruction before errupt) Inst 3 F3 I A E1 Inst 4 F2 F3 I A Interrupt 1 F1 F2 F3 I A E1 E2 Interrupt I Interrupt A E1 E2 Interrupt Interrupt Interrupt Figure Interrupt processing during speculative flows Interrupt isabled uring Execution Sometimes the programmer needs a certain part of the code to be executed free of errupts. In this case, disabling all hardware errupts by clearing bit [60] of IMASK is effective immediately (contrary to clearing a specific errupt enable). Be aware that there is a performance cost to using this feature. If the errupt is already in the pipeline when IMASK[60] is cleared, it will continue execution until reaching EX1, and only then will it be aborted and the flow returned to a normal flow. An example to this flow is shown in Figure 3-22 on page The errupt is identified by the TigerSHARC on the second cycle (when 3-42 TigerSHARC SP Instruction Set Specification

43 Instruction Flow is fetched). Inst, which clears IMASK[60] is only completed five cycles after the errupt occurs. When the first errupt routine ruction reaches EX1, IMASK[60] is checked again, and if it is cleared, the whole errupt flow is aborted and the TigerSHARC returns to its original flow. Exception Flow An exception is normally caused by using a specific ruction line. The exception routine s first ruction is the next ruction executed after the ruction that caused it. In order to make this happen, when the ruction line that caused the exception reaches EX2, all the ructions in the pipeline are aborted, and the TigerSHARC starts fetching from the exception routine. This flow is similar to the flow of unpredicted and taken jumps conditioned by EX2 condition (see Figure 3-11 on page 3-23). TigerSHARC SP Instruction Set Specification 3-43

44 Interrupt Flow Instr F1 F2 F3 I A E1 E2 Instr F1 F2 F3 I A E1 Aborted (last ruction before errupt) Interrupt routine F1 F2 F3 I A E1 Aborted Interrupt routine F1 F2 F3 I A Aborted Interrupt routine F1 F2 F3 I Aborted Interrupt routine F1 F2 F3 Aborted Interrupt routine#5 F1 F2 F3 Aborted Interrupt routine#6 F1 F2 Aborted Instr F1 F2 F3 I A Instr F1 F2 F3 I cclk Fetch1 Fetch2 Fetch3 ecode Predict Integer Access EX1 EX2 final branch Interrupt Identified #5 #6 #5 #6 #5 #6 #5 Global Interrupt enable bit cleared Figure Interrupt isabled While In Pipeline 3-44 TigerSHARC SP Instruction Set Specification

45 Instruction Flow TigerSHARC SP Instruction Set Specification 3-45

46 Interrupt Flow 3-46 TigerSHARC SP Instruction Set Specification

Engineer To Engineer Note

Engineer To Engineer Note Engineer To Engineer Note a EE-205 Technical Notes on using Analog Devices' DSP components and development tools Contact our technical support by phone: (800) ANALOG-D or e-mail: dsp.support@analog.com

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

6.004 Tutorial Problems L22 Branch Prediction

6.004 Tutorial Problems L22 Branch Prediction 6.004 Tutorial Problems L22 Branch Prediction Branch target buffer (BTB): Direct-mapped cache (can also be set-associative) that stores the target address of jumps and taken branches. The BTB is searched

More information

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1 Pipelining COMP375 Computer Architecture and dorganization Parallelism The most common method of making computers faster is to increase parallelism. There are many levels of parallelism Macro Multiple

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

What is Pipelining? RISC remainder (our assumptions)

What is Pipelining? RISC remainder (our assumptions) What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14 MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers Course Introduction Purpose: This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers Objectives: Learn about error detection and address errors

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows

More information

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle? CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

More information

Pipeline Architecture RISC

Pipeline Architecture RISC Pipeline Architecture RISC Independent tasks with independent hardware serial No repetitions during the process pipelined Pipelined vs Serial Processing Instruction Machine Cycle Every instruction must

More information

Pipelining. CSC Friday, November 6, 2015

Pipelining. CSC Friday, November 6, 2015 Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

ARM processor organization

ARM processor organization ARM processor organization P. Bakowski bako@ieee.org ARM register bank The register bank,, which stores the processor state. r00 r01 r14 r15 P. Bakowski 2 ARM register bank It has two read ports and one

More information

Pipelining, Branch Prediction, Trends

Pipelining, Branch Prediction, Trends Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

ECEC 355: Pipelining

ECEC 355: Pipelining ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly

More information

Appendix C: Pipelining: Basic and Intermediate Concepts

Appendix C: Pipelining: Basic and Intermediate Concepts Appendix C: Pipelining: Basic and Intermediate Concepts Key ideas and simple pipeline (Section C.1) Hazards (Sections C.2 and C.3) Structural hazards Data hazards Control hazards Exceptions (Section C.4)

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language. Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central

More information

Computer Architectures. DLX ISA: Pipelined Implementation

Computer Architectures. DLX ISA: Pipelined Implementation Computer Architectures L ISA: Pipelined Implementation 1 The Pipelining Principle Pipelining is nowadays the main basic technique deployed to speed-up a CP. The key idea for pipelining is general, and

More information

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions Lecture 7: Pipelining Contd. Kunle Olukotun Gates 302 kunle@ogun.stanford.edu http://www-leland.stanford.edu/class/ee282h/ 1 More pipelining complications: Interrupts and Exceptions Hard to handle in pipelined

More information

Pipelining: Hazards Ver. Jan 14, 2014

Pipelining: Hazards Ver. Jan 14, 2014 POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Pipelining: Hazards Ver. Jan 14, 2014 Marco D. Santambrogio: marco.santambrogio@polimi.it Simone Campanoni:

More information

The Pipelined RiSC-16

The Pipelined RiSC-16 The Pipelined RiSC-16 ENEE 446: Digital Computer Design, Fall 2000 Prof. Bruce Jacob This paper describes a pipelined implementation of the 16-bit Ridiculously Simple Computer (RiSC-16), a teaching ISA

More information

ARM ARCHITECTURE. Contents at a glance:

ARM ARCHITECTURE. Contents at a glance: UNIT-III ARM ARCHITECTURE Contents at a glance: RISC Design Philosophy ARM Design Philosophy Registers Current Program Status Register(CPSR) Instruction Pipeline Interrupts and Vector Table Architecture

More information

There are different characteristics for exceptions. They are as follows:

There are different characteristics for exceptions. They are as follows: e-pg PATHSHALA- Computer Science Computer Architecture Module 15 Exception handling and floating point pipelines The objectives of this module are to discuss about exceptions and look at how the MIPS architecture

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

1 Hazards COMP2611 Fall 2015 Pipelined Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor 1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

Module 4c: Pipelining

Module 4c: Pipelining Module 4c: Pipelining R E F E R E N C E S : S T A L L I N G S, C O M P U T E R O R G A N I Z A T I O N A N D A R C H I T E C T U R E M O R R I S M A N O, C O M P U T E R O R G A N I Z A T I O N A N D A

More information

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Thoai Nam Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy & David a Patterson,

More information

Pipelined Processor Design

Pipelined Processor Design Pipelined Processor Design Pipelined Implementation: MIPS Virendra Singh Computer Design and Test Lab. Indian Institute of Science (IISc) Bangalore virendra@computer.org Advance Computer Architecture http://www.serc.iisc.ernet.in/~viren/courses/aca/aca.htm

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

COSC 6385 Computer Architecture - Pipelining

COSC 6385 Computer Architecture - Pipelining COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

Caches. Hiding Memory Access Times

Caches. Hiding Memory Access Times Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY

More information

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr Ti5317000 Parallel Computing PIPELINING Michał Roziecki, Tomáš Cipr 2005-2006 Introduction to pipelining What is this What is pipelining? Pipelining is an implementation technique in which multiple instructions

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev. Exam Review 2 1 ROB: head/tail PC log. reg prev. phys. store? except? ready? A R3 X3 no none yes old tail B R1 X1 no none yes tail C R1 X6 no none yes D R4 X4 no none yes E --- --- yes none yes F --- ---

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

The CPU Pipeline. MIPS R4000 Microprocessor User's Manual 43

The CPU Pipeline. MIPS R4000 Microprocessor User's Manual 43 The CPU Pipeline 3 This chapter describes the basic operation of the CPU pipeline, which includes descriptions of the delay instructions (instructions that follow a branch or load instruction in the pipeline),

More information

CSEE 3827: Fundamentals of Computer Systems

CSEE 3827: Fundamentals of Computer Systems CSEE 3827: Fundamentals of Computer Systems Lecture 21 and 22 April 22 and 27, 2009 martha@cs.columbia.edu Amdahl s Law Be aware when optimizing... T = improved Taffected improvement factor + T unaffected

More information

Basic concepts UNIT III PIPELINING. Data hazards. Instruction hazards. Influence on instruction sets. Data path and control considerations

Basic concepts UNIT III PIPELINING. Data hazards. Instruction hazards. Influence on instruction sets. Data path and control considerations UNIT III PIPELINING Basic concepts Data hazards Instruction hazards Influence on instruction sets Data path and control considerations Performance considerations Exception handling Basic Concepts It is

More information

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1 CMCS 611-101 Advanced Computer Architecture Lecture 9 Pipeline Implementation Challenges October 5, 2009 www.csee.umbc.edu/~younis/cmsc611/cmsc611.htm Mohamed Younis CMCS 611, Advanced Computer Architecture

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation CAD for VLSI 2 Pro ject - Superscalar Processor Implementation 1 Superscalar Processor Ob jective: The main objective is to implement a superscalar pipelined processor using Verilog HDL. This project may

More information

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Pipeline Thoai Nam Outline Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy

More information

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST Chapter 3. Pipelining EE511 In-Cheol Park, KAIST Terminology Pipeline stage Throughput Pipeline register Ideal speedup Assume The stages are perfectly balanced No overhead on pipeline registers Speedup

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations Principles of pipelining Pipelining Simple pipelining Structural Hazards Data Hazards Control Hazards Interrupts Multicycle operations Pipeline clocking ECE D52 Lecture Notes: Chapter 3 1 Sequential Execution

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 4: Pipelining Last Time in Lecture 3 icrocoding, an effective technique to manage control unit complexity, invented in era when logic

More information

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories School of Electrical and Computer Engineering Cornell University revision: 2017-10-17-12-06 1 Processor and L1 Cache Interface

More information

DSP VLSI Design. Pipelining. Byungin Moon. Yonsei University

DSP VLSI Design. Pipelining. Byungin Moon. Yonsei University Byungin Moon Yonsei University Outline What is pipelining? Performance advantage of pipelining Pipeline depth Interlocking Due to resource contention Due to data dependency Branching Effects Interrupt

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions) EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Practice Problems (Con t) The ALU performs operation x and puts the result in the RR The ALU operand Register B is loaded with the contents of Rx

Practice Problems (Con t) The ALU performs operation x and puts the result in the RR The ALU operand Register B is loaded with the contents of Rx Microprogram Control Practice Problems (Con t) The following microinstructions are supported by each CW in the CS: RR ALU opx RA Rx RB Rx RB IR(adr) Rx RR Rx MDR MDR RR MDR Rx MAR IR(adr) MAR Rx PC IR(adr)

More information

Lecture 7 Pipelining. Peng Liu.

Lecture 7 Pipelining. Peng Liu. Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt

More information

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model. Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution

More information

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 Objectives ---------- 1. To introduce the basic concept of CPU speedup 2. To explain how data and branch hazards arise as

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017! Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!

More information

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3. Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,

More information

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections ) Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections 2.2-2.6) 1 Predication A branch within a loop can be problematic to schedule Control

More information

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Pipelining Recall Pipelining is parallelizing execution Key to speedups in processors Split instruction

More information

The Processor: Improving the performance - Control Hazards

The Processor: Improving the performance - Control Hazards The Processor: Improving the performance - Control Hazards Wednesday 14 October 15 Many slides adapted from: and Design, Patterson & Hennessy 5th Edition, 2014, MK and from Prof. Mary Jane Irwin, PSU Summary

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

(Basic) Processor Pipeline

(Basic) Processor Pipeline (Basic) Processor Pipeline Nima Honarmand Generic Instruction Life Cycle Logical steps in processing an instruction: Instruction Fetch (IF_STEP) Instruction Decode (ID_STEP) Operand Fetch (OF_STEP) Might

More information

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information