Midterm I SOLUTIONS March 21 st, 2007 CS252 Graduate Computer Architecture

Size: px
Start display at page:

Download "Midterm I SOLUTIONS March 21 st, 2007 CS252 Graduate Computer Architecture"

Transcription

1 University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2007 John Kubiatowicz Midterm I SOLUTIONS March 21 st, 2007 CS252 Graduate Computer Architecture Your Name: SID Number: Problem Possible Score Total 100 1

2 [ This page left for π ]

3 Question #1: Short Answer [16 pts] Problem 1a[2pts]: What is simultaneous multithreading and why is it useful? Simultaneous multithreading is a technique that adds multiple threads to a multi-issue, out-oforder processor. Since the instructions of these threads can be interleaved in an arbitrary fashion (and are thus running simultaneously), it is called simultaneous multithreading. This technique is useful because it can utilize otherwise idle issue slots in a multi-issue processor. Probglem 1b[2pts]: What is a data flow architecture? How would it work? A data flow architecture is one which attempts to exploit the maximum parallelism available in an algorithm. It provides a hardware execution of the dataflow graph. Such architectures typically work by placing operations (instructions) in some sort of physical store such that they are triggered immediately when their operands are available. Completion of a given operation generates data which flows to the operand inputs of dependent operations, thus triggering their execution, etc. Problem 1c[3pts]: What technological forces have caused Intel, AMD, Sun, and others to start putting multiple processors on a chip? Power consumption, limited instruction-level parallelism available in typical applications, memory access time (memory wall) and other issues have caused attempts to improve singlethread performance to stall in The cost in improving single-thread performance reached a point where it wasn t worth the resulting gain. Chip manufacturers decided to back off from improving individual single-thread performance and instead start making multicore chips. Problem 1d[2pts]: Name two components of a modern superscalar architecture whose delay scales quadratically with the issue-width. Many things scale quadratically with issue width. For instance: 1. The delay in the forwarding network scales quadratically with issue-width. 2. The instruction-issue(wakeup) logic scales quadratically with issue-width 3. Register Rename logic scales quadratically 3

4 Problem 1e[2pts]: Most branches in a program are highly biased, i.e. they can be predicted by a simple one-level predictor. What can the compiler do to improve the number of branches that are in this category? By factoring the code so that it duplicates branches that depend on other branches (called node splitting ). By copying these branches into multiple instances, one for each path through the code, the new branches can become highly biased.. Problem 1f[3pts]: What is the difference between implicit and explicit register renaming? How are they implemented? Implicit register renaming is what the original Tomasulo algorithm depended on to eliminate WAW and WAR hazards. When issuing an instruction, it replaced registers by either (1) a value or (2) the name of a slot in a reservation station. Thus, the renaming was implicit, since it merely replaced the register names as part of the scheduling algorithm. Explicit renaming, on the other hand, renames user-visible register names with physical register names before the instruction issue stage. The Explicit renaming technique relies on a mapping table/free-list mechanism to allocate registers. Problem 1g[2pts]: Why are Vector processors are more power efficient that superscalar processors when executing applications with a lot of data-level parallelism? Explain. If an application has lots of data-level parallelism, then this can be expressed directly with vector instructions. The vector processor can then spend power executing the actual data operations in parallel, with very little instruction overhead. Trying to execute the same algorithm with a superscalar processor wastes a lot of power extracting the parallelism branch prediction, large instruction windows, complex issue logic, etc is all required just to get multiple iterations executing in parallel. 4

5 Problem #2: Superpipelining [21 pts] Suppose that we have single-issue, in-order pipeline with one fetch stage, one decode stage, multiple execution stages (which include memory access) and a singe write-back stage. Assume that it has the following execution latencies (i.e. the number of stages that it takes to compute a value): multf (4 cycles), addf (3 cycles), divf (6 cycles), integer ops (1 cycle). Assume full bypassing and two cycles to perform memory accesses, i.e. loads and stores take a total of 3 cycles to execute (including address computation). Finally, branch conditions are computed by the first execution stage (integer execution unit). Problem 2a[10pts]: Assume that this pipeline consists of a single linear sequence of stages in which later stages serve as no-ops for shorter operations. You should do the following on your diagram: 1. Draw each stage of the pipeline as a box and name each of the stages. Stages may have multiple function: i.e. an execute stage + memory op. You will have a total of 9 stages. 2. Describe what is computed in each stage (e.g. EX 1 : Integer Ops, Address Compute, First stage of ) 3. Show all of the bypass paths (as arrows between stages). Your goal is to design a pipeline which never stalls unless a value is not ready. Label each of these arrows with the types of instructions that will forward their results along these paths (i.e. use M for multf, D for divf, A for addf, I for integer operations, Ld for loads). [Hint: be careful to optimize for information feeding into store instructions!] I,D,Ld,A,M I,Ld,A,M I,Ld,A,M I,Ld,A I I Ans Stage Functions: Address Compute (for LD/ST) F: Fetch next Instruction D: Decode Instructions EX 1 : Integer Ops D Ex 1 Ex 2 Ex 3 Ex 4 Ex 5 First stage of: Addf, Multf, Divf Ld,A EX 2 : First memory stage of LD/ST Second stage of Addf, Multf, Divf M EX 3 : Last memory stage of LD/ST Last stage of Addf Third stage of Multf, Divf EX 4 : Last stage of Multf Notes on Feedback paths: Fourth stage of Divf Feeding from beginning of stage to beginning EX 1 EX 5 : Fifth stage of Divf Feeding from first stage after completion to EX 2 EX 6 : Sixth stage of Divf (For stores) W: Writeback stage F Ex 6 D W 5

6 Problem 2b[3pts]: How many extra instructions are required between each of these instruction combinations to avoid stalls (i.e. assume that the second instruction uses a value from the first). Be careful! Between a divf and a store: 4 insts Between a load and a multf: 2 insts Between two integer instructions: 0 insts Between a multf and an addf: 3 insts Between an addf and a divf: 2 insts Between an integer op and a store: 0 insts Problem 2c[2pts]: How many branch delay slots does this machine have? Explain. This machine has 2 branch delay slots. This can be seen simply by considering the fact that the result of a branch comparison must somehow effect (forward in time!) a fetch: F D E 1 E 2 E 3 E 4 E 5 E 6 W F D E 1 E 2 E 3 E 4 E 5 E 6 W F D E 1 E 2 E 3 E 4 E 5 E 6 W F D E 1 E 2 E 3 E 4 E 5 E 6 W As a result, there must be 2 intervening instructions between the branch and the first instruction that is an actual taken branch. Probem 2d[2pts]: Could branch prediction increase the performance of this pipeline? Why or why not? Yes. Without some form of prediction, we must expose the 2 delay slots (part c) to the compiler. It is hard to do a good job of filling two delay slots with useful instructions. Thus, we could remove the delay slots from the ISA and use branch prediction to predict instructions immediately after a branch It is possible that a good branch prediction scheme could do a better job of finding two good instructions for these delay slots then the static compiler. Note, however, that this is very application dependent. Problem 2e[2pts]: In the 5-stage pipeline that we discussed in class, a load into a register followed by an immediate store of that register to memory would not require any stalls, i.e. the following sequence could run without stalls: lw r4, 0(r2) sw r4, 0(r3) Explain why this was true for the 5-stage pipeline. This is true because we could forward the load data from the end of the memory stage to the beginning of the memory stage for the store. Problem 2f[2pts]: Is this still true for your superpipelined processor? Explain. No. As we can see from (2a), we forward load results from beginning of EX 3 to beginning of EX 1. This means that there must be one instruction between a load and subsequent store. 6

7 Problem 3: Tomasulo Architecture [20 pts] Problem 3a[5pts]: Consider a Tomasulo architecture with a reorder buffer. This architecture replaces the normal 5- stages of execution with 5 stages: Fetch, Issue, Execute, Writeback, and Commit. Explain what happens to an instruction in each of them (be as complete as you can): a) Fetch: Fetch instructions from memory. Usually contains branch-prediction logic b) Issue: Get instruction from operation queue. If reservation station free and reorder buffer slot free, send instruction to reservation station and reorder buffer after renaming registers (to either values or tags). c) Execute: When both operands ready then execute; if not ready, watch CDB for result. d) Writeback: Finish execution. Write result on Common data bus. Use reorder buffer number as the tag for result. Result will be written to reorder buffer slot. Mark reservation station as free. e) Commit: Update actual register file with results from the head of reorder buffer. Only completed instructions at head of reorder buffer will be committed. After commit, reorder buffer slot is freed. Problem 3b[3pts]: Name each of the three types of data hazards and explain how the Tomasulo architecture removes them: RAW: Read after Write hazards. These are real data dependencies. Tomasulo removes them by tracking the dependencies during the issue stage, such that data dependant instructions receive the tag of source instructions. Results are broadcast on CDB to instructions that need them, thereby removing the RAW hazard. WAR: Write after Read hazards. Write after read hazards are removed because values from the register file are copied to the reservation stations on issue. Thus, there is no possibility for later instructions to overwrite operands read by earlier instructions. WAW: Write after Write hazards. The register renaming mechanism prevents WAW hazards because the register file gets a program-order sequence of tags written to it. If two instructions that write to the same register are in the pipeline simultaneously, then only the tag of the later one is kept in the register file; hence there is no chance for an earlier value to persist over a later value. Problem 3c[3pts]: Name three structural hazards that this architecture exhibits. Explain your answer. Structural hazards on the reorder buffer, reservation stations, and the CDB. It cannot issue an instruction if there is no slot available in the reorder buffer and/or no slot available in an appropriate reservation station. Further, two instructions cannot do writeback at the same time if there is only one CDB. 7

8 Problem 3d[2pts]: Assume that you have a long chain of dependent instructions, such as the following: add $r1, $r2, $r3 add $r3, $r1, $r4 add $r7, $r3, $r5 Also assume that the integer execution unit takes one cycle for adds. What CPI would you achieve for this sequence with the basic Tomasulo architecture, assuming that each of the stages from (3a) are non-overlapped and take a complete cycle? This would achieve a CPI of 2. Reason: assuming sufficient reservation space, each instruction would execute (one cycle), then broadcast its result (one cycle). Assuming that the writeback takes a complete cycle, there would be no overlap of the execute/writeback of subsequent insts. Problem 3e[2pts]: Assume that associative matching on the CDB is a slow enough operation that it takes much of a cycle. How can you still get a throughput of one instruction per cycle for long dependent chains of operations such as given in (3d)? Only well-thought-out answers will get credit. Separate the CDB into a two-cycle operation: first cycle for matching, second cycle for data transmission. Then, during the cycle in which an execution was occurring, the functional unit could send out the promise of the value to be transmitted on the following cycle. This would permit the reservation station to recognize when it could execute on the following cycle, thereby setting up the operation to occur. The CDB data could be sent directly to the appropriate input from the functional unit on the next cycle. Note: you have to assume that the time to transmit the value is short enough to permit both transmission and execution in a single cycle. Problem 3f[2pts]: The Tomasulo algorithm has one interesting bug in it. Consider the situation where one instruction uses a value from another one. Suppose the first instruction is issued on the same cycle as the one that it depends on is in writeback. add $r1, $r2, $r3 The result is broadcast... add $r4, $r1, $r1 This one is being issued What is the problem? Can you fix it easily? The problem is that, at the beginning of the cycle, the issue logic looks in the register file and decides that the value isn t ready thereby sending the dependent instruction to the reservation station with a tag. Meanwhile, by the end of the cycle, the register file is overwritten with an actual value. Now, there is a tag in the reservation station and a value in the register file, and the new instruction never executes. Fix: do the writeback in the first half of the cycle and look up in the register file on the second half of the cycle (like the 5-stage pipeline). Problem 3g[3pts]: Which changes would you have to make to the basic Tomasulo architecture (with reorder buffer) to enable it to average a CPI = 0.33? Need register-rename logic that can look at three instructions at the same time. Need a register file that can handle 6 reads and 3 writes at the same time. Need three CDBs and appropriate logic to arbitrate for them. Need a reorder buffer than can accept three new instructions per cycle, that can take three writebacks per cycle, and can commit three instructions per cycle. Also, need enough actual reservation stations and execution units so that you can have an average of three instructions running per cycle. 8

9 Problem #4: Fixing the loops [21 pts] For this problem, assume that we have a superpipelined architecture like that in problem (2) with the following use latencies (these are not the right answers for problem #2b!): Between a multf and an addf: 3 insts Between a load and a multf: 2 insts Between an addf and a divf: 1 insts Between a divf and a store: 6 insts Between an int op and a store: 0 insts Number of branch delay slots: 1 insts Consider the following loop which performs a restricted rotation and projection operation. In this code, F0 and F1 contain sin(θ) and cos(θ) for rotation. The array based at register r1 contains pairs of single-precision (32-bit) values which represent x,y coordinates. The array based at register r2 receives a projected coordinate along the observer s horizontal direction: project: ldf F3,0(r1) multf F10,F3,F0 ldf F4,4(r1) multf F11,F4,F1 addf F12,F10,F11 divf F13,F12,F2 stf 0(r2),F13 addi r1,r1,#8 addi r2,r2,#4 subi r3,r3,#1 bneq r3, r0, project nop 2 cycles 2 cycles 3 cycles 1 cycle 6 cycles Total: 14 Stall Cycles Problem 4a[2pts]: How many cycles does this loop take per iteration? Indicate stalls in the above code by labeling each of them with a number of cycles of stall: This takes a total of 12+14=26 cycles/iteration (see stalls above) Problem 4b[4pts]: Reschedule this code to run with as few cycles per iteration as possible. Do not unroll it or software pipeline it. How many cycles do you get per iteration of the loop now? project: ldf F3,0(r1) ldf F4,4(r1) addi r1,r1,#8 multf F10,F3,F0 multf F11,F4,F1 addi r2,r2,#4 subi r3,r3,#1 addf F12,F10,F11 divf F13,F12,F2 bneq r3, r0, project stf -4(r2),F13 7 stall cycles, 11 instructions 18 cycles/iteration 1 cycle 1 cycle 5 cycles 9

10 Problem 4c[6pts]: Unroll the loop once and schedule it to run with as few cycles as possible per iteration of the original loop. How many cycles do you get per iteration now? project: ldf F3,0(r1) ldf F4,4(r1) ldf F5,8(r1) multf F10,F3,F0 multf F11,F4,F1 ldf F6,12(r1) multf F12,F5,F0 multf F13,F6,F1 addf F14,F10,F11 divf F15,F14,F2 addf F16,F12,F13 divf F17,F16,F2 addi r1,r1,#16 addi r2,r2,#8 subi r3,r3,#2 stf -8(r2),F15 bneq project stf -4(r2),F17 1 cycle 1 cycle 1 cycle 1 cycle So: Total number of stalls = 4. Number of insts = (7 x ). Thus, we have 22/2 = 11 cycles/iteration. Note that the 3 rd stall cycle adds an extra cycle between the first divide and first store. Hence no stall before first store. Problem 4e[3pts]: Your loop in (4c) will not run without stalls. Without going to the trouble to unroll further, what is the minimum number of times that you would have to unroll this loop to avoid stalls? How many cycles would you get per iteration then? With 4 or more iterations, we can put all the loads together, all the multiplies together, etc. without any stalls until stores. Then, we have D n I 3 S n-1 B S (where D =divf, I =int, S =store, and B =branch). To avoid a stall, we look between first D and first S where there are (n-1+3) insts n+2=6 n=4 iterations. So, we want 4 iterations. #cycles/iteration = [ (7 x 4)+4 ]/4 = 8 cycles/iteration. Problem 4f[6pts]: Rewrite your code to utilize vector instructions and to run as fast as possible. Assume that the value in r3 is the vector length. Make sure to comment each instruction to say what it is doing. Assuming full chaining, one instruction/cycle issue, and delays for instructions/memory that are the same as the non-vector processor. How long does this code take to execute (you can use the original value of r3 in your expression). project: SVL r3 ; Set vector length to r3 LVF V0,r1,8 ; Load single-float V0 from addr r1, stride 8 addi r1,r1,4 ; Increment base addr by 4 LVF V1,r1,8 ; Load single-float V1 from addr r1, stride 8 MULTF V2,V0,F0 ; Multiply V0 by constant F0 V2 MULTF V3,V1,F1 ; Multiply V0 by constant F1 V3 ADDF V4,V2,V3 ; Add vectors V2 and V3 V4 DIVF V5,V4,F2 ; Divide V4 by constant F2 V5 SVF V5,r2,4 ; Store single-float V5 at addr r2, stride 4 After the execution gets going, data flows through functional units at full speed (assuming for now 1 lane). After first stall (before second MULTF), issues get ahead of execution and don t impact time. So, 6 cycles up to that MULTF, then 4 cycles for MUTLF, 2 cycles ADDF, 8 cycles DIVF, 2 cycles SVF until first value. Remaining values 1 per cycle. Time = (r3-1) = 21+r3. 10

11 Problem 4g: [Extra Credit: 5pts] Assume that you have a Tomasulo architecture with functional units of the same execution latency (number of cycles) as our deeply pipelined processor (be careful to adjust use latencies to get number of execution cycles!). Assume that it issues one instruction per cycle and has an unpipelined divider with a small number of reservation stations. Suppose the other functional units are duplicated with many reservation stations and that there are many CDBs. What is the minimum number of divide reservation stations needed to achieve one instruction per cycle with the optimized code of (4b)? Show your work. [hint: assume that the maximum issue rate is sutained and look at the scheduling of a single iteration] Answer: The best way to understand this is to actually look at the timing of issue slots. First, we take the use latencies from the beginning of this problem to extract the execution latencies (number of execution stages) for the different operations: Load: 3 cycles, Add: 2 cycles, Multiply: 4 cycles, Divide: 8 cycles (careful here!) Next, we show the timing of two iterations. Note that we assume that the WB (broadcast) of one operation and the scheduling of the next can occur in the same cycle: Name Issue Start End Write Execution Execution Back ldf ldf addi multf multf addi subi addf divf bne stf ldf ldf addi multf multf addi subi addf divf bne stf ldf ldf addi multf Looking at this table, we only need 2 reservation stations: one that is running, and one waiting. 11

12 Problem 5: Prediction [24 pts] In this question, you will examine several different schemes for branch prediction, using the following code sequence for a MIPS-like ISA with no branch delay slow: addi r2, r0, #45 ; initialize r2 to , binary addi r3, r0, #6 ; initialize r3 to 6, decimal addi r4, r0, #10000 ; initialize r4 to a big number top: PC1--> andi r1, r2, #1 bnez r1, skip1 ; extract the low-order bit of r2 ; branch if the bit is set xor r0, r0, r0 ; dummy instruction skip1: srli r2, r2, #1 ; shift the pattern in r2 PC2--> subi r3, r3, #1 bnez r3, skip2 ; decrement r3 addi r2, r0, #45 ; reinitialize pattern addi r3, r0, #6 skip2: subi r4, r4, #1 PC3--> bnez r4, top ; decrement loop counter This sequence contains 3 branches, labeled by PC1, PC2, and PC3. Problem 5a[2pts]: Sketch a basic PAg predictor that might be used for prediction. Assume that we will be tracking 3 bits of history. Address Bit sel PABHR GPHT Problem 5b[2pts]: What is the minimum range of instruction address bits required to address the branch history table for your PAg predictor in order to avoid aliasing between PC1, PC2, and PC3? How many entries that this correspond to? Notice that the branches at PC1, PC2, PC3, are separated by 4 instructions each. Assuming that instructions are 4 bytes (2 address bits), this means that we want instruction bits 5:4, since PC1, PC2, and PC3 are distinguished in these bits. Hence, our PABHR needs 4 entries. Note that you will still have some aliasing through the Pattern History Table regardless. 12

13 Problem 5c[6pts]: The following are the steadystate taken/not-taken patterns for each of the three branches (T indicates taken, N indicates not taken): PC1: TTNTNT TTNTNT... PC2: NTNTNT NTNTNT... PC3: TTTTTN TTTTTN... Using the PAg predictor of 5a and assuming no aliasing (i.e. a correct answer to 5b), what is the steady state prediction success rate (that is, the ratio of correctly predicted branches to total branches) for each branch? Assume that all 2-bit predictors are initialized to zero. Hint: Draw a table representing values (T or F) fed to each entry of the pattern history table. After you get a repeating-pattern stream for each predictor, you should be able to know how each 2-bit counter will predict: The trick here is to take groups of three branches together and look at the follow-on branch. For instance, we take the first branches of each branch. We assume that we always take PC1, PC2, PC3 as our pattern (since I didn t tell you otherwise). We see: TTN T for PC1, NTN T for PC2 and TTT T for PC3. For each of these results, I added a T to the resulting line on the table below and labeled with a subscript for which branch they belong to. Note that I ve drawn the first rectangle below to indicate this operation. The second rectangle indicates the second set of branch values (TNT N, TNT N, TTT T). PC1: TTNTNT TTNTNT... PC2: NTNTNT NTNTNT... PC3: TTTTTN TTTTTN... Pattern Follow-on Instances Prediction NNN? NNT? NTN T 2 T 1 T 2 T 2 T NTT T 1 T 3 T TNN? TNT N 1 N 2 T 1 N 2 T 3 N 2 N TTN T 1 T 3 T TTT T 3 T 3 N 3 N 1 NNTN Note that we ve continued putting in instances until there is a repeating pattern. The only ambiguous prediction is for TTT. For the others, any initialization of counters will be washed out after enough iterations. For TTT, we have utilized the initialization condition here. TTT starts at 00. First prediction: N. T takes us to 01. Next prediction: N. T takes us to 10. Next prediction: T. N takes us to 01. Next prediction N. Final N takes us back to 00. Final prediction N. Pattern repeats. Note that only the last TTT in the pattern is a correct prediction! Now, we just work our way through the branches and get our success rate: PC1: 5/6 = 83.3%, PC2: 6/6=100%, PC3: 2/6=33.3% 13

14 Problem 5d[2pts]: Can you make a simple argument why a version of PAg with 6 bits of history will have 100% prediction accuracy for this set of branch patterns? Very simple answer: since every pattern repeats after 6 bits, we know that each 6 bit pattern has only one outcome: the branch at the beginning. For instance, for a pattern B 1 B 2 B 3 B 4 B 5 B 6 the outcome is B 1. Consequently, there is no ambiguity in the outcome, i.e. the particular 2-bit counter associated with B 1 B 2 B 3 B 4 B 5 B 6 will easily converge to predict B 1. (There can t be a different outcome, because this would be a different 6-bit pattern and thus a different 2-bit counter). Problem 5e[4pts]: Draw the following global predictors: GAg, GShare, GAs. What is the reason for using a GShare or GAs predictor instead of GAg predictor? PAPHT GPHT GBHR GBHR GBHR Address GAg GPHT GAs GShare The reason to use a GShare or GAs predictor instead of a GAg predictor is to lessen the effects of aliasing in the PHT. 14

15 Problem 5f[4pts]: What is the simplest type of predictor that can predict the following sequence of data values without errors after some startup period: Draw a hardware diagram for it. How many data values must it see before it starts predicting correctly? Answer: a Strided predictor of some sort will predict this properly. Here is an extremely simple version that only remembers one previous value. It will predict properly after seeing one value (one to initialize the previous register ). Input Previous Sub (-) Add (+) Problem 5g[4pts]: What is the simplest type of predictor that can predict the following sequence of data values without errors after some startup period: Draw a hardware diagram for it. How many data values must it see before it starts predicting correctly? This requires a Context predictor. To get uniqueness, we need a 2 nd order predictor, since we need the following predictions: [1, 3] 3, [3,3] 7, [3,7] 10, [7,10] 1, [10,1] 3 We will need to see 7 values before we can predict correctly. Prediction Addr Frequency Table Read Select Highest Freq Element Output Row Prediction 2-entry History Write Update Entry For Current (+1) Input Entry Select 15

16 [ This page intentionally left blank!] 16

Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture

Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2007 John Kubiatowicz Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture Your Name: SID Number:

More information

Midterm I SOLUTIONS March 18, 2009 CS252 Graduate Computer Architecture

Midterm I SOLUTIONS March 18, 2009 CS252 Graduate Computer Architecture University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2009 John Kubiatowicz Midterm I SOLUTIONS March 18, 2009 CS252 Graduate Computer Architecture Your Name:

More information

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Short Answer: [3] What is the primary difference between Tomasulo s algorithm and Scoreboarding?

Short Answer: [3] What is the primary difference between Tomasulo s algorithm and Scoreboarding? Short Answer: [] What is the primary difference between Tomasulo s algorithm and Scoreboarding? [] Which data hazard occurs when instructions are allowed to complete out of order? Which one occurs when

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 16, 2014 Time: 1 hour + 15 minutes Name: Alias: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation Digital Systems Architecture EECE 343-01 EECE 292-02 Predication, Prediction, and Speculation Dr. William H. Robinson February 25, 2004 http://eecs.vanderbilt.edu/courses/eece343/ Topics Aha, now I see,

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

CS433 Homework 3 (Chapter 3)

CS433 Homework 3 (Chapter 3) CS433 Homework 3 (Chapter 3) Assigned on 10/3/2017 Due in class on 10/17/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods 10-1 Dynamic Scheduling 10-1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods Not yet complete. (Material below may

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software: CS152 Computer Architecture and Engineering Lecture 17 Dynamic Scheduling: Tomasulo March 20, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Alexandria University

Alexandria University Alexandria University Faculty of Engineering Computer and Communications Department CC322: CC423: Advanced Computer Architecture Sheet 3: Instruction- Level Parallelism and Its Exploitation 1. What would

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007, Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) Instruction Level Parallelism (ILP) Pipelining supports a limited sense of ILP e.g. overlapped instructions, out of order completion and issue, bypass logic, etc. Remember Pipeline CPI = Ideal Pipeline

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

6.823 Computer System Architecture

6.823 Computer System Architecture 6.823 Computer System Architecture Problem Set #4 Spring 2002 Students are encouraged to collaborate in groups of up to 3 people. A group needs to hand in only one copy of the solution to a problem set.

More information

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3 CISC 662 Graduate Computer Architecture Lecture 10 - ILP 3 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM EXAM #1 CS 2410 Graduate Computer Architecture Spring 2016, MW 11:00 AM 12:15 PM Directions: This exam is closed book. Put all materials under your desk, including cell phones, smart phones, smart watches,

More information

Computer Architecture EE 4720 Final Examination

Computer Architecture EE 4720 Final Examination Name Computer Architecture EE 4720 Final Examination Primary: 6 December 1999, Alternate: 7 December 1999, 10:00 12:00 CST 15:00 17:00 CST Alias Problem 1 Problem 2 Problem 3 Problem 4 Exam Total (25 pts)

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Instruction-Level Parallelism. Instruction Level Parallelism (ILP) Instruction-Level Parallelism CS448 1 Pipelining Instruction Level Parallelism (ILP) Limited form of ILP Overlapping instructions, these instructions can be evaluated in parallel (to some degree) Pipeline

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Instruction Level Parallelism Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson /

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

ILP: Instruction Level Parallelism

ILP: Instruction Level Parallelism ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal

More information

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010 CS252 Graduate Computer Architecture Lecture 8 Explicit Renaming Precise Interrupts February 13 th, 2010 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

More information

/ : Computer Architecture and Design Fall 2014 Midterm Exam Solution

/ : Computer Architecture and Design Fall 2014 Midterm Exam Solution 16.482 / 16.561: Computer Architecture and Design Fall 2014 Midterm Exam Solution 1. (8 points) UEvaluating instructions Assume the following initial state prior to executing the instructions below. Note

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4 PROBLEM 1: An application running on a 1GHz pipelined processor has the following instruction mix: Instruction Frequency CPI Load-store 55% 5 Arithmetic 30% 4 Branch 15% 4 a) Determine the overall CPI

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

CSE 490/590 Computer Architecture Homework 2

CSE 490/590 Computer Architecture Homework 2 CSE 490/590 Computer Architecture Homework 2 1. Suppose that you have the following out-of-order datapath with 1-cycle ALU, 2-cycle Mem, 3-cycle Fadd, 5-cycle Fmul, no branch prediction, and in-order fetch

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Advanced Instruction-Level Parallelism

Advanced Instruction-Level Parallelism Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 18 Advanced Processors II 2006-10-31 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Thanks to Krste Asanovic... TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

1 Tomasulo s Algorithm

1 Tomasulo s Algorithm Design of Digital Circuits (252-0028-00L), Spring 2018 Optional HW 4: Out-of-Order Execution, Dataflow, Branch Prediction, VLIW, and Fine-Grained Multithreading uctor: Prof. Onur Mutlu TAs: Juan Gomez

More information

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP? What is ILP? Instruction Level Parallelism or Declaration of Independence The characteristic of a program that certain instructions are, and can potentially be. Any mechanism that creates, identifies,

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

3.16 Historical Perspective and References

3.16 Historical Perspective and References Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell 247 Power4 Power5 Power6 Power7 Introduced 2001 2004 2007 2010 Initial clock rate (GHz) 1.3 1.9 4.7 3.6 Transistor count (M) 174 276 790

More information

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods 10 1 Dynamic Scheduling 10 1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods Not yet complete. (Material below may repeat

More information

CS / ECE 6810 Midterm Exam - Oct 21st 2008

CS / ECE 6810 Midterm Exam - Oct 21st 2008 Name and ID: CS / ECE 6810 Midterm Exam - Oct 21st 2008 Notes: This is an open notes and open book exam. If necessary, make reasonable assumptions and clearly state them. The only clarifications you may

More information

CMSC411 Fall 2013 Midterm 2 Solutions

CMSC411 Fall 2013 Midterm 2 Solutions CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

Computer System Architecture Final Examination Spring 2002

Computer System Architecture Final Examination Spring 2002 Computer System Architecture 6.823 Final Examination Spring 2002 Name: This is an open book, open notes exam. 180 Minutes 22 Pages Notes: Not all questions are of equal difficulty, so look over the entire

More information

COSC4201 Instruction Level Parallelism Dynamic Scheduling

COSC4201 Instruction Level Parallelism Dynamic Scheduling COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,

More information

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units CS333: Computer Architecture Spring 006 Homework 3 Total Points: 49 Points (undergrad), 57 Points (graduate) Due Date: Feb. 8, 006 by 1:30 pm (See course information handout for more details on late submissions)

More information

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012 Advanced Computer Architecture CMSC 611 Homework 3 Due in class Oct 17 th, 2012 (Show your work to receive partial credit) 1) For the following code snippet list the data dependencies and rewrite the code

More information

Floating Point/Multicycle Pipelining in DLX

Floating Point/Multicycle Pipelining in DLX Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations in one or two cycles is impractical since it requires: A much longer CPU clock cycle, and/or

More information