High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Size: px
Start display at page:

Download "High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur"

Transcription

1 High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Dynamic Instruction Scheduling with Branch Prediction Hello and welcome to today s lecture on Dynamic Instruction Scheduling with Branch Prediction. In the last two lectures, we have discussed various techniques of branch prediction and earlier, we have discussed the dynamic instruction scheduling. Now today, we shall see, how these two can be combined to achieve higher performance in a processes. (Refer Slide Time: 01:27) But, before that, let us have a quick recap of, one very important concept that is, data flow architecture, which is used in dynamic instruction scheduling. Earlier, data flow architecture were proposed with basic idea that, hardware represents direct encoding of compiler dataflow graphs. So, for example, you have to perform this computation y is equal to a plus b by x and x is equal to a into a plus b plus b and inputs area and b, and output is y and x. Now, one can create, in data flow architectures, a data flow graph is created, so here this is the dataflow graph corresponding to this computation. Here, you can see, a and b are the inputs and y and x are the outputs and various operations like addition then, your

2 multiplication and this division y and x and this addition. So, you can see, all these are given here, now what is the basic idea of the dataflow architecture. So, dataflow along arcs in tokens, you can see here, two circles are given, these are known as tokens. When both the tokens are available then, when two tokens are arrive at the compute box, so this is your compute box, adder is a compute box and then, the box fires and produces new tokens. So, that we seen, there are a two tokens I mean, if there is a token on this arc, if there is a another, there is also a token on this arc then, this compute box will fire and produce a token at the output. And if their a spilt operations like this then, copies of tokens are produced that means, here for example, A and B tokens of their. So, this box will fire, produce a token here and that token will be available at this input on this arc as well as on this arc. Now, we can see tokens on A and tokens on this arc are available, so this multiplication multiple I mean, multiple compute box will fire and it will produce a token here. Now, we can see, token is present on this arc, token is already present on this arc, so this add execute unit will fire, compute unit will fire. And as the token is available here and earlier, there was a token here, now tokens are available both these arcs. So, this particular compute unit that is, divide will fire and finally, it will produce a token here. So, you see, this is how, the dataflow operation of different computations are controlled with the help of tokens in dataflow machines. Somewhat similar thing is done in dynamic instructions I mean, scheduling and particularly in tomasulo s approach, what is being done, this dataflow graph is implemented, while executing instruction and hardware does it.

3 (Refer Slide Time: 05:07) That means, this dataflow graph is created within the hardware, with the help of reservation a stations and other functional units to do it. So, this is one very important concept and we have already discussed about this dynamic instruction scheduling, hardware rearranges the instruction execution to reduce stalls, while maintaining dataflow, you can see dataflow and exception behavior. Here, as I mentioned, I have briefly explain the operations of the dataflow architecture, same concept has been borrowed here. And of course, it also maintains exception behavior, which is also important and it has got many advantages. As you have already discussed, it allows handling situations not more than compile time that means, it is better than compile time instruction scheduling and it simplifies the compiler. It allows the processor to tolerate unprecedented delays such as, cache misses by executing other code, while waiting for the miss to resolve. Because, it allows you, how to code for execution, so if some instruction, if it is delayed then, other instruction, which is out of order can proceed, particularly in processors in the in the early years, when cache memory was not present or even when cache memory is present, it may be delayed. So, if cache memory is not present, if it is a memory instruction, it will take longtime, on the other hand if there is cache memory, if there is cache miss, it may take longer time, because of cache miss.

4 So, this type of problem is resolved that means, executing other codes while waiting for the miss to resolve. So that means, the different delays of different instructions are tolerated then, it allows code, that was compiled for one pipeline in mind to run effectively on a different pipeline, as elaborated earlier. Now, in my next lecture, I shall discussed about another very important technique that is, hardware speculation. You will see that, hardware speculation goes hand in hand with dynamic instructions scheduling. So, this dynamic instruction scheduling along with hardware speculation, which I shall discussed about in my next lecture, which can lead to further performance advantage. And particularly, as I have already mentioned, tomasulo s building dataflow graphs on the fly. So, that means, this dataflow graphs, which is created here is created on the fly in your tomasulo s scheme. (Refer Slide Time: 07:52) And we have already discussed about the basic structure that you require to implement tomasulo s scheme. You have been instruction queue then, these are the floating point registers, when innovation is the availability of reservation stations as we have discussed and you have got load buffers and store buffers.

5 (Refer Slide Time: 08:20) And the main features is register renaming, that is achieved with the help of reservation stations, which buffer the operands of instructions waiting to issue and by the issue logic that you are already discussed and this avoids WAR, WAW hazards without stalling. And distributed hazard detection and execution control by using reservation stations, buffers the instructions and operation cooperands that means, with the help of this buffering, you can pass the results directly to the functional units from the reservation stations, rather than going through the registers. And this common database bypassing is done, because the result is broadcasted and the common database, which is received by all the functional units waiting for the operand to be loaded simultaneously. So, these are the key continuation of tomasulo s approach and another thing is, integer instructions can past branches. So, this allows floating point operations beyond the basic block of FP queue and this can be done provided, we have got branch prediction, only with the help of branch prediction, this can be done as we shall see.

6 (Refer Slide Time: 09:50) So, let us illustrate this with the help of tomasulo s loop example, this is a loop example, earlier we have seen how dynamic instruction scheduling can be done over the basic block. And you can increases the size of the basic block by loop unrolling and all those things, but now you will see that, loop unrolling will be done dynamically with the help of branch prediction, so this is the example. And these are the assumptions we shall make, assume multiply takes 4 cycles, assume first load takes 8 cycles, because of cache miss and second load takes 4 clocks. Because, as you know, later on I shall discussed about this cache memory and you will say that, whenever there is cache miss trying to read a particular word, if that particular word is not present, the entire block comprising several words had transferred to the cache memory. And as a result, subsequently if you read another word from that block then, it will lead to a hit. So, first time there will be a miss, subsequently there will be hit, so that is what is explained here and later on, we shall discuss in detail, how the cache memory really works and this terminologies like cache miss and cache hit. But, for the time being, let us assume that, first load will take 8 clock cycles, because of cache miss and second loads will take 4 clock cycles. And also we shall see, although I mean, for the execution of this, we shall show clocks for these two instructions, which are essentially used for the purpose of housekeeping for

7 loop computations, you know you have to subtract R 1, which is incremented I mean, R 1 is incremented gradually, which is acting as a pointer to different elements of an array and this branch is done by checking care of, how many times the loop will proceed or not. So, by creating it to 0, taking it whether it is equal to 0 or not, so these two was the clocks, for these two will be also shown. And we shall see that, integer instructions will proceed ahead compared to the floating point, but we shall primarily focus on floating point of a reasons. (Refer Slide Time: 12:51) Now, hazards due to out of order executions, if a branch is predicted as taken, multiple executions of loop can proceed at once using reservation stations. So, normally the execution is restricted to a basic fire block but now, if you know that a branch predicted that means, that loop will be taken I mean, that branch will be taken loop then obviously, the instructions of the other iterations can be carried out. So, this is the basic idea here that means, if a branch is predicted as taken, multiple executions of loop can proceed at once using reservation stations, how we shall discussed later. And a load and store can several execute out of order, provided they access different memory location.

8 (Refer Slide Time: 14:06) So, this is another thing we have to take care of, we have already discussed about the read after write of hazards, which are essentially arising out of true dependency, which will be taken care of by tomasulo s algorithms. We have already seen how it is being done, now there is other two hazards, right after read and write after write. If they involve registers that means, if the read after write and write after write operations are involving registers, that can be taken care of by register renaming, that we are already discussed. And in tomasulo s algorithm, this register renaming is done dynamically without actually involving registers, that we are discussed in detail. Now, question naturally arises, if this hazards arise involving memory then, how it can be taken care of. So, write after read and write after write hazards I mean, arises out of name dependences. If they arises involving memory then, you have to take care of, you may have a series of load and store instructions present in a program. So, the way the sequence, in which they appear in a program as you know, this is known as program order. Now suppose, you have got, this is one load and you may have another load say, let it be F 0, 0, R 1 and there can be load and store, there can be several other load and store before and after hit. Now, let us consider load and this load, let us consider and suppose, this is also F 0 and it is involving a memory. Let us assume here, it

9 is some something 8 R 1, but in between the value of R l may have changed, which have not let me here. Now, if these two involve the same memory then, from the same memory we are loading then, what can happen. So, if you change the order, this will lead to a hazard, which is known as, which is essentially write after write hazard, so this is a write after write hazard, you are writing in a same register. So, how this type of hazard can be overcome in tomasulo's approach, so this is you are the reading from the same memory and you are from a particular memory on writing into a register, you are reading from memory, writing into same register. So, in this case that means, it is involving a memory, but it is writing into a same register. In that case, if this is executed fast and then, this will lead to write after write hazards, how can it be overcome. This can be overcome in this way, so whenever this instruction is executed I mean, it encountered then, the effective address is calculated. So, effective address of this load, of a particular load is calculated and that is available, as you know in the A field, address field of the tomasulo s that the data structure, which is available for the reservation stations. Now, whenever the then, after knowing the effective address of particular load, it checks whether there is any other load that is present earlier, which is active that means, which is not yet been carried out. Then, what you have to do, this particular load has to be delayed that means, it will not be loaded into the load buffer. So, there is a load buffer and store buffer, so the instruction following a load using the, having the same effective address is not loaded. So, this is how it has been taken care of and similarly, there is a store data, again it is involving same register. So, you are storing the data into a memory from this register and if this is executed fast and then, you are loading it and then, you are loading it, storing it into a memory location, so this will lead to again hazard. So, here also that means, if the load is succeeded by store and load instructions, which are appearing having the same effective at this then, that particular load has to be delayed. So, similar situations will happen in case of store instructions that means, you have to look at the calculate the effective address, which is precisely stored in the A field. And that effective address has to be compared with other effective addresses, preceding

10 though that load and store instructions, which have not yet been carried out. But, active that present in the reservation stations, so this is how, you can take care of the load and store. (Refer Slide Time: 21:00) So, if you load and store access the same address, out of order execution leads to WAR or RAW or WAW types of hazards, particularly WAR or WAW types of hazards that you already seen. To detect such hazards, the processor must compute the effective address of the load and store instructions in the program and tomasulo s schemes combine to different techniques. The renaming of the architectural register to a large set of registers, buffering of sources operands from the register file, but this is essentially corresponding to registers, but for memory b we have to do it this way.

11 (Refer Slide Time: 21:43) Now, let us consider the loop example that I have already mentioned and you can see here, here you have got two iterations present to name instructions two loops present in this instruction window and this is the code to be executed, which is present here. And initially, this register R 1 is having the value 80 that is, the memory location that is being available I mean, from where it has to be loaded is 80. Now, with this starting point, let us see, how the execution will proceed. So, these are I mean, various things which is store added store buffers, load and store buffers are shown here, instruction loop is given here, iteration count is given here. This is the first iteration, this is the second iteration and value of registers used for address in the iteration control is given here.

12 (Refer Slide Time: 23:02) Now, let us go to clock cycle 1, so it will issue the first load instruction and you can see the load unit is busy and the address is, the effective address is 80, that is available. And the register, in which it to be loaded from I mean, from the load instruction is given here and this is pointing, this load is being issued, this instructions is being issued. (Refer Slide Time: 23:32) Now, we go to the second clock and the second instruction is the multiplication double, that is being issued and necessary data structure is updated. And this shows that, this particular instruction has been issued and here, the busy then, operation then, these are

13 the I mean, values from where it will get. And here, you can see, the value will be taken from read F 2, R F 2 and that will be that means, the register, it will be from the register. R F 2 means, it will be coming from register F 2 and here, the one operands will come from load 1, this is F 0 will come from load 1, so this is been provided here. (Refer Slide Time: 24:28) And now, we go to the third cycle and here, this store instruction is issued of the first iteration and again the corresponding data structure is updated. So, this busy that means, that store unit is (Refer Time: 24:47) busy at this correspondence address is 80 and effective address is 80 and that functional unit involved is I mean, from where it will come, so F 4 will come is multiplier 1 that is, in the multiplier 1, so that is been showed here. And here, there is no other change then, it will go to the outside, so here implicit renaming sets up dataflow graph that means, implicit renaming means, here you can see, that loading I mean, implicit renaming means, he will not really perform any renaming of these registers that means, whenever you go to the second iteration, but it will be done implicitly, because the information will be taken, data will be taken from the reservation stations, that we shall see. So, you can see here load, it will come from load and again it will be provided here that means, it F 0 will come here and it will be provided here also for this store. So, you can see that means, both the things are done here.

14 (Refer Slide Time: 26:21) Now, it will a dispatching this SUBI instruction, but it will not go to the floating point queue. So, here it shows the floating point operations, that queue, but that subtraction immediate instruction, it will not go to the floating point queue, but it will be executed, so this will lead to change the value of R 1. (Refer Slide Time: 26:48) So, that is what is happen that, now that register R 1 has been changed, the value has been changed, it is now 72, 8 has been subtracted from it. And so, this instruction is

15 executed then, this instruction will executed and it will go to the second iteration. So, this instruction is getting executed. (Refer Slide Time: 27:14) Now, it has gone to the second iteration, so you can see, there was two cycles missing here, 4 and 5, which is missing here, before the second iteration starts. So, before second iterations starts, the two cycles are not present here, because the calculation of the effective address for the second iteration was necessary and then, computation of the condition. So, this condition is being performed and then, it will issue the load instruction now, in the sixth block cycle. And accordingly, this load unit will now becoming busy, but here the effective address is 72, not 80 and as it is shown here. So, although you can see, to load instruction have been issued and, but they have not completed their operation. Now, you can see, how it is happening here, earlier here it was written load 1 that means, this floating point register was supposed to be loaded from the operation of load 1. Now, it has been changed to load 2 that means, the before loading was done, the load 2 is written here that means, the F 0 never sees the value that is loaded by this load 1 instruction. Because, this operation is not carried out, it is the information is stored are available in the reservation stations, but it was not written into the registers. Before it was written into the registers, the another load operation has taken place, which will load into the same register, that is why this has been modified. So, this result writing, before it

16 proceeded to result writing, that load instruction has not completed writing result. So, before that, this has happened, so this will not see the load operation and it will proceed to the next cycle. (Refer Slide Time: 29:28) In the 7 th cycle, again the second multiplication operation is issued, because you have got two multiplier, so the second multiplication operation is issued and necessary updating of the data structure is available here. So, here it will lead one operand will come from register F 2, that value is written here, is available in V k and after load is performed, this value will be available here. So, load 1 and load 2 will provide the I mean, this shows that, this operant will come from load 1 and load 2 and value of course, will come here in V j. So, this instruction has been issued, now it will go to register file completely detached from computation, because that computation will proceed without involving the registers. Now, we have come to, so you can see here that, first iteration and second iteration, both the iterations are now getting over left. They are getting over left that means, the instructions of the both the iteration have been issued although that completion has not yet taken place.

17 (Refer Slide Time: 30:54) And now, this store instruction will be also issued, because it is, there is a store unit available and store instruction is now issued. And accordingly, the corresponding address is modified here, it will come from 72, effective address is 72. So, you can see, there is no hazards so far, present in this. (Refer Slide Time: 31:18) Now, it will perform this computation, that load operation it will continue, it will perform the, it will complete the load operation and because we know that, this load instruction, first load will take 8 cycles, that was our assumption. Because, this was a

18 cache miss, so it has taken 8 cycles, it was issued here and it has taken 8 cycles. So, in a 9 th cycle, the load operation is completed and now, the value will be available in the I mean, in the corresponding reservation unit and it will appear here, we will see. Load is completing and obviously, this multiplier 1 is waiting for the data to be available here. Now, it is dispatching this instruction, you can see, it will again now dispatched this one. That mean, it will corresponding to the, for this is instruction I mean, for the second iteration, this will continue. (Refer Slide Time: 32:33) And now, here you can say, as I was telling that, load has now completed and corresponding value is transferred directly to this registers v j, v j is a register which is available as a part of reservation stations. Now, both the values are available and obviously, in the next cycle, this multiplication instruction will start computation. So here, 4 is written here indicating that, the computation has started and it will require 4 cycles to perform the computation. And in the mean time, as you can see, it is now prepared for the third iteration, because R 1 has been modified corresponding to the third iteration. So, the first iteration was corresponding to I mean, the effective address was 80, for the second iteration effective address was 72 and for the third iteration, it is 64. So, it is already (Refer Time: 33:32) for that and load 2 is completing now, because it was started in instruction cycles 6, so it

19 will take 4 cycles. So, in the 10 th, 6 plus 4, 10 cycles, this load will completed and it will dispatching this instruction. (Refer Slide Time: 33:55) And you can see, this load is completing, so result is been written here, so when result writing will take place in the 11 th cycle and here, the operand is the available that means, result is available means, it is providing to this multiplier 2. This load will provide multiplier 2 and it will directly go to the reservation station buffers and it will be available here. Now, here you can see in the meantime, that F 0 has been updated to load 3 that means, again this result, which is coming from the second load, will not be written into the register. Because, already other instruction can be issued, which will provide the operand to the register F 0. So, this will continue in this way, so next load in sequence is coming, for third iteration has started now, so third iteration has started next load in sequence.

20 (Refer Slide Time: 35:02) Now, question arises, will it be able to issue, you see although instruction can be issued, but it is not shown here, because this instruction window, we have not updated, it remains the same. Although this instruction has been issued, this third iteration instruction has been issued, now you can see, this multiplier, this multiplication operation, can it be issued. The reason for that I mean, it cannot be issued, the reason for that is, both the multiplier are now busy. Since both the multipliers are busy, you have got two multipliers, so this instruction cannot be issued, until the multiplier the one of the two multipliers becomes free. So, as a consequence, although the third iteration has been reached, but this multiplication operations cannot be issued, because of structural hazard.

21 (Refer Slide Time: 36:07) Now, here the third store, so these third store cannot also be issued, why it cannot be issued, the reason for that is, you have seen the tamosulo s approach performs in order issue that means, see multiplication cannot be issued, no other subsequent instruction can be issued. And as a consequence, since multiplication operation has not been issued here, cannot be issued, this third store cannot be issued also. (Refer Slide Time: 36:55) Now, this multiplication operation is getting completed, it will be performing this, multiplication is complete. So, and multiplication operation, that result will go to F 4

22 and, F 4 and this store data, this instruction is waiting for the data, so in the next cycle, it will do that. (Refer Slide Time: 37:23) So, you can see, it has started this, it has provided the result, that result will be used I mean, by the store data instruction to store the result, but that will start in the next cycle. After it is completed and in a meantime, the second multiplier operation will be also completed. You can see, just in one cycle difference, both are completing, because here it was started issued in 7 cycles, but result were available, see in this particular case, it has taken 10 plus 4, here it is 11 plus 4. So, multiplication operation is requiring 4 cycles, so that is why, here it is completing in 15 cycle and this will completing in 16 cycle, as you shall see.

23 (Refer Slide Time: 38:29) So, it will completing in 16 cycle and now, the stored data will be completed. (Refer Slide Time: 38:38) Stored data will perform completion, so it will required 4 cycles.

24 (Refer Slide Time: 38:47) 14 plus 4, 18 and it is completing, we writing the result and in the 19 th cycle, both this one is completing and I mean, writing results and this one will completing and this store is completing. (Refer Slide Time: 39:08) And now, you see this loop example, across the loop boundary, computation has been performed and let us look at this, here we find that, in order issue, no instruction is issued out of order, , so in order issue has been performed. However, out of order completion, execution has been done, out of order execution has been done and as a

25 consequence, you can see, although I mean, it is completing in 18 cycle, but this is completing in 10 cycle. So, out of order completion, again this is completing in 15 cycle and this is completing in 19 cycles. So, you can see here, out of order completion is taking place, not in the same order, so this one is completing earlier than this, this one is completing earlier than these. So, out of order completion is taking place and out of order execution is taking place and out of order completion is also taking place in this part. So, as you can see, this is how it is happening. (Refer Slide Time: 40:26) Now, the question arises, why can tomasulo s scheme overlap iterations of loops, how is it possible. Number 1 reason is, multiple iterations use different physical destination for registers, so ii does register renaming. Because of register renaming, different physical registers are used and essentially, you may considering dynamic loop unrolling. Earlier we have discussed in detail loop unrolling, which is carried out with help of compiler and there we have seen explicitly, you have to use registers available in the processer for the purpose of loop unrolling. But here, as you can see, even without having architectural registers, by that I mean, registers available in the processer. It can do you loop unrolling dynamically using the registers buffers available in the reservation stations, so that is the reason, why it can do. Second is the reservation station permit instruction issue to advance past integer control

26 flow operations, that we already seen. Also buffer old values of registers, totally avoiding the WAR stall, that we saw in the scoreboard. So, I mean, starting from score boarding is that, how the write after read or stall is avoided, that we have discussed in detail, that is also performed. And as a consequence, we are able to overlap iterations of loops in the tomasulo s scheme. (Refer Slide Time: 42:18) These are the three major advantages provided by tomasulo s scheme, first of all distribution of hazard detection logic. We have seen, there are several reservation stations, which will detect the hazards and if multiple instruction wait on a single result, instruction can be passed, simultaneously by broadcast of common database. And if you centralized register file were used, units would have to read their results from registers. So, these thing is avoided, because we are not using centralized register file, but we are using the registers files available in the reservation stations. And elimination of stalls of WAW and WAR hazards, I have already elaborated on these, how these stalls are overcome by using that dynamic register renaming and by using that effective address calculation for memory miss, so possible to have superscalar execution. Now, these are lead to another possibility, so far we have considered that, number of instruction issued is only one. But, if you have enough instructions available, which can be executed into parallel, you can go from superscalar architecture.

27 In other words, tomasulo s basic scheme can be extended for superscalar architecture, however in that case, it may be necessary to have more than one common database. And you may have to duplicate the reservation stations corresponding to more than one common database, but it is possible to add superscalar execution. (Refer Slide Time: 44:26) So, this is the summary, reservations stations renaming to larger set of registers plus buffering source operands. This prevents registers as bottleneck that means, this physical availability of the registers in the processers, that restriction is over come. And I have already repeated many times, it avoids WAR, WAW hazards of scoreboards using tomasulo s algorithm. And it allows loop unrolling in the hardware by dynamic register renaming is possible in this scheme and this is also quite clear, it is not limited to the basic blocks, because you can go beyond branches, integers units gets ahead and it go it goes beyond branches. And it helps caches misses, I have already explain this and these are the three lasting contributions of tamosulo s approach. Number 1 is dynamic instruction scheduling, register renaming and load and store disambiguation. Load and store disambiguation means that I mean, that WAW types hazards and WAR type of hazards arising out of load and store, these are being overcome I mean, disambiguation of load and store this thing was done in tomasulo s algorithms. And as a consequence, many descendants of 360/91 that means, as we know the tomasulo s

28 approach are originally conceived or developed for 360/91 to improve the performance of floating point operations. But, subsequently, they have been incorporated in all the I mean, a good number of modern processes like pentium 2, PowerPc 604, MIPS R 10000, HP-PA 8000, there alpha (Refer Slide Time: 46:28) Of course, there are some drawbacks that is, in the common database connects to multiple functional units, because of high capacitance, it will be slower. So, number of functional units that can be computed per cycle is limited to 1, because of common database. So, common database is a major concern in this tomasulo s approach, because all the functional units are feeding to a common database. And as more and more functional units are feeding to it, the capacitance increases and only one of them can output the result through that. Suppose, it can go to multiple functional units, result can go to multiple functional units, but only one functional unit in it can produce results on the common database. So, this is a severe restriction and so, the alternative approaches to have multiple common database, so more functional units logic for parallel stores. So, this can be done, but it will definitely increase the complexity, so that is one, that is also a drawback to tomasulo s scheme.

29 So, the hardware complexity is relatively high, because you required reservation stations having large number of registers. And if you go for multiple common database, it will again increase the complexity of hardware. And another important aspect is imprecise exceptions, so effective handling is a major performance bottleneck. (Refer Slide Time: 48:17) Let me briefly discuss about these interrupts and exceptions, which can arise in a processor. Now, as you know, interrupts are essentially generated externally, as you know each processers is provided with... (Refer Slide Time: 48:36)

30 Suppose, you have brought a CPU, usually keys provided with two interrupts, two interrupts inputs, these interrupts inputs are known as, one is usually none miscible interrupts and another is miscible interrupts. They are available in verity of names, but whatever it may be, one is non miscible interrupts and another is miscible interrupts. Non miscible interrupts is commonly used for some kind of emergency situations like power failure and other things, so it is restricted to that. But however, this miscible interrupts input is commonly used for interrupts coming from IO devices and that has lead to that interrupt driven IO operations. That means, whenever IO device is ready for providing data to a processor, it will generate an interrupt then, the processor will read that data. So, that is the basic concept, interrupts driven IO and the interrupts may also come from the operating system. (Refer Slide Time: 50:18) And another type of interrupts are exceptions, exceptions are internal, which are generated with the, as the instructions are executed. So, for example, an illegal op code is executed, so processor tries to execute an op code and that code is not matching, we need a valid operation codes present in the processers. So, it will generate an exception, similarly then, divide by 0, overflow, underflow, page faults. So, these are various situation that can happen dynamically, as instructions are executed and this will lead to exceptions.

31 So, whenever these happens, OS need to intervenes to handle exceptions, so whenever these happens, usually control is transferred to the operating system. CPU cannot handle these things, that is why, this control is to transferred to the operation systems. (Refer Slide Time: 51:31) But, there are imprecise exceptions are to be taken care of, what you mean by imprecise exceptions, an exception is called imprecise when the processes state, when an exception is raised, does not look exactly the same compared to when the instructions are executed in order. (Refer Slide Time: 51:52)

32 That means, here this is arising out of order execution, we have seen since you are performing out of order execution, what can happen. The exceptions that can be generated by the these instructions should be exactly same, as if those instructions as been executed in order, so these as to be taken care of. (Refer Slide Time: 52:29) And this is done with the help of, in an out of order execution model, an imprecise exception is said to be occur, if when an exception is raised by an instruction, some instructions before it may not be complete, some instruction after it are already complete. For example, a floating point instructions exceptions could be detected after an integer instruction, that is much later in the program order is complete. So, this type of imprecise exceptions are to be taken care of. So, we have discussed tomasulo s approach in the contexts of whenever you have got branch prediction. So, we along with branch prediction, tomasulo s approach can help you to go beyond loop iteration I mean, you can execute instructions beyond loop boundaries I mean, multiple iteration can be executed one after the others. If the prediction is taken, that you have demonstrated with the help of an example and however, there are certain things like imprecise exceptions are to be taken care of. The various other things like different type of hazards are taken care of automatically, that you have discussed in detail. In my next lecture, I shall discussed about another very important concept and that is, not your branch prediction, but speculative execution. So,

33 we shall see, what is an speculative execution and how, tomasulo s scheme can be extended for speculative execution, that we shall discussed in my next lecture. Thank you.

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3 CISC 662 Graduate Computer Architecture Lecture 10 - ILP 3 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm EEF011 Computer Architecture 計算機結構 吳俊興高雄大學資訊工程學系 October 2004 Example to eleminate WAR and WAW by register renaming Original DIV.D ADD.D S.D SUB.D MUL.D F0, F2, F4 F6, F0, F8 F6, 0(R1) F8, F10, F14 F6,

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 26 Cache Optimization Techniques (Contd.) (Refer

More information

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software: CS152 Computer Architecture and Engineering Lecture 17 Dynamic Scheduling: Tomasulo March 20, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/

More information

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions) EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

COSC4201 Instruction Level Parallelism Dynamic Scheduling

COSC4201 Instruction Level Parallelism Dynamic Scheduling COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

There are different characteristics for exceptions. They are as follows:

There are different characteristics for exceptions. They are as follows: e-pg PATHSHALA- Computer Science Computer Architecture Module 15 Exception handling and floating point pipelines The objectives of this module are to discuss about exceptions and look at how the MIPS architecture

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

ECE 505 Computer Architecture

ECE 505 Computer Architecture ECE 505 Computer Architecture Pipelining 2 Berk Sunar and Thomas Eisenbarth Review 5 stages of RISC IF ID EX MEM WB Ideal speedup of pipelining = Pipeline depth (N) Practically Implementation problems

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 23 Hierarchical Memory Organization (Contd.) Hello

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica 1 Introduction Hardware-based speculation is a technique for reducing the effects of control dependences

More information

Superscalar Architectures: Part 2

Superscalar Architectures: Part 2 Superscalar Architectures: Part 2 Dynamic (Out-of-Order) Scheduling Lecture 3.2 August 23 rd, 2017 Jae W. Lee (jaewlee@snu.ac.kr) Computer Science and Engineering Seoul NaMonal University Download this

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm: LECTURE - 13 Dynamic Scheduling Better than static scheduling Scoreboarding: Used by the CDC 6600 Useful only within basic block WAW and WAR stalls Tomasulo algorithm: Used in IBM 360/91 for the FP unit

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 09

More information

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods 10 1 Dynamic Scheduling 10 1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods Not yet complete. (Material below may repeat

More information

Preventing Stalls: 1

Preventing Stalls: 1 Preventing Stalls: 1 2 PipeLine Pipeline efficiency Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls Ideal pipeline CPI: best possible (1 as n ) Structural hazards:

More information

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding

More information

Hardware Speculation Support

Hardware Speculation Support Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification

More information

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods 10-1 Dynamic Scheduling 10-1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods Not yet complete. (Material below may

More information

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 5 Prof. Patrick Crowley Plan for Today Note HW1 was assigned Monday Commentary was due today Questions Pipelining discussion II 2 Course Tip Question 1:

More information

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010 CS252 Graduate Computer Architecture Lecture 8 Explicit Renaming Precise Interrupts February 13 th, 2010 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

More information

CMSC411 Fall 2013 Midterm 2 Solutions

CMSC411 Fall 2013 Midterm 2 Solutions CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple

More information

The basic structure of a MIPS floating-point unit

The basic structure of a MIPS floating-point unit Tomasulo s scheme The algorithm based on the idea of reservation station The reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Dynamic scheduling Scoreboard Technique Tomasulo Algorithm Speculation Reorder Buffer Superscalar Processors 1 Definition of ILP ILP=Potential overlap of execution among unrelated

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) Instruction Level Parallelism (ILP) Pipelining supports a limited sense of ILP e.g. overlapped instructions, out of order completion and issue, bypass logic, etc. Remember Pipeline CPI = Ideal Pipeline

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) 1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

COSC 6385 Computer Architecture - Pipelining (II)

COSC 6385 Computer Architecture - Pipelining (II) COSC 6385 Computer Architecture - Pipelining (II) Edgar Gabriel Spring 2018 Performance evaluation of pipelines (I) General Speedup Formula: Time Speedup Time IC IC ClockCycle ClockClycle CPI CPI For a

More information

Chapter 06: Instruction Pipelining and Parallel Processing

Chapter 06: Instruction Pipelining and Parallel Processing Chapter 06: Instruction Pipelining and Parallel Processing Lesson 09: Superscalar Processors and Parallel Computer Systems Objective To understand parallel pipelines and multiple execution units Instruction

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation

More information

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012 Advanced Computer Architecture CMSC 611 Homework 3 Due in class Oct 17 th, 2012 (Show your work to receive partial credit) 1) For the following code snippet list the data dependencies and rewrite the code

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

Chapter. Out of order Execution

Chapter. Out of order Execution Chapter Long EX Instruction stages We have assumed that all stages. There is a problem with the EX stage multiply (MUL) takes more time than ADD MUL ADD We can clearly delay the execution of the ADD until

More information

Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1

More information

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,

More information

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Scoreboard information (3 tables) Four stages of scoreboard control

Scoreboard information (3 tables) Four stages of scoreboard control Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

RECAP. B649 Parallel Architectures and Programming

RECAP. B649 Parallel Architectures and Programming RECAP B649 Parallel Architectures and Programming RECAP 2 Recap ILP Exploiting ILP Dynamic scheduling Thread-level Parallelism Memory Hierarchy Other topics through student presentations Virtual Machines

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

Advanced Pipelining and Instruction- Level Parallelism 4

Advanced Pipelining and Instruction- Level Parallelism 4 4 Advanced Pipelining and Instruction- Level Parallelism 4 Who s first? America. Who s second? Sir, there is no second. Dialog between two observers of the sailing race later named The America s Cup and

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

COSC4201. Prof. Mokhtar Aboelaze York University

COSC4201. Prof. Mokhtar Aboelaze York University COSC4201 Chapter 3 Multi Cycle Operations Prof. Mokhtar Aboelaze York University Based on Slides by Prof. L. Bhuyan (UCR) Prof. M. Shaaban (RTI) 1 Multicycle Operations More than one function unit, each

More information

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at

More information

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4 PROBLEM 1: An application running on a 1GHz pipelined processor has the following instruction mix: Instruction Frequency CPI Load-store 55% 5 Arithmetic 30% 4 Branch 15% 4 a) Determine the overall CPI

More information

(Refer Slide Time: 00:02:04)

(Refer Slide Time: 00:02:04) Computer Architecture Prof. Anshul Kumar Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture - 27 Pipelined Processor Design: Handling Control Hazards We have been

More information

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.

More information

Floating Point/Multicycle Pipelining in DLX

Floating Point/Multicycle Pipelining in DLX Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations in one or two cycles is impractical since it requires: A much longer CPU clock cycle, and/or

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information