TAMPERE UNIVERSITY OF TECHNOLOGY Institute of Digital and Computer Systems. Exercise 2: DLX I - Architecture

Size: px
Start display at page:

Download "TAMPERE UNIVERSITY OF TECHNOLOGY Institute of Digital and Computer Systems. Exercise 2: DLX I - Architecture"

Transcription

1 TAMPERE UNIVERSITY OF TECHNOLOGY Institute of Digital and Computer Systems TKT-3206 Computer Architecture I Exercise 2: DLX I - Architecture Group Name stud. num.

2 General info about the exercise The purpose of this exercise is to explain how a pipelined processor works, and which factors affect its performance. A simulator of a pipelined DLX processor is used in this exercise work. The simulator has been done in Lund University, Sweden. The returning of this exercise is this document completed. Give clear and brief, yet sufficiently accurate answers to all the questions. Write your answers with readable handwriting, using a pencil so that possible errors can be corrected. Unclear, unintelligible, and nondescriptive answers yield a boomerang or possibly re-doing the whole exercise from the beginning. DLX-simulator In this exercise, a data path model of a DLX processor is studied. A comprehensive model that effectively solves hazards is developed in three steps. The first model, Datapath Model 1 (DP-1), lacks the support for bypassing technique. The second model, Datapath model 2 (DP-2), supports the bypassing technique, but uses the ALU for calculating the jump addresses. The third model, DLX, features a separate adder for the jump address calculation. The simulator supports only integer instructions. The DLX CONTROL exercise will study the implementation of control logic for the DLX processor model. The simulator supports symbolic machine language written for the DLX processor. An assembler code for the simulator can be written using a regular text editor. The extension of the code file must be '.s', and the last instruction of the prorgam trap 0. Installation Go to your home directory in Birdland. Copy the simulator and the program code files to yourself (first make sure that you have at least 1.1 MB of quota left) using the command: cp -r ~rhu/public_html/tktekn/dlx/. Now you have all the files needed in doing this and DLX-CONTROL exercise in the subdirectory DLX. Go to this sub directory (cd DLX). Starting the simulator The simulator is started with the command: sparc_pipe (or./sparc_pipe is the current path is not defined in the search path of your environment settings) 1/18

3 and the following window opens: The window shows the five stages of the processor, which are (from lef): IF (Instruction Fetch), ID (Instruction Decode), EX (Execute), MEM (Memory Access) and WB (Write Back to register file). The processor is pipelined by inserting D-flip-flops (registers) in between the stages. The intermediate results of each stage are saved into these registers, which are symbolized in the simulator by the boxes on top of the dash lines, on each clock cycle. The functional units that perform the calculation reside between the dash lines. The calculation proceeds from left to right, with the exception of the writebacks. The bold lines represent 32-bit buses which connect the functional units and registers. The thin lines imply one-bit control signals. A diagram that uses the above-described grouping of clocksynchronized memory elements on vertical lines is called a Werner-diagram. The instruction executed on each stage is shown in the box below the stage. The contents of the register file can be examined by clicking the left mouse button ontop of the Register File box in the simulator. The contents of the register file cannot be altered manually. The functions of the simulator On the top-lefthand corner of the window are two pull-down menus: File and Views. An assembler code can be loaded by using the command Load of the File menu, and the simulator exited using the Quit command. The used data path model (DP_1, DP_2 or DLX) is chosen from the Views menu. Each data path model contains three options that are used in examining hazards: Pipelining/No Pipelining, Delayed Load/No Delayed Load and Delayed Branch/No Delayed Brach. These options can be changed by clicking the boxes on the top of the simulator window. Before executing a program, the Program Counter (PC) and the Register File must be resetted. This is done by clicking the Reset box. Note that Reset does not reset the data memory; it can be resetted e.g. by (re)loading the program code. The program can be executed one clock cycle at a time by clicking the box Clock. Clicking Run will execute the wohle program. Elapsed Cycles shows the amount of clock cycles passed from the moment when the first instruction enters the IF stage, or if Run is clicked, the total number of clock cycles from the start until the last instruction reaches WB stage. 2/18

4 Clock cycles Per Instruction (CPI) is calculated with the following formula: CPI NumberOfClockCycles = = NumberOfInstructions( nonop) NumberOfClockCycles NumberOfClockCycles Hazards FillingThePipeline Hazards is the number of clock cycles that is needed in solving the hazards. This sum composes of the NOP-instructions reaching the WB stage, which appear after the first instruction reaches WB stage, and of the clock cycles when the pipeline is stalled due to hazards. Sub-operations of the DLX instructions Examine the following program: LW ADD SW R3,18(R0) R1,R3,R3 18(R0),R1 What does M[18+R0] contain after the program execution, if M[18+R0] is before the execution? Next, we will study what sub-operations an un-pipelined data path performs on each of its stages to the instructions of the above code. By running the program without pipelining, we can concentrate on the execution of a single instruction at a time and examine what is happening at the different pipeline stages. The program code exempel1.s is loaded into the simulator by selecting File->Load. Type the name of the code file and click OK. Make sure that the data path settings are: Datapath Model 1, No Pipelining, No Delayed Load and No Delayed Branch. Give a clock pulse by clicking the box Clock. What instruction is read into the processor? What does that particular instruction do? Give a clock pulse. Enter data into table 1 (the abbreviations of the sub-operations are listed in table 2). What is being done in the ID stage? Give a clock pulse. Where does the ALU get its operands? What is the result of the ALU-operation? What was calculated in the ALU? Give a clock pulse. What value does DMAR get? What is the DMAR-bus connected into? What sub-operetion was performed in the EX stage? 3/18

5 What does the register R3 contain (click the Register-File)? Give a clock pulse. What does the register R3 now contain and why? The instruction LW R3,18(R0) has now been executed. Mark the performed sub-operations into table 1 using the abbreviations listed in table 2. On the last row, mark which functional unit in the Werner diagram performs the operation in each stage. Table 1: The sub-operations of the example code. Class IF ID EX MEM WB LOAD instruction ALU instruction STORE instruction BRANCH instruction Functional unit Table2: Sub-operation abbreviations for table 1. Sub-operation Abbreviation Instruction Fetch Register Read Arithmetic/logical Operation Operand address Calculation Jump address Calculation Memory Read Memory Write Register Write Load Program Counter no sub-operation - Give a clock pulse. The instruction ADD R1,R3,R3 is fetched now. Answer the following questions by giving clock pulses. What is done in the EX stage? Which sub-operation was performed in the MEM stage? IF RR AO OC JC MR MW RW LPC 4/18

6 5/18 Examine the contents of registers R1 and R3 before you forward the instruction into the WB stage. What do R1 and R3 contain? R1 = R3 = Give a clock pulse. What does the register R1 now contain and why? R1 = Fill into table 1 the sub-operations that were now performed. The next instruction is SW 18(R0),R1. Execute the instruction by giving enough clock pulses and answer to the following questions. What does the instruction do? Which registers were read in the ID stage? What was done in the EX stage? Which sub-operation was performed in the MEM stage? A special bus was used in transferring the contents of register R1. Why? Which sub-operation is performed in the WB stage? Fill in the the sub-operations of the instruction into table 1. When resolving the stage in which a given sub-operation is performed, it does not matter where the corresponding functional unit resides. The stage in which a sub-operation is performed is defined based on the location of the instruction at the moment of the sub-operation execution. For example, the register write is performed in the WB stage even though the register file is in the ID stage. This is an important property when the sub-operations of a BRANCH instruction are examined. Finally, we study the sub-operations of the instruction BEQZ R0,14. What is done in the ID stage? What is done in the EX stage? What is the output of the ALU?

7 What controls the multiplexer MX1? In which situations the control signal of MX1 changes its state? The value of the PC changes as you move the instruction into the WB stage. What is the new PC value? Why does the PC value change into this particular number? Now fill in the missing fields of table 1. We can see from table 1 that for some of the instructions, a sub-operation is performed in each stage, while some of the instructions require a no sub-operation in some of their stages. For which instruction classes and in which stages these no sub-operations are performed? So far, each instruction has been executed separately. An important figure with regard of the effectiveness of a code is how many clock cycles on average are needed in executing one instruction. This figure is called Clock cycles Per Instruction (CPI). What is the CPI of the executed program and why? Parallelism and pipelining In order for the pipelining to be possible, every instruction must be able to be divided into suboperations, and these sub-operations to be performed in order. However, the instructions do not have to utilize every sub-operation. Furthermore, the processor must have enough functional units and buses so that each pipeline stage can execute a sub-operation on every clock cycle. If this condition is not met, the pipeline must be stalled until busy functional unit or bus becomes available again. This situation is referred to as a structural hazard. Mark into table 1 which pipeline stages use the following functional units: data memory, instruction memory, ALU, and register file. Why two pipeline stages can simultaneously use the register file without causing a structural hazard? What are these stages? What architectural properties must the memory have in order to avoid a structural hazard? 6/18

8 7/18 From now on, we assume that the memory has been constructed so that there are no structural hazards. Examine the following program: ADDI R1,R0,#1;R1 <- 1 ADDI R2,R0,#2;R2 <- 2 ADDI R3,R0,#4;R3 <- 4 ADD R1,R1,R1;R1 <- R1+R1 ADD R2,R2,R2;R2 <- R2+R2 ADD R3,R3,R3;R3 <- R3+R3 ADD R1,R1,R1;R1 <- R1+R1 ADD R2,R2,R2;R2 <- R2+R2 ADD R3,R3,R3;R3 <- R3+R3 Determine by examining the code, what do the registers R1, R2, R3 contain after the execution? Give the answer in hexadecimals. R1 = R2 = R3 = The program is in file exempel2.s. Load it into the simulator and run it (Run). How many clock cycles does the execution take, when only one instruction is performed at a time? (DP_1, No pipelining) Now, enable the pipelining (Pipelining), and the processor resets. Give clock pulses until the firstthe instruction reaches MEM stage. How many clock cycles does it take? After this, on every clock cycle, one instruction is finished. We can say that now the pipeline is full. Give a clock pulse. Which register is updated and which registers are read? How many instructions are performed simultaneously in the pipeline? Give clock pulses, until the last instruction is in the WB stage. What are the contents of registers R1,R2 and R3? R1 = R2 = R3 = How many clock cycles did the execution take? If the clock cycles needed in filling the pipeline are not counted, how much faster is the execution when pipelining is used? Mark in the following space also the formula you used to calculate your result with. From this we can conclude that pipelining significantly increases the speed of the calculation if there are no structural hazards. In practice, however, there will be some problems that reduce this speed-up. Next, we will study what exactly these problems are.

9 Data hazards Next, we will determine what kind of special mechanisms are required to quarantee correct program execution in spite of data dependencies between the instructions. Bypassing Examine the following program: ADDI R1,R0,#2;R1 <- 2 ADD R2,R1,R1;R2 <- R1+R1 ADD R3,R1,R1;R3 <- R1+R1 ADD R4,R1,R1;R4 <- R1+R1 What should the registers R1, R2, R3 and R4 contain after the program execution? R1 = R2 = R3 = R4 = The program is in file exempel3.s. Load and run the program. (DP_1, Pipelining) What do the registers R1, R2, R3 and R4 contain after the execution? R1 = R2 = R3 = R4 = Apparently, something went wrong. To understand what it was, run the program again. Reset the processor and give clock pulses until instruction ADD R2,R1,R1 reaches ID stage. What is the value of register R1? R1 = Where in the data path is the correct value of register R1? Give clock pulses until instruction ADD R3,R1,R1 reaches ID stage. What is the value of register R1 when it is being read? R1 = Where in the data path is the correct value of register R1? Give clock pulses until instruction ADD R4,R1,R1 reaches ID stage. What is the value of register R1? Is it correct? R1 = Now you have (hopefully) understood that the register file does not always contain the correct values. Thus, it must be possible to bypass the result of an instruction to the next two instructions if they use this result as their operand. 8/18

10 From which point(s) of the pipeline must this bypassing be possible? The next data path model, Datapath Model 2 (DP_2) supports the Bypassing technique. Choose data path model DP_2 from the View menu (and the processor is resetted). Give clock pulses until instruction ADD R2,R1,R1 reaches ID stage. Examine the state of multiplexers MX2 and MX3. From where comes the value of R1? Give clock pulses until instruction ADD R3,R1,R1 reaches ID stage. Examine the state of multiplexers MX2 and MX3 again. From where comes the value of R1 now? Give clock pulses until the program has been completely executed and make sure that the registers have the correct values. Delayed Load Examine the following program: LW ADD ADD ADD R1,18(R0);R1 <- M[18] R2,R1,R1;R2 <- R1+R1 R3,R1,R1;R3 <- R1+R1 R4,R1,R1;R4 <- R1+R1 What should the registers R1, R2, R3 and R4 contain after the program execution if M[18] = before the execution? R1 = R2 = R3 = R4 = The program is in file exempel4.s. Load and run the program (DP_2, Pipelining). What do the registers R1, R2, R3 and R4 contain after the execution? R1 = R2 = R3 = R4 = Apparently, something went wrong. To understand what it was, run the program again. Reset the processor and give clock pulses until instruction ADD R2,R1,R1 reaches ID stage. In which stage is instruction LW R1,18(R0)? Give a clock pulse. Where in the data path is the correct value of register R1? In which stage is instruction ADD R2,R1,R1? 9/18

11 Why instruction ADD R3,R1,R1 gets the correct register value? The execution of one instruction failed. Which instruction was it, and why exactly this particular instruction failed? This problem can be solved by using Delayed Load. Always when the result of a LOAD instruction is used by the next instruction, the pipeline is stalled until the Bypassing technique can be used to forward the loaded value to the next instruction(s). Activate Delayed Loadby clicking the filed No Delayed Load. (DP_2, Pipelining, Delayed Load, Delayed Branch) Give clock pulses until instruction ADD R2,R1,R1 reaches ID stage. In which stage is instruction LW R1,18(R0)? Give a clock pulse. In which stage is instruction ADD R2,R1,R1 now? Pipeline gets stalled (STALL) so that the instructions in ID and IF stages no longer progress within the pipeline. From where is the value of register R1 now obtained? Finish the execution with the Run command and make sure that the values of the registers are now correct. How many clock cycles did the execution take? How many clock cycles would the execution take if there was no data dependency between the first LOAD instruction and the first ADD instruction? Delayed Load and Bypassing together quarantee that all data hazards are eliminated. However, one clock cycle is lost every time when there is a data dependency between a LOAD instruction and the instruction following it. 10/18

12 Control hazards Examine the following program (the memory addresses of the instructions are on the left) 0000 ADDI R1,R0,#9;R1 < ADD R2,R0,R0;R2 < ADD R3,R0,R0;R3 < C ADD R4,R0,R0;R4 < ADD R5,R0,R0;R5 < LOOP: ADDI R2,R2,#1;R2 <- R SUBI R1,R1,#1;R1 <- R C BNEZ R1,LOOP;If R1 <> 0 then BRANCH to LOOP 0020 ADD R3,R3,R2;R3 <- R3+R ADD R4,R4,R2;R4 <- R4+R ADD R5,R5,R2;R5 <- R5+R2 What should the registers R1, R2, R3, R4 and R5 contain after the program execution? R1 = R2 = R3 = R4 = R5 = What should the registers R3, R4 and R5 contain just before instruction ADD R3,R3,R2 is executed? R3 = R4 = R5 = How many instructions are executed? The program is in file exempel5.s. Load and run the program. (DP_2, Pipelining, Delayed Load, Delayed Branch) What do the registers R1, R2, R3, R4 and R5 contain after the execution? R1 = R2 = R3 = R4 = R5 = Reset the processor and give clock pulses until instruction BNEZ R1,LOOP reaches ID stage the first time. Give a clock pulse. Which instruction is fetched? Give a clock pulse. Which instruction is now fetched? What is the value of the program counter? PC = Give a clock pulse. What is the value of the program counter and which instruction is now fetched? PC = 11/18

13 Give a clock pulse. What is the value of register R3 and why it is wrong? R3 = After the branch instruction has reached ID stage, how many clock cycles does it take until the address determined by the branch instruction is loaded into the program counter? The processor does not function correctly. The problem is that the branch address is not loaded into the program counter immediately, but some instruction following the branch instruction are executed regardless whether the branch condition evaluates true or false. The simple solution is to stop the pipeline whenever a branch instruction is about to proceed forward from ID stage. The execution continues only after the jump address has been calculated and the jump condition evaluated, or, if the branch is taken, after the program counter has been updated to point to the new instruction. This is achieved by using the option No Delayed Branch, which we will study next. Change the processor settings to: DP_2, Pipelining, Delayed Load, No Delayed Branch. Give clock pulses, until instruction BNEZ R1,LOOP reaches ID stage the first time. Give a clock pulse. Which instruction is fetched? The fetching of instructions has apparently been halted. After how many clock cycles the fetching of new instructions is resumed? Finish the execution of the program and make sure that the values of the registers are correct. How many clock cycles does the execution take? Why the number of used clock cycles differs greatly from the number of executed instructions? Now the processor works correctly with branch instructions, but the performance is greatly degraded due to the large branch overhead. In practice, processors spend a large portion of the total execution time within program loops, and since the loops are predominantly short, a big loop-overhead is simply not acceptable. Advancing the jump address calculation The problem with datapath model 2 is that as the jump address and branch condition are calculated in EX stage and the results are available during MEM stage. If the calculation could be advanced, the loop overhead would be smaller. 12/18

14 13/18 The earliest stage in which the processor recognizes an instruction to be a branch instruction is ID stage. Thus, the jump address calculation and jump condition evaluation can be done already in ID stage, as seen in the datapath model DLX. Choose datapath model DLX (DLX, Pipelining, Delayed Load, No Delayed Branch). How does the DLX model differ from the DP_2 model? Load the same program as earlier (exempel5.s). Give clock pulses until instruction BNEZ R1,LOOP reaches ID stage the first time. Give a clock pulse. Which instruction is now fetched and why? Finish the program execution. How many clock cycles did the execution take? How much faster was the execution compared to DP-2? What is the CPI value? Now the CPI value is close to one, but not yet quite one. What is the reason for this? Next, we will examine how the programmer and compiler can further reduce the CPI figure. Co-operation between the compiler and pipeline Examine the following program: ADDI R2,R0,#200 ;R2 <- 200 LOOP: LW R3,28(R2) ;R3 <- M[28+R2] ADD R1,R3,R1 ;R1 <- R3+R1 SUBI R2,R2,#4 ;R2 <- R2-4 BNEZ R2,LOOP ;if R2<>0 then BRANCH to LOOP The program is in file exempel6.s. Load and run the program.

15 What is the CPI value? As the number of instructions in a program increases, the effect of the clock cycles needed in filling the pipeline reduces to be insignificant. Thus, the CPI value can be approximated by dividing the number of clock cycles it takes to execute one loop iteration by the number of instructions in the loop. Calculate an approximation for the CPI figure. In addition to the result, indicate also the formula you used to achieve the result. Static instruction scheduling Delayed load What does the delayed load technique imply? Consider the issue from the viewpoints of the programmer/compiler and the hardware. Examine the following program: ADDI R2,R0,#200 ;R2 <- 200 LOOP: LW R3,28(R2) ;R3 <- M[28+R2] SUBI R2,R2,#4 ;R2 <- R2-4 ADD R1,R3,R1 ;R1 <- R3+R1 BNEZ R2,LOOP ;if R2<>0 then BRANCH to LOOP How does this program differ from the previous one (exempel6.s)? The program is in file exempel7.s. Load and run the program. (DLX, Pipelining, Delayed Load, No Delayed Branch). What is the CPI value? 14/18

16 Compare the CPI figure and the portion of the branch instructions to the total amount of instructions. Why is the CPI value now better than with the previous program? How much faster was the execution compared to the previous code (exempel6.s)? Delayed Branch What does the delayed branch technique imply? Consider the issue from the viewpoints of the programmer/compiler and the hardware. Examine the following program: ADDI R2,R0,#200 ;R2 <- 200 LOOP: LW R3,28(R2) ;R3 <- M[28+R2] SUBI R2,R2,#4 ;R2 <- R2-4 BNEZ R2,LOOP ;if R2<>0 then BRANCH to LOOP ADD R1,R3,R1 ;R1 <- R3+R1 How does this program differ from the previous one (exempel7.s)? Activate the option Delayed Branch. The program is in file exempel8.s. Load and run the program. (DLX, Pipelining, Delayed Load, Delayed Branch). What is the CPI value now? 15/18

17 16/18 The compiler cannot always place a useful instruction next to a branch instruction. In these cases, an instruction that does not affect the calculation must be used. Such instruction is NOP (No Operation). An example of these situations is shown in the following program: ADDI R2,R0,#200 ;R2 <- 200 ADD R1,R0,R0 ;R1 <- 0 LOOP: ADD R1,R1,R2 ;R1 <- R1+R2 SUBI R2,R2,#1 ;R2 <- R2-1 BNEZ R2,LOOP ;if R2<>0 then BRANCH to LOOP Here, the compiler cannot schedule any of the instructions to be after the branch instruction. Thus, a NOP instruction must be added. ADDI R2,R0,#200 ;R2 <- 200 ADD R1,R0,R0 ;R1 <- 0 LOOP: ADD R1,R1,R2 ;R1 <- R1+R2 SUBI R2,R2,#1 ;R2 <- R2-1 BNEZ R2,LOOP ;if R2<>0 then BRANCH to LOOP NOP Calculate the CPI figure for the above rogram. In addition to the result, indicate also the formula you used to achieve the result. The program is in file exempel9.s. Load and run the program. (DLX, Pipelining, Delayed Load, Delayed Branch). Is the resulting CPI figure the same as what you calculated? The NOP instruction can be eliminated by using more sophisticated methods. The following program performs the same function as the previous code, but without the NOP instruction. ADDI R2,R0,#200 ;R2 <- 200 ADD R1,R0,R0 ;R1 <- 0 ADD R1,R1,R2 ;R1 <- R1+R2 LOOP: SUBI R2,R2,#1 ;R2 <- R2-1 BNEZ R2,LOOP ;if R2<>0 then BRANCH to LOOP ADD R1,R1,R2 ;R1 <- R1+R2 SUB R1,R1,R2 ;R1 <- R1-R2 Two instruction had to be added into the for the program to yield correct results. Mark these instructions into the above code. The CPI figure is now reduced to one. The penalty of this modification is the increase of the number of instructions by two. Nevertheless, as loops are usually executed more than two times, the benefits far outweigh the drawbacks. However, not all compilers are able to do this kind of sophisticated analysis.

18 The program is in file exempel10.s. Load and run the program. (DLX, Pipelining, Delayed Load, Delayed Branch). What is the CPI figure now? Compare the CPI value and the execution time to the previous program (exempel9.s). Examine the following program: ADDI R1,R0,#16 ;R1 <- 16 ADDUI R2,R0,#4777 ;R2 < ADDUI R3,R0,#1326 ;R3 < ADDI R5,R0,#0 ;R5 <- 0 ADDUI R6,R0,#32768 ;R6 < = FRANK: AND R4,R2,R6 ;R4 <- R2 AND R6 SLLI R2,R2,#1 ;R2 <- R2<<1 SLLI R5,R5,#1 ;R5 <- R5<<1 SUBI R1,R1,#1 ;R1 <- R1-1 BNEZ R4,ZED ;if R4<>0 then BRANCH to ZED BNEZ R1,FRANK ;if R1<>0 then BRANCH to FRANK J END ;BRANCH to END ZED: ADD R5,R5,R3 ;R5 <- R5+R3 BNEZ R1,FRANK ;if R1<>0 then BRANCH to FRANK END: TRAP 0 ;end program What does this program do? If your answer does not easily fit on the following line, it is wrong. The execution time of the program depends on the initial value of R2 (4777 in the example code). With what R2 value the execution time is maximized and minimized? Max: R2 = 16 = 10 Min: R2 = 16 = 10 The program is in file zorbas1.s. Load and run the program. (DLX, Pipelining, Delayed Load, No Delayed Branch). Write the results given by the simulator on the next line. Elapsed Cycles = CPI = Hazards = R5 = 16 Modify the program so that all control hazards are eliminated. The execution time and CPI figure must also be improved. Use Delayed Branch. You can re-order, add, and remove instructions. NOP instruction or its look-alikes, such as writing into a register that is never read, must not be used. Correct functioning of the program can be tested by comparing the final value of R5 to the value producad by the example code (zorbas1.s). The final values of the other registers are largely irrelevant. 17/18

19 Write the simulation results of the modified code on the next line. Elapsed Cycles = CPI = Hazards = R5 = 16 Write the code on the next lines. The instructions preceding the loops do not have to be written, unless they have been modified. 18/18

EXERCISE 3: DLX II - Control

EXERCISE 3: DLX II - Control TAMPERE UNIVERSITY OF TECHNOLOGY Institute of Digital and Computer Systems TKT-3200 Computer Architectures I EXERCISE 3: DLX II - Control.. 2007 Group Name Email Student nr. DLX-CONTROL The meaning of

More information

1 Hazards COMP2611 Fall 2015 Pipelined Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor 1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM EXAM #1 CS 2410 Graduate Computer Architecture Spring 2016, MW 11:00 AM 12:15 PM Directions: This exam is closed book. Put all materials under your desk, including cell phones, smart phones, smart watches,

More information

CS 251, Winter 2019, Assignment % of course mark

CS 251, Winter 2019, Assignment % of course mark CS 251, Winter 2019, Assignment 5.1.1 3% of course mark Due Wednesday, March 27th, 5:30PM Lates accepted until 1:00pm March 28th with a 15% penalty 1. (10 points) The code sequence below executes on a

More information

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:

More information

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.

More information

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson

More information

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching

More information

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4 IC220 Set #9: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life Return to Chapter 4 Midnight Laundry Task order A B C D 6 PM 7 8 9 0 2 2 AM 2 Smarty Laundry Task order A B C D 6 PM

More information

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add; CSE 30321 Lecture 13/14 In Class Handout For the sequence of instructions shown below, show how they would progress through the pipeline. For all of these problems: - Stalls are indicated by placing the

More information

ECEC 355: Pipelining

ECEC 355: Pipelining ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly

More information

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31 4.16 Exercises 419 Exercise 4.11 In this exercise we examine in detail how an instruction is executed in a single-cycle datapath. Problems in this exercise refer to a clock cycle in which the processor

More information

Updated Exercises by Diana Franklin

Updated Exercises by Diana Franklin C-82 Appendix C Pipelining: Basic and Intermediate Concepts Updated Exercises by Diana Franklin C.1 [15/15/15/15/25/10/15] Use the following code fragment: Loop: LD R1,0(R2) ;load R1 from address

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations? Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined

More information

CS 251, Winter 2018, Assignment % of course mark

CS 251, Winter 2018, Assignment % of course mark CS 251, Winter 2018, Assignment 5.0.4 3% of course mark Due Wednesday, March 21st, 4:30PM Lates accepted until 10:00am March 22nd with a 15% penalty 1. (10 points) The code sequence below executes on a

More information

Pipeline Architecture RISC

Pipeline Architecture RISC Pipeline Architecture RISC Independent tasks with independent hardware serial No repetitions during the process pipelined Pipelined vs Serial Processing Instruction Machine Cycle Every instruction must

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard Computer Architecture and Organization Pipeline: Control Hazard Lecturer: Prof. Yifeng Zhu Fall, 2015 Portions of these slides are derived from: Dave Patterson UCB Lec 15.1 Pipelining Outline Introduction

More information

Pipelining. Maurizio Palesi

Pipelining. Maurizio Palesi * Pipelining * Adapted from David A. Patterson s CS252 lecture slides, http://www.cs.berkeley/~pattrsn/252s98/index.html Copyright 1998 UCB 1 References John L. Hennessy and David A. Patterson, Computer

More information

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Thoai Nam Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy & David a Patterson,

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4 PROBLEM 1: An application running on a 1GHz pipelined processor has the following instruction mix: Instruction Frequency CPI Load-store 55% 5 Arithmetic 30% 4 Branch 15% 4 a) Determine the overall CPI

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations

More information

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs. Exam 2 April 12, 2012 You have 80 minutes to complete the exam. Please write your answers clearly and legibly on this exam paper. GRADE: Name. Class ID. 1. (22 pts) Circle the selected answer for T/F and

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Instr. execution impl. view

Instr. execution impl. view Pipelining Sangyeun Cho Computer Science Department Instr. execution impl. view Single (long) cycle implementation Multi-cycle implementation Pipelined implementation Processing an instruction Fetch instruction

More information

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Pipeline Thoai Nam Outline Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions? Basic Instruction Timings Pipelining 1 Making some assumptions regarding the operation times for some of the basic hardware units in our datapath, we have the following timings: Instruction class Instruction

More information

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.SP96 1 Review: Evaluating Branch Alternatives Two part solution: Determine

More information

6.004 Tutorial Problems L22 Branch Prediction

6.004 Tutorial Problems L22 Branch Prediction 6.004 Tutorial Problems L22 Branch Prediction Branch target buffer (BTB): Direct-mapped cache (can also be set-associative) that stores the target address of jumps and taken branches. The BTB is searched

More information

Please state clearly any assumptions you make in solving the following problems.

Please state clearly any assumptions you make in solving the following problems. Computer Architecture Homework 3 2012-2013 Please state clearly any assumptions you make in solving the following problems. 1 Processors Write a short report on at least five processors from at least three

More information

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle? CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

More information

CS3350B Computer Architecture Quiz 3 March 15, 2018

CS3350B Computer Architecture Quiz 3 March 15, 2018 CS3350B Computer Architecture Quiz 3 March 15, 2018 Student ID number: Student Last Name: Question 1.1 1.2 1.3 2.1 2.2 2.3 Total Marks The quiz consists of two exercises. The expected duration is 30 minutes.

More information

Pipelining. CSC Friday, November 6, 2015

Pipelining. CSC Friday, November 6, 2015 Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution Important guidelines: Always state your assumptions and clearly explain your answers. Please upload your solution document

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Speeding Up DLX Computer Architecture Hadassah College Spring 2018 Speeding Up DLX Dr. Martin Land

Speeding Up DLX Computer Architecture Hadassah College Spring 2018 Speeding Up DLX Dr. Martin Land Speeding Up DLX 1 DLX Execution Stages Version 1 Clock Cycle 1 I 1 enters Instruction Fetch (IF) Clock Cycle2 I 1 moves to Instruction Decode (ID) Instruction Fetch (IF) holds state fixed Clock Cycle3

More information

Pipelining is Hazardous!

Pipelining is Hazardous! Pipelining is Hazardous! Hazards are situations where pipelining does not work as elegantly as we would like Three kinds Structural hazards -- we have run out of a hardware resource. Data hazards -- an

More information

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes NAME: STUDENT NUMBER: EE557--FALL 1999 MAKE-UP MIDTERM 1 Closed books, closed notes Q1: /1 Q2: /1 Q3: /1 Q4: /1 Q5: /15 Q6: /1 TOTAL: /65 Grade: /25 1 QUESTION 1(Performance evaluation) 1 points We are

More information

Final Exam Fall 2007

Final Exam Fall 2007 ICS 233 - Computer Architecture & Assembly Language Final Exam Fall 2007 Wednesday, January 23, 2007 7:30 am 10:00 am Computer Engineering Department College of Computer Sciences & Engineering King Fahd

More information

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of

More information

Floating Point/Multicycle Pipelining in DLX

Floating Point/Multicycle Pipelining in DLX Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations in one or two cycles is impractical since it requires: A much longer CPU clock cycle, and/or

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information

Computer System Architecture Quiz #1 March 8th, 2019

Computer System Architecture Quiz #1 March 8th, 2019 Computer System Architecture 6.823 Quiz #1 March 8th, 2019 Name: This is a closed book, closed notes exam. 80 Minutes 14 Pages (+2 Scratch) Notes: Not all questions are of equal difficulty, so look over

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

6.823 Computer System Architecture Datapath for DLX Problem Set #2

6.823 Computer System Architecture Datapath for DLX Problem Set #2 6.823 Computer System Architecture Datapath for DLX Problem Set #2 Spring 2002 Students are allowed to collaborate in groups of up to 3 people. A group hands in only one copy of the solution to a problem

More information

Computer Architectures. DLX ISA: Pipelined Implementation

Computer Architectures. DLX ISA: Pipelined Implementation Computer Architectures L ISA: Pipelined Implementation 1 The Pipelining Principle Pipelining is nowadays the main basic technique deployed to speed-up a CP. The key idea for pipelining is general, and

More information

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes. The Processor Pipeline Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes. Pipeline A Basic MIPS Implementation Memory-reference instructions Load Word (lw) and Store Word (sw) ALU instructions

More information

Vertieferlabor Mikroelektronik (85-324) & Embedded Processor Lab (85-546) Task 5

Vertieferlabor Mikroelektronik (85-324) & Embedded Processor Lab (85-546) Task 5 FB Elektrotechnik und Informationstechnik Prof. Dr.-Ing. Norbert Wehn Dozent: Uwe Wasenmüller Raum 12-213, wa@eit.uni-kl.de Task 5 Introduction Subject of the this task is the extension of the fundamental

More information

LECTURE 10. Pipelining: Advanced ILP

LECTURE 10. Pipelining: Advanced ILP LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction

More information

Static, multiple-issue (superscaler) pipelines

Static, multiple-issue (superscaler) pipelines Static, multiple-issue (superscaler) pipelines Start more than one instruction in the same cycle Instruction Register file EX + MEM + WB PC Instruction Register file EX + MEM + WB 79 A static two-issue

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Advanced Pipelining and Instruction- Level Parallelism 4

Advanced Pipelining and Instruction- Level Parallelism 4 4 Advanced Pipelining and Instruction- Level Parallelism 4 Who s first? America. Who s second? Sir, there is no second. Dialog between two observers of the sailing race later named The America s Cup and

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction Review: Evaluating Branch Alternatives Lecture 3: Introduction to Advanced Pipelining Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier Pipeline speedup

More information

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Pipelining Recall Pipelining is parallelizing execution Key to speedups in processors Split instruction

More information

The Evolution of Microprocessors. Per Stenström

The Evolution of Microprocessors. Per Stenström The Evolution of Microprocessors Per Stenström Processor (Core) Processor (Core) Processor (Core) L1 Cache L1 Cache L1 Cache L2 Cache Microprocessor Chip Memory Evolution of Microprocessors Multicycle

More information

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds? Chapter 4: Assessing and Understanding Performance 1. Define response (execution) time. 2. Define throughput. 3. Describe why using the clock rate of a processor is a bad way to measure performance. Provide

More information

Laboratory 05. Single-Cycle MIPS CPU Design smaller: 16-bits version One clock cycle per instruction

Laboratory 05. Single-Cycle MIPS CPU Design smaller: 16-bits version One clock cycle per instruction Laboratory 05 Single-Cycle MIPS CPU Design smaller: 16-bits version One clock cycle per instruction 1. Objectives Study, design, implement and test Instruction Fetch Unit for the 16-bit Single-Cycle MIPS

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27) Lecture Topics Today: Data and Control Hazards (P&H 4.7-4.8) Next: continued 1 Announcements Exam #1 returned Milestone #5 (due 2/27) Milestone #6 (due 3/13) 2 1 Review: Pipelined Implementations Pipelining

More information

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Instruction Level Parallelism ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Basic Block A straight line code sequence with no branches in except to the entry and no branches

More information

Instruction Pipelining

Instruction Pipelining Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages

More information

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST * ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST * SAMPLE 1 Section: Simple pipeline for integer operations For all following questions we assume that: a) Pipeline contains 5 stages: IF, ID, EX,

More information

Instruction Pipelining

Instruction Pipelining Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages

More information

Introduction to Pipelining. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Introduction to Pipelining. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. Introduction to Pipelining Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. L15-1 Performance Measures Two metrics of interest when designing a system: 1. Latency: The delay

More information

Computer Architecture Practical 1 Pipelining

Computer Architecture Practical 1 Pipelining Computer Architecture Issued: Monday 28 January 2008 Due: Friday 15 February 2008 at 4.30pm (at the ITO) This is the first of two practicals for the Computer Architecture module of CS3. Together the practicals

More information

CS3350B Computer Architecture Winter 2015

CS3350B Computer Architecture Winter 2015 CS3350B Computer Architecture Winter 2015 Lecture 5.5: Single-Cycle CPU Datapath Design Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and Design, Patterson

More information

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Hakim Weatherspoon CS 3410 Computer Science Cornell University Hakim Weatherspoon CS 3410 Computer Science Cornell University The slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, McKee, and Sirer. memory inst register

More information

ECE 486/586. Computer Architecture. Lecture # 12

ECE 486/586. Computer Architecture. Lecture # 12 ECE 486/586 Computer Architecture Lecture # 12 Spring 2015 Portland State University Lecture Topics Pipelining Control Hazards Delayed branch Branch stall impact Implementing the pipeline Detecting hazards

More information

Computer Architecture EE 4720 Practice Final Examination

Computer Architecture EE 4720 Practice Final Examination Name Computer Architecture EE 4720 Practice Final Examination 10 May 1997, 02:00 04:00 CDT :-) Alias Problem 1 Problem 2 Problem 3 Problem 4 Exam Total (100 pts) Good Luck! Problem 1: Systems A and B are

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations Chapter 4 The Processor Part I Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations

More information

HY425 Lecture 05: Branch Prediction

HY425 Lecture 05: Branch Prediction HY425 Lecture 05: Branch Prediction Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS October 19, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 05: Branch Prediction 1 / 45 Exploiting ILP in hardware

More information

Chapter 4 The Processor 1. Chapter 4B. The Processor

Chapter 4 The Processor 1. Chapter 4B. The Processor Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2018 Static Instruction Scheduling 1 Techniques to reduce stalls CPI = Ideal CPI + Structural stalls per instruction + RAW stalls per instruction + WAR stalls per

More information

Computer System Architecture Final Examination Spring 2002

Computer System Architecture Final Examination Spring 2002 Computer System Architecture 6.823 Final Examination Spring 2002 Name: This is an open book, open notes exam. 180 Minutes 22 Pages Notes: Not all questions are of equal difficulty, so look over the entire

More information

Appendix C: Pipelining: Basic and Intermediate Concepts

Appendix C: Pipelining: Basic and Intermediate Concepts Appendix C: Pipelining: Basic and Intermediate Concepts Key ideas and simple pipeline (Section C.1) Hazards (Sections C.2 and C.3) Structural hazards Data hazards Control hazards Exceptions (Section C.4)

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Design for a simplified DLX (SDLX) processor Rajat Moona

Design for a simplified DLX (SDLX) processor Rajat Moona Design for a simplified DLX (SDLX) processor Rajat Moona moona@iitk.ac.in In this handout we shall see the design of a simplified DLX (SDLX) processor. We shall assume that the readers are familiar with

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop

More information

Unresolved data hazards. CS2504, Spring'2007 Dimitris Nikolopoulos

Unresolved data hazards. CS2504, Spring'2007 Dimitris Nikolopoulos Unresolved data hazards 81 Unresolved data hazards Arithmetic instructions following a load, and reading the register updated by the load: if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or

More information

5008: Computer Architecture HW#2

5008: Computer Architecture HW#2 5008: Computer Architecture HW#2 1. We will now support for register-memory ALU operations to the classic five-stage RISC pipeline. To offset this increase in complexity, all memory addressing will be

More information

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

More information

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10) Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10) 1 Hazards Structural hazards: different instructions in different stages (or the same stage)

More information