1 /10 2 /16 3 /18 4 /15 5 /20 6 /9 7 /12

Size: px

Start display at page:

Download "1 /10 2 /16 3 /18 4 /15 5 /20 6 /9 7 /12"

Oswin Doyle
5 years ago
Views:

1 M A S S A C H U S E T T S I N S T I T U T E O F T E C H N O L O G Y DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Computation Structures Fall 2018 Practice Quiz #3B Name Athena login name Score 1 /10 2 /16 3 /18 4 /15 5 /20 6 /9 7 /12 Recitation section o WF 11, (Silvina) o WF 1, (Andy) o WF 12, (Silvina) o None (pick up quiz in 32-G846) Please enter your name, Athena login name, and recitation section above. Enter your answers in the spaces provided below. You can use the extra white space and the backs of the pages for scratch work. Problem 1. Potpourri (10 points) For the following questions, circle the correct of the two choices. (A) (2 points) In a system with virtual memory: a. User processes need to use system calls to communicate with peripherals that are using MMIO addresses outside of the process s address space. b. User processes can directly load from and store to MMIO addresses because MMIO addresses are never virtual addresses. Virtual memory is deactivated for them. (B) (2 points) In the processors that we studied in class and in part II (and part III) of Lab 7, virtual-to-physical address translation happens for: a. Every memory operation emitted in user mode (instruction loads, data loads, and data stores get their address translated). b. Only data memory operations emitted in user mode (data loads and data stores get their address translated. (C) (2 points) After an interrupt in a process at pc X returns to the original running process, the control flow is restored to: a. Address X b. Address X + 4 (D) (2 points) An exception always returns control to the process that caused the exception. a. True b. False e.g., sleep syscall in lab 7, returns to another process Fall of 13 - Practice Quiz #3B

2 (E) (2 points) System calls (invoked using ecall in RISC-V): a. Behave like jal, so the user process must save all caller-saved registers before invoking a system call. b. Behave differently from jal, so the user process does not need to save all callersaved registers before invoking a system call. All registers treated as callee-saved registers in system calls Fall of 13 - Practice Quiz #3B

3 Problem 2. Virtual Memory (16 points) A standard RISC-V CPU is connected to a memory management unit (MMU) that uses a page table to translate 32-bit virtual addresses to 28-bit physical addressing using a page size of 2 16 bytes. (A) (6 points) Including the D (dirty) and R (resident) control bits, please give the number of entries in the page map and the number of bits required for each entry in page map. VA: VPN: 16 bits, page offset: 16 bits PA: PPN: 12 bits, page offset: 16 bits Number of entries in the page map: 2 16 Number of bits required for each page map entry: 14 (B) (10 points) The following program fragment is executed and a record is made of the inputs and outputs of the MMU. The record is shown is shown in the table on the right. lw x12, 0(x10) addi x10, x10, 4 slli x12, x12, 5 sw x11, 0(x12) Access type Virtual address Physical address Inst. fetch 0x13FFF8 0x3FFFF8 Data read 0x x Inst. fetch 0x13FFFC 0x3FFFFC Inst. fetch 0x x Inst. fetch 0x x Data write 0x x Using information from the program and the table above, please deduce the contents of as many entries as possible in the page table. Please make an entry in the table below for each page table entry we learn about, giving the VPN, D and R controls bits, and the PPN, showing the state of the page table after the execution of the program fragment. If you can t deduce the value of a field, please leave the field blank. Assume that pages holding instructions are read-only. VPN D R PPN 0x x3F 0x98 1 0x89 0x x42 0x x Fall of 13 - Practice Quiz #3B

4 Problem 3. Pipelining Combinational Circuits (18 points) For each of the questions below, please create a valid K-stage pipeline of the given circuit. Each component in the circuit is annotated with its propagation delay. Show your pipelining contours and place large black circles ( ) on the signal arrows to indicate the placement of pipeline registers. Give the latency and throughput of each design, assuming ideal registers (t PD=0, t SETUP=0). Remember that our convention is to place a pipeline register on each output. (A) (3 points) Show the maximum-throughput 1-stage pipeline. t CLK = 11 ns Latency (ns): 11 Throughput (ns -1 ): 1/11 (B) (5 points) Show the maximum-throughput 2-stage pipeline using a minimal number of registers. t CLK = 7 ns; latency = 2 * 7 = 14 Latency (ns): 14 Throughput (ns -1 ): 1/7 (C) (5 points) Show the maximum-throughput pipeline using a minimal number of registers. t CLK = 4 ns; latency = 3 * 4 = 12 Latency (ns): 12 Throughput (ns -1 ): 1/ Fall of 13 - Practice Quiz #3B

5 (D) (5 points) You manage to reimplement the slowest combinational component in the previous circuit (the one with a propagation delay of 4 ns) using two components with propagation delays of 2ns, as shown right. Show the maximum-throughput pipeline using a minimal number of registers. Latency (ns): 12 Throughput (ns -1 ): 1/3 _ t CLK = 3 ns; latency = 4 * 3 = Fall of 13 - Practice Quiz #3B

6 Problem 4. Pipelined Processors (15 points) Consider the execution of the following code sequence on a 5-stage pipelined RISC-V processor, which is fully bypassed, predicts all branches not-taken, and kills instructions following taken branches. Assume that branch taken/not taken decisions are made in the EXE stage. Also, assume that the results of load operations are not available until the WB stage. The loop sums the first 100 elements of the integer array at address 0x1000 and stores the result at address 0x2000. Assume execution halts at instruction unimp. lui x11, 1 // x11 = 0x1000 (array) lui x15, 2 // x15 = 0x2000 (result) addi x12, x0, 0 // x12 holds sum addi x14, x11, 400 // address of array element 101 A: lw x13, 0(x11) // load next array element addi x11, x11, 4 // addr of next array element add x12, x12, x13 // add element value to sum bne x11, x14, A // loop until element 101 sw x12, 0(x15) // store result xor x2, x3, x4 unimp To help answer the following questions, fill in the pipeline diagram below showing execution of the loop assuming that the loop was previously executed, and will be repeated again after this iteration. Extra copies of this table are provided at the end of the quiz. Cycle IF lw addi add bne bne sw xor lw addi add DEC lw addi add add bne sw NOP lw addi EXE lw addi NOP add bne NOP NOP lw MEM lw addi NOP add bne NOP NOP WB lw addi NOP add bne NOP Bypass paths Cycle 5: lw to add Cycle 6: addi to lw Cycle 7: add to sw Fall of 13 - Practice Quiz #3B

7 (A) (3 points) Are there points in the execution of the sequence when data is bypassed from the EXE stage back to the DEC stage? If so, give the instruction(s) in the DEC stage at each such point; otherwise, enter NONE. Instruction(s) in DEC, or NONE: NONE (B) (3 points) Are there points in the execution of the sequence when data is bypassed from the WB stage back to the DEC stage? If so, give the instruction(s) in the DEC stage at each such point; otherwise, enter NONE. Instruction(s) in DEC, or NONE: add, bne (C) (3 points) Are there points during the execution of the sequence when the pipeline is stalled? If so, give the instruction(s) in the DEC stage at each such point; otherwise, enter NONE. Instruction(s) in DEC, or NONE: add (D) (6 points) Fill in the contents of the WB stage below. You may have blank spaces indicating the time it takes for the lw instruction to reach the WB stage. cycle WB stage lw addi NOP add bne NOP Fall of 13 - Practice Quiz #3B

8 Problem 5. Processor Pipeline Performance (20 points) You are designing a four stage RISC-V processor (IF, DEC, EXE, WB) with a BTB for next address prediction and a scoreboard for stalling on data hazards. Currently you are trying to decide whether to include bypassing through the register file from the write-back stage to the decode stage. As part of this evaluation, you construct two processors: Processor A: No bypassing from WB to DEC. Processor B: Bypassing from WB to DEC through the register file. (A) (2 points) Question not covered on this quiz. You are using the following loop of an important program to evaluate the performance of the processor: L1: lw t0, 0(a0) add a1, a1, t0 addi a0, a0, 4 blt a0, a2, L1 For the following questions, assume this loop has been running for a long time. (B) (4 points) How many cycles per loop iteration does the decode stage stall due to read after write hazards in the following cases? Processor A decode stall cycles per iteration: 4 Processor B decode stall cycles per iteration: 2 (C) (2 points) How many cycles does this loop take to execute in the following cases? Processor A cycles per iteration: 8 Processor B cycles per iteration: 6 A IF lw add addi addi addi blt lw lw lw add addi DEC lw add add add addi blt blt blt lw add EXE lw NOP NOP Add addi NOP NOP blt lw WB lw NOP NOP add addi NOP NOP blt B IF lw add addi addi blt lw lw add DEC lw add add addi blt blt lw add add EXE lw NOP add addi NOP blt lw NOP add WB lw NOP add addi NOP blt lw NOP Fall of 13 - Practice Quiz #3B

below. Assuming the BYP logic has a propagation delay of 1 ns. (D) (4 points) What is the minimum clock period for each processor? Clock period for processor A: 3.

9 Processor A has the following propagation delays for each of the pipeline stages: IF: 2 ns DEC: 3.0 ns EX: 3.5 ns WB: 1.0 ns The logic for the bypassing path of processor B can be viewed as taking the output from the DEC and WB stages of processor A and adding an additional bypass logic (BYP) as shown in the picture below. Assuming the BYP logic has a propagation delay of 1 ns. (D) (4 points) What is the minimum clock period for each processor? Clock period for processor A: 3.5 ns Clock period for processor B: 4 ns (E) (4 points) For the loop shown above, what is the average cycles per instruction for the two processors: Average cycles per instruction for processor A: 8/4 = 2 Average cycles per instruction for processor B: 6/4 = 3/2 (F) (4 points) For the loop shown above, what is the average number of instructions per second for the two processors: Average number of instructions per second for processor A: 1/(7ns) Average number of instructions per second for processor B: 1/(6ns) Instr/sec = 1/(cyc/instr)(sec/cycle) A: 1/(2*3.5ns) = 1/(7ns) B: 1/((3/2)*4ns) = 1/(6ns) Fall of 13 - Practice Quiz #3B

10 Problem 6. Synchronization (9 points) G. Nome has designed four separate concurrent processes each of which prints a single character A, C, G or T. Her customers place orders for sequences that satisfy certain constraints and Ms. Nome adds semaphores as appropriate to ensure the printed sequence meets the specified criteria. For each of customer orders below, add the appropriate semaphores so the running processes will produce sequences that make the customers happy don t forget to specify the semaphores initial values! To receive full credit, don t impose any unnecessary constraints. Assume the processes start running immediately and that there are no constraints on the order in which statements in different processes are executed except those imposed by your semaphores. Processes will run indefinitely although they may, of course, end up stuck in a WAIT(). (A) (4 points) I d like the sequence CAT. semc = 1; sema = 0; semt = 0; semg = 0; Process #1 Process #2 Process #3 Process #4 A: C: G: T: wait(sema) wait(semc) wait(semg) wait(semt) print( A ) print( C ) print( G ) print( T ) signal(semt) signal(sema) goto A goto C goto G goto T (B) (5 points) My sequences have to be exactly 4 characters long. semc = 4; Process #1 Process #2 Process #3 Process #4 A: C: G: T: wait(semc) wait(semc) wait(semc) wait(semc) print( A ) print( C ) print( G ) print( T ) goto A goto C goto G goto T Problem 7. Branch Prediction in Complex Pipeline (12 points) Ben Bitdiddle has decided his high-performance RISC-V processor should have 8 pipeline stages, shown below. IF1 IF2 Instruction fetch, first cycle Instruction fetch, second cycle Fall of 13 - Practice Quiz #3B

11 D RF ALU MEM1 MEM2 WB Instruction decode, calculate branch target address Read/bypass register operands, make branch decision Perform ALU operation on operands LD/ST memory access, first cycle LD/ST memory access, second cycle Write result to register file at end of cycle Unless directed otherwise, the IF1 stage speculates that the next instruction comes from PC+4. The determination that an instruction is a branch instruction (e.g., beq, bne) is made in the D stage. The calculation of the branch target address is also made in the D stage. The actual branch decision (taken/not taken) is made in the ALU stage. (A) (4 points) With the 8-stage pipeline, what is the number of NOPs introduced into the pipeline when a branch instruction changes the PC to the branch target address, i.e., it s a taken branch? When a branch instruction is not a taken branch? The number of NOPs introduced is called the branch penalty. Branch penalty for taken branches (# of NOPs introduced): 4 Branch penalty for not-taken branches (# of NOPs introduced): 0 To reduce the penalty for taken branches, Ben plans to use a direct mapped Branch Target Buffer (BTB) in the IF1 stage together with a Branch History Table (BHT) in the D stage. The detailed description of the BTB and BHT are provided below. Recall that in the ALU stage, it becomes known whether the branch was actually taken or not. 1. The BTB holds entrypc, targetpc pairs for jumps and branches predicted to be taken. Assume that the targetpc predicted by the BTB is always correct for this question. (Yet the direction still might be wrong.) 2. The BTB is accessed every cycle. If there is a match with the current PC, PC is redirected to the targetpc predicted by the BTB (unless PC is redirected by an older instruction); if not, it is set to PC In the D stage (Instruction decode, calculate branch target address), a conditional branch instruction (beq/bne) looks up the BHT, but an unconditional jump (jal, jalr) does not. If a branch is predicted to be taken, stages IF1 and IF2 are flushed and the PC is redirected to the calculated branch target address Fall of 13 - Practice Quiz #3B

12 (B) (8 points) Fill out the following table of the number of pipeline bubbles (inserted NOPs) for conditional branches. Fill in table of branch penalties BTB hit? BHT predicted taken? Actually taken? Pipeline Bubbles Yes Yes Yes 0 Yes Yes No 4 Yes No Yes Cannot occur Yes No No Cannot occur No Yes Yes 2 No Yes No 4 No No Yes 4 No No No Fall of 13 - Practice Quiz #3B

13 Blank pipeline diagrams for Problem 3 Cycle IF lw DEC EXE MEM WB Cycle IF lw DEC EXE MEM WB END OF PRACTICE QUIZ 3! Fall of 13 - Practice Quiz #3B

6.004 Tutorial Problems L22 Branch Prediction

6.004 Tutorial Problems L22 Branch Prediction Branch target buffer (BTB): Direct-mapped cache (can also be set-associative) that stores the target address of jumps and taken branches. The BTB is searched