ETH, Design of Digital Circuits, SS17 Review Session Questions I - SOLUTIONS

Size: px

Start display at page:

Download "ETH, Design of Digital Circuits, SS17 Review Session Questions I - SOLUTIONS"

Sibyl Day
5 years ago
Views:

1 ETH, Design of Digital Circuits, SS17 Review Session Questions I - SOLUTIONS Instructors: Prof. Onur Mutlu, Prof. Srdjan Capkun TAs: Jeremie Kim, Minesh Patel, Hasan Hassan, Arash Tavakkol, Der-Yeuan Yu, Francois Serre, Victoria Caparros Cabezas, David Sommer, Mridula Singh, Sinisa Matetic, Aritra Dhar, Marco Guarnieri 1 Potpourri (a) Full Pipeline Keeping a processor pipeline full with useful instructions is critical for achieving high performance. What are the three fundamental reasons why a processor pipeline cannot always be kept full? Reason 1. Data/Control Flow dependences Reason 2. Multi-cycle Operations Reason 3. Resource contention (b) Out-of-Order vs. Dataflow When does an instruction execute in a dataflow processor? When all inputs to the instruction are ready. When does the fetch of an instruction happen in an out-of-order execution processor? When the program counter points to that instruction. What structure holds the instructions until they are ready to execute in an out-of-order processor? Reservation stations. (c) Tomasulo s Algorithm Here is the state of the reservation stations in a processor during a particular cycle ( denotes an unknown value): 1/16

2 ADD Reservation Station Tag V Tag Data V Tag Data A 0 D 1 27 B E C 0 B 0 A What is wrong with this picture? MUL Reservation Station Tag V Tag Data V Tag Data D 0 B 0 C E B Cyclical dependences between instructions, which leads to deadlock. (Between tags B and E, and also between tags A, D, and C.) 2/16

3 2 Out-of-Order Execution In this problem, we will give you the state of the Register Alias Table (RAT) and Reservation Stations (RS) for a Tomasulo-like out-of-order execution engine. Your job is to determine the original sequence of five instructions in program order. The out-of-order machine in this problem behaves as follows: The frontend of the machine has a one-cycle fetch stage and a one-cycle decode stage. The machine can fetch one instruction per cycle, and can decode one instruction per cycle. The machine dispatches one instruction per cycle into the reservation stations, in program order. Dispatch occurs during the decode stage. An instruction always allocates the first reservation station that is available (in top-to-bottom order) at the required functional unit. When a value is captured (at a reservation station) or written back (to a register) in this machine, the old tag that was previously at that location is not cleared; only the valid bit is set. When an instruction in a reservation station finishes executing, the reservation station is cleared. Both the adder and multiplier are fully pipelined. Add instructions take 2 cycles. Multiply instructions take 4 cycles. When an instruction completes execution, it broadcasts its result, and dependent instructions can begin execution in the next cycle if they have all operands available. When multiple instructions are ready to execute at a functional unit, the oldest ready instruction is chosen. Initially, the machine is empty. Five instructions then are fetched, decoded, and dispatched into reservation stations, before any instruction executes. Then, one instruction completes execution. Here is the state of the machine at this point, after the single instruction completes: 3/16

4 (a) Give the five instructions that have been dispatched into the machine, in program order. The source registers for the first instruction can be specified in either order. Give instructions in the following format: opcode destination source1, source2. MUL R3 R1, R7 MUL ADD ADD R4 R1, R2 R2 R3, R4 R6 R0, R5 MUL R6 R2, R6 (b) Now assume that the machine flushes all instructions out of the pipeline and restarts execution from the first instruction in the sequence above. Show the full pipeline timing diagram below for the sequence of five instructions that you determined above, from the fetch of the first instruction to the writeback of the last instruction. Assume that the machine stops fetching instructions after the fifth instruction. As we saw in class, use F for fetch, D for decode, E1, E2, E3, and E4 to signify the first, second, third and fourth cycles of execution for an instruction (as required by the type of instruction), and W to signify writeback. You may or may not need all columns shown. Cycle: MUL R3 R1, R7 F D E1 E2 E3 E4 W MUL R4 R1, R2 F D E1 E2 E3 E4 W ADD R2 R3, R4 F D E1 E2 W ADD R6 R0, R5 F D E1 E2 W MUL R6 R2, R6 F D E1 E2 E3 E4 W Finally, show the state of the RAT and reservation stations after 8 cycles in the blank figures below. 4/16

5 3 The GPU Strikes Back! We define the SIMD utilization of a program run on a GPU as the fraction of SIMD lanes that are kept busy with active threads during the run of a program. The following code segment is run on a GPU. Each thread executes a single iteration of the shown loop. Assume that the data values of the arrays A, B, and C are already in vector registers so there are no loads and stores in this program. (Hint: Notice that there are 5 instructions in each thread.) A warp in the GPU consists of 32 threads, and there are 32 SIMD lanes in the GPU. for (i = 0; i < (512*1024); i++) { if (B[i] < 0) { A[i] = A[i] * C[i]; B[i] = A[i] + B[i]; C[i] = C[i] + 1; B[i] = B[i] + 1; } } (a) How many warps does it take to execute this program? Warps = (Number of threads) / (Number of threads per warp) Number of threads = 2 19 (i.e., one thread per loop iteration). Number of threads per warp = 32 = 2 5 (given). Warps = 2 19 /2 5 = 2 14 (b) When we measure the SIMD utilization for this program with one input set, we find that it is 9/40. What can you say about arrays A, B, and C? Be precise. A: Nothing. B: 1 in every 32 of B s elements are negative. C: Nothing. (c) Is it possible for this program to yield a SIMD utilization of 20% (circle one)? YES NO 5/16

6 If YES, what should be true about arrays A, B, and C for the SIMD utilization to be 20%? Be precise. A: B: C: If NO, explain why not. The smallest SIMD utilization possible is the same as part (b), 36/160, but this is greater than 20%. (d) Is it possible for this program to yield a SIMD utilization of 100% (circle one)? YES NO If YES, what should be true about arrays A, B, C for the SIMD utilization to be 100%? Be precise. A: Nothing. B: Either: (1) All of B s elements are less than 0, or (2) All of B s elements are greater than or equal to 0. C: Nothing. If NO, explain why not. 6/16

7 4 Microcoded Machines Microcoding is a powerful technique to perform complex computations on a simple datapath. In this problem, you will sketch out a microcode-controlled single-bus implementation of the MIPS R2000 ISA. In a single-bus processor, the primary interconnect between state elements and functional units is a shared bus. In each cycle, only one unit can drive data on the bus, while all other units may listen and selectively latch the data. Below is the starting point of a single-bus microarchitecture. Assume that there is a microcode control ROM that can drive all of the control signals shown based on a current state and specify a next state. As an example, below is a possible microcode sequence for the three-register ADD instruction. Assume that other states which are not shown handle instruction fetch and PC update. Thus, at the initial state shown here, the instruction register already contains the instruction. State nextstate lepc enpc leir enir lea leb ALU OP enalu select field leidx RF R RF W enrf lemar MEM R MEM WenMemComment ADD 0 ADD No Op 0 21 (rs) Latch rs into IDX ADD 1 ADD No Op Read RF[rs] into A ADD 2 ADD No Op 0 16 (rt) Latch rt into IDX ADD 3 ADD No Op Read RF[rt] into B ADD 4 ADD No Op 0 11 (rd) Latch rd into IDX ADD 5 Fetch ADD A+B into RF[rd] For this homework problem, you will implement a new instruction, ADDM, in microcode. ADDM performs an addition where one operand is loaded from memory and the other operand comes from a register, and the result is stored back into a register. It is an R-type instruction with the following semantics: RF[rd] = M[RF[rs]] + RF[rt] In other words, the instruction loads the word in memory at the address specified by register rs, then adds the loaded value to the value in register rt, and stores the result in register rd. Write out a microcode sequence like the one in the table above for the ADDM instruction. You can assume that when accessing memory (with the MEM R signal asserted), the microcode sequencer will stall until the memory provides the data. Microcode for ADDM instruction State nextstate lepc enpc leir enir lea leb ALU OP enalu select field leidx RF R RF W enrf lemar MEM R MEM WenMemComment ADDM 0 ADDM No Op 0 21 (rs) Latch rs into IDX ADDM 1 ADDM No Op Read RF[rs] into MAR ADDM 2 ADDM No Op Latch M[RF[rs]] into A ADDM 3 ADDM No Op 0 16 (rt) Latch rt into IDX ADDM 4 ADDM No Op Latch RF[rt] into B ADDM 5 ADDM No Op 0 11 (rd) Latch rd into IDX ADDM 6 Fetch ADD A+B into RF[rd] 7/16

8 5 Out-of-Order Execution Out-of-order processors make efficient use of their functional units by executing instructions according to the flow of data between them. (a) In class, we learned about a graphical way of showing the data dependencies (edges) of instructions (vertices) using a data flow graph. For the instruction stream below, draw the corresponding data flow graph. ADD r3 <- r1, r2 MUL r4 <- r1, r3 ADD r5 <- r8, r9 DIV r6 <- r1, r3 ADD r7 <- r6, r5 Solution: r1 r2 r3 r8 r9 r4 r6 r5 (b) Of course, while a data flow graph is a helpful way of visualizing what goes on in an out-of-order processor, real machines execute instructions using multiple pipeline stages. Take the code from above and find the number of cycles it would take to execute on a processor with in-order fetch and out-of-order dispatch. Assume the following: It takes one cycle to decode an instruction and another cycle to issue the decoded instruction into a reservation station. These two stages can be pipelined. There are separate functional units for ADD, MUL and DIV, and each functional unit has its own reservation station. Tags are broadcast in the same cycle their corresponding operation finishes executing. A single tag broadcast bus exists and if more than one reservation station entry to the same execution unit is ready to be dispatched, the older reservation station entry will use the bus and the younger ones will have to stall. Similarly, if more than one instruction contends for writing to the ROB or architectural register file during a single cycle, the oldest one receives access and the younger ones must stall. ADDs take 3 cycles and are pipelined, MULs take 5 cycles and are pipelined, and DIVs take 7 cycles and are pipelined. How many cycles will the program take to execute on an in-order fetch and out-of-order dispatch machine, under the assumptions above? 8/16

9 18 cycles: ADD F D S E E E R W MUL F D S - - E E E E E R W ADD F D S E E E R W DIV F D S E E E E E E E R W ADD F D S E E E R W (c) How many cycles will the program take to execute on an in-order fetch and in-order dispatch machine, under the same assumptions as above? 20 cycles: ADD F D S E E E R W MUL F D S - - E E E E E R W ADD F D - - S E E E R - - W DIV F - - D S E E E E E E E R W ADD F D S E E E R W 9/16

10 6 Vector Processing You are studying a program that runs on a vector computer with the following latencies for various instructions: VLD and VST: 50 cycles for each vector element; fully interleaved and pipelined. VADD: 4 cycles for each vector element (fully pipelined). VMUL: 16 cycles for each vector element (fully pipelined). VDIV: 32 cycles for each vector element (fully pipelined). VRSHF (right shift): 1 cycle for each vector element (fully pipelined). Assume that: The machine has an in-order pipeline. The machine supports chaining between vector functional units. In order to support 1-cycle memory access after the first element in a vector, the machine interleaves vector elements across memory banks. All vectors are stored in memory with the first element mapped to bank 0, the second element mapped to bank 1, etc. Each memory bank has an 8KB row buffer. Vector elements are 64 bits in size. Each memory bank has two ports (so that two loads/stores can be active simultaneously), and there are two load/store functional units available. (a) What is the minimum power-of-two number of banks required in order for memory accesses to never stall? (Assume a vector stride of 1.) 64 banks, because memory latency is 50 cycles and the next power of two is 64. There is another solution if one interprets never stall to mean that a single load will never stall rather than the memory accesses in the program below: in that case, 32 banks suffices since each bank has two ports. For those who answered this way on the test, we gave full credit. (b) The machine (with as many banks as you found in part (a)) executes the following program (assume that the vector stride is set to 1): VLD V1 <- A VLD V2 <- B VADD V3 <- V1, V2 VMUL V4 <- V3, V1 VRSHF V5 <- V4, 2 It takes 111 cycles to execute this program. What is the vector length? 40 elements VLD (VLEN-1)---- VLD VADD -4- VMUL -16- VRSHF (VLEN-1) (VLEN-1) = 71 + VLEN = 111 -> VLEN = 40 If the machine did not support chaining (but could still pipeline independent operations), how many cycles would be required to execute the same program? Show your work. 10/16

11 228 cycles VLD (VLEN-1)--- VLD (VLEN-1)--- VADD (VLEN-1)--- VMUL (VLEN-1)--- VRSHF 1 --(VLEN-1) *(VLEN-1) = *VLEN = 228 (c) The architect of this machine decides that she needs to cut costs in the machine s memory system. She reduces the number of banks by a factor of 2 from the number of banks you found in part (a) above. Because loads and stores might stall due to bank contention, an arbiter is added to each bank so that pending loads from the oldest instruction are serviced first. How many cycles does the program take to execute on the machine with this reduced-cost memory system (but with chaining)? 129 cycles VLD [0] bank 0 (takes port 0) [31] bank 31 [32] bank 0 (takes port 0) [39] bank 7 VLD [0] bank 0 (takes port 1) [31] bank 31 [32] bank 0 (takes port 1) [39] bank 7 VADD (tracking last elements) VMUL VRSHF 1 (B[39]: ) = 129 cyc Now, the architect reduces cost further by reducing the number of memory banks (to a lower power of 2). The program executes in 279 cycles. How many banks are in the system? 8 banks VLD [0] [8] [16] [24] [32] [39] VLD [39] 1 11/16

12 VADD VMUL VRSHF 1 5* = 279 cyc 12/16

13 7 Tracing the Cache Assume you have three toy CPUs: 6808-D, 6808-T, and 6808-F. All three CPUs feature one level of cache. The cache size is 128 bytes, the cache block size is 32 bytes, and the cache uses LRU replacement. The only difference between the three CPUs is the associativity of the cache: 6808-D uses a direct mapped cache T uses a two-way associative cache F uses a fully associative cache. You run the SPECMem3000 program to evaluate the CPUs. This benchmark program tests only memory read performance by issuing read requests to the cache. Assume that the cache is empty before you run the benchmark. The cache accesses generated by the program are as follows, in order of access from left to right: A, B, A, H, B, G, H, H, A, E, H, D, H, G, C, C, G, C, A, B, H, D, E, C, C, B, A, D, E, F Each letter represents a unique cache block. All 8 cache blocks are contiguous in memory. However, the ordering of the letters does not necessarily correspond to the ordering of the cache blocks in memory. For 6808-D, you observe the following cache misses in order of generation: A, B, A, H, B, G, A, E, D, H, C, G, C, B, D, A, F (a) By using the above trace, please identify which cache blocks are in the same set for the 6808-D processor. Please be clear. A and B C and G H and D E and F (b) Please write down the sequence of cache misses for the 6808-F processor in their order of generation. (Hint: You might want to write down the cache state after each request). By simulating the cache and using the requests Req: A B A H B G H H A E H D H G C C G C A B H D E C C B A D E F M? : x x x x x x x x x x x x x x x x x x x MRU: A B A H B G H H A E H D H G C C G C A B H D E C C B A D E F : - A B A H B G G H A E H D H G G C G C A B H D E E C B A D E : B A H B B G H A E E D H H H H G C A B H D D E C B A D LRU: A A A B G G A A E D D D D H G C A B H H D E C B A (c) For 6808-T, you observed the following five cache misses in order of generation: A, B, H, G, E But, unfortunately, your evaluation setup broke before you could observe all cache misses for the 6808-T. Using the given information, which cache blocks are in the same set for the 6808-T processor? 13/16

14 We first simulate the cache up to the point it breaks. By simulating the cache and using the requests Req : A B A H B G H H A E M? : x x x x x MRU0: A B A A B B B B A E LRU0: - A B B A A A A B A MRU1: H H G H H H H LRU1: H G G G G If H was in the same set as A and B, then the B right after H would have missed. Similarly if G was in the same set as A and B, the A right before E would have missed. Using this information and the sets calculated in part (a), the sets are respectively A, B, E, and F H, G, C, and D (d) Please write down the sequence of cache misses for the 6808-T processor in their order of generation. Req : A B A H B G H H A E H D H G C C G C A B H D E C C B A D E F M? : x x x x x x x x x x x x x x x x MRU0: A B A A B B B B A E E E E E E E E E A B B B E E E B A A E F LRU0: - A B B A A A A B A A A A A A A A A E A A A B B B E B B A E MRU1: H H G H H H H H D H G C C G C C C H D D C C C C D D D LRU1: H G G G G G H D H G G C G G G C H H D D D D C C C (e) What is the cache miss rate for each processor? 6808-D: 17/ T: 16/ F: 19/30 14/16

15 8 Programming a Systolic Array Figure 1 shows a systolic array processing element. Each processing element takes in two inputs, M and N, and outputs P and Q. Each processing element also contains an accumulator R that can be read from and written to. The initial value of the accumulator is 0. Figure 2 shows a systolic array composed of 9 processing elements. The smaller boxes are the inputs to the systolic array and the larger boxes are the processing elements. You will program this systolic array to perform the following calculation: c 00 c 01 c 02 a 00 a 01 a 02 b 00 b 01 b 02 c 10 c 11 c 12 = a 10 a 11 a 12 b 10 b 11 b 12 c 20 c 21 c 22 a 20 a 21 a 22 b 20 b 21 b 22 In each time cycle, each processing element will take in its two inputs, perform any necessary actions, and write on its outputs. The time cycle labels on the input boxes determine which time cycle the inputs will be fed into their corresponding processing elements. Any processing element input that is not driven will default to 0, and any processing element that has no output arrow will have its output ignored. After all the calculations finish, each processing element s accumulator will hold one element of the final result matrix, arranged in the correct order. (a) Please describe the operations that each individual processing element performs, using mathematical equations and the variables M, N, P, Q and R. N M R P Q Figure 1: A systolic array processing element P = M Q = N R = R + M N (b) Please fill in all 30 input boxes in Figure 2 so that the systolic array computes the correct matrix multiplication result described on the previous page. (Hint: Use a ij and b ij.) 15/16

16 4 0 0 b b 21 b 12 TIME 2 b 20 b 11 b 02 1 b 10 b b TIME a 02 a 01 a 00 0 a 12 a 11 a 10 0 a 22 a 21 a Figure 2: A systolic array 16/16

ETH, Design of Digital Circuits, SS17 Review Session Questions I

ETH, Design of Digital Circuits, SS17 Review Session Questions I Instructors: Prof. Onur Mutlu, Prof. Srdjan Capkun TAs: Jeremie Kim, Minesh Patel, Hasan Hassan, Arash Tavakkol, Der-Yeuan Yu, Francois