CPE 631 Advanced Computer Systems Architecture: Homework #3,#4

Size: px

Start display at page:

Download "CPE 631 Advanced Computer Systems Architecture: Homework #3,#4"

Dennis Moore
6 years ago
Views:

1 Issued: 04/09/2007 Due: 04/18/2007 CPE 631 Advanced Computer Systems Architecture: Homework #3,#4 Name: Q1 Q2 Q3 Q4 Q5 Total Question #1: (20 points) 1.1 (20 points) Consider the following code sequence executing on a CPU using the Tomasulo s algorithm for a single-issue processor. The loop implements Y = ax/y for a vector of length 100. Initially R1=0 and F0 contains a. foo: L.D F2,0(R1) ; load X(i) MUL.D F4,F2,F0 ; multiply a*x(i) L.D F6,800(R1) ; load Y(i) DIV.D F6,F4,F6 ; a*x(i)/y(i) S.D F6,800(R1) ; store Y(i) DADDUI R1,R1,#8 ; increment R1 DSGTUI R3,R1,800 ; test if done BEQZ R3,foo ; loop if not done Assume the following: The EX stage does the effective address calculation for loads and stores. Loads take 1 clock cycle. Results are communicated via a single CDB. The issue (IS) and write back (WB) stages take 1 clock cycle. There are separate integer units for effective address calculation, for ALU operations, and for branch condition evaluation. Functional units are not pipelined. There are 5 load buffer slots There are 5 store buffer slots FU Type Cycles in EX Number of FUs Number of reservation stations Integer (eff. address) Integer (ALU oper.) Integer (branch eval.) FP adder FP multiplier FP divider

2 Fill the following table assuming Tomasulo s algorithm with speculation for the first 3 iterations of the loop. Iter. Instructions num ber 1 L.D F2,0(R1) ISSUE (at clock) EX (startstop) Read Access (at clock) Write on CDB (at clock) Commits (at clock) Comment 1 MUL.D F4,F2,F0 1 L.D F6,800(R1) 1 DIV.D F6,F4,F6 1 S.D F6,800(R1) 1 DADDUI R1,R1,#8 1 DSGTUI R3,R1,800 1 BEQZ R3,foo 2 L.D F2,0(R1) 2 MUL.D F4,F2,F0 2 L.D F6,800(R1) 2 DIV.D F6,F4,F6 2 S.D F6,800(R1) 2 DADDUI R1,R1,#8 2 DSGTUI R3,R1,800 2 BEQZ R3,foo 3 L.D F2,0(R1) 3 MUL.D F4,F2,F0 3 L.D F6,800(R1) 3 DIV.D F6,F4,F6 3 S.D F6,800(R1) 3 DADDUI R1,R1,#8 3 DSGTUI R3,R1,800 3 BEQZ R3,foo

3 Question #2: (20 points) Consider the following code sequence executing on a double-issue CPU using the Tomasulo s algorithm with speculation. Loop: LD.D F4, 0(R1) ; load X(i) DIV.D F0, F4, F2 ; X(i)/a LD.D F8, 0x1000(R1) ; load Y(i) ADD.D F6, F0, F8 ; X(i)/a + Y(i) S.D F6, 0(R1) ; store new X(i) DADDIU R1, R1, #4 ; increment pointer BNE R1, R3, Loop ; end of loop Assume that execution of the DIV.D instruction requires 8 clock cycles while ADD.D requires 3 clock cycles. Assume that there is one integer unit (for integer execution and address calculation) and one unit for branch condition evaluation. Fill the following table entering the clock cycle when the instructions issue, execute (from-to), read access, write the result back into the register file, and commit. Assume a separate pipelined functional unit for each FP operation. Assume that only one instruction may commit and that only one CDB is available. Iter. Instruction Issue Execute Read access 1 LD.D 1 1 DIV.D 1 1 LD.D 1 ADD.D 1 S.D 1 DADDIU 1 BNE 2 LD.D 2 DIV.D 2 LD.D 2 ADD.D 2 S.D 2 DADDIU 2 BNE 3 LD.D 3 DIV.D 3 LD.D 3 ADD.D 3 S.D 3 DADDIU 3 BNE WriteBack Commit Comment

4 Question #3: (Branch Prediction) (20 points) 3.1. (5 points) What is a static basic block? What is a dynamic basic block? 3.2 (5 points) Increasing the size of branch predictor buffer makes it less likely that two branches share a single cell in the branch predictor. However, sharing a cell can also have positive effects. (a) Write a code sequence with two branches and list a sequence of actions on each (T(aken)/NT(not taken) so that the misprediction rate decreases if they share the same cell. Assume a simple 1-bit predictor. (b) Write a code sequence with two branches and list a sequence of actions on each (T(aken)/NT(not taken) so that the misprediction rate increases if they share the same cell. Assume a simple 1-bit predictor.

5 3.3. (5 points) Consider two approaches for reducing branch penalty of a deeply pipelined processor with the following parameters. CPI(ideal) = 1; Frequency of conditional branches = 15% Probability(branches are taken) = 60%. A#1. Static branch prediction. Assume that all conditional branches are predicted statically as not taken. The branch penalty when branches are taken is 5 clock cycles. A#2. Dynamic branch prediction with BTB. How much faster is the processor with BTB if we know the following? BTB hit rate = 85%. Accuracy of prediction in BTB = 90%. Penalty when miss in BTB = 5 clock cycles. Penalty when mispredicted taken branch in BTB = 6 clock cycles Penalty when mispredicted not taken branch in BTB = 2 clock cycles 3.4 (5 points) Discuss the level of speculativeness (the number of conditional branches that may be executed speculatively in succession). Is this important in modern processors (explain your answer)? What are trade-offs?

6 Question #4 (Multiprocessors) (60 points) 4.1. (5 points) Compare write-invalidate and write-update protocols for cache coherence (discuss advantages and disadvantages) (5 points) What are four states of the MESI protocol? Give a short explanation of what each of these states means (5 points) What is false sharing? How we can cope with it?

7 4.4. (5 points) Why is Test-and-Test&Set generally a better lock implementation than Test&Set? Under what conditions would Test&Set be a better lock than Test-and-Test&Set?

8 4.5. (20 points) Consider a three-processor (P1, P2, P3) bus-based shared memory multiprocessor protocol with writeback direct-mapped first level cache memories and MESI snooping cache coherence protocol. Show the state of a cache line in each processor s cache and main memory after each of the following actions. Assume that memory blocks A and B map to the same cache line. Initially, A = 5, B = 7. Actions: (1) P1: Write A, 4 (2) P3: Write B, 8 (3) P2: Read A (4) P3: Read A (5) P3: Write A, 3 (6) P2: Read A (7) P2: Read B Note: BRd Bus Read, BWr Bus Write, BRdX Bus Read Exclusive, BInv Bus Invalidate. For P1, P2, P3 use A:S notation to indicate that the cache has block A in Shared state. Reference P1 Bl:S:V P2 Bl:S:V P3 Bl:S:V Bus Transactions {BRd BWr BRdX BInv} Initial state I I I - A:5 B:7 Memory Bl:V

9 4.6. (20 points) Consider a three-processor (P1, P2, P3) shared memory multiprocessor with distributed main memory, which uses the directory-based MSI write invalidate protocol with write-back direct-mapped first level cache memories. Show the state of a cache line in each processor s cache and directories after each of the following actions. Assume that memory blocks A, and B map to the same cache line. Processor P1 is home for block A (initial value is 5), and processor P2 is home for block B (initial value is 7). Actions: (1) P1: Write A, 4 (2) P3: Write B, 8 (3) P2: Read A (4) P3: Read A (5) P3: Write A, 3 (6) P2: Read A (7) P2: Read B Legend: St (State), Bl (Block)={ A B}, V(Value), IN (Interconnection Network), Act (Action = Message Type), Src (Source), Dst (Destination), DirP1 (Directory of processor P1), ProcList (List of Sharers). Each action on the IN should take one row. States in the directory: U (uncached), S (shared), E (exclusive). Cache States: M (Modified), S (Shared), I (Invalid). Interconnection Network Messages: - RM, Read Miss (Local Cache -> Home Directory) - WM, Write Miss (Local Cache -> Home Directory) - I, Invalidate (Home Directory -> Remote Caches) - F, Fetch (Home Directory -> Remote Caches) - FI, FetchInvalidate (Home Directory -> Remote Caches) - DVR, DataValueReply (Home Directory -> Local Cache) - DWB, DataWriteBack (Remote Cache -> Home Directory)

10 Step P1 P2 P3 IN DirP1 DirP2 DirP3 Bl:St:V Bl:St:V Bl:St:V Act Src Dest Bl:St:V ProcList P1P2P3 Bl:S:V ProcList P1P2P3 Bl:S:V initial A:U:5 000 B:U: (1) ProcList P1P2P3

11 Question #5 (Vector Processing) (40 points) Show how a vectorizing compiler would translate the following loop into vector operations. For simplicity, assume that LEN is equal to the length of the system s vector registers. Use Vn to indicate vector registers, and Rn to indicate scalar registers. for (i = 0; i < LEN; i++) { a[i] = b[i] * c[i]; d[i] = a[i] + 7 } 5.1. (10 points) Illustrate execution assuming one memory access pipe in VMIPS vector processor. What is execution time? 5.2. (10 points) Illustrate execution assuming 3 memory access pipes (2 for load and 1 for store) in VMIPS processor.

12 Consider the following DAXPY-like loop where k is a parameter to the procedure containing the loop: do 10 i=1, 128 do 10 j=1, 128 Y(k,j) = a*x(i,j) + Y(k,j) Z(j) = X(i,j)*Z(j) 10 continue 5.3 (10 points) Analyze the original FORTAN code. Write an optimal vector code sequence for VMIPS processor. Identify convoys. VMIPS register banks have 64 registers. Use chaining assuming: 1 Load pipe, 1 Store pipe, 2 multiplier pipes, and 2 adder pipes. 5.4 (10 points) Estimate performance of your code from part (a). Startup penalties are the following: 6cc for vector add, 7 for vector multiply, 12 for vector load, 12 for vector store. Assume that the overhead for handling an outer loop is 15 clock cycles.

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions