Course on Advanced Computer Architectures

Size: px

Start display at page:

Download "Course on Advanced Computer Architectures"

Lee Daniels
5 years ago
Views:

1 Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUION Politecnico di Milano, June 22nd, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1 Q2 SIX QUESIONS OAL ( 5 points) ( 5 points) ( 6 points) ( 5 points) ( 5 points) ( 6 points) (32 points)

2 EXERCISE 1 OMASULO (5 points) Please consider the program in the table be executed on a CPU with dynamic scheduling based on OMASULO algorithm with: 2 RESERVAION SAIONS (RS1, RS2) + 2 LOAD/SORE unit (LDU1, LDU2) with latency 4 2 RESERVAION SAIONS (RS3, RS4) + 2 ALU/BR FUs (ALU1, ALU2) with latency 2 1. Please complete the OMASULO ABLE by assuming all cache HIS ISRUZIONE ISSUE SAR WRIE Hazards ype RSi UNI EXEC RESUL lw $f1,0($r1) RS1 LDU1 lw $f2,0($r1) RS2 LDU2 fadd $f2,$f2,$f (RAW $f1) RAW $f2 RS3 ALU1 sw $f2,0($r1) SRUC RS1+ RAW RS1 LDU1 $f2 lw $f3,4($r1) RS2 LDU2 lw $f4,4($r1) SRUC RS2 RS2 LDU2 fadd $f4,$f4,$f RAW $f4 RS3 ALU1 sw $f4,4($r1) RAW $f4 RS1 LDU1 addi $r1,$r1, RS4 ALU2 2. Express the formula then compute the following metrics: CPI = (#clock cyles / IC) = 27/9 =3 IPC = 1/CPI = 1/ 3 = 0.33 Page 1 - SOLUION

3 EXERCISE 2: CACHE PERFORMANCE EVALUAION (5 points) Let us consider a computer with a memory hierarchy with L1 and L2 caches (including Instructions and Data). Suppose that in 1000 memory references there are 30 misses in L1 and 20 misses in L2. 1. What are the various miss rates? Miss Rate L1 = 30 /1000 = % Miss Rate L2 = 20 /30 = % Global Miss Rate for Last Level Cache: Miss Rate L1 L2 = Miss Rate L1 x Miss Rate L2 = % Assume: Hit ime L1 = 1 clock cycle; Hit ime L2 = 12 clock cycles; Miss Penalty L2 = 120 clock cycles; 2. Write the formula of the AVERAGE MEMORY ACCESS IME (AMA) when there are L1 and L2 caches: AMA = Hit ime L1 + Miss Rate L1 x (Hit ime L2 + Miss Rate L2 x Miss Penalty L2 ) = Hit ime L1 + Miss Rate L1 x Hit ime L2 + Miss Rate L1L2 x Miss Penalty L2 3. How much is the AMA in this case? AMA = = 1 cycle x 12 cycles x 120 cycles = 3.76 clock cycles 4. Given Memory Accesses Per Instruction = 1.5 and CPIEXEC = 2, compute the CPU time considering the memory hierarchy and 1000 instructions with CLK=1ns: CPUtime = IC x (CPIexec + MAPI x MRL1 x HL2 + MAPI x MRL1 L2 x MPL2 ) x CLK = 10 3 x ( x 0.03 x x 0.02 x 120) x 1 ns = 6.14 microsec Page 2 - SOLUION

4 EXERCISE 3: CACHE COHERENCY (6 points) 1. o implement cache coherency one of the protocols is the MESI concurrency protocol implemented for a write-invalidate write-back cache. Please complete the following table with Yes/No/Maybe and briefly explain the protocol. State Cache up to date? Memory up to date? Others have a copy? Cache can respond other s reads? Modified Yes No No Yes required Exclusive Yes Yes No Yes Shared Yes Yes yes No Invalid No Maybe Maybe No Page 3 - SOLUION

5 1. Consider the following access pattern on a two-processor system with a directmapped, write-back cache with one cache block and a two cache block memory. Assume the MESI protocol is used, with write-back caches, write-allocate, and invalidation of other caches on write (instead of updating the value in the other caches) Cycle After Operation P0 cache block state P1 cache block state Memory at block 0 up to date? 0 P0: read block 1 Exclusive (1) Invalid Yes Yes 1 P0: write block 0 Modified (0) Invalid No Yes Memory at block 1 up to date? 2 P1: read block 0 Shared (0) Shared (0) Yes Yes 3 P1: write block 0 Invalid Modified (0) No Yes 4 P0: read block 1 Exclusive (1) Modified (0) No Yes 5 P1: read block 1 Shared (1) Shared (1) Yes Yes 6 P0: write block 1 Modified (1) Invalid Yes No 7 P1: write block 1 Invalid Modified (1) Yes No 8 P0: read block 0 Exclusive (0) Modified (1) Yes No 9 P1: write block 1 Exclusive (0) Modified (1) Yes No 10 P1: read block 1 Exclusive (0) Modified (1) Yes No 11 P0: read block 1 Shared (1) Shared (1) Yes Yes 12 P1: write block 1 Invalid Modified (1) Yes No Page 4 - SOLUION

6 QUESION 1: VLIW ARCHIECURES (5 points) 1. Describe the architectures of Very Long Instruction Word processors 2. What are possible issues associated with VLIW architectures? (For each of the answers motivate why it is RUE or FALSE) Answer 1: Compilers can detect parallelism only in local portions of the code F Answer 2: Larger code size Answer 3: Increased hw complexity F Answer 4: Less code portability Page 5 - SOLUION

7 QUESION 2: DYNAMIC SCHEDULING (5 points) An ideal dynamic-scheduled CPU should: Look arbitrarily far ahead to find set of instructions to issue; Predict all branches perfectly; Rename all registers uses (to avoid WAR and WAW hazards); Determine whether there are data dependencies among instructions in the issue packet; rename if necessary; Determine if memory dependencies exist among issuing instructions, handle them; Provide enough replicated functional units to allow all ready instructions to issue. Explain why real processors cannot implement each one of these points. Page 6 - SOLUION

8 SIX QUESIONS: (6 points) Answer with rue or False to the following statements, considering that some answer may have multiple rue answers. Q1) Which of the following static branch prediction method is preferred to be used in case of loops in the MIPS processor? Answer 1: Branch always taken Answer 2: Branch always not-taken Answer 3: Delayed branch with delay slot scheduled from target Answer 4: Delayed branch with delay slot scheduled from fall-through Answer 5: Backward aken Forward Not aken Q2) Consider the following Assembly code: LOOP: DONE: SUB R3, R6, R7 BEQ R1, R2, DONE ADD R5, R5, R4 ADDI R1, R1, 1 J LOOP Knowing that: R2 is initialized at 100 & R1 is initialized at 0 How many BH accesses are performed? Answer 1: 101 Answer 2: 200 Answer 3: 402 Answer 4: 302 Feedback: Answer 1 is RUE under the assumption that only Branch instructions are accessing BH For each iteration, the BEQ accesses the BH at each iteration (100). he BEQ instruction is performed 100 times and one additional time at the end of the loop (total 101). Answer 3 is RUE under the assumption that the execution of each instruction requires an access to the BH during the IF phase. So it s sufficient to count the number of instructions executed. he first instruction is outside the loop and it s executed just once, the other four are executed once per loop plus one more execution of the BEQ instruction at the end where the condition is satisfied. Since the loop lasts for 100 iterations the number of total BH accesses is 402. For more details see Computer Architecture A Quantitative Approach fourth edition, Patterson Hennessy Chapter 2, section 3 Reducing Branch Cost with Prediction page 80. Or slides ACA, Dynamic Branch Prediction pages Page 7 - SOLUION

9 Q3) Which of the following cache improvements are effective in reducing the miss penalty? Answer 1: Adopting higher associativity in designing the cache to reducing the impact of collisions when the correct block must be loaded in the cache. Answer 2: Avoiding address translation during indexing of the cache to minimize the management time of miss penalties. Answer 3: Combining a small and fast level-1 cache with a slower but larger level-2 cache to capture many accesses that would go to the main memory. Answer 4: Giving priority to read misses over writes, exploiting techniques to serve the reads before the writes have been completed. Q4) Which of the following statements are true in a SIMD architecture? Answer 1: SIMD architecture can exploit significant data parallelism. Answer 2: Each instruction can operate on multiple loaded data in a single operation. Answer 3: Multiple program counters allow multiple instructions execution. Answer 4: Given N PEs (Processing Elements), a 2^N Array controller is required to handle them Q5) Considering hardware parallelism techniques, we can say that Vector Architectures and Graphic Processor Units: Answer 1: Exploit data-level parallelism by applying a single instruction to a collection of data in parallel. Answer 2: Exploits either data-level parallelism or task-level parallelism in a tightly coupled HW model that allows interaction among threads Answer 3: Exploits parallelism among largely decoupled tasks specified by the programmer or the OS Answer 4: Exploits data-level parallelism at modest level through compiler techniques such as pipelining and at medium levels using speculation Q6) In vector architectures, is it correct to say that: With Vector Chaining, the system must wait for the last element of result to be written before starting dependent instruction? (answer rue or False) Answer: False Page 8 - SOLUION

Course on Advanced Computer Architectures

Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1