ADVANCED COMPUTER ARCHITECTURES: Prof. C. SILVANO Written exam 11 July PDF Free Download

ADVANCED COMPUTER ARCHITECTURES: 088949 Prof. C. SILVANO Written exam 11 July 2011 SURNAME NAME ID EMAIL SIGNATURE EX1 (3) EX2 (3) EX3 (3) EX4 (5) EX5 (5) EX6 (4) EX7 (5) EX8 (3+2) TOTAL (33) EXERCISE 1 PIPELINE BASIC (3 points) Given the following loop expressed in a high level language: do { BASEC[i] = BASEA[i] + BASEB[i] + BASEC[i]; i++; } while (i!= N) The program has been compiled in MIPS assembly code assuming that registers $t6 and $t7have been initialized with values 0 and N respectively. The symbols VETTA, VETTB and VETTC are 16-bit constant. The processor clock frequency is 1 GHz. Let us consider the loop executed by 5-stage pipelined MIPS processor without any optimization in the pipeline Identify the RAW (Read After Write) data hazards by marking in RED and control hazards in BLUE Identify the number of stalls to be inserted before each instruction (or between the stage IF and ID of each instruction) necessary to solve the hazards.

Num Stalls Instruction C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 Hazard Type 3 DO: lw $t2,basea($t6) IF ID EX M WB CNTR lw $t3,baseb($t6) IF ID EX M WB lw $t4,basec($t6) IF ID EX M WB 2 add $t3,$t2,$t3 IF ID EX M WB RAW $t2 RAW $t3 3 add $t4,$t4,$t3 IF ID EX M WB RAW $t3 RAW $t4 3 sw $t4,basec($t6) IF ID EX M WB RAW $t4 addi $t6, $t6, 4 IF ID EX M WB 3 bne $t6,$t7,do IF ID EX M WB RAW $t6 Express the formula then calculate the following metrics: Instruction Count (IC): 8 Number of stalls per iteration: 14 CPI per iteration: CPI = # cycles / IC = (IC+ # stalls + 4) /IC = (8 + 14 +4) / 8 = 3.25 Throughput (expressed in MIPS) per iteration: MIPS = f CLOCK / CPI * 10 6 = (10 9 ) / (CPI * 10 6 ) = 10 3 /3.25 = 308 Asymptotic CPI (N cycles) : CPI AS = (IC + # stalls) / IC = (8+14) /8 = 2.75 Asymptotic Throughput (expressed in MIPS) (N cycles): MIPS AS = f CLOCK / CPI AS * 10 6 = (10 9 ) / (CPI AS * 10 6 ) = 10 3 /2.75 = 364

EXERCISE 2 PIPELINE OPTIMIZATIONS (3 points) Assuming there are the following optimisations in the pipeline - In the Register File it is possible the read and write at the same address in the same clock cycle; - Forwarding - Computation of PC and TARGET ADDRESS for branch & jump instructions anticipated in the ID stage 1. Identify the RAW (Read After Write) data hazards and the control hazards. 2. Identify the number of stalls to be inserted before each instruction (or between the stage IF and ID of each instruction) necessary to solve the hazards. 3. Identify in the last column the forwarding path used Num Stalls Instruction C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 Hazard Type Forwarding Path 1 DO: lw $t2,basea($t6) IF ID EX M WB CNTR lw $t3,baseb($t6) IF ID EX M WB lw $t4,basec($t6) IF ID EX M WB add $t3,$t2,$t3 IF ID EX M WB RAW $t2 $t3 MEM-EX add $t4,$t4,$t3 IF ID EX M WB RAW $t3 $t4 MEM-EX sw $t4,basec($t6) IF ID EX M WB RAW $t4 MEM-MEM addi $t6, $t6, 4 IF ID EX M WB 1 bne $t6,$t7,do IF ID EX M WB RAW $t6 Express the formula then calculate the following metrics: Instruction Count (IC): 8 Number of stalls per iteration: 2 Asymptotic CPI (N cycles) : CPI AS = (IC + # stalls) / IC = (8+2) /8 = 1.25 Asymptotic Throughput (expressed in MIPS) (N cycles): MIPS AS = f CLOCK / CPI AS * 10 6 = (10 9 ) / (CPI AS * 10 6 ) = 10 3 /1.25 = 800 Calculate the speedup with respect to the previous case (EX. 1): CPI AS1 /CPI AS2 = 2.75 / 1.25 = 2.2

EXERCISE 3 PIPELINE WITH DATA CACHE MISSES (3 points) We assume that, in the previously scheduled and optimized program, each DATA READ access in the MEM phase generates a DATA CACHE MISS requiring 2 stalls to access the memory. Draw the pipeline scheme by inserting the stalls due to the memory accesses and to data and control hazards still remaining in the code: Instruction C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 C23 DO: lw $t2,basea($t6) S IF ID EX M S S WB lw $t3,baseb($t6) IF ID EX S S M S S WB lw $t4,basec($t6) IF ID S S EX S S M S S WB add $t3,$t2,$t3 IF S S ID S S EX S S M WB add $t4,$t4,$t3 S S IF S S ID S S EX M WB sw $t4,basec($t6) S S IF S S ID EX M WB addi $t6, $t6, 4 S S IF ID EX M WB bne $t6,$t7,do IF S ID EX M WB Express the formula then calculate the following metrics: Instruction Count (IC): 8 Number of stalls per iteration: 8 Asymptotic CPI (N cycles) : CPI AS = (IC + # stalls) / IC = (8+8) /8 = 2 Asymptotic Throughput (expressed in MIPS) (N cycles): MIPS AS = f CLOCK / CPI AS * 10 6 = (10 9 ) / (CPI AS * 10 6 ) = 500 Calculate the performance lost with respect to the previous case (EX. 2): = CPI AS3 /CPI AS2 = 2 / 1.25 = 1.6

EXERCISE 4: SCOREBOARD (5 points) Assuming the program be executed by a CPU with dynamic scheduling based on SCOREBOARD with: 2 LOAD/STORE units (LDU1, LDU2) with latency 5 2 ALU/BR/J units (ALU1, ALU2) with latency 2. Check structural hazards in ISSUE phase Check RAW hazards and RF READ in READ OPERANDS phase Check WAR e WAW and RF WRITE in WRITE BACK phase Forwarding Static Branch Prediction BTFNT (BACKWARD TAKEN FORWARD NOT TAKEN) with Branch Target Buffer Please complete the SCOREBOARD TABLE by assuming all cache HITS and considering ONE ITERATION: INSTRUCTION PRED. ISSUE READ EXECUTION WRITE HAZARDS TYPE Forwarding UNIT T / NT OPERANDS COMPLETE BACK DO: lw $t2,basea($t6) - 1 2 7 8 LDU1 lw $t3,baseb($t6) - 2 3 8 9 LDU2 lw $t4,basec($t6) - 9 10 15 16 STRUCT LDU1 LDU1 add $t3,$t2,$t3-10 11 13 14 ALU1 add $t4,$t4,$t3-11 16 18 19 RAW $t3 RAW $t4 Forw $t3 $t4 ALU2 sw $t4,basec($t6) - 12 19 24 25 RAW $t4 Forw $t4 LDU2 addi $t6, $t6, 4-15 17 19 20 STRUCT ALU1 + RF READ (WAR $t6 OK) ALU1 bne $t6,$t7,do T 20 21 23 24 STRUCT ALU2 ALU2 (*) RF WRITE does not occur because the sw writes in Data Cache Express the formula then calculate the following metrics: CPI = # cycles / IC = 25 / 8 = 3.125 IPC = 1 / CPI = 0.32 To avoid structural hazards, please indicate how many LOAD/STORE UNITS are needed: _4 and how many ALU/BR/J UNITS are needed: _4

EXERCISE 5 TOMASULO (5 points) We assume the original program be executed on CPU with dynamic scheduling based on TOMASULO algorithm with: 2 RESERVATION STATIONS (RS1, RS2) + 2 LOAD/STORE units (LDU1, LDU2) with latency 5 2 RESERVATION STATIONS (RS3, RS4) + 2 ALU/BR FU (ALU1,ALU2) with latency 2 Check STRUCTURAL hazards for RS in ISSUE phase Check RAW hazards and Check STRUCTURAL hazards for FUs in START EXECUTE phase WRITE RESULT in RS and RF Static Branch Prediction BTFNT (BACKWARD TAKEN FORWARD NOT TAKEN) with Branch Target Buffer Please complete the TOMASULO TABLE by assuming all cache HITS and considering ONE ITERATION: ISTRUZIONE ISSUE START WRITE Hazards Type RSi UNIT EXEC RESULT DO: lw $t2,basea($t6) 1 2 7 RS1 LDU1 lw $t3,baseb($t6) 2 3 8 RS2 LDU2 lw $t4,basec($t6) 8 9 14 Struct RS1 RS1 LDU1 add $t3,$t2,$t3 9 10 12 RS3 ALU1 add $t4,$t4,$t3 10 15 17 RAW $t3 + RAW$t4 RS4 ALU2 sw $t4,basec($t6) 11 18 23 RAW $t4 RS2 LDU2 addi $t6, $t6, 4 13 14 16 Struct RS3 RS3 ALU1 bne $t6,$t7,do 17 18 20 Struct RS3 RS3 ALU1 Express the formula then calculate the following metrics: CPI = # cycles / IC = 23 / 8 = 2.875 IPC = 1 / CPI = 0.35 Calculate the speedup with respect to the first version of Scoreboard (EX 4): Speedup = (Exec. Time Scoreboard 6) / (Exec. Time Tomasulo) = 25 / 23 = 1.09

EXERCISE 6: PERFORMANCE OF THE MEMORY HIERARCHY (4 points) Write the formula for the average memory access time: Write the formula for the average memory access time for L1 and L2 caches: Provide the definition for Local Miss Rate and Global Miss rate:

EXERCISE 7: STATIC SCHEDULING (5 points) Describe the main concepts of static scheduling to manage ILP (Instruction Level Parallelism): Draw and briefly describe the architecture of a 2-issue VLIW processor:.

Explain the main advantages of VLIW architectures: What are the main disadvantages of VLIW architectures:

EXERCISE 8: MEMORY HIERARCHY (3 points) Explain the main advantages of introducing a SECOND LEVEL CACHE (L2 CACHE) : OPTIONAL PART (2 points) Explain the concepts and benefits of introducing a VICTIM CACHE:

ADVANCED COMPUTER ARCHITECTURES: Prof. C. SILVANO Written exam 11 July 2011