Homework 2 (r1.0) Due: Mar 27, 2018, 11:55pm

Size: px

Start display at page:

Download "Homework 2 (r1.0) Due: Mar 27, 2018, 11:55pm"

Rosanna Sims
6 years ago
Views:

1 Second Semester, Homework 2 (r1.0) Due: Mar 27, 2018, 11:55pm Instruction: Submit your answers electronically through Moodle. There are 3 major parts in this homework. Part A includes questions that aim to help you with understanding the lecture materials. They resemble the kind of questions you will encounter in quizzes and the final exam. This part is an individual portion and you should complete this part by yourself. Part B and C are group portions. You may work in groups of up to 3 for these parts. Part B asks you to examine cache performance in real world processors. You should try to run it on as many different processors as you can find and try to analyze their cache behaviors. Part C contains an open-ended project. It is open-ended by nature, meaning there is no right-wrong answers. The following summarize the 3 parts. Part Type Indv/Grp Grading A Basic problem set Individual Graded on correctness B Hands-on Group of 2 to 3 Graded on effort C Mini-project Group of 2 to 3 Graded on effort In all cases, you are encouraged to discuss the homework problems offline or online using Piazza. However, you should not ask for or give out solution directly as that defeat the idea of having homework exercise. Giving out answers or copying answers directly will likely constitute an act of plagiarism.

2 Homework 2, Part A Part A: Problem Set A.1 Cache Access Consider the following sequence of memory accesses to the main memory in a 32-bit processor: Address (hex) ABCD1234 ABCD11D4 ABCD122C 1BCD1234 ABCD1220 ABCD15D8 1BCD11A4 AB0D C45A3C ABCD15C4 77C45A20 ABCD163C ABCD11A4 77C45A38 AB0D1228 Type W W W W W W A.1.1 Assume the following data cache organization: Capacity: 1 KiB Line size: 8 words Organization: direct map Policy: write back, write allocate Trace through the above memory access and answer the following: (i) For each access, is it a hit or a miss? (ii) Show the final content in the cache, including the tag. For sake of simplicity, assume the content of a memory address is the same as its address, i.e., mem[x] = X A.1.2 epeat A.1.1 but with a different cache: Capacity: 1 KiB Line size: 8 words Organization: 2-way set associative Policy: true LU, write back, write allocate r1.0 Page 2 of 16

3 Homework 2, Part A A.1.3 epeat A.1.1 but with a different cache: Capacity: 1 KiB Line size: 8 words Organization: direct map Policy: write through, no write allocate A.1.4 Given the above sequence of memory access, assume you can change the cache line size and cache organization, what is the minimum capacity of cache that is needed to result in the minimum number of misses? What is the minimum number of misses, line size, and the resulting cache organization? Explain your answer? A.2 Cache Performance You are evaluating the performance of the cache subsystem of a processor. The initial design of the processor has the following cache: Separate instruction and data cache Cache hit time is 1 cycle Cache miss penalty is 300 cycles. A.2.1 Focusing on the instruction cache in this part. The instruction cache has a miss rate of 5 %. What is the average memory access time (AMAT) of the instruction cache? A.2.2 The data cache has a miss rate of 15 %, what is the AMAT of the data cache? A.2.3 After profiling a program B, you find the following percentage of dynamic instructions: ALU Jump/Branch Load/Store 40 % 20 % 40 % Assuming CPI of all branch/jump instructions are 3 and CPI of ALU instructions are 1. Also assume for this part that the instruction cache is perfect, i.e., miss rate is 0 %. What is the overall CPI of program B? A.2.4 Now, consider the realistic instruction cache with 5 % miss rate as mentioned in A.2.1. Assume when the I-cache miss, the entire processor pipeline is stalled while the memory subsystem fetches data from main memory. Since there is only 1 external DAM, if both I-cache and D-cache misses at the same cycle, the 2 memory accesses have to take place sequentially. With these assumption, what is the average CPI of program B taking into account the possibility of both I-cache and D-cache miss? A.2.5 Upon investigation, you realize that 90 % of the load/store instructions (i.e., 90 % 40 % of the original dynamic instructions) are indeed accessing memory data that can be recalculated within the CPU. Specifically, each of these load/store instruction can be replaced by 100 ALU instructions plus 20 Jump/Branch instructions. If all of these load/store instructions are converted into ALU and Jump/Branch instructions, would it improve performance of B? If so, how much performane improvement can be achieved? If not, is there a maximum number of ALU+Jump/Branch instructions that can be used to replace these recalculatable load/store instructions to make the program faster? r1.0 Page 3 of 16

4 Homework 2, Part A A.3 Instruction Cache Adapted from final exam 2015 You are desigining the intruction cache for a new 32-bit processor. Because of other hardware constraints, your cache design must meet the following criteria: Virtual address space is 32-bit wide. Word size is 32 bits. Cache must be indexed and tagged with virtual addresses. Line size must be 8 words The number of bits to index the cache must be 8 bits. Page size is 4 KiB Translation from virtual to physical address is controlled by the OS and the page mapping is pseudo-random. Process ID is 4-bit wide. All compiled programs start running from address 0x The OS performs a context switch every 256 instructions. A.3.1 In order to differentiate data from different processes, you have decided to concatenate the process ID to the virtual address tag in the cache. Assume you are using a direct-map cache, what is the minimum width of the combined tag? bits A.3.2 Assume you are using a direct-map cache organization. What is the maximum possible capacity you can use in this processor given the above constraints? At this maximum capacity, what is the total number of bits required for tag storage in the cache? A.3.3 Your machine starts with an empty instruction cache. The following program is run as Process 1 (with process ID = 1): 0x _start: addi a1, a1, 1 # instruction 0 0x addi a3, zero, x000401FC bne a1, a3, _start # instruction 127: NOT taken on 4th time Process 1 begins execution for 256 instructions. Then it is stopped. There is no branch instruction in the first 128 instructions except for the last bne. The bne instruction takes the branch the first 3 times it is encountered. The branch condition is false on the 4th time it is executed. After Process 1 has stopped by the OS on context swicth (after it has executed 256 instructions), how many hits and misses have occurred in the instruction cache? Briefly explain your answer. A.3.4 Following up from the previous part, after Process 1 is stopped, the OS switches in Process 2. Process 2 is the exact same program as Process 1 except it is started by a different user. Therefore, it executes the exact same code as the above part except with a process ID = 2. Process 2 is again stopped after 256 instructions. How many hits and misses will have occurred in the instruction cache due to running Process 2? Briefly explain your answer. r1.0 Page 4 of 16

5 Homework 2, Part A A.3.5 Since there are only 2 processes running, the process switching between Process 1 and Process 2 continues: Process 1 is again switched in after Process 2. It continues its execution for another 256 instructions. By now, Process 1 has completed 512 instructions since it started. Process 2 is switched in place of Process 1 and run for 256 instructions. When the two processes run in the processor for the second time, is the instruction hit rate changed when compared to your answers in A.3.3 and A.3.4? If the hit rate has changed, explain how is it different. If the hit rate remains the same, explain why is it the case. A.3.6 In order to improve performance of instruction cache, you now have the following proposals: Proposal A B C D E Description Change to 2-way set associative organization while keeping the same capacity. Virtually tagged with Process ID. Change to 4-way set associative organization while keeping the same capacity. Virtually tagged with Process ID. Change to a physically tagged cache; direct map; flush the cache on context switch. Change to a physically tagged cache; direct map; DO NOT flush the cache on context switch; OS ensures processes never mapped to the same physical location. Change to a physically tagged cache; direct map; DO NOT flush the cache on context switch; Also, OS map the instructions to the same physical page. That is, for example, address 0x of both Process 1 and Process 2 will be mapped to the same physical location 0xA In each of these cases, explain if the proposal will result in better performance given the 2 processes above. Explain why and why not. r1.0 Page 5 of 16

6 Homework 2, Part A A.4 Branch Predictor & Branch Target Buffer As a chief architect for a new processor, you are considering the use of a branch predictor (BP) together with branch history table (BHT) and a branch target buffer (BTB). You will use the following particular important benchmark program conv to evaluate your processor. Comments are pseudo code describing the function of the instruction. Each line begins with its address in hex on the left of. The code is in ISV-V assembly language, and the macro %hi and %lo returns the corresponding part of the base address of its parameter (imgout and imgin in this case). A00 conv: lui s4,%hi(imgout) A04 addi s4,s4,%lo(imgout) A08 li s2,0 A0C li s3,256 A10.L10: mv s1,s4 A14 li s0,0 A18.L11: mv a1,s0 A1C mv a0,s2 A20 jal ra, k A24 sw a0,0(s1) A28 addi s0,s0,1 A2C addi s1,s1,4 A30 bne s0,s3,.l11 A34 addi s2,s2,1 A38 addi s4,s4,1024 A3C bne s2,s0,.l10 A40 li a0,0 A44 exit # program ends here A48 k: mv a5,a0 A4C beqz a0,.l3 A50 li a4,255 A54 beq a0,a4,.l3 A58 beqz a1,.l3 A5C beq a1,a4,.l3 A60 slli a5,a0,8 A64 add a5,a5,a1 A68 lui a4,%hi(imgin) A6C addi a4,a4,%lo(imgin) A70 slli a5,a5,2 A74 add a5,a5,a4 A78 lw a0,0(a5) A7C slli a0,a0,1 A80 jr ra A84.L3: li a0,0 A88 jr ra r1.0 Page 6 of 16

7 Homework 2, Part A The following shows a C code equivalent of conv. The funciton conv performs a 2D-convolution action on the input image imgin. The convolution kernel is implemented in k is simply returns the value of the input image at position (r, c). A more realistic kernel will be used in Part C no fthis homework. // 2D convolution with simple kernel #define N 256 int imgin[n][n]; int imgout[n][n]; int k(int r, int c) { if (r == 0 r == (N-1) c == 0 c==(n-1)) { return 0; } return imgin[r][c] * 2; } int f() { int c = 0; int r = 0; for (r = 0; r < N; r++) { for (c = 0; c < N; c++) { imgout[r][c] = k(r,c); } } return 0; } r1.0 Page 7 of 16

8 Homework 2, Part A A.4.1 In the following table, trace the first 15 outcome of the branch/jump instructions. Time flows from left to right. Each column represents the outcome of one branch/jump event. The first few columns have been filled in for you as an example. Mark Y if a branch is taken, N if a branch is not taken. Mark Y for a jump as well. 0xA20 (jal) 0xA30 (bne) 0xA3C (bne) 0xA4C (beqz) 0xA54 (beq) 0xA58 (beqz) 0xA5C (beq) 0xA80 (jr) 0xA88 (jr) A.4.2 Based on your result froma.4.1, assume you implement a branch predictor that predicts branches are always taken, what are the branch misprediction rate (prediction is wrong) for the branch instructions? A.4.3 If the 2-bit predictor as shown in class was used for each branch location and assume that it starts with predict branch not taken, what is the misprediction rate for the bne instruction at 0xA3C? A.4.4 If you refer back to the C code, you will see that the value N corresponds to the size of input/output image. If N increases by 128 times, how would that change your misprediction rate of the always predict taken and the 2-bit predictor compared to your answers above? Which predictor is better? r1.0 Page 8 of 16

9 Homework 2, Part A A.4.5 Assume you have a perfect BTB, which includes one entry for each branch/jump instruction. Fill in the target address for each of the entry after executing conv. Instruction Address Target Address.. A.4.6 Due to hardware constraints, you nove have only 4 entries for BTB. As a result, you have decided to use the lower 2 bits of word address (i.e., bit 3 and bit 2 of instructin address) to index this BTB. Show the final content of the 4 entries in the BTB. Index Instruction Address Target Address r1.0 Page 9 of 16

10 Homework 2, Part B Part B: Hands-on Exercise B.1 Micro-Benchmarking Cache Performance In this exercise, you will perform a mirco-benchmarking of the cache system of real machines. The core of the benchmarking program is the following loop: for(stride= STIDE_MIN; stride <= STIDE_MAX; stride = stride << 1) { for(i=0; i < asize; i += stride) { array[i] = array[i]; } } By carefully examining the time it takes to access elements of an array with different strides, it is possible to deduce information about the cache system such as its cache size, associativity, block size, etc. B.1.1 Obtain the File On tux-1, the file is included in the archive for homework 2 (see part C). To test on another machine, you can also download the file from: B.1.2 Compilation On a Linux machine, such as on tux-1.eee.hku.hk, compile the program as follows: gcc -o mbench -O0 mbench.c -lrt If you want to, on an OSX machine (You will need Xcode), you can compile the program as follows: gcc -o mbench -O0 mbench.c In both cases, note that the switch in the middle of the line is minus capital-oh Zero. B.1.3 Execute the compiled program by issuing the command: >./mbench By default, the results will be printed to the screen. If you want to save the results to a file, while seeing the output at the same time, you can pipe the output to the tee command as follows: r1.0 Page 10 of 16

11 Homework 2, Part B >./mbench tee output.log With the above command, the output will be printed to the screen while saving to the file output.log. B.1.4 Output Comma Separated Values Be default, the output of mbench is in human readable form. To make plotting the results easier, run the program with the -c switch as follows: >./mbench -c B.1.5 Plotting esults To analyze the micro-benchmark, it is easiest to plot the results. You can plot the CSV file using gnuplot, Excel or Matlab. Your results should look similar to the one below: (a) Full ange (b) Small Arrays Only Your plots should have the stride size on the x-axis, and access time on y-axis. Each series of plot represents the results of one array size. Your x-axis should have a base 2 log scale as your stride size increases exponentially as power of 2. Hint: it may be useful to produce separate plots for very large and very small array sizes so you can clearly observe the pattern in cache access time. B.1.6 Submission Submit your plot(s) for tux-1.eee.hku.hk. If you have access to other Linux or OSX machine, you may compile mbench there and submit results from different processors as well. The results are a lot clearer on simple/older processors than on modern multi-core processors. If you have access to an embedded system such as a aspberry Pi, you will be able to obtain much better results. B.1.7 Analysis From your plots, there are a lot you can learn about the memory hierarchy. By analyzing the plots, try to deduce the following: How many levels of cache does your processor has? For each level of cache, what is its block size, capacity, and associativity? What is the hit time of each level of cache? What is the page fault time? r1.0 Page 11 of 16

12 Homework 2, Part B You may not be able to deduce all the information. In fact, modern processors have many advanced features that may obscure your analysis above. If you have access to a machine with older processors, the effect will be easier to see. See hints below. B.1.8 Hints on analyzing the results Consider a simple example to begin understanding the effect of stride size and array size on the number of cache hits/misses. ecall that all elements of your array are being accessed in strides, and the process is repeated many times. As a result, the number of cache hits/misses is not going to be due to compulsory misses. Now, consider a cache with 16 words capacity, then ask yourself the following questions: If the array size is smaller than 16 words, how many hit/misses will you get. Consider an array with 32 words. Start with a direct map cache, with 1 word block size, how many hits/misses do you get when you increase the stride size from 1 word to 16 words? Now, if the cache is a 2-way set associative cache, how many hits/misses do you get when you increase the stride size from 1 word to 16 words? What if the cache is 4-way set associative? Can you observe a pattern in the change in number of hits/misses when the associativity increases? Now repeat the above exercise with block size of 2 words and 4 words. When does a change in block size have effect on the number of hits/misses you experience? With the above observations, go back and analyze your plot results. Do you observe similar change in hit/miss time? From these results, you can deduce certain features of the cache. B.1.9 Submission Submit your analysis of the results for at least 2 systems, including tux-1.eee.hku.hk. If you have tried the same exercise on additional processors, you may submit their analysis too. Compare your results with information you can find about the processor from online resources and note any differences. r1.0 Page 12 of 16

13 Homework 2, Part C Part C: Open-ended Project C.1 Adapting Cache Characteristics In this exercise, you will investigate and try to improve the cache performance of a set of benchmarks. To evaluate the benchmark performance, you will be using a ISC-V ISA simulator called spike. The spike simulator simulates the behavior of the ISC-V ISA with limited hardware implementation details. It features a built-in cache simulator that captures every memory access generated from the processor and collect statistics accordingly. C.1.1 The ISA simulator is already installed on tux-1.eee.hku.hk. If you prefer to run the simulator on your own machine, you need to get the latest ISC-V toolchain source code from: To obtain the files for homework 2, perform the following on tux-1.eee.hku.hk: tux-1$ cd ~ tux-1$ tar xzvf ~elec3441/elec3441hw2.tar.gz tux-1$ cd elec3441hw2 tux-1$ export HW2OOT=$PWD In the downloaded file you will find different benchmark programs located in its own individual directory. C.1.2 Compiling Benchmark Programs The ISA simulator may execute any valid ISC-V program. If you examine the source code in each subdirectory, you will notice they are no different from any other normal program. Feel free to write your own benchmark if you are curious. You must setup your environment correctly to make use of the ISC-V toolchain. On tux-1, you can use the following command: tux-1$. ~elec3441/elec3441hw2.bashrc To compile the provided benchmark, you may either perform a make command in each director, or make use of the top-level makefile that is provided for you: tux-1$ cd ${HW2OOT}/benchmarks tux-1$ make r1.0 Page 13 of 16

14 Homework 2, Part C C.1.3 unning Simulator You may now execute your compiled program using the spike ISA simulator. Execute the target binary with an L1 instruction cache as follows: tux-1$ cd ${HW2OOT}/benchmarks/kmean tux-1$ spike --ic=128:2:64 pk kmean Veirification passed! I$ Bytes ead: I$ Bytes Written: 0 I$ ead Accesses: I$ Write Accesses: 0 I$ ead Misses: 489 I$ Write Misses: 0 I$ Writebacks: 0 I$ Miss ate: 0.001% tux-1$ In the above line, the last argument kmean specifies the ISC-V binary that you are simulating using spike. The argument pk before that stands for proxy kernel, and it tells spike to use the native Linux kernel to handle any system calls. Finally, the argument --ic=128:2:64 tells spike to simulate an instruction cache that has 128 entries 2-way set associative 64-byte block If you multiply the three parameters, you get the capacity of the instruction as = 16 KiB. You can also specify the use of a data cache with the argument --dc=<s>:<w>:<b> and a unified L2 cache with the argument --l2=<s>:<w>:<b>. For example: tux-1$ spike --ic=128:2:64 pk... tux-1$ spike --ic=128:2:64 --dc=128:2:64 pk... tux-1$ spike --ic=128:2:64 --dc=128:2:64 --l2=1024:4:64 pk... C.1.4 Cache Evaluation Now, run all the benchmark programs with the following 2 memory hierarchies, one with L2 and the other without L2 cache: --ic=128:2:64 --dc=128:2:64 --ic=128:2:64 --dc=128:2:64 --l2=1024:4:64 Collect statistics about the memory hierarchies from the output and answer the following questions. 1. For each benchmark, what is the miss rate for L1 I$, L1 D$ and L2$? Which benchmark has the best and which one has the worst cache performance? Hint: Consider automating the process, as you will probably need to regenerate a lot similar statistics in the rest of this homework. 2. What is the cache access time for the 3 caches according to Table C.1? 3. Based on your above answer, what is the cycle time of the pipeline? Assuming on a cache hit, data should be returned in 1 cycle. Also assume that critical path of all non-memory pipeline stages is 600 ps. In other word, your cycle time is limited by the cache if the cache access time is longer than 600 ps. Otherwise, cache access is not the bottleneck. 4. Calculate the average CPI for the benchmarks without L2 cache. You can use the following formula to calculate average CPI, where MP = Miss Penalty, CT = Cycle Time. Assume the backside of r1.0 Page 14 of 16

15 Homework 2, Part C L1 caches are connected to a single DAM with 100 ns access time. Use CPI base = 1.2. CPI = CPI base + #L1_I$ misses + #L1_D$ misses #instructions MP CT Note that the number of instruction is the I$ ead Accesses number in the output of spike. 5. What is the AMAT (in ns) of the L2$ for all benchmarks? Use the following formula to calculate AMAT. (HT=Hit Time, M=Miss ate, MP=Miss Penalty) Assume the backside of the L2$ is connected to a DAM with 100 ns access time. Note that L2 AMAT is the time calculated after L1 access. So HT is the time to access L2 cache after L1, M is the local L2 miss rate, etc. AMAT L2 = HT L2 + M L2 MP L2 6. Calculate the average CPI for the benchmarks with L2 cache. Use the following formula to calculate CPI, assuming the L2$ is running asynchronously on its own clock domain. Use the same base CPI as above. CPI = CPI base + #L1_I$ misses + #L1_D$ misses #instructions 7. Based on your answers above, does the L2$ help with performance? AMATL2 CT C.1.5 Optimal L1 Data Cache Configuration Find the optimal L1 D$ configuration that maximizes performance of the benchmarks. You can assume L1 I$ is perfect and does not affect processor performance. There is no L2$. Pick one configuration from the following design space: Total capacity: up to 1 MiB Associativity: 1, 2, 4, 8, 16, 32, 64 Cache line size: 32, 64 Note the following: Your cache organization may affect your processor cycle time. Your architecture requires that L1 cache returns data in 1 cycle. Use geometric mean of normalized performance of the benchmark programs as the overall performance metric. Hint: You should write a program or a spreadsheet program like Excel to help you find the optimal L1 configuration. emember, you need to compute the geometric mean of C.1.6 Optional: Improved Benchmarks Now, given your optimal L1 cache obtained from above, try to implement a faster version of the slowest benchmark program than the provided implementation. You may use any modifications to data structure or code, as long as the same calculation is still performed at runtime, i.e., you cannot simply hard code the answer at compile time. Evaluate the improvements you make using the geometric mean of speedups that you achieve over the baseline that you determined in the previous question. C.1.7 Submission Submit your answer to C.1.4, C.1.5, and the optional part in C.1.6. answer concise, but make sure you have data to support your analysis. Make your r1.0 Page 15 of 16

16 Homework 2, Part C (a) cache line size = 32 B assoc \ size 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB N/A N/A N/A N/A N/A N/A (b) cache line size = 64 B assoc \ size 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A Table C.1: Cache access time (in ns) for various cache configurations in 45 nm technology. Data obtained from CACTI. r1.0 Page 16 of 16

Homework 3 (r1.1) Due: Part (A) -- Apr 29, 2016, 11:55pm Part (B) -- Apr 29, 2016, 11:55pm Part (C) -- Apr 29, 2016, 11:55pm

Second Semester, 2015 16 Homework 3 (r1.1) Due: Part (A) -- Apr 29, 2016, 11:55pm Part (B) -- Apr 29, 2016, 11:55pm Part (C) -- Apr 29, 2016, 11:55pm Instruction: Submit your answers electronically through