Homework 3 (r1.1) Due: Part (A) -- Apr 29, 2016, 11:55pm Part (B) -- Apr 29, 2016, 11:55pm Part (C) -- Apr 29, 2016, 11:55pm

Size: px

Start display at page:

Download "Homework 3 (r1.1) Due: Part (A) -- Apr 29, 2016, 11:55pm Part (B) -- Apr 29, 2016, 11:55pm Part (C) -- Apr 29, 2016, 11:55pm"

Claud Wright
5 years ago
Views:

1 Second Semester, Homework 3 (r1.1) Due: Part (A) -- Apr 29, 2016, 11:55pm Part (B) -- Apr 29, 2016, 11:55pm Part (C) -- Apr 29, 2016, 11:55pm Instruction: Submit your answers electronically through Moodle. There are 3 major parts in this homework. Part A includes questions that aim to help you with understanding the lecture materials. They resemble the kind of questions you will encounter in quizzes and the final exam. Your answers to this part will be graded on your effort. Part B of this homework are hands-on exercises that require you to design and evaluate processor systems using various software and hardware tools, including Chisel and the RISCV-V compilation tool chains. They are designed to help you understand real-world processor design and the use of various tools to help you along the way. This part of the homework will be graded on correctness. Part C of this homework contains open-ended mini-project ideas. They are open-ended by nature, meaning there are no right-wrong answers. You must choose to attempt one of the several available topics. You may work individually or in groups of up to 3 for this part. If you work in groups, each of you must submit independent report on the project. The following summarize the 3 parts. Part Type Indv/Grp Grading A Basic problem set Individual Graded on effort B Hands-on Individual or Group of 2 to 3 Graded on correctness C Mini-project Individual or Group of 2 to 3 Graded on effort In all cases, you are encouraged to discuss the homework problems offline or online using Piazza. However, you should not ask for or give out solution directly as that defeat the idea of having homework exercise. Giving out answers or copying answers directly will likely constitute an act of plagiarism.

2 Part A: Problem Set A.1 Cache Access Consider the following sequence of memory accesses to the main memory in a 16-bit processor: Address (hex) Type A000 R B000 R A380 R A004 W 580C W A108 R 5800 R A10C W A39C W A3AC R A1AC R A006 R 5804 R A.1.1 Assume the following data cache organization: Capacity: 4 KiB Line size: 8 words Organization: direct map Policy: write back, write allocate Trace through the above memory access and answer the following: (i) For each access, is it a hit or a miss? (ii) Show the final content in the cache, including the tag. For sake of simplicity, assume the content of a memory address is the same as its address, i.e., mem[x] = X A.1.2 Repeat A.1.1 but with a different cache: Capacity: 4 KiB Line size: 8 words Organization: 2-way set associative Policy: LRU, write back, write allocate A.1.3 Repeat A.1.1 but with a different cache: Capacity: 4 KiB Line size: 8 words Organization: direct map r1.1 Page 2 of 17

3 Policy: write through, no write allocate A.1.4 Repeat A.1.1 but with a different cache: Capacity: 4 KiB Line size: 8 words Organization: 2-way set associative Policy: LRU, write through, no write allocate A.2 Cache Performance You are evaluating the performance of the cache subsystem of a processor. The initial design of the processor has the following cache: Separate instruction and data cache Cache hit time is 1 cycle Cache miss penalty is 300 cycles. A.2.1 Focusing on the instruction cache in this part. The instruction cache has a miss rate of 5 %. What is the average memory access time (AMAT) of the instruction cache? A.2.2 The data cache has a miss rate of 20 %, what is the AMAT of the data cache? A.2.3 After profiling a program B, you find the following percentage of dynamic instructions: ALU Jump/Branch Load/Store 50 % 15 % 35 % Assuming CPI of all branch/jump instructions are 2 and CPI of ALU instructions are 1. Also assume for this part that the instruction cache is perfect, i.e., miss rate is 0 %. Then what is the overall CPI of program B? A.2.4 Based on the above D-cache calculation, which of the following change will improve the CPI of program B most? Explain your answer. (i) Change D-cache into a fully associative cache (ii) Increase clock speed of the processor (iii) Rewrite B to increase data reuse in cache A.2.5 Now, consider the realistic instruction cache with 5 % miss rate as mentioned in A.2.1. Assume when the I-cache miss, the entire processor pipeline is stalled while the memory subsystem fetches data from main memory. Since there is only 1 external DRAM, if both I-cache and D-cache misses at the same cycle, the 2 memory accesses have to take place sequentially. With these assumption, what is the average CPI of program B taking into account the possibility of both I-cache and D-cache miss? A.2.6 As an attempt to improve CPI of the program, you are considering the addition of a new dual-port DRAM that allows 2 concurrent memory accesses. With the new dual-port memory, if both I-cache and D-cache misses at the same cycle, both of them may fetch from this memory at the same time. With the dual-port DRAM, however, the miss penalty increases to 310 cycles. Assume 2 % of the I-cache misses overlap with D-cache misses. What is the overall CPI of program B? Is it the overall performance improved? r1.1 Page 3 of 17

4 A.3 Instruction Cache Adapted from final exam 2015 You are desigining the intruction cache for a new 32-bit processor. Because of other hardware constraints, your cache design must meet the following criteria: Virtual address space is 32-bit wide. Word size is 32 bits. Cache must be indexed and tagged with virtual addresses. Line size must be 4 words The number of bits to index the cache must be 8 bits. Page size is 8 KiB Translation from virtual to physical address is controlled by the OS and the page mapping is pseudo-random. Process ID is 4-bit wide. All compiled programs start running from address 0x The OS performs a context switch every 256 instructions. A.3.1 In order to differentiate data from different processes, you have decided to concatenate the process ID to the virtual address tag in the cache. Assume you are using a direct-map cache, what is the minimum width of the combined tag? bits A.3.2 Assume you are using a direct-map cache with the maximum possible size given the above constraints, what is the total number of bits required for tag storage in the cache? r1.1 Page 4 of 17

5 A.3.3 Your machine starts with an empty instruction cache. The following program is run as Process 1 (with process ID = 1): 0x _start: addi a1, a1, 1 # instruction 0 0x addi a3, zero, x000401FC bne a1, a3, _start # instruction 127: NOT taken on 4th time Process 1 begins execution for 256 instructions. Then it is stopped. There is no branch instruction in the first 128 instructions except for the last bne. The bne instruction takes the branch the first 3 times it is encountered. The branch condition is false on the 4th time it is executed. After Process 1 has stopped (after it has executed 256 instructions), how many hits and misses have occurred in the instruction cache? Briefly explain your answer. A.3.4 Following up from the previous part, after Process 1 is stopped, the OS switches in Process 2. Process 2 is the exact same program as Process 1 except it is started by a different user. Therefore, it executes the exact same code as the above part except with a process ID = 2. Process 2 is again stopped after 256 instructions. How many hits and misses will have occurred in the instruction cache due to running Process 2? Briefly explain your answer. r1.1 Page 5 of 17

6 A.3.5 Since there are only 2 processes running, the process switching between Process 1 and Process 2 continues: Process 1 is again switched in after Process 2. It continues its execution for another 256 instructions. By now, Process 1 has completed 512 instructions since it started. Process 2 is switched in place of Process 1 and run for 256 instructions. When the two processes run in the processor for the second time, is the instruction hit rate changed when compared to your answers in A.3.3 and A.3.4? If the hit rate has changed, explain how is it different. If the hit rate remains the same, explain why is it the case. A.3.6 As an attempt to improve performance of the instruction cache, you are given the option to change the instruction cache organization into 2-way set associative while keeping the cache capacity constant. However, as the tag capacity is limited, you can no longer store the process IDs in the cache. As a result, you need to flush the cache for every context switch. Considering the 2-process scenario above, how would the change to a 2-way set associate cache with cache flushing affect the overall instruction hit rate after both Process 1 and 2 have finished running 512 instructions? That is: the processor has just finished executing Process 1 (256 r1.1 Page 6 of 17

7 instructions) Process 2 (256 instructions) Process 1 (256 instructions) Process 2 (256 instructions). A.3.7 Your teammate suggest that you should keep the process ID in the cache tag to avoid flushing of cache during context switch. However, since you are limited in hardware resource, your teammate suggest that the capacity of the cache should be reduced by half as a tradeoff. Briefly discuss if this scheme using a 2-way set associative cache with reduced cache capacity and no flushing may improve performance over the original direct map cache. A.4 Streaming Cache Performance You are investigating the cache performance of your processor regarding the following code segment: // int i, n, a; // int y[], x[]; for (i = 0; i < n; i++) { y[i] = a*x[i] + y[i]; } r1.1 Page 7 of 17

8 A.4.1 Cache Hit or Miss Assume that a, i, and n are stored in registers. Array x is stored at memory address starting at 0xA while the array y is stored at memory address 0xB The processor data cache is initially empty. Below are the details of the data cache: Capacity: 1 MiB Organization: 2-way set associative Line size: 4 words Replacement policy: true LRU Write policy: write through, no write allocate A write buffer is available such that the processor can resume running immediately after writing the data to the buffer. Let n = 2 20, trace through the above code, then show and explain the sequence of cache hit or miss that will occur in the space below. Use the notation RH for read hit, WH for write hit, RM for read miss, and WM for write miss. A.4.2 Based on your result above, what is the overall data cache miss rate of the above code when executed in the main CPU? Consider BOTH read and write access. Assume the write and read miss penalties are both 300 cycles. Hit time is 1 cycle. What is the average memory access time (AMAT) for this code? A.4.3 Now the above C code is compiled into the following RISC-V instructions: r1.1 Page 8 of 17

9 # a0 = 2^20 # a1 is base address of x[] # a2 is base address of y[] # constant a is stored in register a3 00: loop: addi a0 a0-1 04: lw t1, 0(a1) 08: lw t2, 0(a2) 0C: mult t0, t1, a3 10: add t0, t0, t2 14: sw t0, 0(a2) 18: addi a1, a1, 4 1C: addi a2, a2, 4 20: bne a0, zero, loop Let n = Assume add, addi and bne takes 1 cycle; mult takes 4 cycles; and performance of lw and sw depends on the cache performance. What is the total run time of the above code in cycles? You may leave the variable n in your answer. A.4.4 Write Back Cache If the write policy of the cache is changed to write back, how would the following be affected? (i) Read miss penalty (ii) Overall performance of the above code Explain your answer by tracing through the execution of the above code, highlighting any different in the required cache content handling with the use of write back cache. r1.1 Page 9 of 17

10 A.5 Page Table & TLB In this exercise you will experiment with the interaction between the TLB and the page table in a VM system. Assume the following system configuration: 32-bit architecture 8 KiB page size 4-entry, direct mapped TLB Initially, the TLB and page table contains the following entries: TLB Valid VPN (dec) PPN (dec) Page Table Index Valid PPN or Disk disk disk disk A.5.1 The following sequence of memory accesses are issued: 0x0000F234, 0x0000A008, 0x , 0x0000F098, 0x0000A09C, 0x0000F00C, 0x , 0x C, 0x , 0x Assume an invalid page table entry indicates that the page has not been allocated to a user process and all entries in the page table not shown above are invalid. r1.1 Page 10 of 17

11 For each memory access, indicate if it is a generates a TLB hit or miss, and whether it generates a page fault. Also, show the final state of the TLB and page table after the above accesses. A.5.2 What are some of the advantages and disadvantages of larger page size? A.5.3 As TLB access is in the processor critical path, you are experimenting if a direct map TLB could improve performance. Now assume the capacity of the TLB remains unchanged, consider the impact to TLB performance regarding the instructions of a program P. The instructions of P begins at memory location 0x , during run time, it allocates and access only 1 data page starting at address 0xD In the following scenarios, explain how the well the TLB would perform with a direct map organization when compared to the original fully associative organization? (i) Only 1 copy of P running in the system (ii) 2 users both running P at the same time. OS flushes TLB when swapping between processes (iii) 2 users both running P at the same time. OS does not flush TLB. Instead, the process id is prepended to the virtual page number, i.e., VPN tag becomes [process id VPN]. (iv) 2 users both running P at the same time. OS does not flush TLB. Instead, the process id is appended to the virtual page number, i.e., VPN tag becomes [VPN process id]. r1.1 Page 11 of 17

12 Homework 3, Part B Part B: Hands-on Exercise B.1 Micro-Benchmarking Cache Performance In this exercise, you will perform a mirco-benchmarking of the cache system of real machines. The core of the benchmarking program is the following loop: for(stride= STRIDE_MIN; stride <= STRIDE_MAX; stride = stride << 1) { for(i=0; i < asize; i += stride) { array[i] = array[i]; } } By carefully examining the time it takes to access elements of an array with different strides, it is possible to deduce information about the cache system such as its cache size, associativity, block size, etc. B.1.1 Obtain the File Download the file from: B.1.2 Compilation On a Linux machine, such as on tux-1.eee.hku.hk, compile the program as follows: gcc -o mbench -O0 mbench.c -lrt If you want to, on an OSX machine, you can compile the program as follows: gcc -o mbench -O0 mbench.c In both cases, note that the switch in the middle of the line is minus capital-oh Zero. B.1.3 Execute the compiled program by issuing the command: >./mbench By default, the results will be printed to the screen. If you want to save the results to a file, while seeing the output at the same time, you can pipe the output to the tee command as follows: r1.1 Page 12 of 17

13 Homework 3, Part B >./mbench tee output.log With the above command, the output will be printed to the screen while saving to the file output.log. B.1.4 Output Comma Separated Values Be default, the output of mbench is in human readable form. To make plotting the results easier, run the program with the -c switch as follows: >./mbench -c B.1.5 Plotting Results To analyze the micro-benchmark, it is easiest to plot the results. You can plot the CSV file using gnuplot, Excel or Matlab. Your results should look similar to the one below: (a) Full Range (b) Small Arrays Only Your plots should have the stride size on the x-axis, and access time on y-axis. Each series of plot represents the results of one array size. Your x-axis should have a base 2 log scale as your stride size increases exponentially as power of 2. Hint: it may be useful to produce separate plots for very large and very small array sizes so you can clearly observe the pattern in cache access time. B.1.6 Submission Submit your plot(s) for tux-1.eee.hku.hk. If you have access to other Linux or OSX machine, you may compile mbench there and submit results from different processors as well. B.1.7 Analysis From your plots, there are a lot you can learn about the memory hierarchy. By analyzing the plots, try to deduce the following: How many levels of cache does your processor has? For each level of cache, what is its block size, capacity, and associativity? What is the hit time of each level of cache? What is the page fault time? You may not be able to deduce all the information. In fact, modern processors have many advanced features that may obscure your analysis above. If you have access to a machine with older processors, the effect will be easier to see. See hints below. r1.1 Page 13 of 17

14 Homework 3, Part B B.1.8 Hints on analyzing the results Consider a simple example to begin understanding the effect of stride size and array size on the number of cache hits/misses. Recall that all elements of your array are being accessed in strides, and the process is repeated many times. As a result, the number of cache hits/misses is not going to be due to compulsory misses. Now, consider a cache with 16 words capacity, then ask yourself the following questions: If the array size is smaller than 16 words, how many hit/misses will you get. Consider an array with 32 words. Start with a direct map cache, with 1 word block size, how many hits/misses do you get when you increase the stride size from 1 word to 16 words? Now, if the cache is a 2-way set associative cache, how many hits/misses do you get when you increase the stride size from 1 word to 16 words? What if the cache is 4-way set associative? Can you observe a pattern in the change in number of hits/misses when the associativity increases? Now repeat the above exercise with block size of 2 words and 4 words. When does a change in block size have effect on the number of hits/misses you experience? With the above observations, go back and analyze your plot results. Do you observe similar change in hit/miss time? From these results, you can deduce certain features of the cache. B.1.9 Submission Submit your analysis of the results for at least tux-1.eee.hku.hk. If you have tried the same exercise on different processors, you may submit additional analysis. Feel free to compare your results with information you can find about the processor from online resources. r1.1 Page 14 of 17

15 Homework 3, Part C Part C: Open-ended Project C.1 Breadth First Search In homework 1, you have experimented with different ways to implement the breadth first search (BFS) algorithm to maximize performance in your single cycle processor. In this homework, you will continue the same idea, but with a focus on memory hierarchy performance. To evaluate your BFS performance, you will be using a RISC-V ISA simulator called spike. The spike simulator simulates the behavior of the RISC-V ISA with limited hardware implementation details. It features a built-in cache simulator that captures every memory access generated from the processor and collect statistics accordingly. C.1.1 The ISA simulator is already installed on tux-1.eee.hku.hk. If you prefer to run the simulator on your own machine, you need to get the latest RISC-V toolchain source code from: To obtain the files for homework 3, perform the following on tux-1.eee.hku.hk: tux-1$ cd ~ tux-1$ tar xzvf ~elec3441/elec3441hw3.tar.gz tux-1$ cd hw3 tux-1$ export HW3ROOT=$PWD In the downloaded file you will find different benchmark programs located in its own individual directory. C.1.2 Compiling Benchmark Programs The ISA simulator may execute any valid RISC-V program. If you examine the source code in each subdirectory, you will notice they are no different from any other normal program. Feel free to write your own benchmark if you are curious. You must setup your environment correctly to make use of the RISC-V toolchain. On tux-1, you can use the following command: tux-1$. ~elec3441/elec3441.bashrc To compile the provided benchmark, you may either perform a make command in each director, or make use of the top-level makefile that is provided for you: tux-1$ cd ${HW3ROOT}/benchmarks tux-1$ make r1.1 Page 15 of 17

16 Homework 3, Part C C.1.3 Running Simulator You may now execute your compiled program using the spike ISA simulator. Execute the target binary with an L1 instruction cache as follows: tux-1$ cd ${HW3ROOT}/benchmarks/kmean tux-1$ spike --ic=128:2:64 pk kmean Veirification passed! I$ Bytes Read: I$ Bytes Written: 0 I$ Read Accesses: I$ Write Accesses: 0 I$ Read Misses: 163 I$ Write Misses: 0 I$ Writebacks: 0 I$ Miss Rate: 0.000% tux-1$ In the above line, the last argument kmean specifies the RISC-V binary that you are simulating using spike. The argument pk before that stands for proxy kernel, and it tells spike to use the native Linux kernel to handle any system calls. Finally, the argument --ic=128:2:64 tells spike to simulate an instruction cache that has 128 entries 2-way set associative 64-byte block If you multiply the three parameters, you get the capacity of the instruction as = 16 KiB. You can also specify the use of a data cache with the argument --dc=<s>:<w>:<b> and a unified L2 cache with the argument --l2=<s>:<w>:<b>. For example: tux-1$ spike --ic=128:2:64 pk... tux-1$ spike --ic=128:2:64 --dc=128:2:64 pk... tux-1$ spike --ic=128:2:64 --dc=128:2:64 --l2=1024:4:64 pk... C.1.4 Warmup Exercise Now, run all the benchmark programs with the following 2 memory hierarchies, one with L2 and the other without L2 cache: --ic=128:2:64 --dc=128:2:64 --ic=128:2:64 --dc=128:2:64 --l2=1024:4:64 You can collect statistics about the memory hierarchies from the output. Make sure you understand what is the miss rate for L1 I$, L1 D$ and L2$ in each case. Which benchmark has the best and which one has the worst cache performance? Hint: Consider automating the process, as you will probably need to regenerate a lot similar statistics for next part. C.1.5 Basic BFS Implementation Evaluation Now, consider a processor with the following memory hierarchy: L1 I$ L1 D$ L2$ Capacity 64 KiB 64 KiB 4 MiB Associativity Line Size 64 B 64 B 128 B Targeting this processor, evaluate the performance of the 2 versions of BFS that were experimented in homework 1 one version utilizes a link list data structure while the other one utilizes r1.1 Page 16 of 17

17 Homework 3, Part C an adjacency list data structure. Evaluate the programs with the graph of input size Which version results in better cache performance? C.1.6 Improved BFS Now given your understanding from above, try to implement a faste version of BFS than the 2 provided samples. You may use either data structure, or implement you own version of BFS if you like. You may also tune the data/instruction cache to values that beneift your code most. C.1.7 Submission Submit your implementation of BFS together with a report on how you optimized the code for the targeted cache system. r1.1 Page 17 of 17

Homework 2 (r1.1) Due: Part (A) -- Apr 2, 2017, 11:55pm Part (B) -- Apr 2, 2017, 11:55pm Part (C) -- Apr 2, 2017, 11:55pm

Second Semester, 2016 17 Homework 2 (r1.1) Due: Part (A) -- Apr 2, 2017, 11:55pm Part (B) -- Apr 2, 2017, 11:55pm Part (C) -- Apr 2, 2017, 11:55pm Instruction: Submit your answers electronically through