ELEC3441: Computer Architecture Second Semester, Homework 3 (r1.1) SOLUTION. r1.1 Page 1 of 12

Size: px

Start display at page:

Download "ELEC3441: Computer Architecture Second Semester, Homework 3 (r1.1) SOLUTION. r1.1 Page 1 of 12"

Margery Miller
5 years ago
Views:

1 Homework 3, Part ELEC3441: Computer Architecture Second Semester, Homework 3 (r1.1) r1.1 Page 1 of 12

2 A.1 Cache Access Part A: Problem Set Consider the following sequence of memory accesses to the main memory in a 16-bit processor: Address (hex) Type A000 R B000 R A380 R A004 W 580C W A108 R 5800 R A10C W A39C W A3AC R A1AC R A006 R 5804 R A.1.1 Assume the following data cache organization: Capacity: 4 KiB Line size: 8 words Organization: direct map Policy: write back, write allocate Trace through the above memory access and answer the following: (i) For each access, is it a hit or a miss? (ii) Show the final content in the cache, including the tag. For sake of simplicity, assume the content of a memory address is the same as its address, i.e., mem[x] = X word size = 16 bit = 2 B. Since line size is 8 words, so 3 more bits are needed to address the words within 1 line, making offset 4 bits. For direct map cache, 4 KiB capacity = 2 8 = 256 lines. Therefore, 8 bits used for index, and remaining 4 bits are for tags. A.1.2 Repeat A.1.1 but with a different cache: Capacity: 4 KiB Line size: 8 words Organization: 2-way set associative Policy: LRU, write back, write allocate r1.1 Page 2 of 12

3 A.1.3 Repeat A.1.1 but with a different cache: Capacity: 4 KiB Line size: 8 words Organization: direct map Policy: write through, no write allocate A.1.4 Repeat A.1.1 but with a different cache: Capacity: 4 KiB Line size: 8 words Organization: 2-way set associative Policy: LRU, write through, no write allocate A.2 Cache Performance A.2.1 A.2.2 AMAT = hit time + miss rate miss penalty = % 300 = 15 AMAT = hit time + miss rate miss penalty = % 300 = 61 A.2.3 Since i-cache always hit, the average CPI of B is: CPI hit = 50 % % % AMAT D = 50 % % % 61 = A.2.4 A.2.5 CPI = CPI hit + 5 % 300 = = r1.1 Page 3 of 12

4 A.2.6 The new memory incurs a larger miss penalty. Therefore, new data cache AMAT becomes: AMAT = hit time + miss rate miss penalty = % 310 = 63 Now, with a perfect I-cache, CPI becomes: CPI hit = 50 % % % AMAT D = 50 % % % 63 = Because of the dual port memory, an I-cache miss that happens to be D-cache miss will no longer incur additional penalty on top of the penalty already incurred by the D-cache miss. For these instructions, they essentially behave as if they are were an I-cache hit in terms of performance. Therefore, CPI = CPI hit + 5 %(1 2 %) 300 = = So the new DRAM actually results in lower performance. A simple way to understand the reason is that because of the larger miss penalty resulting from the new DRAM, it degrades overall CPI in all cases regardless of whether there s an overlap between I-cache and D-cache miss. As the chance of both I-cache and D-cache misses happening at the sam time is low, the over all performance still degrades. A.3 Instruction Cache Adapted from final exam 2015 You are desigining the intruction cache for a new 32-bit processor. Because of other hardware constraints, your cache design must meet the following criteria: Virtual address space is 32-bit wide. Word size is 32 bits. Cache must be indexed and tagged with virtual addresses. Line size must be 4 words The number of bits to index the cache must be 8 bits. Page size is 8 KiB Translation from virtual to physical address is controlled by the OS and the page mapping is pseudo-random. Process ID is 4-bit wide. All compiled programs start running from address 0x The OS performs a context switch every 256 instructions. A.3.1 In order to differentiate data from different processes, you have decided to concatenate the process ID to the virtual address tag in the cache. Assume you are using a direct-map cache, what is the minimum width of the combined tag? 24 bits r1.1 Page 4 of 12

5 A.3.2 Assume you are using a direct-map cache with the maximum possible size given the above constraints, what is the total number of bits required for tag storage in the cache? r1.1 Page 5 of 12

6 A.3.3 Your machine starts with an empty instruction cache. The following program is run as Process 1 (with process ID = 1): 0x _start: addi a1, a1, 1 # instruction 0 0x addi a3, zero, x000401FC bne a1, a3, _start # instruction 127: NOT taken on 4th time Process 1 begins execution for 256 instructions. Then it is stopped. There is no branch instruction in the first 128 instructions except for the last bne. The bne instruction takes the branch the first 3 times it is encountered. The branch condition is false on the 4th time it is executed. After Process 1 has stopped (after it has executed 256 instructions), how many hits and misses have occurred in the instruction cache? Briefly explain your answer. Start with cold cache + no branch + no capacity miss M H H H sequence repeated 128/4 = 32 times during the first 128 instructions. Afterward, since there is no conflict misses, all instructions are cached. Therefore, we have another 128 H. Total: Miss = 32; Hit = 3 times = 224. A.3.4 Following up from the previous part, after Process 1 is stopped, the OS switches in Process 2. Process 2 is the exact same program as Process 1 except it is started by a different user. Therefore, it executes the exact same code as the above part except with a process ID = 2. Process 2 is again stopped after 256 instructions. How many hits and misses will have occurred in the instruction cache due to running Process 2? Briefly explain your answer. r1.1 Page 6 of 12

7 The result is exactly the same as Process 1. Process 2 uses the same cache locations as P1 and overwrites them as if they are from a cold cache. A.3.5 Since there are only 2 processes running, the process switching between Process 1 and Process 2 continues: Process 1 is again switched in after Process 2. It continues its execution for another 256 instructions. By now, Process 1 has completed 512 instructions since it started. Process 2 is switched in place of Process 1 and run for 256 instructions. When the two processes run in the processor for the second time, is the instruction hit rate changed when compared to your answers in A.3.3 and A.3.4? If the hit rate has changed, explain how is it different. If the hit rate remains the same, explain why is it the case. The same as above. Because of ping pong effect, P1 resumes and see an empty cache because P2 has evicted them. Same for P2. A.3.6 As an attempt to improve performance of the instruction cache, you are given the option to change the instruction cache organization into 2-way set associative while keeping the cache capacity constant. However, as the tag capacity is limited, you can no longer store the process IDs in the cache. As a result, you need to flush the cache for every context switch. Considering the 2-process scenario above, how would the change to a 2-way set associate cache with cache flushing affect the overall instruction hit rate after both Process 1 and 2 have finished running 512 instructions? That is: the processor has just finished executing Process 1 (256 r1.1 Page 7 of 12

8 instructions) Process 2 (256 instructions) Process 1 (256 instructions) Process 2 (256 instructions). Because of flushing, every time a process comes back, the cache is cold. So the answer is the same as above. A.3.7 Your teammate suggest that you should keep the process ID in the cache tag to avoid flushing of cache during context switch. However, since you are limited in hardware resource, your teammate suggest that the capacity of the cache should be reduced by half as a tradeoff. Briefly discuss if this scheme using a 2-way set associative cache with reduced cache capacity and no flushing may improve performance over the original direct map cache. Yes it works. In this particular case, the code from both processes will be stored in the 2 sets, preserving over context switch. A.4 Streaming Cache Performance You are investigating the cache performance of your processor regarding the following code segment: // int i, n, a; // int y[], x[]; for (i = 0; i < n; i++) { y[i] = a*x[i] + y[i]; } r1.1 Page 8 of 12

9 A.4.1 Cache Hit or Miss Assume that a, i, and n are stored in registers. Array x is stored at memory address starting at 0xA while the array y is stored at memory address 0xB The processor data cache is initially empty. Below are the details of the data cache: Capacity: 1 MiB Organization: 2-way set associative Line size: 4 words Replacement policy: true LRU Write policy: write through, no write allocate A write buffer is available such that the processor can resume running immediately after writing the data to the buffer. Let n = 2 20, trace through the above code, then show and explain the sequence of cache hit or miss that will occur in the space below. Use the notation RH for read hit, WH for write hit, RM for read miss, and WM for write miss. The first access (read) of each line is always going to be a miss, while subsequent accesses to the same line will be a hit because of spatial locality. Specifically, read x[0] generates a miss (M), y[0] generates a miss (M). But since the cache is 2-way s.a., x[0] and y[0] are stored on the same line but in different set. Therefore, write y[0] is a hit (H). Then (R) x[1], (R) y[1], (W) y[1], (R) x[2], (R) y[2], (W) y[2], (R) x[3], (R) y[3], (W) y[3] are all going to be hit (9 Hits total). This pattern repeats until the end of the cache, which is after 32k lines. (2 20 bytes in 2 ways. So each way contains 2 19 bytes = 2 17 words = 2 15 lines At that point, x[128k] will be stored at line 0 again. A conflict miss + LRU policy makes x[0] evicted. As a result, x[128k] is a miss, but will be stored over the block of x[0]. Similarly, y[128k] is a miss, and will be stored over y[0]. Therefore, the overall sequence is: RM RM WH RH RH WH RH RH WH RH RH WH, then repeat. A.4.2 Based on your result above, what is the overall data cache miss rate of the above code when executed in the main CPU? Consider BOTH read and write access. Assume the write and read miss penalties are both 300 cycles. Hit time is 1 cycle. What is the average memory access time (AMAT) for this code? Overall miss rate is 2/12 = 1/6. Therefore, AMAT is = 51cycles 6 A.4.3 Now the above C code is compiled into the following RISC-V instructions: r1.1 Page 9 of 12

10 # a0 = 2^20 # a1 is base address of x[] # a2 is base address of y[] # constant a is stored in register a3 00: loop: addi a0 a0-1 04: lw t1, 0(a1) 08: lw t2, 0(a2) 0C: mult t0, t1, a3 10: add t0, t0, t2 14: sw t0, 0(a2) 18: addi a1, a1, 4 1C: addi a2, a2, 4 20: bne a0, zero, loop Let n = Assume add, addi and bne takes 1 cycle; mult takes 4 cycles; and performance of lw and sw depends on the cache performance. What is the total run time of the above code in cycles? You may leave the variable n in your answer ( ) = = 161n A.4.4 Write Back Cache If the write policy of the cache is changed to write back, how would the following be affected? (i) Read miss penalty (ii) Overall performance of the above code Explain your answer by tracing through the execution of the above code, highlighting any different in the required cache content handling with the use of write back cache. r1.1 Page 10 of 12

11 Because of the change in write policy, there will be increased number of cache lines that need to be written back on a read misses. In particular, when the access to the cache wrap around in the above code, the current values of y[i-128k] is dirty and need to be written back. But since there is a write buffer, the additional time to write back before reading new data is minimal. As a result, there won t be major difference in performance. A.5 Page Table & TLB A.5.1 A.5.2 Some advantages of larger page size: As page size is larger, the number of page is smaller and thus the page table size can be reduced. If the number of TLB entry keeps unchanged, PPNs stored in TLB cover a larger memory space and TLB hit rate could be higher. Some disadvantages of larger page size: Whenever there is page fault, a whole page needs to be loaded to memory. If these data in the same page is not accesses much before the page is swapped out, it is a waste of memory bandwidth. If the physical memory allocated to a program is limited, the number of pages stored in memory is smaller. When the program access across different pages, it may result in page fault frequently. A.5.3 As TLB access is in the processor critical path, you are experimenting if a direct map TLB could improve performance. Now assume the capacity of the TLB remains unchanged, consider the impact to TLB performance regarding the instructions of a program P. The instructions of P begins at memory location 0x , during run time, it allocates and access only 1 data page starting at address 0xD In the following scenarios, explain how the well the TLB would perform with a direct map organization when compared to the original fully associative organization? (i) Only 1 copy of P running in the system (ii) 2 users both running P at the same time. OS flushes TLB when swapping between processes (iii) 2 users both running P at the same time. OS does not flush TLB. Instead, the process id is prepended to the virtual page number, i.e., VPN tag becomes [process id VPN]. (iv) 2 users both running P at the same time. OS does not flush TLB. Instead, the process id is appended to the virtual page number, i.e., VPN tag becomes [VPN process id]. r1.1 Page 11 of 12

12 (i) Instruction page map to index 2, data page map to index 0. In a full associative TLB, they would be mapped to different location and have both page translations stored in TLB throughout the lifetime of the process. However, in a direct map cache, the 2 collide at index 0, causing multiple TLB misses during run time. (ii) As the TLB is flushed when swapping processes, the scenario for each process is the same as above as if they start from a cold TLB. (iii) With the VPN prepended, in fully associative organization, all 4 pages will be held in TLB. With direct map, the pages from both processes, as well as from both instruction and data page will collide. (iv) With VPN appended, it is possible that the pages from different process would be mapped to different locations in a direct map TLB. However the instruction and the data page from the same process will still collide with each other. r1.1 Page 12 of 12

Homework 3 (r1.1) Due: Part (A) -- Apr 29, 2016, 11:55pm Part (B) -- Apr 29, 2016, 11:55pm Part (C) -- Apr 29, 2016, 11:55pm

Second Semester, 2015 16 Homework 3 (r1.1) Due: Part (A) -- Apr 29, 2016, 11:55pm Part (B) -- Apr 29, 2016, 11:55pm Part (C) -- Apr 29, 2016, 11:55pm Instruction: Submit your answers electronically through