Homework 3 (r1.2) Due: Part (A) -- Apr 28, 2017, 11:55pm Part (B) -- Apr 28, 2017, 11:55pm Part (C) -- Apr 28, 2017, 11:55pm

Size: px

Start display at page:

Download "Homework 3 (r1.2) Due: Part (A) -- Apr 28, 2017, 11:55pm Part (B) -- Apr 28, 2017, 11:55pm Part (C) -- Apr 28, 2017, 11:55pm"

Tracey Farmer
6 years ago
Views:

1 Second Semester, Homework 3 (r1.2) Due: Part (A) -- Apr 28, 2017, 11:55pm Part (B) -- Apr 28, 2017, 11:55pm Part (C) -- Apr 28, 2017, 11:55pm Instruction: Submit your answers electronically through Moodle. There are 3 major parts in this homework. Part A includes questions that aim to help you with understanding the lecture materials. They resemble the kind of questions you will encounter in quizzes and the final exam. Your answers to this part will be graded on your effort. Part B of this homework are hands-on exercises that require you to design and evaluate processor systems using various software and hardware tools, including Chisel and the RISCV-V compilation tool chains. They are designed to help you understand real-world processor design and the use of various tools to help you along the way. This part of the homework will be graded on correctness. Part C of this homework contains open-ended mini-project ideas. They are open-ended by nature, meaning there are no right-wrong answers. You must choose to attempt one of the several available topics. You may work individually or in groups of up to 3 for this part. If you work in groups, each of you must submit independent report on the project. The following summarize the 3 parts. Part Type Indv/Grp Grading A Basic problem set Individual Graded on effort B Hands-on Individual or Group of 2 to 3 Graded on correctness C Mini-project Individual or Group of 2 to 3 Graded on effort In all cases, you are encouraged to discuss the homework problems offline or online using Piazza. However, you should not ask for or give out solution directly as that defeat the idea of having homework exercise. Giving out answers or copying answers directly will likely constitute an act of plagiarism.

2 Part A: Problem Set A.1 Column-Row In class, we discussed how matrices may be stored in memory with row-major or column-major orientation. In a row-major organization, matrices are stored row-by-row in memory; while in a column-major organization, matrices are stored column-by-column. Standard C compilers organizes matrices as row-major, while Matlab organizes matrices as column-major. Consider the following C code: #define N 128 int a[n][n]; //a[0][0] located at 0xA int i,j,sum; // i, j, sum in register sum = 0; for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { sum = sum + a[i][j]; You are running this code in a processor with the following tiny data cache: Direct map 8 words block size 32 entries A.1.1 What is the capacity of the cache? A.1.2 When the code is run with an initially empty cache, list out the sequence of Hits and Misses that are being generated. What type of misses are they? What is the miss rate? A.1.3 Suppose now that the i and j loops are exchanged as in the following code: for (j = 0; j < N; j++) { for (i = 0; i < N; i++) { sum = sum + a[i][j]; Repeat your work in??. How does it affect your miss rate? A.1.4 If instead the matrix is stored in column-major, which loop may produce better cache performance? r1.2 Page 2 of 17

3 A.1.5 If the cache is changed to a 4-way set associative cache while keeping the same capacity, how would it change the miss rate in the above matrix access? A.1.6 Now consider a different loop that copies top half of the matrix to the bottom half: for (i = 0; i < N/2; i++) { for (j = 0; j < N; j++) { a[i+n/2][j] = a[i][j]; When executed with the original direct mapped cache, what is the sequence of Hits an dmisses that gets generated? What is the miss rate? Assume matrix is stored as row-major in memory. Also assume a write-back and write allocate policy. A.1.7 Repeat A.1.6 if a 2-way set associative cache is used. Assume a write-through and nowrite allocate policy. r1.2 Page 3 of 17

4 A.2 Page Table & TLB In this exercise you will experiment with the interaction between the TLB and the page table in a VM system. Assume the following system configuration: 16-bit virtual and physical address 256 B page size 4-entry, fully associative TLB, true LRU replacement policy Initially, the TLB and page table contains the following entries. An invalid entry is marked as empty. If a page is located on hard disk, it is marked as disk. Tag TLB PPN 0xAC 0x07 0x04 0xCA 0x28 0xD0 empty Loc 0xFF 0xFE 0xFD 0xFC 0xFB 0xFA. 0x08 0x07 0x06 0x05 0x04 0x03 0x02 0x01 0x00 Page Table PPN or Disk disk empty empty empty 0xAB empty. empty empty disk empty 0xCA 0x18 disk empty 0xFF A.2.1 The following sequence of memory accesses are issued: 0xFF02, 0x00FF, 0xAC98, 0x0801, 0xFC98, Assume there is only 1 process in the system. Answer the following: 1. What is the final state of the TLB and Page Table after the above accesses? 2. For each memory access, is it a hit in TLB, a hit in the page table, or a page fault? A.2.2 Given the above VM setup, what is its TLB reach? A.2.3 What are some of the advantages and disadvantages of larger page size? A.2.4 Assume now that there are 2 processes running in the system. The 2 processes generate the following list of memory references: r1.2 Page 4 of 17

5 Process 1 Process 2 0x0700 0x0704 0xFA00 0xFA04 0x0700 0x0704 0xFA00 0xFA04 0x0708 0xFE00 0xFE04 0x0708 0x070C 0xFFF0 0xFFEC 0xFFE8 0x070C 0x0710 0x0714 0x0718 0xFE00 0xFE04 0xFE0C 0xFE04 0xFE00 0xFE04 Given the above accesses, how many TLB miss will be generated for Process 1 and 2 respectively? How many page faults are generated? A.2.5 Describe what kind of hardware/software changes you can make to reduce the numer of TLB misses in such scenario. r1.2 Page 5 of 17

6 A.3 Vector Co-Processor Adapted from 2016 final exam As an attempt to improve performance of a simple 32-bit scalar processor, you are considering adding a vector co-processor to your system as shown below: CPU Data Cache Main Memory DRAM Vector Co-processor As shown in the figure, the additional co-processor, called MMV, is a memory-memory vector unit that is attached directly to the system main memory. When instructed by the main processor, it fetches data from the main memory, operate on the data, and store the results back to the main memory. Your task is to evaluate the effectiveness of this MMV unit regarding the following very important program: // int i, a; #define N 256 float int x[n], y[n]; Preprocess(x); // compute kernel for (i = 0; i < N; i++) { y[i] = x[i] + y[i]; Mathematically, the loop marked as compute kernel performs the following operation: where x and y are vectors with N elements. y = x + y A.3.1 Scalar Performance The D-cache of the processor has the following properties: Cache miss takes 150 cycles Cache hit takes 1 cycle Capacity: 4 MiB Organization: 2-way set associative Cache line is 4 words Policy: write back, write allocate r1.2 Page 6 of 17

7 The for loop marked as compute kernel in the above C program is compiled as follows: # a0 = 256 # a1 is base address of x[]: 0xA # a2 is base address of y[]: 0xA0C : loop: addi a0 a0-1 04: lw t1, 0(a1) 08: lw t2, 0(a2) 10: add t2, t1, t2 14: sw t2, 0(a2) 18: addi a1, a1, 4 1C: addi a2, a2, 4 20: bne a0, zero, loop Assume CPI of all instructions is 1 and CPI of memory operation depends on cache performance. Assume cache is initially empty, how many cycles does it take to complete the above loop? r1.2 Page 7 of 17

8 A.3.2 The vector co-processor MMV is able to perform similar vector additions by directly reading input from the main memory and storing the results to the main memory. Taking into account DRAM timing, on average, it can perform 16 additions in 100 cycles, which include cycles needed to check for loop boundary. Now, assume all the data are in main memory, how many cycles does MMV take to complete the loop? A.3.3 When compared to the scalar processor, what is the speed up offered by MMV considering the compute kernel loop only? A.3.4 One limitation of MMV is that it operates directly from the main memory. Assume both x[] and y[] are dirty and thus required flushing before MMV can take over to accelerate the loop. Now, assume flushing of each cache line requires 150 cycles. How long does it take to flush the entire vector x[] and y[] from the cache to main memory if they are all dirty? A.3.5 Taking into account the time to flush x[] and y[] to memory, what is now the speed up by using MMV? A.3.6 In practice, since it is usually difficult for the OS to know which line of the cache a variable resides, it often needs to flush the entire cache instead. Recall that the cache is 4 MiB, how long does it take to flush the entire cache assuming 25 % of the cache lines are dirty? Also, what is the resulting speed up of using MMV? r1.2 Page 8 of 17

9 A.3.7 Your project partner suggests that the need to flushing dirty lines from the cache can be eliminated by implementing a write through policy. Does the use of write through policy eliminate the need to flush dirty lines from cache? Does it resolve all data coherence problem between MMV and the scalar processor? Explain your answers. r1.2 Page 9 of 17

10 A.4 Page Table Size You are considering a system with the following parameters: 64-bit architecture 1 MiB page size Each page table entry is 8 bytes A.4.1 Assume a single linear page table, with all entries pre-allocated when a process is launched, what is the size of the page table for 1 process? A.4.2 Obviously, it is unrealistic to allocate all available space for a process. In practice, most processes reference only a few clusters of addresses, such as the addresses within the data heap, the stack, and addresses for the program instructions, etc. Assuming each process access 4 different disjoint clusters of addresses, each consists of 256 MiB of data each. What is the theoretical minimum amount of storage need for page table entry (PTE)? A.4.3 To get a storage requirement closer to the above theoretical minimum, a 2-level page table system is being explored. The first level of page table is preallocated when a process is launched, and contains pointers to 2nd level page tables. The second level page table is allocated on demand according to the 4 clusters of memory regions are allocated. The following breakdown in address between the 2 levels is used: First Level Second Level Page Offset If the 4 clusters of memory are located at a contiguous 1 GiB virtual address space starting from address 0, how many 2 nd level page table are allocated for this process? With that, what is the total size of page table in both levels? A.4.4 If the 4 regions are located at 0x , 0x , 0x and 0xC Will it affect your answer from previous part? A.4.5 If a 3-level page table is used instead. Can it reduce the total memory that is needed to store all page table entries, assuming the addresses are distributed as in A.4.4? A.4.6 Assuming the addresses are distributed as in A.4.4? Will having even more levels of page table be beneficial? Is there a limit on how many levels of page table may be used? What are the tradeoff? r1.2 Page 10 of 17

11 Homework 3, Part B Part B: Hands-on Exercise In this exercise, you will learn techniques to optimize an application for CPU and for GPU. The core of the matrix-matrix multiplication program is the following loop: for (i = 0; i < N; i++){ for (j = 0; j < N; j++) { for (k = 0; k < N; k++){ c[i*n+j] += a[i*n+k] * b[k*n+j]; B.1 Optimizing Matrix-Matrix Multiplication for CPU Obtain the homework files: tux-1$ tar xzf ~elec3441/elec3441hw3.tar.gz tux-1$ cd hw3 tux-1$ export HW3ROOT=$PWD B.1.1 Loop Interchange There are no data or control dependencies between loop iterations, so it is possible to reorder operations arbitrarily. Depending on the order of the loops, one of the three elements a[i*n+k], b[k*n+j], c[i*n+j] stays constant during the whole inner loop. Compare the following three versions: IJK for (i = 0; i < N; i++){ for (j = 0; j < N; j++) { x = 0; for (k = 0; k < N; k++){ x += a[i*n+k] * b[k*n+j]; c[i*n+j] += x; IKJ for (i = 0; i < N; i++){ for (k = 0; k < N; k++){ x = a[i*n+k]; r1.2 Page 11 of 17

12 Homework 3, Part B for (j = 0; j < N; j++) { c[i*n+j] += x * b[k*n+j]; JKI for (j = 0; j < N; j++) { for (k = 0; k < N; k++){ x = b[k*n+j]; for (i = 0; i < N; i++){ c[i*n+j] += a[i*n+k] * x; B.1.2 Analyze the cache behavior of each version. Assume the matrices are very large and even one row will not fit in the cache. Matrices are stored in row-major order. One cache line fits four matrix elements. How many loads and stores does each iteration of the innermost loop need? How many cache misses per iteration will it produce for matrices A, B, C? B.1.3 What will be the behavior for loop orders JIK, KIJ and KJI? B.1.4 The file test mmm inter.c measures the time needed for the three above versions of matrix multiplication on various matrix sizes. Compile and run this code: tux-1$ cd ${HW3ROOT/mmm_cpu tux-1$ gcc -O3 -o test_mmm_inter test_mmm_inter.c -lrt tux-1$./test_mmm_inter It will output the time taken in nanoseconds for the loop orders ijk, kij, jki for matrix sizes from to Plot the time taken per iteration (time/n 3 ) for each of the loop orders. Does the result match your analysis in B.1.2? B.1.5 Can you make a guess at the L1 and L2 cache size of the machine based on this graph? You can change the values of BASE, ITER and DELTA to get a closer look for smaller matrix sizes. B.1.6 Blocking Another way to apply the observation that the different iterations of the loop body can be executed in any order is cache blocking. After loading a small part of the matrices into the cache, to obtain the best performance as many operations using this data as possible should be executed before loading new data. Examine the function mmm iijjkk blocked in test mmm block.c. In the following diagram, each numbered block is of size block size block size. Highlight the blocks of A, B and C that will be accessed for values of ii=1 and jj=2. r1.2 Page 12 of 17

13 Homework 3, Part B What inequality between the block size and the cache size needs to be fulfilled for this method to be efficient? B.1.7 Compile and run test mmm blocked.c: tux-1$ cd ${HW3ROOT/mmm_cpu tux-1$ gcc -O3 -o test_mmm_block test_mmm_block.c -lrt tux-1$./test_mmm_blocked Plot the results. What is the maximum efficient block size? What can you deduce about the cache size? B.1.8 Modify the file test mmm inter.c to obtain results for the same matrix size How does the performance of the blocked code compare to the different loop orders? B.1.9 Interchange the loops in the blocked matrix multiply code. What is the best performance you can obtain by combining the two techniques? B.1.10 Submission Submit the following: The modified files test mmm inter.c and test mmm blocked.c. The plots generated and your answers to the questions in B.1.2 to B.1.9. B.2 Optimizing Matrix Multiplication for GPU In this section you will learn how to use the CUDA toolkit to program GPUs for general-purpose computing. To make the CUDA tools available in your shell, use the following command: tux-1$ source ~elec3441/elec3441hw3.bashrc Because the graphics card on tux-1 is quite old, documentation for installed CUDA version 6.5 is no longer available online. You can find the documentation in the folder /usr/local/cuda/doc. It is recommended to read the CUDA C Programming Guide, most importantly chapters 2, 4 and 5. Note that the GPU in tux-1 offers Compute Capability 1.1, which affects many features you will see throughout the manual. B.2.1 Structure of a CUDA program CUDA is a SIMT (Single Instruction Multiple Thread) model. A CUDA kernel is a function that is executed simultaneously in many threads. It is launched using the following syntax: r1.2 Page 13 of 17

14 Homework 3, Part B kernel_fn<<< grid_dim3, block_dim3 >>>(arguments); The parameters grid dim3, block dim3 specify the number and layout of the parallel threads to be launched. Each argument is of type dim3, specifying 3 dimensions. Threads are organized in up to 3-dimensional blocks. All the blocks of a kernel are arranged in a grid. Recent GPUs allow a 3D grid, but for Compute Capability 1.1 only 2 dimensions are allowed. During execution, each thread has access to its block index and thread index to identify itself and the data it shall work on. The main task in GPU programming is to efficiently organize memory accesses among the threads. Examine the files matrixmul.cu and matrixmul naive.cuh in ${HW3ROOT/mmm gpu. The host (CPU) code is in matrixmul.cu, and matrixmul naive.cuh contains the kernel that will be executed on the GPU. Answer the following questions: B.2.2 What are the block and grid dimensions for a 2048x2048 matrix? B.2.3 What data elements in A, B and C will an individual thread touch? B.2.4 What are we comparing the performance of the CUDA code to? B.2.5 Why do we include the CUDA data management functions (cudamemcpy etc) when we measure the time for matrix multiplication on GPU? B.2.6 Compile the code and run it for matrix sizes from 16x16 to 2048x2048: tux-1$ make tux-1$./matrixmul -length=16 Plot the CPU and GPU performances. At what size does using the GPU become more efficient than the CPU? B.2.7 GPU Memory Hierarchy The first step in improving the performance of a CUDA kernel is to adapt memory accesses to the GPU s memory hierarchy. Unlike in CPUs, the memory hierarchy of the GPU is mostly exposed to the programmer, and you will have to manually copy data from the more distant to the closer levels and back again. The following types of memory are available on a GPU 1 : Registers Registers are the fastest memory, accessible without any latency on each clock cycle, just as on a regular CPU. A thread s registers cannot be shared with other threads. Shared Memory Shared memory is comparable to L1 cache memory on a regular CPU. It resides close to the multiprocessor, and has very short access times. Shared memory is shared among all the threads of a given block. Section of the Cuda C Best Practices Guide has more on shared memory optimization considerations. Global Memory Global memory resides on the device, but off-chip from the multiprocessors, so that access times to global memory can be 100 times greater than to shared memory. All threads in the kernel have access to all data in global memory. 1 r1.2 Page 14 of 17

15 Homework 3, Part B Local Memory Thread-specific memory stored where global memory is stored. Variables are stored in a thread s local memory if the compiler decides that there are not enough registers to hold the thread s data. This memory is slow, even though its called local. Constant Memory 64k of Constant memory resides off-chip from the multiprocessors, and is read-only. The host code writes to the devices constant memory before launching the kernel, and the kernel may then read this memory. Constant memory access is cached each multiprocessor can cache up to 8k of constant memory, so that subsequent reads from constant memory can be very fast. All threads have access to constant memory. Texture Memory Specialized memory for surface texture mapping, not discussed in this module. The equivalent of cache blocking is to subdivide the matrices into tiles. Each thread block is in charge of one tile of the result matrix C. It loads the necessary tiles of A and B into shared memory, computes the result values in register, and finally writes the result back to global memory. B.2.8 Examine the code in matrixmul tiling.cuh. What is the role of the for loop? For a 128x128 matrix, how many iterations will it have? Modify matrixmul.cu to make use of the tiled code: --#include "matrixmul_naive.cuh" ++#include "matrixmul_tiling.cuh" Plot the performace for matrix sizes from 16x16 to 2048x2048 and compare it to the previous. You can comment out the line #define COMPARE CPU at the top of matrixmul.cu to avoid re-generating the CPU performance numbers. B.2.9 Coalescing Similarly to the effect of cache lines on CPU, memory accesses to consecutive memory addresses by threads with consecutive indexes are more efficient than accesses to different addresses. This is called Coalescing. Read appendix G.3 of the CUDA C Programming Guide for a detailed explanation. B.2.10 Show that the accesses to matrix A already fulfil the conditions listed in section G.3.2. Why can the accesses to matrix B not be coalesced? B.2.11 matrixmul coalescing.cuh contains a version that allows coalesced accesses to B. How has this been achieved? B.2.12 Plot the performance of the coalesced version for matrix sizes from 16x16 to 2048x2048 and compare it to the previous. B.2.13 Shared Memory Bank Conflicts You will find that instead of the expected performance improvement from coalesced memory accesses, the last version actually performs more poorly. The problem is that although we have improved the global memory access pattern, we have introduced a problem with the access patterns to shared memory, namely bank conflicts. Section G.3.3 of the Appendix to the CUDA C Programming Guide explains how shared memory is structured in banks. To avoid bank conflicts, each group of 16 consecutive thread IDs needs to access 32-bit words in 16 different banks, i.e. with a different word alignment. The easiest way to achieve this is to access consecutive words, which is what has been implemented in matrixmul nobankconflict.cuh. Explain the relevant change. r1.2 Page 15 of 17

16 Homework 3, Part B B.2.14 Why does the access AS[ty][k] not generate bank conflicts? B.2.15 Plot the performance of this version for matrix sizes from 16x16 to 2048x2048 and compare it to the previous. B.2.16 Optional: further optimizations If you are looking for inspiration for other improvements you could make in part C, the page explains more techniques you can try. B.2.17 Submission Submit the following: A plot comparing the performance of all the different versions of CUDA matrix multiply examined in this section. Your answers to the questions. r1.2 Page 16 of 17

Homework 3, Part C Part C: Open-ended Project C.1 Convolution In this part you will apply the techniques you have learned in part B to a different algorithm: a convolution filter.

17 Homework 3, Part C Part C: Open-ended Project C.1 Convolution In this part you will apply the techniques you have learned in part B to a different algorithm: a convolution filter. You can find the code in the folder ${HW3ROOT/convolutionSeparable. A convolution filter is applied to an image to achieve some effect, for example a blur. It is defined by a convolution kernel, which is a matrix of small size (e.g. 3 3). Each output pixel is calculated based on the input pixel at the same location as well as its neighbors. For each input pixel location, an area of the same size as the kernel centered at the location is considered. Each input pixel value of this area is multiplied by the corresponding kernel value. These values are summed up to produce the output pixel value. The convolution kernel you are optimizing in this part has a further property: it is separable. This means that the kernel is the product of two vectors. In our particular example it is the product of a vector and its own transpose = [ ] You may use this feature to your advantage when optimizing the code. C.1.1 Submission Submit your implementation of the convolution filter for CPU and GPU together with a report on how you optimized the code for each target. r1.2 Page 17 of 17

Homework 3 (r1.1) Due: Part (A) -- Apr 29, 2016, 11:55pm Part (B) -- Apr 29, 2016, 11:55pm Part (C) -- Apr 29, 2016, 11:55pm

Second Semester, 2015 16 Homework 3 (r1.1) Due: Part (A) -- Apr 29, 2016, 11:55pm Part (B) -- Apr 29, 2016, 11:55pm Part (C) -- Apr 29, 2016, 11:55pm Instruction: Submit your answers electronically through