CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you need to review Questions 3, 4 and 5a from this exam, in addition to Questions 6 and 7 from last year s mid-term that was posted earlier. Questions 5b, 6, 7 and 8 of this exam relates to material that we did not cover as of yet. You should get back to them when you review for Exam III. 1

Question 3: (10 points) Consider a computer system that supports a 16MB virtual address space (byte addressable) with 1KB pages, a 256KB of physical memory and an 8-entry direct-mapped TLB. a) Assume that the TLB configuration is as follows: Valid Tag Page address 1 0000 0011 000 0010 0000 1 0000 0011 000 0000 0100 1 1111 0011 000 0000 1110 1 0011 0001 101 1111 0001 1 0011 1110 100 1010 0011 1 1010 1010 110 0101 0001 0 0000 0011 000 0101 0010 1 1111 0011 000 1100 0101 For each of the following virtual addresses, specify the virtual page number and find the physical page number. If any address causes a TLB miss, indicate so (do not find the physical address in such a case). Virtual address (byte address) Virtual page number Physical page number 0000 0011 0000 0100 0001 1001 1111 0011 0001 1111 1100 0100 0011 0001 1011 1110 1001 1101 b) A TLB miss may or may not result in a page fault. When does a TLB miss result in a page fault? c) What would be the main advantage and main disadvantage of replacing the direct-mapped TLB with a fully associative TLB? Advantage: Disadvantage: 2

Question 4 (15 points): Consider a 2GHz processor with separate instruction and data caches. We are focusing on improving the data cache performance assuming that the instruction cache achieves a 100% hit rate while our data cache achieves only a 90% hit rate. The processor is a d processor in which, on average, an instruction takes 3.5 cycles to execute if the cache hit rate is 100% (that is CPI = 3.5). (a) If the memory access latency is 40 n.sec., what is the data cache miss penalty (in cycles)? (b) Assuming that the cache miss penalty is x cycles (the number computed in part a), and that 25% of all instructions are memory instructions, write the formula that you would use to compute the CPI (write the formula in terms of x do not evaluate it). (c) To improve the overall performance, we decided to introduce a second level cache (L2). Its access latency is five cycle. The L2 cache hit rate is measured to be 50% (i.e., the probability of finding block in L2 when it misses in L1). Write the formula that you would use to compute the CPI in this case (keep the formula in terms of x do not evaluate it). (d) If the memory address (byte addressable) consists of 32 bits, b31,, b0, and the data cache is a 64KB, 4-way associative cache with 64B block size, specify the bits that should be used to index the cache and those that should be used as a tag. Bits for index are: b,, b Bits for tag are: b,, b b31 b0 3

Question 5: (10 points) (a) Consider a quad processor system where each processor (P0, P1, P2 and P3) has a private direct mapped cache and all cores share a single address space (shared memory system). Assume that memory address A maps to some cache block and that this block is initially invalid in all the caches. Specify the state (I=invalid, S=shared or E=exclusive/modified) of the block in the cache of each processor after each of the following sequence of memory operations: In P0 In P1 In P2 In P3 Initially I I I I P1 reads A P2 reads A P3 writes A P0 writes A P1 reads A P0 writes A (b) Let the distance between two nodes be defined as the number of links between the nodes. What is the diameter and what is the bisection width of each of the networks shown below? Diameter = Diameter = Bisection width = Bisection width = 4

Question 6: (10 points) (a) Consider the following skeleton for a CUDA program _global_void my_kernel (int *a, int * b) { int idx = blockidx.x * blockdim.x + threadidx.x ; /*note that blockdim.x = 2 */ a[idx+1] = threadidx.x ; b[idx] = blockidx.x ; void main(){ my_kernel<<< 4,2 >>> (a, b) /* four blocks, each containing 2 threads */ Assuming that arrays a[] and b[] are allocated in the global memory and are initialized to -1, what will be the values of their elements after my_kernel finishes execution? a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] b[0] b[1] b[2] b[3] b[4] b[5] b[6] b[7] b[8] (b) For each of the following statements, circle the correct answer - The _syncthreads() CUDA barrier synchronizes all the treads of a: (i) kernel (ii) thread block (iii) warp - In a CUDA kernel, it is more efficient to use a block size of (i) 64 threads (ii) 16 threads (iii) 8 threads - In a GPU, the shared memory is shared among all the threads of a (i) kernel (ii) thread block (iii) warp - cudamalloc() is used to dynamically allocate space in (i) shared memory (ii) global memory (iii) CPU memory - All the threads in a thread block execute on the same (i) SM - streaming (ii) SP - streaming (iii) device multiprocessor processor - The number of registers in an SM determines the maximum number of allowed on the SM (i) warps (ii) thread blocks (iii) threads 5

Question 7: (20 points) Consider the following Pthread program which computes the sum of N numbers, using P threads. The sum will be computed in sum[0] : #define P 8 /* P is a power of 2 */ #define N 1024 /* N is a power of 2 */ void *compute_sum ( void *); struct arg_to_thread {int id ; float A[N]; /* assume that A[] is initialized to some values */ float sum[p] ; main (int argc, char *argv[] ) { int i ; pthread_t p_threads[p]; pthread_attr_t attr; struct arg_to_thread my_arg[p] ; pthread_attr_init (&attr); for (i=0; i < P ; i++ ){ my_arg[i].id = i ; pthread_create (&p_threads[i], &attr, compute_sum, (void*) &my_arg[i]); for (i=0; i< num_threads; i++) pthread_join (p_threads[i], NULL); void *compute_sum (void *arg) { struct arg_to_thread *local_arg ; int i, half, idx ; local_arg = arg; idx = (*local_arg).id; /* idx is the id given to the thread */ for (i = ; i < ; i++) /* compute a partial sum */ /*line 1*/ sum[ ] = sum[ ] + A[i] ; /*line 2*/ half = P/2 ; /*line 3*/ for (i = ; ; i++) { /* compute the global sum */ /*line 4*/ if( ) { /*line 5*/ sum[ ] = sum[ ] + sum[ ] ; /*line 6*/ half = ; /*line 7*/ /*line 8*/ 6

(a) Ignoring the need for synchronization, complete the lines labeled /*line 1*/ to /*line 7*/ in the function compute_sum such that the sum of the 1024 numbers is computed in sum[0]. The sum is computed by forking 8 threads, each of which computes the sum of 128 numbers (in parallel). The 8 partial sums are then added together using a tree reduction algorithm. (b) Indicate after which line(s) you should add barrier synchronizations and explain why are these barriers needed? (c) What is the speedup and efficiency obtained from the parallel execution when N = 1024 and P = 8? Serial execution time = Execution time using 8 processors = steps steps Speedup = Efficiency = (d) Assuming that you can use as many processors as you want, what is the maximum speedup that can be obtained to solve the problem for N = 1024? (e) Amdahl law indicates that if is the fraction of the task that has to execute serially, then the maximum speedup that can be obtained is 1/. Why can t we apply Amdahl s law to obtain the answer to part (d)?. 7

Question 8 (10 points): Consider a superscalar architecture with two d units, one for load/store and one for instructions. The following two tables indicate the order of execution of two threads, A and B, and the latencies mandated by dependences between instructions. For example, A2 and A3 should execute after A1 and there should be at least four cycles between the execution of A3 and A5 and at least three cycles between A4 and A5. In other words, the tables indicate the schedule if the instructions of each of the threads are executed with no multithreading. Note that an instruction that executes on one (for example A1 on the load/store ) cannot execute on the other. time Load/store t A1 t+1 A3 A2 t+2 A4 t+3 t+4 t+5 t+6 A5 t+7 A6 A7 time Load/store t B1 B2 t+1 B3 t+2 B4 B5 t+3 B6 t+4 B7 t+5 B8 t+6 t+7 Show the execution schedule for the two threads on the two s assuming: (a) Coarse grain multithreading (b) fine grain multithreading (c) Simultaneous multithreading (with priority given to thread A) time Load/store time Load/store time t t t t+1 t+1 t+1 t+2 t+2 t+2 t+3 t+3 t+3 t+4 t+4 t+4 t+5 t+5 t+5 t+6 t+6 t+6 t+7 t+7 t+7 t+8 t+8 t+8 t+9 t+9 t+9 t+10 t+10 t+10 t+11 t+11 t+11 t+12 t+12 t+12 t+13 t+13 t+13 t+14 t+14 t+14 Load/store 8