CSE-160 (Winter 2017, Kesden) Practice Midterm Exam. volatile int count = 0; // volatile just keeps count in mem vs register

Size: px

Start display at page:

Download "CSE-160 (Winter 2017, Kesden) Practice Midterm Exam. volatile int count = 0; // volatile just keeps count in mem vs register"

Hubert Barnett
5 years ago
Views:

1 Full PID: CSE-160 (Winter 2017, Kesden) Practice Midterm Exam 1. Threads, Concurrency Consider the code below: volatile int count = 0; // volatile just keeps count in mem vs register void *count(void *arg) { for (int i=0; i<100000; i++) count++; int main () { pthread_t tid1, tid2; pthread_create(&tid1, NULL, count, NULL); pthread_create(&tid2, NULL, count, NULL); pthread_join (tid1, NULL); pthread_join (tid2, NULL); printf ( i: %d\n, i); The above code is incorrect. It has a data race. A. What could be the symptom(s)? B. What is the critical resource? C. Protect the critical section by adding a simple mutex, including declaration, initialization, etc. D. Protect the critical section by adding a lock_guard, including declaration, initialization, etc.

2 2. Synchronization Primitives A. Under what circumstances should one use a barrier instead of a mutex? B. Please write a short code segment that illustrates your answer to part (A) above. C. Under what circumstances should one use a condition variable instead of just a mutex? D. Please write a short code segment that illustrates your answer to part (C) above. E. Why must the condition variable s wait() operation accept a mutex? What does it protect? F. Why must the condition variable s signal() operation accept a mutex? What purpose does it serve?

3 3. Parallel Speedup A. Assuming that a program s code is 75% parallelizable and 25% necessarily-serial, what is the maximum speedup that can be achieved by adding threads on a 4 core system? Show your work. B. Assuming a program is well designed and well written and has the following running times, what percentage of it is parallelizable? Show your work. 1-thread/1-core: 16 seconds 2-threads/2-cores: 12 seconds 4-thread/4-cores: 10 seconds C. What is the maximum speedup that can be achieved in a program for which 25% of the code is parallelizable? Show your work. D. If parallelizing an algorithm results in a super-linear speedup, what does this suggest? E. Consider Gustafson s observation. How can increasing the amount of data allow us to defeat Amdahl s Law?

4 4. Working Sets and Locality A. For each type of cache miss, please define it and explain how it can be mitigated, if possible. a. Cold/Compulsory b. Conflict c. Capacity B. Write a simple for-loop that exhibits good special locality, but not good temporal locality C. Does the following for-loop exhibit good special locality, temporal locality, neither, or both? Why? // ints a and b are declared and initialized elsewhere // int[16] array is declared and inialized elsewhere for (int index=0; index < 100; index++) array[index] += (a + b) D. Assuming that the values shown below are ints, and that an int is 4 bytes, what is the size of the working set for the loop above? Explain.

5 5. Memory Hierarchy Assume the following memory access times: Registers: L1 Cache: L2 Cache: Main Mem: 1 cycle, 0.5ns 4 cycles, 2ns 8 cycles, 4ns 160 cycles, 80ns Consider a system where 1 in 50 variable accesses require fetching from memory into registers, a 95% hit rate at L1 and a 99% hit rate at L2, and in which memory cache accesses are not performed in parallel. A. What is the effective memory access time of this system? (Just set up the equation, no need to evaluate. It can be in cycles or seconds) 6. OpenMP A. You are reading code parallelized with OpenMP #pragmas. Please explain the relationship between the scope in which a variable is declared upon whether or not it is shared. Include both loop and non-loop cases. B. Under what circumstances is it safe to remove a nowait on the first of two back-to-back loops? C. Why might a nowait be inappropriate for the last of two loops? D. Consider the clause, schedule(dynamic, 2) operating upon a loop with 4 threads and 16 iterations. Which threads will perform each iteration? E. Consider the clause, schedule(static, 1) operating upon a loop with 4 threads and 16 iterations. What are the potential advantages and disadvantages of this configuration as compared to the one described in (D) above?

6 7. Caching #1 (Credit: CMU) Consider the following matrix transpose function: typedef int array[2][2]; void transpose(array dst, arraysrc) { int i, j; for (j = 0; j < 2; j++) { for (i = 0; i < 2;i++) { dst[i][j] = src[j][i]; Running on a hypothetical machine with the following properties: sizeof(int) == 4. The src array starts at address 0 and the dst array starts at address 16 (decimal). There is a single L1 data cache that is direct mapped and write-allocate, with a block size of 8 bytes. Accesses to the src and dst arrays are the only sources of read and write accesses to the cache, respectively. Suppose the cache has a total size of 16 data bytes (i.e., the block size times the number of sets is 16 bytes) and that the cache is initially empty. A. How many bits are used for each of the Index: Offset: B. For each row and col, indicate whether each access to src[row][col] and dst[row][col] is a hit (h) or a miss (m). For example, reading src[0][0] is a miss and writing dst[0][0] is also a miss. src array dst array col 0 col 1 col 0 col 1 row 0 m row 0 m row 1 row 1 C. Repeat part A for a cache with a total size of 32 data bytes. src array dst array col 0 col 1 col 0 col 1 row 0 m row 0 m row 1 row 1

7 8. Caching #2 (Credit: 15 CMU) Consider a computer with an 8-bit address space and a direct-mapped 64-byte data cache with byte cache blocks. A. The boxes below represent the bit-format of an address. In each box, indicate which field that bit represents (it is possible that a field does not exist) by labeling them as follows: B: Block Offset S: Set Index T: Cache Tag B. The table below shows a trace of load addresses accessed in the data cache. Assume the cache is initially empty. For each row in the table, please complete the two rightmost columns, indicating (i) the set number (in decimal notation) for that particular load, and (ii) whether that loads hits (H) or misses (M) in the cache (circle either H or M accordingly). Load No. Hex Address Binary Address Set Number? (in Decimal) Hit or Miss? (Circle one) H M 2 b H M H M 4 f H M 5 b H M H M 7 d H M 8 b H M H M H M

8 8. Caching #2, cont. C. For the trace of load addresses shown in Part B, below is a list of possible final states for the cache, showing the hex value of the tag for each cache block in each set. Assume that initially all cache blocks are invalid (represented by X). (a) (b) (c) (d) (e) (f) (g) 0 X 2 1 X X X 1 X 2 X 0 X 3 X 1 X 2 3 X 0 X X 1 X X 0 X 1 X X X 2 1 X X 4 1 X 2 X 1 X 2 3 Which of the choices above is the correct final state of the cache? 9. Snooping Caches A. Consider a configuration with a per-core L1 cache and a shared L2 cache, configured such that the L1 caches are write-allocate snooping caches. Should the L1 caches be write-back, or write-through? Explain. B. Consider the caching configuration described in part (A) above. What are the relative advantages and disadvantages of configuring the L1 caches a s a write-update cache vs a write-invalidate cache? 10. False Sharing A. Consider the code segments below running on two threads, tid=0 and tid=1. Which is most likely to result in false sharing? Explain. A. for (int index=tid; index<array_length; index += 2) array[index] = 2 * array[index]; B. for (int index=tid*array_length/2; index<array_length/(2-tid); index++) array[index] = 2 * array[index];

CS , Fall 2001 Exam 2

CS , Fall 2001 Exam 2 Andrew login ID: Full Name: CS 15-213, Fall 2001 Exam 2 November 13, 2001 Instructions: Make sure that your exam is not missing any sheets, then write your full name and Andrew login ID on the front. Write