Exam Parallel Computer Systems

Size: px

Start display at page:

Download "Exam Parallel Computer Systems"

Kelly Higgins
5 years ago
Views:

1 Exam Parallel Computer Systems Academic Year Friday January 9, 2015: 8u30 12u00 Prof. L. Eeckhout Some remarks: Fill in your name on every page. Write down the answers in the boxes answers are typically short and should fit in these boxes. You can use course notes, books, exercises, etc. You have to turn off your laptop and cell/smart phones though. You may want to complete the exam out-of-order. Good luck!

2 Q1 [4pts] The MESIF cache coherence protocol is a protocol developed by Intel, and is defined as follows. The M, E, S and I states are the same as in the MESI protocol. The Forward (F) state is a specialized form of the S state, and indicates that a cache should act as a designated responder for any requests for the given line. The protocol ensures that, if any cache holds a line in the S state, at most one (other) cache holds it in the F state. In a system of caches employing the MESI protocol, a cache line request that is received by multiple caches holding a line in the S state will be serviced inefficiently. It may either be satisfied from (slow) main memory, or all the sharing caches could respond, bombarding the requestor with redundant responses. In a system of caches employing the MESIF protocol, only the cache holding the line in the F state will respond to a cache line request. This allows the requestor to receive a copy at cache-to-cache speeds, while allowing the use of as few multicast packets as the network topology will allow. Because a cache may unilaterally discard (invalidate) a line in the S or F states, it is possible that no cache has a copy in the F state, even though copies in the S state exist. In this case, a request for the line is satisfied (less efficiently, but still correctly) from main memory. To minimize the chance of the F line being discarded due to lack of interest, the most recent requestor of a line is assigned the F state; when a cache in the F state responds, it gives up the F state to the new cache. Complete the below table and fill in all possible transitions, and the reasons for these transitions, for the M, E, I and F states. Follow the below example for the S state. From To Reason S M Local Write to cache line in S requires exclusive ownership to write I Cacheline gets evicted upon an invalidation From To Reason M 2

3 From To Reason E From To Reason I From To Reason F 3

4 Q2 [1pt] Consider the below multi-threaded program. Variables A, B and C are shared memory variables and are initialized to zero. T1 T2 T3 T4 A=1; A=2; while(b!=1){;} while(b!=1){;} B=1; C=1; while(c!=1){;} while(c!=1){;} tmp1=a; tmp2=a; This program gets executed on a quad-core processor with a MESI cache coherence protocol and a TSO memory consistency model. Is it possible for threads T3 and T4 to read different values for variable A? Motivate your answer. 4

5 Q3 [2pts] Consider the below multi-threaded program. A and B are shared memory variables and are initialized to zero. T1 A=2; B=A; B++; T2 A=1; B=A; What are the possible outcomes for A and B on an SC system? Provide a possible interleaving for every possible outcome. 5

6 Q4 [2pts] Consider a computer program that we compile and run on two processors. Processor A is an out-of-order core running at 1GHz, consuming 40W, with a chip area of 40mm 2. Processor B is an in-order core running at 2GHz, consuming 8W, with a chip area of 10mm 2. We use a different compiler for either processor core. The computer program when run on processor A executes 10B instructions, taking 5B cycles in total; on processor B, the computer program executes 9B instructions, taking 9B cycles in total. (a) Which processor delivers the best performance (shortest execution time)? Motivate your answer. (b) For which processor is temperature likely to be the highest? Motivate your answer. (c) Increasing clock frequency of the out-of-order core to 2 GHz will improve performance by a factor 2. Fill in with <,, =, or >. Motivate your answer. (d) What is the impact on dynamic power consumption? Do you expect a linear (2 ), quadratic (4 ), or cubic (8 ) increase when doubling clock frequency? Motivate your answer. 6

7 Q5 [1pt] Consider an FGMT in-order processor with a RISC ISA with 32 architecture registers. The FGMT processor has four hardware thread contexts and 96 physical registers. Is this a meaningful design? 7

8 Q6 [2pts] Consider the following global branch sequence for branches A, B, C, D, and E. The second row in the below table denotes whether the branch is taken (T) or not-taken (N). A B E A C E A C D E A B E A C E A C D E A B T T T N T T N N N T T T T N T T N N N T T T How many history bits are needed for a PAp branch predictor to achieve perfect accuracy (no mispredictions)? Assume there is no aliasing among branches in any of the tables. Carefully motivate your answer. 8

9 Q7 [4pts] Consider the following instruction sequence from the inner loop of the leslie3d SPEC CPU2006 benchmark. The first instruction (1) is a long-latency load (last-level cache miss) that writes into register xmm0. Instructions (2), (4), and (5) compute a new value for register rax to index memory via a second long-latency load (6). (1) mov (r9+rax*8),xmm0 (2) mov esi,rax (3) add xmm0,xmm0 (4) mul r8,rax (5) add rdx,rax (6) mul (r9+rax*8),xmm1 (a) Draw the data flow graph for the above code sequence. (b) Fill in the issue times for all instructions in the below table when executed on a dual-issue stall-on-use in-order processor. Assume that all instructions take one cycle to execute; a memory access takes 100 cycles. Also compute the total execution time. Total execution time = Issue time (1) 1 (2) (3) (4) (5) (6) 9

10 (c) Fill in the issue times for all instructions in the below table when executed on a dual-issue out-of-order processor. Assume that all instructions take one cycle to execute; a memory access takes 100 cycles. Also compute the total execution time. Issue time (1) 1 (2) (3) (4) (5) (6) Total execution time = (d) Is the out-of-order processor able to exploit MLP? Why? Why not? 10

11 Q8 [2pts] Please read the below excerpt from a scientific publication: The number of active threads in a processor varies over time, and is often (much) smaller than the number of available hardware thread contexts. This observation has been made across different application domains. Desktop applications exhibit a limited amount of thread-level parallelism, with typically only 2 to 3 active threads. Datacenter servers are often underutilized and seldom operate near their maximum utilization; they operate most of the time between 10 to 50 percent of their maximum utilization level. Even parallel, multithreaded applications do not utilize all cores all the time. Threads may be waiting because of synchronization primitives (locks, barriers, etc.) and may yield the processor to avoid active spinning. Finally, in a multi-tasking environment, jobs come and go, and hence, the amount of available thread-level parallelism varies over time. Workloads with dynamically varying active thread counts imply that multi-core chip designs should balance core count and per-core performance. Few high-performance cores are beneficial at low active thread counts, while a sea of energy-efficient cores is preferred at high active thread counts. The key question is what processor architecture is best able to deal with dynamically varying degrees of thread-level parallelism (TLP). (a) How can a heterogeneous multi-core (consisting of a number of big, out-of-order cores along with a larger number of small, in-order cores) exploit variable TLP? (b) How can a (homogeneous) multi-core with SMT out-of-order cores exploit variable TLP? 11

12 (c) Which processor architecture (heterogeneous multicore vs. multicore with SMT cores) is best performing under variable TLP workload conditions? For example, consider a heterogeneous multicore with 2 big cores and 6 small cores versus a homogeneous multicore with 4 big cores and 2 SMT threads per core (assuming equal area and power). 12

13 Q9 [2pts] Consider the DAXPY code example: Y(i) = A * X(i) + Y(i). LD F0,a ADDI R4,Rx,#512 Loop: LD F2,0(Rx) MULTD F2,F0,F2 LD F4,0(Ry) ADDD F4,F2,F4 SD F4,0(Ry) ADDI Rx,Rx,#8 ADDI Ry,Ry,#8 SUB R20,R4,Rx BNZ R20,Loop (a) Under which conditions for arrays X and Y can an out-of-order processor benefit from load forwarding for this particular code sequence? (b) Under which conditions for arrays X and Y can an out-of-order processor benefit from load bypassing for this particular code sequence? Explain how load bypassing enables exploiting more MLP. 13

Case Study 1: Exploring the Impact of Microarchitectural Techniques

6 Solutions to Alternate Case Study Exercises Chapter 2 Solutions Case Study 1: Exploring the Impact of Microarchitectural Techniques 2.1 The baseline performance (in cycles, per loop iteration) of the