CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers

CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers This was a 180-minute open-book test. You were to answer five of the six questions. Each question was worth 20 points. If you answered all six questions, your five highest scores counted. Question 1. (a) (10 points) Suppose node 5 (i.e., node 01012) in a hypercube (or n cube) of dimension 4 (i.e., 16 nodes) is going to broadcast a message to all other nodes in the minimum possible time, given that a node can only send a message to one other node at a time. Show which nodes transmit the message to which other nodes at each step in the process. Use the notation i j to indicate that node number i transmits to node number j, and use decimal notation (not binary) for node numbers. For example, at some step you might write: 4 12,2 3,7 6. Note that 6 steps are not necessarily needed. Answer: Several answers are possible; the best idea is to broadcast in dimension order. By increasing dimension, this would be Step 1: 5 4. Step 2: 5 7; 4 6. Step 3: 5 1; 4 0; 7 3; 6 2. Step 4: 5 13; 4 12; 7 15; 6 14; 1 9; 0 8; 3 11; 2 10. (b) (6 points) Suppose that one of the links used in step 1 of your answer to (a) failed. Could you still perform the broadcast in the same number of steps? If so, redo part (a), showing how the broadcast can be completed in the same number of steps despite the failed link. If not, explain fully why not. Answer: No. Since one node can send to only one other node per time step, in order to complete the broadcast in four time steps, the number of nodes that have the message must double at each time step. Consequently, each node that has the message must be able to send to another node in each time step. Therefore, the first node (5 in the example) must be able to send to four different nodes on the four time steps. That can t happen if the first node is connected to only three of its neighbors. (c) (4 points) What is the smallest number of links that would have to fail to prevent a broadcast from being performed at all (regardless of how long it would take)? Why? Answer: Four links. A broadcast can be performed unless one node is completely isolated. Since there are four links per node, four links would have to fail to isolate a node completely. Question 2. (2/3 point per blank) Consider a system that has 3 processors with private writeback caches and uses the SSCI protocol, where each line holds a single integer value (e.g., block size of 4 bytes). Memory contents Variable Value X 4 Y 5

indicates that we don t know what s in a particular cache line. Initially, the cache for each processor is empty. Fill in the blank spaces. Operation P1 P2 P3 Memory Value State Prev Next Value State Prev Next Value State Prev Next Value State Head P1: read X 4 E 0 0 4 EM 1 P1: X = 6 6 M 0 0 4 EM 1 P2: read X 6 S 2 0 6 S 0 1 6 S 2 P3: Y = 7 7 M 0 0 5 EM 3 P2: read Y 7 S 0 3 7 S 2 0 7 S 2 P3: Y = 8 7 I 0 3 8 M 0 0 7 EM 3 P2: X = 9 6 I 2 0 9 M 0 0 6 EM 2 P3: X = 10 6 I 2 0 9 I 0 0 10 M 0 0 10 EM 3 P3: X= 2 6 I 2 0 9 I 0 0 2 M 0 0 10 EM 3 Question 3. (2 points for part (a); 3 points for all other parts) This question concerns the organization of a memory-based cache-directory scheme. Note: In doing the calculations below, be careful not to confuse bytes with bits! (1 byte = 2 3 bits) Suppose we have a multiprocessor with 1024 nodes, each of which has 4 GB (= 2 32 bytes) of main memory. Suppose that a cache line contains 32 bytes. (a) How many cache blocks are there per node? Answer: 2 31 /2 5 = 2 26, or 64 "megablocks." (b) For parts (b) to (e), assume that the directory is kept in main memory. If we could use the full bit-vector approach for organizing the cache directory, what fraction of main memory would have to be devoted to cache directories? Answer: For each block, we would need to use 1024 bits to record which of the 1024 processors contained it. This means 2 26 blocks 2 10 bits = 2 36 bits = 2 33 bytes. Woops cache directories would consume twice as much memory as we have! (c) Assume that the average block is cached in 2 nodes. If we used pointers instead of the full bit-vector approach, what fraction of memory would be devoted to the pointers? (Answer the fraction of memory that would be devoted to pointers alone, not including other data structures that would be needed to keep track of the sharers.) Answer: Each pointer is 10 bits long. Each of the 2 26 blocks requires two pointers, on average. Therefore, there will be 10 2 26 = 1.2 2 29 bits devoted to pointers, out of 2 32 bytes altogether. Thus, 5/4 1/8 = 5/32 of main memory will be devoted to pointers. (d) Is the assumption of part (c) realistic? Explain.

Answer: No, it is not realistic. If the average block were cached at 2 nodes, then there would have to be twice as much cache memory as main memory in the system! Therefore, the answer of part (c) is a gross overestimate of the amount of memory that would be devoted to pointers. (e) Suppose that multicore nodes are used to save memory in the full bit-vector approach. Under the assumptions of part (a), how many cores would be needed per node so that less than 1% of main memory is devoted to the directory? Answer: Easiest way is to start with the answer to part (c). With one core per node, 5/32 of main memory is required. Each time the number of cores per node doubles, the fraction of memory needed goes down by one half: 5/64 for dual-core, 5/128 for quad core, 5/256 for 8-core, and 5/512 for 16-core. So the answer is 16 cores per node. (f) Another approach is to use a cache-based scheme, like SSCI does. Suppose that the pointers point to individual processors that are caching a block (not to multicore nodes). In a 1024- processor system with 32-byte lines, what fraction of cache bits are devoted to pointers? (Again, consider only pointers and information bits, not state bits, etc.) Answer: Each cache line contains 2 10 = 20 bits of pointers and 32 8 = 256 information bits. So pointers occupy 20/256 0.078 of the cache. And that s with 32-byte lines, which are about ¼ the size of the average cache line. (g) Which approach, multicore nodes or cache-based directories, is more scalable and why? Answer: The cache-based approach, because a constant fraction of the cache is used, regardless of the number of processors. With the multicore approach, as the number of processors rises, either more memory is devoted to bit-vectors, or the number of cores per chip needs to keep increasing. Question 4. (a) (3 points) In the MSI coherence prototocol, when a block transitions from state M to state S, a flush is induced. Suppose the block were not flushed to main memory at this time. Would the protocol still work? Would it still be as efficient? Answer: It wouldn t necessarily be incorrect. However, now blocks could be in state S without main memory being up to date. This means that if another processor read the block, it would have to be supplied by a cache-to-cache transfer (this typically implies an owner cache, something that the MSI protocol does not provide). And when blocks in state S were purged from the cache, they would have to be written back to main memory if dirty. A block, once dirtied, would have to be written back at least once, and possibly several times (if it ended up being shared by several caches). So there might be many more writes if the flush were not performed when it is. On the other hand, if a block transitions between states S and M several times, several flushes would be avoided. So it depends on the relative frequency of these two conditions (dirty blocks being shared vs. multiple M S M transitions). (b) (3 points) In the MESI protocol, when a block transitions from state E to state I, the diagram shows it being flushed. Where is it flushed to? Why is this necessary? Answer: If cache-to-cache sharing is in use, it is flushed out on the bus so that it can be picked up by another processor (e.g., if another processor is trying to write it). If cache-to-cache sharing is not in use, main memory can supply the data. (Note that this is indicated Flush in the diagram instead of Flush. (c) (3 points) In the Dragon protocol, a flush is used only when a bus read comes in for a block in M or Sm state. Where is data flushed to in this case and why?

Answer: It is flushed onto the bus so that another processor may pick it up. The processor picking it up will hold the block in state Sc. If it were not flushed, the requesting processor would have nowhere to get up-to-date data from, since main memory may be out of date. The rest of the question relates to this situation: Assume that in a four-process shared-memory multiprocessor with private, write-back caches that the following sequence of reads and writes take place. Assume that location 100 is not in any cache at the start of the sequence. 1. P0 writes location 100. 2. P1 reads location 100. 3. P2 reads location 100. 4. P3 writes location 100. 5. P3 writes location 100. Legend: Px means Processor x s cache. M means main memory. (d) (4 points) After line 2 is executed, what will be the state of memory location 100 in each of the caches? (e) (3 points) After line 4 is executed, what will be the state of memory location 100 in each of the caches? Dragon Directory-based (FBV) P0 P1 P2 P3 M P0 P1 P2 P3 Sm Sm S S S Sc Sc Sc Sm EM I I I M (f) (2 points) After line 4 is executed, will main memory have an up-to-date copy of location 100? No No (g) (2 points) How many total writes to memory are performed by each protocol on this sequence of 5 instructions? 1 1 Question 5. Suppose two processors execute the following code: P 1 P 2 1a A = 1 1b d = A 1c f = B 1d print e 2a B = 1 2b e = B 2c print d, f (a) (8 points) Which of the eight combinations of values for (a, b, c) can be printed under sequential consistency? Answer: (0, 1, 0), (1, 0, 0), (1, 0, 1), (1, 1, 0), and (1, 1, 1). The sequence 1a, 1b, 1c, 1d, 2a, 2b, 2c (1, 0, 0). The sequence 1a, 1b, 2a, 1c, 1d, 2b, 2c (1, 0, 1). The sequence 1a, 1b, 2a, 2b, 1c, 1d, 2c (1, 1, 1). The sequence 2a, 2b, 2c, 1a, 1b, 1c, 1d (0, 1, 0). The sequence 2a, 1a, 1b, 2b, 2c, 1c, 1d (1, 1, 0).

(b) (4 points) Choose two of the combinations that are impossible and explain why they are not possible under sequential consistency. Answer: (0, 0, 0) is not possible because one process will inevitably finish before the other, and will have set its variable(s) to 1. Therefore, the other process will print 1 as the value of these variables. (0, 0, 1) and (0, 1, 1) are not possible under SC because if c is 1, then a must have previously been set to 1. (c) (2 points) Under processor consistency, which if any additional combinations of values can be printed? How? Answer: (0, 0, 0) is possible if writes propagate very slowly, i.e., if a process sees its own writes considerably before it sees writes done by other processes. (0, 0, 1) and (0, 1, 1) are not possible for the same reason as under SC; that is, the write to a is seen everywhere before the write to c, as both writes come from the same processor. (e) (2 points) Under weak ordering, which if any additional combinations of values can be printed? How? Answer: (0, 0, 1) and (0, 1, 1) are now possible, because writes from P 1 aren t necessarily seen in order by P 2. (f) (4 points) Pick one of these models (PC or WO) that can print combinations not possible with SC. Tell where you would insert fence (SYNC) operations to make it conform to SC. Answer: Pick any one of those models, and insert SYNCs before both print statements. This will assure that all of the writes complete before any printing is done. Question 6. (3 points each, except 5 points for (e)) Given the code for shared-memory version of the Ocean simulation (Lecture 7), what could go wrong if: (a) we remove the BARRIER on line 25d? (b) we remove the BARRIER on line 25f? (c) we add a BARRIER right before lline 19? (d) we add a new LOCKDEC at line 2b and a matching LOCK/UNLOCK around the assignment to A[i,j] on line 20? (e) we remove the BARRIER at line 16a? (f) Can we safely remove the BARRIER on line 25d? Why or why not? Answers: (a) In this situation we will be modifying the done variable before all threads have finished executing. Since we are using that variable to determine if we should continue iterating over the grid, if this gets set to true because one thread finishes early (when 'diff' is still very small) then we stop iterating too soon. (b) If one thread were to reach this point (where we remove the barrier) while other threads were still executing, it would re-enter the loop and set diff = 0 which would mean that other threads still in the previous iteration would see diff as zero and would then set done = true which would cause the simulation to exit prematurely. (c) This would not affect the results of the program, but would cause significant performance

degradation due to the overhead of introducing a BARRIER on every iteration of the inner loop. (d) This would effectively render the parallelization useless as it would force the execution to a state resembling serial execution. The LOCK on the inner loop would mean that only one thread could update it's grid location at one time, which is really no different from a serial program. (e) This could result in the program having unpredictable results, as the diff variable, which is shared, could be arbitrarily reset to 0 at almost any point during execution until the last thread passes line 16. This would result in the program failing to converge upon the correct result. (f) The temptation would be to add an else {done=0;} to the following if statement. If done were a shared variable this could work. However in this case, this would be incorrect because the done variable is private to each thread, meaning that some threads would exit too soon.