CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers

Similar documents
Lecture 24: Board Notes: Cache Coherency

Computer Architecture

Relaxed Memory-Consistency Models

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Chapter 5. Multiprocessors and Thread-Level Parallelism

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Snooping coherence protocols (cont.)

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Scalable Cache Coherence

Relaxed Memory-Consistency Models

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

The Cache-Coherence Problem

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Computer Architecture

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Shared Memory Architectures. Approaches to Building Parallel Machines

ECE/CS 757: Homework 1

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Shared Memory Multiprocessors

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Cache Coherence Tutorial

Scalable Multiprocessors

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Multiprocessors & Thread Level Parallelism

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Bus-Based Coherent Multiprocessors

Multiprocessor Systems

EC 513 Computer Architecture

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

CSC/ECE 506: Computer Architecture and Multiprocessing Program 3: Simulating DSM Coherence Due: Tuesday, Nov 22, 2016

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

EC 513 Computer Architecture

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Lecture 25: Multiprocessors

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Portland State University ECE 588/688. Cache Coherence Protocols

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

CS 167 Final Exam Solutions

Memory Hierarchy in a Multiprocessor

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

CS 475: Parallel Programming Introduction

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Cache Coherence in Bus-Based Shared Memory Multiprocessors

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Coherence and Consistency

Switch Gear to Memory Consistency

Computer Architecture Memory hierarchies and caches

Snooping coherence protocols (cont.)

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Processor Architecture

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Multiprocessor Cache Coherency. What is Cache Coherence?

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Potential violations of Serializability: Example 1

Write only as much as necessary. Be brief!

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Cache Coherence in Scalable Machines

Simulating ocean currents

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Scalable Cache Coherence

Advanced OpenMP. Lecture 3: Cache Coherency

Using Relaxed Consistency Models

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 3 for Fall Semester, 2005

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

CSE502: Computer Architecture CSE 502: Computer Architecture

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Cache Coherence and Atomic Operations in Hardware

Relaxed Memory Consistency

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Transcription:

CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers This was a 180-minute open-book test. You were to answer five of the six questions. Each question was worth 20 points. If you answered all six questions, your five highest scores counted. Question 1. (a) (10 points) Suppose node 5 (i.e., node 01012) in a hypercube (or n cube) of dimension 4 (i.e., 16 nodes) is going to broadcast a message to all other nodes in the minimum possible time, given that a node can only send a message to one other node at a time. Show which nodes transmit the message to which other nodes at each step in the process. Use the notation i j to indicate that node number i transmits to node number j, and use decimal notation (not binary) for node numbers. For example, at some step you might write: 4 12,2 3,7 6. Note that 6 steps are not necessarily needed. Answer: Several answers are possible; the best idea is to broadcast in dimension order. By increasing dimension, this would be Step 1: 5 4. Step 2: 5 7; 4 6. Step 3: 5 1; 4 0; 7 3; 6 2. Step 4: 5 13; 4 12; 7 15; 6 14; 1 9; 0 8; 3 11; 2 10. (b) (6 points) Suppose that one of the links used in step 1 of your answer to (a) failed. Could you still perform the broadcast in the same number of steps? If so, redo part (a), showing how the broadcast can be completed in the same number of steps despite the failed link. If not, explain fully why not. Answer: No. Since one node can send to only one other node per time step, in order to complete the broadcast in four time steps, the number of nodes that have the message must double at each time step. Consequently, each node that has the message must be able to send to another node in each time step. Therefore, the first node (5 in the example) must be able to send to four different nodes on the four time steps. That can t happen if the first node is connected to only three of its neighbors. (c) (4 points) What is the smallest number of links that would have to fail to prevent a broadcast from being performed at all (regardless of how long it would take)? Why? Answer: Four links. A broadcast can be performed unless one node is completely isolated. Since there are four links per node, four links would have to fail to isolate a node completely. Question 2. (2/3 point per blank) Consider a system that has 3 processors with private writeback caches and uses the SSCI protocol, where each line holds a single integer value (e.g., block size of 4 bytes). Memory contents Variable Value X 4 Y 5

indicates that we don t know what s in a particular cache line. Initially, the cache for each processor is empty. Fill in the blank spaces. Operation P1 P2 P3 Memory Value State Prev Next Value State Prev Next Value State Prev Next Value State Head P1: read X 4 E 0 0 4 EM 1 P1: X = 6 6 M 0 0 4 EM 1 P2: read X 6 S 2 0 6 S 0 1 6 S 2 P3: Y = 7 7 M 0 0 5 EM 3 P2: read Y 7 S 0 3 7 S 2 0 7 S 2 P3: Y = 8 7 I 0 3 8 M 0 0 7 EM 3 P2: X = 9 6 I 2 0 9 M 0 0 6 EM 2 P3: X = 10 6 I 2 0 9 I 0 0 10 M 0 0 10 EM 3 P3: X= 2 6 I 2 0 9 I 0 0 2 M 0 0 10 EM 3 Question 3. (2 points for part (a); 3 points for all other parts) This question concerns the organization of a memory-based cache-directory scheme. Note: In doing the calculations below, be careful not to confuse bytes with bits! (1 byte = 2 3 bits) Suppose we have a multiprocessor with 1024 nodes, each of which has 4 GB (= 2 32 bytes) of main memory. Suppose that a cache line contains 32 bytes. (a) How many cache blocks are there per node? Answer: 2 31 /2 5 = 2 26, or 64 "megablocks." (b) For parts (b) to (e), assume that the directory is kept in main memory. If we could use the full bit-vector approach for organizing the cache directory, what fraction of main memory would have to be devoted to cache directories? Answer: For each block, we would need to use 1024 bits to record which of the 1024 processors contained it. This means 2 26 blocks 2 10 bits = 2 36 bits = 2 33 bytes. Woops cache directories would consume twice as much memory as we have! (c) Assume that the average block is cached in 2 nodes. If we used pointers instead of the full bit-vector approach, what fraction of memory would be devoted to the pointers? (Answer the fraction of memory that would be devoted to pointers alone, not including other data structures that would be needed to keep track of the sharers.) Answer: Each pointer is 10 bits long. Each of the 2 26 blocks requires two pointers, on average. Therefore, there will be 10 2 26 = 1.2 2 29 bits devoted to pointers, out of 2 32 bytes altogether. Thus, 5/4 1/8 = 5/32 of main memory will be devoted to pointers. (d) Is the assumption of part (c) realistic? Explain.

Answer: No, it is not realistic. If the average block were cached at 2 nodes, then there would have to be twice as much cache memory as main memory in the system! Therefore, the answer of part (c) is a gross overestimate of the amount of memory that would be devoted to pointers. (e) Suppose that multicore nodes are used to save memory in the full bit-vector approach. Under the assumptions of part (a), how many cores would be needed per node so that less than 1% of main memory is devoted to the directory? Answer: Easiest way is to start with the answer to part (c). With one core per node, 5/32 of main memory is required. Each time the number of cores per node doubles, the fraction of memory needed goes down by one half: 5/64 for dual-core, 5/128 for quad core, 5/256 for 8-core, and 5/512 for 16-core. So the answer is 16 cores per node. (f) Another approach is to use a cache-based scheme, like SSCI does. Suppose that the pointers point to individual processors that are caching a block (not to multicore nodes). In a 1024- processor system with 32-byte lines, what fraction of cache bits are devoted to pointers? (Again, consider only pointers and information bits, not state bits, etc.) Answer: Each cache line contains 2 10 = 20 bits of pointers and 32 8 = 256 information bits. So pointers occupy 20/256 0.078 of the cache. And that s with 32-byte lines, which are about ¼ the size of the average cache line. (g) Which approach, multicore nodes or cache-based directories, is more scalable and why? Answer: The cache-based approach, because a constant fraction of the cache is used, regardless of the number of processors. With the multicore approach, as the number of processors rises, either more memory is devoted to bit-vectors, or the number of cores per chip needs to keep increasing. Question 4. (a) (3 points) In the MSI coherence prototocol, when a block transitions from state M to state S, a flush is induced. Suppose the block were not flushed to main memory at this time. Would the protocol still work? Would it still be as efficient? Answer: It wouldn t necessarily be incorrect. However, now blocks could be in state S without main memory being up to date. This means that if another processor read the block, it would have to be supplied by a cache-to-cache transfer (this typically implies an owner cache, something that the MSI protocol does not provide). And when blocks in state S were purged from the cache, they would have to be written back to main memory if dirty. A block, once dirtied, would have to be written back at least once, and possibly several times (if it ended up being shared by several caches). So there might be many more writes if the flush were not performed when it is. On the other hand, if a block transitions between states S and M several times, several flushes would be avoided. So it depends on the relative frequency of these two conditions (dirty blocks being shared vs. multiple M S M transitions). (b) (3 points) In the MESI protocol, when a block transitions from state E to state I, the diagram shows it being flushed. Where is it flushed to? Why is this necessary? Answer: If cache-to-cache sharing is in use, it is flushed out on the bus so that it can be picked up by another processor (e.g., if another processor is trying to write it). If cache-to-cache sharing is not in use, main memory can supply the data. (Note that this is indicated Flush in the diagram instead of Flush. (c) (3 points) In the Dragon protocol, a flush is used only when a bus read comes in for a block in M or Sm state. Where is data flushed to in this case and why?

Answer: It is flushed onto the bus so that another processor may pick it up. The processor picking it up will hold the block in state Sc. If it were not flushed, the requesting processor would have nowhere to get up-to-date data from, since main memory may be out of date. The rest of the question relates to this situation: Assume that in a four-process shared-memory multiprocessor with private, write-back caches that the following sequence of reads and writes take place. Assume that location 100 is not in any cache at the start of the sequence. 1. P0 writes location 100. 2. P1 reads location 100. 3. P2 reads location 100. 4. P3 writes location 100. 5. P3 writes location 100. Legend: Px means Processor x s cache. M means main memory. (d) (4 points) After line 2 is executed, what will be the state of memory location 100 in each of the caches? (e) (3 points) After line 4 is executed, what will be the state of memory location 100 in each of the caches? Dragon Directory-based (FBV) P0 P1 P2 P3 M P0 P1 P2 P3 Sm Sm S S S Sc Sc Sc Sm EM I I I M (f) (2 points) After line 4 is executed, will main memory have an up-to-date copy of location 100? No No (g) (2 points) How many total writes to memory are performed by each protocol on this sequence of 5 instructions? 1 1 Question 5. Suppose two processors execute the following code: P 1 P 2 1a A = 1 1b d = A 1c f = B 1d print e 2a B = 1 2b e = B 2c print d, f (a) (8 points) Which of the eight combinations of values for (a, b, c) can be printed under sequential consistency? Answer: (0, 1, 0), (1, 0, 0), (1, 0, 1), (1, 1, 0), and (1, 1, 1). The sequence 1a, 1b, 1c, 1d, 2a, 2b, 2c (1, 0, 0). The sequence 1a, 1b, 2a, 1c, 1d, 2b, 2c (1, 0, 1). The sequence 1a, 1b, 2a, 2b, 1c, 1d, 2c (1, 1, 1). The sequence 2a, 2b, 2c, 1a, 1b, 1c, 1d (0, 1, 0). The sequence 2a, 1a, 1b, 2b, 2c, 1c, 1d (1, 1, 0).

(b) (4 points) Choose two of the combinations that are impossible and explain why they are not possible under sequential consistency. Answer: (0, 0, 0) is not possible because one process will inevitably finish before the other, and will have set its variable(s) to 1. Therefore, the other process will print 1 as the value of these variables. (0, 0, 1) and (0, 1, 1) are not possible under SC because if c is 1, then a must have previously been set to 1. (c) (2 points) Under processor consistency, which if any additional combinations of values can be printed? How? Answer: (0, 0, 0) is possible if writes propagate very slowly, i.e., if a process sees its own writes considerably before it sees writes done by other processes. (0, 0, 1) and (0, 1, 1) are not possible for the same reason as under SC; that is, the write to a is seen everywhere before the write to c, as both writes come from the same processor. (e) (2 points) Under weak ordering, which if any additional combinations of values can be printed? How? Answer: (0, 0, 1) and (0, 1, 1) are now possible, because writes from P 1 aren t necessarily seen in order by P 2. (f) (4 points) Pick one of these models (PC or WO) that can print combinations not possible with SC. Tell where you would insert fence (SYNC) operations to make it conform to SC. Answer: Pick any one of those models, and insert SYNCs before both print statements. This will assure that all of the writes complete before any printing is done. Question 6. (3 points each, except 5 points for (e)) Given the code for shared-memory version of the Ocean simulation (Lecture 7), what could go wrong if: (a) we remove the BARRIER on line 25d? (b) we remove the BARRIER on line 25f? (c) we add a BARRIER right before lline 19? (d) we add a new LOCKDEC at line 2b and a matching LOCK/UNLOCK around the assignment to A[i,j] on line 20? (e) we remove the BARRIER at line 16a? (f) Can we safely remove the BARRIER on line 25d? Why or why not? Answers: (a) In this situation we will be modifying the done variable before all threads have finished executing. Since we are using that variable to determine if we should continue iterating over the grid, if this gets set to true because one thread finishes early (when 'diff' is still very small) then we stop iterating too soon. (b) If one thread were to reach this point (where we remove the barrier) while other threads were still executing, it would re-enter the loop and set diff = 0 which would mean that other threads still in the previous iteration would see diff as zero and would then set done = true which would cause the simulation to exit prematurely. (c) This would not affect the results of the program, but would cause significant performance

degradation due to the overhead of introducing a BARRIER on every iteration of the inner loop. (d) This would effectively render the parallelization useless as it would force the execution to a state resembling serial execution. The LOCK on the inner loop would mean that only one thread could update it's grid location at one time, which is really no different from a serial program. (e) This could result in the program having unpredictable results, as the diff variable, which is shared, could be arbitrarily reset to 0 at almost any point during execution until the last thread passes line 16. This would result in the program failing to converge upon the correct result. (f) The temptation would be to add an else {done=0;} to the following if statement. If done were a shared variable this could work. However in this case, this would be incorrect because the done variable is private to each thread, meaning that some threads would exit too soon.