CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers

Size: px
Start display at page:

Download "CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers"

Transcription

1 CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers This was a 180-minute open-book test. You were to answer five of the six questions. Each question was worth 20 points. If you answered all six questions, your five highest scores counted. Question 1. (a) (10 points) Suppose node 5 (i.e., node 01012) in a hypercube (or n cube) of dimension 4 (i.e., 16 nodes) is going to broadcast a message to all other nodes in the minimum possible time, given that a node can only send a message to one other node at a time. Show which nodes transmit the message to which other nodes at each step in the process. Use the notation i j to indicate that node number i transmits to node number j, and use decimal notation (not binary) for node numbers. For example, at some step you might write: 4 12,2 3,7 6. Note that 6 steps are not necessarily needed. Answer: Several answers are possible; the best idea is to broadcast in dimension order. By increasing dimension, this would be Step 1: 5 4. Step 2: 5 7; 4 6. Step 3: 5 1; 4 0; 7 3; 6 2. Step 4: 5 13; 4 12; 7 15; 6 14; 1 9; 0 8; 3 11; (b) (6 points) Suppose that one of the links used in step 1 of your answer to (a) failed. Could you still perform the broadcast in the same number of steps? If so, redo part (a), showing how the broadcast can be completed in the same number of steps despite the failed link. If not, explain fully why not. Answer: No. Since one node can send to only one other node per time step, in order to complete the broadcast in four time steps, the number of nodes that have the message must double at each time step. Consequently, each node that has the message must be able to send to another node in each time step. Therefore, the first node (5 in the example) must be able to send to four different nodes on the four time steps. That can t happen if the first node is connected to only three of its neighbors. (c) (4 points) What is the smallest number of links that would have to fail to prevent a broadcast from being performed at all (regardless of how long it would take)? Why? Answer: Four links. A broadcast can be performed unless one node is completely isolated. Since there are four links per node, four links would have to fail to isolate a node completely. Question 2. (2/3 point per blank) Consider a system that has 3 processors with private writeback caches and uses the SSCI protocol, where each line holds a single integer value (e.g., block size of 4 bytes). Memory contents Variable Value X 4 Y 5

2 indicates that we don t know what s in a particular cache line. Initially, the cache for each processor is empty. Fill in the blank spaces. Operation P1 P2 P3 Memory Value State Prev Next Value State Prev Next Value State Prev Next Value State Head P1: read X 4 E EM 1 P1: X = 6 6 M EM 1 P2: read X 6 S S S 2 P3: Y = 7 7 M EM 3 P2: read Y 7 S S S 2 P3: Y = 8 7 I M EM 3 P2: X = 9 6 I M EM 2 P3: X = 10 6 I I M EM 3 P3: X= 2 6 I I M EM 3 Question 3. (2 points for part (a); 3 points for all other parts) This question concerns the organization of a memory-based cache-directory scheme. Note: In doing the calculations below, be careful not to confuse bytes with bits! (1 byte = 2 3 bits) Suppose we have a multiprocessor with 1024 nodes, each of which has 4 GB (= 2 32 bytes) of main memory. Suppose that a cache line contains 32 bytes. (a) How many cache blocks are there per node? Answer: 2 31 /2 5 = 2 26, or 64 "megablocks." (b) For parts (b) to (e), assume that the directory is kept in main memory. If we could use the full bit-vector approach for organizing the cache directory, what fraction of main memory would have to be devoted to cache directories? Answer: For each block, we would need to use 1024 bits to record which of the 1024 processors contained it. This means 2 26 blocks 2 10 bits = 2 36 bits = 2 33 bytes. Woops cache directories would consume twice as much memory as we have! (c) Assume that the average block is cached in 2 nodes. If we used pointers instead of the full bit-vector approach, what fraction of memory would be devoted to the pointers? (Answer the fraction of memory that would be devoted to pointers alone, not including other data structures that would be needed to keep track of the sharers.) Answer: Each pointer is 10 bits long. Each of the 2 26 blocks requires two pointers, on average. Therefore, there will be = bits devoted to pointers, out of 2 32 bytes altogether. Thus, 5/4 1/8 = 5/32 of main memory will be devoted to pointers. (d) Is the assumption of part (c) realistic? Explain.

3 Answer: No, it is not realistic. If the average block were cached at 2 nodes, then there would have to be twice as much cache memory as main memory in the system! Therefore, the answer of part (c) is a gross overestimate of the amount of memory that would be devoted to pointers. (e) Suppose that multicore nodes are used to save memory in the full bit-vector approach. Under the assumptions of part (a), how many cores would be needed per node so that less than 1% of main memory is devoted to the directory? Answer: Easiest way is to start with the answer to part (c). With one core per node, 5/32 of main memory is required. Each time the number of cores per node doubles, the fraction of memory needed goes down by one half: 5/64 for dual-core, 5/128 for quad core, 5/256 for 8-core, and 5/512 for 16-core. So the answer is 16 cores per node. (f) Another approach is to use a cache-based scheme, like SSCI does. Suppose that the pointers point to individual processors that are caching a block (not to multicore nodes). In a processor system with 32-byte lines, what fraction of cache bits are devoted to pointers? (Again, consider only pointers and information bits, not state bits, etc.) Answer: Each cache line contains 2 10 = 20 bits of pointers and 32 8 = 256 information bits. So pointers occupy 20/ of the cache. And that s with 32-byte lines, which are about ¼ the size of the average cache line. (g) Which approach, multicore nodes or cache-based directories, is more scalable and why? Answer: The cache-based approach, because a constant fraction of the cache is used, regardless of the number of processors. With the multicore approach, as the number of processors rises, either more memory is devoted to bit-vectors, or the number of cores per chip needs to keep increasing. Question 4. (a) (3 points) In the MSI coherence prototocol, when a block transitions from state M to state S, a flush is induced. Suppose the block were not flushed to main memory at this time. Would the protocol still work? Would it still be as efficient? Answer: It wouldn t necessarily be incorrect. However, now blocks could be in state S without main memory being up to date. This means that if another processor read the block, it would have to be supplied by a cache-to-cache transfer (this typically implies an owner cache, something that the MSI protocol does not provide). And when blocks in state S were purged from the cache, they would have to be written back to main memory if dirty. A block, once dirtied, would have to be written back at least once, and possibly several times (if it ended up being shared by several caches). So there might be many more writes if the flush were not performed when it is. On the other hand, if a block transitions between states S and M several times, several flushes would be avoided. So it depends on the relative frequency of these two conditions (dirty blocks being shared vs. multiple M S M transitions). (b) (3 points) In the MESI protocol, when a block transitions from state E to state I, the diagram shows it being flushed. Where is it flushed to? Why is this necessary? Answer: If cache-to-cache sharing is in use, it is flushed out on the bus so that it can be picked up by another processor (e.g., if another processor is trying to write it). If cache-to-cache sharing is not in use, main memory can supply the data. (Note that this is indicated Flush in the diagram instead of Flush. (c) (3 points) In the Dragon protocol, a flush is used only when a bus read comes in for a block in M or Sm state. Where is data flushed to in this case and why?

4 Answer: It is flushed onto the bus so that another processor may pick it up. The processor picking it up will hold the block in state Sc. If it were not flushed, the requesting processor would have nowhere to get up-to-date data from, since main memory may be out of date. The rest of the question relates to this situation: Assume that in a four-process shared-memory multiprocessor with private, write-back caches that the following sequence of reads and writes take place. Assume that location 100 is not in any cache at the start of the sequence. 1. P0 writes location P1 reads location P2 reads location P3 writes location P3 writes location 100. Legend: Px means Processor x s cache. M means main memory. (d) (4 points) After line 2 is executed, what will be the state of memory location 100 in each of the caches? (e) (3 points) After line 4 is executed, what will be the state of memory location 100 in each of the caches? Dragon Directory-based (FBV) P0 P1 P2 P3 M P0 P1 P2 P3 Sm Sm S S S Sc Sc Sc Sm EM I I I M (f) (2 points) After line 4 is executed, will main memory have an up-to-date copy of location 100? No No (g) (2 points) How many total writes to memory are performed by each protocol on this sequence of 5 instructions? 1 1 Question 5. Suppose two processors execute the following code: P 1 P 2 1a A = 1 1b d = A 1c f = B 1d print e 2a B = 1 2b e = B 2c print d, f (a) (8 points) Which of the eight combinations of values for (a, b, c) can be printed under sequential consistency? Answer: (0, 1, 0), (1, 0, 0), (1, 0, 1), (1, 1, 0), and (1, 1, 1). The sequence 1a, 1b, 1c, 1d, 2a, 2b, 2c (1, 0, 0). The sequence 1a, 1b, 2a, 1c, 1d, 2b, 2c (1, 0, 1). The sequence 1a, 1b, 2a, 2b, 1c, 1d, 2c (1, 1, 1). The sequence 2a, 2b, 2c, 1a, 1b, 1c, 1d (0, 1, 0). The sequence 2a, 1a, 1b, 2b, 2c, 1c, 1d (1, 1, 0).

5 (b) (4 points) Choose two of the combinations that are impossible and explain why they are not possible under sequential consistency. Answer: (0, 0, 0) is not possible because one process will inevitably finish before the other, and will have set its variable(s) to 1. Therefore, the other process will print 1 as the value of these variables. (0, 0, 1) and (0, 1, 1) are not possible under SC because if c is 1, then a must have previously been set to 1. (c) (2 points) Under processor consistency, which if any additional combinations of values can be printed? How? Answer: (0, 0, 0) is possible if writes propagate very slowly, i.e., if a process sees its own writes considerably before it sees writes done by other processes. (0, 0, 1) and (0, 1, 1) are not possible for the same reason as under SC; that is, the write to a is seen everywhere before the write to c, as both writes come from the same processor. (e) (2 points) Under weak ordering, which if any additional combinations of values can be printed? How? Answer: (0, 0, 1) and (0, 1, 1) are now possible, because writes from P 1 aren t necessarily seen in order by P 2. (f) (4 points) Pick one of these models (PC or WO) that can print combinations not possible with SC. Tell where you would insert fence (SYNC) operations to make it conform to SC. Answer: Pick any one of those models, and insert SYNCs before both print statements. This will assure that all of the writes complete before any printing is done. Question 6. (3 points each, except 5 points for (e)) Given the code for shared-memory version of the Ocean simulation (Lecture 7), what could go wrong if: (a) we remove the BARRIER on line 25d? (b) we remove the BARRIER on line 25f? (c) we add a BARRIER right before lline 19? (d) we add a new LOCKDEC at line 2b and a matching LOCK/UNLOCK around the assignment to A[i,j] on line 20? (e) we remove the BARRIER at line 16a? (f) Can we safely remove the BARRIER on line 25d? Why or why not? Answers: (a) In this situation we will be modifying the done variable before all threads have finished executing. Since we are using that variable to determine if we should continue iterating over the grid, if this gets set to true because one thread finishes early (when 'diff' is still very small) then we stop iterating too soon. (b) If one thread were to reach this point (where we remove the barrier) while other threads were still executing, it would re-enter the loop and set diff = 0 which would mean that other threads still in the previous iteration would see diff as zero and would then set done = true which would cause the simulation to exit prematurely. (c) This would not affect the results of the program, but would cause significant performance

6 degradation due to the overhead of introducing a BARRIER on every iteration of the inner loop. (d) This would effectively render the parallelization useless as it would force the execution to a state resembling serial execution. The LOCK on the inner loop would mean that only one thread could update it's grid location at one time, which is really no different from a serial program. (e) This could result in the program having unpredictable results, as the diff variable, which is shared, could be arbitrarily reset to 0 at almost any point during execution until the last thread passes line 16. This would result in the program failing to converge upon the correct result. (f) The temptation would be to add an else {done=0;} to the following if statement. If done were a shared variable this could work. However in this case, this would be incorrect because the done variable is private to each thread, meaning that some threads would exit too soon.

Lecture 24: Board Notes: Cache Coherency

Lecture 24: Board Notes: Cache Coherency Lecture 24: Board Notes: Cache Coherency Part A: What makes a memory system coherent? Generally, 3 qualities that must be preserved (SUGGESTIONS?) (1) Preserve program order: - A read of A by P 1 will

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models Review. Why are relaxed memory-consistency models needed? How do relaxed MC models require programs to be changed? The safety net between operations whose order needs

More information

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains: The Lecture Contains: Shared Memory Multiprocessors Shared Cache Private Cache/Dancehall Distributed Shared Memory Shared vs. Private in CMPs Cache Coherence Cache Coherence: Example What Went Wrong? Implementations

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY> Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: This is a closed book, closed notes exam. 80 Minutes 19 pages Notes: Not all questions

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism CS 61C: Great Ideas in Computer Architecture Amdahl s Law, Thread Level Parallelism Instructor: Alan Christopher 07/17/2014 Summer 2014 -- Lecture #15 1 Review of Last Lecture Flynn Taxonomy of Parallel

More information

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Module 9: Introduction to Shared Memory Multiprocessors Lecture 16: Multiprocessor Organizations and Cache Coherence Shared Memory Multiprocessors Shared Memory Multiprocessors Shared memory multiprocessors Shared cache Private cache/dancehall Distributed shared memory Shared vs. private in CMPs Cache coherence Cache coherence: Example What went

More information

Snooping coherence protocols (cont.)

Snooping coherence protocols (cont.) Snooping coherence protocols (cont.) A four-state update protocol [ 5.3.3] When there is a high degree of sharing, invalidation-based protocols perform poorly. Blocks are often invalidated, and then have

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models [ 9.1] In small multiprocessors, sequential consistency can be implemented relatively easily. However, this is not true for large multiprocessors. Why? This is not the

More information

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today

More information

The Cache-Coherence Problem

The Cache-Coherence Problem The -Coherence Problem Lecture 12 (Chapter 6) 1 Outline Bus-based multiprocessors The cache-coherence problem Peterson s algorithm Coherence vs. consistency Shared vs. Distributed Memory What is the difference

More information

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins 4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the

More information

Shared Memory Architectures. Approaches to Building Parallel Machines

Shared Memory Architectures. Approaches to Building Parallel Machines Shared Memory Architectures Arvind Krishnamurthy Fall 2004 Approaches to Building Parallel Machines P 1 Switch/Bus P n Scale (Interleaved) First-level $ P 1 P n $ $ (Interleaved) Main memory Shared Cache

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

Cache Coherence Tutorial

Cache Coherence Tutorial Cache Coherence Tutorial The cache coherence protocol described in the book is not really all that difficult and yet a lot of people seem to have troubles when it comes to using it or answering an assignment

More information

Scalable Multiprocessors

Scalable Multiprocessors Scalable Multiprocessors [ 11.1] scalable system is one in which resources can be added to the system without reaching a hard limit. Of course, there may still be economic limits. s the size of the system

More information

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols Lecture 3: Directory Protocol Implementations Topics: coherence vs. msg-passing, corner cases in directory protocols 1 Future Scalable Designs Intel s Single Cloud Computer (SCC): an example prototype

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,

More information

Bus-Based Coherent Multiprocessors

Bus-Based Coherent Multiprocessors Bus-Based Coherent Multiprocessors Lecture 13 (Chapter 7) 1 Outline Bus-based coherence Memory consistency Sequential consistency Invalidation vs. update coherence protocols Several Configurations for

More information

Multiprocessor Systems

Multiprocessor Systems White Paper: Virtex-II Series R WP162 (v1.1) April 10, 2003 Multiprocessor Systems By: Jeremy Kowalczyk With the availability of the Virtex-II Pro devices containing more than one Power PC processor and

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs252

More information

CSC/ECE 506: Computer Architecture and Multiprocessing Program 3: Simulating DSM Coherence Due: Tuesday, Nov 22, 2016

CSC/ECE 506: Computer Architecture and Multiprocessing Program 3: Simulating DSM Coherence Due: Tuesday, Nov 22, 2016 CSC/ECE 506: Computer Architecture and Multiprocessing Program 3: Simulating DSM Coherence Due: Tuesday, Nov 22, 2016 1. Overall Problem Description In this project, you will add new features to a trace-driven

More information

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro Lecture 7: M Wrap-Up, ache coherence Topics: handling M errors and writes, cache coherence intro 1 Optimizations for Writes (Energy, Lifetime) Read a line before writing and only write the modified bits

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Snoopy Cache Coherence rof. Michel A. Kinsy Consistency in SMs CU-1 CU-2 A 100 Cache-1 A 100 Cache-2 CU- bus A 100 Consistency in SMs CU-1 CU-2 A 200 Cache-1

More information

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Suggested Readings! What makes a memory system coherent?! Lecture 27 Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality! 1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and

More information

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations 1 Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in

More information

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University 18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Papamarcos and Patel, A low-overhead coherence solution

More information

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed

More information

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. Coherence Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. L5- Coherence Avoids Stale Data Multicores have multiple private caches for performance Need to provide the illusion

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours CS433 Final Exam Prof Josep Torrellas December 12, 2006 Time: 2 hours Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 6 Questions. Please budget your time. 3. Calculators

More information

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh Multicore Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh Symmetric MultiProcessing 2 Each processor in an SMP has equal access to all parts of memory same latency and bandwidth

More information

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM Lecture 4: Directory Protocols and TM Topics: corner cases in directory protocols, lazy TM 1 Handling Reads When the home receives a read request, it looks up memory (speculative read) and directory in

More information

Portland State University ECE 588/688. Cache Coherence Protocols

Portland State University ECE 588/688. Cache Coherence Protocols Portland State University ECE 588/688 Cache Coherence Protocols Copyright by Alaa Alameldeen 2018 Conditions for Cache Coherence Program Order. A read by processor P to location A that follows a write

More information

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues Lecture 8: Directory-Based Cache Coherence Topics: scalable multiprocessor organizations, directory protocol design issues 1 Scalable Multiprocessors P1 P2 Pn C1 C2 Cn 1 CA1 2 CA2 n CAn Scalable interconnection

More information

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017 CS433 Homework 6 Assigned on 11/28/2017 Due in class on 12/12/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures Approaches to Building arallel achines Switch/Bus n Scale Shared ory Architectures (nterleaved) First-level (nterleaved) ain memory n Arvind Krishnamurthy Fall 2004 (nterleaved) ain memory Shared Cache

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

CS 167 Final Exam Solutions

CS 167 Final Exam Solutions CS 167 Final Exam Solutions Spring 2018 Do all questions. 1. [20%] This question concerns a system employing a single (single-core) processor running a Unix-like operating system, in which interrupts are

More information

Memory Hierarchy in a Multiprocessor

Memory Hierarchy in a Multiprocessor EEC 581 Computer Architecture Multiprocessor and Coherence Department of Electrical Engineering and Computer Science Cleveland State University Hierarchy in a Multiprocessor Shared cache Fully-connected

More information

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes. Data-Centric Consistency Models The general organization of a logical data store, physically distributed and replicated across multiple processes. Consistency models The scenario we will be studying: Some

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) 1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory

More information

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition

More information

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017 CS433 Homework 6 Assigned on 11/28/2017 Due in class on 12/12/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)

More information

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. Coherence Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. L25-1 Coherence Avoids Stale Data Multicores have multiple private caches for performance Need to provide the illusion

More information

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing Lecture 26: Multiprocessors Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing 1 Cache Coherence Protocols Directory-based: A single location (directory)

More information

Coherence and Consistency

Coherence and Consistency Coherence and Consistency 30 The Meaning of Programs An ISA is a programming language To be useful, programs written in it must have meaning or semantics Any sequence of instructions must have a meaning.

More information

Switch Gear to Memory Consistency

Switch Gear to Memory Consistency Outline Memory consistency equential consistency Invalidation vs. update coherence protocols MI protocol tate diagrams imulation Gehringer, based on slides by Yan olihin 1 witch Gear to Memory Consistency

More information

Computer Architecture Memory hierarchies and caches

Computer Architecture Memory hierarchies and caches Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches

More information

Snooping coherence protocols (cont.)

Snooping coherence protocols (cont.) Snooping coherence protocols (cont.) A four-state update protocol [ 5.3.3] When there is a high degree of sharing, invalidation-based protocols perform poorly. Blocks are often invalidated, and then have

More information

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem Cache Coherence Bryan Mills, PhD Slides provided by Rami Melhem Cache coherence Programmers have no control over caches and when they get updated. x = 2; /* initially */ y0 eventually ends up = 2 y1 eventually

More information

Processor Architecture

Processor Architecture Processor Architecture Shared Memory Multiprocessors M. Schölzel The Coherence Problem s may contain local copies of the same memory address without proper coordination they work independently on their

More information

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Lecture-22 (Cache Coherence Protocols) CS422-Spring Lecture-22 (Cache Coherence Protocols) CS422-Spring 2018 Biswa@CSE-IITK Single Core Core 0 Private L1 Cache Bus (Packet Scheduling) Private L2 DRAM CS422: Spring 2018 Biswabandan Panda, CSE@IITK 2 Multicore

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014 Cache Coherence Introduction to High Performance Computing Systems (CS1645) Esteban Meneses Spring, 2014 Supercomputer Galore Starting around 1983, the number of companies building supercomputers exploded:

More information

Potential violations of Serializability: Example 1

Potential violations of Serializability: Example 1 CSCE 6610:Advanced Computer Architecture Review New Amdahl s law A possible idea for a term project Explore my idea about changing frequency based on serial fraction to maintain fixed energy or keep same

More information

Write only as much as necessary. Be brief!

Write only as much as necessary. Be brief! 1 CIS371 Computer Organization and Design Final Exam Prof. Martin Wednesday, May 2nd, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached (with

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections 4.2-4.4) 1 SMP/UMA/Centralized Memory Multiprocessor Main Memory I/O System

More information

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

Simulating ocean currents

Simulating ocean currents Simulating ocean currents We will study a parallel application that simulates ocean currents. Goal: Simulate the motion of water currents in the ocean. Important to climate modeling. Motion depends on

More information

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization Lecture 25: Multiprocessors Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 Snooping-Based Protocols Three states for a block: invalid,

More information

Scalable Cache Coherence

Scalable Cache Coherence arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy

More information

Advanced OpenMP. Lecture 3: Cache Coherency

Advanced OpenMP. Lecture 3: Cache Coherency Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building multiprocessor systems is the cache coherency problem. The shared memory programming model assumes that a shared variable

More information

Using Relaxed Consistency Models

Using Relaxed Consistency Models Using Relaxed Consistency Models CS&G discuss relaxed consistency models from two standpoints. The system specification, which tells how a consistency model works and what guarantees of ordering it provides.

More information

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based) Lecture 8: Snooping and Directory Protocols Topics: split-transaction implementation details, directory implementations (memory- and cache-based) 1 Split Transaction Bus So far, we have assumed that a

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM

More information

East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 3 for Fall Semester, 2005

East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 3 for Fall Semester, 2005 Points missed: Student's Name: Total score: /100 points East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 3 for Fall Semester, 2005 Section

More information

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location. Cache Coherence This lesson discusses the problems and solutions for coherence. Different coherence protocols are discussed, including: MSI, MOSI, MOESI, and Directory. Each has advantages and disadvantages

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Cache Coherence and Atomic Operations in Hardware

Cache Coherence and Atomic Operations in Hardware Cache Coherence and Atomic Operations in Hardware Previously, we introduced multi-core parallelism. Today we ll look at 2 things: 1. Cache coherence 2. Instruction support for synchronization. And some

More information

Relaxed Memory Consistency

Relaxed Memory Consistency Relaxed Memory Consistency Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information