CS315A Midterm Solutions

Similar documents
Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Processor Architecture

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Snooping coherence protocols (cont.)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Snooping coherence protocols (cont.)

Shared Memory Multiprocessors

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Advanced OpenMP. Lecture 3: Cache Coherency

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Computer Architecture

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Snooping-Based Cache Coherence

Cray XE6 Performance Workshop

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EECS 482 Midterm (Fall 1998)

Shared Memory Multiprocessors

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Switch Gear to Memory Consistency

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

A More Sophisticated Snooping-Based Multi-Processor

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

A Basic Snooping-Based Multi-Processor Implementation

Shared Memory Architectures. Approaches to Building Parallel Machines

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

CMU /618 Exam 2 Practice Problems

Lecture 24: Board Notes: Cache Coherency

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CSE 5306 Distributed Systems

A three-state update protocol

Lecture 13: Memory Consistency. + course-so-far review (if time) Parallel Computer Architecture and Programming CMU /15-618, Spring 2017

Cache Coherence and Atomic Operations in Hardware

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

A Basic Snooping-Based Multi-processor

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

CMPSCI 377 Midterm (Sample) Name:

CS140 Operating Systems and Systems Programming Midterm Exam

Computer Architecture and Engineering CS152 Quiz #5 May 2th, 2013 Professor Krste Asanović Name: <ANSWER KEY>

Multicore MESI Based Cache Design HAS

Student Name: University of California at Berkeley College of Engineering Department of Electrical Engineering and Computer Science

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols

ECE 485/585 Microprocessor System Design

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

University of Toronto Faculty of Applied Science and Engineering

ECE/CS 757: Homework 1

CSE502: Computer Architecture CSE 502: Computer Architecture

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Midterm Exam Solutions and Grading Guidelines March 3, 1999 CS162 Operating Systems

Parallel Arch. Review

Lecture 25: Multiprocessors

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Other consistency models

An#Aside# Course#thus#far:# We#have#focused#on#parallelism:#how#to#divide#up#work#into# parallel#tasks#so#that#they#can#be#done#faster.

Advanced Parallel Programming I

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Modern CPU Architectures

Concurrent Preliminaries

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Chapter 5. Multiprocessors and Thread-Level Parallelism

Midterm Exam March 3, 1999 CS162 Operating Systems

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization

Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions

CPSC 261 Midterm 2 Thursday March 17 th, 2016

CS 351 Final Review Quiz

COMP 3361: Operating Systems 1 Final Exam Winter 2009

Chapter 5. Multiprocessors and Thread-Level Parallelism

Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization

Name: uteid: 1. CS439H: Fall 2011 Final Exam

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I

A Basic Snooping-Based Multi-Processor Implementation

M4 Parallelism. Implementation of Locks Cache Coherence

The MESI State Transition Graph

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Last lecture. Some misc. stuff An older real processor Class review/overview.

EC 513 Computer Architecture

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Transcription:

K. Olukotun Spring 05/06 Handout #14 CS315a CS315A Midterm Solutions Open Book, Open Notes, Calculator okay NO computer. (Total time = 120 minutes) Name (please print): Solutions I agree to abide by the terms of the Honor Code while taking this exam. Signature: Points: 1. (30) 2. (20) 3. (20) 4. (30) Total: (100) Make sure you state all your assumptions. SCPD Students: Please attach a routing slip to the midterm. 1

Question 1. Short Answer (30 pts) 1a. (6 pts) Give an example of a simple parallel for loop that would execute faster with interleaved static scheduling than by giving each task a single contiguous chunk of iterations. Your example should not have nested loops. Use a loop where iteration i+p is predicated on iteration i, where p is the number of processors: Iterations: 1 2 3 4 5 6 Imagine the lines between iterations are dependencies. If p=3, for the diagram above, and blocking took chunks of three, the algorithm would be serialized (each processor waiting for the others to complete). But, perfect load balancing can be achieved if interleaving is used. 3 points for doing a variable amount of work for each iteration: you still may have a lot of load balancing problems without dynamic scheduling (notice the problem says static ). 1b. (6 pts) Consider a parallel make program which compiles and links single library---it computes the dependency graph and performs the compilation of each needed object file on a different processor, finally linking in one sequential step. What kind of data partitioning is this? (output, input, output & input, etc.) Input. There is only one output (the library) and many inputs, each of which needs to be processed individually. 1c. (6 pts) How would a shared memory parallel system optimized for TPC-C differ from one optimized for SPLASH-2 applications? TPC: Little ILP so simple pipelines, FP not important, large L1 caches, Large L2 caches to hold shared meta data, optimize latency of dirty misses. SPLASH-2: Lots of ILP and FP so complex pipeline, small L1 caches to capture important working sets, L2 size not that important, small amount of sharing so communication misses not that important. 2

1d. (6 pts) Why does the coherence miss rate of Barnes, unlike Ocean, decrease with increasing cache size? (Hint: coherence misses are measured as transitions into the M state with the MSI protocol described in class) The reduction in coherence miss rate happens because these are really capacity misses to non-shared data that masquerade as coherence misses because the MSI protocol counts S -> M transitions as a coherency miss even though the data is not shared. The real coherency miss rate shows up with cache size of 128-256 KB. Ocean does not do much writing of local data so does not exhibit this effect. Be careful how you justify your answers; some people said erroneous things. 1e. (6 pts) The Sun Enterprise 5000 does not use an upgrade transaction. However, it does avoid transferring data on a BusRdX transaction if it turns out that the issuing cache is the only one that has the data. The idea is a simple extension of the way that memory is waved off on a BusRd when one of the caches has the block in M. Using our three wired-or signals (shared, dirty, inhibit) for reporting status of relevant blocks, explain how to get the desired behavior? (Hint: the issuing cache snoops its own tags during the transaction.) A BusRdX would be issued by the processor on a write. The line could be in I or S state. If the line is on I state it would get the data form memory or another cache. If the line in S state it would assert the processor would assert inhibit and look to see if the shared line is raised by any other processor. If not it would raise the dirty line to hold of the memory and lower the inhibit line. 3

Question 2: Cache Coherence Protocols (20 pts) Add the transient states to the Dragon transition diagram so that the protocol works even though requesting the bus and acquiring the bus is not an atomic operation. PrRdMiss/ BusReq BusRd(!S) BusRd(S) BusUpd/-- BusUpd(S) PrWr/ BusReq X BusUpd/Update PrWr/ BusReq BusUpd(!S) (BusRd(S), BusUpd) BusUpd/Update Goto X BusRd(!S) PrWrMiss/ BusReq 4

Question 3: Cache Coherence Analysis (20 pts) 3.a (10 pts) Compute the traffic per CPU using the transition frequencies shown in the figure below. All frequencies are the number of transitions per 1000 data references. Assume the following: 1 GIPS processor 30% of instructions are loads and stores 6 bytes for upgrade/invalidate 70 bytes for a cache block read or write PrWr/BusRdX 0.1 Replace/BusWB 3.2 PrWr / BusRdX 4 PrRd / BusRd 2 M S I PrRd, PrWr / -- 550 PrWr / BusRdX 4 BusRd /BusWB 2 PrRd / BusRd 3 BusRdX/BusWB 0.3 BusRdX / -- 2 PrRd,BusRd / -- 430 There are no upgrades or invalidates in MSI: all bus transactions read/write an entire line. Transition Number of transitions Cycles I->S 2 70 I->M 0.1 70 S->S 430 0 S->M 4 70 S->I 2 0 M->M 550 0 M->S 2 70 M->I 0.3 70 Replace/BusWB 3.2 70 PrWr/BusRdX 4 70 PrRd/BusRd 3 70 Bytes per reference: 70 * (3.2 + 4 +.1 + 2 + 4 + 3 +.3 + 2)/1000 = 1.302 References per second:.3 * 1e9 Bytes per second:.3 * 1e9 * 1.302 = 372.5 MB/s (390.6 x 1e6 B/s) 5

3b. (5 pts) If the S M transition which causes a bus read exclusive is replaced with a bus invalidate, what is the bus traffic per processor? Transition Number of transitions Cycles I->S 2 70 I->M 0.1 70 S->S 430 0 S->M 4 6 S->I 2 0 M->M 550 0 M->S 2 70 M->I 0.3 70 Replace/BusWB 3.2 70 PrWr/BusRdX 4 70 PrRd/BusRd 3 70 Bytes per reference: 70 * (3.2 + 4 +.1 + 2 + 3 +.3 + 2)/1000 + 6 * 4/1000 = 1.046 References per second:.3 * 1e9 Bytes per second:.3 * 1e9 * 1.046 = 299.3 MB/s (313.8 x 1e6 B/s) 3c. (5 pts) If the MSI protocol of Question 3b is replaced with a MESI protocol and the transition frequency between the E and M states is 2 per 1000 data references what is the bus traffic per processor? S->M, in a MSI protocol, contains all references S->m and E->M in a MESI protocol. So, since E->M = 2 in MESI, S->M must equal 4-2 = 2 (the old S->M value minus the E->m value). Transition Number of transitions Cycles I->S 2 70 I->M 0.1 70 S->S 430 0 S->M 2 6 S->I 2 0 E->M 2 0 I->E?? 70 E->S?? 0 M->M 550 0 M->S 2 70 M->I 0.3 70 Replace/BusWB 3.2 70 PrWr/BusRdX 4 70 PrRd/BusRd 3 70 The number of transitions from I->E is a subset of those transitions from I->S. We can t find the exact number, but it doesn t matter since I->E and I->S take the same number of bus cycles. Bytes per reference: 70 * (3.2 + 4 +.1 + 2 + 3 +.3 + 2)/1000 + 6 * 2/1000 = 1.034 References per second:.3 * 1e9 Bytes per second:.3 * 1e9 * 1.034 = 295.8 MB/s (310 x 1e6 B/s) 6

Question 4: Parallel Programming (30 pts) An incarnation of the 8-puzzle problem is displayed below: there are eight numbered tiles and the solution is a set of steps transforming the initial state to the goal state. A step consists of sliding one of the tiles into the empty slot. One simple way to find the solution with the fewest steps is by performing a breadth-first search on the tree of successors from the initial state. 8 2 7 3 4 5 1 6 Initial State 8 2 7 3 4 5 1 6 1 2 3 4 5 6 8 2 7 7 8 3 4 6 Goal 5 1 Successor States 4a. (25 pts) Write a parallel algorithm for p processors that finds the first goal state in the successor tree. Compute the speedup for your algorithm using the timing information below. Assume the solution is at depth d and each state has an average of s successors. Do not worry about repeated states. Assume any reasonable shared-memory parallel programming model. In other words, you don t have to show accurate calls to pthreads/openmp; just use pseudo code so we know when threads begin/end and any synchronization (locks, barriers, etc.). For timing analysis, assume perfect caches. Use the following functions with corresponding timing information. B Function Time BARRIER() t B * LOCK(Lock l)/unlock(lock l) t L void enqueue(queue q, State/States s) t E ** State dequeue(queue q) t D ** States successors(state s) (Returns all successors of s) t S Comparing two states (use ==) t C * Time that the last arriving processor is delayed. ** Queue access is not atomic; may require locks. Make sure that when a thread finds the solution, it terminates the search across the whole system. 7

This code ended up being a little harder than I originally thought because I didn t think through the find first solution completely. To correctly answer the problem, you would need to prevent one processor from getting ahead of the others. Here is a correct solution, notice the extension of enqueue/dequeue to carry an additional int. You could do it in other ways, as people demonstrated. Basically, all processors must reach a barrier once they dequeue a state which is in the next level of the tree---this ensures that each level gets completely searched before moving on to the next level. There is some additional code you would need to deal with the first few iterations, since one barrier would not be enough. 15pts for code, 10pts for analysis. void findsolution(state init, State goal) { done = false enqueue(q, successors(init), 0); parallel { int lastlevel = 0; int thislevel; while(!done) { lock(ql) (next, thislevel) = dequeue(q); enqueue(q, successors(next), thislevel+1); unlock(ql) if (thislevel!= lastlevel) { BARRIER(); // wait until everyone // reaches thislevel. lastlevel = thislevel; if(next == goal) { done = true; break; Analysis: There are O(s^d) nodes which must be examined before the solution will be found. Non parallelizable part: t_s + t_e Sequential runtime: t_s + t_e + O(s^d)*(t_d + t_e + t_s + t_c) That s just the non-parallelizable part plus, for each iteration of a simple loop, we dequeue the next state, enqueue its successors, and compare it with the goal. Parallel runtime: t_s + t_e + O(s^d)/p*(2*t_l + t_d + t_e + t_s + t_c) + d*t_b So, for each parallel iteration, there is a lock/unlock pair, dequeue, enqueue of successors, and compare. Then there s one barrier per level in the tree until you find the solution. So, the speedup is just the sequential runtime over the parallel runtime. 8

4.b (5 pts) Short answer In the 8-puzzle program, what would constitute blocking/chunking? (Hint: think about how you choose what work to do next). Blocking would be dequeing more than one state per iteration, thus amortizing the cost of dequeing/enqueing successors. 9