EECS 470. Lecture 17 Multiprocessors I. Fall 2018 Jon Beaumont

Similar documents
CSE502: Computer Architecture CSE 502: Computer Architecture

Wenisch Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Prof. Thomas Wenisch. Lecture 23 EECS 470.

On Power and Multi-Processors

On Power and Multi-Processors

Multiprocessors continued

Multiprocessors continued

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Beyond ILP In Search of More Parallelism

Thread-Level Parallelism. Shared Memory. This Unit: Shared Memory Multiprocessors. U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Computer Architecture

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

EECS 570 Lecture 9. Snooping Coherence. Winter 2017 Prof. Thomas Wenisch h6p://

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

EECS 570 Lecture 2. Message Passing & Shared Memory. Winter 2018 Prof. Satish Narayanasamy

PARALLEL MEMORY ARCHITECTURE

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Computer Systems Architecture

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessors & Thread Level Parallelism

Handout 3 Multiprocessor and thread level parallelism

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Chap. 4 Multiprocessors and Thread-Level Parallelism

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Multiprocessors 1. Outline

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Chapter 5. Multiprocessors and Thread-Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

Computer Systems Architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Processor Architecture

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Chapter 5. Multiprocessors and Thread-Level Parallelism

Fall 2015 :: CSE 610 Parallel Computer Architectures. Cache Coherence. Nima Honarmand

Interconnect Routing

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Shared Symmetric Memory Systems

Shared Memory Architectures. Approaches to Building Parallel Machines

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont

Computer Science 146. Computer Architecture

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

1. Memory technology & Hierarchy

Advanced Parallel Programming I

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Lecture 24: Virtual Memory, Multiprocessors

Parallel Architecture. Hwansoo Han

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

COSC4201 Multiprocessors

Limitations of parallel processing

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Computer Architecture

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessors - Flynn s Taxonomy (1966)

Lecture 13. Shared memory: Architecture and programming

Lecture 9: MIMD Architectures

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

EECS 470. Lecture 16 Virtual Memory. Fall 2018 Jon Beaumont

Chapter 5. Thread-Level Parallelism

Multiprocessors and Locking

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Modern CPU Architectures

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Aleksandar Milenkovich 1

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture 9: MIMD Architectures

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Foundations of Computer Systems

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

DISTRIBUTED SHARED MEMORY

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Transcription:

Lecture 17 Multiprocessors I Fall 2018 Jon Beaumont www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth Shen, Smith, Sohi, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Pennsylvania, and University of Wisconsin. Slide 1

Announcements Milestone 3 due Friday (11/30) Should be passing many non-trivial benchmarks Send a 1 page status update as normal Indicate if you would like to meet Slide 2

Roadmap Speedup Programs Reduce Instruction Latency Parallelize Reduce number of instructions Reduce average memory latency Instruction Level Parallelism Thread Level Parallelism Caching First 2 months Power Efficiency MultiProcessors Programmability Precise State Virtual Memory Slide 3

Multiprocessors Slide 4

Spectrum of Parallelism Bit-level Pipelining ILP Multithreading Multiprocessing Distributed EECS 370 EECS 570 EECS 591 Why multiprocessing? Desire for performance Techniques from 370/470 difficult to scale further Slide 5

Thread-Level Parallelism struct acct_t { int bal; }; shared struct acct_t accts[max_acct]; int id,amt; if (accts[id].bal >= amt) { accts[id].bal -= amt; spew_cash(); } 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash Thread-level parallelism (TLP) Collection of asynchronous tasks: not started and stopped together Data shared loosely, dynamically Example: database/web server (each query is a thread) accts is shared, can t register allocate even if it were scalar id and amt are private variables, register allocated to r1, r2 Slide 6

Thread-Level Parallelism Coarser than instruction level parallelism Usually exploited over thousands of instructions, not 10s-100s More independence between different threads than within a thread Less data dependencies in TLP than ILP Fewer guarantees about how instructions execute relative to one another outside a thread than within Instructions in a thread are guaranteed to execute (as if they were done) sequentially Only guaranteed across threads if programmer explicitly indicates so Slide 7

Shared-Memory Multiprocessors Multiple execution contexts sharing a single address space Multiple programs (MIMD) Or more frequently: multiple copies of one program (SPMD) Implicit (automatic) communication via loads and stores Theoretical foundation: PRAM model P 1 P 2 P 3 P 4 Memory System Slide 8

Pluses Why Shared Memory? For applications looks like multitasking uniprocessor For OS only evolutionary extensions required Easy to do communication without OS Software can worry about correctness first then performance Minuses Proper synchronization is complex Communication is implicit so harder to optimize Hardware designers must implement Result Traditionally bus-based Symmetric Multiprocessors (SMPs), and now CMPs are the most success parallel machines ever And the first with multi-billion-dollar markets Slide 9

Shared Memory CPU and cache Mem R Memory Router logic figure out where transactions get sent Slide 10

Paired vs. Separate Processor/Memory? Separate processor/memory Uniform memory access (UMA): equal latency to all memory + Simple software, doesn t matter where you put data Lower peak performance Bus-based UMAs common: symmetric multi-processors (SMP) Paired processor/memory Non-uniform memory access (NUMA): faster to local memory More complex software: where you put data matters + Higher peak performance: assuming proper data placement Mem R Mem R Mem R Mem R Mem Mem Mem Mem Slide 11

Shared vs. Point-to-Point Networks Shared network: e.g., bus (left) + Low latency Low bandwidth: doesn t scale beyond ~16 processors + Shared property simplifies cache coherence protocols (later) Point-to-point network: e.g., mesh or ring (right) Longer latency: may need multiple hops to communicate + Higher bandwidth: scales to 1000s of processors Cache coherence protocols are complex Mem R Mem R Mem R Mem R Mem R R Mem Mem R R Mem Slide 12

Organizing Point-To-Point Networks Network topology: organization of network Tradeoff performance (connectivity, latency, bandwidth) cost Router chips Networks that require separate router chips are indirect Networks that use processor/memory/router packages are direct + Fewer components, Glueless MP Point-to-point network examples Indirect tree (left) Direct mesh or ring (right) R R R Mem R R Mem Mem R Mem R Mem R Mem R Mem R R Mem Slide 13

Implementation #1: Snooping Bus MP Mem Mem Mem Mem Two basic implementations Bus-based systems Typically small: 2 8 (maybe 16) processors Typically processors split from memories (UMA) Sometimes multiple processors on single chip (CMP) Symmetric multiprocessors (SMPs) Common, I use one everyday Slide 14

Implementation #2: Scalable MP Mem R R Mem Mem R R Mem General point-to-point network-based systems Typically processor/memory/router blocks (NUMA) Glueless MP: no need for additional glue chips Can be arbitrarily large: 1000 s of processors Massively parallel processors (MPPs) In reality only government (DoD) has MPPs Companies have much smaller systems: 32 64 processors Scalable multi-processors Slide 15

Issues for Shared Memory Systems Two in particular Cache coherence Memory consistency model Closely related to each other Different solutions for SMPs and MPPs Slide 16

Cache Coherence: The Problem t1: Store A=1 P1 P2 t2: Load A? A: A: 0 1 Write-back A: 0 Bus A: 0 Main Memory Need to do something to keep P2 s cache coherent Slide 17

Slide 18

Approaches to Cache Coherence Software-based solutions Mechanisms: Mark cache blocks/memory pages as cacheable/non-cacheable Add Flush and Invalidate instructions When are each of these needed? Could be done by compiler or run-time system Difficult to get perfect (e.g., what about memory aliasing?) Hardware solutions are far more common Simple schemes rely on broadcast over a bus Slide 19

Simple Write-Through Scheme: Valid-Invalid Coherence t1: Store A=1 P1 P2 A [V]: 0 1 Write-through A [V]: I]: 0 No-write-allocate t3: Invalidate A t2: BusWr A=1 Bus Valid-Invalid Coherence A: A: 0 1 Main Memory Allows multiple readers, but must write through to bus Write-through, no-write-allocate cache All caches must monitor (aka snoop ) all bus traffic simple state machine for each cache frame Slide 20

Valid-Invalid Snooping Protocol Load / BusRd Load / -- Valid Store / BusWr BusWr Actions: From proc: Ld, St From bus: BusRd, BusWr Write-through, no-write-allocate cache Invalid 1 bit of storage overhead per cache frame Store / BusWr EECS 570 Slide 21

Supporting Write-Back Caches Write-back caches drastically reduce bus write bandwidth Key idea: add notion of ownership to Valid-Invalid Mutual exclusion when owner has only replica of a cache block, it may update it freely Sharing multiple readers are ok, but they may not write without gaining ownership Need to find which cache (if any) is an owner on read misses Need to eventually update memory so writes are not lost Slide 22

Modified-Shared-Invalid (MSI) Protocol Three states tracked per-block at each cache Invalid cache does not have a copy Shared cache has a read-only copy; clean Clean == memory is up to date Modified cache has the only copy; writable; dirty Dirty == memory is out of date Three processor actions Load, Store, Evict Five bus messages BusRd, BusRdX, BusInv, BusWB, BusReply Could combine some of these Slide 23

Modified-Shared Invalid (MSI) Protocol Load / BusRd Invalid Shared P1 1: Load A P2 A [I A S]: [I] 0 A [I] 2: BusRd A 3: BusReply A Bus A: 0 EECS 570 Slide 24

Modified-Shared Invalid (MSI) Protocol Invalid Load / BusRd Shared BusRd / [BusReply] Load / -- P1 1: Load A P2 1: Load A A [S]: 0 A [I A S]: [I] 0 3: BusReply A Bus A: 0 2: BusRd A EECS 570 Slide 25

Modified-Shared Invalid (MSI) Protocol Invalid Load / BusRd Evict / -- Shared BusRd / [BusReply] Load / -- P1 P2 Evict A A [S]: 0 A [S]: [I] I] 0 Bus A: 0 EECS 570 Slide 26

Store / BusRdX Modified-Shared Invalid (MSI) Protocol Load / BusRd BusRd / [BusReply] Invalid BusRdX / [BusReply] Shared Load / -- Evict / -- P1 A [S]: I]: 0 P2 1: Store A A [IA M]: [I] 0 1 Modified 3: BusReply A Bus 2: BusRdX A Load, Store / -- A: 0 EECS 570 Slide 27

Store / BusRdX Modified-Shared Invalid (MSI) Protocol Load / BusRd BusRd / [BusReply] Invalid BusRdX / [BusReply] Shared Load / -- Evict / -- P1 1: Load A P2 A [I A S]: [I] 1 A [M]: S]: 1 Modified 2: BusRd A Bus 3: BusReply A Load, Store / -- EECS 570 4: Snarf A A: A: 001 Slide 28

Store / BusRdX Modified-Shared Invalid (MSI) Protocol Load / BusRd BusRd / [BusReply] Invalid BusRdX, BusRdX BusInv / / [BusReply] Shared Load / -- Evict / -- Load, Store / -- EECS 570 Modified 2: BusInv A P1 1: Store A aka Upgrade A A [S[S]: M]: 12 A [S]: I] 1 Bus A: 1 P2 Slide 29

Store / BusRdX Modified-Shared Invalid (MSI) Protocol Load / BusRd BusRd / [BusReply] Invalid BusRdX, BusInv / [BusReply] Shared Load / -- BusRdX / BusReply Evict / -- P1 P2 A [M]: I]: 2 A [I A M]: [I] 3 1: Store A Modified 3: BusReply A Bus 2: BusRdX A Load, Store / -- A: 1 EECS 570 Slide 30

Store / BusRdX Modified-Shared Invalid (MSI) Protocol Load / BusRd BusRd / [BusReply] Invalid BusRdX, BusInv / [BusReply] Shared Load / -- Evict / BusWB BusRdX / BusReply Evict / -- P1 A [I] P2 A [M]: I]: 3 1: Evict A Load, Store / -- EECS 570 Modified Bus A: A: 1 3 2: BusWB A Slide 31

Store / BusRdX MSI Protocol Summary Load / BusRd BusRd / [BusReply] Invalid BusRdX, BusInv / [BusReply] Shared Load / -- Evict / BusWB Modified BusRdX / BusReply Evict / -- Cache Actions: Load, Store, Evict Bus Actions: BusRd, BusRdX BusInv, BusWB, BusReply Load, Store / -- EECS 570 Slide 32

Coherence vs. Consistency Intuition says loads should return latest value what is latest? Coherence concerns only one memory location Consistency concerns apparent ordering for all locations A Memory System is Coherent if can serialize all operations to that location such that, operations performed by any processor appear in program order program order = order defined program text or assembly code value returned by a read is value written by last store to that location Slide 33

Why Coherence!= Consistency /* initial A = B = flag = 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; print A; flag = 1; print B; Intuition says printed A = B = 1 Coherence doesn t say anything. Why? Your uniprocessor ordering mechanisms (ld/st queue) won t help. Why? Slide 34

Sequential Consistency (SC) processors issue memory ops in program order P1 P2 P3 switch randomly set after each memory op provides single sequential order among all operations Memory Slide 35

Sufficient Conditions for SC A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program -Lamport, 1979 Every proc. issues memory ops in program order Memory ops happen (start and end) atomically must wait for store to complete before issuing next memory op after load, issuing proc waits for load to complete, before issuing next op Easily implemented with a shared bus Slide 36

Dekker s Algorithm Mutually exclusive access to a critical region Works as advertised under sequential consistency /* initial A = B = 0 */ P1 P2 A = 1; B=1; if (B!= 0) goto retry; if (A!= 0) goto retry; /* enter critical section*/ /* enter critical section*/ Slide 37

Problems with SC Memory Model Difficult to implement efficiently in hardware Straight-forward implementations: No concurrency among memory access Strict ordering of memory accesses at each node Essentially precludes out-of-order CPUs Unnecessarily restrictive Most parallel programs won t notice out-of-order accesses Conflicts with latency hiding techniques Slide 38

E.g., Add a Store Buffer Allow reads to bypass incomplete writes Reads search store buffer for matching values Hides all latency of store misses in uniprocessors, but Slide 39

Dekker s Algorithm w/ Store Buffer P1 P2 Read B t1 Write A t3 Read A t2 Write B t4 Shared Bus P1 A = 1; if (B!= 0) goto retry; A: 0 B: 0 P2 B=1; if (A!= 0) goto retry; /* enter critical section*/ /* enter critical section*/ Slide 40

Fixing SC Performance Option 1: Change the memory model Weak/Relaxed Consistency Programmer specifies when order matters Other access happen concurrently/ out-of-order + Simple hardware can yield high performance Programmer must reason under counter-intuitive rules P P P Overlap Accesses Relaxed ordering Option 2: Speculatively ignore ordering rules In-window Speculation & InvisiFence Order matters only if re-orderings are observed Ignore the rules and hope no-one notices Works because data races are rare + Performance of relaxed consistency with simple programming model More sophisticated HW; speculation can lead to pathological behavior P St P Ld R Ld Q ckpt Mem One of the most esoteric (but important) topics in multiprocessors More in 570 Slide 41

Multiprocessors Are Here To Stay Moore s law is making the multiprocessor a commodity part 500M transistors on a chip, what to do with all of them? Not enough ILP to justify a huge uniprocessor Really big caches? t hit increases, diminishing % miss returns Chip multiprocessors (CMPs) Multiple full processors on a single chip Example: IBM POWER4: two 1GHz processors, 1MB L2, L3 tags Multiprocessors a huge part of computer architecture Multiprocessor equivalent of H+P is 50% bigger than H+P Another entire cores on multiprocessor architecture Hopefully every other year or so (not this year) Slide 53

Multiprocessing & Power Consumption Multiprocessing can be very power efficient Recall: dynamic voltage and frequency scaling Performance vs power is NOT linear Example: Intel s Xscale 1 GHz 200 MHz reduces energy used by 30x Impact of parallel execution What if we used 5 Xscales at 200Mhz? Similar performance as a 1Ghz Xscale, but 1/6th the energy 5 cores * 1/30th = 1/6th Assumes parallel speedup (a difficult task) Remember Ahmdal s law Slide 54

Shared Memory Summary Shared-memory multiprocessors + Simple software: easy data sharing, handles both DLP and TLP Complex hardware: must provide illusion of global address space Two basic implementations Symmetric (UMA) multi-processors (SMPs) Underlying communication network: bus (ordered) + Low-latency, simple protocols that rely on global order Low-bandwidth, poor scalability Scalable (NUMA) multi-processors (MPPs) Underlying communication network: point-to-point (unordered) + Scalable bandwidth Higher-latency, complex protocols Slide 55

Shared Memory Summary Two aspects to global memory space illusion Coherence: consistent view of individual cache lines Absolute coherence not needed, relative coherence OK VI and MSI protocols, cache-to-cache transfer optimization Implementation? SMP: snooping, MPP: directories Consistency: consistent global view of all memory locations Programmers intuitively expect sequential consistency (SC) Global interleaving of individual processor access streams Not naturally provided by coherence, needs extra stuff Relaxed ordering: consistency only for synchronization points Slide 56