CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Similar documents
Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Shared Memory Multiprocessors

Processor Architecture

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Page 1. What is (Hardware) Shared Memory? Some Memory System Options. Outline

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Review: Symmetric Multiprocesors (SMP)

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

On Power and Multi-Processors

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

On Power and Multi-Processors

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Computer Architecture

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Shared Memory Architectures. Approaches to Building Parallel Machines

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Snooping-Based Cache Coherence

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

CSE502: Computer Architecture CSE 502: Computer Architecture

Multiprocessors continued

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Multiprocessors continued

SGI Challenge Overview

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors

Wenisch Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Prof. Thomas Wenisch. Lecture 23 EECS 470.

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

EECS 470. Lecture 17 Multiprocessors I. Fall 2018 Jon Beaumont

Snooping coherence protocols (cont.)

Foundations of Computer Systems

Snooping coherence protocols (cont.)

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Memory Hierarchy in a Multiprocessor

Cache Coherence: Part 1

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols

Advanced OpenMP. Lecture 3: Cache Coherency

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Chapter 5. Multiprocessors and Thread-Level Parallelism

Switch Gear to Memory Consistency

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Caches. Parallel Systems. Caches - Finding blocks - Caches. Parallel Systems. Parallel Systems. Lecture 3 1. Lecture 3 2

Multiprocessors 1. Outline

Alewife Messaging. Sharing of Network Interface. Alewife User-level event mechanism. CS252 Graduate Computer Architecture.

CS315A Midterm Solutions

Fall 2015 :: CSE 610 Parallel Computer Architectures. Cache Coherence. Nima Honarmand

Review: Creating a Parallel Program. Programming for Performance

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Bus-Based Coherent Multiprocessors

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Cray XE6 Performance Workshop

CSC 631: High-Performance Computer Architecture

Aleksandar Milenkovich 1

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

12 Cache-Organization 1

--Introduction to Culler et al. s Chapter 2, beginning 192 pages on software

Multiprocessors & Thread Level Parallelism

Beyond ILP In Search of More Parallelism

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

CS 152 Computer Architecture and Engineering

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

CSC/ECE 506: Architecture of Parallel Computers Program 2: Bus-Based Cache Coherence Protocols Due: Wednesday, October 25, 2017

Symmetric Multiprocessors

Scalable Cache Coherence

Physical Design of Snoop-Based Cache Coherence on Multiprocessors

EECS 570 Lecture 9. Snooping Coherence. Winter 2017 Prof. Thomas Wenisch h6p://

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

EC 513 Computer Architecture

A Basic Snooping-Based Multi-Processor Implementation

Shared Memory Architectures. Shared Memory Multiprocessors. Caches and Cache Coherence. Cache Memories. Cache Memories Write Operation

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Interconnect Routing

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

ECE 485/585 Microprocessor System Design

Transcription:

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks! Outline Motivation Coherence Coherence Tradeoffs Memory Consistency Synchronizaton CS/ECE 757 2 Page 1

What is (Hardware) Shared Memory? Take multiple (micro-)processors Implement a memory system with a single global physical address space (usually) Minimize memory latency (co-location & caches) Maximize memory bandwidth (parallelism & caches) CS/ECE 757 3 Some Memory System Options P 1 P n Switch P 1 P n (Interleaved) First-level $ (Interleaved) Main memory $ $ Bus Mem I/O devices (a) Shared cache (b) Bus-based shared memory P 1 P n P 1 P n $ Interconnection network $ Mem $ Mem $ Mem Mem Interconnection network (c) Dancehall (d) Distributed-memory CS/ECE 757 4 Page 2

Why Shared Memory? Pluses For throughput applications looks like multitasking uniprocessor For OS only evolutionary extensions required Easy to do communication without OS Software can worry about correctness first then performance Minuses Proper synchronization is complex Communication is implicit so harder to optimize Hardware designers must implement Result Symmetric Multiprocessors (SMPs) are the most success parallel machines ever And the first with multi-billion-dollar markets CS/ECE 757 5 In More Detail Efficient Naming virtual to physical using TLBs ability to name relevant portions of objects Ease and efficiency of caching caching is natural and well understood can be done in HW automatically Communication Overhead low since protection is built into memory system easy for HW to packetize requests / replies Integration of latency tolerance demand-driven: consistency models, prefetching, multithreaded Can extend to push data to PEs and use bulk transfer CS/ECE 757 6 Page 3

Symmetric Multiprocesors (SMP) Multiple (micro-)processors Each has cache (today a cache hierarchy) Connect with logical bus (totally-ordered broadcast) Implement Snooping Cache Coherence Protocol Broadcast all cache misses on bus All caches snoop bus and may act Memory responds otherwise CS/ECE 757 7 Cache Coherence Problem (Step 1) P1 P2 ld r2, x Time Interconnection Network x Main Memory CS/ECE 757 8 Page 4

Cache Coherence Problem (Step 2) P1 P2 ld r2, x Time ld r2, x Interconnection Network x Main Memory CS/ECE 757 9 Cache Coherence Problem (Step 3) P1 P2 ld r2, x Time ld r2, x add r1, r2, r4 st x, r1 Interconnection Network x Main Memory CS/ECE 757 10 Page 5

Snoopy Cache-Coherence Protocols Bus provides serialization point (more on this later) Each cache controller snoops all bus transactions relevant transactions if for a block it contains take action to ensure coherence» invalidate» update» supply value depends on state of the block and the protocol Simultaneous Operation of Independent Controllers CS/ECE 757 11 Snoopy Design Choices Controller updates state of blocks in response to processor and snoop events and generates bus xactions Often have duplicate cache tags Snoopy protocol set of states state-transition diagram actions Basic Choices write-through vs. write-back invalidate vs. update Processor ld/st Cache State Tag Data... Snoop (observed bus transaction) CS/ECE 757 12 Page 6

The Simple Invalidate Snooping Protocol PrRd / BusRd PrRd / -- Valid PrWr / BusWr BusWr Write-through, no-writeallocate cache Actions: PrRd, PrWr, BusRd, BusWr Invalid PrWr / BusWr CS/ECE 757 13 A 3-State Write-Back Invalidation Protocol 2-State Protocol + Simple hardware and protocol Bandwidth (every write goes on bus!) 3-State Protocol (MSI) Modified» one cache has valid/latest copy» memory is stale Shared» one or more caches have valid copy Invalid Must invalidate all other copies before entering modified state Requires bus transaction (order and invalidate) CS/ECE 757 14 Page 7

MSI Processor and Bus Actions Processor: PrRd PrWr Writeback on replacement of modified block Bus Bus Read (BusRd) Read without intent to modify, data could come from memory or another cache Bus Read-Exclusive (BusRdX) Read with intent to modify, must invalidate all other caches copies Writeback (BusWB) cache controller puts contents on bus and memory is updated Definition: cache-to-cache transfer occurs when another cache satisfies BusRd or BusRdX request Let s draw it! CS/ECE 757 15 MSI State Diagram PrRd /-- PrWr / -- M PrWr / BusRdX PrWr / BusRdX PrRd / BusRd S I BusRd / BusWB BusRdX / BusWB BusRdX / -- PrRd / -- BusRd / -- CS/ECE 757 16 Page 8

An example Proc Action P1 State P2 state P3 state Bus Act Data from 1. P1 read u S -- -- BusRd Memory 2. P3 read u S -- S BusRd Memory 3. P3 write u I -- M BusRdX Memory or not 4. P1 read u S -- S BusRd P3 s cache 5. P2 read u S S S BusRd Memory Single writer, multiple reader protocol Why Modified to Shared? What if not in any cache? Read, Write produces 2 bus transactions! CS/ECE 757 17 4-State (MESI) Invalidation Protocol Often called the Illinois protocol Modified (dirty) Exclusive (clean unshared) only copy, not dirty Shared Invalid Requires shared signal to detect if other caches have a copy of block Cache Flush for cache-to-cache transfers Only one can do it though CS/ECE 757 18 Page 9

4-State (MESI) Invalidation Protocol copyright 1999 Morgan Kaufmann Publishers, Inc CS/ECE 757 19 More Generally: MOESI [Sweazey & Smith ISCA86] M - Modified (dirty) O - Owned (dirty but shared) WHY? E - Exclusive (clean unshared) only copy, not dirty S - Shared I - Invalid ownership O Variants MSI MESI MOSI MOESI S M E I validity exclusiveness CS/ECE 757 20 Page 10

4-State Write-back Update Protocol Dragon (Xerox PARC) States Exclusive (E): one copy, clean, memory is up-to-date Shared-Clean (SC): could be two or more copies, memory unknown Shared-Modified (SM): could be two or more copies, memory stale Modified (M) Adds Bus Update Transaction Adds Cache Controller Update operation Must obtain bus before updating local copy What does state diagram look like? let s look at the actions first CS/ECE 757 21 Dragon Actions Processor PrRd PrWr PrRdMiss PrWrMiss Update in response to BusUpd Bus Xactions BusRd BusUpd BusWB CS/ECE 757 22 Page 11

4-State Dragon Update Protocol copyright 1999 Morgan Kaufmann Publishers, Inc CS/ECE 757 23 Tradeoffs in Protocol Design New State Transitions What Bus Transactions Cache block size Workload dependence Compute bandwidth, miss rates, from state transitions CS/ECE 757 24 Page 12

Computing Bandwidth Why bandwidth? How do I compute it? Monitor State Transitions tells me bus transactions I know how many bytes each bus transaction requires CS/ECE 757 25 MESI State Transitions and Bandwidth FROM/TO NP I E S M NP -- -- BusRd 6+64 BusRd 6+64 BusRdX 6+64 I -- -- BusRd 6+64 BusRd 6+64 BusRdX 6+64 E -- -- -- -- -- S -- -- NA -- BusUpgr 6 M BusWB 6 + 64 BusWB 6+64 NA BusWB 6 + 64 -- CS/ECE 757 26 Page 13

Bandwidth of MSI vs. MESI 200 MIPS/MFLOPS processor use with measured state transition counts to obtain transitions/sec Compute state transitions/sec Compute bus transactions/sec Compute bytes/sec What is BW savings of MESI over MSI? Difference between protocols is Exclusive State Add BusUpgr for E->M transtion Result is very small benefit! Small number of E->M transitions Only 6 bytes on bus CS/ECE 757 27 Bandwidth of MSI vs. MESI (1MB Cache) copyright 1999 Morgan Kaufmann Publishers, Inc CS/ECE 757 28 Page 14

Bandwidth of MSI vs. MESI (1MB Cache) copyright 1999 Morgan Kaufmann Publishers, Inc CS/ECE 757 29 Bandwidth of MSI vs. MESI (64KB Cache) copyright 1999 Morgan Kaufmann Publishers, Inc CS/ECE 757 30 Page 15

MSI BusUpgrd vs. BusRdX MSI S->M Transition Issues BusUpgrd could have block invalidated while waiting for BusUpgrd response adds complexity to detect this Instead just issue BusRdX from MESI put BusRdX in E->M and S->M Result is 10% to 20% Improvement application dependent CS/ECE 757 31 Cache Block Size Block size is unit of transfer and of coherence Doesn t have to be, could have coherence smaller [Goodman] Uniprocessor 3C s (Compulsory, Capacity, Conflict) SM adds Coherence Miss Type True Sharing miss fetches data written by another processor False Sharing miss results from independent data in same coherence block Increasing block size Usually fewer 3C misses but more bandwidth Usually more false sharing misses P.S. on increasing cache size Usually fewer capacity/conflict misses (& compulsory don t matter) No effect on true/false coherence misses (so may dominate) CS/ECE 757 32 Page 16

Cache Block Size: Miss Rate copyright 1999 Morgan Kaufmann Publishers, Inc CS/ECE 757 33 Cache Block Size: Miss Rate copyright 1999 Morgan Kaufmann Publishers, Inc CS/ECE 757 34 Page 17

Cache Block Size: Traffic copyright 1999 Morgan Kaufmann Publishers, Inc CS/ECE 757 35 Invalidate vs. Update Pattern 1: for i = 1 to k P1(write, x); P2--PN-1(read, x); end for i // one write before reads Pattern 2: for i = 1 to k for j = 1 to m P1(write, x); // many writes before reads end for j P2(read, x); end for i CS/ECE 757 36 Page 18

Invalidate vs. Update, cont. Pattern 1 (one write before reads) N = 16, M = 10, K = 10 Update» Iteration 1: N regular cache misses (70 bytes)» Remaining iterations: update per iteration (14 bytes; 6 cntrl, 8 data) Total Update Traffic = 16*70 + 9*14 = 1246 bytes» book assumes 10 updates instead of 9 Invalidate» Iteration 1: N regular cache misses (70 bytes)» Remaining: P1 generates upgrade (6), 15 others Read miss (70) Total Invalidate Traffic = 16*70 + 9*6 + 15*9*17 = 10,624 bytes Pattern 2 (many writes before reads) Update = 1400 bytes Invalidate = 824 bytes CS/ECE 757 37 Invalidate vs. Update, cont. What about real workloads? Update can generate too much traffic Must limit (e.g., competitive snooping ) Current Assessment Update very hard to implement correctly (c.f., consistency discussion coming next) Rarely done Future Assessment May be same as current or Chip multiprocessors may revive update protocols» More intra-chip bandwidth» Easier to have predictable timing paths? CS/ECE 757 38 Page 19

Invalidate vs. Update: Miss Rate copyright 1999 Morgan Kaufmann Publishers, Inc CS/ECE 757 39 Qualitative Sharing Patterns [Weber & Gupta, ASPLOS3] Read-Only Migratory Objects Maniplated by one processor at a time Often protected by a lock Usually a write causes only a single invalidation Synchronization Objects Often more processors imply more invalidations Mostly Read More processors imply more invalidations, but writes are rare Frequently Read/Written More processors imply more invalidations CS/ECE 757 40 Page 20