A More Sophisticated Snooping-Based Multi-Processor

Similar documents
A Basic Snooping-Based Multi-Processor Implementation

A Basic Snooping-Based Multi-Processor Implementation

A Basic Snooping-Based Multi-processor

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Shared Memory Multiprocessors

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

SGI Challenge Overview

Physical Design of Snoop-Based Cache Coherence on Multiprocessors

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

EC 513 Computer Architecture

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Processor Architecture

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Page 1. Cache Coherence

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Lecture 25: Multiprocessors

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture: Transactional Memory. Topics: TM implementations

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Chapter 9 Multiprocessors

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

Lecture 21: Virtual Memory. Spring 2018 Jason Tang

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Lecture 24: Board Notes: Cache Coherency

Snoop-Based Multiprocessor Design III: Case Studies

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Process Synchronisation (contd.) Deadlock. Operating Systems. Spring CS5212

ECE 551 System on Chip Design

CMSC 611: Advanced Computer Architecture

Review: Symmetric Multiprocesors (SMP)

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Computer Architecture

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Page 1. What is (Hardware) Shared Memory? Some Memory System Options. Outline

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Multiprocessor Systems. COMP s1

Multiprocessor Systems. Chapter 8, 8.1

Lecture 25: Busses. A Typical Computer Organization

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

Parallel Arch. Review

CSC 631: High-Performance Computer Architecture

Multiprocessor Cache Coherency. What is Cache Coherence?

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Advanced OpenMP. Lecture 3: Cache Coherency

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Snooping coherence protocols (cont.)

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Advanced Topic: Efficient Synchronization

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Scalable Multiprocessors

Lecture 24: Virtual Memory, Multiprocessors

CSE 4/521 Introduction to Operating Systems. Lecture 12 Main Memory I (Background, Swapping) Summer 2018

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

CMSC 611: Advanced. Distributed & Shared Memory

TDT Appendix E Interconnection Networks

Interconnect Routing

Lecture 1: Introduction

CS315A Midterm Solutions

Switch Gear to Memory Consistency

Limitations of parallel processing

Shared Memory Multiprocessors

Cache Coherence in Scalable Machines

15-740/ Computer Architecture

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Scalable Cache Coherence

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

Presented by Anh Nguyen

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CSE Traditional Operating Systems deal with typical system software designed to be:

Virtual Memory. User memory model so far:! In reality they share the same memory space! Separate Instruction and Data memory!!

Transcription:

Lecture 16: A More Sophisticated Snooping-Based Multi-Processor Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014

Tunes The Projects Handsome Boy Modeling School (So... How s your Girl?) There s a lot about the projects I do adore. - Dan the Automator

Final projects / expectations Project proposals are due Friday April 4 (but you are welcome to submit early to get feedback) - 5-6 weeks of effort: expecting at least 2 class assignments worth of work The parallelism competition is on Friday May 10th during the final exam slot. - ~20-25 finalist projects will present to judges during this time - Other projects will presented to staff later in the day - Final writes are due end of day on May 10th. (no late days) Your grade is independent of the parallelism competition results - It is based on your work, your writeup, and your presentation You are encouraged to design your own project - Contact staff as early as possible about equipment needs - This is a list of project ideas on the web site to help - http://15418.courses.cs.cmu.edu/spring2014/article/12

Photo competition winner: 75% utilization My work aims to capture the essential human struggle that pits ensuring good load balance against the overheads incurred by using resources to orchestrate a computation. - Rokhini

Last time we implemented a very simple cache-coherent multiprocessor using an atomic bus as an interconnect Processor Processor Cache Cache Atomic Bus Memory

Key issues to processor We addressed the issue of contention for access to cache tags by duplicating tags. Tags State processor-side controller Data Tags State Snoop controller to bus Address Data Shared Dirty Snoop-valid We described how snoop results can be collectively reported to a cache via shared, dirty, and valid lines on the bus.

Key issues We addressed correctness issues that arise from the use of a write-back buffer by checking both the cache tags and the write-back buffer when snooping. (and also added the ability to cancel pending bus transfer requests).

Key issues We talked about ensuring write serialization: processor is held up by the cache until the I want exclusive access transaction appears on the bus (this is when the write commits ). We talked about how coherence protocol state transitions are not really atomic operations in a real machine (even though the bus itself it atomic), leading to possible race conditions.

We discussed deadlock, livelock, and starvation Situation 1: P1 has a modified copy of cache line X P1 is waiting for the bus to issue BusRdX on cache line Y BusRd for X appears on bus while P1 is waiting FETCH DEADLOCK! To avoid deadlock, P1 must be able to service incoming transactions while waiting to issue its own requests

Required conditions for deadlock 1. Mutual exclusion: one processor can hold a given resource at once 2. Hold and wait: processor must hold the resource while waiting for other resources needed to complete an operation 3. No preemption: processors don t give up resources until operation they wish to perform is complete 4. Circular wait: waiting processors have mutual dependencies (a cycle exists in the resource dependency graph) A Work queue (full) Work queue (full) B

Livelock Situation 2: Two processors simultaneously write to cache line X (in S state) P1 acquires bus access ( wins bus ), issues BusRdX P2 invalidates its copy of the line in response to P1 s BusRdX Before P1 performs the write (updates block), P2 acquires bus and issues BusRdX P1 invalidates in response to P2 s BusRdX LIVELOCK! To avoid livelock, write that obtains exclusive ownership must be allowed to complete before exclusive ownership is relinquished.

Today s topic Today we will build the system around non-atomic bus transactions.

What you should know What is the major performance issue with atomic bus transactions that motivates moving to a more complex non-atomic system? Who should know the main components of a split-transaction bus, and how transactions are split into requests and responses How deadlock and livelock might occur in both atomic bus and non-atomic bus-based systems (what are possible solutions for avoiding it?) The role of queues in a parallel system (today is yet another example)

Transaction on an atomic bus 1. Client is granted bus access (result of arbitration) 2. Client places command on bus (may also place data on bus) Problem: bus is idle while response is pending (decreases effective bus bandwidth) This is bad, because the interconnect is a limited, shared resource in a multi-processor system. (So it is important to use it as efficiently as possible) 3. Response to command by another bus client placed on bus 4. Next client obtains bus access (arbitration)

Split-transaction bus Bus transactions are split into two transactions: a request and a response Other transactions can intervene. Consider: P1 P2 Read miss to A by P1 Bus upgrade of B by P2 Cache Cache P1 gains access to bus P1 sends BusRd command Split-Transaction Bus [memory starts fetching data] P2 gains access to bus Memory P2 sends BusUpg command Memory gains access to bus Memory places A on bus

New issues arise due to split transactions 1. How to match requests with responses? 2. How to handle conflicting requests on bus? Consider: - P1 has outstanding request for line A - Before response to P1 occurs, P2 makes request for line A 3. Flow control: how many requests can be outstanding at a time, and what should be done when buffers fill up? 4. When are snoop results reported? During the request? or during the response?

A basic design Up to eight outstanding requests at a time (system wide) Responses need not occur in the same order as requests - But request order establishes the total order for the system Flow control via negative acknowledgements (NACKs) - When a buffer is full, client can NACK a transaction, causing a retry

Initiating a request Can think of a split-transaction bus as two separate buses: a request bus and a response bus. 256 bits 3 bits Request bus: cmd + address Response bus: data Response tag Transaction tag is index into table Request Table (assume a copy of this table is maintained by each bus client: e.g., cache) Requestor P0 Addr 0xbeef State Step 1: Requestor asks for request bus access Step 2: Bus arbiter grants access, assigns transaction a tag Step 3: Requestor places command + address on the request bus

Read miss: cycle-by-cycle bus behavior (phase 1) ARB RSLV ADDR DCD ACK Clocks Request Bus (Addr/cmd) Addr req Grant Addr Addr Ack Caches acknowledge this snoop result is ready (or signal they could not complete snoop in time here (e.g., raise inhibit wire) Caches perform snoop: look up tags, update cache state, etc. Memory operation commits here! (NO BUS TRAFFIC) Bus winner places command/address on the bus Request resolution: address bus arbiter grants access to one of the requestors Request table entry allocated for request (see previous slide) Special arbitration lines indicate tag assigned to request Request arbitration: cache controllers present request for address to bus (many caches may be doing so in the same cycle)

Read miss: cycle-by-cycle bus behavior (phase 2) ARB RSLV ADDR DCD ACK ARB RSLV ADDR DCD ACK Clocks Request Bus (Addr/cmd) Addr req Grant Addr Addr Ack Response Bus (Data Arbitration) Data req Grant Tag check (Data) Original requestor signals readiness to receive response (or lack thereof: requestor may be busy at this time) Data bus arbiter grants one responder bus access Data response arbitration: responder presents intent to respond to request with tag T (many caches --or memory-- may be doing so in the same cycle)

Read miss: cycle-by-cycle bus behavior (phase 3) ARB RSLV ADDR DCD ACK ARB RSLV ADDR DCD ACK Clocks Request Bus (Addr/cmd) Addr req Grant Addr Addr Ack Response Bus (Data Arbitration) Data req Grant Tag check (Data) Data Data Data Data Responder places response data on data bus Caches present snoop result for request with the data Request table entry is freed Here: assume 128 byte cache lines 4 cycles on 256 bit bus

Pipelined transactions ARB RSLV ADDR DCD ACK ARB RSLV ADDR DCD ACK Clocks Request Bus (Addr/cmd) Addr req Grant Addr Addr Ack Addr req Grant Addr Addr Ack Response Bus (Data Arbitration) Data req Grant Tag check Data req Grant Tag check (Data) Data Data Data Data Data Data... = memory transaction 1 = memory transaction 2 Note: write backs and BusUpg transactions do not have a response component (write backs acquire access to both request address bus and data bus as part of request phase)

Pipelined transactions Clocks Request Bus (Addr/cmd) Response Bus (Data Arbitration) (Data)... = memory transaction 1 = memory transaction 2 = memory transaction 3 = memory transaction 4

Key issues to resolve Conflicting requests - Avoid conflicting requests by disallowing them - Each cache has a copy of the request table - Simple policy: caches do not make requests that conflict with requests in the request table Flow control: - Caches/memory have buffers for receiving data off the bus - If the buffer fills, client NACKs relevant requests or responses (NACK = negative acknowledgement) - Triggers a later retry

Situation 1: P1 read miss to X, write transaction involving X is outstanding on bus P1 Request Table Requestor Addr State P2 X Op: BusRdX read X P1 P2 Cache Cache Split-Transaction Bus Memory If there is a conflicting outstanding request (as determined by checking the request table), cache must hold request until conflict clears

Situation 2: P1 read miss to X, read transaction involving X is outstanding on bus P1 Request Table Requestor Addr State read X P2 X Op: BusRd, share P1 P2 Cache Cache Split-Transaction Bus Memory If outstanding request is a read: there is no conflict. No need to make a new bus request, just listen for the response to the outstanding one.

Multi-level cache hierarchies

Why do we have queues? A B To accommodate variable (unpredictable) rates of production and consumption. As long as A and B, on average, produce and consume at the same rate, both workers can run at full rate. No queue: stalls exist time A B 1 2 3 4 5 6 1 2 3 4 5 6 With queue of size 2: A and B never stall A B 1 2 3 4 0 1 2 1 1 2 5 6 0 1 0 1 2 1 0 Size of queue when A completes a piece of work (or 3 4 5 6 B begins work)

Consider fetch deadlock problem Assume one outstanding memory request per processor. Consider fetch deadlock problem: cache must be able to service requests while waiting on response to its own request (hierarchies increase response delay)

Deadlock due to full queues to processor Assume buffers are sized so that max queue size is one message. L1 Cache Outgoing read request (initiated by this processor) L1 L2 queue L2 L1 queue Incoming read request (due to another cache) ** Both requests generate responses that require space in the other queue (circular dependency) L2 Cache to bus ** will only occur if L1 is write back

Multi-level cache hierarchies Assume one outstanding memory request per processor. Consider fetch deadlock problem: cache must be able to service requests while waiting on response to its own request (hierarchies increase response delay) Sizing all buffers to accommodate the maximum number of outstanding requests on bus is one solution to avoiding deadlock. But an expensive one!

Avoiding buffer deadlock with separate request/response queues to processor L1 L2 request queue L1 L2 response queue L1 Cache L2 Cache to bus L2 L1 request queue L2 L1 response queue System classifies all transactions as requests or responses Key insight: responses can be completed without generating further transactions! Requests INCREASE queue length But responses REDUCE queue length While stalled attempting to send a request, cache must be able to service responses. Responses will make progress (they generate no new work so there s no circular dependence), eventually freeing up resources for requests

Putting it all together Class exercise: describe everything that might occur during the execution of this statement int x = 10; // assume this is a write to memory (value not // stored in register)

Class exercise: describe everything that might occur during the execution of this statement int x = 10; Virtual address to physical address conversion (TLB lookup) TLB miss TLB update (might involve OS) OS may need to swap in page to get the appropriate page table (load from disk to physical address) Cache lookup Line not in cache (need to generate BusRdX) Arbitrate for bus Win bus, place address, command on bus Another cache or memory decides it must respond (assume memory) Memory request sent to memory controller Memory controller is itself a scheduler Memory checks active row in row buffer. May need to activate new row. Values read from row buffer Memory arbitrates for data bus Memory wins bus Memory puts data on bus Cache grabs data, updates cache line and tags, moves line into Exclusive state Processor is notified data exists Instruction proceeds