Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Similar documents
5008: Computer Architecture

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture 25: Multiprocessors

Chapter 5. Thread-Level Parallelism

Directory Implementation. A High-end MP

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

ECSE 425 Lecture 30: Directory Coherence

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

EC 513 Computer Architecture

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Chapter 5. Multiprocessors and Thread-Level Parallelism

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

CMSC 611: Advanced. Distributed & Shared Memory

Computer Architecture

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

M4 Parallelism. Implementation of Locks Cache Coherence

CMSC 611: Advanced Computer Architecture

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Page 1. Instruction-Level Parallelism (ILP) CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 21: Multiprocessors

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Aleksandar Milenkovich 1

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallel Architecture. Hwansoo Han

Multiprocessors & Thread Level Parallelism

Computer Architecture

Page 1. Cache Coherence

Outline. EEL 5764 Graduate Computer Architecture. Chapter 4 - Multiprocessors and TLP. Déjà vu all over again?

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

Computer Science 146. Computer Architecture

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

CSE502: Computer Architecture CSE 502: Computer Architecture

Chapter-4 Multiprocessors and Thread-Level Parallelism

Multiprocessor Synchronization

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Handout 3 Multiprocessor and thread level parallelism

Lecture 24: Virtual Memory, Multiprocessors

1. Memory technology & Hierarchy

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Chapter 6: Multiprocessors and Thread-Level Parallelism

Multiprocessor Cache Coherency. What is Cache Coherence?

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency

Flynn s Classification

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Chap. 4 Multiprocessors and Thread-Level Parallelism

Shared Memory Architecture

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

CSE Traditional Operating Systems deal with typical system software designed to be:

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Review. EECS 252 Graduate Computer Architecture. Lec 13 Snooping Cache and Directory Based Multiprocessors. Outline. Challenges of Parallel Processing

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Midterm Exam 02/09/2009

Chapter 9 Multiprocessors

Multiprocessors. Key questions: How do parallel processors share data? How do parallel processors coordinate? How many processors?

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Multiprocessors and Thread Level Parallelism CS 282 KAUST Spring 2010 Muhamed Mudawar

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Concept of a process

Multiprocessors and Locking

Limitations of parallel processing

CS 152 Computer Architecture and Engineering

Cache Coherence and Atomic Operations in Hardware

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Transcription:

Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture 1

Unified Shared Memory Multiprocessors Processor Processor Cache CPU-Memory bus Cache Not scalable Memory Assume a UMA system with 16 processors, each processor with a L1-L2 Cache hierarchy, with L2 global miss rate 1%. The block size of the L2 Cache is 64 bytes, and the processors are clocked at 4 GHz. Assume that the processors can sustain IPC=2, and each application has on average 30% load/store instructions. What is the max bandwidth required by all processors? Each application has on average 30% load/store instructions. Therefore, 0.3 data references per instruction or 0.3*2 data references each cycle (for IPC=2). This amounts to 0.3*2*4*16 = 38.4 data references per ns for all 16 processors. These data references create 38.4 * 0.01 = 0.384 misses/ns which require 0.384*64=24.576 bytes/ns bandwidth to memory = 24.576 GB/sec Parallel Computer Architecture 2

Distributed Shared Memory (DSM or NUMA) Processor & Caches Memory Directory I/O Interconnection Network Processor & Caches Memory Directory I/O FSM for each blocks Very similar to snooping FSM for each memory block in directory Directory Memory I/O Directory Memory I/O Processor & Caches Processor & Caches Parallel Computer Architecture 3

Distributed Shared Memory (DSM or NUMA) Each processor has its own physical memory and potentially I/O facilities like DMA No symmetric memory However, the memories are shared and can be accessed with load/store instructions by any processor Every memory module has associated directory information (one directory per processor) It only keeps information for the blocks of the memory module of that processor i.e. the sharing status for each block is in a single known location keeps track of copies of d blocks and their states on a miss, find directory entry, look it up, and communicate ONLY with the nodes that have copies if necessary No broadcasting Only point to point network transactions 4

Directory Protocol No broadcast: interconnect no longer single arbitration point all messages have explicit responses Typically 3 agents involved Local node (processor+) where a request originates Home node (directory controller) where the memory location of an address resides Remote node (processor+) has a copy of a block, whether exclusive or shared Parallel Computer Architecture 5

Directory Protocol Two different FSMs One for each block (in the ) One in directory for each memory block of the memory module The block FSM is similar to snoopy protocol The directory FSM has also three states: Und: no processor has it; not valid in any, only in memory Shared: 1 processors have memory block, memory up-to-date Exclusive: 1 processor (owner) has data; memory out-of-date Parallel Computer Architecture 6

Directory Protocol In addition to sharing state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy) Keep it simple: Writes to non-exclusive data write miss Processor blocks until access completes Assume messages received and acted upon in order sent Parallel Computer Architecture 7

CPU -Cache State Machine State machine for each block Requests coming from local CPU and from home directory Red is from local CPU and Blue is from home directory Home directory is where the mem location resides. Can be in another processor Fetch/Invalidate send Data Write Back message to home directory Invalid Invalidate CPU Read Miss Send Read Miss msg to home directory CPU Write Miss: Send Write Miss msg to home directory CPU Read hit Shared (read/only) CPU Write Miss: Send Write miss message to home directory CPU Write Hit: Send Write invalidate message to home directory Fetch: send Data Write Back message to home directory CPU read miss: Send Read Miss msg to home directory CPU read hit CPU write hit Exclusive (read/write) CPU read miss: send Data Write Back message and read miss to home directory CPU write miss: send Data Write Back message and Write Miss to home directory Parallel Computer Architecture 8

Directory State Machine of Home Node State machine for Directory requests for each memory block Requests coming from controllers Und state if data only in memory Data Write Back (e.g. Due to block replacement Sharers = {} (Write back block) Und Read miss: send Data Value Reply Sharers = {P} Write Miss: send Data Value Reply Msg Sharers = {P}; Read miss: send Data Value Reply Sharers += {P}; Shared (read only) Write Miss: send Invalidate to Sharers; send Data Value Reply msg Sharers = {P}; Write Miss: send Fetch/Invalidate; send Data Value Reply msg to remote Sharers = {P}; Exclusive (read/write) Read miss: send Fetch; send Data Value Reply msg to remote Sharers += {P}; Parallel Computer Architecture 9

Example Directory Protocol P1 Read Miss E Directory FSM Home Node Read X Sharers={pA, p1} S U R/reply Dir ctrl X M Data Value X reply Send Read miss msg Cache FSM M Cache FSM M S I R/req $ P1 load R3, X S I $ P2 Parallel Computer Architecture 10

Example Directory Protocol P2 Read Miss E Sharers={pA, p1} R/_ Sharers = {pa,p1,p2} S U R/reply Dir ctrl X M Send Read miss msg Data Value X reply M M R/_ S R/req P1 R/_ S $ $ R/req P2 I load R3, X I load R5, X Parallel Computer Architecture 11

Example Directory Protocol P1 writes to shared (Write Hit) R/_ Sharers = {pa,p1,p2} Sharers = {p1} reply with ACK E S U R/reply Send write hit msg RX/invalidate&reply Dir ctrl X M Invalidate loc X Inv ACK W/_ ME W/req E M R/_ S I R/req P1 R/_ S $ $ Inv/_ R/req store R1, X I P2 Parallel Computer Architecture 12

Example Directory Protocol P2 writes to Exclusive (Write Miss) RU/_ DE Sharers = {p1} Sharers = {p2} Fetch;Invalidate R/_ S U Write_back X RX/invalidate&reply R/reply Dir ctrl X M Reply data X Send Write Miss Msg for location X W/_ ME W/_ M W/req E W/req E W/req E R/_ Inv/_ S I R/req S P1 R/_ $ $ Inv/_ R/req I P2 store R2, X Parallel Computer Architecture 13

Discussion All these mechanisms are hardware based BUT, programmer should be concerned about performance A frequently accessed shared data structure can easily thrash such a system Data placement near access points is still critical Parallel Computer Architecture 14

Synchronization Parallel Computer Architecture 15

Synchronization Question: How are all these hardware enhancements used in software to implement synchronization primitives? The key ability to implement synchronization is a set of hardware primitives with the ability to atomically read and modify a memory location Typically, part of the ISA Higher level structures (synchronization libraries) based on these primitives Locks, barriers, etc... Parallel Computer Architecture 16

Synchronization The need for synchronization arises whenever there are concurrent processes in a system (even in a uniprocessor system) Exclusive use of a resource: Operating system has to ensure that only one process uses a resource at a given time P1 fork join P2 Barriers: In parallel programming, a parallel process may want to wait until several events have occurred Producer-Consumer: A consumer process must wait until the producer process has produced data producer consumer Parallel Computer Architecture 17

Locks or Semaphores E. W. Dijkstra, 1965 A semaphore s is a non-negative integer, with the following operations: P(s): if s>0, decrement s by 1 and enter critical section, otherwise wait V(s): increment s by 1 and wake up one of the waiting processes P s and V s must be executed atomically, i.e., without interruptions or interleaved accesses to s by other processors Process i P(s) V(s) initial value of s determines the maximum no. of processes in the critical section Parallel Computer Architecture 18

Use of Semaphores Locks Example: m is a memory location, R is a register AtomicExchange (m), R: R t = M[m]; M[m] = R; R = R t ; m is a shared location called Lock. When Lock==0, we can enter the Critical Region When Lock==1, we cannot enter the Critical Region To enter a critical section, processors need to capture the lock by setting it 0 --> 1 Unlocking does not need to be atomic Lock (m), R /* R is local */ lock: R = 1; AtomicExchange (m), R; BNEZ R, lock; R = 0; M[m] = R; /* unlock */ Parallel Computer Architecture 19

Implementation of Atomic Exchange Load Linked & Store Conditional Instructions Examples: m is a memory location, R is a register If (m) was changed by another processor or a context switch happened between the execution of LL and SC, the SC fails LL R, (m) : R = M[m]; SC R, (m) : if M[m] has changed since last LL then R = 0; else if the processor context switched since last LL then R = 0; else M(m) = R; R = 1; Note: LL and SC go in pairs Lock (m), R : /* initialize R with 1 */ try : R3 = R; LL R2, (m); /* (m) <--> R */ SC R3, (m); BEQZ R3, try; /* was the exchange atomic? */ BNEZ R2, try; /* if yes, check for lock availability */ < critical section> Parallel Computer Architecture 20

Load Linked & Store Conditional Instructions LL R1, (m) SC R2, (m) LL and SC implementation in a multiprocessor : keep track of the address m in a link register R l R l m The LL instruction causes the data in (m) to be loaded in the local of CPU A If an interrupt occurs, or if the block that corresponds to memory location m is invalidated (by a write from another CPU B to m), the link register is cleared The SC instruction checks if the address m matches the value of the link register Parallel Computer Architecture 21

Synchronization and Caches Processor 1 Processor 2 Processor 3 CPU-Memory Bus Cache-coherence protocols will cause lock to ping-pong between s. Ping-ponging can be reduced by first reading the lock location (non-atomically) and executing a swap only if it is found to be zero. For example, if we use LL, SC instructions swap(m), R { try: LL R1, (m) SC R, (m) BEQZ R, try R = R1 } Parallel Computer Architecture 22

Synchronization and Caches Processor 1 Processor 2 Processor 3 CPU-Memory Bus P1 has lock, P2 and P3 spin testing if lock is 0, lock is shared Assume that status of line that contains the lock is (I, S, S), swap(m), R { try: LL R1, (m) SC R, (m) BEQZ R, try R = R1 } Parallel Computer Architecture 23

Synchronization and Caches Processor 1 Processor 2 Processor 3 CPU-Memory Bus P1 sets lock to 0, P2 and P3 receive invalidate, lock becomes Exclusive to P0 Status of line that contains the lock becomes (M, I, I) swap(m), R { try: LL R1, (m) SC R, (m) BEQZ R, try R = R1 } Parallel Computer Architecture 24

Synchronization and Caches Processor 1 Processor 2 Processor 3 CPU-Memory Bus P2 and P3 race to read the lock (read miss) Status of line that contains the lock becomes (S, S, S) swap(m), R { try: LL R1, (m) SC R, (m) BEQZ R, try R = R1 } Parallel Computer Architecture 25

Synchronization and Caches Processor 1 Processor 2 Processor 3 CPU-Memory Bus swap(m), R: R t = M[m]; M[m] = R; R = R t ; P3 is faster to test for 0 and to write 1 to the lock (write hit) and invalidate the other copies in P1,P2 lock becomes Exclusive to P3 Status of line that contains the lock becomes (I, I, M) Parallel Computer Architecture 26

Synchronization and Caches Processor 1 Processor 2 Processor 3 CPU-Memory Bus P3 enters critical region P2 attempts to read the lock again (read miss), and gets the value 1. It sets the lock to 1 (write hit) and lock becomes Exclusive to P2 Status of line that contains the lock becomes (I, M, I) swap(m), R: R t = M[m]; M[m] = R; R = R t ; Parallel Computer Architecture 27

Synchronization and Caches Processor 1 Processor 2 Processor 3 CPU-Memory Bus P2 spins testing if lock is 0. The lock remains Exclusive to P2 try: Status of line that contains the lock remains (I, M, I) swap(m), R { } LL R1, (m) SC R, (m) BEQZ R, try R = R1 Parallel Computer Architecture 28

Synchronization and Caches Processor 1 Processor 2 Processor 3 CPU-Memory Bus P3 sets lock to 0, P2 receives invalidate, lock becomes Exclusive to P3 Status of line that contains the lock becomes (I, I, M) swap(m), R { try: LL R1, (m) SC R, (m) BEQZ R, try R = R1 } Parallel Computer Architecture 29

Synchronization and Caches Processor 1 Processor 2 Processor 3 CPU-Memory Bus P2 reads the lock (I, S, S), find it equal to 0, and sets it to 1 (I, M, I). P2 enters the critical region. swap(m), R { try: LL R1, (m) SC R, (m) BEQZ R, try R = R1 } Parallel Computer Architecture 30