Multiprocessors. Key questions: How do parallel processors share data? How do parallel processors coordinate? How many processors?

Similar documents
Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Chapter 5. Multiprocessors and Thread-Level Parallelism

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

1. Memory technology & Hierarchy

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

EC 513 Computer Architecture

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Page 1. Cache Coherence

Chapter 5. Multiprocessors and Thread-Level Parallelism

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

M4 Parallelism. Implementation of Locks Cache Coherence

Limitations of parallel processing

Multiprocessors & Thread Level Parallelism

Lecture 25: Multiprocessors

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Chapter-4 Multiprocessors and Thread-Level Parallelism

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

PARALLEL MEMORY ARCHITECTURE

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

CS377P Programming for Performance Multicore Performance Cache Coherence

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Cache Coherence and Atomic Operations in Hardware

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Chapter 5. Thread-Level Parallelism

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Multiprocessor Synchronization

Lecture 24: Board Notes: Cache Coherency

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Scalable Cache Coherence

Handout 3 Multiprocessor and thread level parallelism

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Aleksandar Milenkovich 1

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Computer Organization

Shared Symmetric Memory Systems

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

CMSC 611: Advanced. Distributed & Shared Memory

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Parallel Architecture. Hwansoo Han

Midterm Exam 02/09/2009

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Flynn s Classification

High Performance Multiprocessor System

CMSC 611: Advanced Computer Architecture

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

Shared Memory Architectures. Approaches to Building Parallel Machines

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Synchronization. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. P&H Chapter 2.11

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Thread- Level Parallelism. ECE 154B Dmitri Strukov

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Computer Systems Architecture

Synchronization. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. P&H Chapter 2.11

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Spin Locks and Contention. Companion slides for Chapter 7 The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

EC 513 Computer Architecture

Multiprocessors and Locking

Advanced OpenMP. Lecture 3: Cache Coherency

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Computer Architecture

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Chapter 9 Multiprocessors

Processor Architecture

The MESI State Transition Graph

The Cache Write Problem

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

ECSE 425 Lecture 30: Directory Coherence

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

CS 152 Computer Architecture and Engineering

CS3350B Computer Architecture

Transcription:

Multiprocessors Key questions: How do parallel processors share data? How do parallel processors coordinate? How many processors? Two main approaches to sharing data: Shared memory (single address space) Synchronization Uniform/nonuniform memory access Message passing Clusters (processors connected via LAN) Two types of connection: Single bus Network Category Choice Number of processors Sharing data Shared memory UMA 2 64 NUMA 8 256 Message passing 8 256 Connection Network 8 256 Bus 2 32

Programming multiprocessors Multiprogramming is difficult: Communication problems Requires knowledge about the hardware All parts of the program should be parallelized

Multiprocessors connected by a single bus Cache Cache Cache Single bus Memory I/O Each processor is smaller than a multichip processor The use of caches can reduce the bus traffic There exists mechanisms to keep caches and memory consistent

Parallel program Shared data: A[100000], sum[10] Private data: i, half sum[pn]=0; for (I=10000*Pn; I<10000*(Pn+1); I++) sum[pn]=sum[pn]+a[i]; half=10 repeat sync(); /* wait for partial sum completion - barrier synchronization */ if (half%2!=0 && Pn==0) sum[0]=sum[0]+sum[half-1]; half=half/2; if (Pn<half) sum[pn]=sum[pn]+sum[pn+half]; until (half==1);

Multiprocessor cache coherency Snoopı tag Cache tagı and data Snoopı tag Cache tagı and data Snoopı tag Cache tagı and data Single bus Memory I/O Snooping (monitoring) protocols: locate all caches that share a block to be written. Then: Write-invalidate: The writing processor causes all copies in other caches to be invalidated before changing its local copy. Similar to write-back. Write-update (broadcast): The write processor sends the new data (the word) over the bus. Similar to write-through. The role of the size of the block (broadcasting only a word, false sharing).

Write-invalidate cache coherency protocol based on a write-back policy Each cache block is in one of the following states: Read only: the block is not wtitten and may be shared Read/Write: the block is written (dirty) and may not be shared Invalid: the block does not have valid data write miss Invalidı (not validı cache block) Read/Writeı (dirty) ı read miss write read miss (Send invalidate if hit) (Write back dirty block to memory) a. Cache state transitions using signals from the processor Another processor has a readı miss or a write miss forı this block (seen on bus);ı write back old block Invalidı (not validı cache block) Invalidate orı another processorı has a write missı for this blockı (seen on bus) Read Onlyı (clean) writeı (hit or miss) Read Onlyı (clean) Read/Writeı (dirty) b. Cache state transitions using signals from the bus

Synchronization using coherency Using locks (semaphores) Atomic swap operation Load lockı variable Step P0 P1 P2 Bus activity Memory 1 Has lock Spins, testing if lock = 0 Spins, testing if lock = 0 None 2 Sets lock to 0; sends invalidate over bus Spins, testing if lock = 0 Spins, testing if lock = 0 Write-invalidate of lock variable sent from P0 3 Cache miss Cache miss Bus services P2 s 4 Responds to P2 s ; sends lock = 0 (waits for ) (waits for ) Response to P2 s 5 (waits for ) Tests lock = 0; succeeds Bus services P1 s 6 Tests lock = 0; succeeds Attempt swap; needs write permission 7 Attempt swap; needs write permission Send invalidate to gain write permission 8 Cache miss Swap; reads lock = 0 and sets to 1 9 Swap; read lock = 1 sets to 1; go back to spin Responds to P1 s cache miss, sends lock = 1 Response to P1 s Bus services P2 s invalidate Bus services P1 s Response to P2 s Update memory with block from P0 Responds to P1 s cache miss; sends lock variable No No Unlocked?ı (= 0?) Yes Try to lock variable using swap:ı read lock variable and then setı variable to locked value (1) Succeed?ı (= 0?) Yes Begin updateı of shared data FIGURE 9.3.6 Cache coherence steps and bus traffic for three processors, P0, P1, and P2. This figure assumes write-invalidate coherency. P0 starts with the lock (step 1). P0 exits and unlocks the lock (step 2). P1 and P2 race to see which reads the unlocked value during the swap (steps 3 5). P2 wins and enters the critical section (steps 6 and 7), while P1 spins and waits (steps 7 and 8). The critical section is the name for the code between the lock and the unlock. When P2 exits the critical section, it sets the lock to 0, which will invalidate the copy in P1 s cache, restarting the process. Finish updateı of shared data Unlock:ı set lock variable to 0