COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

Similar documents
Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 25: Multiprocessors

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Midterm Exam 02/09/2009

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Multiprocessor Cache Coherency. What is Cache Coherence?

Scalable Cache Coherence

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

ECE 669 Parallel Computer Architecture

Chapter 5. Multiprocessors and Thread-Level Parallelism

ECSE 425 Lecture 30: Directory Coherence

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Memory Hierarchy in a Multiprocessor

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

A Scalable SAS Machine

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

EC 513 Computer Architecture

Processor Architecture

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

ECE 485/585 Microprocessor System Design

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Multiprocessors & Thread Level Parallelism

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Performance study example ( 5.3) Performance study example

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CMSC 611: Advanced. Distributed & Shared Memory

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Chapter 5. Multiprocessors and Thread-Level Parallelism

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Flynn s Classification

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Directory Implementation. A High-end MP

Aleksandar Milenkovich 1

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

1. Memory technology & Hierarchy

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Lecture 24: Board Notes: Cache Coherency

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Foundations of Computer Systems

Limitations of parallel processing

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Computer Architecture

COSC 6385 Computer Architecture. - Thread Level Parallelism (III)

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

CMSC 611: Advanced Computer Architecture

Bus-Based Coherent Multiprocessors

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

Computer Systems Architecture

Lecture 1: Introduction

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lecture 24: Virtual Memory, Multiprocessors

Advanced OpenMP. Lecture 3: Cache Coherency

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Handout 3 Multiprocessor and thread level parallelism

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

6.1 Multiprocessor Computing Environment

Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University

CSE502: Computer Architecture CSE 502: Computer Architecture

COSC 6385 Computer Architecture - Thread Level Parallelism (III)

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

A More Sophisticated Snooping-Based Multi-Processor

Computer Architecture

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Chapter 5. Thread-Level Parallelism

Snooping coherence protocols (cont.)

Lecture 21: Virtual Memory. Spring 2018 Jason Tang

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Transcription:

COP 633 - Parallel Computing Lecture 10 September 27, 2018 CC-NUA (1) CC-NUA implementation Reading for next time emory consistency models tutorial (sections 1-6, pp 1-17) COP 633 - Prins CC-NUA (1)

Topics Objectives of the next few lectures Examine some implementation issues in shared-memory multiprocessors cache coherence memory consistency synchronization mechanisms Why? Correctness memory consistency can be the source of very subtle bugs Performance cache coherence and synchronization mechanisms can have profound performance implications COP 633 - Prins CC-NUA (1) 2

Cache-coherent shared memory multiprocessor Implementations shared bus 1 2 k bus may be a slotted ring scalable interconnect fixed per-processor bandwidth C 1 C 2 C p Effect of CPU write on local cache P 2 P p write-through policy value is written to cache and to memory write-back policy value written in cache only; memory updated upon cache line eviction Effect of CPU write on remote cache 1 C 1 2 C 2 p C p update remote value is modified invalidate remote value is marked invalid P 2 P p COP 633 - Prins CC-NUA (1) 3

Bus-Based Shared-emory protocols Snooping caches C i caches memory operations from P i C i monitors all activity on bus due to C h (h i ) Update protocol with write-through cache between proc P i and cache C i 1 2 k read-hit from P i resolved from C i read-miss from P i resolved from memory and inserted in C i write (hit or miss) from P i updates C i and memory [write-through] C 1 C 2 P 2 C p P p between cache C i and cache C h if C i writes a memory location cached at C h, then C h is updated with new value consequences every write uses the bus doesn t scale COP 633 - Prins CC-NUA (1) 4

Bus-Based Shared-emory protocols Invalidation protocol with write-back cache Cache blocks can be in one of three states: INVALI The block does not contain valid data SHARE The block is a current copy of memory data other copies may exist in other caches EXCLUSIVE The block holds the only copy of the correct data memory may be incorrect, no other cache holds this block Handling exclusively-held blocks Processor events cache is block owner» reads and writes are local Snooping events on detecting a read-miss or write-miss from another processor to an exclusive block» write-back block to memory» change state to shared (on ext read-miss) or invalid (on ext write-miss) 1 C 1 2 C 2 P 2 C p P p k COP 633 - Prins CC-NUA (1) 5

Invalidation protocol: example P 2 P 3 x 1 R x 1 W x 3 x 1 P 2 P 3 x 1 Shared Excl Invalid R x 1 x 1 P 2 P 3 x 1 P 2 P 3 x 3 R x 3 x 1 x 3 Shared Shared Shared Invalid Shared W x 2 x 1 P 2 P 3 x 1 P 2 P 3 x 3 W x 3 x 4 x 3 Excl Invalid Invalid Excl Invalid COP 633 - Prins CC-NUA (1) 6

Implementation: FS per cache line Action in response to CPU event CPU read Action in response to bus event Invalid CPU read Place read-miss on bus Eviction Shared Invalid Write-miss for this block Shared Excl Excl CPU read CPU write COP 633 - Prins CC-NUA (1) 7

Scalable shared memory: directory-based protocols The Stanford ASH multiprocessor Processing clusters are connected via a scalable network Global memory is distributed equally among clusters Caching is performed using an ownership protocol Each memory block has a home processing cluster At each cluster, a directory tracks the location & state of each cached block whose home is on the cluster Processing cluster I I I I COP 633 - Prins CC-NUA (1) 8

irectories I I I I Cluster Bitmap Block State 0 1 2... 15 x x Cache Blocks 0 1 2... 1 irectories track location & state of all cache blocks 16 B cluster memories 16 byte cache blocks 2+ B storage overhead per directory COP 633 - Prins CC-NUA (1) 9

Cache coherence in ASH I I I I Cluster Bitmap Block State 0 1 2... 15 x x Cache Blocks 0 1 2... 1 Caching is based on an ownership model uncached, clean, & dirty states Home cluster is the owner for all its uncached and clean blocks Any one cache can own the only copy of a dirty block COP 633 - Prins CC-NUA (1) 10

Cache coherence in ASH: Read miss I I I I Check local cluster caches first... If found and CLEAN then copy If found and IRTY then make CLEAN and copy If not found consult desired block s home directory If CLEAN or UNCACHE then block is sent to requestor If IRTY then request is forwarded to cluster where block is cached. Remote cluster makes block CLEAN and sends copy to requestor To make a block CLEAN Send copy to owning cluster mark CLEAN COP 633 - Prins CC-NUA (1) 11

Cache coherence in ASH: Writes I I I I Writing processor must first become block s owner If block is cached at requesting processor and block is... IRTY, then write can proceed CLEAN, then home directory must invalidate all copies If block is not cached locally but is cached on the cluster a local block transfer is performed home directory is updated if the state was CLEAN COP 633 - Prins CC-NUA (1) 12

Cache coherence in ASH: Writes I I I I If block is not cached on local cluster then block s home directory is contacted If block is... UNCACHE Block is marked IRTY and sent to requestor CLEAN Block is marked as IRTY and messages sent to caching clusters to invalidate their copies IRTY Request is forwarded to caching cluster. There the block is invalidated and forwarded to requestor COP 633 - Prins CC-NUA (1) 13

Intel cache coherence basically a directory-based protocol like ASH with 2 or 4 clusters each package (socket) is a cluster with p cores on a slotted ring COP 633 - Prins CC-NUA (1) 14

Intel physical organization up to 4 sockets up to 22 cores per socket up to 44 thread contexts machine socket 0 socket 3 core 0 core 1 core 0 core 1 thread context COP 633 - Prins CC-NUA (1) 15

apping OpenP threads to hardware apping threads to maximize data locality KP_AFFINITY = granularity=fine,compact machine socket 0 socket 1 core 0 core 1 core 0 core 1 thread context 0 1 2 3 4 5 6 7 OpenP thread id COP 633 - Prins CC-NUA (1) 16

apping OpenP threads to hardware apping threads to maximize bandwidth without data locality KP_AFFINITY = granularity=fine,scatter machine socket 0 socket 1 core 0 core 1 core 0 core 1 thread context 0 4 2 6 1 5 3 7 OpenP thread id COP 633 - Prins CC-NUA (1) 17

apping OpenP threads to hardware apping threads to maximize data locality and equal thread progress KP_AFFINITY = granularity=fine,compact,1,0 OP_NU_THREAS = 4 machine socket 0 socket 1 core 0 core 1 core 0 core 1 thread context 0 4 1 5 2 6 3 7 OpenP thread id COP 633 - Prins CC-NUA (1) 18

apping OpenP threads to hardware apping threads to maximize bandwidth and equal thread progress KP_AFFINITY = granularity=fine,scatter OP_NU_THREAS = 4 machine socket 0 socket 1 core 0 core 1 core 0 core 1 thread context 0 4 2 6 1 5 3 7 OpenP thread COP 633 - Prins CC-NUA (1) 19

Coherence and Consistency Coherence behavior of a single memory location viewed from a single processor read returns most recent written value Consistency behavior of multiple memory locations read and written by multiple processors viewed from one or more of the processors read may not return the most recent value What are the permitted ordering among reads and writes of several memory locations? COP 633 - Prins CC-NUA (1) 20