Distributed Memory and Cache Consistency. (some slides courtesy of Alvin Lebeck)

Similar documents
Multiple-Writer Distributed Memory

Replication of Data. Data-Centric Consistency Models. Reliability vs. Availability

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

Distributed Systems. Distributed Shared Memory. Paul Krzyzanowski

Shared Virtual Memory. Programming Models

Distributed Shared Memory. Presented by Humayun Arafat

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

DISTRIBUTED SYSTEMS [COMP9243] Lecture 3b: Distributed Shared Memory DISTRIBUTED SHARED MEMORY (DSM) DSM consists of two components:

SDSM Progression. Implementing Shared Memory on Distributed Systems. Software Distributed Shared Memory. Why a Distributed Shared Memory (DSM) System?

Memory Consistency. Minsoo Ryu. Department of Computer Science and Engineering. Hanyang University. Real-Time Computing and Communications Lab.

Implementing caches. Example. Client. N. America. Client System + Caches. Asia. Client. Africa. Client. Client. Client. Client. Client.

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Consistency and Replication

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

DISTRIBUTED COMPUTER SYSTEMS

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Distributed Shared Memory

DISTRIBUTED SHARED MEMORY

Shared Memory Multiprocessors

Relaxed Memory-Consistency Models

Distributed Shared Memory and Memory Consistency Models

9. Distributed Shared Memory

Computer Architecture

Multiprocessor Cache Coherency. What is Cache Coherence?

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

What is consistency?

殷亚凤. Consistency and Replication. Distributed Systems [7]

COMP Parallel Computing. CC-NUMA (2) Memory Consistency

CSE 513: Distributed Systems (Distributed Shared Memory)

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Replication and Consistency. Fall 2010 Jussi Kangasharju

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Processor Architecture

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

Implementing Isolation

A More Sophisticated Snooping-Based Multi-Processor

Distributed Shared Memory

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

A Migrating-Home Protocol for Implementing Scope Consistency Model on a Cluster of Workstations

Selection-based Weak Sequential Consistency Models for. for Distributed Shared Memory.

EE382 Processor Design. Processor Issues for MP

Transaction Processing: Concurrency Control ACID. Transaction in SQL. CPS 216 Advanced Database Systems. (Implicit beginning of transaction)

Replication and Consistency

Consistency Issues in Distributed Shared Memory Systems

Concurrency control CS 417. Distributed Systems CS 417

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

ECE/CS 757: Homework 1

A Survey of Cache Coherence Mechanisms in Shared Memory Multiprocessors

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Bus-Based Coherent Multiprocessors

Distributed Systems. Lec 11: Consistency Models. Slide acks: Jinyang Li, Robert Morris

Handout 3 Multiprocessor and thread level parallelism

Consistency and Replication. Why replicate?

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Consistency and Replication 1/62

CPS 512 midterm exam #1, 10/7/2016

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. Cache Coherence

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

1. Memory technology & Hierarchy

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Computing Parable. The Archery Teacher. Courtesy: S. Keshav, U. Waterloo. Computer Science. Lecture 16, page 1

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Distributed Shared Memory: Concepts and Systems

CSE 5306 Distributed Systems

Adaptive Techniques for Homebased Software DSMs

Lecture: Consistency Models, TM

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Lightweight Logging for Lazy Release Consistent Distributed Shared Memory

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Lecture 16/17: Distributed Shared Memory. CSC 469H1F Fall 2006 Angela Demke Brown

Page 1. Memory Hierarchies (Part 2)

EC 513 Computer Architecture

Distributed Systems. Lec 12: Consistency Models Sequential, Causal, and Eventual Consistency. Slide acks: Jinyang Li

Distributed Shared Memory (DSM) Introduction Shared Memory Systems Distributed Shared Memory Systems Advantage of DSM Systems

Overview: Memory Consistency

Switch Gear to Memory Consistency

Chapter 5. Multiprocessors and Thread-Level Parallelism

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Consistency and Replication 1/65

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Replication. Consistency models. Replica placement Distribution protocols

Relaxed Memory-Consistency Models

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Distributed Systems. Catch-up Lecture: Consistency Model Implementations

Module 7 - Replication

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Transcription:

Distributed Memory and Cache Consistency (some slides courtesy of Alvin Lebeck)

Software DSM 101 Software-based distributed shared memory (DSM) provides anillusionofsharedmemoryonacluster. remote-fork the same program on each node data resides in common virtual address space library/kernel collude to make the shared VAS appear consistent The Great War: shared memory vs. message passing for the full story, take CPS 221 switched interconnect

Page Based DSM (Shared Virtual Memory) Virtual address space is shared Virtual Address Space physical physical DRAM DRAM

Inside Page-Based DSM (SVM) The page-based approach uses a write-ownership token protocol on virtual memory pages. Kai Li [Ivy, 1986], Paul Leach [Apollo, 1982] System maintains per-node per-page access mode. {shared, exclusive, no-access} determines local accesses allowed modes enforced with VM page protection mode load store shared yes no exclusive yes yes no-access no no

Write-Ownership Protocol A write-ownership protocol guarantees that nodes observe sequential consistency of memory accesses: Any node with any access has the latest copy of the page. On any transition from no-access, fetch current copy of page. A node with exclusive access holds the only copy. At most one node may hold a page in exclusive mode. On transition into exclusive, invalidate all remote copies and set their mode to no-access. Multiple nodes may hold a page in shared mode. Permits concurrent reads: every holder has the same data. On transition into shared mode, invalidate the exclusive remote copy (if any), and set its mode to shared as well.

P1 read virtual address x Paged DSM/SVM Example Page fault Allocate physical frame for page(x) Request page(x) from home(x) Set readable page(x) Resume P1 write virtual address x Protection fault Request exclusive ownership of page(x) Set writeable page(x) Resume

The Sequential Consistency Memory Model sequential processors issue memory ops in program order P1 P2 P3 Easily implemented with shared bus. switch randomly set after each memory op ensures some serial order among all operations Memory

Motivation for Weaker Orderings 1. Sequential consistency is sufficient (but not necessary) for shared-memory parallel computations to execute correctly. 2. Sequential consistency is slow for paged DSM systems. Processors cannot observe memory bus traffic in other nodes. Even if they could, no shared bus to serialize accesses. Protection granularity (pages) is too coarse. 3. Basic problem: the need for exclusive access to cache lines (pages) leads to false sharing. Causes a ping-pong effect if multiple writers to the same page. 4. Solution:allowmultiple writers to a page if their writes are nonconflicting.

Weak Ordering Careful access ordering only matters when data is shared. Shared data should be synchronized. Classify memory operations as data or synchronization Can reorder data operations between synchronization operations Forces consistent view at all synchronization points Visible synchronization operation, can flush write buffer and obtain ACKS for all previous memory operations Cannot let synch operation complete until previous operations complete (e.g., ACK all invalidations)

Weak Ordering Example Read / Write Read/Write Synch Read / Write Read/Write Synch Read / Write Read/Write A B (x = y = 0;) if (y > x) loop { panic( ouch ); x = x + 1; y=y+1; } A acquire(); if (y > x) panic( ouch ); release(); B loop() { acquire(); x=x+1; y=y+1; release(); }

Multiple Writer Protocol x&yonsamepagep1writesx,p2writesy Don t want delays associated with constraint of exclusive access Allow each processor to modify its local copy of a page between synchronization points Make things consistent at synchronization point

Treadmarks 101 Goal: implement the laziest software DSM system. Eliminate false sharing by multiple-writer protocol. Capture page updates at a fine grain by diffing. Propagate just the modified bytes (deltas). Allows merging of concurrent nonconflicting updates. Propagate updates only when needed, i.e., when program uses shared locks to force consistency. Assume program is fully synchronized. lazy release consistency (LRC) A need not be aware of B s updates except when needed to preserve potential causality......with respect to shared synchronization accesses.

Lazy Release Consistency Piggyback write notices with acquire operations. guarantee updates are visible on acquire lazier than Munin, which propagates updates on release implementation propagates invalidations rather than updates P0 P1 acq w(x) rel r(y) acq lock grant + inv. w(x) updates to x rel P2 X&Yonsamepage r(y) acq lock grant + inv. r(x) updates to x rel

Ordering of Events in Treadmarks A Acquire(x) Release(x) B Release(x) Acquire(x) Release(x) LRC is not linearizable: there is no fixed global ordering of events. There is a serializable partial order on synchronization events and the intervals they define. Acquire(x) C

Vector Timestamps in Treadmarks To maintain the partial order on intervals, each node maintains a current vector timestamp (CVT). Intervals on each node are numbered 0, 1, 2... CVT is a vector of length N, the number of nodes. CVT[i] is number of the last preceding interval on node i. Vector timestamps are updated on lock acquire. CVT is passed with lock acquire request... compared with the holder s CVT... pairwise maximum CVT is returned with the lock.

LRC Protocol A receive write notices B Acquire(x) generate write notices Release(x) CVT B Acquire(x) write notices for CVT A -CVT B (see text below) delta request reference to page with pending write notice What does A send to B on the red arrow (pass with lock token)? - a list of intervals known to A but not to B (CVT A -CVT B ) - for each interval in the list: - the origin node i and interval number n - i s vector clock CVT i during that interval n on node i - a list of pages dirtied by i during that interval n - these dirty page notifications are called write notices delta list Release(x) write notices C Acquire(x)

Write Notices LRC requires that each node be aware of any updates to a shared page made during a preceding interval. Updates are tracked as sets of write notices. A write notice is a record that a page was dirtied during an interval. Write notices propagate with locks. When relinquishing a lock token, the holder returns all write notices for intervals added to the caller s CVT. Use page protections to collect and process write notices. First store to each page is trapped...write notice created. Pages for received write notices are invalidated on acquire.

Capturing Updates (Write Collection) To permit multiple writers to a page, updates are captured as deltas, made by diffing the page. Delta records include only the bytes modified during the interval(s) in question. On first write, make a copy of the page (a twin). Mark the page dirty and write-enable the page. Send write notices for all dirty pages. To create deltas, diff the page with its twin. Record deltas, mark page clean, and disable writes. Cache write notices by {node, interval, page}; cache local deltas with associated write notice.

Lazy Interval/Diff Creation 1. Don t create intervals on every acquire/release; do it only if there s communication with another node. 2. Delay generation of deltas (diff) until somebody asks. When passing a lock token, send write notices for modified pages, but leave them write-enabled. Diff and mark clean if somebody asks for deltas. Deltas may include updates from later intervals (e.g., under the scope of other locks). 3. Must also generate deltas if a write notice arrives. Must distinguish local updates from updates made by peers. 4. Periodic garbage collection is needed.

Treadmarks Page State Transitions no-access write notice received load fetch deltas store twin and cache write notice write notice received diff and discard twin read-only (clean) delta request received diff and discard twin store twin and cache write notice write-enabled (dirty)

Ordering Conflicting Updates Write notices must include origin node and CVT. Compare CVTs to order the updates. A B C 0 B2: j=0, i=0 A(y) A(x) R(y) A(y) 2 R(x) 3 4 0 A1: j=1 A(x) B2(j=0,i=0) R(y) 1 2 C1:i=1 {B2, A1} R(x) {C1} Variables i and j are on the same page, under control of locks X and Y. C1(i=1) A1 (j = 1) (j == 0) (i == 0) D2 (i ==?, j ==?) 0 1 A(y) A(x) 2 D B2 < A1, B2 < C1

Ordering Conflicting Updates (2) D receives B s write notice for the page from A. D receives write notices for the same page from A and C, covering their updates to the page. If D then touches the page, it must fetch updates (deltas) from three different nodes (A, B, C), since it has a write notice from each of them. The deltas sent by A and B will both include values for j. The deltas sent by B and C will both include values for i. D must decide whose update to j happened first: B s or A s. D must decide whose update to i happened first: B s or C s. In other words, D must decide which order to apply the three deltas to its copy of the page. D must apply these updates in vector timestamp order. Every write notice (and delta) must be tagged with a vector timestamp.

Page Based DSM: Pros and Cons Good things Low Cost, can use commodity parts Flexible Protocol (Software) Allocate/replicate in main memory Bad Things Access Control Granularity False sharing Complex protocols to deal with false sharing Page fault overhead