Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

Similar documents
CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Page 1. Memory Hierarchies (Part 2)

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

ECE331: Hardware Organization and Design

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Memory Hierarchy. Slides contents from:

1. Memory technology & Hierarchy

ECEC 355: Cache Design

Memory Hierarchy. Slides contents from:

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

CS3350B Computer Architecture

Computer Architecture Memory hierarchies and caches

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EE 4683/5683: COMPUTER ARCHITECTURE

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

V. Primary & Secondary Memory!

Course Administration

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Multiprocessors and Thread-Level Parallelism

Improving Cache Performance

Improving Cache Performance

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

The Cache Write Problem

Computer Architecture

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CPE300: Digital System Architecture and Design

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Cache Coherence and Atomic Operations in Hardware

Chapter 5. Multiprocessors and Thread-Level Parallelism

Recap: Machine Organization

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Introduction to cache memories

ECE331: Hardware Organization and Design

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Chapter 6 Objectives

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

CMPSC 311- Introduction to Systems Programming Module: Caching

COSC 6385 Computer Architecture - Memory Hierarchies (I)

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Memory Hierarchy Design (Appendix B and Chapter 2)

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

SISTEMI EMBEDDED. Computer Organization Memory Hierarchy, Cache Memory. Federico Baronti Last version:

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EN1640: Design of Computing Systems Topic 06: Memory System

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy

Page 1. Multilevel Memories (Improving performance using a little cash )

Logical Diagram of a Set-associative Cache Accessing a Cache

Caches. Hiding Memory Access Times

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Limitations of parallel processing

Advanced Memory Organizations

Introduction. Memory Hierarchy

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Caches. Parallel Systems. Caches - Finding blocks - Caches. Parallel Systems. Parallel Systems. Lecture 3 1. Lecture 3 2

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

EN1640: Design of Computing Systems Topic 06: Memory System

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

ECE331: Hardware Organization and Design

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Handout 3 Multiprocessor and thread level parallelism

CSE502: Computer Architecture CSE 502: Computer Architecture

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

ECE232: Hardware Organization and Design

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

CS152 Computer Architecture and Engineering Lecture 17: Cache System

MIPS) ( MUX

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

A Cache Hierarchy in a Computer System

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

A Review on Cache Memory with Multiprocessor System

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Transcription:

Shared memory Caches, Cache coherence and Memory consistency models Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16

Shared memory Caches, Cache coherence and Memory consistency models References Computer Organization and Design. David A. Patterson, John L. Hennessy. Chapter 5. Computer Architecture: A Quantitative Approach. John L. Hennessy, David A. Patterson. Appendix 4. Only if interested in much more detail on cache coherence and memory consistency: A Primer on Memory Consistency and Cache Coherence. Daniel J. Sorin, Mark D. Hill, David A. Wood. Diego Fabregat Shared memory 2 / 34

Outline 1 Recap: Processor and Memory hierarchy 2 Cache memories 3 Shared-memory: cache coherence 4 Shared-memory: memory consistency models Diego Fabregat Shared memory 3 / 34

The basic picture Programmers dream: Unlimited amount of fast memory CPU Memory system Diego Fabregat Shared memory 4 / 34

Speed divergence Gap in performance between CPU and main memory. Computer organization and design. Patterson, Hennessy. Diego Fabregat Shared memory 5 / 34

Memory hierarchy Illusion of large and fast memory Diego Fabregat Shared memory 6 / 34

Memory hierarchy Illusion of large and fast memory Underlying idea: principle of locality Temporal locality: if an item is referenced, it will tend to be referenced again soon (e.g., instructions and data in a loop). Spatial locality: if an item is referenced, items whose addresses are close by will tend to be referenced soon (e.g., arrays). Diego Fabregat Shared memory 7 / 34

Outline 1 Recap: Processor and Memory hierarchy 2 Cache memories 3 Shared-memory: cache coherence 4 Shared-memory: memory consistency models Diego Fabregat Shared memory 8 / 34

Cache memories Cache hierarchy may consist of multiple levels Data copied between two adjacent levels Cache block (or line): unit of data transfer Cache consists of multiple such blocks Diego Fabregat Shared memory 9 / 34

Cache memories (II) Cache hit: data requested by the processor is present in some block of the upper level of cache Cache miss: data requested by the processor is not present in any block of the upper level of cache In a cache miss the lower levels in the hierarchy are accessed to retrieve the block containing the requested data. The data is then supplied from the upper level. Diego Fabregat Shared memory 10 / 34

Line placement How do we know whether a line is in cache? Where do we look for it? Diego Fabregat Shared memory 11 / 34

Line placement How do we know whether a line is in cache? Where do we look for it? Direct-mapped: Each block can go in exactly one place in the cache Fully associative: Each block can go anywhere in the cache Set associative: Each block can go anywhere within a restricted set of entries in the cache Diego Fabregat Shared memory 11 / 34

Line placement Direct-mapped Mapping: (Block address) modulo (Number of blocks in the cache) Diego Fabregat Shared memory 12 / 34

Line placement Direct-mapped Mapping: (Block address) modulo (Number of blocks in the cache) How do we know whether cache block 001 contains MEM[00001], MEM[01001], MEM[10001],...? Diego Fabregat Shared memory 12 / 34

Line placement Direct-mapped Index: block selection Tag: compare block with memory address Valid bit: whether block contains valid data Diego Fabregat Shared memory 13 / 34

Line placement Direct-mapped: Simple scheme, simple tag comparison, high #misses Fully associative: Complex tag comparison, low #misses n-way associative: Compromise, multiple tags, lower #misses than DM Diego Fabregat Shared memory 14 / 34

Line replacement policies Random: all lines are candidates to be evicted (replaced) Least recently used (LRU): replace the block that has been unused for the longest time. Ok for 2-way, maybe 4-way. Too hard beyond. Approximated LRU: Hierarchical LRU. Example 4-way: 1 bit determines the pair with the oldest (LRU) pair of blocks. 1 bit determines the LRU block within the pair. Diego Fabregat Shared memory 15 / 34

Writing data Write-through On a hit, processor could simply update only the block in the cache. Inconsistency problem: cache and memory hold different values. Simplest solution: write-through, i.e., update both cache and memory. On a miss: fetch block into cache, update, write to memory. Problem: Low performance. A write may take at least 100 clock cycles, slowing down the processor. Can be alleviated with write buffers. Write-back On a hit, update only the block in cache. Keep track of whether each block is dirty. When a dirty block is replaced, write to the lower level of the hierarchy. On a miss: typically, also fetch block into cache and update. Diego Fabregat Shared memory 16 / 34

Outline 1 Recap: Processor and Memory hierarchy 2 Cache memories 3 Shared-memory: cache coherence 4 Shared-memory: memory consistency models Diego Fabregat Shared memory 17 / 34

Shared-memory parallel architectures Diego Fabregat Shared memory 18 / 34

Shared-memory parallel architectures Not every memory is shared! CPU 0 CPU 1 (Private) L1 Cache (Private) L1 Cache (Shared) L2 Cache / Main memory Diego Fabregat Shared memory 19 / 34

Cache coherence View of memory of two different processors is through their caches. Diego Fabregat Shared memory 20 / 34

Cache coherence View of memory of two different processors is through their caches. Cache coherence problem: Two different processors can have two different values for the same location. Coherence: Which value will be returned by a read? Diego Fabregat Shared memory 20 / 34

Coherence definition Idea: Cache coherence aims at making the caches of a shared-memory system as functionally invisible as the caches of a single-core system. Properties: Processor P writes to location X, then P reads from X (with no other processor writing to X in between). The read must return the value written by P. Preserves program order. A read by processor P 1 from X after a write by processor P 2 to X returns the value written by P 2 if no other writes to X occur between the two accesses. Defines the notion of coherent view of the memory. Two writes to the same location by any two processors are seen in the same order by all processors. Prevents two processors to end up with two different values of X. Write serialization. Diego Fabregat Shared memory 21 / 34

Cache coherence protocols Snooping protocols Decentralized approach Each cache line with a copy of a memory location has a label indicating the sharing state of the line All caches connected via a broadcast medium (e.g., bus) Cache controllers snoop (listen) on the medium to determine whether they hold a copy of a line that is requested Directory-based protocols The state of a block of memory is kept in one location (centralized) More scalable Diego Fabregat Shared memory 22 / 34

Snooping protocol Cache controllers (C.C.) connected via a bus which they monitor (snoop). CPU 0 CPU 1 CPU 2 CPU 3 Cache C.C. Cache C.C. Cache C.C. Cache C.C. Main memory Diego Fabregat Shared memory 23 / 34

Snooping protocol Three states: exclusive (read/write), shared (read-only), invalid On a write, all other copies in cache are invalidated Guarantees exclusive access Diego Fabregat Shared memory 24 / 34

Snooping protocol Finite-state machine controller Diego Fabregat Shared memory 25 / 34

Note 1: True and False sharing True sharing: A cache miss occurs because a block was invalidated by another processor writing the same word. Necessary to keep coherence False sharing: A cache miss occurs because a block was invalidated by another processor writing a different word. Would not occur is block size were one word Diego Fabregat Shared memory 26 / 34

Note 2: Data races and synchronization Beware, CC does not solve race condition situations CPU 0 CPU 1 t1 = load(x) t1 = t1+1 t2 = load(x) X = store(t1) t2 = t2 + 1 X = store(t2) We need critical sections to ensure mutual exclusion Diego Fabregat Shared memory 27 / 34

Outline 1 Recap: Processor and Memory hierarchy 2 Cache memories 3 Shared-memory: cache coherence 4 Shared-memory: memory consistency models Diego Fabregat Shared memory 28 / 34

Memory Consistency Models Coherence guarantees that (i) a write will eventually be seen by other processors (ii) all processors see writes to the same location in the same order Consistency model defines the (allowed) ordering of reads and writes to different memory locations. The hardware guarantees a certain consistency model and the programmer attempts to write correct programs with those assumptions. Diego Fabregat Shared memory 29 / 34

Memory Consistency Models Sequential consistency Within a thread of execution, ordering is preserved Instructions from different threads may interleave arbitrarily Example: CPU 0 CPU 1 instr a instr A instr b instr B instr c instr C instr d instr D Valid executions: abcdabcd, aabbccdd, abacbcdd,... Diego Fabregat Shared memory 30 / 34

Memory Consistency Models Sequential consistency Why is preserving the ordering within a thread so important? Consider the following code: CPU 0 CPU 1 ------------------------------------------------- S1: data = NEW; S2: flag = SET; L1: r1 = flag; B1: if (r1!= SET) goto L1; L2: r2 = data; Assume: initially, data = OLD and flag = UNSET Diego Fabregat Shared memory 31 / 34

Memory Consistency Models Sequential consistency Why is preserving the ordering within a thread so important? Consider the following code: CPU 0 CPU 1 ------------------------------------------------- S1: data = NEW; S2: flag = SET; L1: r1 = flag; B1: if (r1!= SET) goto L1; L2: r2 = data; Assume: initially, data = OLD and flag = UNSET As a result, should r2 always be set to NEW? Diego Fabregat Shared memory 31 / 34

Memory Consistency Models Sequential consistency Why is preserving the ordering within a thread so important? Consider the following code: CPU 0 CPU 1 ------------------------------------------------- S1: data = NEW; S2: flag = SET; L1: r1 = flag; B1: if (r1!= SET) goto L1; L2: r2 = data; Assume: initially, data = OLD and flag = UNSET As a result, should r2 always be set to NEW? What if order is S2, L1, L2, S1? Diego Fabregat Shared memory 31 / 34

Memory Consistency Models Sequential consistency Very intuitive Easy for programmers Problems: Limited room for optimization Slower execution Diego Fabregat Shared memory 32 / 34

Memory Consistency Models Relaxed consistency models We, as programmers, care about ordering only in a few cases (e.g., synchronization) Relax ordering constraints in general, use fences (barriers) to complete all previous memory accesses before moving on (sequential consistency) Programmers are responsible to set these fences or synchronization points. Diego Fabregat Shared memory 33 / 34

Summary Memory hierarchy: illusion of large and fast memory Cache coherence: Behavior of the memory system when a single memory location is accessed by multiple threads Coherence protocols: snooping, directory-based Memory consistency model: Define the allowed orderings of reads and writes to different memory locations Most multicore processors implement some kind of relaxed model Most programs rely on predefined synchronization primitives Diego Fabregat Shared memory 34 / 34