Shared Symmetric Memory Systems

Similar documents
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Computer Systems Architecture

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Computer Systems Architecture

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiprocessors & Thread Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Handout 3 Multiprocessor and thread level parallelism

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Chapter 5. Multiprocessors and Thread-Level Parallelism

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Lecture 24: Virtual Memory, Multiprocessors

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Exploitation of instruction level parallelism

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Fundamentals of Computer Design

Computer Architecture

Advanced cache memory optimizations

Processor Architecture

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

COSC 6385 Computer Architecture - Multi Processor Systems

Limitations of parallel processing

Cache Coherence in Bus-Based Shared Memory Multiprocessors

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Interconnect Routing

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

1. Memory technology & Hierarchy

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Computer parallelism Flynn s categories

Chapter-4 Multiprocessors and Thread-Level Parallelism

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Multiprocessors 1. Outline

Parallel Architecture. Hwansoo Han

Aleksandar Milenkovich 1

Lecture 1: Parallel Architecture Intro

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Virtualization and memory hierarchy

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

COSC4201 Multiprocessors

Lecture 13. Shared memory: Architecture and programming

Chapter 5. Thread-Level Parallelism

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

WHY PARALLEL PROCESSING? (CE-401)

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Multiprocessors and Locking

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Advanced OpenMP. Lecture 3: Cache Coherency

CSCI 4717 Computer Architecture

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Chapter 18 Parallel Processing

ECE 485/585 Microprocessor System Design

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Portland State University ECE 588/688. Cache Coherence Protocols

Chapter 9 Multiprocessors

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Comp. Org II, Spring

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Processing & Multicore computers

Comp. Org II, Spring

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Computer Architecture Memory hierarchies and caches

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Cache Coherence Protocols for Chip Multiprocessors - I

Lecture 24: Memory, VM, Multiproc

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Multiprocessor Cache Coherency. What is Cache Coherence?

Foundations of Computer Systems

CS3350B Computer Architecture

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

PCS - Part Two: Multiprocessor Architectures

Lecture 9: MIMD Architecture

Computer Organization. Chapter 16

Transcription:

Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 1/37

Introduction to multiprocessor architectures 1 Introduction to multiprocessor architectures 2 Centralized shared memory architectures 3 Cache coherence alternatives 4 Snooping protocols 5 Performance in SMPs 6 Conclusion cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 2/37

Introduction to multiprocessor architectures Increasing importance of multiprocessors There is a decrease in silicon and energy efficiency as more ILP is exploited. Cost of silicon and energy grows faster than performance. Increasing interest in high performance servers. Cloud computing, software as a service,... Data intensive applications growth. Huge amounts of data on the Internet. Big data analytics. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 3/37

Introduction to multiprocessor architectures TLP: Thread level parallelism TLP implies the existence of multiple program counters. Assumes MIMD. Generalized use of TLP outside scientific computing is relatively recent. New applications: Embedded applications. Desktop. High-end servers. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 4/37

Introduction to multiprocessor architectures Multiprocessors A multiprocessor is a computer consisting of highly coupled processors with: Coordination and use typically controlled by a single operating system. Memory sharing through a single shared memory space. Software models: Parallel processing: Coupled set of cooperating threads. Request processing: Independent process execution originated by users. Multiprogramming: Independent execution of multiple applications. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 5/37

Introduction to multiprocessor architectures Most common approach: From 2 to tenths of processors. Shared memory. Implies shared memory. Does not necessarily imply a single physical memory. Alternatives: CMP (Chip Multi Processors) or multi-core. Multiple chips. Each one may (or may not) be multi-core. Multicomputer: Weakly coupled processors not sharing memory. Used in large scale scientific computing. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 6/37

Introduction to multiprocessor architectures Maximizing exploitation of multiprocessors: With n processors, at least n processes or threads are needed. Threads identification: Explicitly identified by programmer. Created by operating system from requests. Loop iterations generated by parallel compiler (e.g. OpenMP). High-level identification performed by programmer or system software with threads having enough number of instructions to execute. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 7/37

Introduction to multiprocessor architectures Multiprocessors and shared memory SMP: Symmetric Multi-Processor Centralized shared memory. Share a single centralized memory where all have equal access time. All multi-cores are SMP. UMA: Uniform Memory Access Memory latency is uniform. DSM: Distributed Shared Memory Memory is distributed across processors. Needed when the number of processors is high. NUMA: Non Uniform Memory Access. Memory latency depends on data location. Communication through access to global variables. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 8/37

Introduction to multiprocessor architectures SMP: Symmetric Multi Processor P1 P2 P3 P4 Private cache Private cache Private cache Private cache Shared cache Main memory cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 9/37

Introduction to multiprocessor architectures DSM: Distributed Shared Memory P1 P2 Mem I/O Mem I/O Interconnection network Mem I/O Mem I/O P3 P4 cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 10/37

Centralized shared memory architectures 1 Introduction to multiprocessor architectures 2 Centralized shared memory architectures 3 Cache coherence alternatives 4 Snooping protocols 5 Performance in SMPs 6 Conclusion cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 11/37

Centralized shared memory architectures SMP and memory hierarchy Why using centralized memory? Multi-level large caches decrease memory bandwidth demand on main memory accesses. Evolution: 1. Single-core with memory in shared bus. 2. Memory connection in separated bus only for memory. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 12/37

Centralized shared memory architectures Cache memory Kinds of data in cache memory: Private data: Data used by a single processor. Shared data: Data used by multiple processors. Problem with shared data: Datum may be replicated in multiple caches. Contention is decreased. Each processors accesses its local copy. If two processors modify their copies... Cache coherence? cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 13/37

Centralized shared memory architectures Cache coherence Thread 1 lw $t0, d i r x addi $t0, $t0, 1 sw $t0, d i r x Thread 2 lw $t0, d i r x $t0 initially 1. Assuming write through. Process Instruction P1 Cache P2 Cache Main memory T1 Initially Not present Not present 1 T1 lw $t0, dirx 1 Not present 1 T1 addi $t0, $t0, 1 1 Not present 1 T2 lw $t0, dirx 1 1 1 T1 sw $t0, dirx 2 1 1 cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 14/37

Centralized shared memory architectures Cache incoherence Why does incoherence happen? State duality: Global state Main memory. Local state Private cache. A memory system is coherent if any read from a location returns the most recent value that has been written to that location. Two aspects: Coherence: Which value does a read return? Consistency: When does a read get the written value? cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 15/37

Centralized shared memory architectures Conditions for coherence Program order preservation A read from processor P on location X after a write from processor P on location X, without intermediate writes on X by any other processor Q, always returns the value written by P. Coherent view of memory: A read from processor P on a memory location X, after a write form other processor Q on location X, returns the written value if both operations are separate enough in time and there are no intermediate writes on X. Writes serialization: Two writes on the same memory location by two different processors are seen in the same order by all the processors. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 16/37

Centralized shared memory architectures Memory consistency Defines in which point in time a process reading values will see a written value. Coherence y consistency are complementary: Coherence: Behavior of reads and writes on a single memory location. Consistency: Behavior of reads and writes with respect to accesses to other memory locations. There are different consistency memory models. We will have a specific lecture on this problem cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 17/37

Cache coherence alternatives 1 Introduction to multiprocessor architectures 2 Centralized shared memory architectures 3 Cache coherence alternatives 4 Snooping protocols 5 Performance in SMPs 6 Conclusion cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 18/37

Cache coherence alternatives Coherent multiprocessors A coherent multiprocessor offers: Shared data migration. A datum may be moved to a local cache and be used transparently. Decreases remote data access latency and bandwidth demand to shared memory. Shared data replication simultaneously read. Performs data copy in local cache. Decreases access latency and read contention. Critical properties for performance: Solution: Hardware protocol for keeping cache coherence. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 19/37

Cache coherence alternatives Kinds of cache coherence protocols Directory based: Sharing state is kept in a directory. SMP: Centralized directory in memory or in LLC (Last-level cache). DSM: To avoid bottlenecks a distributed directory is used (more complex). Snooping: Each cache keeps the sharing state of each block that it stores. Caches accessible through a broadcasting device (bus). All caches monitor broadcasting device to determine if they have a copy of the block. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 20/37

Snooping protocols 1 Introduction to multiprocessor architectures 2 Centralized shared memory architectures 3 Cache coherence alternatives 4 Snooping protocols 5 Performance in SMPs 6 Conclusion cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 21/37

Snooping protocols Coherence maintenance Write invalidation: Guarantees that a processor has exclusive access to a block before performing a write. Invalidates the rest of copies that other processors might have. Write updates (write broadcasting): Broadcasts all writes to all caches to modify block. Makes use of more bandwidth. Most common strategy Invalidation. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 22/37

Snooping protocols Memory bus use Invalidation. Processor acquires bus and broadcasts the address to be invalidated. All processors are snooping the bus. Each processor checks if it has in cache the broadcasted address and invalidate it. There cannot be two simultaneous writes: Exclusive use of bus serializes writes. Cache misses: Write through: Memory contains the last performed write. Write back: If a processor has a modified copy, it sends it to a cache miss from the other processor. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 23/37

Snooping protocols Implementation Invalidation: Takes advantage from validity bit (V) associated to each block. Writes: Need to know if there are other copies in cache. If there are no other copies write broadcast is not needed. Sharing bit (S) is added to each associated block. When there is a write: Bus invalidation is generated. Transition from shared state to exclusive state. No need to send new invalidations. When there is a miss cache in other processor: Transition from exclusive state to shared state. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 24/37

Snooping protocols Basic protocol Based in state machine for each cache block: State changes generated by: Processor requests. Bus requests. Actions: State transitions. Actions on the bus. Simple approach with three states: M: Block has been modified. S: Block is shared. I: Block has been invalidated. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 25/37

Snooping protocols Actions generated by processor Request State Action Description Read hit S S Hit Read data from local cache Read hit M M Hit Read data from local cache Read miss I S Miss Broadcast read miss on bus. Read miss S S Replacement Address conflict miss. Broadcast read miss on bus. Read miss M S Replacement Address conflict miss. Write block and broadcast read miss. Write hit M M Hit Write data in local cache. Write hit S M Coherence Bus invalidation. Write miss I M Miss Broadcast write miss on bus. Write miss S M Replacement Address conflict miss. Broadcast write miss on bus. Write miss M M Replacement Address conflict miss. Write block and broadcast write miss. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 26/37

Snooping protocols MSI Protocol: Processor actions Write miss Write miss Bus: Write miss Invalid (I) Read miss Read miss Bus: Read miss Bus: Read miss Read hit Write hit Modified (M) Read miss Write block / Bus: Read miss Shared (S) Read hit Write hit Bus: Invalidation Write miss Bus: Write miss cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 27/37

Snooping protocols Actions generated by bus Request State Action Description Read miss S S Shared memory serves miss. Read miss M S Coherence Attempt to share data. Place block on bus. Invalidate S I Coherence Attempt to write a shared block. Invalidate block. Write miss S I Coherence Attempt to write a shared block. Invalidate block. Write miss M I Coherence Attempt to write a block that is exclusive elsewhere Write back cache block. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 28/37

Snooping protocols MSI Protocol: Bus actions Invalid (I) Modififed (M) Write miss Write block Abort memory access Read miss Invalidate Write block Abort memory access Write miss Shared (S) Read miss cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 29/37

Snooping protocols MSI protocol complexities Protocol assumes that operations are atomic. Example: It is assumed that a miss can be detected, bus acquired, and response received in a single action without interruption. If operations are not atomic: Possibility for a deadlock or data race. Solution: Processor sending invalidation keeps bus ownership until invalidation arrives to the rest of processors. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 30/37

Snooping protocols Extensions to MSI MESI: Add exclusive state (E) signaling that a block lives in a single cache but is not modified. Writing of an E block does not generate invalidations. MESIF: Adds forward state (F): Alternative to S signaling which node must answer each request. Used by Intel Core i7. MOESI: Adds owned state (O) signaling that block in memory is not updated. Avoids memory writes. Used by AMD Opteron. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 31/37

Performance in SMPs 1 Introduction to multiprocessor architectures 2 Centralized shared memory architectures 3 Cache coherence alternatives 4 Snooping protocols 5 Performance in SMPs 6 Conclusion cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 32/37

Performance in SMPs Performance Use of cache coherence policies has impact on miss rate. Coherence misses emerge: True sharing misses: A processor writes to a shared block and invalidates. A different processor reads a shared block. False sharing misses: A processor writes a shared block and invalidates it. A different processor reads a different word from the same block. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 33/37

Conclusion 1 Introduction to multiprocessor architectures 2 Centralized shared memory architectures 3 Cache coherence alternatives 4 Snooping protocols 5 Performance in SMPs 6 Conclusion cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 34/37

Conclusion Summary Multiprocessor as computer with multiple highly coupled processors with coordination, use, and memory sharing. Multiprocessors classified into SMP (Symmetric multiprocessors) and DSM (Distributed Shared Memory). Two aspects to consider in memory hierarchy: coherence and consistency. Two alternatives in cache coherence: directory and snooping. Snooping protocols do not require a centralized element. But they generate more bus traffic. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 35/37

Conclusion References Computer Architecture. A Quantitative Approach 5th Ed. Hennessy and Patterson. Sections: 5.1, 5.2, 5.3. Recommended exercises: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 36/37

Conclusion Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 37/37