Multiprocessors. Loosely coupled [Multi-computer] each CPU has its own memory, I/O facilities and OS. CPUs DO NOT share physical memory

Similar documents
Multiprocessors. Loosely coupled [Multi-computer] each CPU has its own memory, I/O facilities and OS. CPUs DO NOT share physical memory

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first

Chapter 5. Multiprocessors and Thread-Level Parallelism

Portland State University ECE 588/688. Cache Coherence Protocols

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Processor Architecture

Advanced OpenMP. Lecture 3: Cache Coherency

Chapter 18 Parallel Processing

Computer Organization. Chapter 16

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Computer Organization

Chapter 5. Multiprocessors and Thread-Level Parallelism

COSC 6385 Computer Architecture - Multi Processor Systems

Comp. Org II, Spring

Parallel Processing & Multicore computers

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Comp. Org II, Spring

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

ECE 485/585 Microprocessor System Design

MEMORY MANAGEMENT UNITS

The Cache Write Problem

1. Memory technology & Hierarchy

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Organisasi Sistem Komputer

CSCI 4717 Computer Architecture

EC 513 Computer Architecture

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Computer Architecture

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

NUMA replicated pagecache for Linux

Memory Hierarchy in a Multiprocessor

Snooping coherence protocols (cont.)

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Chapter 18. Parallel Processing. Yonsei University

CSC526: Parallel Processing Fall 2016

Computer Systems Architecture

Shared Symmetric Memory Systems

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Shared Memory Architectures. Approaches to Building Parallel Machines

Chapter 17 - Parallel Processing

High Performance Multiprocessor System

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Lecture 9: MIMD Architecture

Lecture 9: MIMD Architectures

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 9: MIMD Architectures

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

Multiprocessors & Thread Level Parallelism

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Foundations of Computer Systems

The Optimal CPU and Interconnect for an HPC Cluster

Shared Memory Multiprocessors

CS4021/4521 INTRODUCTION

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

EC 513 Computer Architecture

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Flynn s Classification

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

12 Cache-Organization 1

CS 351 Final Review Quiz

Lecture 24: Board Notes: Cache Coherency

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Multiprocessor Cache Coherency. What is Cache Coherence?

Limitations of parallel processing

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Ekran System System Requirements and Performance Numbers

The SHARED hosting plan is designed to meet the advanced hosting needs of businesses who are not yet ready to move on to a server solution.

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Operating Systems Course 2 nd semester 2016/2017 Chapter 1: Introduction

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

6 th Lecture :: The Cache - Part Three

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

COSC4201 Multiprocessors

Snooping coherence protocols (cont.)

Transcription:

Loosely coupled [Multi-computer] each CPU has its own memory, I/O facilities and OS CPUs DO NOT share physical memory IITAC Cluster [in Lloyd building] 346 x IBM e326 compute node each with 2 x 2.4GHz 64bit AMD Opteron 250 CPUs, 4GB RAM, 80GB SATA scratch disk, 2 x Broadcom BCM5704 Gbit Ethernet and PCI-X 10Gb InfiniBand connect 2 x Voltaire 288 port InfiniBand switches + 512 port Gbit switch Debian [Linux] Theoretical peak performance 3.4TFlops [Linpac 2.7Tflops] ranked 345 th most powerful supercomputer [in 2006] a distributed system, CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 1

Tightly coupled [Multiprocessor] CPUs physically share memory and I/O cache coherency can be handled by CPU inter-processor communicate via shared memory symmetric multiprocessing possible [i.e. single shared copy of operating system] critical sections protected by locks/semaphores processes can easily migrate between CPUs CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 2

Multiprocessor Cache Coherency what's the problem? must guarantee that a CPU always reads the most up to date value of a memory location 1) CPU A reads location X, copy of X stored in cache A 2) CPU B reads location X, copy of X stored in cache B 3) CPU A writes to location X, X in cache and main memory updated [write-through] 4) CPU B reads X, BUT gets out of date [stale] copy from its cache B CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 3

Multiprocessor Cache Coherency solutions based on bus watching or snoopy caches [yet again] snoopy caches watch all bus transactions and update their internal state according to a pre-determined cache coherency protocol cache coherency protocols given names eg. Dragon, Firefly, MESI, Berkeley, Illinois, Write-Once and Write-Through [will discuss protocols in bold] CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 4

Write-Through Protocol key ideas uses write-through caches when a cache observes a bus write to a memory location it has a copy of, it simply invalidates the cache line [a write-invalidate protocol] the next time the CPU accesses the same memory location/cache line, the data will be fetched from memory [hence getting the most up to date value] CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 5

Write-Through State Transition Diagram each cache line is in either the VALID or INVALID state [if an address is not in the cache, it is captured by the INVALID state] PW processor write and write through to memory bus traffic writes [writes 20% of memory accesses] straightforward to implement and effective for small scale parallelism [< 6 CPUs] CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 6

Write-Once Protocol uses write-back caches [to reduce bus traffic] BUT still uses a write-invalidate protocol each cache line can be in one of 4 states: INVALID VALID: cache line in one or more caches, cache copies memory RESERVED: cache line in one cache ONLY, cache copy memory DIRTY: cache line in one cache ONLY and it is the ONLY up to date copy [memory copy out of date / stale] CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 7

Write-Once State Transition Diagram CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 8

Write-Once Protocol... key ideas write-through to VALID cache lines write back (local write) to RESERVED/DIRTY cache lines [as we know cache line in one cache ONLY] when a memory location is read initially, it enters the cache in the VALID state a cache line in the VALID state changes to the RESERVED state when written for the first time [hence the name write-once] a write which causes a transition to the RESERVED state is written through to memory so that all other caches can observe the transaction and invalidate their copies if present [cache line now in one cache ONLY] CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 9

Write-Once Protocol... subsequent writes to a RESERVED cache line will use write-deferred cycles and the cache line marked as DIRTY as it is now the only up to date copy in the system [memory copy out of date] caches must monitor bus for any reads or writes to its RESERVED and DIRTY cache lines if a cache observes a read from one of its RESERVED cache lines, it simply changes its state to VALID if a cache observes a read from one of its DIRTY cache lines, the cache must intervene [intervention cycle] and supply the data onto the bus to the requesting cache, the data is also written to memory and the state of both cache lines are changed to VALID NB: behaviour on an observed write [DIRTY => INVALID] CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 10

Write-Once Protocol... try the Vivio cache coherency animation CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 11

Firefly Protocol used in the experimental Firefly DEC workstation DEC Digital Equipment Corporation each cache line can be in one of 4 states: ~Shared & ~Dirty [Exclusive & Clean] ~Shared & Dirty [Exclusive & Dirty] Shared & ~Dirty [Shared & Clean] Shared & Dirty [Shared & Dirty] NB: there is NO INVALID state CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 12

Firefly Protocol key ideas a cache knows if its cache lines are shared with other caches [may not actually be shared but that is OK] when a cache line is read into the cache the other caches will assert a common SHARED bus line if they contain the same cache line writes to exclusive cache line are write-deferred writes to shared cache lines are write-through and the other caches which contain the same cache lines are updated together with memory [write-update protocol] when a cache line ceases to be shared, it needs an extra write-through cycle to find out that the cache line is no longer shared [SHARED signal will not be asserted] sharing may be illusory e.g. (i) processor may no longer be using a shared cache line (ii) process migration between CPUs [sharing with an old copy of itself] CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 13

Firefly State Transition Diagram CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 14

Firefly Protocol... try the Vivio cache coherency animation CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 15

Firefly Protocol as there is NO INVALID state at reset the cache is placed in a miss mode and a bootstrap program fills cache with a sequence of addresses making it consistent with memory [VALID state] during normal operation a location can be displaced if the cache line is needed to satisfy a miss, BUT the protocol never needs to a invalidate cache line how is the Shared & Dirty state entered? asymmetrical behaviour with regard to updating memory [avoids changing memory cycle from a read cycle to a write cycle mid cycle] try the vivio animation CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 16

MESI Cache Coherency Protocol used by Intel CPUs very similar to write-once, BUT uses a shared bus signal [like Firefly] so cache line can enter cache and be placed directly in the shared or exclusive state a write-invalidate protocol MESI states compared with Write-Once states Modified dirty Exclusive reserved Shared valid Invalid invalid what is the difference between the MESI and Write-Once protocols? cache line may enter cache and be placed directly in the Exclusive [reserved] state "write-once" write-through cycles no longer necessary if cache line is Exclusive try the Vivio animation CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 17

MESI State Transition Diagram CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 18

MESI Protocol... try the Vivio cache coherency animation CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 19

What's next? PHYSICAL MEMORY VIEW 0 4K page 0 0 4K page 0 4K page 0 4K page 0 4K page 0 4K page 1 4K page 1 4K page 1 4K page 1 4K page 1 4K page 2 4K page 2 4K page 2 4K page 2 4K page 2 4K page 3 4K page 3 4K page 3 4K page 3 4K page 3 MEMORY 4K page 0 4K page 1 4K MEMORY page 2 4K page 3 CACHE core0 CACHE core0 0 MEMORY 4K page 0 4K page 1 4K MEMORY page 2 4K page 3 CPU 0 CPU 1 CACHE coren HYPERTRANSPORT (AMD) OR QUICKPATH (INTEL) CACHE core0 CPU 3 CPU 4 CACHE coren CACHE core0 CACHE coren CACHE coren 0 MEMORY 4K page 0 4K page 1 4K MEMORY page 2 4K page 3 CACHE core0 CPU 2 CACHE coren 0 4K page 3 4K MEMORY page 2 4K page 1 4K page 0 0 4K page 3 4K MEMORY page 2 4K page 1 4K page 0 MEMORY MEMORY CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 20

What's next? network of multiprocessors shared bus removed as it is a bottleneck physical memory interleaved between CPUs given a physical address, it is straightforward to determine to which CPU the memory is attached high speed point-to-point links between CPUs [Intel QuickPath and AMD HyperTransport] if a CPU accesses non local memory, request sent via high speed point-to-point links to correct CPU which accesses its local memory and returns the data System can be view as a large shared memory multiprocessor point-to-point protocol supports cache-coherency [Intel MESIF and AMD MOESI] CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 21

Summary you are now able to: explain the cache coherency problem describe the operation of the write-through, write-once, Firefly and MESI cache coherency protocols predict the cache behaviour and bus traffic for a sequence of CPU "memory" accesses in a multiprocessor with a cache coherent bus evaluate the cost updating shared data CS3021/3421 2017 jones@tcd.ie School of Computer Science and Statistics, Trinity College Dublin 11-Dec-17 22