Lecture 24: Virtual Memory, Multiprocessors

Similar documents
Lecture 24: Memory, VM, Multiproc

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture 17: Virtual Memory, Large Caches. Today: virtual memory, shared/pvt caches, NUCA caches

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Multiprocessors & Thread Level Parallelism

Computer parallelism Flynn s categories

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Chap. 4 Multiprocessors and Thread-Level Parallelism

Lecture: Virtual Memory, DRAM Main Memory. Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)

Shared Symmetric Memory Systems

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Computer Organization. Chapter 16

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Computer Systems Architecture

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Handout 3 Multiprocessor and thread level parallelism

Computer Systems Architecture

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

COSC 6385 Computer Architecture - Multi Processor Systems

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Chapter 9 Multiprocessors

Organisasi Sistem Komputer

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Aleksandar Milenkovich 1

CSCI 4717 Computer Architecture

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Chapter 17 - Parallel Processing

Multiprocessors - Flynn s Taxonomy (1966)

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

Lecture 21: Virtual Memory. Spring 2018 Jason Tang

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 18 Parallel Processing

Lecture 2. Memory locality optimizations Address space organization

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture 15: Virtual Memory and Large Caches. Today: TLB design and large cache design basics (Sections )

Chapter Seven Morgan Kaufmann Publishers

Computer Architecture

Flynn s Classification

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

Multiprocessors 1. Outline

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 2)

ECE 30 Introduction to Computer Engineering

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Interconnect Routing

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Lect. 2: Types of Parallelism

Virtual Memory. Virtual Memory

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Chapter 5. Thread-Level Parallelism

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Module 5 Introduction to Parallel Processing Systems

Thread- Level Parallelism. ECE 154B Dmitri Strukov

EC 513 Computer Architecture

Chapter 5. Multiprocessors and Thread-Level Parallelism

Parallel Architectures

Chapter 18. Parallel Processing. Yonsei University

Parallel Architectures

Lecture 1: Parallel Architecture Intro

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

SMP/SMT. Daniel Potts. University of New South Wales, Sydney, Australia and National ICT Australia. Home Index p.

Introduction to parallel computing

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Transcription:

Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1

Virtual Memory Processes deal with virtual memory they have the illusion that a very large address space is available to them There is only a limited amount of physical memory that is shared by all processes a process places part of its virtual memory in this physical memory and the rest is stored on disk (called swap space) Thanks to locality, disk access is likely to be uncommon The hardware ensures that one process cannot access the memory of a different process 2

Virtual Memory 3

Address Translation The virtual and physical memory are broken up into pages 8KB page size Virtual address virtual page number 13 page offset Translated to physical page number Physical address 4

Memory Hierarchy Properties A virtual memory page can be placed anywhere in physical memory (fully-associative) Replacement is usually LRU (since the miss penalty is huge, we can invest some effort to minimize misses) A page table (indexed by virtual page number) is used for translating virtual to physical page number The page table is itself in memory 5

TLB Since the number of pages is very high, the page table capacity is too large to fit on chip A translation lookaside buffer (TLB) caches the virtual to physical page number translation for recent accesses A TLB miss requires us to access the page table, which may not even be found in the cache two expensive memory look-ups to access one word of data! A large page size can increase the coverage of the TLB and reduce the capacity of the page table, but also increases memory waste 6

TLB and Cache Is the cache indexed with virtual or physical address? To index with a physical address, we will have to first look up the TLB, then the cache longer access time Multiple virtual addresses can map to the same physical address must ensure that these different virtual addresses will map to the same location in cache else, there will be two different copies of the same physical memory word Does the tag array store virtual or physical addresses? Since multiple virtual addresses can map to the same physical address, a virtual tag comparison can flag a miss even if the correct physical memory word is present 7

Cache and TLB Pipeline Virtual page number TLB Physical page number Virtual address Virtual index Tag array Offset Data array Physical tag comparion Physical tag Virtually Indexed; Physically Tagged Cache 8

Bad Events Consider the longest latency possible for a load instruction: TLB miss: must look up page table to find translation for v.page P Calculate the virtual memory address for the page table entry that has the translation for page P let s say, this is v.page Q TLB miss for v.page Q: will require navigation of a hierarchical page table (let s ignore this case for now and assume we have succeeded in finding the physical memory location (R) for page Q) Access memory location R (find this either in L1, L2, or memory) We now have the translation for v.page P put this into the TLB We now have a TLB hit and know the physical page number this allows us to do tag comparison and check the L1 cache for a hit If there s a miss in L1, check L2 if that misses, check in memory At any point, if the page table entry claims that the page is on disk, flag a page fault the OS then copies the page from disk to memory and the hardware resumes what it was doing before the page fault phew! 9

Multiprocessor Taxonomy SISD: single instruction and single data stream: uniprocessor MISD: no commercial multiprocessor: imagine data going through a pipeline of execution engines SIMD: vector architectures: lower flexibility MIMD: most multiprocessors today: easy to construct with off-the-shelf computers, most flexibility 10

Memory Organization - I Centralized shared-memory multiprocessor or Symmetric shared-memory multiprocessor (SMP) Multiple processors connected to a single centralized memory since all processors see the same memory organization uniform memory access (UMA) Shared-memory because all processors can access the entire memory address space Can centralized memory emerge as a bandwidth bottleneck? not if you have large caches and employ fewer than a dozen processors 11

SMPs or Centralized Shared-Memory Caches Caches Caches Caches Main Memory I/O System 12

Memory Organization - II For higher scalability, memory is distributed among processors distributed memory multiprocessors If one processor can directly address the memory local to another processor, the address space is shared distributed shared-memory (DSM) multiprocessor If memories are strictly local, we need messages to communicate data cluster of computers or multicomputers Non-uniform memory architecture (NUMA) since local memory has lower latency than remote memory 13

Distributed Memory Multiprocessors & Caches & Caches & Caches & Caches Memory I/O Memory I/O Memory I/O Memory I/O Interconnection network 14

SMPs Centralized main memory and many caches many copies of the same data A system is cache coherent if a read returns the most recently written value for that word Time Event Value of X in Cache-A Cache-B Memory 0 - - 1 1 CPU-A reads X 1-1 2 CPU-B reads X 1 1 1 3 CPU-A stores 0 in X 0 1 0 15

Cache Coherence A memory system is coherent if: P writes to X; no other processor writes to X; P reads X and receives the value previously written by P P1 writes to X; no other processor writes to X; sufficient time elapses; P2 reads X and receives value written by P1 Two writes to the same location by two processors are seen in the same order by all processors write serialization The memory consistency model defines time elapsed before the effect of a processor is seen by others 16

Cache Coherence Protocols Directory-based: A single location (directory) keeps track of the sharing status of a block of memory Snooping: Every cache block is accompanied by the sharing status of that block all cache controllers monitor the shared bus so they can update the sharing status of the block, if necessary Write-invalidate: a processor gains exclusive access of a block before writing by invalidating all other copies Write-update: when a processor writes, it updates other shared copies of that block 17

Snooping-Based Protocols Three states for a block: invalid, shared, modified A write is placed on the bus and sharers invalidate themselves The protocols are referred to as MSI, MESI, etc. Caches Caches Caches Caches Main Memory I/O System 18

Example P1 reads X: not found in cache-1, request sent on bus, memory responds, X is placed in cache-1 in shared state P2 reads X: not found in cache-2, request sent on bus, everyone snoops this request, cache-1does nothing because this is just a read request, memory responds, X is placed in cache-2 in shared state P1 P2 Cache-1 Cache-2 Main Memory P1 writes X: cache-1 has data in shared state (shared only provides read perms), request sent on bus, cache-2 snoops and then invalidates its copy of X, cache-1 moves its state to modified P2 reads X: cache-2 has data in invalid state, request sent on bus, cache-1 snoops and realizes it has the only valid copy, so it downgrades itself to shared state and responds with data, X is placed in cache-2 in shared state, memory is also updated 19

Example Request Cache Hit/Miss Request on the bus Who responds State in Cache 1 State in Cache 2 State in Cache 3 State in Cache 4 Inv Inv Inv Inv P1: Rd X Miss Rd X Memory S Inv Inv Inv P2: Rd X Miss Rd X Memory S S Inv Inv P2: Wr X Perms Miss P3: Wr X Write Miss Upgrade X No response. Other caches invalidate. Inv M Inv Inv Wr X P2 responds Inv Inv M Inv P3: Rd X Read Hit - - Inv Inv M Inv P4: Rd X Read Miss Rd X P3 responds. Mem wrtbk Inv Inv S S 20

Title Bullet 21