CSC526: Parallel Processing Fall 2016

Similar documents
1. Memory technology & Hierarchy

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Performance metrics for caches

Page 1. Cache Coherence

Virtual Memory. Samira Khan Apr 27, 2017

Lecture 21: Virtual Memory. Spring 2018 Jason Tang

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

The Cache Write Problem

Virtual Memory, Address Translation

Virtual Memory, Address Translation

EC 513 Computer Architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

MEMORY. Objectives. L10 Memory

CMSC 611: Advanced. Distributed & Shared Memory

Memory hierarchy and cache

Portland State University ECE 588/688. Cache Coherence Protocols

COMP9242 Advanced OS. S2/2017 W03: Caches: What Every OS Designer Must

Characteristics of Mult l ip i ro r ce c ssors r

Multiprocessor Cache Coherency. What is Cache Coherence?

CMSC 611: Advanced Computer Architecture

CIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2015

COMP9242 Advanced OS. Copyright Notice. The Memory Wall. Caching. These slides are distributed under the Creative Commons Attribution 3.

High Performance Multiprocessor System

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Advanced OpenMP. Lecture 3: Cache Coherency

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

12 Cache-Organization 1

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Lecture 24: Board Notes: Cache Coherency

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Logical Diagram of a Set-associative Cache Accessing a Cache

Multiprocessors. Loosely coupled [Multi-computer] each CPU has its own memory, I/O facilities and OS. CPUs DO NOT share physical memory

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Introduction. Memory Hierarchy

Chapter 18 Parallel Processing

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

CSE 351. Virtual Memory

William Stallings Computer Organization and Architecture 8th Edition. Cache Memory

Effect of memory latency

Computer Systems Architecture

Chap. 4 Multiprocessors and Thread-Level Parallelism

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

CSE502: Computer Architecture CSE 502: Computer Architecture

PARALLEL MEMORY ARCHITECTURE

CIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2017

CS 136: Advanced Architecture. Review of Caches

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

2-Level Page Tables. Virtual Address Space: 2 32 bytes. Offset or Displacement field in VA: 12 bits

EC 513 Computer Architecture

CSE 451: Operating Systems Winter Page Table Management, TLBs and Other Pragmatics. Gary Kimura

P6/Linux Memory System Nov 11, 2009"

Dr e v prasad Dt

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Virtual Memory. Computer Systems Principles

Address Translation. Tore Larsen Material developed by: Kai Li, Princeton University

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Review: Computer Organization

Chapter 18. Parallel Processing. Yonsei University

(Advanced) Computer Organization & Architechture. Prof. Dr. Hasan Hüseyin BALIK (4 th Week)

Multiprocessors. Loosely coupled [Multi-computer] each CPU has its own memory, I/O facilities and OS. CPUs DO NOT share physical memory

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first

1. Memory technology & Hierarchy

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

Cache Memory. Content

Organisasi Sistem Komputer

Handout 3 Multiprocessor and thread level parallelism

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Multiprocessors & Thread Level Parallelism

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Page Which had internal designation P5

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Systems Architecture

Shared Memory Architectures. Approaches to Building Parallel Machines

Scalable Cache Coherence

Lecture 11: Large Cache Design

Limitations of parallel processing

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Page 1. Memory Hierarchies (Part 2)

ECSE 425 Lecture 30: Directory Coherence

Transcription:

CSC526: Parallel Processing Fall 2016 WEEK 5: Caches in Multiprocessor Systems * Addressing * Cache Performance * Writing Policy * Cache Coherence (CC) Problem * Snoopy Bus Protocols PART 1: HARDWARE Dr. Soha S. Zaghloul 1

INTRODUCTION Most multiprocessor systems use private caches associated with different processors as depicted in the following figure: Processors P1 P2 P3 Pn Caches C1 C2 C3 Cn Interconnection Network (Bus, crossbar, etc ) Main Memory M1 M2 M3 Mn I/O Channels Disks D1 D2 D3 Dn Dr. Soha S. Zaghloul 2 2

ADDRESSING (1) Caches may be addressed in one of two ways: Physical addressing data in the cache are accessed using their physical addresses. Virtual addressing data in the cache are accessed using their virtual addresses. Dr. Soha S. Zaghloul 3 3

ADDRESSING (2) PHYSICAL (1) UNIFIED CACHE The following figure depicts the organization of a physical address unified cache: CPU VA PA PA MMU Cache D/I D/I Main Memory The Memory Management Unit (MMU) translates a virtual address into corresponding physical address. A Unified cache contains both data and instructions. A cache hit occurs when the required address is found in the cache. Otherwise, we have a cache miss. After a cache miss, a whole block is loaded from main memory into the cache. Dr. Soha S. Zaghloul 4 4

ADDRESSING (3) PHYSICAL (2) SPLIT CACHE The following figure depicts the organization of a physical address split multi-level data cache: MMU PA PA PA VA CPU Data PA Instruction Level-1 D-Cache I-Cache Data PA Level-2 D-Cache Data Instruction Main Memory Level-2 cache has higher capacity than Level-1 cache. For example, 256 KB and 64 KB respectively. At any point of time, Level-1 cache is a subset of Level-2 cache. Usually, the Level-1 cache is put on-chip (ie. with the processor on the same chip). Dr. Soha S. Zaghloul 5 5

ADDRESSING (4) VIRTUAL (1) UNIFIED CACHE The following figure depicts the organization of a virtual address unified cache: VA MMU PA CPU VA Cache Main Memory D/I D/I Both cache access and MMU address translation are performed in parallel. However, the PA is not used unless memory access is needed. Dr. Soha S. Zaghloul 6 6

ADDRESSING (5) VIRTUAL (2) SPLIT CACHE The following figure depicts the organization of a virtual address split cache: Instruction VA I-Cache Instruction CPU MMU PA Main Memory Data VA D-Cache Data Dr. Soha S. Zaghloul 7 7

ADDRESSING (6) PHYSICAL VS. VIRTUAL The following points highlights the pros & cons of both addressing modes: Physical addressing Pros: Cons: No need to perform cache flushing since PA are uniques No aliasing problems: two VAs are mapped to the same PA The slowdown in accessing the cache till the MMU translates the VA into PA Virtual addressing Pros: Cons: Faster access to cache, since MMU translation is performed in parallel with cache access. The aliasing problem Multiple processes may have the same range of VAs This may be solved by flushing the entire cache. However, this may result in a poor performance The drawback of PA may be alleviated if the MMU and the cache are integrated on the same chip as the CPU. Most system designs use PA for (1) its simplicity; (2) it requires less intervention from the OS as compared to the VA. Dr. Soha S. Zaghloul 8 8

CACHE PERFORMANCE (1) The performance of a cache is measured by its hit ratio: Number of cache hits Hit Ratio (HR) = ------------------------------------------------------- Total number of cache access Number of cache misses Miss Ratio (MR) = ------------------------------------------------------- Total number of cache access Miss Ratio = 1 Hit Ratio For a multi-level cache, the access time (T) to each level should be considered: T caches = HR L1 * T L1 + MR L1 (T L1 + T L2 ) //The average access time for L1-cache To calculate the overall memory system performance, the access time to the main memory (T M ) should also be considered: T overall = HR L1 * T L1 + HR L2 (T L1 + T L2 ) + MR L2 (TL2 + T M ) Dr. Soha S. Zaghloul 9 9

CACHE PERFORMANCE (2) NUMERICAL EAMPLE CPU Access time = 0.01 μs Level-1 D-Cache Access time = 0.1 μs Level-2 D-Cache Assume HR L1 = 0.95. What is the L1-cache performance? T = 0.95*0.01 + 0.05(0.01+0.1) = 0.015 μs Dr. Soha S. Zaghloul 10 10

WRITING POLICIES (1) PROBLEM DEFINITION (1) SCENARIO (1) CPU = 300 W 0 =150 W 1 W B 0 2 W 3 tag w o =150 w300 1 w 2 w 3 tag tag ----- ----- ----- ----- w o w 1 w 2 w 3 w o w 1 w 2 w 3 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- W 0 W 1 W 2 W 3 W 0 W 1 W 2 W 3 Main Memory B i B j Dr. Soha S. Zaghloul 11 11

WRITING POLICIES (2) PROBLEM DEFINITION (2) SCENARIO (2) I/O MODULE = 300 W 0 =150 W300 1 W B 0 2 W 3 tag w o =150 w 1 w 2 w 3 tag w o w 1 w 2 w 3 tag w o w 1 w 2 w 3 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- W 0 W 1 W 2 W 3 W 0 W 1 W 2 W 3 Main Memory B i B j Dr. Soha S. Zaghloul 12 12

WRITING POLICIES (3) SOLUTION (1) The aim of a writing policy is to keep the data consistent between cache and memory. Two main writing policies are followed in caches design: Write-through Write-back Dr. Soha S. Zaghloul 13 13

WRITING POLICIES (4) SOLUTION (2) WRITE THROUGH CPU = 300 = 300 W 0 =150 W300 1 W B 0 2 W 3 W 0 tag w o =150 w300 1 w 2 w 3 W 1 tag w o w 1 w 2 w W 2 3 W tag w o w 1 w 2 w 3 3 ----- ----- ----- ----- ----- W Every time a word is updated in the cache, it is written through (reflected) 0 in the ----- ----- ----- ----- ----- W main memory. 1 ----- ----- ----- ----- ----- W 2 This technique is simple. However, it increases the memory traffic. ----- ----- ----- ----- ----- W 3 Main Memory B i B j Dr. Soha S. Zaghloul 14 14

WRITING POLICIES (5) SOLUTION (3) WRITE BACK W 0 =150 W300 1 B 0 W 2 W 3 W 0 tag w o =150 w300 1 w 2 w 3 W 1 B tag w o w 1 w 2 w W i 2 When a cache line is updated, a status bit (update bit) 3 is set to 1. W tag w o w 1 w 2 w 3 3 When the cache line is to be replaced, it is copied to the main memory if its update bit is equal -----to 1. ----- ----- ----- ----- W 0 ----- ----- ----- ----- ----- This technique minimizes memory accesses (traffic). W 1 B ----- ----- ----- ----- ----- W j 2 However, some memory locations become invalid. ----- ----- ----- ----- ----- W 3 Main Memory In addition, write back imposes that the I/O module accesses the memory through the cache. Dr. Soha S. Zaghloul 15 15

CACHE COHERENCE PROBLEM (1) In a multiprocessor system, data inconsistency may occur between a cache and main memory; or amongst local caches of different processors. Multiple caches may have different copies of the same memory block since multiple processors operate asynchronously and independently. Such situation is known as cache coherence problem. Cache coherence problem may be caused by: Data sharing Process migration I/O that bypasses caches (DRAM) Dr. Soha S. Zaghloul 16 16

CACHE COHERENCE PROBLEM (2) DATA SHARING Consider the following scenario: Processors P1 P2 P1 P2 P1 P2 Private caches Main Memory is a data shared between both processors. Before update, the three copies of are consistent. P1 updates to. Assuming a write-through policy, the update is immediately reflected onto the main memory. However, the copy in P2 is inconsistent. P1 updates to. Assuming a write-back policy, the update is not immediately reflected onto the main memory. The copy in P2 is also inconsistent. Dr. Soha S. Zaghloul 17 17

CACHE COHERENCE PROBLEM (3) PROCESS MIGRATION Consider the following scenario: Processors P1 P2 P1 P2 P1 P2 Private caches Main Memory is a data used by P1. P1 is to be migrated to P2. P2 updates to after migration. Assuming a write-through policy, the update is immediately reflected onto the main memory. However, the copy in P1 is inconsistent. P1 updates to after process migration. Assuming a write-back policy, the update is not immediately reflected onto the main memory. The copy in P2 is also inconsistent. Dr. Soha S. Zaghloul 18 18

CACHE COHERENCE PROBLEM (4) I/O Consider the following scenario. Processors P1 P2 Processors P1 P2 Processors P1 P2 Private caches Private caches Private caches Main Memory Input Output I/O Main Input Memory I/O Main Memory I/O Output When the I/O bypasses the cache, a cache coherence problem may occur: When the I/O processor loads a new value of into the main memory, bypassing the cache, the values of in the processor private caches become obsolete. P1 updates to. Write-back caches are used, so the update is not immediately reflected onto the memory. When the memory outputs the value of directly to the I/O bypassing the cache, it outputs an obsolete value. Dr. Soha S. Zaghloul 19 19

CACHE COHERENCE PROBLEM (5) SOLUTION Two main approaches are commonly used to solve the cache coherence problem: Snoopy bus protocols Directory-based protocols Dr. Soha S. Zaghloul 20 20

SNOOPY BUS PROTOCOLS (1) INTRODUCTION (1) A bus is a convenient Interconnection Network (I/N) topology for ensuring cache coherence. A bus allows all interconnected processors in the system to observe ongoing memory transactions. If a bus transaction threatens the consistent state of local caches, the cache controller can take appropriate actions to invalidate the local copy. Two practices are implemented to maintain the cache coherence: Write-invalidate policy: When a local cache block is updated, all blocks with the same address in remote caches are invalidated. Write-update policy: When a local cache block is updated, the new data block is broadcast to all caches containing a copy of the same block. Snoopy protocols achieve data consistency among the caches and shared memory through a bus watching mechanism. The following figure illustrates the policies mentioned above Dr. Soha S. Zaghloul 21 21

SNOOPY BUS PROTOCOLS (2) INTRODUCTION (2) The memory copy is updated. All copies of in the caches are invalidated (I). Invalidated blocks are called dirty, meaning that they should not be used. Main Memory Write- Invalidate I I Bus P1 P2 P3 Caches Processors P1 P2 P3 Initial State The new block contents is broadcast via the bus to all caches and hence updated. With write-through caches, the memory copy is also updated. With write-back caches, the memory is updated later upon block replacement Write Update P1 P2 P3 Dr. Soha S. Zaghloul 22 22

SNOOPY BUS PROTOCOLS (3) STATE DIAGRAM A state diagram is used to depict all transactions of the write-invalidate protocol implemented in both write-through and write-back caches: The states in the diagram represent those of the cache block Two processors are denoted: a local processor (i) and a remote processor (j) Six operations may take place in such an environment; namely: Read the cache block in the local processor R(i) Read the cache block in the remote processor R(j) Write (modify) the cache block in the local processor W(i) Write (modify) the cache block in the remote processor W(j) Replacing the cache block in the local processor Z(i) Replacing the cache block in the remote processor Z(j) Dr. Soha S. Zaghloul 23 23

SNOOPY BUS PROTOCOLS (4) WRITE-THROUGH CACHES (1) A block belonging to a write-through cache has one of two states: Valid (V) or Invalid (I). A cache block in the invalid state means either it is dirty or unavailable in the processor s cache. R(i) R(j) W(i) Z(j) Let us first consider the Valid state: W(j) Z(i) Local Read R(i): does not affect the status of the local cache block. Remote Read R(j): does not affect the status of the local cache block. Local Write (modification) W(i): does not affect the status of the local cache block. Remote Write (modification) W(j): causes the copy of local cache block to be dirty Inv Local Replace Z(i): the cache block is no more available in local processor Inv. Remote Replace Z(j): does not affect the status of the local cache block. Dr. Soha S. Zaghloul 24 24

SNOOPY BUS PROTOCOLS (4) WRITE-THROUGH CACHES (2) R(j) W(j) Z(i) Z(j) R(i) W(j) W(i) Z(i) R(i) R(j) W(i) Z(j) Let us now consider the Invalid state: Local Read R(i): cache miss the block is fetched from memory and becomes valid. Remote Read R(j): does not affect the status of the local cache block. Local Write (modification) W(i): refreshes the local cache block Valid. Remote Write (modification) W(j): does not affect the status of the local cache block. Local Replace Z(i): the cache block is still unavailable Inv. Remote Replace Z(j): does not affect the status of the local cache block. Dr. Soha S. Zaghloul 25 25

SNOOPY BUS PROTOCOLS (5) WRITE-BACK CACHES (1) The state diagram in write-back caches represents three states; namely, the valid (V), the Read-Only (RO) and the Read-Write (RW). The Invalid state designates that the cache block is either dirty or unavailable in the local cache. RO state: Many caches can contain the RO copies of a block. RW state: Only one processor in the whole system may have a cache block in the RW state. The processor that performs a write is in the RW state. Dr. Soha S. Zaghloul 26 26

SNOOPY BUS PROTOCOLS (6) WRITE-BACK CACHES (2) R(i) W(i) Let us first consider the Invalid state: R(j) W(j) Z(i) Z(j) Local Read R(i): refreshes the local cache with a RO copy RO Remote Read R(j): does not affect the status of the local cache block. Local Write (modification) W(i): refreshes the local cache block with a RW copy RW Remote Write (modification) W(j): does not affect the status of the local cache block. Local Replace Z(i): the cache block is still unavailable Inv. Remote Replace Z(j): does not affect the status of the local cache block. Dr. Soha S. Zaghloul 27 27

SNOOPY BUS PROTOCOLS (7) WRITE-BACK CACHES (3) W(i) R(i) R(j) Z(j) W(i) Let us now consider the RO state: R(i) Z(i) W(j) R(j) W(j) Z(i) Z(j) Local Read R(i): does not change the state of the local cache block Remote Read R(j): does not affect the status of the local cache block. Local Write (modification) W(i): The last processor to write the cache block RW Remote Write (modification) W(j): this makes the local cache block dirty Inv Local Replace Z(i): the cache block becomes dirty Inv. Remote Replace Z(j): does not affect the status of the local cache block. Dr. Soha S. Zaghloul 28 28

SNOOPY BUS PROTOCOLS (8) WRITE-BACK CACHES (4) R(i) W(i) R(i) W(i) R(j) R(j) Z(j) Z(j) W(i) W(j) Z(i) R(i) Z(i) W(j) R(j) W(j) Z(i) Z(j) Finally, let us consider the RW state: Local Read R(i): does not change the state of the local cache block Remote Read R(j): memory is updated (write-back) memory is in RW & cache block RO Local Write (modification) W(i): does not change the state of the local cache block Remote Write (modification) W(j): this makes the local cache block dirty Inv Local Replace Z(i): the cache block becomes unavailable Inv. Remote Replace Z(j): does not affect the status of the local cache block. Dr. Soha S. Zaghloul 29 29

FURTHER READINGS Cache/Memory addressing Mapping functions Replacement policies Directory-based protocol Dr. Soha S. Zaghloul 30 30