CSC526: Parallel Processing Fall PDF Free Download

CSC526: Parallel Processing Fall 2016 WEEK 5: Caches in Multiprocessor Systems * Addressing * Cache Performance * Writing Policy * Cache Coherence (CC) Problem * Snoopy Bus Protocols PART 1: HARDWARE Dr. Soha S. Zaghloul 1

INTRODUCTION Most multiprocessor systems use private caches associated with different processors as depicted in the following figure: Processors P1 P2 P3 Pn Caches C1 C2 C3 Cn Interconnection Network (Bus, crossbar, etc ) Main Memory M1 M2 M3 Mn I/O Channels Disks D1 D2 D3 Dn Dr. Soha S. Zaghloul 2 2

ADDRESSING (1) Caches may be addressed in one of two ways: Physical addressing data in the cache are accessed using their physical addresses. Virtual addressing data in the cache are accessed using their virtual addresses. Dr. Soha S. Zaghloul 3 3

ADDRESSING (2) PHYSICAL (1) UNIFIED CACHE The following figure depicts the organization of a physical address unified cache: CPU VA PA PA MMU Cache D/I D/I Main Memory The Memory Management Unit (MMU) translates a virtual address into corresponding physical address. A Unified cache contains both data and instructions. A cache hit occurs when the required address is found in the cache. Otherwise, we have a cache miss. After a cache miss, a whole block is loaded from main memory into the cache. Dr. Soha S. Zaghloul 4 4

ADDRESSING (3) PHYSICAL (2) SPLIT CACHE The following figure depicts the organization of a physical address split multi-level data cache: MMU PA PA PA VA CPU Data PA Instruction Level-1 D-Cache I-Cache Data PA Level-2 D-Cache Data Instruction Main Memory Level-2 cache has higher capacity than Level-1 cache. For example, 256 KB and 64 KB respectively. At any point of time, Level-1 cache is a subset of Level-2 cache. Usually, the Level-1 cache is put on-chip (ie. with the processor on the same chip). Dr. Soha S. Zaghloul 5 5

ADDRESSING (4) VIRTUAL (1) UNIFIED CACHE The following figure depicts the organization of a virtual address unified cache: VA MMU PA CPU VA Cache Main Memory D/I D/I Both cache access and MMU address translation are performed in parallel. However, the PA is not used unless memory access is needed. Dr. Soha S. Zaghloul 6 6

ADDRESSING (5) VIRTUAL (2) SPLIT CACHE The following figure depicts the organization of a virtual address split cache: Instruction VA I-Cache Instruction CPU MMU PA Main Memory Data VA D-Cache Data Dr. Soha S. Zaghloul 7 7

ADDRESSING (6) PHYSICAL VS. VIRTUAL The following points highlights the pros & cons of both addressing modes: Physical addressing Pros: Cons: No need to perform cache flushing since PA are uniques No aliasing problems: two VAs are mapped to the same PA The slowdown in accessing the cache till the MMU translates the VA into PA Virtual addressing Pros: Cons: Faster access to cache, since MMU translation is performed in parallel with cache access. The aliasing problem Multiple processes may have the same range of VAs This may be solved by flushing the entire cache. However, this may result in a poor performance The drawback of PA may be alleviated if the MMU and the cache are integrated on the same chip as the CPU. Most system designs use PA for (1) its simplicity; (2) it requires less intervention from the OS as compared to the VA. Dr. Soha S. Zaghloul 8 8

CACHE PERFORMANCE (1) The performance of a cache is measured by its hit ratio: Number of cache hits Hit Ratio (HR) = ------------------------------------------------------- Total number of cache access Number of cache misses Miss Ratio (MR) = ------------------------------------------------------- Total number of cache access Miss Ratio = 1 Hit Ratio For a multi-level cache, the access time (T) to each level should be considered: T caches = HR L1 * T L1 + MR L1 (T L1 + T L2 ) //The average access time for L1-cache To calculate the overall memory system performance, the access time to the main memory (T M ) should also be considered: T overall = HR L1 * T L1 + HR L2 (T L1 + T L2 ) + MR L2 (TL2 + T M ) Dr. Soha S. Zaghloul 9 9

CACHE PERFORMANCE (2) NUMERICAL EAMPLE CPU Access time = 0.01 μs Level-1 D-Cache Access time = 0.1 μs Level-2 D-Cache Assume HR L1 = 0.95. What is the L1-cache performance? T = 0.95*0.01 + 0.05(0.01+0.1) = 0.015 μs Dr. Soha S. Zaghloul 10 10

WRITING POLICIES (1) PROBLEM DEFINITION (1) SCENARIO (1) CPU = 300 W 0 =150 W 1 W B 0 2 W 3 tag w o =150 w300 1 w 2 w 3 tag tag ----- ----- ----- ----- w o w 1 w 2 w 3 w o w 1 w 2 w 3 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- W 0 W 1 W 2 W 3 W 0 W 1 W 2 W 3 Main Memory B i B j Dr. Soha S. Zaghloul 11 11

WRITING POLICIES (2) PROBLEM DEFINITION (2) SCENARIO (2) I/O MODULE = 300 W 0 =150 W300 1 W B 0 2 W 3 tag w o =150 w 1 w 2 w 3 tag w o w 1 w 2 w 3 tag w o w 1 w 2 w 3 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- W 0 W 1 W 2 W 3 W 0 W 1 W 2 W 3 Main Memory B i B j Dr. Soha S. Zaghloul 12 12

WRITING POLICIES (3) SOLUTION (1) The aim of a writing policy is to keep the data consistent between cache and memory. Two main writing policies are followed in caches design: Write-through Write-back Dr. Soha S. Zaghloul 13 13

WRITING POLICIES (4) SOLUTION (2) WRITE THROUGH CPU = 300 = 300 W 0 =150 W300 1 W B 0 2 W 3 W 0 tag w o =150 w300 1 w 2 w 3 W 1 tag w o w 1 w 2 w W 2 3 W tag w o w 1 w 2 w 3 3 ----- ----- ----- ----- ----- W Every time a word is updated in the cache, it is written through (reflected) 0 in the ----- ----- ----- ----- ----- W main memory. 1 ----- ----- ----- ----- ----- W 2 This technique is simple. However, it increases the memory traffic. ----- ----- ----- ----- ----- W 3 Main Memory B i B j Dr. Soha S. Zaghloul 14 14

WRITING POLICIES (5) SOLUTION (3) WRITE BACK W 0 =150 W300 1 B 0 W 2 W 3 W 0 tag w o =150 w300 1 w 2 w 3 W 1 B tag w o w 1 w 2 w W i 2 When a cache line is updated, a status bit (update bit) 3 is set to 1. W tag w o w 1 w 2 w 3 3 When the cache line is to be replaced, it is copied to the main memory if its update bit is equal -----to 1. ----- ----- ----- ----- W 0 ----- ----- ----- ----- ----- This technique minimizes memory accesses (traffic). W 1 B ----- ----- ----- ----- ----- W j 2 However, some memory locations become invalid. ----- ----- ----- ----- ----- W 3 Main Memory In addition, write back imposes that the I/O module accesses the memory through the cache. Dr. Soha S. Zaghloul 15 15

CACHE COHERENCE PROBLEM (1) In a multiprocessor system, data inconsistency may occur between a cache and main memory; or amongst local caches of different processors. Multiple caches may have different copies of the same memory block since multiple processors operate asynchronously and independently. Such situation is known as cache coherence problem. Cache coherence problem may be caused by: Data sharing Process migration I/O that bypasses caches (DRAM) Dr. Soha S. Zaghloul 16 16

CACHE COHERENCE PROBLEM (2) DATA SHARING Consider the following scenario: Processors P1 P2 P1 P2 P1 P2 Private caches Main Memory is a data shared between both processors. Before update, the three copies of are consistent. P1 updates to. Assuming a write-through policy, the update is immediately reflected onto the main memory. However, the copy in P2 is inconsistent. P1 updates to. Assuming a write-back policy, the update is not immediately reflected onto the main memory. The copy in P2 is also inconsistent. Dr. Soha S. Zaghloul 17 17

CACHE COHERENCE PROBLEM (3) PROCESS MIGRATION Consider the following scenario: Processors P1 P2 P1 P2 P1 P2 Private caches Main Memory is a data used by P1. P1 is to be migrated to P2. P2 updates to after migration. Assuming a write-through policy, the update is immediately reflected onto the main memory. However, the copy in P1 is inconsistent. P1 updates to after process migration. Assuming a write-back policy, the update is not immediately reflected onto the main memory. The copy in P2 is also inconsistent. Dr. Soha S. Zaghloul 18 18

CACHE COHERENCE PROBLEM (4) I/O Consider the following scenario. Processors P1 P2 Processors P1 P2 Processors P1 P2 Private caches Private caches Private caches Main Memory Input Output I/O Main Input Memory I/O Main Memory I/O Output When the I/O bypasses the cache, a cache coherence problem may occur: When the I/O processor loads a new value of into the main memory, bypassing the cache, the values of in the processor private caches become obsolete. P1 updates to. Write-back caches are used, so the update is not immediately reflected onto the memory. When the memory outputs the value of directly to the I/O bypassing the cache, it outputs an obsolete value. Dr. Soha S. Zaghloul 19 19

CACHE COHERENCE PROBLEM (5) SOLUTION Two main approaches are commonly used to solve the cache coherence problem: Snoopy bus protocols Directory-based protocols Dr. Soha S. Zaghloul 20 20

SNOOPY BUS PROTOCOLS (1) INTRODUCTION (1) A bus is a convenient Interconnection Network (I/N) topology for ensuring cache coherence. A bus allows all interconnected processors in the system to observe ongoing memory transactions. If a bus transaction threatens the consistent state of local caches, the cache controller can take appropriate actions to invalidate the local copy. Two practices are implemented to maintain the cache coherence: Write-invalidate policy: When a local cache block is updated, all blocks with the same address in remote caches are invalidated. Write-update policy: When a local cache block is updated, the new data block is broadcast to all caches containing a copy of the same block. Snoopy protocols achieve data consistency among the caches and shared memory through a bus watching mechanism. The following figure illustrates the policies mentioned above Dr. Soha S. Zaghloul 21 21

SNOOPY BUS PROTOCOLS (2) INTRODUCTION (2) The memory copy is updated. All copies of in the caches are invalidated (I). Invalidated blocks are called dirty, meaning that they should not be used. Main Memory Write- Invalidate I I Bus P1 P2 P3 Caches Processors P1 P2 P3 Initial State The new block contents is broadcast via the bus to all caches and hence updated. With write-through caches, the memory copy is also updated. With write-back caches, the memory is updated later upon block replacement Write Update P1 P2 P3 Dr. Soha S. Zaghloul 22 22

SNOOPY BUS PROTOCOLS (3) STATE DIAGRAM A state diagram is used to depict all transactions of the write-invalidate protocol implemented in both write-through and write-back caches: The states in the diagram represent those of the cache block Two processors are denoted: a local processor (i) and a remote processor (j) Six operations may take place in such an environment; namely: Read the cache block in the local processor R(i) Read the cache block in the remote processor R(j) Write (modify) the cache block in the local processor W(i) Write (modify) the cache block in the remote processor W(j) Replacing the cache block in the local processor Z(i) Replacing the cache block in the remote processor Z(j) Dr. Soha S. Zaghloul 23 23

SNOOPY BUS PROTOCOLS (4) WRITE-THROUGH CACHES (1) A block belonging to a write-through cache has one of two states: Valid (V) or Invalid (I). A cache block in the invalid state means either it is dirty or unavailable in the processor s cache. R(i) R(j) W(i) Z(j) Let us first consider the Valid state: W(j) Z(i) Local Read R(i): does not affect the status of the local cache block. Remote Read R(j): does not affect the status of the local cache block. Local Write (modification) W(i): does not affect the status of the local cache block. Remote Write (modification) W(j): causes the copy of local cache block to be dirty Inv Local Replace Z(i): the cache block is no more available in local processor Inv. Remote Replace Z(j): does not affect the status of the local cache block. Dr. Soha S. Zaghloul 24 24

SNOOPY BUS PROTOCOLS (4) WRITE-THROUGH CACHES (2) R(j) W(j) Z(i) Z(j) R(i) W(j) W(i) Z(i) R(i) R(j) W(i) Z(j) Let us now consider the Invalid state: Local Read R(i): cache miss the block is fetched from memory and becomes valid. Remote Read R(j): does not affect the status of the local cache block. Local Write (modification) W(i): refreshes the local cache block Valid. Remote Write (modification) W(j): does not affect the status of the local cache block. Local Replace Z(i): the cache block is still unavailable Inv. Remote Replace Z(j): does not affect the status of the local cache block. Dr. Soha S. Zaghloul 25 25

SNOOPY BUS PROTOCOLS (5) WRITE-BACK CACHES (1) The state diagram in write-back caches represents three states; namely, the valid (V), the Read-Only (RO) and the Read-Write (RW). The Invalid state designates that the cache block is either dirty or unavailable in the local cache. RO state: Many caches can contain the RO copies of a block. RW state: Only one processor in the whole system may have a cache block in the RW state. The processor that performs a write is in the RW state. Dr. Soha S. Zaghloul 26 26

SNOOPY BUS PROTOCOLS (6) WRITE-BACK CACHES (2) R(i) W(i) Let us first consider the Invalid state: R(j) W(j) Z(i) Z(j) Local Read R(i): refreshes the local cache with a RO copy RO Remote Read R(j): does not affect the status of the local cache block. Local Write (modification) W(i): refreshes the local cache block with a RW copy RW Remote Write (modification) W(j): does not affect the status of the local cache block. Local Replace Z(i): the cache block is still unavailable Inv. Remote Replace Z(j): does not affect the status of the local cache block. Dr. Soha S. Zaghloul 27 27

SNOOPY BUS PROTOCOLS (7) WRITE-BACK CACHES (3) W(i) R(i) R(j) Z(j) W(i) Let us now consider the RO state: R(i) Z(i) W(j) R(j) W(j) Z(i) Z(j) Local Read R(i): does not change the state of the local cache block Remote Read R(j): does not affect the status of the local cache block. Local Write (modification) W(i): The last processor to write the cache block RW Remote Write (modification) W(j): this makes the local cache block dirty Inv Local Replace Z(i): the cache block becomes dirty Inv. Remote Replace Z(j): does not affect the status of the local cache block. Dr. Soha S. Zaghloul 28 28

SNOOPY BUS PROTOCOLS (8) WRITE-BACK CACHES (4) R(i) W(i) R(i) W(i) R(j) R(j) Z(j) Z(j) W(i) W(j) Z(i) R(i) Z(i) W(j) R(j) W(j) Z(i) Z(j) Finally, let us consider the RW state: Local Read R(i): does not change the state of the local cache block Remote Read R(j): memory is updated (write-back) memory is in RW & cache block RO Local Write (modification) W(i): does not change the state of the local cache block Remote Write (modification) W(j): this makes the local cache block dirty Inv Local Replace Z(i): the cache block becomes unavailable Inv. Remote Replace Z(j): does not affect the status of the local cache block. Dr. Soha S. Zaghloul 29 29

FURTHER READINGS Cache/Memory addressing Mapping functions Replacement policies Directory-based protocol Dr. Soha S. Zaghloul 30 30