Caches Design of Parallel and High-Performance Computing Recitation Session

Size: px

Start display at page:

Download "Caches Design of Parallel and High-Performance Computing Recitation Session"

Chad Ferdinand Page
6 years ago
Views:

1 S. DI GIROLAMO Caches Design of Parallel and High-Performance Computing Recitation Session Slides credits: Pavan Balaji, Torsten Hoefler

2 Idea Feasible to have small amount of fast memory and/or large amount of slow memory. Want Size advantage of DRAM Increasing speed as we get closer to the processor Increasing size as we get farther away from the processor Speed advantage of SRAM. CPU CPU looks in cache for data it seeks from main memory. If data not there it retrieves it from main memory. If the cache is able to service "most" CPU requests then effectively we will get speed advantage of cache. All addresses in cache are also in memory Cache Main memory 2

3 Typical Memory Hierarchy L0: registers CPU registers hold words retrieved from L1 cache Smaller, faster, costlier per byte Larger, slower, cheaper per byte L4: L3: L2: L1: on-chip L1 cache (SRAM) on-chip L2 cache (SRAM) main memory (DRAM) local secondary storage (local disks) L1 cache holds cache lines retrieved from L2 cache L2 cache holds cache lines retrieved from main memory Main memory holds disk blocks retrieved from local disks Local disks hold files retrieved from disks on remote network servers L5: remote secondary storage (tapes, distributed file systems, Web servers) 3

4 Direct Mapping Memory address Cache Tag Cache Index Valid Tag = hit y To CPU 4

5 Why do we need the validity bit? Assume computer is turned on and every location in cache is zero. What can go wrong? Processor emits 32 bit address to cache Tag Index Byte Offset 00 Index Index Index Tag Tag Tag Contents Contents Contents Index Tag Contents

Increased HW complexity, hence increased cost Diminishing returns: the number of

6 Fully Associative Fully Associative A block can go anywhere To check if a block is cached, all the tags must be compared! Increased HW complexity, hence increased cost Diminishing returns: the number of conflicts decreases when increasing the cache size 6

7 Set Associative Tag Index 10 Byte Offset (2 bits) Tag V Tag V Tag V Tag V = = = = hit 4 to 1 multiplexor 7

8 Extremes of set associativity Direct Mapped (1-way) 1-Way Two-way Set Associative 4 sets 2 Ways 8 sets Four-way Set Associative 4 Ways 2 sets Fully Associative (8-way) 1 set 8 Ways 8

9 Quiz How caches enable the exploiting of spatial locality? How to make sure that a newly allocated memory region starts at cacheline? 9

10 How to allocate block-aligned memory? Synopsis #include <stdlib.h> int posix_memalign(void **memptr, size_t alignment, size_t size); Description The function posix_memalign() allocates size bytes and places the address of the allocated memory in *memptr. The address of the allocated memory will be a multiple of alignment, which must be a power of two and a multiple of sizeof(void *). If size is 0, then posix_memalign() returns either NULL, or a unique pointer value that can later be successfully passed to free(3). Return value posix_memalign() returns zero on success, or one of the error values listed in the next section on failure. Note that errno is not set. 10

11 The 3C s Compulsory misses cold start don t have a choice Can we reduce the number of compulsory misses? Increasing block size can reduce the number of distinct blocks that are requested Capacity misses cache is much smaller than total addressable memory Conflict misses collision within a set requested block was thrown out when some other block wanted to occupy the same position can reduce by increasing associativity (each block has more options) 11

12 Exercise 1 Blackboard 12

13 Replacement How do we choose victim? Verbs: Victimize, evict, replace, cast out Many policies are possible FIFO (first-in-first-out) LRU (least recently used), pseudo-lru LFU (least frequently used) NMRU (not most recently used) NRU Pseudo-random (yes, really!) Optimal Etc Optimal: [Belady, IBM Systems Journal, 1966] Evict block with longest reuse distance i.e. next reference to block is farthest in future Requires knowledge of the future! 13

Least-Recently Used For a=2, LRU is equivalent to NMRU Single bit per set indicates LRU/MRU Set/clear on each access For a>2, LRU is difficult/expensive Timestamps?

14 Least-Recently Used For a=2, LRU is equivalent to NMRU Single bit per set indicates LRU/MRU Set/clear on each access For a>2, LRU is difficult/expensive Timestamps? How many bits? Must find min timestamp on each eviction Sorted list? Re-sort on every access? Practical Pseudo-LRU 14

15 Practical Pseudo-LRU In Action J F C B X Y A Z JY X Z BCF A 011: PLRU Block B is here 110: MRU block is here 15

16 Exercise 2 Blackboard 16

Marten van Dijk Syed Kamran Haider, Chenglu Jin, Phuong Ha Nguyen. Department of Electrical & Computer Engineering University of Connecticut

CSE 5095 & ECE 4451 & ECE 5451 Spring 2017 Lecture 5a Caching Review Marten van Dijk Syed Kamran Haider, Chenglu Jin, Phuong Ha Nguyen Department of Electrical & Computer Engineering University of Connecticut