Cache Performance (H&P 5.3; 5.5; 5.6)

Size: px

Start display at page:

Download "Cache Performance (H&P 5.3; 5.5; 5.6)"

Gervais Booth
6 years ago
Views:

1 Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time + Average memory access time Avg. mem. time = Hit time + Miss rate x Miss penalty Memory performance eqn. Improving memory hierarchy performance: Decrease hit time Decrease miss rate Decrease miss penalty Inf3 Computer Architecture

2 Reducing Cache Miss Rates Cache miss classification: the three C s Compulsory misses (or cold misses): when a block is accessed for the first time Capacity misses: when a block is not in the cache because it was evicted because the cache was full Conflict misses: when a block is not in the cache because it was evicted because the cache set was full Inf3 Computer Architecture

3 Cache Misses vs. Cache Size Direct mapped H&P Fig way set associative Miss rate Conflict Capacity Cold Miss rate Conflict Capacity Cold 0 0 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB Cache size Cache size Miss rates are very small in practice Miss rates decrease significantly with cache size Miss rates decrease with set-associativity because of reduction in conflict misses Inf3 Computer Architecture

4 Reducing Cold Miss Rates Technique 1: Large block size Principle of locality other data in the block are likely to be used soon Reduce cold miss rate May increase conflict and capacity miss rate for the same cache size (fewer blocks in cache) Increase miss penalty because more data has be brought in each time Uses more memory bandwidth Inf3 Computer Architecture

5 Cache Misses vs. Block Size Miss rate KB 6 16KB 5 64KB KB B 32B 64B 128B 256B Block size Small caches are very sensitive to block size In all cases very large blocks (> 128B) have worse miss rate H&P Fig Inf3 Computer Architecture

6 Reducing Cold Miss Rates Technique 2: Prefetching Idea: bring into the cache (or a special buffer) ahead of time data or instructions that are likely to be used soon Reduce cold misses Uses more memory bandwidth May increase conflict and capacity miss rates (unless prefetch buffer is used) Does not increase miss penalty (prefetch is handled after main cache access is completed) Inf3 Computer Architecture

7 Prefetching Hardware prefetching: hardware automatically prefetches cache blocks on a cache miss No need for extra prefetching instructions in the program Effective for regular accesses, such as instructions E.g., next blocks prefetching, stride prefetching Inf3 Computer Architecture

8 Prefetching Software prefetching: compiler inserts instructions at proper places in the code to prefetch Requires new IS instructions for prefetching (nonbinding prefetch) Adds instructions to compute the prefetching addresses and to perform the prefetch itself (prefetch overhead) E.g., data prefetching in loops, linked list prefetching Inf3 Computer Architecture

9 Software Prefetching E.g., prefetching in loops: Brings the next required block, two iterations ahead of time (assuming each element of x is 4-bytes long and the block has 64 bytes). for (i=0; i<=999; i++) { for (i=0; i<=999; i++) { if (i%16 == 14) x[i] = x[i] + s; prefetch(x[i+16]); } x[i] = x[i] + s; } E.g, linked-list prefetching: Brings the next object in the list while (student) { student->mark = rand(); student = student->next; } while (student) { prefetch(student->next); student->mark = rand(); student=student->next; } Inf3 Computer Architecture

10 Reducing Conflict Miss Rates Technique 3: High associativity caches More options for block placement fewer conflicts Reduce conflict miss rate May increase hit access time because tag match takes longer May increase miss penalty because replacement policy is more involved Inf3 Computer Architecture

11 Cache Misses vs. Associativity Miss rate KB 16KB 64KB 512KB 0 1-way 2-way 4-way fully Associativity Small caches are very sensitive to associativity In all cases more associativity decreases miss rate, but little difference between 4-way and fully associative Inf3 Computer Architecture

12 Reducing Conflict Miss Rates Technique 4: Compiler optimizations E.g., merging arrays: improves spatial locality if the fields are used together for the same index int val[size]; int key[size]; E.g., loop fusion: improves temporal locality struct merge { int val; int key; }; Struct merge merged_array[size]; for (i=0; i<1000; i++) A[i] = A[i]+1; for (i=0; i<1000; i++) B[i] = B[i]+A[i]; for (i=0; i<1000; i++) { A[i] = A[i]+1; B[i] = B[i]+A[i]; } Inf3 Computer Architecture

13 Reducing Conflict Miss Rates E.g., blocking: change row-major and column-major array distributions to block distribution to improve spatial and temporal locality for (i=0; i<5; i++) for (j=0; j<5; j++) { r=0; for (k=0; k<5; k++) { r=r+y[i][k]*z[k][j]; x[i][j]=r; } x: y: z: (matrix multiplication x=y*z) i=0;j=0;0<k<5 i=0;j=1;0<k<5 i=1;j=0;0<k<5 Poor temporal locality Poor spatial and temporal locality Inf3 Computer Architecture

14 Reducing Conflict Miss Rates Loop Blocking or Tiling for (jj = 0; jj < 5; jj = jj+2) for (kk = 0; kk < 5; kk = kk+2) for (i = 0; i < 5; i++) for (j = jj; j < min(jj+2-1,5); j++) { r = 0; for (k = kk; k < min(kk+2-1,5); k++) r = r + y[i][k]*z[k][j]; x[i][j]= x[i][j] + r; } x: y: z: jj=0;kk=0;i=0;j=0;0<k<1 jj=0;kk=0;i=0;j=1;0<k<1 jj=0;kk=0;i=1;j=0;0<k<1 Better temporal locality Inf3 Computer Architecture

15 Cache Performance II Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. Avg. mem. time = Hit time + Miss rate x Miss penalty Memory performance eqn. Improving memory hierarchy performance: Decrease hit time Decrease miss rate Decrease miss penalty Inf3 Computer Architecture

16 Reducing Cache Miss Penalty Technique 1: Victim caches (Can also considered to reduce miss rate) Very small cache used to capture evicted lines from cache In case of cache miss the data may be found quickly in the victim cache (cache miss time < VC hit time < cache hit time) Replacement policy is much more involved CPU Memory address L1 cache Victim cache tag data Main memory tag data?? Inf3 Computer Architecture

17 Reducing Cache Miss Penalty Technique 2: giving priority to reads over writes The value of a read (load instruction) is likely to be used soon, while a write does not affect the processor Idea: place write misses in a write buffer, and let read misses overtake writes Reads to the same memory address of a pending write in the buffer now become hits in the buffer: sw 512(r0), r3 lw r2, 512(0) 1. write miss goes into write buffer 2. read hits in the write buffer and gets the value from the previous write memory address value 512 R[r3] write buffer Inf3 Computer Architecture

18 Reducing Cache Miss Penalty Technique 3: early restart and critical word first On a read miss processor will need just the loaded word (or byte) very soon, but processor has to wait until the whole block is brought into the cache Early restart: as soon as the requested word arrives in the cache, send it to the processor and then continue reading the rest of the block into the cache lw r2, 3(0) CPU 0x0003 L1 cache Main memory tag data ? Cache block Inf3 Computer Architecture

19 Reducing Cache Miss Penalty Technique 3: early restart and critical word first Critical word first: get the requested word first from the memory, send it asap to the processor and then continue reading the rest of the block into the cache lw r2, 3(0) CPU 0x0003 L1 cache Main memory tag data ? 03 Cache block Inf3 Computer Architecture

20 Reducing Cache Miss Penalty Technique 4: non-blocking (or lockup-free) caches Dynamic scheduling (Tomasulo s): ALU instructions can overtake a cache miss instruction Non-blocking caches: other memory instructions can also overtake a cache miss instruction Cache can service multiple hits while waiting on a miss: hit under miss More aggressive: cache can service multiple hits while waiting on multiple misses: miss under miss or hit under multiple misses Cache and memory must be able to service multiple requests concurrently Must keep track of multiple outstanding memory operations Increased hardware complexity Inf3 Computer Architecture

21 Non-blocking Caches H&P Fig Significant improvement from small degree of outstanding memory operations Some applications benefit from large degrees Inf3 Computer Architecture

22 Reducing Cache Miss Penalty Technique 4: second level caches (L2) Gap between main memory and L1 cache speeds is increasing L2 makes main memory appear to be faster if it captures most of the L1 cache misses L1 miss penalty becomes L2 hit access time if hit in L2 L1 miss penalty higher if miss in L2 L2 considerations: Misses will be more frequent Higher associativity is possible On-chip (512KB - 1MB) or off-chip (1MB 4 MB) cycles access time Inf3 Computer Architecture

23 Second Level Caches Memory subsystem performance: Avg. mem. time = Hit time L1 + Miss rate L1 x Miss penalty L1 Miss penalty L1 = Hit time L2 + Miss rate L2 x Miss penalty L2 Avg. mem. time = Hit time L1 + Miss rate L1 x (Hit time L2 + Miss rate L2 x Miss penalty L2 ) Miss rates: Local: the number of misses divided by the number of requests to the cache E.g., Miss rate L1 and Miss rate L2 in the equations above Usually not so small for lower level caches Global: the number of misses divided by the total number of requests from the CPU E.g, L2 global miss rate = Miss rate L1 x Miss rate L2 Represents the aggregate effectiveness of the caches combined Inf3 Computer Architecture

24 Cache Misses vs. L2 size Global miss rate L2 local miss rate Miss rate (%) KB 8KB 16KB 32KB 64KB 128KB 256KB L2 size 512KB L2 caches must be much bigger than L1 Local miss rates for L2 are larger than for L1 and are not a good measure of overall performance 1MB 2MB 4MB H&P Fig Inf3 Computer Architecture

25 Reducing Cache Hit Time Technique 1: small and simple caches Small caches can be placed on-chip signals take a long time to go offchip Low associativity caches have few tags to compare against the requested data Direct mapped caches have only one tag to compare and comparison can be done in parallel with the fetch of the data Inf3 Computer Architecture

26 Reducing Cache Hit Time Technique 2: virtual address caches Programs use virtual addresses for data, while main memory uses physical addresses addresses from processor must be translated at some point Option 1: physical address caches perform address translation before cache access Hit time is increased to accommodate translation 0x0003 CPU address translation 0x2103 L1 cache tag data Main memory? Inf3 Computer Architecture

27 Reducing Cache Hit Time Technique 2: virtual address caches Option 2: virtual address caches perform address translation after cache access if miss Hit time does not include translation 0x0003 CPU L1 cache address translation 0x2103 Main memory tag data? Inf3 Computer Architecture

28 Reducing Cache Hit Time Problems of virtual address caches Programs may use the same virtual addresses, but different physical addresses Cache contents must be flushed on every context switch increase miss rate Cache tag must be extended with process identifier (PID) User programs and OS may use different virtual addresses for the same data: aliasing problem Same data structure may end up with two copies in the cache Inf3 Computer Architecture

29 Virtual Memory Each process would like to see its own, full, address space Clearly impossible to provide full physical memory for all processes Processes may define a large address space but use only a small part of it at any one time Processes would like their memory to be protected from access and modification by other processes The operating system needs to be protected from applications Each process has its own Virtual Address Space, divided into fixed-sized pages Virtual pages that are in use get mapped to pages of physical memory. Virtual pages not recently used may be stored on disk Extends the memory hierarchy out to the swap partition of a disk Inf3 Computer Architecture

30 Virtual and Physical Memory Example 4K page size Process 1 has pages A, B, C and D Page B is held on disk Virtual memory (process 1) 0 4K 8K A B Physical memory 0 4K 8K Y D Virtual memory (process 2) 0 4K 8K X Y Process 2 has pages X, Y, Z Page Z is held on disk 12K 16K 20K 24K C 12K 16K 20K A C X 12K 16K 20K 24K Z Process 1 cannot access pages X, Y, Z Process 2 cannot access page A, B, C, D 28K 32K 36K 36K D B Z page swapping 28K 32K 36K 36K O/S can access any page (full privileges) Swap disk Inf3 Computer Architecture

31 Sharing memory using Virtual Aliases Process 1 and Process 2 want to share a page of memory Process 1 maps virtual page A to physical page P Process 2 maps virtual page Z to physical page P Virtual memory (process 1) 0 4K 8K 12K 16K A B Physical memory 0 4K 8K 12K 16K P Q Virtual memory (process 2) 0 4K 8K 12K 16K Permissions can vary between the sharing processors. 20K 24K 28K C 20K page swapping 20K 24K 28K Z O/S can still access any page (full privileges) 32K 36K 32K 36K Note: Process 1 can also map the same physical page at multiple virtual addresses!! 36K P Shared page Swap disk 36K Q Aliased within one process Inf3 Computer Architecture

32 Typical Virtual Memory Parameters parameter L1 cache virtual memory block/page bytes 4KB-64KB hit time 1-3 cycles cycles miss penalty cycles 1M-10M cycles access time cycles 800K-8M cycles transfer time 2-20 cycles 200K-2M cycles miss rate % % size 256KB-1MB 64MB-16GB H&P Fig Virtual Memory miss is called a page fault Page size is usually fixed, but some systems use variable size segments Inf3 Computer Architecture

33 Virtual Memory Policies Block replacement: choosing a page frame to reuse Minimize misses (page fault) LRU policy Minimize write backs to disk give priority to non-modified pages Write strategy: policy adopted on a write Write-through would mean writing the cache block back to disk whenever the page is updated in main memory not practical Write-back policy is always used (with Dirty or modified bit in page table) Some systems use one Dirty bit per block in the page to minimize data writes back to disk Inclusivity: Inclusive would mean having a copy of all used pages in disk too expensive Memory and Disk are non-inclusive in all systems Inf3 Computer Architecture

34 Virtual Memory Policies Block placement: location of page in memory More freedom lower miss rates, higher hit and miss penalties Memory access time is already high and memory miss penalty (disk access time) is huge low miss rates Full associativity virtual page can be located in any page frame Important to reduce time to find a page in memory (hit time) To place new pages in memory, OS maintains a list of free frames Block placement may be constrained by use of translated virtual address bits when indexing the cache (see later) Inf3 Computer Architecture

35 Virtual Memory Policies Block identification: finding the correct page frame Assigning tags to memory page frames and comparing tags is impractical OS maintains a table that maps all virtual pages to page frames: Page Table Table is updated with a new mapping every time a virtual page is allocated a page frame Table is accessed on a memory request to translate virtual to physical address inefficient The number of entries in the table is the number of virtual pages very large (e.g., with 4Kbyte pages, it has 2 20 =1M entries for a 32 bit address space and 2 52 entries for a 64 bit address space) Page frame number is used to generate the physical address during address translation One Page Table per process Inf3 Computer Architecture

36 Page Tables and Address Translation Page table contains a translation for all virtual pages Page Table Address Register One page table for each process, and one for the system Each page has specific access permissions Virtual Page Number PageOffset Virtual Address Read permission Write permission Disk? Physical Page Number Permissions Valid Execute permission Bit indicates if page is on disk, in which case Physical Page Number indicates location within swap file 0 r-w-x 1 Page Table Page table can be very large, so is often itself stored in virtual memory of the operating system, and large parts may be swapped out CPU needs a cache of recently Physical Page Number PageOffset Physical Address used Page Table Entries (PTEs) Inf3 Computer Architecture

37 Translation Look-aside Buffers Typically a small, fullyassociative cache of Page Table Entries (PTE) Tag given by VPN for that PTE PPN taken from PTE Valid bit required D bit (dirty) indicates whether page has been modified R, W, X bits indicate Read, Write and Execute permission Permissions are checked on every memory access Physical address formed from PPN and Page Offset TLB Exceptions: TLB miss (no matching entry) Privilege violation Often separate TLBs for Instruction and Data references V TLB hit D R W X Virtual Page Number = = = = = Tag Physical Address PageOffset Virtual Address Physical Page Number Physical Page Number PageOffset Inf3 Computer Architecture

38 Problems with virtual aliases and caches Virtually tagged data cache problems: Page aliases appear to be at different addresses Two copies could exist in the same data cache Writing to copy 1 would not be reflected in copy 2 Reading copy 2 would get stale data Does not provide a coherent view of memory Solution: Use Physical address tags Aliases have same physical address, therefore same tag Only one copy exists in each cache Implications for CPU-cache interactions: Must translate addresses before cache tag check May still be able to index cache using non-translated low-order address bits under certain circumstances. Inf3 Computer Architecture

39 VI-PT: translating in parallel with L1-$ access If translation takes place before L1-$ access, then hit time will increase TLB and L1-$ often arranged to allow parallel TLB and L1-$ access Requires that L1-$ index can be obtained from the non-translated bits of the virtual address. This places a limit of one page on the capacity of one way of the cache Virtual Page Number PageOffset Index Offset Virtual Address TLB hit TLB Tag Comparison Cache hit L1-$ 4 KB D-M 32-byte line IMPORTANT: If the cache Index extends beyond bit 11, into the translated part of the address, then translation must take place before the cache can be indexed Inf3 Computer Architecture

40 Coping with large VI-PT caches Rely on page allocator in the O/S to allocate pages such that the translation of index bits would always be an identity relation Hence, if virtual address A translates to physical address P, then Page Allocator must guarantee that: V[11] == P[11] Cache tag bits Index Offset Cache addressing Virtual Page Number Page offset Virtual addressing Any translated bit used to index the cache must be identical in both the Virtual and Physical addresses Inf3 Computer Architecture

41 Putting it together: TLBs in the pipeline Two TLBs, one for Instructions and one for Data, located in IF and MEM respectively Each may generate TLB exceptions (effectively interrupts) TLB exceptions must be re-startable (kill instruction, load TLB entry, restart instruction) Tag check now involves translated address, and can be delayed to next stage IF DEC EX MEM WB I-TLB Exception I-Cache Hit / Miss EX D-TLB Exception D-Cache Hit / Miss MEM WB MEM WB WB PC Instruction TLB Virtual Addr Physical Addr L1 I-cache Tag (s) I-Tag check Register File ALU Virtual Addr Data TLB Physical Addr L1 D-Cache D-Tag check Read Address Read Data Instruction (s) Read Addr 0 Read Addr 1 Read Data 0 Read Data 1 Address Write data Read data Write Data Write Addr Inf3 Computer Architecture

42 When to perform address translation? VI-VT : Virtually indexed, virtually tagged L1-$ indexed with virtual address, before translation, tag contains virtual address Con: Cannot distinguish virtual aliases or synonyms in cache Pro: Only perform TLB lookup on L1-$ miss VI-PT : Virtually indexed, physically tagged L1-$ indexed with virtual address, or often just the un-translated bits Translation must take place before tag can be checked Con: Translation must take place on every L1-$ access Pro: No aliases in the cache; works with cache-coherent shared memory PI-PT : Physically indexed, physically tagged Translation first; then cache access Con: Translation occurs in sequence with L1-$ access high latency PI-VT : Physically indexed, virtually tagged Not interesting Inf3 Computer Architecture

43 Cache Performance Techniques technique miss rate miss penalty hit time complexity large block size high associativity victim cache hardware prefetch compiler prefetch compiler optimizations priorisation of reads critical word first nonblocking caches L2 caches small and simple caches virtual caches Inf3 Computer Architecture

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +