CS/ECE 250 Computer Architecture

Similar documents
Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Lecture 11 Reducing Cache Misses. Computer Architectures S

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Computer Architecture Spring 2016

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

ECE/CS 250 Computer Architecture

CS/ECE 250: Computer Architecture

Types of Cache Misses: The Three C s

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

CMSC 611: Advanced Computer Architecture. Cache and Memory

Copyright 2012, Elsevier Inc. All rights reserved.

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Where We Are in This Course Right Now. ECE 152 Introduction to Computer Architecture. This Unit: Caches and Memory Hierarchies. Readings.

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Cache performance Outline

Lec 12 How to improve cache performance (cont.)

ECE/CS 250 Computer Architecture. Summer 2018

CSE 502 Graduate Computer Architecture

Aleksandar Milenkovich 1

Advanced optimizations of cache performance ( 2.2)

CPE 631 Lecture 06: Cache Design

Aleksandar Milenkovich 1

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

Lecture 7: Memory Hierarchy 3 Cs and 7 Ways to Reduce Misses Professor David A. Patterson Computer Science 252 Fall 1996

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

NOW Handout Page 1. Review: Who Cares About the Memory Hierarchy? EECS 252 Graduate Computer Architecture. Lec 12 - Caches

CS422 Computer Architecture

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

CS 2461: Computer Architecture 1

EITF20: Computer Architecture Part4.1.1: Cache - 2

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

211: Computer Architecture Summer 2016

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Chapter-5 Memory Hierarchy Design

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Memory Hierarchy. Slides contents from:

Memory Hierarchy Design. Chapter 5

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Topics. Digital Systems Architecture EECE EECE Need More Cache?

5008: Computer Architecture

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

Lecture 7 - Memory Hierarchy-II

Cache Performance (H&P 5.3; 5.5; 5.6)

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Memory Hierarchy I - Cache

Chapter 2: Memory Hierarchy Design Part 2

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

This Unit: Caches. CIS 501 Computer Architecture. Motivation. Readings. Basic memory hierarchy concepts Speed vs capacity. Caches

Memory Hierarchy. Slides contents from:

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Systems Programming and Computer Architecture ( ) Timothy Roscoe

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy and Application Tuning

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Lecture 11. Virtual Memory Review: Memory Hierarchy

Lecture 2: Caches. belongs to Milo Martin, Amir Roth, David Wood, James Smith, Mikko Lipasti

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Memory Hierarchy I: Caches

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Chapter 2: Memory Hierarchy Design Part 2

This Unit: Caches. CIS 501 Computer Architecture. Motivation. Readings. Basic memory hierarchy concepts Speed vs capacity. Caches

Inside out of your computer memories (III) Hung-Wei Tseng

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Course Administration

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Lecture notes for CS Chapter 2, part 1 10/23/18

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

CS222: Cache Performance Improvement

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Advanced Caching Techniques

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

14:332:331. Week 13 Basics of Cache

ECE 152 Introduction to Computer Architecture

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Transcription:

Computer Architecture Caches and Memory Hierarchies Benjamin Lee Duke University Some slides derived from work by Amir Roth (Penn), Alvin Lebeck (Duke), Dan Sorin (Duke) 2013 Alvin R. Lebeck from Roth and Sorin 1

Administrivia HW #4 Sunday, March 30 at 11:55pm Midterm #2 Tuesday, Apr 1 in class Covers logic design, datapath and control (HW #3 & #4) Does not include memory hierarchy. No assembly programming Two more homeworks HW #5 Friday, Apr 11 Memory hierarchy HW #6 Wednesday, Apr 23 Exceptions, I/O, Pipelining Reading: Chapter 5 in Patterson and Hennessy 2013 Alvin R. Lebeck from Roth and Sorin 2

Full-Associativity (1024 Entries) 0 1 1022 1023 == == == == [31:2] 1:0 << CPU How to implement full (or at least high) associativity? Doing it this way is terribly inefficient 1K matches are unavoidable, but 1K data reads + 1K-to-1 mux? 2013 Alvin R. Lebeck from Roth and Sorin 3

Full-Associativity with CAMs CAM: content addressable memory Array of words with built-in comparators Matchlines instead of bitlines Output is one-hot (unary) encoding of match == == == == 0 1 1022 1023 Fully-associative cache? Tags as CAM Data as RAM [31:2] 1:0 << 2013 Alvin R. Lebeck from Roth and Sorin 4

Analyzing Cache Misses: 3C Model Divide cache misses into three categories Compulsory: miss because cache has not previously seen address Easy to identify Capacity: miss caused because cache is too small N is the number of blocks in the cache Consecutive accesses to a block are separated by at least N other distinct blocks Conflict: miss caused because cache associativity is too low All other misses 2013 Alvin R. Lebeck from Roth and Sorin 5

Associativity (increase) + Decreases conflict misses Increases t hit Block size (increase) Increases conflict misses ABCs of Caches + Decreases compulsory misses ± Increases or decreases capacity misses Negligible effect on t hit Capacity (increase) + Decreases capacity misses Increases t hit 2013 Alvin R. Lebeck from Roth and Sorin 6

Two Possible Optimizations Victim buffer: for conflict misses Prefetching: for capacity/compulsory misses 2013 Alvin R. Lebeck from Roth and Sorin 7

Victim Buffer Conflict misses: insufficient associativity High-associativity is expensive, but also rarely needed E.g., 3 blocks mapped to 2-way set and accessed sequentially Victim buffer (VB): small FA cache (e.g., 4 entries) Sits on I$/D$ fill path VB is small à very fast Blocks kicked out of I$/D$ placed in VB On miss, check VB. If VB hits, return block to I$/D$ 4 extra ways, shared among all sets + Only a few sets will need it at any given time + Very effective in practice I$/D$ VB L2 2013 Alvin R. Lebeck from Roth and Sorin 8

Prefetching Prefetching: put blocks in cache proactively/speculatively Anticipate upcoming miss addresses accurately Prediction in software or hardware Simple example: next block prefetching Miss on address X anticipate miss on X+blocksize Works for instructions: sequential execution Works for data: arrays Timeliness: initiate prefetches sufficiently in advance Accuracy: prefetch useful data, do not evict useful data I$/D$ prefetch logic L2 2013 Alvin R. Lebeck from Roth and Sorin 9

Cache Writes So far we have looked at reading from cache (loads) What about writing into cache (stores)? Several new issues arise during cache writes Tag/data access Write-through vs. write-back Write-allocate vs. write-not-allocate 2013 Alvin R. Lebeck from Roth and Sorin 10

Tag/Data Access Reads: read tag and data in parallel Tag mis-match data is garbage (OK) Writes: read tag, write data in parallel? Tag mis-match data is lost For SA cache, which way is written? Writes are pipelined 2-cycle process Cycle 1: match tag Cycle 2: write to matching way == 0 1 2 1022 1023 << [10:2] 1:0 data [31:11] [10:2] 1:0 data hit/miss address 2013 Alvin R. Lebeck from Roth and Sorin 11 data

Tag/Data Access Cycle 1: check tag Hit? Write data next cycle Miss? We ll get to this in a few slides 0 1 2 1022 1023 == << [10:2] 1:0 data [31:11] [10:2] 1:0 data hit/miss address 2013 Alvin R. Lebeck from Roth and Sorin 12 data

Tag/Data Access Cycle 2: write data 0 1 2 1022 1023 == << [10:2] 1:0 data [31:11] [10:2] 1:0 data hit/miss address 2013 Alvin R. Lebeck from Roth and Sorin 13 data

Write-Through vs. Write-Back When to propagate new value to (lower level) memory? Write-through: immediately + Conceptually simpler + Uniform read miss latency Requires additional bus bandwidth Write-back: when block is evicted and replaced Requires additional dirty bit per block + Minimal bus bandwidth Only write back dirty blocks Non-uniform read miss latency Clean miss: one transaction (fill) Dirty miss: two transactions (writeback & fill) 2013 Alvin R. Lebeck from Roth and Sorin 14

Write-allocate vs. Write-non-allocate What to do on a write miss? Write-allocate: read block from lower level, write value into it + Decreases read misses Requires additional bandwidth Use with write-back Write-non-allocate: just write to next level Potentially more read misses + Uses less bandwidth Use with write-through 2013 Alvin R. Lebeck from Roth and Sorin 15

Write Buffer Write buffer: between cache and memory 1 $ Write-through cache? Helps with store misses + Write to buffer to avoid waiting for memory Store misses become store hits 3 Next Level 2 Write-back cache? Helps with dirty misses + Allows you to do read first 1. Write dirty block to buffer 2. Read new block from memory to cache 3. Write buffer contents to memory 2013 Alvin R. Lebeck from Roth and Sorin 16

Typical Processor Cache Hierarchy First level caches: optimized for t hit and parallel access Insns and data in separate caches (I$, D$) Capacity: 8 64KB, block size: 16 64B, associativity: 1 4 Other: write-through or write-back t hit : 1 4 cycles Second level cache (L2): optimized for % miss Insns and data in one cache for better utilization Capacity: 128KB 1MB, block size: 64 256B, associativity: 4 16 Other: write-back t hit : 10 20 cycles Third level caches (L3): also optimized for % miss Capacity: 1 8MB t hit : 30 cycles 2013 Alvin R. Lebeck from Roth and Sorin 17

Performance Calculation Example Parameters Reference (address) stream: 20% stores, 80% loads L1 D$: t hit = 1ns, % miss = 5%, write-through + write-buffer L2: t hit = 10ns, % miss = 20%, write-back, 50% dirty blocks Main memory: t hit = 50ns, % miss = 0% What is t avgl1d$ without an L2? Write-through and write-buffer means all stores hit t missl1d$ = t hitm t avgl1d$ = t hitl1d$ + % loads *% missl1d$ *t hitm = 1ns+(0.8*0.05*50ns) = 3ns What is t avgd$ with an L2? Write-back means dirty misses incur double cost (writeback, fill) t missl1d$ = t avgl2 t avgl2 = t hitl2 +(1+% dirty )*% missl2 *t hitm = 10ns+(1.5*0.2*50ns) =25ns t avgl1d$ = t hitl1d$ + % loads *% missl1d$ *t avgl2 = 1ns+(0.8*0.05*25ns) =2ns 2013 Alvin R. Lebeck from Roth and Sorin 18

Cache Organization Summary Average access time of a memory component t avg = t hit + % miss * t miss Hard to get low t hit and % miss in one structure hierarchy Memory hierarchy Cache (SRAM) memory (DRAM) swap (Disk) Smaller, faster, more expensive bigger, slower, cheaper SRAM Analog technology for implementing big storage arrays Cross-coupled inverters + bitlines + wordlines Delay ~ #bits * #ports 2013 Alvin R. Lebeck from Roth and Sorin 19

Summary, cont d Cache ABCs Capacity, associativity, block size 3C miss model: compulsory, capacity, conflict Some optimizations Victim buffer for conflict misses Prefetching for capacity, compulsory misses Write issues Pipelined tag/data access Write-back vs. write-through/write-allocate vs. write-no-allocate Write buffer Next Your Programs and Caches 2013 Alvin R. Lebeck from Roth and Sorin 20

Cache Performance Tave = number of cycles we stall waiting for memory operation Execution time = (Core execution clock cycles + Memory stall clock cycles) x Clock cycle time Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty Example Assume every instruction takes 1 cycle Miss penalty = 20 cycles Miss rate = 10% 1000 total instructions, 300 memory accesses Memory stall cycles? CPU clocks? CPS 104

Cache Performance Memory Stall Cycles = 300 * 0.10 * 20 = 600 Core Cycles = 1000 + 600 = 1600 60% slower because of cache misses! Change miss penalty to 100 cycles Core Cycles = 1000 + 3000 = 4000 cycles

Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. CPS 104

Reducing Misses (The 3 Cs) Compulsory The first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses. Capacity If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. Conflict If the block-placement strategy is set-associative or direct mapped, conflict misses (in addition to compulsory, capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses. CPS 104

Cache Performance Your program and caches Can you affect performance? Think about 3Cs CPS 104

Mapping Arrays to Memory Row-major 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Column major 0 5 10 15 20 1 6 11 16 21 2 7 12 17 22 3 8 13 18 23 4 9 14 19 24 Part of the Row maps into cache

Array Mapping and Cache Behavior Elements spread out in memory because of column-major mapping Fixed mapping into cache Self-interference in cache Cache Memory Cache Mapping

Data Cache Performance Instruction Sequencing Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing blocks of data repeatedly vs. going down entire columns or rows Data Layout Merging Arrays: Improve spatial locality by single array of compound elements vs. 2 separate arrays Nonlinear Array Layout: Mapping 2 dimensional arrays to the linear address space Pointer-based Data Structures: Node-allocation

Loop Interchange Example /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) /* After */ for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Matrix x stored in row-major format (i.e., row layout is sequential in memory). Interchange produces sequential accesses instead of 100-word strides through memory CPS 104

Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) /* After */ d[i][j] = a[i][j] + c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} Baseline incurs two misses when accessing matrices a and c. Fusion incurs only one miss when accessing matrices a and c. CPS 104

/* Before */ for(i = 0; i < n; i++) Naïve Matrix Multiply for (j = 0; j < n; j++) for (k = 0; k < n; k++) C[i][j] = C[i][j] + A[i][k]*B[k][j]; Misses depend on N and cache size CPS 104

{implements C = C + A*B} for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n Naïve Matrix Multiply C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory} C(i,j) C(i,j) A(i,:) = + * B(:,j)

Naïve Matrix Multiply Number of slow memory references on unblocked matrix multiply m = n 3 read each column of B n times + n 2 read each row of A once + 2n 2 read and write each element of C once = n 3 + 3n 2 C(i,j) C(i,j) A(i,:) = + * B(:,j)

Blocking (Tiling) Example /* Before */ for(i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < N; k++) c[i][j] = c[i][j] + a[i][k]*b[k][j]; Two inner loops Read all NxN elements of c[ ][ ] Read N elements of rows in a[ ][ ], b[ ][ ] repeatedly Write all NxN elements of c[ ][ ] Capacity misses depend on N and cache size 3 NxN => no capacity misses; otherwise... Idea is to compute on BxB submatrix that fits CPS 104

Blocked (Tiled) Matrix Multiply /* After */ for(ii = 0; ii < n; ii += B) for (jj = 0; jj < n; jj += B) for (kk = 0; kk < n; kk +=B) for(i = ii; i < MIN(ii+B-1,n); i++) for (j = jj; j < MIN(jj+B-1,n); j++) for (k = kk; k < MIN(kk+B-1,n); k++) c[i][j] = c[i][j] + a[i][k]*b[k][j]; B is called the blocking factor or tile size CPS 104

Consider A,B,C to be N by N matrices of b by b sub-blocks Where b =n / N is called the block size for i = 1 to N Blocked (Tiled) Matrix Multiply for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory} C(i,j) C(i,j) A(i,k) = + * B(k,j)

Blocked (Tiled) Matrix Multiply C(1,1) C(1,1) = + * A(1,1) B(1,1)

Blocked (Tiled) Matrix Multiply C(1,1) C(1,1) = + A(1,2) * B(2,1)

Blocked (Tiled) Matrix Multiply C(1,1) C(1,1) = + A(1,3) * B(3,1)

Blocked (Tiled) Matrix Multiply C(1,2) C(1,2) = + * A(1,1) B(1,2)

Blocked (Tiled) Matrix Multiply C(1,2) C(1,2) = + * A(1,2) B(2,2)

Blocked (Tiled) Matrix Multiply C(1,2) C(1,2) = + * A(1,3) B(3,2)

Blocked (Tiled) Matrix Multiply m is amount memory traffic between slow and fast memory" matrix has nxn elements, and NxN blocks each of size bxb" "b = n / N" "m = N*n 2 " "read each block of B N 3 times (N 3 * n/n * n/n)" + N*n 2 " "read each block of A N 3 times" + 2n 2 " "read and write each block of C once" = (2N + 2) * n 2 " = 2(n/b + 1) * n 2 " = 2n 3 / b + 2n 2 "compare to naïve matrix multiply n 3 + 3n 2" " " So we can improve performance by increasing the blocksize b"

Reducing Conflict Misses by Blocking 0.1 0.05 Direct Mapped Cache Fully Associative Cache 0 0 50 100 150 Blocking Factor Conflict misses in caches not FA vs. Blocking size Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fit in cache CPS 104

Data Layout Optimizations Changes in program control affect the order in which memory is accessed Changes in data layout affect how data structures map to memory locations

Merging Arrays Example /* Before */ int val[size]; int key[size]; /* After */ struct merge { int val; int key; }; struct merge merged_array[size]; Reducing conflicts between val & key CPS 104

Layout and Cache Behavior Tile elements spread out in memory because of columnmajor mapping Fixed mapping into cache Self-interference in cache Each block holds two elements Cache Memory Cache Mapping

Making Tiles Contiguous Elements of a quadrant are contiguous Recursive layout Elements of a tile are contiguous No self-interference in cache Memory Cache Mapping

Pointer-based Data Structures Linked List, Binary Tree Group linked elements close together in memory Need relatively static traversal pattern Or could do it during garbage collection/compaction

Summary of Program Optimizations to Reduce Cache Misses vpenta (nasa7) gmty (nasa7) tomcatv btrix (nasa7) mxm (nasa7) spice cholesky (nasa7) compress 1 1.5 2 2.5 3 Performance Improvement merged arrays loop interchange loop fusion blocking CPS 104

Reducing I-Cache Misses by Compiler Optimizations Instructions Reorder procedures in memory to reduce misses Profiling to look at conflicts McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocks CPS 104

Summary Cost-effective memory hierarchy Works by exploiting temporal and spatial locality Associativity, Blocksize, Capacity (ABCs of caches) Know how a cache works Break address into tag,index, block offset Know how to draw a block diagram of a cache Know CPU cycles/time, Memory Stall Cycles Know programs and cache performance CPS 104