Memory Hierarchies 2009 DAT105

Size: px

Start display at page:

Download "Memory Hierarchies 2009 DAT105"

Egbert Hall
5 years ago
Views:

1 Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement techniques 1

2 Today s Topics Background: Repetition opt. lecture 2: Cache memories (C.1) A brief look on virtual memory (C.4) Improvement techniques: How cache performance affects processor performance [ ] A set of techniques that each address these main factors [C.3, 5.2] 2

3 Why is Caching Important? 1980: no cache in microprocessors level caches on a microprocessor chip level caches on a microprocessor chip Processor/memory speed gap is increasing 3

4 Faster Larger Levels of the Memory Hierarchy 100s of bytes 0,25 ns 100 KB 0,5 5 ns 1-10 GBytes 50 (2,5) ns 100s of GBytes 1-10 ms sec-min Registers Cache memory Main memory Magnetic disks Tape Instr. operands Blocks Pages Files Control program 1-8 bytes cache cntl bytes OS 4-16 Kbytes --- Mbytes Upper level Lower level 4

5 Cache Memory Implementation 31 Tag 14 Index 4 Offset 0 Wrd Byte Index selected Valid Tag Data (block=4 words) = & Hit Data (1 word) 5

6 Memory Block num Direct Mapped Caches Many memory blocks map to (compete for) the same cache location Block place Cache Tag check and data read (and bus transm.) can proceed in parallel! (Improves Hit Time ) 6

7 Two-way set associative cache 31 Tag 14 Index/Set 4 Offset 0 Wrd Byte Index Valid Tag Data (block=4 words) selected = & Index selected Valid Tag Data (block=4 words) = & =1 Hit Data (1 word) 7

8 Virtual Memory PROGRAM A Addr 0 A1 PROGRAM B Addr 0 B1 MAIN MEMORY Addr 0 1 A2 A3 A4 B2 B3 B4 Main memory is shared and uses a physical address space Addr Max B5 5 Each program uses its own virtual address space B6 Addr Max Addr Max Virtual Addresses Physical Addresses 2009 DAT105 8

9 Virtual Memory (VM) PROGRAM A Addr 0 A1 A2 A3 PROGRAM B Addr 0 B1 B2 B3 Virtual addresses get translated to physical addresses MAIN MEMORY Addr 0 A2 1 B5 2 A1 3 A4 Addr Max B4 B5 B6 Different translations for different programs B1 4 A3 5 Addr Max Addr Max Virtual Addresses Physical Addresses 9

10 Block called page Typically 4-16 KB Virtual Address Translation 31 Virtual page number Virtual (program) address Page offset 0 V D R r w x Physical page location Page table (PT) stored in Main memory Page Fault If the page is not in memory 27 Physical (main memory) address 0 Physical page number Page offset 2009 DAT105 10

11 The Pagetable (PT) Virtual to Physical address translation PT_Basereg : CPU register pointing to the current PT (quick process switches) Includes Protection bits per page (r w x) Includes status bits per page - V = Valid; 1= Page is in memory; 0 = Page is on disk - D = Dirty; 1= Page has been modified - R = Reference bit(s); Used by LRU replacement alg. 11

12 The Pagetable (PT) But each memory reference is now doubled! 1 st : Access the PT; 2 nd : Access the data Solution: use a TLB (Translation Lookaside Buffer): a small cache holding the most recent Virtual -> Physical address translations. Often separate ITLB & DTLB 12

13 Virtual Address Page No. offset tag index offset TLB VM Accesses Memory V D R TAG Physical Page No MISS HIT MISS Page Table V D R Physical Page no. / Disk position Disk

14 Exempel The FastMATH Processor. (Fys. Adr Cache) 14

15 System Hardware Overview (A Very Schematic Example) L1 ICache Interrupts CPU Control MMU TLB Memory bus Bus control System bus I/O control Network Register file ROB L1 DCache L2 Cache Main Memory I/O control Disk Branch predictor RS Memory access Common Data Bus Boot Timer etc. I/O control Graphics Fetch instruction Fetch queue Get operands & Issue RS Integer & Logic Write result I/O control Other I/O RS Floating point CPU System board Devices 15

16 Today s Topics Improvement techniques: How cache performance affects processor performance [ ] A set of techniques that each address these main factors [C.3, 5.2] 16

17 Cache Performance Metrics Average memory access time = Hit time Miss rate x Miss penalty Miss rate: fraction of memory accesses not found in a cache/memory level Sometimes useful to consider Hit rate = 1 - Miss rate Miss penalty: time to bring in a block, including time to replace a block. Measured in ns or number of clock cycles access time: average time to access data. transfer time: time to transfer block to a higher level 17

18 Cache Performance Impact of caches on execution time: #misses Exec time = IC x (CPI execution instruction x Miss Penalty) x T c CPI increase caused by cache misses Three ways to increase performance: 1. Reduce hit time 2. Reduce #misses/instruction Reduce #misses Reduce number of memory references per instruction 3. Reduce miss penalty 18

19 Technique Avoid virtual address translation Small and simple caches Way prediction Trace caches Pipelined cache access Nonblocking caches Banked caches Multi-level caches Prioritize read misses over writes Prioritize demanded data Merging write buffer Larger block size Bigger caches Higher associativity Compiler techniques Hardware controlled prefetching Compiler controlled prefetching Overview - improvements Hit time Complexity Bandwidth Miss penalty - Miss rate Comment Trivial; widely used Used in Pentium 4 Used in Pentium 4 Widely used Widely used Widely used With write through SW challenge! Mostly instructions Nonblocking needed 19

20 Hit-Time Improvement Techniques Average memory access time = Hit time Miss rate x Miss penalty Hit-time improvement techniques next 20

21 Simple often means Fast Smaller and simpler is faster: Multilevel cache. Simple and fast 1st level cache. A direct mapped cache allows tag check and data transmission to proceed in parallel. 21

22 Impact of Address Translation on Hit Time Physically indexed cache : P VA Translation PA hit Cache miss Main memory data 1. Virtually indexed cache (let virtual address index the cache in parallel!) : P VA Translation hit tag check Cache PA miss Main memory data 22

23 Impact of Address Translation on Hit Time 2. Use virtual addresses to both index cache and for tag check : VA P Cache miss Translation hit tag check PA Main memory data Only do TLB access on cache miss! Advantage: Removes the TLB from the critical path DAT105 23

24 Other Hit Time Improvement Techniques Trace caches Cache predicted sequences of instructions Used in for example Pentium 4 See pages 131 and Tag 14 Index/Set 4 Offset Wr d Byt e 0 Way prediction Faster if correct prediction If wrong, take an extra cycle Accuracy may be over 85%! Used in Pentium 4 Index Valid Tag Data (block=4 words) selected = & = 1 Hit Predicted way Index Valid Tag Data (block=4 words) selected = & Data (1 word) 24

25 Cache Bandwidth Improvements The number of memory operations that can be started per clock cycle is important to keep the processor running Hiding the miss penalty Improving hit time 25

) Ratio to blocking cache The cache has to bookkeep all pending miss

26 Nonblocking Caches to Increase Cache Bandwidth Permits other cache operations to proceed when a miss has occurred (exploit parallelism!) Ratio to blocking cache The cache has to bookkeep all pending miss requests The presence of true data dependencies limits performance (as always) 26

27 Other Cache Bandwidth Improvements Pipelined Cache Access Each cache access takes several pipelined cycles For example: 1. Data and tag read 2. Tag check, word select 3. Block select and state update To processor Read tag mem Read data mem Tag check Word select Block select State update 27

28 Other Cache Bandwidth Improvements Multibanked Caches Divide cache blocks into banks that can be accessed simultaneously While one bank is accessed for possibly several cycles, next access can proceed if it goes to another bank Bank is selected based on block address 28

29 Miss-penalty Improvement Techniques Average memory access time = Hit time Miss rate x Miss penalty Miss penalty improvement techniques next 29

30 Using Cache Hierarchies to reduce Miss Penalty Considerations: 1st level cache can be made faster and smaller => on-chip 2nd level cache can be made larger to reduce capacity misses More cache levels P Cache L1 Cache L2 Main memory Performance of multi-level cache hierarchies Access time L1 = Hit L1 Miss rate L1 x Miss penalty L1 Miss penalty L1 = Hit L2 Miss rate L2 x Miss penalty L2 30

31 Prioritize Read Misses over Writes to Reduce Miss Penalty Write buffer (see appendix C) Holds writes to memory Let memory reads go first Need to check for RAW hazards against write buffer Perform writes when no reads 31

Prioritizing Demanded Data to Reduce Miss Penalty Sub-block placement tag 0 sub-block 0 1 1 valid bit Early restart and critical word first Early restart

32 Prioritizing Demanded Data to Reduce Miss Penalty Sub-block placement tag 0 sub-block valid bit Early restart and critical word first Early restart restart processor as soon as the requested word has arrived Critical word first Fetch the requested word first Increases performance for large block sizes 32

33 Optimizing Write Accesses to Reduce Miss Penalty Merging write buffer Buffer not using merging Buffer using merging 33

34 Miss-rate Improvement Techniques Average memory access time = Hit time Miss rate x Miss penalty Miss-rate improvement techniques next 34

35 Classification of Cache Misses The three C model (Hill and Smith, 1987) : Compulsory (or cold) miss: The first reference to a block is always a miss. Capacity miss: If the space is not sufficient to host the data or code that have been accessed. Conflict miss: Two memory blocks may be mapped to the same cache block with a direct or set-associative address mapping even if there is still unused space in cache. 35

36 Basic Techniques to Reduce Misses Larger block size Uses spatial locality Reduces compulsory misses May increase conflict misses May increase miss penalty Bigger caches Reduces capacity misses May increase hit time Higher associativity Reduces conflict misses May increase hit time 31 Tag 14 Index/Set 4 Offset Wr d Byt e Index Valid Tag Data (block=4 words) selected = & = 1 Hit Index Valid Tag Data (block=4 words) selected = & Data (1 word) 0 36

37 Victim Cache: One Miss Reduction Technique Improving hit rate for directmapped caches Add buffer to place data discarded from cache Jouppi: a 4-entry victim cache removed 20%-95% of conflict misses for a 4 Kbyte direct mapped data cache Used in Alpha, and HP machines NB! Victim caches are discussed only briefly on page 301 of the book. 37

38 Prefetching Software prefetching load data before it is needed: Prefetching into registers Prefetching into caches Both require lockup-free caches Hardware prefetching If there is a miss for block X, fetch also block X1, X2,... Xd d=1 => one-block lookahead. Used in Alpha processors. 38

39 .key.val.key.val Compiler Optimizations to Eliminate Misses Increase locality! Increases Spatial Locality! 1. Merging arrays: Before: int key [SIZE]; After: struct merge { int key; int val [SIZE]; int val; } struct merge newarr[size]; key: newarr: val: 39

40 Compiler Optimizations to Eliminate Misses 2. Loop interchange: Memory: Before: After: for (col=0; col < N; col) for (row = 0; row < N; row) for (row = 0; row < N; row) for (col = 0; col < N; col) A [row, col] =... A [row, col] =... A[2,4] A[2,3] A[2,2] A[2,1] col -> A: row -> Increases Spatial Locality! A[1,4] A[1,3] A[1,2] A[1,1] 40

41 Compiler Optimizations to Eliminate Misses 3. Blocking. Example Matrix*Matrix. X = Y * Z. y 11 y 12 y 13 y 14 y 15 y 16 * z 11 z 12 z 13 z 14 z 15 z 16 y 21 y 22 y 23 y 31 y 32 y 33 y 41 y 42 y 43 y 24 y 25 y 26 y 34 y 35 y 36 y 44 y 45 y 46 z 21 z 22 z 23 z 24 z 25 z 26 z 31 z 32 z 33 z 34 z 35 z * 36 z 41 z 42 z 43 z 44 z 45 z 46 = y 51 y 52 y 53 y 54 y 55 y 56 z 51 z 52 z 53 z 54 z 55 z 56 y 61 y 62 y 63 y 64 y 65 y 66 z 61 z 62 z 63 z 64 z 65 z 66 x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 x 41 x 42 x 43 x 51 x 52 x 53 x 61 x 62 x 63 x 14 x 15 x 16 x 24 x 25 x 26 x 34 x 35 x 36 x 44 x 45 x 46 x 54 x 55 x 56 x 64 x 65 x 66 Before: for (i = 0; i < N; i) for (j = 0; j < N; j) { r = 0; for (k = 0; k < N; k) r = r y [ i ] [ k ] * z[ k ] [ j ]; x [ i ] [ j ] = r; }; DAT105 41

42 * DAT105 42

43 After Blocking: for (jj = 0; jj < N; jj = jj B) * Increases for (kk = 0; kk < N; kk = kk B) Spatial & Temporal for (i = 0; i < N; i) Locality! for (j = jj; j < min (jj B,N); j) { r = 0; for (k = kk; k < min (kk B,N); k) r = r y [ i ] [ k ] * z[ k ] [ j ]; x [ i ] [ j ] = x [ i ] [ j ] r; }; Number of memory access (worst case): Before blocking: 2N 3 N 2 After blocking: 2N 3 /B N 2 DAT105 43

44 Summary improvement techniques Technique Avoid virtual address translation Small and simple caches Way prediction Trace caches Pipelined cache access Nonblocking caches Banked caches Multi-level caches Prioritize read misses over writes Prioritize demanded data Merging write buffer Larger block size Bigger caches Higher associativity Compiler techniques Hardware controlled prefetching Compiler controlled prefetching Hit time Complexity Bandwidth Miss penalty - Miss rate Comment Trivial; widely used Used in Pentium 4 Used in Pentium 4 Widely used Widely used Widely used With write through SW challenge! Mostly instructions Nonblocking needed 44

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate Outline Lecture 7: EITF20 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University November 21, 2012 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21,