Chapter 4. Cache Memory. Yonsei University

Chapter 4 Cache Memory

Contents Computer Memory System Overview Cache Memory Principles Elements of Cache Design Pentium 4 and Power PC Cache 4-2

Key Characteristics 4-3 Location Processor Internal (main) External (secondary) Capacity Word size Number of words Unit of Transfer Word Block Access Method Sequential Direct Associative Performance Access time Cycle time Transfer rate Physical Type Semiconductor Magnetic Optical Magneto-Optical Physical Characteristics Volatile/nonvolatile Erasable/nonerasable Organization Computer memory system overview

Characteristics of Memory System Computer memory system overview Location Processor Internal(main) External(secondary) 4-4

Capacity Computer memory system overview Internal memory capacity Expressed in terms of bytes or words External memory capacity Expressed in terms of bytes 4-5

Computer memory Unit of Transfer system overview Internal memory The unit of transfer is equal to number of data into and out of the memory module Often equal to the word length Word Natural unit of organization of memory The size of the word is equal to number of bits used to represent a number and to instruction length Addressable unit In many systems, the addressable unit is word Some systems allow addressing at the byte level Unit of transfer The number of bits read out of or written into memory at a time 4-6

Methods of Accessing Sequential access Start at the beginning and read through in order Access time depends on location of data and a previous location e.g. tape Direct access Individual blocks have unique address Access is by jumping to vicinity plus sequential search Access time depends on location and previous location e.g. disk Computer memory system overview 4-7

Methods of Accessing Computer memory system overview Random access Individual addresses identify locations exactly Access time is independent of location or previous access e.g. RAM Associative access Data is located by a comparison with contents of a portion of the store Access time is independent of location or previous access e.g. cache 4-8

Performance Computer memory system overview Access time Time it takes to perform read or write operation (for random-access memory) Time it takes to position the read-write mechanism at desired location (for non-random-access memory) Memory cycle time Applied to random-access memory Consists of the access time plus any additional time required before next access 4-9

Performance Computer memory system overview Transfer rate Rate at which data can be transferred into or out of a memory unit (for random-access memory) For non-random-access memory T T = + N A N R T N = Average time to read or write N bits T A = Average access time N = Number of bits R = Transfer rate, in bits per second(bps) 4-10

Physical Types Computer memory system overview Semiconductor RAM Magnetic Disk & Tape Optical CD & DVD Others Bubble Hologram 4-11

Physical Characteristics Computer memory system overview Decay Volatility Erasable Power consumption 4-12

Organization Computer memory system overview Physical arrangement of bits to form words Not always obvious e.g. interleaved 4-13

The Bottom Line Computer memory system overview How much? Capacity How fast? Time is money How expensive? 4-14

The Memory Hierarchy Relationships The faster access time, the greater cost per bit The greater capacity, the smaller cost per bit The greater capacity, the slower access time As one goes down the hierarchy, the following occur Decreasing cost per bit Increasing capacity Increasing access time Computer memory system overview Decreasing frequency of access of the memory by the processor 4-15

The Memory hierarchy Computer memory system overview 4-16

The Memory Hierarchy Computer memory system overview Registers In CPU Internal or Main memory May include one or more levels of cache RAM External memory Backing store 4-17

Performance of A Two-Level Memory Computer memory system overview 4-18

Hierarchy List Computer memory system overview Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape 4-19

Memory Hierarchy Locality of reference During the course of the execution of a program, memory references tend to cluster Computer memory system overview 4-20

Memory Hierarchy Computer memory system overview Additional levels can be effectively added to the hierarchy in software Portion of main memory can be used as a buffer to hold data that is to be read out to disk Such a technique, sometimes referred to as a disk cache 4-21

So You Want Fast? Computer memory system overview It is possible to build a computer which uses only static RAM (see later) This would be very fast This would need no cache How can you cache cache? This would cost a very large amount 4-22

Cache & Main Memory Cache memory Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module 4-23

Principle Cache Memory Principles CPU requests contents of memory location Check cache for this data If present, get the word from cache (fast) If not present, main memory is read into cache and the word is delivered to the processor The word is delivered from cache to CPU because of phenomenon of locality of reference Cache includes tags to identify which block of main memory is in each cache slot 4-24

Principle Cache Memory Principles For mapping purposes, the memory is considered to consist of number of fixedlength block of K words each Number of block : M = 2 n /K Cache consists of C lines of K words each and the number of lines is considerably less than the number of main memory blocks Each line includes a tag that identifies which particular block is currently being stored 4-25

Cache/Main-Memory Memory Structure Cache Memory Principles 4-26

Principle Cache hit occurs Data and address buffers are disabled and the communication is only between the processor and cache, with no system bus traffic Cache miss occurs Cache Memory Principles Desired address is loaded onto the system bus and the data are returned through data buffer to both cache and main memory 4-27

Cache Read Operation Cache Memory Principles 4-28

Elements of Cache Design Cache memory Cache size Mapping Function Direct Associative Set Associative Replacement Algorithm Least recently used(lru) First in first out(fifo) Least frequently used(lfu) Random Write Policy Write through Write back Write once Line Size Number of Caches Single or two level Unified or split 4-29

Cache Size 4-30 Cache memory Small enough that overall average cost per bit is close to that of main memory Large enough that overall average access time is close to that cache alone The larger cache, the larger number of gates involved in addressing the cache The larger cache tend to be slower than small caches Cache size is limited by available chip and board area Impossible to arrive optimum size because cache s performance is very sensitive to the nature of workload

Factors For Cache Size Cache memory Cost More cache is expensive Speed More cache is faster (up to a point) Checking cache for data takes time 4-31

Typical Cache Organization Cache memory 4-32

Mapping Function Needed for determining which main memory block currently occupies a cache line Direct mapping Cache of 64kByte Cache block of 4 bytes i.e. cache is 16k (2 14 ) lines of 4 bytes 16MBytes main memory 24 bit address (2 24 =16M) Cache memory 4-33

Direct Mapping Cache memory Maps each block of main memory into only one possible cache line i = j modulo m where i = cache line number j = main memory block number m= number of lines in the cache 4-34

Direct Mapping Cache Organization Cache memory 4-35

Direct Mapping Cache Line Table Cache line Main memory block assigned 0 0, m, 2m,, 2 s -m 1 1, m +1, 2m +1,, 2 s -m + 1..... m -1 m 1, 2m 1, 3m 1,, 2 s -1. Cache memory 4-36 Least significant w bits identify a unique word or byte Most significant s bits specify one memory block The MSBs are split into a cache line field r and a tag of s-r (most significant) This latter field identifies one of the m=2 r lines of cache

Direct Mapping Example Cache memory 4-37

Direct Mapping Address Structure Cache memory Tag s-r Line or Slot r Word w 8 14 2 24 bit address 2 bit word identifier (4 byte block) 22 bit block identifier 8 bit tag (=22-14) 14 bit slot or line No two blocks in the same line have the same Tag field Check contents of cache by finding line and checking Tag 4-38

Direct Mapping Summary Cache memory Address length = (s + w) bits Number of addressable units = 2 s+w words or bytes Block size = line size = 2 w words or bytes Number of blocks in main memory = 2 s+ w / 2 w = 2 s Number of lines in cache = m = 2 r Size of tag = (s r) bits 4-39

Direct Mapping Cache memory Advantage Simple Inexpensive to implement Disadvantage If a program accesses 2 blocks that map to the same li ne repeatedly, cache misses are very high 4-40

Associative Mapping Cache memory A main memory block can load into any line of cache The tag field uniquely identifies a block of main memory To determine whether a block is in the cache, the cache control logic must simultaneously examine every line s tag for a match Cache searching gets expensive 4-41

Fully Associative Cache Organization Cache memory 4-42

Associative Mapping Example Cache memory 4-43

Associative Mapping Address Structure 4-44 Tag 22 bit 22 bit tag stored with each 32 bit block of data Compare tag field with tag entry in cache to check for hit Cache memory Word 2 bit Least significant 2 bits of address identify which 16 bit word is required from 32 bit data block e.g. Address Tag Data Cache line FFFFFC FFFFFC 24682468 3FFF

Associative Mapping Summary Cache memory Address length = (s + w) bits Number of addressable units = 2 s+w words or bytes Block size = line size = 2 w words or bytes Number of blocks in main memory = 2 s+ w /2 w = 2 s Number of lines in cache = undetermined Size of tag = s bits 4-45

Associative Mapping Cache memory Replacement algorithms is designed to maximize the hit ratio Advantage Flexible replacement of blocks when a new block is read into the cache Disadvantage Complex circuitry is required to examine the tags of all cache lines in parallel 4-46

Set Associative Mapping Cache memory Compromise that reduces disadvantage of direct & associative approaches Where m = v k i = j modulo v i = cache set number j = main memory block number m= number of lines in the cache k-way set associative mapping V=m, k=1, the set associative technique reduces to direct mapping V=1, k=m, reduces to associative mapping 4-47

Set Associative Mapping Cache is divided into a number of sets Each set contains a number of lines A given block maps to any line in a given set e.g. Block B can be in any line of set i e.g. 2 lines per set 2 way associative mapping A given block can be in one of 2 lines in only one set Cache memory 4-48

K- Way Set Associative Cache Organization Cache memory 4-49

Set Associative Mapping Address Structure Cache memory Tag 9 bit Set 13 bit Word 2 bit Use set field to determine cache set to look in Compare the tag field to see if we have a hit e.g Address Tag Data Set number 1FF 7FFC 1FF 12345678 1FFF 001 7FFC 001 11223344 1FFF 4-50

Set Associative Mapping Example Cache memory 13 bit set number Block number in main memory is modulo 2 13 000000, 008000,, FF8000map to same set 4-51

Set Associative Summary Cache memory Address length = (s + w) bits Number of addressable units = 2 s+w words or bytes Block size = line size = 2 w words or bytes Number of blocks in main memory = 2 d Number of lines in set = k Number of sets = v = 2 d Number of lines in cache = kv = k * 2 d Size of tag = (s d) bits 4-52

Two Way Set Associative Mapping Example Cache memory 4-53

Replacement Algorithms A new block is brought into cache, one of the existing blocks must be replaced Cache memory Direct mapping Because there is only one possible line for any particular block, the block is must be replaced 4-54

Replacement Algorithms Hardware implemented algorithm (speed) Cache memory Associative & set associative techniques Need replacement algorithm Least recently used(lru) Most effective Replace that block in the set that has been in the cache longest with no reference When a line is referenced, its USE bit is set to 1 and the USE bit of the other line is set to 0 4-55

Replacement Algorithms Cache memory First in first out (FIFO) Replace block that has been in cache longest Least frequently used(lfu) Replace block which has experienced the fewest references Random Not based on usage 4-56

Write Policy Must not overwrite a cache block unless main memory is up to date If it has not updated, old block in the cache may be overwritten When multiple processors are attached to the same bus and each processor has its own local cache A word altered in one cache, invalidates a word in other cache Multiple CPUs may have individual caches I/O may address main memory directly Cache memory 4-57

Write Through Cache memory All writes go to main memory as well as cache Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date Disadvantage Generates substantial memory traffic and may create bottleneck 4-58

Write Back Cache memory Minimizes memory writes Updates are made only in the cache An UPDATE bit for cache slot is set when an update occurs If a block is to be replaced, it is written back to main memory if and only if the UPDATE bit is set Other caches get out of sync I/O must access main memory through cache Complex circuitry and a potential bottleneck 15% of memory references are writes 4-59

Cache Coherency If data in one cache is altered, this invalidates the corresponding word in main memory and the same word in other caches Even if a write-through policy is used, the other caches may contain invalid data A system that prevents this problem said to maintain cache coherency Bus watching with write through Hardware transparency Noncachable memory Cache memory 4-60

Bus Watching With Write Through Cache memory Each cache controller monitors address lines to detect write operations to memory by other bus masters If another master writes to a location in shared memory that also resides in cache memory, cache controller invalidates that cache entry This strategy depends on the use of a writethrough policy by all cache controllers 4-61

Hardware Transparency Cache memory Additional hardware is used to ensure that all updates to main memory via cache reflected in all caches If one processor modifies a word in its cache, this update is written to main memory Any matching words in other caches are similarly updated 4-62

Noncachable Memory Cache memory Only a portion of main memory is shared by more than one processor, and this is designated as noncachable All accesses to shared memory are cache misses because the shared memory is never copied into the cache The noncachable memory can be identified using chip-select logic of high-address bits 4-63

Line size When a block of data is retrieved and placed in the cache, the desired word and some number of adjacent words are retrieved Two specific effects come into play Cache memory Larger blocks reduce the number of block that fit into a cache. Because each block fetch overwrites older cache contents, a small number of blocks result in data being overwritten shortly after it is fetched As a block becomes larger, each additional word is farther from the requested word, and therefore less likely to be needed in near future 4-64

Line size Cache memory The relationship between block size and hit ratio is complex, depending on the locality characteristics of a particular program No definitive optimum value has been found A size of from 2 to 8 words seems reasonably close to optimum 4-65

Number of Caches Cache memory On chip cache Possible to have cache on the same chip as the processor Reduces processor s external bus activity and speed up execution times and increases overall system performance Most contemporary designs include both on-chip and external caches Two-level cache Internal cache designed level 1(L1) and external cache designed level 2(L2) 4-66

Number of Caches Reason for including L2 cache If there is no L2 cache and the processor makes access request for memory location not in L1 cache, the processor must access DRAM or ROM memory across the bus Poor performance because of slow bus speed and slow memory access time If L2 SRAM cache is used, missing information can be quickly retrieved Effect of using L2 depends on hit rates in both L1 and L2 caches Cache memory 4-67

Number of Caches Cache memory To split the cache into two: one dedicated to instruction and one dedicated to data Potential advantage of a unified cache For a given cache size, a unified cache has higher hit rate than split caches because it balances the load between instruction and data fetches automatically Only one cache needs to be designed and implemented 4-68

Number of Caches Cache memory The trend toward split caches Superscalar machines (Pentium II, PowerPC) which emphasize parallel instruction execution and prefetching of predicted future instructions The key advantage of the split cache design is that it eliminates condition for cache between the instruction processor and the execution unit Important in any design that relies on the pipelining of instructions 4-69

Pentium 4 Cache Pentium 4 and PowerPC 4-70 80386 no on chip cache 80486 8k using 16 byte lines and four way set associative organization Pentium (all versions) two on chip L1 caches Data & instructions Pentium 4 L1 caches 8k bytes 64 byte lines four way set associative L2 cache Feeding both L1 caches 256k 128 byte lines 8 way set associative

Pentium 4 Diagram (Simplified) Pentium 4 and PowerPC 4-71

Pentium 4 Core Processor Pentium 4 and PowerPC 4-72 Fetch/Decode Unit Fetches instructions from L2 cache Decode into micro-ops Store micro-ops in L1 cache Out of order execution logic Schedules micro-ops Based on data dependence and resources May speculatively execute Execution units Execute micro-ops Data from L1 cache Results in registers Memory subsystem L2 cache and systems bus

Pentium 4 Design Reasoning 4-73 Pentium 4 and PowerPC Decodes instructions into RISC like micro-ops bef ore L1 cache Micro-ops fixed length Superscalar pipelining and scheduling Pentium instructions long & complex Performance improved by separating decoding fr om scheduling & pipelining (More later ch14) Data cache is write back Can be configured to write through L1 cache controlled by 2 bits in register CD = cache disable NW = not write through 2 instructions to invalidate (flush) cache and write back then i nvalidate

Structure Of Pentium II Data Cache LRU replacement algorithm Write-back policy Pentium 4 and PowerPC 4-74

Data Cache Consistency Pentium 4 and PowerPC To provide cache consistency, data cache supports a protocol MESI (modified/exclusive/shared/invalid) Data cache includes two status bit per tag, so that each line can be in one of four states Modified Exclusive Shared Invalid 4-75

MESI Cache Line States Pentium 4 and PowerPC M E S I Modified Exclusive Shared Invalid This cache line valid? Yes Yes Yes No The memory copy is Copies exist in other caches? Out of date Valid Valid - No No Maybe Maybe A write to this line Does not go to bus Does not go to bus Goes to bus and updates cache Goes directly to use 4-76

Cache Control Pentium 4 and PowerPC Internal cache controlled by two bits CD (cache disable) NW(not write through) Two Pentium II instruction that used to control cache INVD invalidates(flushes) internal cache memory and signals external cache to invalidate WBINVD writes back and invalidates internal cache, then writes back and invalidates external cache 4-77

Pentium II Cache Operation Modes Pentium 4 and PowerPC Control Bits Operating Mode CD NW Cache Fills Write Throughs Invalidates 0 1 1 0 Enabled Enabled Enabled 0 Disabled Enabled Enabled 1 Disabled Disabled Disabled 4-78

PowerPC Internal Cache Pentium 4 and PowerPC Model Size Bytes/Line Organization PowerPC 601 1 32-kbyte 32 8-way set associative PowerPC 603 2 8-kbyte 32 2-way set associative PowerPC 604 2 16-kbyte 32 4-way set associative PowerPC 620 2 32-kbyte 64 8-way set associative PowerPC G3 2 32-kbyte 64 8-way set associative PowerPC G4 2 32-kbyte 32 8-way set associative 4-79

PowerPC G3 Block Diagram Pentium 4 and PowerPC 4-80

PowerPC Cache Organization Pentium 4 and PowerPC The L1 caches are eight-way set associative and use a version of the MESI cache coherency protocol The L2 cache is a two-way set associative cache with 256K, 512K, of 1 Mbyte of memory 4-81

Characteristics of Two-Level Memories Appendix 4A Typical access time ratios Main Memory Cache Virtual Memory (Paging) 5/1 1000/1 1000/1 Disk Cache Memory management system Implemented by special hardware Combination of hardware and system software System software Typical block size 4 to 128 bytes 64 to 4096 bytes 64 to 4096 bytes Access of processor to second level 4-82 Direct access Indirect access Indirect access

Relative Dynamic Frequency Appendix 4A Study [HUCK83] [KNUT71] [PATT82] [TANE78] Language Pascal FORTRAN Pascal C SAL Workload Scientific Student System System System Assign 74 67 45 38 42 Loop 4 3 5 3 4 Call 1 3 15 12 12 IF 20 11 29 43 36 GoTo 2 9-3 - Other - 7 6 1 6 4-83

Locality Each call is represented by the line moving down and to the right Each return is represented by the line moving up and to the right A window with depth equal to 5 is defined Only a sequence of calls and returns with a net movement of 6 in either direction causes the window to move Appendix 4A 4-84

The Call/Return Behavior of Programs Appendix 4A 4-85

Locality Appendix 4A Spatial locality Refers to the tendency of execution to involve a number of memory locations that are clustered Temporal locality Refers to the tendency for a processor to access memory locations that have been used recently 4-86

Locality of Reference For Web Pages Appendix 4A 4-87

Operation of Two-Level Memory Appendix 4A T s = H T 1 + (1 - H) (T 1 + T 2 ) = T 1 + (1 H) T 2 where T s = average(system) access time T 1 = access time of M1 (e.g., cache, disk cache) T 2 = access time of M2 (e.g., main memory, disk) H = hit ratio (fraction of time reference is found in M1) 4-88

Performance Where C s = C 1 S S C s = average cost per bit for the combined two-level memory C 1 = average cost per bit of upper-level memory M1 C 2 = average cost per bit of lower-level memory M2 S 1 = size of M1 S 2 = size of M2 1 1 + + C S We would like C s C 2 (Given C 1 >> C 2, this requires S 1 << S 2 ) 2 2 S 2 Appendix 4A 4-89

Memory cost Vs. Memory Size Appendix 4A 4-90

Performance Consider the quantity T 2 / T 1, which referred to as the access efficiency T T 1 s = 1 + (1 1 H ) T T 2 1 Appendix 4A 4-91

Access Efficiency Vs. Hit Ratio(T 2 / T 1 ) Appendix 4A 4-92

Hit Ratio Vs. Memory Size Appendix 4A 4-93

Performance Appendix 4A If there is strong locality, it is possible to achieve high values of hit ratio even with relatively small upper-level memory size Small cache sizes will yield a hit ratio above 0.75 regardless of the size of main memory A cache in the range of 1K to 128K words is generally adequate, whereas main memory is now typically in the multiple-megabyte range If we need only a relatively small upper-level memory to achieve good performance, the average cost per bit of the two levels of memory will approach that of the cheaper memory 4-94