Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)

Size: px

Start display at page:

Download "Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)"

Abner Ellis
5 years ago
Views:

1 Chapter 7-1 Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授 V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor) 臺大電機吳安宇教授 - 計算機結構 1

2 Outline 7.1 Introduction 7.2 The Basics of Caches 7.3 Measuring and Improving Cache Performance 7.4 Virtual Memory 7.5 A Common Framework for Memory Hierarchies 臺大電機吳安宇教授 - 計算機結構 2

3 Five Classic Components of a Computer 臺大電機吳安宇教授 - 計算機結構 3

4 Principle of Locality The programs access a relatively small portion of their address space at any instant of time (books in library): Temporal locality : If an item (instruction/data) is referenced, it will tend to be referenced again soon. Spatial locality : If an item (instruction/data) is referenced, items whose addresses are close by will ten to be referenced soon. Example: Temporal locality : In programs, instructions data within loops are likely to be accessed repeated. Spatial locality : Instructions are normally accessed sequentially. Elements in an array are accessed sequentially. Take advantage of the Principle of locality by implementing the memory of a computer Memory hierarchy 臺大電機吳安宇教授 - 計算機結構 4

5 Memory Hierarchy A memory hierarchy consists of multiple levels of memory with different speeds and sizes. Guideline: Build memory as a hierarchy of levels, with the fastest memory close to the processor, and the slower, less expensive memory below that. Goal: To present the user with as much as is available in the cheapest technology, while providing access at the speed offered by the fastest memory. Three major technologies used to construct memory hierarchy: Memory hierarchy SRAM DRAM Magnetic disk Typical access time ns ns 5,000,000-20,000,000 ns $ per GB in 2004 $ $10000 $100 - $200 $0.5 - $2 臺大電機吳安宇教授 - 計算機結構 5

6 Basic Structure of a Memory Hierarchy 臺大電機吳安宇教授 - 計算機結構 6

7 Basic Structure of a Memory Hierarchy Register File 臺大電機吳安宇教授 - 計算機結構 7

Memory Hierarchy The memory system is organized as a hierarchy: a level closer to the processor is a subset of any level further away, and all the data are stored at the

8 Memory Hierarchy The memory system is organized as a hierarchy: a level closer to the processor is a subset of any level further away, and all the data are stored at the lowest level. The data is copied between only two adjacent levels at a time. The minimum unit of information between the hierarchy is called a block. 臺大電機吳安宇教授 - 計算機結構 8

9 Memory Hierarchy Pyramid 臺大電機吳安宇教授 - 計算機結構 9

10 Terminologies If the data requested by the processor appears in the upper level, it s called a hit. If the data isn t found in the upper level, it s a miss. The lower level in the hierarchy is then accessed to retrieve the block containing the requested data. Hit ratio (hit rate): It is the fraction of memory accesses found in the upper level. It is often used as a measure of the performance of the memory hierarchy. Miss rate (1 hit rate): It is the fraction of memory accesses not found in the upper level. 臺大電機吳安宇教授 - 計算機結構 10

11 Terminologies Hit time: the time to access the upper level of the memory hierarchy, which includes the time needed to determine whether the access is a hit or a miss. Miss penalty: the time to replace a block in the upper level with the corresponding block from the lower level, plus the time to deliver the block to processor. Note: In general, miss penalty > hit time. Because all programs spend much of their time accessing memory, the memory system is necessarily a major factor in determining performance. 臺大電機吳安宇教授 - 計算機結構 11

12 Outline 7.1 Introduction 7.2 The Basics of Caches 7.3 Measuring and Improving Cache Performance 7.4 Virtual Memory 7.5 A Common Framework for Memory Hierarchies 臺大電機吳安宇教授 - 計算機結構 12

13 The Basics of Cache Cache: a safe place for hiding or storing things. Example: Before the request, the cache contains a collection of recent references X 1, X 2,., X n-1, and the processor requests a word X n that is not in the cache. This request results in a miss, and the word X n is brought form memory into cache. X 4 X 4 X 1 X 1 X n-2 X n-2 X n-1 X n-1 X 2 X 2 X n X 3 Before the reference to Xn X 3 After the reference to Xn 臺大電機吳安宇教授 - 計算機結構 13

14 Tags of the Cache How do we know whether a requested word is in the cache or not?- -add a set of tags to the cache. 臺大電機吳安宇教授 - 計算機結構 14

15 Direct-mapped Cache 臺大電機吳安宇教授 - 計算機結構 15

16 Valid Bit of Cache Add a valid bit to indicate whether an entry contains a valid address. Replacement policy: recently accessed words replace less-recently referenced words. (use temporal locality) 臺大電機吳安宇教授 - 計算機結構 16

17 Accessing a Cache Example: the action for each reference 臺大電機吳安宇教授 - 計算機結構 17

18 Action for each cache reference 臺大電機吳安宇教授 - 計算機結構 18

19 Action for each cache reference 臺大電機吳安宇教授 - 計算機結構 19

20 Address of Cache An address is divided into: 1. Valid bit 2. Tag field: compare with the value of the tag field of the cache 3. Data field: cache data 4. Cache tag: select the block Size of Cache: 1K x (1bit ) bits 臺大電機吳安宇教授 - 計算機結構 20

21 Address of Cache (cont d) Size of Cache: 16K x (1bit ) bits 臺大電機吳安宇教授 - 計算機結構 21

22 Cache of Intrinsity FastMATH processor 16KB caches: 256 blocks with 16 words per block (spatial locality) 臺大電機吳安宇教授 - 計算機結構 22

23 Total Number of Bits for a Cache Assumption: n bits are used for the index, m bits are used for the word within the block and 2 bits are used for the byte part of the address. In the 32-bit byte address, a direct-mapped cache of size 2 n blocks with 2 m -word (2 m+2 -byte) blocks will require a tag field whose size is 32 - (n+m+2) bits. Since the block size is 2 m words (2 m+5 bits) and the address size is 32 bits, the total number of bits in a direct-mapped cache is 2 n * (block size + tag size + valid filed size) = 2 n * (m * 32 + (32 n m -2) + 1) = 2 n * (m * n m) 臺大電機吳安宇教授 - 計算機結構 23

24 Total Number of Bits for a Cache (Question) How many total bits are required for a direct-mapped cache with 16KB of data and 4-word blocks, assuming 32-bit address? (Answer) 16K Bytes (16KB) of Cache = 4K Words Each block = 4 words Cache has 2 10 blocks Block size: 32*4 = 128 bits Tag size: = 18 bits Valid field size: 1 bit The total number of bits: 2 10 * ( ) = 147 Kbits 臺大電機吳安宇教授 - 計算機結構 24

25 Compute Cache Block Address Example: Mapping an address of a cache to a multi-word cache block (Question) Consider a cache with 64 blocks and a block size of 16 bytes. What block number does byte address 1200 map to? (Answer) Block address: [1200/16] = 75 Cache block number: 75 modulo 64 = 11 In fact, this block (11) maps all byte addresses between 1200 and 臺大電機吳安宇教授 - 計算機結構 25

26 Miss rate v.s. block size In general, the miss rate falls as we increase the block size. (take advantage of spatial locality) Miss rate may go up if the block size is made very large, compared with the cache size (cache blocks become less) Miss penalty: the time required to fetch the data form the next lower-level memory and load it into the cache. 臺大電機吳安宇教授 - 計算機結構 26

27 Handling Cache Miss The control unit must detect a miss and process the miss by fetching the requested data from memory (or a lower-level cache). If the cache report a hit, the CPU continues to use data as if nothing happens. If an instruction fetch results in a miss, then the contents of IR are not valid, and the next action (reading the registers) will be useless (harmless). To perform the actions needed for a cache miss on an instruction read, we need to instruct the lower-level memory to perform a read. We wait for the memory to respond (since the access will take multiple cycles), and then write the words into the cache. 臺大電機吳安宇教授 - 計算機結構 27

28 Steps for Instruction Cache Miss 1. Send the original PC value (current PC - 4) to the memory. 2. Perform a read on the instruction main memory to wait for the memory to complete its access. 3. Write the cache entry: Put the data from memory in the data portion of the entry Write the upper bits of address (from the ALU) into the tag field Turn the valid bit ON 4. Restart the instruction execution at the first step, which will refetch the instruction, this time finding it in the cache (a hit) 臺大電機吳安宇教授 - 計算機結構 28

29 Steps for Cache Read 1. Send the address to the appropriate cache. The address comes either from the PC (for an instruction) or from the ALU (for data). 2. If the cache signals hit, the requested word is available on the data lines. Since there are 16 words in the desired block, we need to select the right one. A block field is used to control multiplexer, which selects the requested word from the 16 words in the indexed block. 3. If the cache signal has a miss, we send address to the main memory. When the memory returns with the data, we write it into the caches and then read it to fulfill the request. Approximate instruction and data miss rates for the Intrinsity FastMATH processor for SPEC2000 benchmarks (Effective miss rate considers the frequency of the events) Instruction miss rate 0.4% Data miss rate 11.4% Effective combined miss rate 3.2% 臺大電機吳安宇教授 - 計算機結構 29

30 Cache Consistency in writing data Inconsistent: Suppose on a store (sw) instruction, we wrote the data into only the data cache without changing main memory. After the write into the cache, memory would have a different value from that in the cache. Write through: A scheme in which writes always update both the cache and the memory, ensuring that data is always consistent between the two. Write buffer: A queue that holds data while the data are waiting to be written to memory. Write back: A scheme that handles writes by updating values only to the block in the cache, then writing the modified block to the lower-level of the hierarchy when the block is replaced. 臺大電機吳安宇教授 - 計算機結構 30

31 Cache Write Miss Read misses: It is similar to single-word block. A miss always bring back the entire block. Write misses: Because the block contains more than a single word, we cannot just write the tag and data. -- If the tag of the address and the tag in the cache entry are equal, we have write hit and can continue. If the tags are unequal, we have a write miss and must fetch the block from memory. 臺大電機吳安宇教授 - 計算機結構 31

32 Different organization of memory The primary method of achieving higher memory bandwidth is to increase the physical or logical width of the memory system. Interleaved 臺大電機吳安宇教授 - 計算機結構 32

33 Memory Organization To understand the impact of different organization of memory, we define a set of hypothetical memory access times. Assume : 1 memory bus clock cycle to send the address (data transfer) 15 memory bus clock cycles for each DRAM access initiated (get data) 1 memory bus clock cycle to send a word of data (data transfer) Three methods: (1) For a cache block of four words and a one-word-wide bank of DRAMs: The miss penalty = * *1 = 65 clock cycles. The number of bytes transferred per bus clock cycle for a single miss = 4 * 4 / 65 = 0.25 bytes/cycle (Effective bandwidth per miss) 臺大電機吳安宇教授 - 計算機結構 33

34 (2) Parallel access: Memory Organization With a main memory bus width of two words: The miss penalty = * * 1 = 33 clock cycles (0.48) With a main memory bus width of four words: The miss penalty = * * 1 = 17 clock cycles (0.94) Cost : wider bus (wires) + increasing access time (due to multiplexer and control logic) (3) Interleaving: Instead of making the entire path between the memory and cache wider, the memory chips can be organized in banks to read or write multiple words in one access time rather than reading or writing a single word each time. Each bank could be one word wide so that the width of the bus and cache needn t change, but sending an address to several banks permits them all to read simultaneously. The miss penalty = * * 1 = 20 clock cycles. The effective bandwidth per miss = 4 * 4 / 20 = 0.8. 臺大電機吳安宇教授 - 計算機結構 34

35 Outline 7.1 Introduction 7.2 The Basics of Caches 7.3 Measuring and Improving Cache Performance 7.4 Virtual Memory 7.5 A Common Framework for Memory Hierarchies 臺大電機吳安宇教授 - 計算機結構 35

36 Measuring Cache Performance CPU time = (CPU execution clock cycles + Memory-stall clock cycles) x Clock cycle time Memory-stall clock cycles come primarily from cache misses: Memory-stall clock cycles = Read-stall cycles + Write-stall cycles Read-stall cycles = (reads/program) * read miss rate * read miss penalty Write-stall cycles = ((writes/ program) * write miss rate * write miss penalty) + write buffer stalls For a write-back scheme: It has potential additional stalls arising from the need to write a cache block to memory when the block is replaced. For a write-through scheme: Write miss requires that we fetch the block before continuing the write. Write buffer stalls: The write buffer is full when a write occurs. 臺大電機吳安宇教授 - 計算機結構 36

37 Measuring Cache Performance In most write-through scheme, we assume: Read and write miss penalty are the same. Write buffer stall is negligible. => (1) Memory-stall clock cycles = (Memory accesses/program) * Miss rate * Miss penalty (2) Memory-stall clock cycles = (Instructions/Program) * (Misses/Instruction) * Miss penalty 臺大電機吳安宇教授 - 計算機結構 37

38 Calculating cache performance (question) How much faster a processor would run with perfect cache that never missed? (answer) Assumptions: An instruction cache miss = 2% A data cache miss rate = 4% CPI = 2 without any memory stalls Miss penalty = 100 cycles for all misses Use the instruction frequencies for SPECint2000 from Chapter3, Fig3.26 on page 228. Instruction count = I 臺大電機吳安宇教授 - 計算機結構 38

39 Calculating cache performance Instruction miss cycles = I * 2% * 100 = 2.00 I Data miss cycles = I * 36% * 4% * 100 = 1.44 I Total number of memory-stall cycles = 2.00 I I = 3.44 I The CPI with memory stalls is = 5.44 Since there is no change in instruction count or clock rate, the ratio of the CPU execution times = CPU time with stalls I * CPI stall clock cycle CPU time with perfect cache I * CPI perfect clock cycle CPI stall CPI perfect 2 臺大電機吳安宇教授 - 計算機結構 39

40 Calculating cache performance Example: Suppose we speed up the computer in the previous example by reducing CPI from 2 to 1. The system with cache misses has a CPI of = The system with the perfect cache = 4.44 / 1 = 4.44 times faster. The amount of execution time spent on memory stalls rises from 3.44 / 5.44 = 63% to 3.44 / 4.44 = 77% 臺大電機吳安宇教授 - 計算機結構 40

41 Example: cache performance with increased clock rate (question) How much faster will the computer be with the faster clock, assuming the same miss rate ad the previous example? (answer) Calculating cache performance The new miss penalty = 200 clock cycles Total miss cycles per instruction = (2%*200)+36%*(4%*200) = 6.88 CPI = = 8.88 performance with fast clock The relative performance = performance with slow clock IC * CPI slow * clock cycle IC * CPI fast * (clock cycle/2) 8.88/2 The computer with faster clock is about 1.2 times faster rather than 2 times faster, which it have been if we ignored cache misses. 臺大電機吳安宇教授 - 計算機結構 41

42 Calculating cache performance Reducing cache misses by more flexible placement of blocks (1) Direct mapped cache: A block can go in exactly one place in the cache. (2) Fully-associative cache: A cache structure in which a block can be placed in any location in the cache. (3) Set-associative cache: A cache that has a fixed number of locations (at least two) where each block can be placed. In direct-mapped cache, the position of a memory block is given by (block number) modulo (number of cache blocks) In a set-associative cache, the set containing a memory block is given by (block number) modulo (number of set in the cache). 臺大電機吳安宇教授 - 計算機結構 42

43 Associativity in Caches Direct-mapped cache 臺大電機吳安宇教授 - 計算機結構 43

44 Associativity in Caches Set-associative cache Fully-associative cache A fully-associative cache with m entries is simple an m-way setassociative cache. It has one set with m blocks, and an entry can reside in any block within that set. 臺大電機吳安宇教授 - 計算機結構 44

45 Associative structures 臺大電機吳安宇教授 - 計算機結構 45

46 Example: Misses and Associativity in Caches Assume there are three small caches, each consisting of four one-word blocks. They are fully associative, two-way set associative and direct-mapped. Find the number of misses for each organization given the following sequence of block addresses : 0, 8, 0, 6, 8. 臺大電機吳安宇教授 - 計算機結構 46

47 Misses in Direct-mapped Cache Sequence of block addresses : 0, 8, 0, 6, 8. Block address Cache block (0 modulo 4) = 0 (6 modulo 4) = 2 (8 modulo 4) = 0 Address of memory block accessed Hit or miss 0 Contents of cache blocks after reference miss Memory[0] 8 miss Memory[8] 0 miss Memory[0] 6 miss Memory[0] Memory[6] 8 miss Memory[8] Memory[6] The direct-mapped cache generates 5 misses for the five accesses. 臺大電機吳安宇教授 - 計算機結構 47

48 Misses and Associativity in Caches (2) Two-way set associative cache: Block address Cache block (0 modulo 2) = 0 (6 modulo 2) = 0 (8 modulo 2) = 0 Address of memory block accessed Hit or miss 0 Contents of cache blocks after reference miss Memory[0] 8 miss Memory[0] Memory[8] 0 hit Memory[0] Memory[8] 6 miss Memory[0] Memory[6] 8 miss Memory[8] Memory[6] The two-way set associative cache has 4 misses. 臺大電機吳安宇教授 - 計算機結構 48

49 Measuring and Improving Cache Performance (3) Fully associative cache: Any memory block can be stored in any cache block. Address of memory block accessed Hit or miss 0 Contents of cache blocks after reference miss Memory[0] 8 miss Memory[0] Memory[8] 0 hit Memory[0] Memory[8] 6 miss Memory[0] Memory[8] Memory[6] 8 hit Memory[0] Memory[8] Memory[6] The fully associative cache only has 3 misses: the best one 臺大電機吳安宇教授 - 計算機結構 49

50 Four-way set-associative cache 臺大電機吳安宇教授 - 計算機結構 50

51 Size of tags v.s. Set associativity Question: Assume a cache of 4K caches, a four-word block size, and a 32-bit address. Find the total number of sets and the total number tag bits for caches that are direct-mapped, twoway and four-way set associative and fully associative. Answer: direct-mapped: The bits for index and tag = 32 4 = 28 (4=block offset) The number of sets = the number of blocks = 4K The bits for index = log 2 (4K) = 12 The total number of tag bits = (28-12) * 4K = 64K bits 臺大電機吳安宇教授 - 計算機結構 51

52 Size of tags v.s. Set associativity Two-way set associative: The number of sets = (the number of blocks) / 2 = 2K The total number of tag bits = (28-11) * 2 * 2K = 68K bits Four-way set associative: The number of sets = (the number of blocks) / 4 = 1K The total number of tag bits = (28-10) * 4 * 1K = 72K bits Fully set associative: The number of sets = 1 The total number of tag bits = 28 * 4K * 1= 112K bits Least recently used (LRU): A replacement scheme in which the block replaced is the one that has been unused for the longest time. 臺大電機吳安宇教授 - 計算機結構 52

53 Multilevel cache Multilevel cache: A memory hierarchy with multiple levels of caches, rather than just a cache and main memory. Example: Suppose we have a processor with a base CPI of 1.0, assuming all reference hit in the primary cache, and a clock rate of 5GHz. Assume a main memory access time of 100 ns, including all the miss handling. Suppose the miss rate per instruction at the primary cache is 2%. How much faster will the processor be if we add a secondary cache that has a 5 ns access time for either a hit or a miss and is large enough to reduce the miss rate to the main memory to 0.5%? 臺大電機吳安宇教授 - 計算機結構 53

54 Multilevel cache (cont d) For the processor with one level of cache: The miss penalty to main memory = 100ns/ 0.2ns (1/5G) = 500 clock cycles. Total CPI = base CPI + memory-stall cycles per instruction. Total CPI = % * 500 = 11.0 For the processor with two levels of cache: The miss penalty for an access to second-level cache = 5 / 0.2 = 25 clock cycles. Total CPI = base CPI + primary stalls per instruction + secondary stalls per instruction. Total CPI = % * % * 500 = 4.0 The processor with the secondary cache is faster by 11.0 / 4.0 = 2.8 臺大電機吳安宇教授 - 計算機結構 54

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM