The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1

Some useful definitions When the CPU finds a requested data item in the cache, it is called a cache hit. When the CPU does not find a date item in the cache it is called cache miss. A fixed collection of data containing the requested word, called a block, is retrieved from the main memory and placed into the cache. Temporal locality tells us that we are likely to need this word again in the near future, so it useful to place it in the cache. Because of spatial locality, there is a high probability that the other data in the block will be needed soon. 2 2

Cache Performance Review The equation for the CPU execution time can be rewritten as follows: CPU ex_time (CPU cl_cycle Memory stall cycle) *Clock Cycle time where memory stall cycles is a number of cycles during which the CPU is stalled waiting for a memory access. Memory stall cycles Number of misses * Miss penalty Misses IC * * Miss penalty Instructio n Memory accesses IC * * Miss rate * Miss penalty Instructio n Miss rate is a fraction of cache accesses that result in a miss (it can be different for reads and writes we use some kind of the average miss rate) 3 3

An example I Question. Consider the computer with CPI = 1, when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hits? Answer: Let s compute the performance for the computer that always hits: CPU ex_time (CPU cl_cycle (IC *CPI 0) *Clock cycle IC *1.0* Clock cycle Memory stall cycle) *Clock Cycle Now compute the performance for the computer with real cache: 4 4 time Memory accesses Memory stall cycles IC * * Miss rate * Miss penalty Instructio n IC *(1 0.5) *0.02 * 25 IC *0.75

An example II Then CPU ex_time_ca che (IC *1.0 IC *0.75) *Clock cycle 1.75* IC * Clock cycle The performance ration is the inverse of the execution times: CPU ex_time_ca CPU ex_time che 1.75* IC *clock cycle 1.0* IC * Clock cycle 1.75 The computer with no cache misses is 1.75 times faster. 5 5

Misses per instruction Instead of miss rate you can used misses per instruction, both measurements are equivalence. Misses instructio n Miss rate * memory access Instructio n count Miss rate * Memory accesses Instructio n For our example misses per instruction = 0.02 * 1.5 = 0.030. Misses per instruction are also reported as misses per 1000 instruction (in our example we have 30 misses per 1000 instructions). 6 6

Where can a block be placed in a cache? If each block has only one place it can appear in the cache, the cache is said to be direct mapped. (block address) MOD (number of blocks in cache) If a block can be placed anywhere in the cache, the cache is said to be fully associative. If a block can be placed in a restricted set of places in the cache, the cache is set associative. A set is a group of blocks in the cache. A block is first mapped onto a set, and then the block can be placed anywhere within that set. The set is usually chosen by bit selection; that is, (block address) MOD (Number of sets in cache) If there are n blocks in a set, the cache placement is called n-way set associative. 7 7

How is a block found if it is the cache? Caches have an address tag on each block frame that gives the block address. The tag of every cache block that might contain the desired information is checked to see if it matches the block address from the CPU. BLOCK ADDRESS TAG INDEX BLOCK OFFSET Tag is used to check all blocks in the set, index is used to select the set, block offset is the address within block. Fully associative caches have no index field. There must be a way to know that a cache block does not have valid information. The most common procedure is to add a valid bit to the tag to say whether or not this entry contains a valid address. If the bit is not set, there cannot be a match on this address. 8 8

Which block should be replaced on a cache miss? Generally three different strategies are used: Random to spread allocation uniformly, candidate blocks are randomly selected (sometimes pseudorandom strategy is used to get reproducible behavior). Least-recently used (LRU) to reduce the chance of throwing out information that will be needed soon, accesses to blocks are recorded. Relying on the past to predict the future, the block replaced is the one that has been unused for the longest time. First in, first out (FIFO) because the LRU can be complicated to calculate, this approximates LRU by determining the oldest block rather that the LRU. 9 9

What happens on a write? The data cache traffic, writes are about 21%. With reads we have not problem the block can be read from the cache at the same time that the tag is read. With writes we have problems: firstly, we can not modifying a block until the tag is checked to see if the address is a hit; secondly in common processor specifies the size of the writes. Additionally we need to solve the problem of cash coherence. Then two different strategies are used: Write through the information is written to both the block in the cache and in the block in the lower-level memory, Write back the information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. 10 10

What happens on a write? Multiprocessors and I/O want write back for processors caches to reduce the memory traffic and write through to keep the cache consistent with lower levels of the memory hierarchy. When the CPU must wait for writes to complete during write through, the CPU said to write stall (it can be reduced by introducing the write buffers). Since the data are not needed on a write, there are two options on a write miss: Write allocate the block is allocated on a write miss, followed by the write hit actions above. No-write allocate this apparently unusual alternative is write misses do not affect the cache. 11 11

Cache Performance - miss rate approach Miss rate is independent of the speed of hardware, however like a instruction count can be misleading. A better measure of memory hierarchy performance is the average memory access time: Average memory access time = Hit time + miss rate * miss penalty where the hit time is the time to hit the cache The components of average access time can be measured either in absolute time or in the number of clock cycles It is still indirect measure of performance. 12 12

An example - question Which the lower miss rate: a 16kB instruction cache with a 16kB data cache or a 32kB unified cache (miss per 1000 instruction for unified 32kB cache is equal to 43.3 and for 16kB instruction cache and date caches 3.82 and 40,9 respectively, the percentage of instruction references is about 74%). Assume that 36% of the instructions are data transfer, a hit takes 1 clock cycle and miss penalty is 100 clock cycles. A load and store hit tales 1 clock cycle on a unified cache if there is only one cache port to satisfy two simultaneous requests. What is the average memory access time in each case? Additionally assume write-through caches with a write buffer and ignore stalls due to the write buffer. 13 13

An example solution Let s convert misses per 1000 instruction into miss rates. Using the following equation: Misses /1000 Miss rate 1000 instructio ns memory accesses instructio n We get miss rate for 16kB instruction cache (3.82/1000/1) = 0.004, for 16kB data cache (40.9/1000/0.36) = 0.114 and for unified cache (43.3/1000/(1 + 0.36)) = 0.0318 The overall miss rate for split caches = (74% * 0.004 + 26% * 0.114) = 0.0324 So, a 32kB unified cache has a slightly lower effective miss rate than two 16kB caches. 14 14

An example solution The average access time formula can be divides into in instruction and data accesses: Average access time = % instructions * (hit time + instruction miss rate * miss penalty) + % data * (hit time + data miss rate * miss penalty) Average time for unified cache = 74% * (1 + 0.0318 * 100) + 26% * (1 + 1 + 0.0318 * 100) = 4.44 Average time for split cache = 74% * (1 + 0.004 * 100) + 26% * (1 + 0.114 * 100) = 4.24 Thus the split caches have a better average memory access time. 15 15

Impact of caches on performance I Question: Consider use an in-order execution computer. Assume the cache miss penalty is 100 clock cycles, and all instructions normally take 1.0 clock cycles. Assuming the average miss rate is 2%, there is an average of 1.5 memory references per instruction, and the average number of cache misses per 1000 instructions is 30. What is the impact on performance when behavior of the cache is included? Calculate the impact using both misses per instruction and miss rate. 16 16

Impact of caches on performance II CPU time IC *(CPI exec Memory stall clock cycles Instructio n )*Clock cycle time The performance, including cache misses, is CPU time with cache IC *(1.0 (30/1000 *100)) *clock cycle IC *4.00*Clock cycle The performance using miss rate CPU time IC *(CPI exec CPU time time memory accesses miss rate * *miss penalty) instructio n with cache IC *(1.0 (1.5* 2%*100)*clock cycle IC *4.00*clock cycle *clock cycle The CPU time increases fourfold, with CPI from 1.00 for a perfect cache to 4.00 with a cache that can miss. time time time time 17 17

Cache misses impact on performance Cache misses have a double-barreled impact on a CPU with a low CPI and fast clock: The lower the CPI execution, the higher the relative impact of a fixed number of cache miss clock cycles. When calculating CPI, the cache miss penalty is measured in CPU clock cycles for a miss. Therefore, even if memory hierarchies for two computers are identical, the CPU with the higher clock rate has a larger number of clock cycles per miss and hence a higher memory portion of CPI. 18 18

Reducing cache miss penalty Technology trends have improved the speed of processors faster than DRAMs, making the relative cost of miss penalties increase over time. One of the opportunities to reduced the miss penalty is to add another level of cache between the original cache and main memory in same sense its make the cache faster and larger The first-level cache can be small enough to match the clock cycle time of the fast CPU, when the second-level cache can be large enough to capture many accesses that would go to main memory. 19 19

Two levels cache Let s define the average memory access time for a two-level cache (L1 and L2 refer to a first and second levels of cache respectively) Average memory access time = Hit time L1 + Miss rate L1 * Miss penalty L1 when Miss penalty L1 = Hit time L2 + Miss rate L2 * Miss penalty L2 Then Average memory access time = Hit time L1 + Miss rate L1 *(Hit time L2 + Miss rate L2 * Miss penalty L2 ) Local miss rate is simply the number of misses in a cache divide by the total number of memory accesses to this cache. Global miss rate the number of misses in the cache divided by the total number of memory accesses generated by the CPU. Average memory stalls per instruction = misses per instruction L1 * Hit time L2 +misses per instruction L2 * Miss penalty L2 20 20

An example Question: Suppose that in 1000 memory references there are 40 misses in the first-level cache and 20 misses in the second-level cache. What is the various miss rates? Assume miss penalty = 100, hit time = 10, hit time = 1 and there are 1.5 references per instructions. What is the average memory access time and average stall cycles per instruction? Ignore the impact of writes. Answer: The miss rates for the first-level cache is 40/1000 (4%). The local miss rate for the second-level cache is 20/40 (50%). The global miss rate of the second-level cache is 20/1000 (2%). Thus Average memory access time = Hit time L1 + Miss rate L1 *(Hit time L2 + Miss rate L2 * Miss penalty L2 ) = =1 +4% * (10 + 50% * 100) = 1 + 4% * 60 = 3.4 clock cycles. 21 21

An example cont. How many misses we get per instruction? We need to multiply the misses by 1.5 to get the number of misses per 1000 instructions. For L1 we get 40 * 1.5 = 60 misses and for L2 20 * 1.5 = 30 misses. Assume that misses are equally distributed between instructions and data then: average memory stalls per instruction = misses per instruction L1 * Hit time L2 + misses per instruction L2 * Miss penalty L2 = (60/1000)*10 + (30/1000)*100= 3.6 clock cycles 22 22

The next example Question: What is the impact of second-level cache associativity on its miss penalty, when: Hit time for direct mapped = 10 clock cycles, Two-way set associativity increases hit time by 0.1 clock cycles to 10.1 clock cycles, Local miss rate for direct mapped = 25%, Local miss rate for two-way set associative = 20%, Miss penalty = 100 clock cycles. Answer: For direct-mapped cache, the first level cache miss penalty is: Miss_penalty 1-way L2 = 10 + 25% * 100 = 35.0 clock cycles. Adding the cost of associativity, for the first level 2-way L2 cache we received a miss penalty equal to 10.1 + 20% * 100 = 30.2 clock cycles The second level hit time must be an integral number, thus an improvement is Miss_penalty 2-way L2 = 10 + 20% * 100 = 30.0 clock cycles or Miss_penalty 2-way L2 = 11 + 20% * 100 = 31.0 clock cycles 23 23

Three categories of misses Compulsory the very first access to a block cannot be in the cache, so the block must be brought into the cache (also called cold-start misses), Capacity if the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur because of blocks being discarded and later retrieved, Conflict if the block placement strategy is set associative or direct mapped, conflict misses will occur because a block may be discarded and later retrieved if too many blocks map to its set. 24 24

Reducing miss rate the classical approaches The simplest way to reduce miss rate is to increase the block size. Question: Assume the memory system takes 80 clock cycles of overhead and then delivers 16 bytes every 2 clock cycles. It means that it can supply 16 bytes in 82 clock cycles, 32 bytes in 84 clock cycles, and so on. Calculate the average memory access time for different cache and block sizes Answer: Average memory access time = hit time + Miss rate * miss penalty If we assume the hit rate is 1 clock cycle independent of block size, then for 16-byte block size in a 4kB cache we get 1 +(8.57%*82)= 8.02, for 32-byte block size in 256kB cache we get 1 + (0.7%*84)=1.58 Basing on similar to above calculation we can choose the block size with the smallest average memory access time for different cache sizes. (for example 32 bytes for 4 kb cache or 64 bytes for larger caches) 25 25

Reducing miss rate compiler optimization Loop interchange for (j = 0; j < 100; j = j + 1) for (i = 0; i < 5000; i = i + 1) x[i][j] = 2 * x[i][j] for (i = 0; i < 5000; i = i + 1) for (j = 0; j < 100; j = j + 1) x[i][j] = 2 * x[i][j] Blocking - for example matrix multiplication 26 26

Organization of main memory to improve performance Performance measures of main memory emphasize both latency and bandwidth (the number of bytes read or written per unit of time) Assume the performance of the basic memory organization is: 4 clock cycles to send the address 56 clock cycles for the access time per word 4 clock cycles to send a word of data Given a cache block of 4 words, and the word is 8 bytes, the miss penalty is 4*(4+56+4) = 256 clock cycles with a memory bandwidth of 1/8 byte. 27 27

Organization of main memory to improve performance Wider Main Memory First- level caches are often organized with physical width of 1 word because CPU accesses are that size, When doubling the width of the cache and the memory will therefore double the memory bandwidth, With memory of two words, the miss penalty in our example would drop from 256 clock cycles to 128 (we need half the memory accesses and the bandwidth is ¼ byte per clock cycle) 28 28

Organization of main memory to improve performance Simple interleaved memory Question: What can interleaving and wide memory buy? Consider the following description of the computer and its cache performance: block size = 1 word, memory bus width = 1 word, miss rate = 3%, memory access per instruction = 1.2, cache miss penalty = 64 cycles, average cycles per instruction = 2. If we change the block size to 2 words, the miss rate falls to 2%, and a 4-word block has a miss rate of 1.2%. What is the improvements in performance of interleaving two ways and four ways versus doubling the width of memory and the bus, assuming the access times from previous example. 29 29

Organization of main memory to improve performance Answer: The CPI for the base computer using 1-word blocks is 2 + (1.2 * 3% * 64) = 4.3 Increasing the block size to 2 words gives the following options: 64-bit bus and memory, no interleaving = 2 + (1.2*2%*2*64)= 5.07 64-bit bus and memory, interleaving = 2 + (1.2*2%*(4+56+8))= 3.63 128-bit bus and memory, no interleaving = 2 + (1.2*2%*1*64)= 3.54 Thus, doubling the block size slows the straightforward implementation If we increase the block size to four we obtain the following: 64-bit bus and memory, no interleaving = 2 + (1.2*1,2%*4*64)= 5.69 64-bit bus and memory, interleaving = 2 + (1.2*1.2%*(4+56+16))= 3.09 128-bit bus and memory, no interleaving = 2 + (1.2*1.2%*2*64)= 3.84 Again the larger blocks hurts performance for the simple case, although the interleaved 64-bit memory is now fastest 30 30

Virtual memory Virtual memory divides physical memory into blocks and allocates them to different processes. The two memory hierarchy levels are controlled by virtual memory (DRAMs and magnetic disks) Virtual memory shares protected memory space, automatically manages the memory hierarchy and simplifies loading the program for execution. The program can be placed anywhere in physical memory or disc by changing the mapping between them (relocation). There are two classes of virtual memory: paging with fixes block size (power of 2) and segmentation with variable size blocks. 31 31

Some useful definitions Page or segment is used for memory blocks. Page fault or address fault is used for miss. CPU uses virtual addresses that are translated by a combination of hardware and software to physical addresses, which access main memory. Above process is called memory mapping or address translation. 32 32

Paging versus segmentation Words per address Programmer visible? Replacing the block Memory use inefficiency Efficient disk traffic Page One Invisible to application programmer Trivial (all blocks are the same size) Internal fragmentation (unused portion of page) Yes (adjust page size to balance access time and transfer time) Segment Two (segment and offset) May be visible to application programmer Hard (must find contiguous variable-size, unused portion of main memory) External fragmentation (unused pieces of main memory) Not always (small segments may transfer just a few bytes) 33 33

Where can a block be placed in main memory The miss penalty for virtual memory involves access to a rotating magnetic storage device and is therefore quite high (1,000,000 to 10,000,000 clock cycles). So, given the choice of lower miss rate or a simpler placement algorithm, the lower miss rate is select because of miss penalty. Generally, operating system allows blocks to be placed anywhere in main memory (fully associative). 34 34

How is a block found if it is in main memory? Virtual address Virtual number offset Main memory Page/ segment table Page/ segment 35 35

Which block should be replaced on a virtual memory miss Almost all operating systems try to replace the least-recently used (LRU) block because if the past predicts the future, that is the one less likely to be needed. For this aim a use bit (reference bit) which is logically set whenever a page is accessed is used. The operating system periodically clears the use bits and later records them so it can determine which pages were used during a particular time period. 36 36

What happens on a write The level below main memory contains rotating magnetic disks that takes millions of clock cycles to access. So, operating systems avoid writes through main memory to disk on every store by the CPU. It means that the write strategy is always write back. Using the dirty bit allows blocks to be written to disk only if they have been altered since since being read from the disk. 37 37

Translation lookaside buffer (TLB) Address Space number Virtual page number offset ASN Prot V Tag Physical address. 128 : 1 multiplexor Address 38 38