COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Size: px

Start display at page:

Download "COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University"

Bernard Mathews
6 years ago
Views:

1 COSC4201 Chapter 5 Memory Hierarchy Design Prof. Mokhtar Aboelaze York University 1

2 Memory Hierarchy The gap between CPU performance and main memory has been widening with higher performance CPUs creating performance bottlenecks for memory access instructions. The memory hierarchy is organized into several levels of memory with the smaller, more expensive, and faster memory levels closer to the CPU: registers, then primary Cache Level (L 1 ), then additional secondary cache levels (L 2, L 3 ), then main memory, then mass storage (virtual memory). Each level of the hierarchy is a subset of the level below: data found in a level is also found in the level below but at lower speed. Each level maps addresses from a larger physical memory to a smaller level of physical memory. This concept is greatly aided by the principal of locality both temporal and spatial which indicates that programs tend to reuse data and instructions that they have used recently or those stored in their vicinity leading to working set of a program. 2

3 Caches Different processors have different requirements Desktop: Runs typically 1 applications at a time Concerned with average memory latency Servers Runs many applications by hundreds of users Concerned about latency and throughput Protection Embedded Processors Almost no or simple OS Worst case is important Small memory, no disk 3 running same program2002 York University

4 Processor-Memory Gap 1000 Performance Cost of cache 30-60% of area, 70-80% of transistors CPU Processor-Memory Performance Gap: (grows 50% / year) DRAM µproc 60%/yr. DRAM 7%/yr. 4

5 Principle of Locality Programs usually spends 90% of the time in 10% of the code (90/10 rule) Two Types of locality: Temporal Locality: If an item is referenced, it will tend to be referenced again soon. Spatial locality: If an item is referenced, items whose addresses are close will tend to be referenced soon. 5

6 Levels of Memory Hierarchy 6

7 Definitions Block: the smallest piece of information transferred between 2 levels (line, page,..) Hit, hit rate Miss, miss rate Miss penalty 7

8 Performance Miss rate - % of instructions which are not found in the cache. Average memory access time - MR*T Miss + (1-MR)* T Hit T Hit &T Miss --Access time for a hit or miss 8

9 4 Questions Q1: Where can a block be placed in the upper level? (Block placement) Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in the upper level? (Block identification) Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) Random, LRU Q4: What happens on a write? (Write strategy) Write Back or Write Through (with Write Buffer) 9

10 Cache Placement Placement strategies or mapping of a main memory data block onto cache block frame addresses divide cache into three organizations: 1Direct mapped cache: A block can be placed in one location only, given by: (Block address) MOD (Number of blocks in cache) 2Fully associative cache: A block can be placed anywhere in cache. 3Set associative cache: A block can be placed in a restricted set of places, or cache block frames. A set is a group of block frames in the cache. A block is first mapped onto the set and then it can be placed anywhere within the set. The set in this case is chosen by: (Block address) MOD (Number of sets in cache) If there are n blocks in a set the cache placement is called n- way set-associative. 10

11 Direct Mapped Cache Address (showing bit positions) Hit Tag Byte offset Data Index In de x Valid Tag Data Blocks Each block = one word

12 Direct Mapped Cache 4K blocks Tag field Each block = four words Hit Tag Address (showing bit positions) Byte offset Index field Word select Data Ind ex Block offset 16 bits 128 bits V Tag Data 4K entries Mux 32 12

13 Set Associativity One-way set associative (direct m apped) Block Tag Data Two-way set associative Set Tag Data Tag Data 7 Four-w ay set associative Set T ag D ata Tag D a ta Ta g D a ta Ta g D a ta 0 1 Eight-way set associative (fully associative) Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data 13

14 14

15 Set Associative Cache Each block frame in cache has an address tag. The tags of every cache block that might contain the required data are checked in parallel. A valid bit is added to the tag to indicate whether this entry contains a valid address. The address from the CPU to cache is divided into: A block address, further divided into: An index field to choose a block set in cache. (no index field when fully associative). A tag field to search and match addresses in the selected set. - A block offset to select the data from the block. 15

16 Set Associative Cache Physical Address Generated by CPU Tag Block Address Index Block Offset Block offset size = log2(block size) Index size = log2(total number of blocks/associativity) Tag size = address size - index size - offset size 16

17 4-way Set Associative Cache Tag Field Address Index Field Index V Tag Data V Tag Data V Tag Data V Tag Data to-1 multiplexor Hit Data 17

18 Alpha AXP Data Cache 18

19 Alpha AXP Data Cache 1 CPU address <29> <9> <6> TAG Index 3 Data 2 V TAG <1> 29> Data <64> =? 3 2 =? 3 2:1 MUX Data out CPU produces 48 but address translated to 44 bit physical address. (alpha has 64 bit virtual address). 4 Memory Victim cache 19

20 20

21 Example Which has a lower miss rate 16KB cache for both instruction or data, or a combined 32KB cache? (0.64%, 6.47%, 1.99%). Assume hit=1cycle and miss =50 cycles. 75% of memory references are instruction fetch. Miss rate of split cache=0.75*0.64%+0.25*6.47%=2.1% Slightly worse than 1.99% for combined cache. But, what about average memory access time? Split cache: 75%(1+0.64%*50)+25%(1+6.47%*50) = 2.05 cycles. Structural hazard Combined cache: 75%(1+1.99*50)+25%( %*50) =

22 Example 50 cycles miss penalty, 2 per instruction for perfect cache, 1.33 mem ref/inst. And miss rate of 2% CPU time = IC *( *2%*50)* T clock = IC * 3.33 * T clock Without cache, would be IC *(2+50*1.33)*T clock = IC * 68.5 * T clock 22

23 Example Compare between 2 different organizations CPI for a perfect cache = 2.0, 1.0ns cycle, 1.5 mem ref/inst Direct mapped, miss=1.4% 2-way set associative miss=1.0%, we have to extend clock cycle by 25% (MUX is on critical path) Both cases miss penalty = 75nsec. Direct mapped memory access= *75=2.05nsec. 2-way memory access = 1* *75=2.00nsec. CPU time = IC(CPI exec + CPI misses ) =IC(CPI exec +Miss/Inst. * MP)*Clock Cycle =IC [ (CPI exec * Tc )+( Mem/Inst *MR *MP)Tc)] direct mapped =IC( 2* *0.014*75)=3.58*IC 2-way = IC(2*1* *0.01*75) = 3.63 IC 23

24 Improving Cache performance 1. Reducing the miss rate 2. Reducing the miss penalty 3. Reducing the time to hit 24

25 Reducing the Cache Miss Classifying Misses: 3 Cs Compulsory The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) Capacity If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. Conflict If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. 25

26 Classifying misses way 8-way going from fully to 8 ways set 4-way going from 8-way to 4-way 2-way 4-way 8-way Capacity Cache Size (KB) Compulsory Misses in a 1-way set associative of size X = misses in 2-way set associative of size X/2 26

27 Reducing cache miss Larger Block Size Larger block size reduces misses up to a point, then start to increase it (larger block size takes advantage of spatial locality, but it decrease the number of different block in the cache and increases the miss penalty since it will take longer to load the block from the memory to the cache) Size 4K 16K 64K 256K % 3.94% 2.04% 1.09% % 2.87% 1.35% 0.70% % 2.64% 1.06% 0.51% % 2.77% 1.02% 0.49% % 3.29% 1.15% 0.49% 27

28 Reducing Cache Miss Larger block size From the last graph, if the memory takes 80 cycles overhead and then deliver 16 bytes every 2 cycle. Compare between64,128 block size for 64K cache. 28

29 Reducing Cache Miss Larger Caches Longer hit time and higher cost More popular in second level and off-chip caches. 29

30 Reducing Cache Miss Higher Associativity Practically 8-way set associative is as god as a fully associative The 2:1 cache rule of thumb states that a directmapped cache of size N has the same miss rate as a 2- way set associative of size N/2 One problem with associative caches is that we need to increase the clock cycle (at least, we need a MUX to choose which set) Practically 10% increase in clock time for TTL, or ECL, and 2% for custom CMOS (when we go from direct mapped to 2-way set associative). 30

31 Reducing Cache Miss Victim Caches Add a small fully associative cache between the cache and the memory. This victim cache contain only blocks that were discarded from the original cache. The victim cache is checked on miss, if the data is found there, it will be swapped with the data in the cache. A small victim cache of 1-5 blocks is sufficient (4-block victim cache removed 20% to 95% of conflict misses. 31

32 Reducing Cache Miss Pseudo-Associative Cache This cache behaves like direct-mapped On a miss, before going to the main memory, check another cache entry 2 hit times, one slow and one fast Way prediction could be also used. A predictor bit is added to each block in order to choose which way to go. If miss, try the other way. Could be used to save power by supplying power to the predicted block only. 32

33 Reducing Cache Miss Prefetching Can be done by the compiler (inserting prefetch instructions in the code), or the hardware Prefetching could be done directly to the cache or to a buffer that could be accessed a lot faster than the main memory. Timing is important, if we prefetch to close to where the data is needed, it might be too late and we may have to stall anyway 33

34 Reducing Cache Miss Compiler Optimization - By rearrangind data we can reduce cache miss, - Merging arrays - Int A[N]; truct merge{ - Int B[N] int a; - int b; - }; struct merged_arrays[n]; - That works if we reference the same indices at the same time; - Loop Interchange - Reference by row instead of columns if the array is stored in a row major fashion 34

35 Reducing Cache Miss Compiler Optimization (Loop fusion) Before for(i=0;i<n;i++) for(j=0;j<n;j++) a[i][j]=1/b[i][j]*c[i][j]; for(i=0;i<n;i++) for(j=0;j<n;j++) d[i][j]=a[i][j]*c[i][j]; After for(i=0;i<n;i++) for(j=0;j<n;j++) { a[i][j]=1/b[i][j]*c[i][j]; d[i][j]=a[i][j]*c[i][j]; } 35

36 Reducing Cache Miss Compiler Optimization It reduces the miss by exploiting the temporal locality. Suppose that we are multiplying two matrices, the first row by the first column. First, we load the first column, then the second column (first is gone from the cache, then last. After that the first column again (cache miss). By blocking, we can fully utilize the data we brought into the cache before we replace it. See the textbook for example. 36

37 Reducing Cache Miss Penalty Giving Priority to Reads over Writes With a write through cache, don t wait for the write to complete, send data to a write buffer and continue reading. Might cause problems (RAW) if we want to read a data that is still in the buffer and not written yet. One way is to wait until the buffer is empty before you read (wastes time) We can check the write buffer in our way to the memory (if we find the data, don t go to the memory) Same thing could happen with a write back 37

38 Reducing Cache Miss Penalty Sub block placement In order to decrease the area occupied by the tag, we want bigger blocks. Bigger blocks increases the miss penalty One solution is to to add a valid bit fro each sub-block, and the transfer between the memory and the cache is done on the sub-block level. If the tag is there, some parts of the block might be valid and other parts might not. 38

39 Reducing Cache Miss Penalty Early restart and Critical Word First The CPU starts as soon as the required word is in the cache Even better, we can request the required word first. Reduces the miss penalty since we don t have to wait for the entire block to be written to the cache. 39

40 Reducing Cache Miss Penalty Merging Write Buffer In caches that use write buffer. If another block is to be written, we check first in the wrote buffer If the same address exists, we combine the 2 writes into 1. If the buffer is full, then CPU must wait 40

41 Reducing Cache Miss Penalty Non-blocking caches The cache continue to supply data during the miss penalty (hit under miss). Can be hit under multiple miss 41

42 Reducing Cache Miss Penalty Second Level Cache In this case Av mem Access= Hit Time L1 +Miss rate L1 * Miss Penalty L1 Miss Penalty L1 = Hit time L2 + Miss Rate L2 * Miss Penalty L2 Local miss rate: number of misses divided by the total access of he cache Global miss rate: The number of misses divided by the total number of memory access generated by the CPU 42

43 Reducing Cache Miss Penalty Second Level Cache Example: Suppose in a 1000 memory reference, there were 40 misses in the first cache, and 20 misses in the second level cache 43

44 Reducing Hit Time Small and Simple Cache Avoiding Address Translation Virtual caches (use virtual address) Every time the process is switched the virtual address refers to a different physical address requiring the cache to be flushed. (protection, context switching, and aliases). If PID is used, only flush these specific lines Pipeline writes We have to check the tag before write One technique is to pipeline writes, check tag, in the next cycle write and check next tag. 44

45 Reducing Hit Time Pipeline cache access Pentium can access the cache in one cycle, Pentium III in 2 cycles, Pentium 4 in 4 cycles. Pipelining can increase the bandwidth for instructions, but it can not help in the load latency 45

46 Reducing Hit Time Trace caches Pentium 4 uses trace cache The idea is finding a sequence of instructions to load into a cache block. 46

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT COSC4201 Chapter 4 Cache Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT 1 Memory Hierarchy The gap between CPU performance and main memory has been