ECE468. -4-7 The otivation for s System ECE468 Computer Organization and Architecture DRA Hierarchy System otivation Large memories (DRA) are slow Small memories (SRA) are fast ake the average access time small by Servicing most accesses from a small, fast memory. Reduce the bandwidth required of the large memory ECE468. -4-7 An Expanded View of the System path Control Speed Fastest Size Smallest Cost Highest Slowest Biggest Lowest Levels of the Hierarchy Capacity Access Time Cost Registers s Bytes <s ns K Bytes - ns $.-./bit ain Bytes ns-us $.-. Disk G Bytes ms - -4 - cents Tape infinite sec-min -6 Registers Disk Tape Instr. Operands Blocks Pages s Staging fer Unit prog./compiler -8 bytes cache cntl 8-8 bytes OS 5-4K bytes user/operator bytes Upper Level faster Larger Lower Level ECE468. -4-7 ECE468.4-4-7 The Principle of Locality Hierarchy Principles of Operation Probability of reference Address Space The Principle of Locality Program access a relatively small portion of the address space at any instant of time. Example 9% of time in % of the code Two Different Types of Locality Temporal Locality (Locality in Time) If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space) If an item is referenced, items whose addresses are close by tend to be referenced soon. At any given time, data is copied between only adjacent levels Upper Level () the one closer to the processor - Smaller, faster, and uses more expensive technology Lower Level () the one further away from the processor - Bigger, slower, and uses less expensive technology Block The minimum unit of information that can either be present or not present in the two level hierarchy To From Upper Level () Blk Lower Level () Blk Y ECE468.5-4-7 ECE468.6-4-7
ECE468.7-4-7 Hierarchy Terminology Hit data appears in some block in the upper level (example Block ) Hit Rate the fraction of memory access found in the upper level Hit Time Time to access the upper level which consists of RA access time + Time to determine hit/miss iss data needs to be retrieved from a block in the lower level (Block Y) iss Rate = - (Hit Rate) iss Penalty = Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << iss Penalty To From Upper Level () Blk Lower Level () Blk Y Basic Terminology Typical Values Typical Values Block (line) size 4-8 bytes Hit time - 4 cycles iss penalty 8 - cycles (and increasing) (access time) (6- cycles) (transfer time) ( - cycles) iss rate % - % Size KB - 56 KB ECE468.8-4-7 How Does Work? The Simplest Direct apped Temporal Locality (Locality in Time) If an item is referenced, it will tend to be referenced again soon. Keep more recently accessed data items closer to the processor Spatial Locality (Locality in Space) If an item is referenced, items whose addresses are close by tend to be referenced soon. ove blocks consists of contiguous words to the cache To From Upper Level Blk Lower Level Blk Y Address 4 5 6 7 8 9 A B C D E F 4 Byte Direct apped Index Location can be occupied by data from location, 4, 8,... etc. In general any memory location whose LSBs of the address are s Address<> => cache index (Block Addr) module (#cache blocks) How can we tell which one is in the cache? Which one should we place in the cache? ECE468.9-4-7 ECE468. -4-7 Example Consider an eight-word direct mapped cache. Show the contents of the cache as it responds to a series of requests (decimal addresses), 6,, 6, 6,, 6, 8 Organization Tag and Index Assume a -bit memory (byte ) address N A bytes direct mapped cache - Index The lower N bits of the memory address - Tag The upper ( - N) bits of the memory address 6 6 6 6 8 N Tag Example x5 Index Ex x miss miss hit hit miss miss hit miss Bit Stored as part of the cache state x5 N Bytes Direct apped Byte Byte Byte Byte N Byte - N - ECE468. -4-7 ECE468. -4-7
ECE468. -4-7 Start Up Access Example Access (miss) Access (miss) V Tag iss Handling Load Write Tag & Set V [] Load Write Tag & Set V [] [] Access (HIT) Access (HIT) [] [] [] [] Sad Fact of Life A lot of misses at start up Compulsory isses - (Cold start misses) Definition of a Block Block the cache data that has in its own cache tag Our previous extreme example (see slide ) 4-byte Direct apped cache Block Size = Byte Take advantage of Temporal Locality If a byte is referenced, it will tend to be referenced soon. Did not take advantage of Spatial Locality If a byte is referenced, its adjacent bytes will be referenced soon. In order to take advantage of Spatial Locality increase the block size Tag Direct apped Byte Byte Byte Byte Question Can we say the larger, the better for the block size? ECE468.4-4-7 Example KB Direct apped with B Blocks For a ** N byte cache The uppermost ( - N) bits are always the Tag The lowest bits are the Byte Select (Block Size = ** ) apping an Address to a ultiword Block Consider a cache with 64 Blocks and a block size of 6 bytes. What block number does byte address map to? Tag Example x5 Stored as part of the cache state 9 Index Ex x 4 Byte Select Ex x [/6] module 64 = Bit Tag x5 Byte Byte 6 Byte Byte Byte Byte Byte Byte 99 ECE468.5-4-7 ECE468.6-4-7 Bits in a cache How many totoal bits are required for a directed mapped cache with 64 KB of data and one-word blocks, assuming a -bit address? Block Size Tradeoff In general, larger block size take advantage of spatial locality BUT Larger block size means larger miss penalty - Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up 4 ^4 ( + (-4-)+) = ^4*49 = 784 Kbits Average Access Time = Hit Time x ( - iss Rate) + iss Penalty x iss Rate iss Penalty iss Rate Exploits Spatial Locality Fewer blocks compromises temporal locality Average Access Time Increased iss Penalty & iss Rate 4 Block Size Block Size Block Size ECE468.7-4-7 ECE468.8-4-7
ECE468.9-4-7 Another Extreme Example Bit Tag Byte Byte Byte Byte Size = 4 bytes Block Size = 4 bytes Only ONE entry in the cache True If an item is accessed, likely that it will be accessed again soon But it is unlikely that it will be accessed again immediately!!! The next access will likely to be a miss again - Continually loading data into the cache but discard (force out) them before they are used again - Worst nightmare of a cache designer Ping Pong Effect Conflict isses are misses caused by Different memory locations mapped to the same cache index - Solution make the cache size bigger - Solution ultiple entries for the same Index How Do you Design a? Set of Operations that must be supported read data <= em[physical Address] write em[physical Address] <= Physical Address Read/Write Black Box Determine the internal register transfers Design the path Design the Controller Path Inside it has Tag- Storage, uxes, Comparators,... Control Points R/W Active Address Controller In wait Out Signals ECE468. -4-7 Impact on Cycle Time Improving Performance general options. Reduce the miss rate, Hit Time directly tied to clock rate increases with cache size increases with associativity I - miss PC IR IRex A B. Reduce the miss penalty, or. Reduce the time to hit in the cache. invalid stall cycles = Read stall cycles + write-stall cycles IRm Average Access time = Hit Time + iss Rate x iss Penalty IRwb Time = IC x CT x (ideal CPI + memory stalls) D iss R T Read stall cycles = #reads * read miss rate * read miss penalty Write stall cycles = #writes * write miss rate * write miss penalty stall cycles = #access * miss rate * miss penalty ECE468. -4-7 ECE468. -4-7 Examples Q Assume an instruction cache miss rate for gcc of % and a data cache miss rate of 4%. If a machine has a CPI of without any memory stalls and the miss penalty is 4 cycles for all misses, determine how much faster a machine would run with a perfect cache that never missed. (It is known that the frequency of all loads and stores in gcc is 6%, unused) Answer Instruction miss cycles = I * % * 4 =.8I miss cycles = I * 4% * 4 =.56I The CPI with memory stalls is +.6 =.6 The performance with the perfect cache is better by.68 Q Suppose we speed up the machine by reducing CPI from to. Q If we double clock rate, And yet Another Extreme Example Fully Associative Fully Associative Forget about the Index Compare the Tags of all cache entries in parallel Example Block Size = B blocks, we need N 7-bit comparators By definition Conflict iss = for a fully associative cache Tag Tag (7 bits long) 4 Byte Select Ex x Bit Byte Byte Byte Byte 6 Byte Byte ECE468. -4-7 ECE468.4-4-7
ECE468.5-4-7 A Two-way Set Associative N-way set associative N entries for each Index N direct mapped caches operates in parallel Example Two-way set associative cache Index selects a set from the cache The two tags in the set are compared in parallel is selected based on the tag result Tag Block Index Block Tag Disadvantage of Set Associative N-way Set Associative versus Direct apped N comparators vs. Extra U delay for the data comes AFTER Hit/iss In a direct mapped cache, Block is available BEFORE Hit/iss Possible to assume a hit and continue. Recover later if miss. Tag Block Index Block Tag Adr Tag Compare Sel Hit OR ux Sel Block Compare Adr Tag Compare Sel OR ux Sel Compare Block Hit ECE468.6-4-7 Example There are three small caches, each consisting of four one-word blocks. One cache is fully associative, a second is two-way set associative, and the third is direct mapped. Find the number of misses for each cache organization given the following sequence of block addresses, 8,, 6, 8 F. A. 8 H 6 8 H A Summary on Sources of isses Compulsory (cold start, first reference) first access to a block Cold fact of life not a whole lot you can do about it Conflict (collision) ultiple memory locations mapped to the same cache location Solution increase cache size Solution increase associativity Capacity cannot contain all blocks access by the program Solution increase cache size -Way H H Invalidation other process (e.g., I/O, processors) updates memory D.. ECE468.7-4-7 ECE468.8-4-7 Source of isses Quiz Sources of isses Answer Direct apped N-way Set Associative Fully Associative Direct apped N-way Set Associative Fully Associative Size Size Big edium Small Compulsory iss Compulsory iss See Note High (but who cares!) edium Low Conflict iss Conflict iss High edium Zero Capacity iss Invalidation iss Capacity iss Invalidation iss Low edium High Same Same Same Note If you are going to run billions of instruction, Compulsory isses are insignificant. ECE468.9-4-7 ECE468. -4-7
ECE468. -4-7 The Need to ake a Decision! Direct apped Each memory location can only mapped to cache location No need to make any decision -) - Current item replaced the previous item in that cache location N-way Set Associative Each memory location have a choice of N cache locations Fully Associative Each memory location can be placed in ANY cache location miss in a N-way Set Associative or Fully Associative Bring in new block from memory Throw out a cache block to make room for the new block ====> We need to make a decision on which block to throw out! Block Replacement Policy Random Replacement Hardware randomly selects a cache item and throw it out Least Recently Used Hardware keeps track of the access history Replace the entry that has not been used for the longest time Example of a Simple Pseudo Least Recently Used Implementation Assume 64 Fully Associative Entries Hardware replacement pointer points to one cache entry Whenever an access is made to the entry the pointer points to - ove the pointer to the next entry Otherwise do not move the pointer Entry Entry Replacement Pointer Entry 6 ECE468. -4-7 Write Policy Write Through versus Write Back read is much easier to handle than cache write Instruction cache is much easier to design than data cache write How do we keep data in the cache and memory consistent? Two options (decision time again -) Write Back write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss. - Need a dirty bit for each cache block - Greatly reduce the memory bandwidth requirement - Control can be complex Write Through write to cache and memory at the same time. - What!!! How can this be? Isn t memory too slow for this? Write Buffer for Write Through DRA Write Buffer A Write Buffer is needed between the and writes data into the cache and the write buffer controller write contents of the buffer to memory Write buffer is just a FIFO Typical number of entries 4 Works fine if Store frequency (w.r.t. time) << / DRA write cycle system designer s nightmare Store frequency (w.r.t. time) > / DRA write cycle Write buffer saturation ECE468. -4-7 ECE468.4-4-7 Write Buffer Saturation Write Buffer DRA Store frequency (w.r.t. time) > / DRA write cycle If this condition exist for a long period of time ( cycle time too quick and/or too many store instructions in a row) - Store buffer will overflow no matter how big you make it - The Cycle Time <= DRA Write Cycle Time Solution for write buffer saturation Use a write back cache Install a second level (L) cache Write Buffer L DRA ECE468.5-4-7 Write Allocate versus Not Allocate Assume a 6-bit write to memory location x and causes a miss Do we read in the rest of the block (Byte,,... )? Yes Write Allocate No Write Not Allocate Bit Tag Tag x Example x 9 Index Ex x Byte Byte 6 Byte Byte Byte Byte Byte ECE468.6-4-7 4 Byte Select Ex x Byte 99
ECE468.7-4-7 Recap Improving Performance Reducing Transfer Time. Reduce the miss rate,. Reduce the miss penalty, or. Reduce the time to hit in the cache. stall cycles = Read stall cycles + write-stall cycles Read stall cycles = #reads * read miss rate * read miss penalty Write stall cycles = #writes * write miss rate * write miss penalty $ bus mux $ bus $ bus stall cycles = #access * miss rate * miss penalty Solution High BW DRA Solution Wide Path Between & Solution Interleaving Examples Page ode DRA SDRA Cost CDRA RAbus ECE468.8-4-7 Cycle Time versus Access Time Increasing Bandwidth - Interleaving Cycle Time Access Time Time DRA (Read/Write) Cycle Time >> DRA (Read/Write) Access Time - ; why? DRA (Read/Write) Cycle Time How frequent can you initiate an access? Analogy A little kid can only ask his father for money on Saturday DRA (Read/Write) Access Time How quickly will you get what you want once you initiate an access? Analogy As soon as he asks, his father will give him the money DRA Bandwidth Limitation analogy What happens if he runs out of money on Wednesday? Access Pattern without Interleaving D available Start Access for D Access Pattern with 4-way Interleaving Access Bank Access Bank Access Bank Start Access for D Access Bank We can Access Bank again Bank Bank Bank Bank ECE468.9-4-7 ECE468.4-4-7 ain Performance Independent Banks Timing model to send address, 6 access time, to send data Block is 4 words Simple.P. = 4 x (+6+) = Wide.P. = + 6 + = 8 Interleaved.P. = + 6 + 4x = How many banks? number banks number clocks to access word in bank For sequential accesses, otherwise will return to original bank before it has next word ready Increasing DRA => fewer chips => harder to have banks Growth bits/chip DRA 5%-6%/yr Nathan yrvold /S mature software growth (%/yr for NT) - growth B/$ of DRA (5%-%/yr) ECE468.4-4-7 ECE468.4-4-7
ECE468.4-4-7 SPARCstation s System SPARCstation s Controller Bus (bus) 64-bit wide Bus (SI Bus) 8-bit wide datapath odule 7 odule 6 odule 5 odule 4 odule odule odule odule (bus odule) Instruction Register odule B Direct apped Write Back Write Allocate odule (bus odule) Instruction Register SPARCstation s Size and organization B, direct mapped Block size 8 B Sub-block size B Write Policy Write back, write allocate ECE468.44-4-7 SPARCstation s Internal Instruction SPARCstation s Internal odule (bus odule) I- KB 5-way Register B Direct apped Write Back Write Allocate odule (bus odule) I- KB 5-way Register B Direct apped Write Back Write Allocate D- 6 KB 4-way WT, WNA SPARCstation s Internal Instruction Size and organization KB, 5-way Set Associative Block size 64 B Sub-block size B Write Policy Does not apply Note Sub-block size the same as the (L) SPARCstation s Internal Size and organization 6 KB, 4-way Set Associative Block size 64 B Sub-block size B Write Policy Write through, write not allocate Sub-block size the same as the (L) ECE468.45-4-7 ECE468.46-4-7 Two Interesting Questions? odule (bus odule) I- KB 5-way Register B Direct apped Write Back Write Allocate D- 6 KB 4-way WT, WNA SPARCstation s odule Supports a wide range of sizes Smallest 4 B 6 b DRA chips, 8 KB of Page ode SRA Biggest 64 B 6b chips, 6 KB of Page ode SRA 5 cols DRA Chip DRA Chip 5 56K x 8 = B Why did they use N-way set associative cache internally? Answer A N-way set associative cache is like having N direct mapped caches in parallel. They want each of those N direct mapped cache to be 4 KB. Same as the virtual page size. 5 rows 56K x 8 = B 8 bits 5 x 8 SRA bits<7> How many levels of cache does SPARCstation has? Answer Three levels. () Internal I & D caches, () cache and ()... 5 x 8 SRA bits<7> Bus<7> ECE468.47-4-7 ECE468.48-4-7
ECE468.49-4-7 DRA Performance Summary A 6 ns (t RAC ) DRA can perform a row access only every ns (t RC ) perform column access (t CAC ) in 5 ns, but time between column accesses is at least 5 ns (t PC ). - In practice, external address delays and turning around buses make it 4 to 5 ns These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead. Drive parallel DRAs, external memory controller, bus to turn around, SI module, pins 8 ns to 5 ns latency from processor to memory is good for a 6 ns (t RAC ) DRA The Principle of Locality Temporal Locality vs Spatial Locality Four Questions For Any Where to place in the cache How to locate a block in the cache Replacement Write policy Write through vs Write back - Write miss Three ajor Categories of isses Compulsory isses sad facts of life. Example cold start misses. Conflict isses increase cache size and/or associativity. Nightmare Scenario ping pong effect! Capacity isses increase cache size ECE468.5-4-7