CS 61C: Great Ideas in Computer Architecture

Size: px

Start display at page:

Download "CS 61C: Great Ideas in Computer Architecture"

Bathsheba Poole
5 years ago
Views:

9// CS 6C: Great Ideas in Computer Architecture Lecture 4 Cache Performance (Cache III) Instructors: ike Franklin Dan Garcia hfp://insteecsberkeleyedu/~cs6c/fa In the News ajor Expansion of World s

1 9// CS 6C: Great Ideas in Computer Architecture Lecture 4 Cache Performance (Cache III) Instructors: ike Franklin Dan Garcia hfp://insteecsberkeleyedu/~cs6c/fa In the News ajor Expansion of World s Centers The next great expansion of the world s digital infrastructure is under way in developing markets like those of China, Brazil and ArgenTna Growth expected to match levels from economy s boom years despite global slowdown hfp://wwwnytmescom//9//technology/worlds- data- centers- expected- to- grow- survey- sayshtml 9// Agenda Review Direct apped Caches easuring Performance ult- Level Caches AssociaTve Caches Cache Wrap Up Review: ultword Block Direct apped Cache Four words/block, cache size = K words (56 blocks) (4KB Total data) Byte offset Valid Block offset 3 9// 3 9// 4 easuring Performance Computers use a clock to determine when events takes place within hardware Clock cycles discrete Tme intervals aka clocks, cycles, clock periods, clock Tcks Clock rate or clock frequency clock cycles per second (inverse of clock cycle Tme) 3 GigaHertz clock rate => clock cycle Tme = /3x 9 seconds clock cycle Tme = 333 picoseconds (ps) CPU Performance Factors But a program executes instructons Time = Seconds Program InstrucTons Clock cycles Seconds = Program InstrucTon Clock Cycle st term called Instruc;on Count 3 rd term is Clock Cycle Time ( / Clock rate) nd term abbreviated CPI for average Clock cycles Per Instruc;on Why CPI =? Why CP >? Can CPI be <? 9// 5 9// 6

2 9// Average emory Access Time (AAT) Average emory Access Time (AAT) is the average to access memory considering both hits and misses AAT = Time for a hit + iss rate x iss penalty What is the AAT for a processor with a psec clock, a miss penalty of 5 clock cycles, a miss rate of misses per instructon and a cache access Tme of clock cycle? + x 5 = clock cycles Or x = 4 psecs PotenTal impact of much larger cache on AAT? ) Lower iss rate ) Longer Access Tme ( Tme): smaller is faster Increase in hit Tme will likely add another stage to the pipeline At some point, increase in hit Tme for a larger cache may overcome the improvement in hit rate, yielding a decrease in performance 9// 7 easuring Cache Performance Effect on CPI Assuming cache hit costs are included as part of the normal CPU executon cycle, then CPU Tme = IC CPI CC = IC (CPI ideal + Average emory- stall cycles) CC CPI stall A simple model for emory- stall cycles emory- stall cycles = accesses/instructon miss rate miss penalty This ignores extra costs of write misses which were described previously 9// Impacts of Cache Performance RelaTve $ penalty increases as processor performance improves (faster clock rate and/or lower CPI) emory speed unlikely to improve as fast as processor cycle Tme When calculatng CPI stall, cache miss penalty is measured in processor clock cycles needed to handle a miss Lower the CPI ideal, more pronounced impact of stalls Processor with a CPI ideal of, a cycle miss penalty, 36% load/store instr s, and % I$ and 4% D$ miss rates emory- stall cycles = % + 36% 4% = 344 So CPI stalls = = 544 ore than twice the CPI ideal! What if the CPI ideal is reduced to? What if the D$ miss rate went up by %? CPU em Access ultple Cache Levels Path of Back to CPU 9// 9 9// L$ iss L$ iss ain emory ultple Cache Levels With advancing technology, have more room on die for bigger L caches and for second level cache normally a unified L cache (ie, it holds both instructons and data,) and in some cases even a unified L3 cache New AAT CalculaTon: AAT = L Time + L iss Rate * L iss Penalty L iss Penalty = L Time + L iss Rate * L iss Penalty and so forth (final miss penalty is ain emory access Tme) New AAT Example cycle L Time, % L iss Rate, 5 cycle L Time, 5% L iss Rate cycle ain emory access Tme No L Cache: AAT = + * = 3 With L Cache: AAT = + *(5 + 5*) =! 9// 9//

9// Core Area Breakdown Nehalem Die Photo 36 mm (54 inch) emory Controller i s c I O Core P I Core e m o r y u e u e Core Core 3 Fall - - Lecture #4 9 mm (75 inch) 9// Fall - - Lecture #4 4 Load

miss rate) Capacity: Cache cannot contain all blocks accessed by the program SoluTon: increase cache size (may increase access Tme) Conﬂict (collision): ultple memory locatons mapped to the same

migraton, st reference): Fall - - Lecture #4 L cache L3 Cache Sources of Cache isses: The 3Cs ExecuTon Units 3KB I$ per core 3KB D$ per core 5KB L$ per core Share one - B L3$ P I Shared L3 Cache 9//

block Compromise: divide $ into sets, each of which consists of n ways (n- way set associa;ve) to place memory block emory block maps to unique set determined by index ﬁeld and is placed in any of

Schemes Consider the sequence of memory accesses Start with an empty cache - all b locks 4 4 4 4 initally marked as not valid miss 4 miss 4 em() em() miss em(4) D placement: mem block in block cache:

3 9// Core Area Breakdown Nehalem Die Photo 36 mm (54 inch) emory Controller i s c I O Core P I Core e m o r y u e u e Core Core 3 Fall - - Lecture #4 9 mm (75 inch) 9// Fall - - Lecture #4 4 Load Store ueue Reducing Cache isses First access to block impossible to avoid; small eﬀect for long running programs SoluTon: increase block size (increases miss penalty; very large blocks could increase miss rate) Capacity: Cache cannot contain all blocks accessed by the program SoluTon: increase cache size (may increase access Tme) Conﬂict (collision): ultple memory locatons mapped to the same cache locaton SoluTon : increase cache size Solu3on : increase associa3vity (may increase access 3me) L Cache & Interrupt Servicing L Inst cache & Inst Fetch Compulsory (cold start or process migraton, st reference): Fall - - Lecture #4 L cache L3 Cache Sources of Cache isses: The 3Cs ExecuTon Units 3KB I$ per core 3KB D$ per core 5KB L$ per core Share one - B L3$ P I Shared L3 Cache 9// i s c I O emory Controller Allow more ﬂexible block placement in cache Direct mapped $: memory block maps to exactly one cache block Fully associa;ve $: allow a memory block to be mapped to any cache block Compromise: divide $ into sets, each of which consists of n ways (n- way set associa;ve) to place memory block emory block maps to unique set determined by index ﬁeld and is placed in any of the n- ways of that set CalculaTon: (block address) modulo (# sets in the cache) 5 Fall - - Lecture #4 6 Example: 4- Word Direct- apped $ Worst- Case Reference String AlternaTve Block Placement Schemes Consider the sequence of memory accesses Start with an empty cache - all b locks initally marked as not valid miss 4 miss 4 em() em() miss em(4) D placement: mem block in block cache: only one cache block where mem block can be found ( modulo ) = 4 SA placement: four sets x - ways ( cache blocks), memory block in set ( mod 4) = ; either element of the set FA placement: mem block can appear in any cache blocks Fall - - Lecture #4 7 miss em(4) 4 miss 4 miss em() em(4) 4 miss 4 em() 4 miss 4 em() requests, misses Ping pong eﬀect due to conﬂict misses - two memory locatons that map into the same cache block Fall - - Lecture #4 3

4 9// Example: - Way Set AssociaTve $ Cache Set Way V (4 words = sets x ways per set) : Is it there? Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in the cache xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx ain emory One word blocks Two low order bits define the byte in the word (3b words) : How do we find it? Use next low order memory address bit to determine which cache set (ie, modulo the number of sets in the cache) Example: 4- Word - Way SA $ Same Reference String Consider the sequence of memory accesses Start with an empty cache - all blocks initally marked as not valid miss 4 miss hit 4 hit em() em() em() em() em(4) em(4) em(4) requests, misses Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locatons that map into the same cache set can co- exist! 9 Example: Eight- Block Cache with Different OrganizaTons Four- Way Set- AssociaTve Cache = 56 sets each with four ways (each with one block) Byte offset V V V V Way Way Way Way Total size of $ in blocks is equal to number of sets x associa;vity For fixed $ size, increasing associatvity decreases number of sets while increasing number of elements per set With eight blocks, an - way set- associatve $ is same as a fully associatve $ 4x select 3 Range of Set- AssociaTve Caches For a fixed- size cache, each increase by a factor of two in associatvity doubles the number of blocks per set (ie, the number or ways) and halves the number of sets decreases the size of the index by bit and increases the size of the tag by bit Used for tag compare Decreasing associatvity Direct mapped (only one way) Smaller tags, only a single comparator Selects the set Increasing associatvity Selects the word in the block Block offset Byte offset Fully associatve (only one set) is all the bits except block and byte offset Costs of Set- AssociaTve Caches N- way set- associatve cache costs N comparators for tag comparisons ust choose appropriate set (multplexer) before data is available When miss occurs, which way s block selected for replacement? Random Replacement: Hardware randomly selects a cache item and throw it out Least Recently Used (LRU): one that has been unused the longest 3 4 4

9// LRU Cache Block Replacement Least Recently Used Hardware keeps track of access history Replace the entry that has not been used for the longest Tme For - way set- associatve cache, need one bit

5 9// LRU Cache Block Replacement Least Recently Used Hardware keeps track of access history Replace the entry that has not been used for the longest Tme For - way set- associatve cache, need one bit for LRU replacement On access set access bit for used block, clear other one Example of a Simple Pseudo LRU ImplementaTon Assume 64 Fully AssociaTve entries in a set Hardware replacement pointer points to one cache entry Whenever access is made to the entry the pointer points to: ove the pointer to the next entry Otherwise: do not move the pointer Replacement Pointer Entry Entry : Entry 63 Benefits of Set- AssociaTve Caches Largest gains are in going from direct mapped to - way (%+ reducton in miss rate) 5 6 The Cache Design Space Several interactng dimensions Cache size Block size Write- through vs write- back Write allocaton AssociaTvity Replacement policy OpTmal choice is a compromise Depends on access characteristcs Workload Use (I- cache, D- cache) Depends on technology / cost Simplicity o en wins Cache Size (Associa3vity) Block Size Bad Good Factor A Factor B Less ore 9// 7 9// Summary AAT to measure cache performance Cache can have major impact on CPI ult- level caches - Reduce Cache iss Penalty OpTmize first level to be fast! OpTmize nd and 3 rd levels to minimize the memory access penalty Set- associatvity - Reduce Cache iss Rate emory block maps into more than cache block N- way: n possible places in cache to hold a memory block Lots and lots of cache parameters! Write- back vs write through, write allocaton, block size, cache size, associatvity, etc 9 5

Agenda. Recap: Components of a Computer. Agenda. Recap: Cache Performance and Average Memory Access Time (AMAT) Recap: Typical Memory Hierarchy

Agenda. Recap: Components of a Computer. Agenda. Recap: Cache Performance and Average Memory Access Time (AMAT) Recap: Typical Memory Hierarchy // CS 6C: Great Ideas in Computer Architecture (Machine Structures) Set- Associa+ve Caches Instructors: Randy H Katz David A PaFerson hfp://insteecsberkeleyedu/~cs6c/fa Cache Recap Recap: Components of