CS 61C: Great Ideas in Computer Architecture Direct- Mapped Caches. Increasing distance from processor, decreasing speed.

CS 6C: Great Ideas in Computer Architecture Direct- Mapped s 9/27/2 Instructors: Krste Asanovic, Randy H Katz hdp://insteecsberkeleyedu/~cs6c/fa2 Fall 2 - - Lecture #4 New- School Machine Structures (It s a bit more complicated!) Parallel Requests Assigned to computer eg, Search Katz Parallel Threads Assigned to core eg, Lookup, Ads So2ware Parallel Instruc^ons > instruc^on @ one ^me eg, 5 pipelined instruc^ons Parallel > data item @ one ^me eg, Add of 4 pairs of words Hardware descrip^ons All gates @ one ^me Programming Languages Harness Parallelism & Achieve High Performance Hardware Warehouse Scale Computer Core Input/Output Instruc^on Unit(s) Core Smart Phone Logic Gates 2 Computer () Core Func^onal Unit(s) A 0 +B 0 A +B A 2 +B 2 A 3 +B 3 Today s Lecture Inner Levels in memory hierarchy Outer Big Idea: Hierarchy Level Level 2 Level 3 Level n Increasing distance from processor, decreasing speed Size of memory at each level As we move to outer levels the latency goes up and price per bit goes down Why? 3 Library Analogy Wri^ng a report based on books on reserve Eg, works of JD Salinger Go to library to get reserved book and place on desk in library If need more, check them out and keep on desk But don t return earlier books since might need them You hope this collec^on of ~0 books on desk enough to write report, despite 0 being only 00000% of books in UC Berkeley libraries 4 Principle of Locality Principle of Locality: Programs access small por^on of address space at any instant of ^me What program structures lead to locality in instruc^on accesses? Philosophy Programmer- invisible hardware mechanism to give illusion of speed of fastest memory with size of largest memory Works fine even if programmer has no idea what a cache is However, performance- oriented programmers today some^mes reverse engineer cache design to design data structures to match cache We ll do that in Project 3 5 9/27/2 Spring 2 - - Lecture #2 6

Access without Load word instruc^on: lw $t0, 0($t)! $t contains 022 ten, [022] = 99 issues address 022 ten to 2 reads word at address 022 ten (99) 3 sends 99 to 4 loads 99 into register $t 9/27/2 Spring 2 - - Lecture #2 7 Access with Load word instruc^on: lw $t0, 0($t)! $t contains 022 ten, [022] = 99 With cache (similar to a hash) issues address 022 ten to 2 checks to see if has copy of data at address 022 ten 2a If finds a match (Hit): cache reads 99, sends to processor 2b No match (Miss): cache sends address 022 to I reads 99 at address 022 ten II sends 99 to III replaces word with new 99 IV sends 99 to processor 3 loads 99 into register $t 9/27/2 Spring 2 - - Lecture #2 8 s Need way to tell if have copy of loca^on in memory so that can decide on hit or miss On cache miss, put memory address of block in tag address of cache block 022 placed in tag next to data from memory (99) 252 2 3 7 4 From earlier instruc^ons 9/27/2 Spring 2 - - Lecture #2 9 Anatomy of a 6 Byte, 4 Byte Block Opera^ons: Hit 2 Miss 3 Refill cache from memory needs s to decide if is a Hit or Miss Compares all 4 tags 9/27/2 Spring 2 - - Lecture # 0 252 2 3 7 4 Requirements Suppose processor now requests loca^on 5, which contains? Doesn t match any cache block, so must evict one resident block to make room Which block to evict? Replace vic^m with new memory block at address 5 252 2 5 3 7 4 9/27/2 Spring 2 - - Lecture #2 Block Must be Aligned in Word blocks are aligned, so binary address of all words in cache always ends in 00 two How to take advantage of this to save hardware and energy? Don t need to compare last 2 bits of byte address (comparator can be narrower) => Don t need to store last 2 bits of byte address in ( can be narrower) 9/27/2 Spring 2 - - Lecture #2 2 2

Anatomy of a 32B, 8B Block Blocks must be aligned in pairs, otherwise could get same word twice in cache s only have even- numbered words Last 3 bits of address always 000 two s, comparators can be narrower Can get hit for either word in block 9/27/2 Spring 2 - - Lecture # 3 252 30 40 2 42 947-0 000 7 Big Idea: Locality Temporal Locality (locality in ^me) Go back to same book on desktop mul^ple ^mes If a memory loca^on is referenced, then it will tend to be referenced again soon SpaGal Locality (locality in space) When go to book shelf, pick up mul^ple books on JD Salinger since library stores related books together If a memory loca^on is referenced, the loca^ons with nearby addresses will tend to be referenced soon 4 Principle of Locality Principle of Locality: Programs access small por^on of address space at any instant of ^me What program structures lead to temporal and spa^al locality in instruc^on accesses? In data accesses? Common Op^miza^ons Reduce tag overhead by having larger blocks Eg, 2 words, 4 words, 8 words Separate caches for instruc^ons and data Double bandwidth, don t interfere with each other Bigger caches (but access ^me could get bigger than one clock cycle if too big) Divide cache into mul^ple sets, only search inside one set => saves comparators, energy If as many sets as blocks, then only comparator (aka Direct- Mapped ) But may increase Miss Rate 5 9/27/2 Spring 2 - - Lecture #2 6 Hardware Cost of Need to compare every tag to the address Comparators are expensive Op^miza^on: 2 sets => ½ comparators bit selects which set Set 0 Set 7 7 Fields used by Controller Block Offset: Byte address within block Set : Selects which set : Remaining por^on of processor address (s total) Set Size of = log2 (number of sets) Size of = size Size of log2 (number of bytes/block) 8 3

What is limit to number of sets? Can save more comparators if have more than 2 sets Limit: As Many Sets as Blocks only needs one comparator! Called Direct- Mapped Design One More Detail: Valid Bit When start a new program, cache does not have valid informa^on for this program Need an indicator whether this tag entry is valid for this program Add a valid bit to the cache tag entry 0 => cache miss, even if by chance, address = tag => cache hit, if processor address = tag 9 Direct- Mapped Example One word blocks, cache size = K words (or 4KB) Hit Valid bit ensures something useful in cache for this index Compare with upper part of to see if a Hit 0 Valid 0 2 02 022 023 3 30 3 2 2 0 Comparator What kind of locality are we taking advantage of? 2 32 Read data from cache instead of memory if a Hit Administrivia Lab #5: MIPS Assembly HW #4 (of six), due Sunday Project 2a: MIPS Emulator, due Sunday Midterm, two weeks from yesterday 22 Agenda Hierarchy Analogy Hierarchy Overview Administrivia s Fully Associa^ve, N- Way Set Associa^ve, Direct Mapped s Performance Mul^level s Terms Hit rate: frac^on of access that hit in the cache Miss rate: Hit rate Miss penalty: ^me to replace a block from lower level in memory hierarchy to cache Hit ^me: ^me to access cache memory (including tag comparison) Abbrevia^on: $ = cache (A Berkeley innova^on!) 23 24 4

Mapping a 6- bit 5 4 3 2 Mem Block Within $ Block Block Within $ Byte Offset Within Block (eg, Word) In example, block size is 4 bytes/ word (it could be mul^- word) and cache blocks are the same size, unit of transfer between memory and cache # blocks >> # blocks 6 blocks/6 words/64 bytes/6 bits to address all bytes 4 blocks, 4 bytes ( word) per block 4 blocks map to each cache block Byte within block: low order two bits, ignore! (nothing smaller than a block) block to cache block, aka index: middle two bits Which memory block is in a given cache block, aka tag: top two bits 25 0 Caching: A Simple First Example Valid 00 0 0 Q: Is the mem block in cache? Compare the cache tag to the high- order 2 memory address bits to tell if the memory block is in the cache (provided valid bit is set) 0000xx 000xx 000xx 00xx 000xx 00xx 00xx 0xx 000xx 00xx 00xx 0xx 00xx 0xx 0xx xx Main One word blocks Two low- order bits define the byte in the block (32b words) Q: Where in the cache is the memory block? Use next 2 low- order memory address bits the index to determine which cache block (ie, modulo the number of blocks in the cache) 26 00 0 0 Caching: A Simple First Example Valid Q: Is the memory block in cache? Compare the cache tag to the high- order 2 memory address bits to tell if the memory block is in the cache (provided valid bit is set) 0000xx 000xx 000xx 00xx 000xx 00xx 00xx 0xx 000xx 00xx 00xx 0xx 00xx 0xx 0xx xx Main One word blocks Two low order bits (xx) define the byte in the block (32b words) Q: Where in the cache is the mem block? Use next 2 low- order memory address bits the index to determine which cache block (ie, modulo the number of blocks in the cache) 27 Mul^word- Block Direct- Mapped Four words/block, cache size = K words Byte 3 30 3 2 4 3 2 0 Hit offset Valid 0 2 253 254 255 8 What kind of locality are we taking advantage of? 28 32 Names for Each Organiza^on Fully Associa^ve : Block can go anywhere First design in lecture Note: No field, but comparator/block Direct Mapped : Block goes one place Note: Only comparator Number of sets = number blocks N- way Set Associa^ve : N places for a block Number of sets = number of blocks / N Fully Associa^ve: N = number of blocks Direct Mapped: N = 29 Range of Set- Associa^ve s For a fixed- size cache, each increase by a factor of 2 in associa^vity doubles the number of blocks per set (ie, the number of ways ) and halves the number of sets decreases the size of the index by bit and increases the size of the tag by bit More Associa^vity (more ways) Note: IBM persists in calling sets ways and ways sets They re wrong 30 5

For S sets, N ways, B blocks, which statements hold? A) The cache has B tags B) The cache needs N comparators C) B = N x S D) Size of = Log 2 (S) A only A and B only A, B, and C only! path Typical Hierarchy On- Chip Components Control RegFile Instr Second Level (SRAM) Main (DRAM) Secondary (Disk Or Flash) Speed (cycles): ½ s s 0 s 00 s,000,000 s Size (bytes): 00 s 0K s M s G s T s Cost/bit: highest lowest Principle of locality + memory hierarchy presents programmer with as much memory as is available in the cheapest technology at the speed offered by the fastest technology 3 32 Review so far Principle of Locality for Libraries /Computer Hierarchy of Memories (speed/size/cost per bit) to Exploit Locality copy of data lower level in memory hierarchy Direct Mapped to find block in cache using field and Valid bit for Hit Larger caches reduce Miss rate via Temporal and Spa^al Locality, but can increase Hit ^me Mul^level caches help Miss penalty AMAT helps balance Hit ^me, Miss rate, Miss penalty 33 6