10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

Size: px

Start display at page:

Download "10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts"

Angel Alicia Stevens
5 years ago
Views:

1 // CS C: Great Ideas in Computer Architecture (Machine Structures) s Part Instructors: Krste Asanović & Randy H Katz Organization and Principles Write Back vs Write Through Performance Design Tradeoffs // Fall - Lecture # // Fall Lecture # Typical Hierarchy Organization and Principles Write Back vs Write Through Performance Design Tradeoffs path On-Chip Components Control RegFile Instr Second- Level (SRAM) Third- Level (SRAM) Main (DRAM) Secondary (Disk Or Flash) Speed (cycles): ½ s s s s-,, s Size (bytes): s K s M s G s T s Cost/bit: highest lowest Principle of locality + memory hierarchy presents programmer with as much memory as is available in the cheapest technology at the speed offered by the fastest technology // Fall Lecture # // Fall - Lecture # Adding to Computer Key Concepts Processor Control path PC Registers Arithmetic & Logic Unit (ALU) Enable? Read/Write Write Read Program Bytes Input Output Processor organized around words and bytes Processor- Interface I/O- Interfaces Fall - Lecture # // (including cache) organized around blocks, which are typically multiple words Principle of Locality Temporal Locality and Spatial Locality Hierarchy of Memories (speed/size/cost per bit) to exploit locality copy of data in lower level of memory hierarchy Direct Mapped to find block in cache using field and Valid bit for Hit Design Organization Choices: Fully Associative, Set-Associative, Direct-Mapped // Fall - Lecture #

2 // Organizations Fully Associative : Block placed anywhere in cache First design last lecture Note: No field, but one comparator/block Direct Mapped : Block goes only one place in cache Note: Only one comparator Number of sets = number blocks N-way Set Associative : N places for block in cache Number of sets = Number of Blocks / N N comparators Fully Associative: N = number of blocks Direct Mapped: N = Block vs Word ing address 8 address 8 address 8 Byte Word 8-Byte Block // Fall - Lecture # LSBs are LSBs are // Fall - Lecture #Block # Byte offset in block 8 Block Number Aliasing -bit memory addresses, Byte blocks Block # Block # mod 8 Block # mod Processor Fields Used by Controller Block Offset: Byte address within block Set : Selects which set : Remaining portion of processor address Processor (-bits total) Set Size of = log (number of sets) Size of = size Size of log (number of bytes/block) Block offset // Fall - Lecture # 9 // Fall - Lecture # What Limits Number of Sets? For a given total number of blocks, we save comparators if have more than two sets Limit: As Many Sets as Blocks => only one block per set only needs one comparator! Called Direct-Mapped Design // Fall - Lecture # Block offset Direct Mapped Example: Mapping a -bit MemBlock Within $ Block Block Within $ Byte Offset In example, block size is bytes/ word and cache blocks always the same size, unit of transfer between memory and cache # blocks >> # blocks blocks = words = bytes => bits to address all bytes blocks, bytes ( word) per block blocks map to each cache block block to cache block, aka index: middle two bits Which memory block is in a given cache block, aka tag: top two bits // Fall - Lecture # Byte Within Block

3 // One More Detail: Valid Bit When start a new program, cache does not have valid information for this program Need an indicator whether this tag entry is valid for this program Add a valid bit to the cache tag entry => cache miss, even if by chance, address = tag => cache hit, if processor address = tag // Fall - Lecture # Valid Organization: Simple First Example Q: Is the memory block in cache? Compare the cache tag to the highorder memory address bits to tell if the memory block is in the cache (provided valid bit is set) xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx Main One word blocks Two low order bits (xx) define the byte in the block (b words) Q: Where in the cache is the mem block? Use next low-order memory address bits the index to determine which cache block (ie, modulo the number of blocks in the cache) // Fall - Lecture # Example: Alternatives in an 8 Block Direct Mapped: 8 blocks, way, tag comparator, 8 sets Fully Associative: 8 blocks, 8 ways, 8 tag comparators, set Way Set Associative: 8 blocks, ways, tag comparators, sets Way Set Associative: 8 blocks, ways, tag comparators, sets DM: 8 sets way FA: set 8 ways Way SA: sets Set // Fall - Lecture # Set Set Set Way SA: sets Set Set Direct-Mapped One word blocks, cache size = K words (or KB) Valid bit ensures something useful in cache for this index Hit Compare with upper part of to see if a Hit Valid Comparator Byte offset Read data from cache instead of memory if a Hit // Fall - Lecture # Peer Instruction Break! For a cache with constant total capacity, if we increase the number of ways by a factor of two, which statement is false: A : The number of sets could be doubled B : The tag width could decrease C : The block size could stay the same : The block size could be halved // Fall - Lecture # // Fall - Lecture # 8

// Organization and Principles Write Back vs Write Through Performance Design Tradeoffs Handling Stores with Write-Through Store instructions write to memory, changing values Need to make sure cache

4 // Organization and Principles Write Back vs Write Through Performance Design Tradeoffs Handling Stores with Write-Through Store instructions write to memory, changing values Need to make sure cache and memory have same values on writes: two policies ) Write-Through Policy: write cache and write through the cache to memory Every write eventually gets to memory Too slow, so include Write Buffer to allow processor to continue once data in Buffer Buffer updates memory in parallel to processor // Fall Lecture # 9 // Fall - Lecture # Write-Through Write both values in cache and in memory Write buffer stops CPU from stalling if memory cannot keep up Write buffer may have multiple entries to absorb bursts of writes What if store misses in cache? -bit -bit Processor -bit -bit 99 Write Buffer Addr // Fall - Lecture # Handling Stores with Write-Back ) Write-Back Policy: write only to cache and then write cache block back to memory when evict block from cache Writes collected in cache, only single write to memory per block Include bit to see if wrote to block or not, and then only write back if bit is set Called Dirty bit (writing makes it dirty ) // Fall - Lecture # Write-Back Store/cache hit, write data in cache only and set dirty bit has stale value Store/cache miss, read data from memory, then update and set dirty bit Write-allocate policy Load/cache hit, use value from cache On any miss, write back evicted block, only if dirty Update cache with new block and clear dirty bit -bit -bit Processor -bit D Dirty D 99 Bits D D -bit // Fall - Lecture # Write-Through vs Write-Back Write-Through: Write-Back Simpler control logic More complex control logic More predictable timing simplifies More variable timing (,, memory processor control logic accesses per cache access) Easier to make reliable, since memory Usually reduces write traffic always has copy of data (big idea: Harder to make reliable, sometimes Redundancy!) cache has only copy of data // Fall - Lecture #

5 // Administrivia Midterm # weeks away! October! In class! 8-9: AM Synchronous digital design and Project (processor design) included Pipelines and s ONE Double sided Crib sheet Review Session: Saturday, Oct 8 (Location TBA) - open drop-in seats for these tutoring sessions: M - Soda Th- Soda 8 F - Soda Guerrilla Session tonight -9 pm in Cory 9 Project - Party tomorrow -9 pm Cory 9 If you would like to change your partnership for Project, your lab TA We will send out a Google form to track all Project partnerships Organization and Principles Write Back vs Write Through Performance Design Tradeoffs // Fall - Lecture # // Fall Lecture # (Performance) Terms Hit rate: fraction of accesses that hit in the cache Miss rate: Hit rate Miss penalty: time to replace a block from lower level in memory hierarchy to cache Hit time: time to access cache memory (including tag comparison) Average Access Time (AMAT) Average Access Time (AMAT) is the average time to access memory considering both hits and misses in the cache AMAT = Time for a hit + Miss rate Miss penalty Abbreviation: $ = cache (a Berkeley innovation!) // Fall - Lecture # // Fall - Lecture # 8 Peer Instruction AMAT = Time for a hit + Miss rate x Miss penalty Given a psec clock, a miss penalty of clock cycles, a miss rate of misses per instruction and a cache hit time of clock cycle, what is AMAT? A : psec B : psec C : psec : 8 psec Ping Pong Example: Direct-Mapped w/ Single-Word Blocks, Worst-Case Reference String Consider the main memory address reference string of word numbers: Start with an empty cache - all blocks initially marked as not valid // Fall - Lecture # 9 // Fall - Lecture #

6 // Ping Pong Example: Direct-Mapped w/ Single-Word Blocks, Worst-Case Reference String Consider the main memory address reference string of word numbers: Start with an empty cache - all blocks initially marked as not valid miss miss miss miss Mem() Mem() Mem() Mem() miss miss miss miss Mem() Mem() Mem() Mem() Organization and Principles Write Back vs Write Through Performance Design Tradeoffs 8 requests, 8 misses Ping-pong effect due to conflict misses - two memory locations that map into the same cache block // Fall - Lecture # // Fall Lecture # Example: -Way Set Associative $ Way Set V ( words = sets x ways per set) Q: Is it there? Compare all the cache tags in the set to the high order memory address bits to tell if the memory block is in the cache xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx Main One word blocks Two low order bits define the byte in the word (b words) Q: How do we find it? Use next low order memory address bit to determine which cache set (ie, modulo the number of sets in the cache) // Fall - Lecture # Ping Pong Example: Word - Way SA $, Same Reference String Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid // Fall - Lecture # Ping Pong Example: -Word - Way SA $, Same Reference String Consider the main memory address reference string Start with an empty cache - all blocks initially marked as not valid miss miss hit hit Mem() Mem() Mem() Mem() Mem() Mem() Mem() 8 requests, misses Solves the ping-pong effect in a direct-mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist! // Fall - Lecture # Four-Way Set-Associative 8 = sets each with four ways (each with one block) Byte offset V V V 8 Way Way Way Way x select // Fall - Lecture # Hit V

7 // Break! Range of Set-Associative s For a fixed-size cache and fixed block size, each increase by a factor of two in associativity doubles the number of blocks per set (ie, the number or ways) and halves the number of sets decreases the size of the index by bit and increases the size of the tag by bit Word offset Byte offset // Fall - Lecture # // Fall - Lecture # 8 Range of Set-Associative s For a fixed-size cache and fixed block size, each increase by a factor of two in associativity doubles the number of blocks per set (ie, the number or ways) and halves the number of sets decreases the size of the index by bit and increases the size of the tag by bit Used for tag compare Decreasing associativity, lower way, more sets Direct mapped (only one way) Smaller tags, only a single comparator Selects the set Increasing associativity, higher way, less sets Selects the word in the block Word offset Byte offset Fully associative (only one set) is all the bits except block and byte offset Total Capacity = Associativity # of sets block_size Bytes = blocks/set sets Bytes/block C = N S B Byte Offset address_size = tag_size + index_size + offset_size = tag_size + log (S) + log (B) // Fall - Lecture # 9 // Fall - Lecture # Total Capacity = Associativity * # of sets * block_size Bytes = blocks/set * sets * Bytes/block C = N * S * B Byte Offset address_size = tag_size + index_size + offset_size = tag_size + log (S) + log (B) Double the Associativity: Number of sets? tag_size? index_size? # comparators? Double the Sets: Associativity? tag_size? index_size? # comparators? // Fall - Lecture # Your Turn For a cache of blocks, each block four bytes in size: The capacity of the cache is: bytes Given a -way Set Associative organization, there are sets, each of blocks, and places a block from memory could be placed Given a -way Set Associative organization, there are sets each of blocks and places a block from memory could be placed Given an 8-way Set Associative organization, there are sets each of blocks and places a block from memory could be placed // Fall - Lecture #

8 // Peer Instruction For S sets, N ways, B blocks, which statements hold? (i) The cache has B tags (ii) The cache needs N comparators (iii) B = N x S (iv) Size of = Log (S) A : (i) only B : (i) and (ii) only C : (i), (ii), (iii) only : All four statements are true Peer Instruction For S sets, N ways, B blocks, which statements hold? (i) The cache has B tags (ii) The cache needs N comparators (iii) B = N x S (iv) Size of = Log (S) A : (i) only B : (i) and (ii) only C : (i), (ii), (iii) only : All four statements are true // Fall - Lecture # // Fall - Lecture # Costs of Set-Associative s N-way set-associative cache costs N comparators (delay and area) MUX delay (set selection) before data is available available after set selection (and Hit/Miss decision) DM $: block is available before the Hit/Miss decision In Set-Associative, not possible to just assume a hit and continue and recover later if it was a miss When miss occurs, which way s block selected for replacement? Least Recently Used (LRU): one that has been unused the longest (principle of temporal locality) Must track when each way s block was used relative to other blocks in the set For -way SA $, one bit per set set to when a block is referenced; reset the other way s bit (ie, last used ) // Fall - Lecture # Replacement Policies Random Replacement Hardware randomly selects a cache evict Least-Recently Used Hardware keeps track of access history Replace the entry that has not been used for the longest time For -way set-associative cache, need one bit for LRU replacement Example of a Simple Pseudo LRU Implementation Assume Fully Associative entries Hardware replacement pointer points to one cache entry Whenever access is made to the entry the pointer points to: Move the pointer to the next entry Otherwise: do not move the pointer (example of not-most-recently used replacement policy) // Fall - Lecture # Replacement Pointer Entry Entry : Entry Benefits of Set-Associative s Choice of DM $ versus SA $ depends on the cost of a miss versus the cost of implementation Organization and Principles Write Back vs Write Through Performance Design Tradeoffs Largest gains are in going from direct mapped to -way (%+ reduction in miss rate) // Fall - Lecture # // Fall Lecture # 8 8

9 // Chip Photos And in Conclusion Name of the Game: Reduce AMAT Reduce Hit Time Reduce Miss Rate Reduce Miss Penalty Balance cache parameters (Capacity, associativity, block size) // Fall - Lecture # 9 // Fall - Lecture # 9

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2 Instructors: Krste Asanović & Randy H. Katz http://inst.eecs.berkeley.edu/~cs61c/ 10/16/17 Fall 2017 - Lecture #15 1 Outline