Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Size: px

Start display at page:

Download "Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip"

Meryl Lindsey
5 years ago
Views:

1 Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off chip this provides a quick tag check and also a larger cache capacity simple (by using direct-mapped caches) means shorter hit time note that direct-mapped may have a higher miss rate than associative caches but concurrent read and tag check makes the hit time (clock cycle) shorter Avoid VM address -> PM address translation use virtually-addressed caches - access by VM addresses address translation proceeds in parallel with cache search if translation indicates that the VM address is not mapped to a PM address, then void the cache search result if the translation indicates a protection violation, then raise an exception if there is a cache miss, then bring the cache block in using the PM address Chapter 5 page 35

2 Problems with Virtual Caches Task switch causes the same VM address to refer to different PM addresses hence cache must be flushed creating a large task switch overhead to avoid flushing, we need a OS-generated tags to identify processes the cache tag field includes a PID -- may increase hit time VM alias problems OS and User code have different VM addresses which map to same PM address, which results in 2 copies in the cache need hardware's anti-aliasing mechanisms to guarantee single copy or need software's ``page coloring'' solution e.g., SUN s UNIX guarantees the last 18 bits of VM and PM addresses are the same Common compromise (Dec Alpha and HP7200) virtually addressed physically tagged Chapter 5 page 36

3 Page Coloring in SUN's UNIX Objective: VM and PM addresses match in the last 18 bits, so no two diff. VMAs -> same PMA direct-mapped < 256K can never have duplicated PMAs for blocks Chapter 5 page 37

4 Virtually-Addressed, Physically-Tagged Caches Use part of the page offset within the VM address to index the cache page offset of a VM address consists of "index" and "block offset" the index field is used to select a set in the cache the block offset is used to select a word within the cache block cache blocks still contain "physical tags" as in a physical cache Advanatage: VMA -> PMA occurs simultaneously once the VM address is translated, can compare the tag field of the translated PM address with the tags in the cache set for direct-mapped, only one tag needs to be compared for n-way set associative, all n tags in the set are compared See Figure 5.27 Chapter 5 page 38

5 Faster Write Hits Write hits are slower since the tag check must be performed before write can proceed Hence pipeline the writes (e.g. Dec Alpha) tags and data are split and addressed independently tag check of (i+1)th write request occurs simultaneously in the same cycle as write of the ith request in a pipeline way Result looks like a write happens on every cycle cycle-time can stay short since real write is spread over several cycles mostly works since CPU is not dependent on data from a write Chapter 5 page 39

6 Cache Improvement Summary Table 1: Technique Miss Rate Miss Penalty Hit Time HW Complexity Comments Larger Block Size win lose easy trivial engineering effort Higher Associativity win lose 1 associative match isn t free Victim Caches win 2 e.g. HP 7200 Pseudo-Associative win 2 Used in L2 of MIPS R10000 HW Prefetch of I and D win 2 D fetch hard to do Compiler controlled prefetch win 3 Needs non-blocking cache Compiler cache scheduling (blocking, merging, loop interchaning,...) win 0 Too bad it s hard to do for all applications - loop focus for now Prioritizing Read Misses win 1 write buffer - simple Subblock placement win 1 good at reducing tag overhead Early restart + critical word first win 2 MIPS R10K, IBM 620 Nonblocking Caches win 3 MIPS R10K, AXP Second Level Caches big win 2 big additional cost Small and Simple Caches lose win 0 trivial so it s widely used Avoid VM->PM translation effects win 2 Alpha AXP 21064, PA Pipelining Writes win 1 AXP 21064, PA-8000, UltraSPARC Chapter 5 page 40

7 Main Memory At the low level of the memory hierarchy 3 important issues Capacity Bell s law - 1 MB per MIP needed real key here is to avoid those costly page faults Latency how long does it take to get the data back by addressing big chunks - like an entire cache block this can be amortized critical to cache performance when the miss is to main Bandwidth affects the time it takes to transfer the block a key issue when DMA service from an I/O device is considered also a key issue with the very large block sizes in lower level caches Chapter 5 page 41

8 Memory Technology SRAM s and DRAM s are different DRAM: 1 transistor/bit; SRAM: 4-6 transistors/bit DRAM capacity is 4-8 times that of SRAM at same feature size SRAM speed is 8-16 times that of DRAM but cost is as much Main memory today means DRAM Multiplexed address lines - row and then column access 2 dimensional address - rows go to a buffer and subsequent column selects subrow Refresh needed every few milliseconds Where are we today (Figure 5.35) 64 Mbit chips RAS access time between 50 and 65ns CAS access 10 ns cycle time - 90 ns (separation between subsequent accesses) Chapter 5 page 42

9 Consider Example 4 cycles to send the address 24 cycles to access a word in the memory unit 4 cycles to transmit the data Hence if main memory is organized by word then 32 cycles for every word is spent Given a cache block size of 4 words 32 *4 = 128 cycles is the miss penalty Clearly we need a better organizational model Chapter 5 page 43

10 Memory Organization Improvements Wider memory Make the width of main memory match/w the cache block size Easy to do - need 4 words on a miss - just quadruple the memory bandwidth; following the numbers in the last slide, miss penalty now? Problem is the cost of the wider bus between cache and MM moreover, since CPU accesses one word at a time, so a multiplexer is needed between cache and CPU, which may adversely affect the cache hit time Interleaved Memory Bus bandwidth is the same but we make it work more often Organize the memory in banks so they read simultaneously and then each deliver one word to bus interleavingly; miss penalty now? Both are optimized for sequential memory accesses e.g. capitalizes on spatial locality principle just like caches do Chapter 5 page 44

11 Reducing Bank Access Conflicts Like to deliver one word from a bank per cycle Therefore # of banks >= access time per word in cycles to avoid any access gap Ex: # banks = 8 & access time per word = 10 cycles -- access gap? Interleaved memory ideal for sequential access Bad if data references go to the same bank e.g., accessing array words #0, #128, #256, etc. in a 128-bank IM Addressing in power-of-2 interleaved memory bank # = (address) mod (# of banks) offset within a bank = (address) / (# of banks) Addressing in prime-number interleaved memory offset within a bank = (address) mod (# of words in a bank) Chapter 5 page 45

12 Addressing in Power-of-2 Interleaved Memory See Figure 5.32 Chapter 5 page 46

13 Reducing Bank Access Conflicts Power-of-2 vs. prime-number interleaved memories See Figure 5.34 Chapter 5 page 47

14 Simple Calculation Problems A machine has 16MB main memory organized into 32-way interleaving and a 64KB cache The cache block size is 512B and the cache is always presented with a PM address Q1: How many bits are in the tag field? (1) for a 16-way set associative cache? (a) 8 (b) 10 (c) 12 (d) 14 (e) 16 (2) for a direct-mapped cache? (a) 8 (b) 10 (c) 12 (d) 14 (e) 16 Q2: For a direct-mapped cache, what is the physical address of a byte at offset 63 of memory bank 3 (all numbers start from 0)? (a) (b) (c) (d) Chapter 5 page 48

15 Virtual Memory Permits applications to use an address space large than the main memory in size Helps with multiple process management Each process gets its own portion of memory Access protection can be imposed on a per process basis Mapping of all VM dynamically onto one shared PM Mapping also facilitates relocation Applications and CPU run in virtual space Mapping onto physical space is invisible to the application VM Management applies between main memory and secondary (disk) hierarchy levels Miss becomes a page or address fault Block becomes a page or segment Chapter 5 page 49

16 Typical Performance Parameters Parameter Page Size L1 Cache Hit Time Virtual Hit (e.g. in Main Memory) Miss Penalty - all the way to disk Disk access time Page Transfer time Table 1: Value 4KB - 64KB 1-2 clock cycles clock cycles 700K - 6M clock cycles 500K - 4M cycles 200K - 2M clock cycles Page Fault Rate.00001% -.001% Main Memory Size 4MB - 4 GB It s a lot like what happens in Cache But all the numbers are much much larger With the exception of the miss rate Chapter 5 page 50

17 Replacement Cache vs. VM Differences Cache miss is handled by hardware Page fault is often handled by the OS since page fault penalty is very large hence more sophisticated strategies can be used to reduce miss penalty Addresses VM space is determined by the address space of the CPU Cache size is independent of the CPU address space Lower level memory For caches - the main memory is not shared by something else For VM - most of the disk contains the file system File system addressed differently - usually in I/O space The VM lower level is usually called the SWAP space Chapter 5 page 51

18 2 VM Styles - Paged or Segmented Pages are fixed-size blocks Segments' sizes vary from 1 byte to 2**32 Table 1: Aspect Page Segment Words per addr. One - contains page and offset Two - possible large max-size hence need two words to address segment and offset Programmer visible No Sometimes yes Replacement Trivial - due to fixed size Hard - need to find contiguous space ==> GC necessary or wasted memory Memory Inefficiency Disk Efficiency Internal fragmentation - wasted part of a page Yes - adjust page size to balance access and transfer time External fragmentation - due to variable size blocks Not always - segment size varies Chapter 5 page 52

19 Block Placement The 4 Questions for VM Choice between lower miss rates and complex placement or vice versa Miss penalty is huge So choose low miss rate ==> place anywhere Similar to fully associative cache model Block Addressing - both use additional data structure Pages - use a page table Virtual page number ==> physical page number and catenate offset Tag bit to indicate presence in main memory Segments - segment table segment # ==> starting physical address of segment + offset Segment table needs an entry for every possible segment Lots of little segments mean a large segment table - always a possibility Chapter 5 page 53

20 Normal Page Tables More on Page Tables Size: number of entries = number of virtual pages For a 32 bit virtual address, 4K pages & 4 bytes per entry size of page table = (4 GB / 4 KB) * 4B = 4MB required too large for nowadays machines - must go for smaller VM space, bigger pages, OR Inverted Page Table Why allocate an entry for each VM page? Instead we allocate an entry for each PM block (page) Example: PM size = 64MB, then the size of inverted page table = (64MB/4KB)*4B=64KB Hash the virtual page number into an entry of the page table Then compare the virtual page number with the tag stored in the hashed entry to make sure it is a match If miss, go to full page table stored on disk - this implies 2 disk accesses in the worst cases; however, we trade increased worst-case penalty for decrease in capacity misses since there is now more room for real pages rather than page table pages check the valid bit of the page table entry - if valid then the page is in PM - else page fault Chapter 5 page 54

21 Back to the 4Q s for VM Block Replacement LRU is the best so use it to minimize the huge miss penalty However like caches true LRU is very expensive - so Page table contains a use tag On access the use tag is set OS checks them periodically - records what it sees in an internal data structure - then clears all the use tags On a miss the OS decides the least used based on the records in its data structure Note - worth a few OS cycles to avoid the huge miss penalty Write Strategy Always write back Due to the access time to the disk - write through is impractical Chapter 5 page 55

22 Page Size? An architectural choice Large pages are good: Reduces page table size Amortizes the long disk access If spatial locality is good then hit rate will improve Large pages are bad: More internal fragmentation If everything is random then each program's segment is only half full Half of bigger is still bigger If there are 3 segments per process: code, data, and stack Then 1.5 pages are wasted for each process Process start up time takes longer since at least 1 page will be necessary and transfer penalty aspect is higher And vice versa of course Chapter 5 page 56

23 Address Translation Page tables are large and paged themselves in some systems So double page faults are possible -- bad for performance If locality applies then cache the references This is called a TLB (a special cache) a TLB entry consists tag (virtual page #), PM block #, valid bit, use bit, protection field, dirty bit, etc. TLB and page table must be consistent with each other TLB is in the CPU critical path TLB must be checked before the cache access can hit Result is the cache hit time may get stretched a bit Virtually-indexed, physically-tagged caches will help Chapter 5 page 57

24 More on TLB s The TLB must be on chip - otherwise it is worthless Small TLB s are worthless anyway Large TLB s are expensive - fully associative Typical TLB s Block size - same as a page table entry - 1 or 2 words Hit time - 1 cycle Miss penalty - 10 to 30 cycles Miss rate -.1% to 2% TLB size - 32 B to 8 KB They re expensive but necessary ==> Price of CPU s is going up! Chapter 5 page 58

25 e.g. AXP TLB 30 bits Page Frame Number 13 bits Page Offset Virtual Address V 2 R 2 W 30 bits 21 bits VPN Tag Physical Frame # V R W VPN Tag Physical Frame # V R W VPN Tag Physical Frame # 32 entries total V R W VPN Tag Physical Frame # protection hit location 32:1 Mux n Indicates steps that could be pipelined bit physical address Chapter 5 page 59

26 A Simple Calculation Problem (5.8(b), Text) A machine with based CPI=1.5; 20% are data transfer instructions; A write-back ``virtual'' cache with 32B per cache block; miss rate = 2.2%; 50% cache blocks are dirty Memory latency = 40 cycles; transfer rate = 4B/cycle TLB does not slow down a cache hit; TLB miss rate 0.2% and TLB miss penalty = 20 cycles What is the CPI? CPI is affected by three sources of stalls: (1) caused by instruction fetch stalls (2) caused by data reference stalls (3) caused by TLB access stalls when there are cache misses So CPI = [(1+20%)*2.2%*72] + [(1+20%)*2.2%*(1+50%)*0.2%*20] The 3rd term above accounts for the TLB stalls per instrcution (1+20%) is the number of memory references per instruction 2.2%*(1+50%) is the number of TLB accesses per memory reference because on a miss 50% of the time TLB needs to do another VM->PM translation to flush a victim cache block Chapter 5 page 60

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find