EEC 581 Computer Architecture Memory Hierarchy Design (II) Department of Electrical Engineering and Computer Science Cleveland State University Topics to be covered Cache Penalty Reduction Techniques Victim cache Assist cache Non-blocking cache Data Prefetch mechanism Virtual Memory 2 1
3Cs Absolute Miss Rate (SPEC92) Compulsory misses are a tiny fraction of the overall misses Capacity misses reduce with increasing sizes Conflict misses reduce with increasing associativity 0.14 1-way Conflict 0.12 2-way 0.1 4-way 0.08 8-way 0.06 Capacity 0.04 0.02 0 1 2 4 8 16 32 64 128 Cache Size (KB) Compulsory 3 2:1 Cache Rule Miss rate DM cache size X ~= Miss rate 2-way SA cache size X/2 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 1-way 2-way 4-way 8-way Conflict Capacity 1 2 4 8 16 32 64 128 Cache Size (KB) Compulsory 4 2
3Cs Relative Miss Rate 100% 80% 60% 1-way 2-way 4-way 8-way Conflict 40% Capacity 20% 0% 1 2 4 8 16 Caveat: fixed block size Cache Size (KB) 32 64 128 Compulsory 5 Victim Caching [Jouppi 90] Processor L1 Memory VC Victim Cache Organization Victim cache (VC) A small, fully associative structure Effective in direct-mapped caches Whenever a line is displaced from L1 cache, it is loaded into VC Processor checks both L1 and VC simultaneously Swap data between VC and L1 if L1 misses and VC hits When data has to be evicted from VC, it is written back to memory 6 3
% of Conflict Misses Removed Dcache Icache 7 Assist Cache [Chan et al. 96] Processor L1 Memory AC Assist Cache Organization Assist Cache (on-chip) avoids thrashing in main (off-chip) L1 cache (both run at full speed) 64 x 32-byte fully associative CAM Data enters Assist Cache when miss (FIFO replacement policy in Assist Cache) Data conditionally moved to L1 or back to memory during eviction Flush back to memory when brought in by Spatial locality hint instructions Reduce pollution 8 4
Multi-lateral Cache Architecture Processor Core A B Memory A Fully Connected Multi-Lateral Cache Architecture Most of the cache architectures be generalized into this form 9 Cache Architecture Taxonomy Processor Processor Processor A B A A B Memory General Description Processor Memory Single-level cache Processor Memory Two-level cache Processor A B A B A B Memory Memory Memory Victim cache Assist cache NTS, and PCS caches 10 5
Non-blocking (Lockup-Free) Cache [Kroft 81] Prevent pipeline from stalling due to cache misses (continue to provide hits to other lines while servicing a miss on one/more lines) Uses Miss Status Handler Register (MSHR) Tracks cache misses, allocate one entry per cache miss (called fill buffer in Intel P6 proliferation) New cache miss checks against MSHR Pipeline stalls at a cache miss only when MSHR is full Carefully choose number of MSHR entries to match the sustainable bus bandwidth 11 Bus Utilization (MSHR = 2) Time m1 m2 m3 Initiation m4 interval m5 Lead-off latency 4 data chunk Stall due to insufficient MSHR BUS Bus Idle Data Transfer Memory bus utilization 12 6
Bus Utilization (MSHR = 4) Time Stall BUS Bus Idle Data Transfer Memory bus utilization 13 Sample question What is the major difference between CDC6600 s Scoreboarding algorithm and IBM 360/91 s Tomasulo algorithm? (One sentence) Why IBM 360/91 only implemented Tomasulo s algorithm in the floating-point unit but not in the integer unit? (One sentence) What are the two main functions of a ReOrder Buffer (ROB)? 14 7
Sample question What is the major difference between CDC6600 s Scoreboarding algorithm and IBM 360/91 s Tomasulo algorithm? (One sentence) Tomasulo algorithm does register renaming. Why IBM 360/91 only implemented Tomasulo s algorithm in the floating-point unit but not in the integer unit? (One sentence) Due to the long latency of the FPU. There are only 4 registers in the FPU. What are the two main functions of a ReOrder Buffer (ROB)? To support i) precise exception and ii) branch misprediction recovery 15 Sample question What is the main responsibility of the Load Store Queue? Given 4 architectural registers (R0 to R3) and 16 physical registers (T0 to T15). The current RAT content is indicated in the leftmost table below. Note that the physical registers are allocated in circular numbering order (i.e., T0, T1, to T15 then back to T0). Assume the renaming logic can rename 4 instructions per clock cycle. For the following instruction sequence, fill the RAT content one cycle later. (The destination register of arithmetic instructions is on the left-hand side.) What are the two main functions of a ReOrder Buffer (ROB)? RAT (after 1 cycle) 16 8
Sample question What is the main responsibility of the Load Store Queue? To perform memory address disambiguation and maintain memory ordering. Given 4 architectural registers (R0 to R3) and 16 physical registers (T0 to T15). The current RAT content is indicated in the leftmost table below. Note that the physical registers are allocated in circular numbering order (i.e., T0, T1, to T15 then back to T0). Assume the renaming logic can rename 4 instructions per clock cycle. For the following instruction sequence, fill the RAT content one cycle later. (The destination register of arithmetic instructions is on the left-hand side.) What are the two main functions of a ReOrder Buffer (ROB)? 17 Sample question Caches and main memory are sub-divided into multiple banks in order to allow parallel access. What is an alternative way of allowing parallel access? What a cache that allows multiple cache misses to be outstanding to main memory at the same time, the pipeline is not stalled. What is called? While cache misses are outstanding to main memory, what is the structure that keeps bookkeeping information about the outstanding cache misses? This structure often augments the cache. 18 9
Sample question Caches and main memory are sub-divided into multiple banks in order to allow parallel access. What is an alternative way of allowing parallel access? Multiporting, duplicating What a cache that allows multiple cache misses to be outstanding to main memory at the same time, the pipeline is not stalled. What is called? Non-blocking (or lockup-free) ( While cache misses are outstanding to main memory, what is the structure that keeps bookkeeping information about the outstanding cache misses? This structure often augments the cache. Miss status handling registers (MSHRs) 19 Sample question Consider a processor with separate instruction and data caches (and no L2 cache). We are focusing on improving the data cache performance since our instruction cache achieves 100% hit rate with various optimizations. The data cache is 4kB, direct-mapped, and has single cycle access latency. The processor supports a 64-bit virtual address space, 8kB pages and no more than 16GB physical memory. The cache block size is 32 bytes. The data cache is virtually indexed and physically tagged. Assume that the data TLB hit rate is 100%. The miss rate of the data cache is measured to be 10%. The miss penalty is 20 cycles. Compute the average memory access latency (in terms of number of cycles) for data accesses. To improve the overall memory access latency, we decided to introduce a victim cache. It is fully associative and has eight entries. Its access latency is one cycle. To save power and energy consumption, we decided to access the victim cache only after we detect a miss from the data cache. The victim cache hit rate is measured to be 50% (i.e., the probability of finding data in the victim cache given that the data cache doesn t have it). Further, only after we detect a miss from the victim cache we start miss handling. Compute the average memory access latency for data accesses. 20 10
Prefetch (Data/Instruction) Predict what data will be needed in future Pollution vs. Latency reduction If you correctly predict the data that will be required in the future, you reduce latency. If you mispredict, you bring in unwanted data and pollute the cache To determine the effectiveness When to initiate prefetch? (Timeliness) Which lines to prefetch? How big a line to prefetch? (note that cache mechanism already performs prefetching.) What to replace? Software (data) prefetching vs. hardware prefetching 21 Software-controlled Prefetching Use instructions Existing instruction Alpha s load to r31 (hardwired to 0) Specialized instructions and hints Intel s SSE: prefetchnta, prefetcht0/t1/t2 MIPS32: PREF PowerPC: dcbt (data cache block touch), dcbtst (data cache block touch for store) Compiler or hand inserted prefetch instructions 22 11
Alpha The Alpha architecture supports data prefetch via load instructions with a destination of register R31 or F31. LDBU, LDF, LDG, LDL, LDT, LDWU LDS LDQ Normal cache line prefetches. Prefetch with modify intent; sets the dirty and modified bits. Prefetch, evict next; no temporal locality. The Alpha architecture also defines the following instructions. FETCH FETCH_M Prefetch Data Prefetch Data, Modify Intent PowerPC dcbt Dcbtst Intel SSE Data Cache Block Touch Data Cache Block Touch for Store The SSE prefetch instruction has the following variants: prefetcht0 prefetcht1 prefetcht2 prefetchnta Temporal data; prefetch data into all cache levels. Temporal with respect to first level cache; prefetch data in all cache levels except 0th cache level. Temporal with respect to second level cache; prefetch data in all cache levels, except 0th and 1st cache levels. Non-temporal with respect to all cache levels; prefetch data into non-temporal cache structure, with minimal cache pollution. 12
Software-controlled Prefetching for (i=0; i < N; i++) { prefetch (&a[i+1]); prefetch (&b[i+1]); } sop = sop + a[i]*b[i]; /* unroll loop 4 times */ for (i=0; i < N-4; i+=4) { prefetch (&a[i+4]); prefetch (&b[i+4]); } sop = sop + a[i]*b[i]; sop = sop + a[i+1]*b[i+1]; sop = sop + a[i+2]*b[i+2]; sop = sop + a[i+3]*b[i+3]; sop = sop + a[n-4]*b[n-4]; sop = sop + a[n-3]*b[n-3]; sop = sop + a[n-2]*b[n-2]; sop = sop + a[n-1]*b[n-1]; Prefetch latency <= computational time 25 Hardware-based Prefetching Sequential prefetching Prefetch on miss Tagged prefetch Both techniques are based on One Block Lookahead (OBL) prefetch: Prefetch line (L+1) when line L is accessed based on some criteria 26 13
Sequential Prefetching Prefetch on miss Initiate prefetch (L+1) whenever an access to L results in a miss Alpha 21064 does this for instructions (prefetched instructions are stored in a separate structure called stream buffer) Tagged prefetch Idea: Whenever there is a first use of a line (demand fetched or previously prefetched line), prefetch the next one One additional Tag bit for each cache line Tag the prefetched, not-yet-used line (Tag = 1) Tag bit = 0 : the line is demand fetched, or a prefetched line is referenced for the first time Prefetch (L+1) only if Tag bit = 1 on L 27 Sequential Prefetching Prefetch-on-miss when accessing contiguous lines Demand fetched Prefetched miss Demand fetched Prefetched hit Demand fetched Prefetched Demand fetched Prefetched miss Tagged Prefetch when accessing contiguous lines 0 Demand fetched 0 Demand fetched 0 Demand fetched 1 Prefetched 0 Prefetched 0 Prefetched 1 Prefetched 0 Prefetched 1 Prefetched miss 28 hit hit 14
29 Virtual Memory Virtual memory separation of logical memory from physical memory. Only a part of the program needs to be in memory for execution. Hence, logical address space can be much larger than physical address space. Allows address spaces to be shared by several processes (or threads). Allows for more efficient process creation. Virtual memory can be implemented via: Demand paging Demand segmentation Main memory is like a cache to the hard disc! 30 15
Virtual Address The concept of a virtual (or logical) address space that is bound to a separate physical address space is central to memory management Virtual address generated by the CPU Physical address seen by the memory Virtual and physical addresses are the same in compile-time and load-time address-binding schemes; virtual and physical addresses differ in execution-time address-binding schemes 31 Advantages of Virtual Memory Translation: Program can be given consistent view of memory, even though physical memory is scrambled Only the most important part of program ( Working Set ) must be in physical memory. Contiguous structures (like stacks) use only as much physical memory as necessary yet grow later. Protection: Different threads (or processes) protected from each other. Different pages can be given special behavior (Read Only, Invisible to user programs, etc). Kernel data protected from User programs Very important for protection from malicious programs => Far more viruses under Microsoft Windows Sharing: Can map same physical page to multiple users ( Shared memory ) 32 16
Use of Virtual Memory stack stack Shared Libs Shared page Shared Libs heap Static data code heap Static data code Process A Process B 33 Virtual vs. Physical Address Space Virtual Virtual Address Memory 0 A Physical Address 0 Main Memory 4k B 4k C 8k C 8k 12k 4G D....... 12k 16k 20k 24k 28k A B D Disk 34 17
= Paging Divide physical memory into fixed-size blocks (e.g., 4KB) called frames Divide logical memory into blocks of same size (4KB) called pages To run a program of size n pages, need to find n free frames and load program Set up a page table to map page addresses to frame addresses (operating system sets up the page table) 35 Page Table and Address Translation Virtual page number (VPN) Page offset Page Table Main Memory Physical page # (PPN) Physical address 36 18
= Page Table Structure Examples One-to-one mapping, space? Large pages Internal fragmentation (similar to having large line sizes in caches) Small pages Page table size issues Multi-level Paging Inverted Page Table Example: 64 bit address space, 4 KB pages (12 bits), 512 MB (29 bits) RAM Number of pages = 2 64 /2 12 = 2 52 (The page table has as many entrees) Each entry is ~4 bytes, the size of the Page table is 2 54 Bytes = 16 Petabytes! Can t fit the page table in the 512 MB RAM! 37 Multi-level (Hierarchical) Page Table Divide virtual address into multiple levels Level 1 is stored in P1 the Main memory P2 Page offset P1 P2 Level 1 page directory (pointer array) Level 2 page table (stores PPN) PPN Page offset 38 19
Inverted Page Table One entry for each real page of memory Shared by all active processes Entry consists of the virtual address of the page stored in that real memory location, with Process ID information Decreases memory needed to store each page table, but increases time needed to search the table when a page reference occurs 39 Linear Inverted Page Table Contain entries (size of physical memory) in a linear array Need to traverse the array sequentially to find a match Can be time consuming PID = 8 Virtual Address VPN = 0x2AA70 Offset match PPN = 0x120D Offset Physical Address PPN Index 0 1 2 0x120C 0x120D PID VPN 1 0x74094 12 0xFEA00 1 0x00023........ 14 0x2409A 8 0x2AA70....... Linear Inverted Page Table 40 20
Hashed Inverted Page Table Use hash table to limit the search to smaller number of page-table entries PID = 8 Virtual Address VPN = 0x2AA70 Offset 2 Hash anchor table Hash PID VPN 0 1 0x74094 1 12 0xFEA00 2 1 0x00023 0x120C 0x120D........ 14 0x2409A 8 0x2AA70... match.... Next 0x0012 --- 0x120D.... 0x0980 0x00A0.... 41 Fast Address Translation How often address translation occurs? Where the page table is kept? Keep translation in the hardware Use Translation Lookaside Buffer (TLB) Instruction-TLB & Data-TLB Essentially a cache (tag array = VPN, data array=ppn) Small (32 to 256 entries are typical) Typically fully associative (implemented as a content addressable memory, CAM) or highly associative to minimize conflicts 42 21
= 43 Example: Alpha 21264 data TLB VPN <35> offset <13> Address Space Number <8> <4><1> <35> <31> ASN ProtVTag PPN...... 128:1 mux 44-bit physical address 44 22
TLB and Caches Several Design Alternatives VIVT: Virtually-indexed Virtually-tagged Cache VIPT: Virtually-indexed Physically-tagged Cache PIVT: Physically-indexed Virtually-tagged Cache Not outright useful, R6000 is the only used this. PIPT: Physically-indexed Physically-tagged Cache 45 46 23
Virtually-Indexed Virtually-Tagged (VIVT) cache line return Processor Core VA VIVT Cache miss TLB Main Memory hit Fast cache access Only require address translation when going to memory (miss) Issues? 47 VIVT Cache Issues - Aliasing Homonym Same VA maps to different PAs Occurs when there is a context switch Solutions Include process id (PID) in cache or Flush cache upon context switches Synonym (also a problem in VIPT) Different VAs map to the same PA Occurs when data is shared by multiple processes Duplicated cache line in VIPT cache and VIVT$ w/ PID Data is inconsistent due to duplicated locations Solution Can Write-through solve the problem? Flush cache upon context switch If (index+offset) < page offset, can the problem be solved? (discussed later in VIPT) 48 24
49 Physically-Indexed Physically-Tagged (PIPT) cache line return Processor Core VA TLB PA PIPT Cache miss Main Memory hit Slower, always translate address before accessing memory Simpler for data coherence 50 25
Virtually-Indexed Physically-Tagged (VIPT) TLB PA Processor Core VA VIPT Cache miss Main Memory cache line return hit Gain benefit of a VIVT and PIPT Parallel Access to TLB and VIPT cache No Homonym How about Synonym? 51 Deal w/ Synonym in VIPT Cache Index VPN A Process A point to the same location within a page Process B VPN B Index VPN A!= VPN B How to eliminate duplication? make cache Index A == Index B? Tag array Data array 52 26
Synonym in VIPT Cache VPN Cache Tag Page Offset Set Index Line Offset If two VPNs do not differ in a then there is no synonym problem, since they will be indexed to the same set of a VIPT cache Imply # of sets cannot be too big Max number of sets = page size / cache line size Ex: 4KB page, 32B line, max set = 128 A complicated solution in MIPS R10000 a 53 R10000 s Solution to Synonym 32KB 2-Way Virtually-Indexed L1 VPN 12 bit 10 bit 4-bit Direct-Mapped Physical L2 a= VPN[1:0] stored as part of L2 cache Tag L2 is Inclusive of L1 VPN[1:0] is appended to the tag of L2 Given two virtual addresses VA1 and VA2 that differs in VPN[1:0] and both map to the same physical address PA Suppose VA1 is accessed first so blocks are allocated in L1&L2 What happens when VA2 is referenced? 1 VA2 indexes to a different block in L1 and misses 2 VA2 translates to PA and goes to the same block as VA1 in L2 3. Tag comparison fails (since VA1[1:0] VA2[1:0]) 4. Treated just like as a L2 conflict miss VA1 s entry in L1 is ejected (or dirty-written back if needed) due to inclusion policy 54 27
Deal w/ Synonym in MIPS R10000 VA1 Page offset index a1 VA2 Page offset index a2 1 miss 0 TLB L1 VIPT cache L2 PIPT Cache Physical index a2 a2!=a1 a1 Phy. Tag data 55 Deal w/ Synonym in MIPS R10000 VA1 Page offset index a1 VA2 Page offset index a2 Only one copy is present in L1 0 1 TLB L1 VIPT cache L2 PIPT Cache Data return a2 Phy. Tag data 56 28