Time. Who Cares About the Memory Hierarchy? Performance. Where Have We Been?

Size: px

Start display at page:

Download "Time. Who Cares About the Memory Hierarchy? Performance. Where Have We Been?"

Alaina Cummings
5 years ago
Views:

1 CS5 / EE5365 cache. Where Have We Been? Multi-Cycle Control Finite State Machines Microsequencing (Microprogramming) Exceptions Pipelining Datapath Making use of multi-cycle datapath Pipelining Control What s Next? Who Cares About the Hierarchy? Processor-DRAM Gap (latency) Performance CS5 / EE5365 cache. 98 µproc CPU 6%/yr. Moore s Law (X/.5yr) Processor- Performance Gap (grows 5% / year) DRAM DRAM 9%/yr. (X/ yrs) Time Performance Growth is Really Super-Exponential From Robot Mere Machine to Transcendent Mind by Hans Moravec CS5 / EE5365 cache.3

2 CS5 / EE5365 cache.4 The Big Picture Where are We Now? The Five Classic Components of a Computer Processor Input Control Datapath Output Today s Topics Recap last lecture Review Advanced Virtual Protection TLB Hierarchy of a Modern Computer System By taking advantage of the principle of locality Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Processor Datapath Control Registers On-Chip Second Level (SRAM) Main (DRAM) Secondary Storage (Disk) Tertiary Storage (Disk) Speed (ns) s s s Size (bytes) s Ks Ms CS5 / EE5365 cache.5,,s (s ms) Gs,,,s (s sec) Ts Recap Static RAM Cell 6-Transistor SRAM Cell word (row select) word bit bit Write. Drive bit lines (bit=, bit=).. Select row Read. Precharge bit and bit to Vdd.. Select row 3. Cell pulls one line low 4. Sense amp on column detects difference between bit and bit bit replaced with pullup to save area bit CS5 / EE5365 cache.6

3 CS5 / EE5365 cache.7 Recap -Transistor Cell (DRAM) Write. Drive bit line.. Select row Read. Precharge bit line to Vdd.. Select row 3. Cell and bit line share charges - Very small voltage changes on the bit line 4. Sense (fancy sense amp) - Can detect changes of ~ million electrons 5. Write restore the value Refresh. Just do a dummy read to every cell. bit row select DRAMs over Time DRAM Generation st Gen. Sample Size Die Size (mm ) Area (mm ) Cell Area (µm ) Mb 4 Mb 6 Mb 64 Mb 56 Mb Gb CS5 / EE5365 cache.8 (from Kazuhiro Sakashita, Mitsubishi) Preview Two Different Types of Locality Temporal Locality (Locality in Time) If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space) If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. DRAM is slow but cheap and dense Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense Good choice for providing the user FAST access time. CS5 / EE5365 cache.9

4 CS5 / EE5365 cache. We Exploit Locality by Providing a Window on the World Programs Tend to Execute Instructions for Awhile Refill Main Repeat times Occasional Non-local Jump Repeat 7 times Programs Look Like This 3 4 Refill Refill Refill3 Main Not Like This Once the cache is full of a tight group of instructions, it can scream! The Art of System Design Workload or Benchmark programs Processor reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>,... op i-fetch, read, write $ MEM Optimize the memory system organization to minimize the average memory access time for typical workloads (Much different than embedded or real-time designs worst case.) CS5 / EE5365 cache. Direct-Mapped Mapping address is modulo the number of blocks in the cache CS5 / EE5365 cache.

5 CS5 / EE5365 cache.3 So, What Info Does the Need to Hold? How are we going to organize it? So, What Info Does the Need to Hold? The Data (or Instructions) Which of the many addresses it is filled with Whether or not it is valid (e.g. first access) How are we going to organize it? CS5 / EE5365 cache.4 Direct Mapped - Organization For MIPS Address Byte offset Byte Offset 3 What kind of locality are we taking advantage of? CS5 / EE5365 cache.5

6 CS5 / EE5365 cache.6 Direct Mapped - Organization For MIPS Hit Address (showing bit positions) Byte offset Tag Index Data Index Valid Tag Data 3 3 What kind of locality are we taking advantage of? How Do We take Advantage of the Other Locality? CS5 / EE5365 cache.7 Taking Advantage of Spatial Locality Load multiple words at once Block Read Hit Address (showing bit positions) Byte Tag offset Index 6 bits 8 bits V Tag Data Data Block Offset 4K entries MUX 3 CS5 / EE5365 cache.8

7 CS5 / EE5365 cache.9 Example KB Direct Mapped with 3 B Blocks For a ** N byte cache The uppermost (3 - N) bits are always the Tag The lowest M bits are the Byte Select (Block Size = ** M) 3 Tag Example x5 Stored as part of the cache state 9 Index Ex x 4 Byte Select Ex x Valid Bit Tag x5 Data Byte 3 Byte Byte 63 Byte 33 Byte 3 Byte 3 Byte 3 Byte 99 3 Terminology Hits vs. Misses Read hits this is what we want! Data is already in the cache Read misses stall the CPU, fetch block from memory, deliver to cache, restart Write hits can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later) Write misses read the entire block into the cache, then write the word CS5 / EE5365 cache. Example Robot Control Code >tmp[] >tmp[] <tmp[] <kp[] <pos[] >torq[] 6 Tmp[] <pos[] <tmp[] <pos[] <tmp[] <pos[] >vel[] <tmp[] >pos[] <tmp[] <pos[] <kv[] <vel[] <torq[] >torq[] <kp[] <pos[] >torq[] <kv[] <vel[] <torq[] Data count flag i Tag Tmp[] Pos[] Pos[] Vel[] Vel[] Kp[] Kp[] Kv[] Kv[] Torq[] Torq[] >vel[] <tmp[] >pos[] >torq[] <pos[] <torq[] >torq[] count flag i CS5 / EE5365 cache.

8 CS5 / EE5365 cache. Where Have We Been? Pipelining Datapath Making use of multi-cycle datapath Pipelining Control Direct-Mapped Block Reads What s Next? More Set-Associative Block Size Tradeoff In general, larger block size take advantage of spatial locality BUT Larger block size means larger miss penalty - Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up - Too few cache blocks In General, Average Access Time = Hit Time x ( - Miss Rate) + Miss Penalty x Miss Rate Miss Penalty Miss Rate Exploits Spatial Locality Fewer blocks compromises temporal locality Average Access Time Increased Miss Penalty & Miss Rate Block Size CS5 / EE5365 cache.3 Block Size Block Size Block Size Effect is Real Increasing the block size tends to decrease miss rate Miss rate 4% 35% 3% 5% % 5% % 5% % Block size (bytes) to a point, then is starts increasing 56 KB 8 KB 6 KB 64 KB 56 KB Program Block size in words Instruction miss rate Data miss rate Effective combined miss rate gcc 6.%.% 5.4% 4.%.7%.9% spice.%.3%.% 4.3%.6%.4% CS5 / EE5365 cache.4

9 CS5 / EE5365 cache.5 Extreme Example single big line Valid Bit Tag Size = 4 bytes Block Size = 4 bytes Only ONE entry in the cache If an item is accessed, likely that it will be accessed again soon But it is unlikely that it will be accessed again immediately!!! The next access will likely to be a miss again - Continually loading data into the cache but discard (force out) them before they are used again - Worst nightmare of a cache designer Ping Pong Effect Conflict Misses are misses caused by Different memory locations mapped to the same cache index - Solution - Solution Data Byte 3 Byte Byte Byte Extreme Example single big line Valid Bit Size = 4 bytes Block Size = 4 bytes Only ONE entry in the cache If an item is accessed, likely that it will be accessed again soon But it is unlikely that it will be accessed again immediately!!! The next access will likely to be a miss again CS5 / EE5365 cache.6 Tag Data - Continually loading data into the cache but discard (force out) them before they are used again - Worst nightmare of a cache designer Ping Pong Effect Conflict Misses are misses caused by Different memory locations mapped to the same cache index - Solution make the cache size bigger Byte 3 Byte Byte Byte - Solution Multiple entries for the same Index Another Extreme Example Fully Associative Fully Associative Forget about the Index Compare the Tags of all cache entries in parallel Example Block Size = 3 B blocks, we need N 7-bit comparators By definition Conflict Miss = for a fully associative cache 3 Tag (7 bits long) 4 Byte Select Ex x? = = = = = Tag Valid Bit Data Byte 3 Byte 63 Byte Byte Byte 33 Byte 3 CS5 / EE5365 cache.7

10 CS5 / EE5365 cache.8 Where Have We Been? Pipelining Control Direct-Mapped Block Reads More Set-Associative What s Next? More Write Policies Virtual Decreasing miss ratio with associativity One-way set associative (direct mapped) Block Tag Data Two-way set associative Set Tag Data Tag Data 3 Four-way set associative Set Tag Data Tag Data Tag Data Tag Data Eight-way set associative (fully associative) Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Compared to direct mapped, give a series of references that results in a lower miss ratio using a -way set associative cache results in a higher miss ratio using a -way set associative cache assuming we use the least recently used replacement strategy CS5 / EE5365 cache.9 Valid A Two-way Set Associative N-way set associative N entries for each Index N direct mapped caches operates in parallel Example Two-way set associative cache Index selects a set from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Tag Data Block Index Data Block Tag Valid Adr Tag Compare Sel Mux Sel Compare Hit OR Block CS5 / EE5365 cache.3

11 CS5 / EE5365 cache.3 Disadvantage of Set Associative N-way Set Associative versus Direct Mapped N comparators vs. Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Block is available BEFORE Hit/Miss Possible to assume a hit and continue. Recover later if miss. Valid Tag Data Block Index Data Block Tag Valid Adr Tag Compare Sel Mux Sel Compare Hit OR Block Book s Figure for 4-Way Set Associative Address Index V Tag Data V Tag Data V Tag Data V Tag Data to- multiplexor Hit Data CS5 / EE5365 cache.3 Performance 5% % Miss Rate Miss rate 9% 6% 3% KB KB 4KB % One-way Two-way Four-way Eight-way Associativity KB KB 6 KB 3 KB 4 KB 64 KB 8 KB 8 KB CS5 / EE5365 cache.33

12 CS5 / EE5365 cache.34 A Summary on Sources of Misses Compulsory (cold start or process migration, first reference) first access to a block Cold fact of life not a whole lot you can do about it Note If you are going to run billions of instruction, Compulsory Misses are insignificant Conflict (collision) Multiple memory locations mapped to the same cache location Solution increase cache size Solution increase associativity Capacity cannot contain all blocks accessed by the program Solution increase cache size Invalidation other process (e.g., I/O) updates memory Sources of Misses Quiz (for constant complexity) Size Small, Medium, Big? Compulsory Miss Direct Mapped N-way Set Associative Fully Associative Conflict Miss Capacity Miss Invalidation Miss Choices Zero, Low, Medium, High, Same CS5 / EE5365 cache.35 How Do you Design a? Set of Operations that must be supported read data <= Mem[Physical Address] write Mem[Physical Address] <= Data CS5 / EE5365 cache.37 Physical Address Read/Write Data Determine the internal register transfers Design the Datapath Black Box Design the Controller Address Data In Data Out DataPath Inside it has Tag-Data Storage, Muxes, Comparators,... Control Points Signals Controller R/W Active wait

13 CS5 / EE5365 cache.38 Impact on Cycle Time Hit Time directly tied to clock rate increases with cache size increases with associativity I - miss PC IR IRex A B invalid Average Access time = Hit Time + Miss Rate x Miss Penalty Time = IC x CT x (ideal CPI + memory stalls) IRm IRwb R D T Miss Example direct map allows miss signal after data Improving Performance 3 general options. Reduce the miss rate. Reduce the time to hit in the cache 3. Reduce the miss penalty CS5 / EE5365 cache.39 Decreasing miss penalty with multilevel caches Add a second level cache often primary cache is on the same chip as the processor use SRAMs to add another cache above primary memory (DRAM) miss penalty goes down if data is in nd level cache Example CPI of. on a 5Mhz machine with a 5% miss rate, ns DRAM access Adding nd level cache with ns access time decreases miss rate to % Using multilevel caches try and optimize the hit time on the st level cache try and optimize the miss rate on the nd level cache CS5 / EE5365 cache.4

14 CS5 / EE5365 cache.4 Remember Our Hierarchy Slide? By taking advantage of the principle of locality A Hierarchy can decrease the miss penalty Processor Datapath Control Registers On-Chip Second Level (SRAM) Main (DRAM) Secondary Storage (Disk) Tertiary Storage (Disk) Speed (ns) s s s Size (bytes) s Ks Ms,,s (s ms) Gs,,,s (s sec) Ts Hardware Issues for Decreasing Miss Penalty Make reading multiple words easier by using banks of memory CPU CPU CPU Multiplexor Bus Bus Bus bank bank bank bank 3 b. Wide memory organization c. Interleaved memory organization a. One-word-wide memory organization It can get a lot more complicated... CS5 / EE5365 cache.4 4 Questions for Hierarchy Q Where can a block be placed in the upper level? (Block placement) Q How is a block found if it is in the upper level? (Block identification) Q3 Which block should be replaced on a miss? (Block replacement) Q4 What happens on a write? (Write strategy) CS5 / EE5365 cache.43

15 CS5 / EE5365 cache.44 Q3 Which block should be replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative Random LRU (Least Recently Used) Associativity -way 4-way 8-way Size LRU RandomLRU Random LRU Random 6 KB 5.% 5.7% 4.7% 5.3% 4.4% 5.% 64 KB.9%.%.5%.7%.4%.5% 56 KB.5%.7%.3%.3%.%.% Q4 What happens on a write? Write through The information is written to both the block in the cache and to the block in the lowerlevel memory. Write back The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT read misses cannot result in writes WB no writes of repeated writes WT always combined with write buffers so that we don t wait for lower level memory CS5 / EE5365 cache.45 Write Buffer for Write Through Processor DRAM Write Buffer A Write Buffer is needed between the and Processor writes data into the cache and the write buffer controller write contents of the buffer to memory Write buffer is just a FIFO Typical number of entries 4 Works fine if Store frequency (w.r.t. time) << / DRAM write cycle system designer s nightmare Store frequency (w.r.t. time) -> / DRAM write cycle Write buffer saturation CS5 / EE5365 cache.46

16 CS5 / EE5365 cache.47 Write Buffer Saturation Processor DRAM Store frequency (w.r.t. time) -> / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row) - Store buffer will overflow no matter how big you make it - The CPU Cycle Time <= DRAM Write Cycle Time Solution for write buffer saturation Use a write back cache Install a second level (L) cache Write Buffer Processor L DRAM Write Buffer Recall Levels of the Hierarchy Capacity Access Time Cost CPU Registers s Bytes < ns K Bytes.5-5 ns -3-4 cents/bit Main M Bytes ns-ns -5-6 cents Disk G Bytes ms - - cents Tape/CD/DVD infinite sec-min Registers Instr. Operands Blocks Pages Disk Files Tape/CD/DVD Staging Xfer Unit prog./compiler -8 bytes cache cntl 8-8 bytes OS 5-4K bytes user/operator Mbytes Upper Level faster Larger Lower Level CS5 / EE5365 cache.48 Virtual Main memory can act as a cache for the secondary storage (disk) Virtual Addresses Physical Addresses Address translation User Program Computer Disk Addresses Advantages illusion of having more physical memory program relocation protection CS5 / EE5365 cache.49

17 CS5 / EE5365 cache.5 Basic Issues in Virtual System Design size of information blocks that are transferred from secondary to main storage (M) block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy which region of M is to hold the new block --> placement policy missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy mem disk reg cache frame pages Paging Organization virtual and physical address space partitioned into blocks of equal size pages page frames Pages virtual memory blocks Page faults the data is not in memory, retrieve it from disk huge miss penalty, thus pages should be fairly large (e.g., 4KB) reducing page faults is important (LRU is worth the price) can handle the faults in software instead of hardware using write-through is too expensive so we use writeback Virtual Address Virtual Page Number Page Offset Translation Physical Page Number Page Offset Physical Address CS5 / EE5365 cache.5 Where Have We Been? Direct-Mapped Set Associative Virtual Page Table TLB What s Next? Virtual I/O CS5 / EE5365 cache.5

18 CS5 / EE5365 cache.53 Page Tables Virtual Page Number V a l id Page Table Physical Page or Disk Address Physical Disk Storage Page Tables Page table register Virtual address Virtual page number Page offset Valid Physical page number Page table 8 If then page is not present in memory Physical page number Page offset Physical address CS5 / EE5365 cache.54 Address Map V = {,,..., n - } virtual address space M = {,,..., m - } physical address space n > m MAP V --> M U {} address mapping function MAP(a) = a' if data at virtual address a is present in physical address a' and a' in M = if data at virtual address a is not present in M a Processor Name Space V missing item fault fault handler a Addr Trans Mechanism a' Main Secondary physical address OS performs this transfer CS5 / EE5365 cache.55

19 CS5 / EE5365 cache.56 P.A Paging Organization frame 7 Physical K K K Address Mapping VA page no. disp Addr Trans MAP V.A page 3 Virtual K K K unit of mapping also unit of transfer from virtual to physical memory Page Table Base Reg index into page table Page Table V Access Rights PA + table located in physical memory physical memory address actually, concatenation is more likely Virtual Address and a CPU CS5 / EE5365 cache.57 VA PA miss Translation Main hit data It takes an extra memory access to translate VA to PA This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible ASIDE Why access cache with PA at all? VA caches have a problem! synonym / alias problem two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! for update must update all cache entries with same physical address or memory becomes inconsistent determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or software enforced alias boundary same lsb of VA &PA > cache size TLBs A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB Virtual Address Physical Address Dirty Ref Valid Access TLB access time comparable to cache access time (much less than main memory access time) CS5 / EE5365 cache.58

20 CS5 / EE5365 cache.59 Translation Look-Aside Buffers Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped TLBs are usually small, typically not more than 8-56 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations. Translation with a TLB CPU hit VA PA miss TLB Lookup miss Translation / t hit data t Main t Making address translation practical TLB Virtual memory => memory acts like a cache for the disk Page table maps virtual page numbers to physical frames Translation Look-aside Buffer (TLB) is a cache of recent translations Virtual Address Space Physical Space Page Table virtual address page off 3 Translation Lookaside Buffer frame page 5 CS5 / EE5365 cache.6 Reducing Translation Time Machines with TLBs go one step further to reduce # cycles/cache access They overlap the cache access with the TLB access Works because high order bits of the VA are used to look in the TLB while low order bits are used as index into cache CS5 / EE5365 cache.6

21 CS5 / EE5365 cache.6 Overlapped & TLB Access 3 TLB assoc lookup page # index K PA Hit/ Miss addr tag index block page # disp 4 bytes Tag Data Hit/ Miss = IF cache hit AND (cache tag = PA) then deliver data to CPU ELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN access memory with the PA from the TLB ELSE do standard VA translation then compare cache tag In the Absence of Virtual 3 TLB assoc lookup page # index K PA Hit/ Miss addr tag index block page # disp 4 bytes Tag Data Hit/ Miss = IF cache hit AND (cache tag = PA) then deliver data to CPU ELSE access memory with the PA CS5 / EE5365 cache.63 Impact of Hierarchy on Algorithms Today CPU time is a function of (ops, cache misses) vs. just f(ops) What does this mean to Compilers, Data structures, Algorithms? The Influence of s on the Performance of Sorting by A. LaMarca and R.E. Ladner. Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January, 997, Quicksort fastest comparison based sorting algorithm when all keys fit in memory Radix sort also called linear time sort because for keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys For Alphastation 5, 3 byte blocks, direct mapped L MB cache, 8 byte keys, from 4 to 4 CS5 / EE5365 cache.64

22 CS5 / EE5365 cache.65 Problems With Overlapped TLB Access Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache Example suppose everything is the same except that the cache is increased to 8 K bytes instead of 4 K cache index This bit is changed by VA translation, but is needed for cache virt page # disp lookup Solutions go to 8K byte page sizes; go to way set associative cache; or SW guarantee VA[3]=PA[3] 4 4 K way set assoc cache Page Fault What happens when you miss? Not talking about TLB miss TLB is HW s attempt to make page table lookup fast (on average) Page fault means that page is not resident in memory Hardware must detect situation Hardware cannot remedy the situation Therefore, hardware must trap to the operating system so that it can remedy the situation pick a page to discard (possibly writing it to disk) load the page in from disk update the page table resume to program so HW will retry and succeed! What is in the page fault handler? See OS class What can HW do to help it do a good job? CS5 / EE5365 cache.66 Page Replacement Not Recently Used (-bit LRU, Clock) Associated with each page is a reference flag such that ref flag = if the page has been referenced in recent past = otherwise -- if replacement is necessary, choose any page frame such that its reference bit is. This is a page that has not been referenced in the recent past page table entry dirty used CS5 / EE5365 cache.67 page table entry Or search for the a page that is both not recently referenced AND not dirty. page fault handler last replaced pointer (lrp) if replacement is to take place, advance lrp to next entry (mod table size) until one with a bit is found; this is the target for replacement; As a side effect, all examined PTE's have their reference bits set to zero. Architecture part support dirty and used bits in the page table => may need to update PTE on any instruction fetch, load, store How does TLB affect this design problem? Software TLB miss?

23 CS5 / EE5365 cache.68 Why virtual memory? Generality ability to run programs larger than size of physical memory Storage management allocation/deallocation of variable sized blocks is costly and leads to (external) fragmentation Protection regions of the address space can be R/O, Ex,... Flexibility portions of a program can be placed anywhere, without relocation Storage efficiency retain only most important portions of the program in memory Concurrent I/O execute other processes while loading/dumping page Expandability can leave room in virtual address space for objects to grow. Performance Observe impact of multiprogramming, impact of higher level languages Summary #/ 4 The Principle of Locality Program likely to access a relatively small portion of the address space at any instant of time. - Temporal Locality Locality in Time - Spatial Locality Locality in Space Three Major Categories of Misses Compulsory Misses sad facts of life. Example cold start misses. Conflict Misses increase cache size and/or associativity. Nightmare Scenario ping pong effect! Capacity Misses increase cache size Design Space total size, block size, associativity replacement policy write-hit policy (write-through, write-back) write-miss policy CS5 / EE5365 cache.69 Summary # / 4 The Design Space Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation The optimal choice is a compromise depends on access characteristics - workload - use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Bad Size Good Factor A Less Associativity Block Size Factor B More CS5 / EE5365 cache.7

24 CS5 / EE5365 cache.7 Summary #3 / 4 TLB, Virtual s, TLBs, Virtual all understood by examining how they deal with 4 questions ) Where can block be placed? ) How is block found? 3) What block is replaced on miss? 4) How are writes handled? Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance (funny times, as most systems can t access all of nd level cache without TLB misses!) Summary #4 / 4 Hierachy VIrtual memory was controversial at the time can SW automatically manage 64KB across many programs? X DRAM growth removed the controversy Today VM allows many processes to share single memory without having to swap all processes to disk; VM protection is more important than memory hierarchy Today CPU time is a function of (ops, cache misses) vs. just f(ops) What does this mean to Compilers, Data structures, Algorithms? CS5 / EE5365 cache.7

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

COSC 6385 Computer Architecture - Hierarchies (I) Fall 2007 Slides are based on a lecture by David Culler, University of California, Berkley http//www.eecs.berkeley.edu/~culler/courses/cs252-s05 Recap