6.004 Computation Structures Spring 2009

Size: px

Start display at page:

Download "6.004 Computation Structures Spring 2009"

Leona Poole
6 years ago
Views:

1 MIT OpenCourseWare Computation Structures Spring 9 For information about citing these materials or our Terms of Use, visit:

2 The Memory Hierarchy What we want in a memory PC DDR BET INST MDDR MDT DOUT MEMOR DDR DIN/DOUT Lab #5 due tonight L5 Memory Hierarchy Capacity Latency Cost Register s of bits ps $$$$ SRM s Kbytes ns $$$ DRM s Mbytes 4 ns $ Hard disk* s Gbytes ms Want s Gbytes ns cheap * non-volatile L5 Memory Hierarchy static bistable storage element word line N word line N+ bit Good, but slow Strong SRM Memory Cell 6-T SRM Cell access FETs Doesn t this violate our static discipline? Slow and almost Strong bit There are two bit-lines per column: one supplies the bit, the other it s complement. On a Read Cycle: single word line is activated (driven to ), and the access transistors enable the selected cells, and their complements, onto the bit lines. Writes are similar to reads, except the bit-lines are driven with the desired value of the cell. The writing has to overpower the original contents of the memory cell. write read read Multiport SRMs (a.k.a. Register Files) wd 4/ PU = / PD = 4 / PU = / PD = / 3 One can increase the number of SRM ports by adding access transistors. By carefully sizing the inverter pair, so that one is strong and the other weak, we can assure that our WRITE bus will only fight with the weaker one, and the REDs are driven by the stronger one - thus minimizing both access and write times. 5 / / rd / rd This transistor isolates the storage node so that it won t flip unintentionally. L5 Memory Hierarchy 3 L5 Memory Hierarchy 4

3 Explicit storage capacitor C in storage capacitor determined by: better dielectric C = -T Dynamic Ram Six transistors/cell may not sound like much, but they can add up quickly. What is the fewest number of transistors that can be used to store a bit? TiN top electrode (V REF ) Can t we get rid of the explicit cap? d -T DRM Cell V REF more area thinner film bit word line access FET poly word line Ta O 5 dielectric W bottom electrode access fet Tricks for increasing throughput Multiplexed ddress (row first, then column) N M Row ddress Decoder t t t 3 t 4 bit lines 3 Column Multiplexer/Shifter D word lines M Row Row Row N memory cell (one bit) Clock out but, alas, not latency The first thing that should pop into you mind when asked to speed up throughput PIPELINING Synchronous DRM (SDRM) Double-clocked Synchronous DRM (DDRM) L5 Memory Hierarchy 5 L5 Memory Hierarchy 6 Hard Disk Drives Sector Quantity vs Quality Cylinder our memory system can be BIG and SLOW... or SMLL and FST. Shaft Track $/MB We ve explored a range of circuit-design trade-offs. Typical high-end drive: verage latency = 4 ms verage seek time = 9 ms Transfer rate = M bytes/sec Capacity = -5G byte Cost = ~$/Gbyte Sector Track oned-bit recording L5 Memory Hierarchy 7... SRM DRM DISK TPE Is there an RCECTURL solution to this DILEMM? ccess Time L5 Memory Hierarchy 8

4 Best of Both Worlds What we WNT: BIG, FST memory! We d like to have a memory system that PERFORMS like GBytes of SRM; but COSTS like GBytes of slow memory. SURPRISE: We can (nearly) get our wish! KE: Use a hierarchy of memory technologies: SRM MIN MEM Key IDE Keep the most often-used data in a small, fast SRM (often local to chip) Refer to Main Memory only rarely, for remaining data. The reason this strategy works: LOCLIT Locality of Reference: Reference to location X at time t implies that reference to location X+ X at time t+ t becomes more probable as X and t approach zero. L5 Memory Hierarchy 9 L5 Memory Hierarchy Memory Reference Patterns Exploiting the Memory Hierarchy address data stack S is the set of locations accessed during t. Working set: a set S which changes slowly wrt access time. Working set size, S S pproach (Cray, others): Expose Hierarchy Registers, Main Memory, Disk each available as storage alternatives; Tell programmers: Use them cleverly pproach : Hide Hierarchy SRM MIN MEM Programming model: SINGLE kind of memory, single address space. Machine UTOMTICLL assigns locations to fast or slow memory, depending on usage patterns. program t time t X? Small Static CCHE Dynamic RM MIN MEMOR HRD DISK SWP SPCE L5 Memory Hierarchy L5 Memory Hierarchy

5 The Cache Idea: Program-Transparent Memory Hierarchy GOLS: "CCHE" "MIN MEMOR" Cache contains TEMPORR COPIES of selected main memory locations... eg. Mem[] = 37 ) Improve the average access time t (- ) ave. (.- ) 37 RTIO: Fraction of refs found in CCHE. RTIO: Remaining references. = t + ( )(t + t ) = t + ( ) t c ) Transparency (compatibility, programming ease) c m c DNMIC RM m Challenge: make the hit ratio as high as possible. How High of a Hit Ratio? Suppose we can easily build an on-chip static memory with a 4 ns access time, but the fastest dynamic memories that we can buy for main memory have an average access time of 4 ns. How high of a hit rate do we need to sustain an average access time of 5 ns? = t ave t c t m = = 97.5% L5 Memory Hierarchy 3 L5 Memory Hierarchy 4 Find Bitdiddle, Ben 5-Minute ccess Time: The Cache Principle 5-Second ccess Time: LGORITHM: Look nearby for the requested information first, if it s not there, check secondary storage Tag B Mem[] Mem[B] (! ) MIN MEMOR Basic Cache lgorithm ON REFERENCE TO Mem[X]: Look for X among cache tags... : X = TG(i), for some cache line i RED: return DT(i) WRITE: change DT(i); Start Write to Mem(X) : X not found in TG of any cache line REPLCEMENT SELECTION: Select some line k to hold Mem[X] (llocation) RED: Read Mem[X] Set TG(k)=X, DT(K)=Mem[X] WRITE: Start Write to Mem(X) Set TG(k)=X, DT(K)= new Mem[X] QUESTION: How do we search the cache? L5 Memory Hierarchy 5 L5 Memory Hierarchy 6

6 ssociativity: Parallel Lookup Fully-ssociative Cache Find Bitdiddle, Ben Nope, Smith Nope, Jones HERE IT IS! Incoming ddress The extreme in associativity: ll comparisons made in parallel TG =? TG =? Nope, Bitwit ny data item could be located in any cache location TG =? Out L5 Memory Hierarchy 7 L5 Memory Hierarchy 8 Find Bitdiddle, Ben B Direct-Mapped Cache (non-associative) NO Parallelism: Look in JUST ONE place, determined by parameters of incoming request (address bits)... can use ordinary RM as table Need: ddress Mapping Function! Maps incoming BIG address to small CCHE address tells us which single cache location to use Direct Mapped: just use a subset of incoming address bits! Collision when several addresses map to same cache line. L5 Memory Hierarchy 9 Find Bitwit Find Bituminous Find Bitdiddle B The Problem with Collisions Nope, I ve got BITWIT under B PROBLEM: Contention among B s... each competes for same cache line! - CN T cache both Bitdiddle & Bitwit... Suppose B s tend to come at once? BETTER IDE: File by LST letter! L5 Memory Hierarchy

7 Optimizing for Locality: selecting on statistically independent bits Find Bitdiddle Find Bitwit Here s BITWIT, under T Here s BITDIDDLE, under E LESSON: Choose CCHE LINE from independent parts of request to MINIMIE CONFLICT given locality patterns... IN CCHE: Select line by LOW ORDER address bits! Does this ELIMINTE contention? L5 Memory Hierarchy Direct Mapped Cache Low-cost extreme: Single comparator Use ordinary (fast) static RM for cache tags & data: Incoming ddress T K K-bit Cache Index T Upper-address bits QUESTION: Why not use HIGH-order bits as Cache Index? K x (T + D)-bit static RM Tag =? DISDVNTGE: COLLISIONS Out D-bit data word L5 Memory Hierarchy Contention, Death, and Taxes... Find Bitdiddle Find Bittwiddle Nope, I ve got BITTWIDDLE under E ; I ll replace it. Nope, I ve got BITDIDDLE under E ; I ll replace it. LESSON: In a non-associative cache, SOME pairs of addresses must compete for cache lines if working set includes such pairs, we get THRSHING and poor performance. L5 Memory Hierarchy 3 Loop : Pgm at 4, data at 37: Loop B: Pgm at 4, data at 48: Direct-Mapped Cache Contention Memory ddress Cache Line Hit/ Miss Works GRET here ssume 4-line directmapped cache, word/line. Consider tight loop, at steady state: (assume WORD, not BTE, addressing) but not here! We need some associativity, But not full associativity Next lecture! L5 Memory Hierarchy 4

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5) Memory Hierarchy It makes me look faster, don t you think? Are you dressed like the Easter Bunny? Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity (Study Chapter 5)