Chapter 6 The Memory Hierarchy Part I The slides of Part I are taken in large part from V. Heuring & H. Jordan, Computer Systems esign and Architecture 1997. 1
Outline: Memory components: RAM memory cells and cell arrays Static RAM more expensive, but less complex Tree and matrix decoders needed for large RAM chips ynamic RAM less expensive, but needs refreshing Chip organization Timing ROM Read-only memory Memory boards Arrays of chips give more addresses and/or wider words 2- and 3- chip arrays Memory modules Large systems can benefit by partitioning memory for separate access by system components fast access to multiple words 2
Memory Hierarchy Outline (cont): The Memory Hierarchy: from fast and expensive to slow and cheap: Registers Cache Main Memory isk Consider two adjacent hierarchy levels: Cache Main Memory Cache: High speed, expensive (1 st level on-chip, 2 nd level off-chip) esign Types: irect mapped, associative, set associative Virtual memory: Makes the hierarchy to disk transparent Translate the address from CPU s logical address to the physical address where the information is actually stored. Memory management how to move information back and forth. Multiprogramming what to do while we wait. The TLB helps in speeding the address translation process. Memory as a subsystem: Overall performance. 3
Memory Technology Characteristics Level Memory Type Average Access Time Typical Size Unit of Transfer (Block Size) 1 Cache.5 20ns 8KB - 32MB Word 16-32bits 2 Main Memory 40 200ns 2MB - 16GB Cache line 8B-16B 3 isk 5 10ms > 100Gb Page 4KB-16KB 4 Magnetic Tape 1 5sec > 200Gb Record 16KB 4
Memory Performance Gap Processor-RAM Memory Gap (latency) 1000 100 10 1 Moore s Law CPU RAM µproc 60%/yr. (2X/1.5yr) Processor-Memory Performance Gap: (grows 50% / year) RAM 9%/yr. (2X/10 yrs) 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance Time 5
Levels of the Memory Hierarchy Capacity, Access Time, Cost CPU Registers 100s Bytes <10s ns Cache K Bytes 10-50 ns 1-0.1 cents/bit Main Memory M Bytes 100ns- 400ns $.0001-.00001 cents /bit isk G Bytes, 10 ms (10,000,000 ns) -5-6 10-10 cents/bit Tape infinite sec-min 10-8 Registers Instr. Operands Cache Blocks Memory Pages isk Files Tape Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 8-128 bytes OS 512-8K bytes user/operator Mbytes Upper Level faster Larger Lower Level 6
The CPU Memory Interface ata bus bus CPU m Main memory s MAR m A 0 A m 1 0 w MR b 0 b 1 1 2 w Register file REQUEST 3 2 m 1 COMPLETE Control signals Sequence of events: Read: 1. CPU loads MAR, issues Read, and REQUEST 2. Main memory transmits words to MR 3. Main memory asserts COMPLETE Write: 1. CPU loads MAR and MR, asserts Write, and REQUEST 2. Value in MR is written into address in MAR 3. Main memory asserts COMPLETE 7
The CPU Memory Interface (cont d.) ata bus bus CPU m Main memory s MAR w MR m b A 0 A mð1 0 bð1 0 1 2 w Register file REQUEST 3 2 m 1 COMPLETE Additional points: If b < w, main memory must make w/b b-bit transfers. Some CPUs allow reading and writing of word sizes < w Example: Intel 8088: m = 20, w = 16, s = b = 8 8- and 16-bit values can be read and written If memory is sufficiently fast, or if its response is predictable, then COMPLETE may be omitted. Some systems use separate R and W lines, and omit REQUEST. 8 Control signals
Memory Performance Parameters Symbol efinition Units Meaning t a Access time time Time to access a memory word t c Cycle time time Time from start of access to start of next access k Block size words Number of words per block ω Bandwidth words/time Word transmission rate t l Latency time Time to access first word of a sequence of words t bl = Block time Time to access an entire block of words t l + k/ω access time (Information is stored and moved in blocks at the cache and disk level.) 9
Memories: Basic Technologies SRAM: value is stored on a pair of inverting gates very fast but takes up more space than RAM (4 to 6 transistors) Cross Coupled gates (more later) RAM: value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10) Word line Pass transistor Capacitor Bit line 10
Memory Cell Structure Regardless of the technology, all RAM memory cells must provide these four functions: Select, atain, ataout, and. Select atain ataout 11
An 8-Bit Register as a 1- RAM Array The entire register is selected with one select line, and uses one line Select atain ataout Select d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 7 ata bus is bidirectional and buffered. (Why?) 12
A 4 x 8 2- Memory Cell Array 2-4 line decoder selects one of the four 8-bit arrays 2-bit address 2 4 decoder A 1 A 0 is common to all d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 7 Bidirectional 8-bit buffered data bus 13
A 64 K x 1 Static RAM Chip ~square array fits IC design paradigm Row address: 8 8 256 A 0 A 7 row Selecting rows separately decoder from columns means only 256 x 2 = 512 circuit elements instead of 65536 circuit elements! Column address: A 8 A 15 256 8 256 256 cell array 256 1256 1mux 11 256demux CS, Chip Select, allows chips in arrays to be selected individually 1 CS This chip requires 21 pins including power and ground, and so will fit in a 22-pin package. ata Input - output 14
A 16 K x 4 SRAM Chip Row address: 8 8 A 0 A 256 7 row decoder 256 464 256 cell arrays There is little difference between this chip and the previous one, except that there are 4 64-1 multiplexers instead of 1 256-1 multiplexer. Column address: A 8 A 13 6 464 1muxes 41 64demuxes 64 each 4 CS This chip requires 24 pins including power and ground, and so will require a 24-pin package. Package size and pin count can dominate chip cost. ata Input-output 15
Matrix and Tree ecoders 2-level decoders are limited in size because of gate fan-in. Most technologies limit fan-in to ~8. When decoders must be built with fan-in >8, then additional levels of gates are required. Tree and matrix decoders are two ways to design decoders with large fan-in: m 0 m 4 m 8 m 12 m 0 m 4 m 1 m 5 m 9 m 13 m 1 m 5 x 0 x 1 2 4 decoder m 2 m 6 m 10 m 14 x 0 x 1 2 4 decoder m 2 m 6 m 3 m 7 m 11 m 15 m 3 m 7 2 4 decoder x 2 x 2 x 2 x 3 3-to-8 line tree decoder constructed from 2-input gates. 4-to-16 line matrix decoder constructed from 2-input gates. 16
6-Transistor Static RAM Cell ual rail data lines for reading and writing bi +5 b NOT Active loads Reading a value: Storage cell 1) precharge the bit lines to a value 1/2 way between a 0 and a 1, 2) At the same time assert the word line. This allows the latch to drive the bit lines to the value stored in the latch. Column select (from column address decoder) CS Word line w i Switches to control access to cell Additional cells Sense/write amplifiers sense and amplify data on Read, drive b i and b i on write d i 17
Static RAM Read Operation Memory address Read/write CS ata t AA Access time from the time required of the RAM array to decode the address and provide value to the data bus. 18
Static RAM Write Operations Memory address Read/write CS ata t w Write time the time the data must be held valid in order to decode address and store value in memory cells. 19
ynamic RAM Organization Single bit line b i Switch to control access to cell Capacitor discharges in 4 15 ms. Refresh capacitor by reading (sensing) value on bit line, amplifying it, and placing it back on bit line where it recharges capacitor. Word line w j t c Additional cells Capacitor stores charge for a 1, no charge for a0 Write: place value on bit line and assert word line. Read: precharge bit line, assert word line, sense value on bit line with sense/amp. Column select (from column address decoder) Sense/write amplifiers sense and amplify data on Read, drive b i and b i on write This need to refresh the storage cells of dynamic RAM chips complicates RAM system design. CS R W d i 20
ynamic RAM Chip Organization es are timemultiplexed on address bus using RAS and CAS as strobes of rows and columns. CAS is normally used as the CS function. Row latches and decoder 1024 1024 1024 cell array 10 1024 A 0 A 9 RAS CAS Control logic Control 10 1024 sense/write amplifiers andcolumnlatches 1024 10 column address latches, 1 1024 muxes and demuxes Pin counts: Without addr. multiplexing: 27 pins including power & ground. With address multiplexing: 17 pins including power & ground. d o d i 21
RAM Read and Write Cycles Typical RAM Read operation Typical RAM Write operation Memory address Row address Column address Memory address Row address Column address RAS t RAS t Prechg RAS t RAS t prechg CAS CAS W ata ata t A t HR Access time Cycle time Notice that it is the bit line precharge operation that causes the difference between access time and cycle time. t C ata hold from RAS. t C 22
RAM Refresh and Row Access Refresh is usually accomplished by a RAS-only cycle. The row address is placed on the address lines and RAS asserted. This refreshed the entire row. CAS is not asserted. The absence of a CAS phase signals the chip that a row refresh is requested, and thus no data is placed on the external data lines. Many chips use CAS before RAS to signal a refresh. The chip has an internal counter, and whenever CAS is asserted before RAS, it is a signal to refresh the row pointed to by the counter, and to increment the counter. Most RAM vendors also supply one-chip RAM controllers that encapsulate the refresh and other functions. Page mode, nibble mode, and static column mode allow rapid access to the entire row that has been read into the column latches. Video RAMS, VRAMS, clock an entire row into a shift register where it can be rapidly read out, bit by bit, for display. 23
A 2- CMOS ROM Chip +v 00 Row decoder CS 1 0 1 0 24
ROM Types ROM Cost Programmability Time to Time to Erase Type Program Mask- Very At factory Weeks N/A programmed inexpensive only ROM PROM Inexpensive Once, by Seconds N/A end user EPROM Moderate Many times Seconds 20 minutes Flash Expensive Many times 100 µs 1 s, large EPROM block EEPROM Very Many times 100 µs 10 ms, expensive byte 25
Memory Boards and Modules There is a need for memories that are larger and wider than a single chip Chips can be organized into boards. Boards may not be actual, physical boards, but may consist of structured chip arrays present on the motherboard. A board or collection of boards make up a memory module. Memory modules: Satisfy the processor main memory interface requirements May have RAM refresh capability May expand the total main memory capacity May be interleaved to provide faster access to blocks of words 26
General Structure of a Memory Chip This is a slightly different view of the memory chip than previous. Chip selects... Multiple chip selects ease the assembly of chips into chip arrays. Usually provided by an external AN gate. m Row decoder Memory cell array I/O multiplexer s s s m CS...... ata CS s s ata 27
Word Assembly from Narrow Chips All chips have common CS,, and lines. Select CS CS... CS ata ata ata s s s p s P chips expand word size from s bits to p x s bits. 28
Increasing the Num. of Words by a Factor of 2 k The additional k address bits are used to select one of 2 k chips, each one of which has 2 m words: m+k m k kto2 k decoder... CS CS CS ata ata ata s s s s Word size remains at s bits. 29
Chip Using 2 Chip Selects m+q+k k Horizontal decoder m CS1 CS2 q ata This scheme simplifies the decoding from use of a (q+k)-bit decoder to using one q-bit and one k-bit decoder. Vertical decoder Multiple chip select lines are used to replace the last level of gates in this matrix decoder scheme. s One of 2 m+q+k s-bit words 30
3-imensional ynamic RAM Array CAS Enable k c +k r High address k r k c 2 k c decoder... RAS 2 k r decoder... 2 k r decoder... Multiplexed address m/2 RAS CAS RAS CAS CAS is used to enable top decoder in decoder tree. Use one 2- array for each bit. Each 2- array on separate board. ata w ata RAS CAS ata ata 31
A Memory Module and Its Interface Must provide Read and Write signals. Ready: memory is ready to accept commands. to be sent with Read/Write command. ata sent with Write or available upon Read when Ready is asserted. Module select needed when there is more than one module. Bus Interface: k+m k register m Chip/board selection Control signal generator: for SRAM, just strobes data on Read, Provides Ready on Read/Write Module select Read Write Control signal generator Memory boards and/or chips For RAM also provides CAS, RAS,, multiplexes address, generates refresh signals, and provides Ready. Ready ata w ata register w 32
ynamic RAM Module with Refresh Control k+m register Chip/board selection k m/2 m/2 m/2 Refresh clock and control Refresh counter 2 multiplexer m/2 Module select Read Write Request Refresh Memory timing generator Grant Board and chip selects RAS CAS ynamic RAM array ata lines lines Ready w ata register ata w 33
Two Kinds of Memory Module Organizations msbs lsbs j + k = m-bit address bus j k Module 0 Module select msbs lsbs k + j = m-bit address bus k j Module 0 Module select Module 1 Module select Module 1 Module select Memory modules are used to allow access to more than one word simultaneously... Module 2 k 1 Module select.. Module 2 k 1 Module select (a) Consecutive words in consecutive modules (interleaving) (b) Consecutive words in the same module 34
Timing Advantage of Interleaving If time to transmit information over bus, t b, is < module cycle time, t c, it is possible to time multiplex information transmission to several modules; Example: store one word of each cache line in a separate module. Main Memory : Word Module No. This provides successive words in successive modules. Timing: Bus Read module 0 address Write module 3 address and data Module 0 ata return Module 0 Module 0 read Module 3 Module 3 write t b t c t b With interleaving of 2 k modules, and t b < t b /2k, it is possible to get a 2 k -fold increase in memory bandwidth, provided memory requests are pipelined. MA satisfies this requirement. 35