CPSC 352. Chapter 7: Memory. Computer Organization. Principles of Computer Architecture by M. Murdocca and V. Heuring

7- CPSC 352 Computer Organization

7-2 Chapter Contents 7. The Memory Hierarchy 7.2 Random Access Memory 7.3 Chip Organization 7.4 Commercial Memory Modules 7.5 Read-Only Memory 7.6 Cache Memory 7.7 Virtual Memory 7.8 Advanced Topics 7.9 Case Study: Rambus Memory 7. Case Study: The Intel Pentium Memory System

7-3 The Memory Hierarchy Fast and expensive Registers Increasing performance and increasing cost Cache Main memory Secondary storage (disks) Off-line storage (tape) Slow and inexpensive

7-4 Functional Behavior of a RAM Cell Read D Q CLK Select Data In/Out

7-5 Simplified RAM Chip Pinout WR Memory Chip A -A m- D -D w- CS

7-6 D 3D2 D D A Four-Word Memory with Four Bits per Word in a 2D Organization A A 2-to-4 decoder Chip Select (CS) WR WR CS Word WR CS Word WR CS Word 2 WR CS Word 3 Q 3Q2 Q Q

7-7 A Simplified Representation of the Four-Word by Four-Bit RAM D 3 D 2 D D WR A A 4 4 RAM CS Q 3 Q 2 Q Q

7-8 2-/2D Organization of a 64-Word by One-Bit RAM Read D Q A A A 2 Row Decoder Read/Write Control Row Select CLK A 3 A 4 A 5 Column Decoder (MUX/DEMUX) Data Two bits wide: One bit for data and one bit for select. Data In/Out One Stored Bit Column Select

7-9 Two Four-Word by Four-Bit RAMs are Used in Creating a Four-Word by Eight-Bit RAM CS WR D 7 D 6 D 5 D 4 D 3 D 2 D D A A 4 4 RAM 4 4 RAM Q 7 Q 6 Q 5 Q 4 Q 3 Q 2 Q Q

7- Two Four-Word by Four-Bit RAMs Make up an Eight-Word by Four-Bit RAM D 3 D 2 D D WR A A 4 4 RAM CS A 2 -to-2 decoder CS 4 4 RAM CS Q 3 Q 2 Q Q

7- Single-In-Line Memory Module Adapted from(texas Instruments, MOS Memory: Commercial and Military Specifications Data Book, Texas Instruments, Literature Response Center, P.O. Box 72228, Denver, Colorado, 99.) A-A9 CAS DQ-DQ8 NC RAS V cc V ss W PIN NOMENCLATURE Address Inputs Column-Address Strobe Data In/Data Out No Connection Row-Address Strobe 5-V Supply Ground Write Enable Vcc CAS DQ A A DQ2 A2 A3 Vss DQ3 A4 A5 DQ4 A6 A7 DQ5 A8 A9 NC DQ6 W Vss DQ7 NC DQ8 NC RAS NC NC Vcc 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 2 22 23 24 25 26 27 28 29 3

7-2 A ROM Stores Four Four-Bit Words 2-to-4 decoder A A Enable Location Stored word Q 3 Q 2 Q Q

7-3 A Lookup Table (LUT) Implements an Eight-Bit ALU Operand A Operand B Function select A A A2 A3 A4 A5 A6 A7 A8 A9 A A A2 A3 A4 A5 A6 A7 Q Q Q2 Q3 Q4 Q5 Q6 Q7 Output A7 A6 Function Add Subtract Multiply Divide

7-4 Placement of Cache in a Computer System CPU 4 MHz Main Memory MHz CPU 4 MHz Cache Main Memory MHz Bus 66 MHz Without cache Bus 66 MHz With cache The locality principle: a recently referenced memory location is likely to be referenced again (temporal locality); a neighbor of a recently referenced memory location is likely to be referenced (spatial locality).

7-5 An Associative Mapping Scheme for a Cache Memory Valid Dirty Tag 27 Slot Block 32 words per block Slot Block Slot 2.. Block 28 Slot 2 4 Block 29 Cache Memory. Block 2 27 Main Memory

7-6 Associative Mapping Example Consider how an access to memory location (A35F4) 6 is mapped to the cache for a 2 32 word memory. The memory is divided into 2 27 blocks of 2 5 = 32 words per block, and the cache consists of 2 4 slots: Tag Word 27 bits 5 bits If the addressed word is in the cache, it will be found in word (4) 6 of a slot that has tag (5AF8) 6, which is made up of the 27 most significant bits of the address. If the addressed word is not in the cache, then the block corresponding to tag field (5AF8) 6 is brought into an available slot in the cache from the main memory, and the memory reference is then satisfied from the cache. Tag Word

7-7 Replacement Policies When there are no available slots in which to place a block, a replacement policy is implemented. The replacement policy governs the choice of which slot is freed up for the new block. Replacement policies are used for associative and set-associative mapping schemes, and also for virtual memory. Least recently used (LRU) First-in/first-out (FIFO) Least frequently used (LFU) Random Optimal (used for analysis only look backward in time and reverse-engineer the best possible strategy for a particular sequence of memory references.)

7-8 A Direct Mapping Scheme for Cache Memory Valid Dirty Tag 3 Slot Block 32 words per block Slot Block Slot 2.. Block 2 4 Slot 2 4 Block 2 4 + Cache Memory. Block 2 27 Main Memory

7-9 Direct Mapping Example For a direct mapped cache, each main memory block can be mapped to only one slot, but each slot can receive more than one block. Consider how an access to memory location (A35F4) 6 is mapped to the cache for a 2 32 word memory. The memory is divided into 2 27 blocks of 2 5 = 32 words per block, and the cache consists of 2 4 slots: Tag Slot Word 3 bits 4 bits 5 bits If the addressed word is in the cache, it will be found in word (4) 6 of slot (2F8) 6, which will have a tag of (46) 6. Tag Slot Word

7-2 A Set Associative Mapping Scheme for a Cache Memory Valid Dirty Tag Set 4 Slot Slot Block Block 32 words per block Set Slot 2.. Block 2 3 Set 2 3 Slot 2 4 Block 2 3 + Cache. Block 2 27 Main Memory

7-2 Set-Associative Mapping Example Consider how an access to memory location (A35F4) 6 is mapped to the cache for a 2 32 word memory. The memory is divided into 2 27 blocks of 2 5 = 32 words per block, there are two blocks per set, and the cache consists of 2 4 slots: Tag Set Word 4 bits 3 bits 5 bits The leftmost 4 bits form the tag field, followed by 3 bits for the set field, followed by five bits for the word field: Tag Set Word

7-22 Cache Read and Write Policies Cache Read Cache Write Data is in the cache Data is not in the cache Data is in the cache Data is not in the cache Forward to CPU. Load Through: Forward the word as cache line is filled, -or- Fill cache line and then forward word. Write Through: Write data to both cache and main memory, -or- Write Back: Write data to cache only. Defer main memory write until block is flushed. Write Allocate: Bring line into cache, then update it, -or- Write No-Allocat Update main memory only.

7-23 Hit Ratios and Effective Access Times Hit ratio and effective access time for single level cache: Hit ratios and effective access time for multi-level cache:

7-24 Direct Mapped Cache Example Compute hit ratio and effective access time for a program that executes from memory locations 48 to 95, and then loops times from 5 to 3. Slot Slot Slot 2 Block Block Block 2-5 6-3 32-47 The direct mapped cache has four 6- word slots, a hit time of 8 ns, and a miss time of 25 ns. Loadthrough is used. The cache is initially empty. Slot 3 Cache Block 3 Block 4 Block 5. Main Memory 48-63 64-79 8-95

7-25 Table of Events for Example Program Event Location Time Comment miss 5 hits miss 5 hits miss 5 hits miss miss 5 hits 9 hits 44 hits 48 25ns Memory block 3 to cache slot 3 49-63 8ns 5=2ns 64 25ns Memory block 4 to cache slot 65-79 8ns 5=2ns 8 25ns Memory block 5 to cache slot 8-95 8ns 5=2ns 5 25ns Memory block to cache slot 6 25ns Memory block to cache slot 7-3 8ns 5=2ns 5 8ns 9=72ns Last nine iterations of loop 6-3 8ns 44=2,24ns Last nine iterations of loop Total hits = 23 Total misses = 5

7-26 Calculation of Hit Ratio and Effective Access Time for Example Program

Cache slot 7-27 Neat Little LRU Algorithm A sequence is shown for the Neat Little LRU Algorithm for a cache with four slots. Main memory blocks are accessed in the sequence:, 2, 3,, 5, 4. 2 3 Cache slot 2 3 2 3 2 3 2 3 2 3 Initial Block accesses:, 2 2 3 2 3 2 3 2 2 3 3 2 3 2 3 2 3, 2, 3, 2, 3,, 2, 3,, 5, 2, 3,, 5, 4

7-28 Overlays A partition graph for a program with a main routine and three subroutines: Compiled program Physical Memory Main Routine Main A Partition # Subroutine A Smaller than program C B Subroutine B Partition # Partition graph Subroutine C

7-29 Virtual Memory Virtual memory is stored in a hard disk image. The physical memory holds a small number of virtual pages in physical page frames. A mapping between a virtual and a physical memory: Virtual addresses Virtual memory - 23 24-247 Page Page Physical memory Physical addresses 248-37 Page 2 Page frame - 23 372-495 Page 3 Page frame 24-247 496-59 Page 4 Page frame 2 248-37 52-643 Page 5 Page frame 3 372-495 644-767 Page 6 768-89 Page 7

7-3 Page Table The page table maps between virtual memory and physical memory. Present bit Page frame Page # Disk address Present bit: : Page is not in physical memory : Page is in physical memory 2 3 4 5 6 7

7-3 Using the Page Table A virtual address is translated into a physical address: Page Offset Virtual address 2 3 4 5 6 7 Physical address Page table

7-32 Using the Page Table (cont ) The configuration of a page table changes as a program executes. Initially, the page table is empty. In the final configuration, four pages are in physical memory. 2 3 4 5 6 7 2 3 4 5 6 7 After fault on page # After fault on page #3 2 3 4 5 6 7 2 3 4 5 6 7 After fault on page #2 Final

7-33 Segmentation A segmented memory allows two users to share the same word processor code, with different data spaces: Segment # Execute only Segment # Read/write by user # Segment #2 Read/write by user # Used Used Free Used Free Unused Address space for code segment of word processor Data space for user # Data space for user #

7-34 Fragmentation (a) Free area of memory after initialization; (b) after fragmentation; (c) after coalescing. Operating System Free Area Operating System Program A Free Area Program B Free Area Program C Operating System Program A Free Area Program B Free Area Program C Free Area Free Area Free Area Dead Zone Dead Zone Dead Zone I/O Space I/O Space I/O Space (a) (b) (c)

7-35 Translation Lookaside Buffer An example TLB holds 8 entries for a system with 32 virtual pages and 6 page frames. Valid Virtual page number Physical page number - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

7-36 3-Variable Decoder A conventional decoder is not extensible to large sizes because each address line drives twice as many logic gates for each added address line. a a a 2 d d d 2 d 3 d 4 d 5 d 6 d 7

7-37 Tree Decoder - 3 Variables A tree decoder is more easily extended to large sizes because fanin and fan-out are managed by adding deeper levels. Fan-out buffers a d a d d d a a d 2 d 2 a 2 d 3 a 2 d 3 d 4 d 4 d 5 d 5 d 6 d 6 d 7 d 7 (a) (b)

7-38 Tree Decoding One Level at a Time A decoding tree for a 6-word random access memory: Level Level Level 2 Level 3

7-39 Content Addressable Memory Addressing Relationships between random access memory and content addressable memory: Address Value Field Field2 Field3 A FF A 9E A4 A8 AC A A4 A8 AC 86734F F FE6822 352467C C34597 392B 3456 49 9 749 575 7 4 E C F FE 6E 5 84 32 bits 32 bits 2 bits 4 bits 8 bits Random access memory Content addressable memory

Information and Commands Data Gathering Device 7-4 Overview of CAM Central Control Source: (Foster, C. C., Content Addressable Parallel Processors, Van Nostrand Reinhold Company, 976.) Comparand Mask Cell Cell T T Cell 2 T 2 Cell 495 T 495

7-4 Addressing Subtrees for a CAM Data: four bits per channel 4 Control: one bit per channel 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

7-42 Block Diagram of Dual-Read RAM D D 7 Data In A Address 2 A A 9 2 2 Word 8 bits A RAM 8 Port A WR CS A Address 2 Data In A A 9 2 B Address 2 2 2 Word 8 bits B RAM 8 Port B B B 9 WR CS WR CS

7-43 Rambus Memory Rambus technology on the Nintendo 64 motherboard (top left and bottom right) enables cost savings over the conventional Sega Saturn motherboard design (bottom left). (Photo source: Rambus, Inc.)

7-44 The Intel Pentium Memory System Pentium Processor Chip CPU 256 Level Instruction Cache 8 KB TLB n Level 2 Cache Up to 52KB n Main Memory Up to 8GB Level Data Cache 8 KB TLB