CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

CS65 Computer Architecture Lecture 9 Memory Hierarchy - Main Memory Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 9: Main Memory 9-/ /6/ A. Sohn Memory Cycle Time 5 Memory Controller Chipset 6MB DIMM Lecture 9: Main Memory 9-/ /6/ A. Sohn

6MB Dual Inline Memory Module Chip Chip7 Mbx8 Mbx8 Mbx8 Mbx8 Mbx8 Mbx8 Mbx8 Mbx8 Parity/ECC x = = 6 MB x x 8 x = MB 6-bit bus -bit muxed Address bus Lecture 9: Main Memory 9-/ /6/ A. Sohn 5MB with 6MB DIMMs x = = 6 MB x x 8 x = MB 6-bit bus -bit muxed Address bus Lecture 9: Main Memory 9-/ /6/ A. Sohn

Current DRAM Technology DRAM fast page mode (burst) allows repeated accesses to the row buffer without another row access time, called a page hit. Synchronous DRAM (SDRAM) synchronizes itself with the timing of the. This enables the memory controller to know the exact clock cycle when the requested data will be ready. SDRAM chips take advantage of interleaving and burst mode functions. GHz vs memory MHz. Double Rate SDRAM (DDR SDRAM) DDR SDRAM allows the memory chip to perform transactions on both the rising and falling edges of the clock cycle. A MHz DDR SDRAM clock rate yields an effective data rate of MHz. Direct Rambus DRAM (DRDRAM) transfers data at MHz over a narrow 6-bit bus called a Direct Rambus Channel. This high-speed clock rate is done through double clockking which allows operations to occur on both the rising and falling edges of the clock cycle. It appears that the current market is moving against Rambus. Lecture 9: Main Memory 9-5/ /6/ A. Sohn DRAM Cell consists of a transistor and a capacitor Row access Column access Lecture 9: Main Memory 9-6/ /6/ A. Sohn

DRAM Array consists of a transistor and a capacitor Row access Column access Lecture 9: Main Memory 9-7/ /6/ A. Sohn Sample Kb Memory Address 5 6 7 8 9 5 6 7 8 9 Address -out-of- Row Decoder Kb xx 5 6 7 8 9 5 6 7 8 9 Sense Amplifiers I/O Gating -out-of- Column Decoder Lecture 9: Main Memory 9-8/ /6/ A. Sohn

Kb with Address Multiplexing 5 6 7 8 9 5 6 7 8 9 Address Address Register Row Address Col Address Counter -out-of- Row Decoder I/O Gating Kb xx -out-of- Column Decoder 5 6 7 8 9 5 6 7 8 9 Sense Amplifiers Burst Lecture 9: Main Memory 9-9/ /6/ A. Sohn Address Kb Address Register Row Address Bank Control -out-of- Row Decoder -out-of- Row Decoder -out-of- Row Decoder -out-of- Row Decoder 5 6 7 8 9 Bank Bank Bank Bank 5 6 7 8 9 Kb xx 5 6 7 8 9 5 6 7 8 9 Sense Amplifiers Col Address I/O Gating -out-of- Column Decoder Lecture 9: Main Memory 9-/ /6/ A. Sohn

Partitioning of Address Space High Order Word Interleaving Consecutive words are stored in the same memory bank High order bits are used to select a bank Low Order Word Interleaving Consecutive words are stored across different memory banks Low order bits are used to select a bank Grouped Low Order Interleaving Consecutive words are stored across different banks of a group High order bits are used to select a group. Low order bits are used to select the bank within a group. Pros and Cons Lecture 9: Main Memory 9-/ /6/ A. Sohn High Order Word Interleaving load f,(r) mult f,f,f store (r),f addi r,r,-#8 bne r,r,loop MSB LSB 5 6 7 8 9 5 6 7 8 9 Bank Bank Bank Bank Lecture 9: Main Memory 9-/ /6/ A. Sohn

Low Order Word Interleaving load f,(r) mult f,f,f store (r),f addi r,r,-#8 bne r,r,loop MSB LSB 8 6 5 9 7 6 8 7 5 9 Bank Bank Bank Bank Lecture 9: Main Memory 9-/ /6/ A. Sohn Grouped Low Order Interleaving load f,(r) mult f,f,f store (r),f addi r,r,-#8 MSB LSB bne r,r,loop 6 8 5 7 9 5 6 8 7 9 Bank Bank Bank Bank Lecture 9: Main Memory 9-/ /6/ A. Sohn

Memory Latency 5 6 7 8 9 5 6 7 8 9 Address Address Register Row Address Col Address -out-of- Row Decoder I/O Gating Kb xx -out-of- Column Decoder 5 6 7 8 9 5 6 7 8 9 Sense Amplifiers Lecture 9: Main Memory 9-5/ /6/ A. Sohn Memory Latency for 6-bit Line First word (8 bytes) - 9 cycles address to Chipset Memory controller at chipset to DRAM Row access strobe (RAS), reading and charging the row Column access strobe (CAS) to get the column to DRAM output buffer from output buffer to through chipset Second word (8 bytes) - cycle Third word (8 bytes) - cycle Fourth word (8 bytes) - cycle For @ GHz, DRAM @MHz, the total latency for reading a 6-bit L cache line Lecture 9: Main Memory 9-6/ /6/ A. Sohn

Memory Latency 5 Memory Controller Chipset 6MB DIMM Lecture 9: Main Memory 9-7/ /6/ A. Sohn Cache Miss Penalty Cache Multiplexer -bit -bit -bit Multiplexer -bit -bit -bit -bit 6-bit -bit -bit Memory block ( words) words (6 bits) words () word word word word Bank Bank Bank Bank word ( bits) Assuming 5 clocks to send address to memory 5 clocks to access memory 5 clocks to send a word to cache Lecture 9: Main Memory 9-8/ /6/ A. Sohn

Cache Miss Penalty 5 + x5 + x5 Cache Memory -bit word ( bits) block ( words) Lecture 9: Main Memory 9-9/ /6/ A. Sohn Cache Miss Penalty 5 + x5 + 5 Multiplexer -bit -bit -bit -bit Cache -bit Memory words () Lecture 9: Main Memory 9-/ /6/ A. Sohn

Cache Miss Penalty 5 + x5 + x5 Cache Memory -bit word ( bits) word ( bits) word ( bits) word ( bits) Bank Bank Bank Bank Lecture 9: Main Memory 9-/ /6/ A. Sohn Improving Cache Performance. Reduce cache miss rate (number of cache misses). Reduce cache miss penalty. Reduce the time to hit in the cache Lecture 9: Main Memory 9-/ /6/ A. Sohn

Reducing Cache Miss Rate Reducing the number of cache misses. Larger block size: lowers compulsory misses but increases miss penalty. Higher associativity: increases hit time and increase clock cycle. Yet another cache: victim cache?. Pseudo-associative caches 5. Hardware prefetching of instructions and data: two blocks are prefetched: one in cache and the other in buffer 6. Compiler-controlled prefetching 7. Compiler optimization Lecture 9: Main Memory 9-/ /6/ A. Sohn Reducing Cache Miss Penalty. Giving priority to read misses over writes. Sub-block placement for reduced miss penalty. Early restart and critical word first. Nonblocking caches to reduce stalls 5. Second-level caches Reducing Hit Time. Small and simple caches. Avoiding address translation during indexing. Pipelining writes for fast write hits Lecture 9: Main Memory 9-/ /6/ A. Sohn

Alpha Memory Hierarchy word = 6 bits block = words = bytes= 6 bits block offset = 5 bits => byte addressable High bits are used to select a word out of words in a block ITLB = entries (pages) Instruction cache = 8 (index) x 5 (block size) = B cache = 8 (index) x 5 (block size) = B L cache = 9 (index) x 5 (block size) = B Main memory = x 5 (block size) = B Disk = (# of pages) x (page size) = B Lecture 9: Main Memory 9-/ /6/ A. Sohn