COSC 6385 Computer Architecture - Memory Hierarchies (III)

COSC 6385 Computer Architecture - Memory Hierarchies (III) Edgar Gabriel Spring 2014 Memory Technology Performance metrics Latency problems handled through caches Bandwidth main concern for main memory Access time: Time between read request and when desired word arrives Cycle time: Minimum time between unrelated requests to memory DRAM mostly used for main memory SRAM used for cache 1

Memory Technology Static Random Access Memory (SRAM) Requires low power to retain bit Requires 6 transistors/bit Dynamic Random Access Memory (DRAM) One transistor/bit Must be re-written after being read Must be periodically refreshed (~ 8 ms) Refresh can be done for an entire row simultaneously Memory system unavailable for an entire memory cycle (Access time + cycle time) Address lines are multiplexed: Upper half of address: row access strobe (RAS) Lower half of address: column access strobe (CAS) Source: http://www.eng.utah.edu/~cs7810/pres/11-7810-12.pdf 2

Memory Technology Amdahl: Memory capacity should grow linearly with processor speed Memory capacity and speed has not kept pace with processors Capacity increased at ~55% per year RAS cycle has improved at ~5% per year Year Chip size Slowest RAS DRAM (ns) Fastest RAS DRAM (ns) CAS (ns) Cycle time (ns) 1980 64Kbit 180 150 75 250 1989 4Mbit 100 80 20 165 1998 128Mbit 70 50 10 100 2010 4Gbit 36 28 1 37 2012 8Gbit 30 24 0.5 31 DRAM optimizations Dual Inline Memory Module (DIMM): Chip containing 4 16 DRAMS Double data rate (DDR): transfer data on the rising and falling edge of the DRAM clock signal Doubles the peak data rate Synchronous DRAM (SDRAM) Added clock to DRAM interface Removed initial synchronization with controller clock Contain a register to hold the number of bytes requested Up to 8 transfers of 16 bits each can be served without having to send a new address (burst mode) Burst mode often supports critical word first 3

DRAM optimizations (II) Multiple accesses to same row Wider interfaces DDR: offered 4 bit transfer mode DDR2 and DDR3: offer 16 bit transfer mode Multiple banks on each DRAM device 2-8 banks in DDR3 Requires to add another segment to the address: bank number, row address, column address DDR DRAMs and DIMMs 4

Memory Technology DDR: DDR2 Lower power (2.5 V -> 1.8 V) Higher clock rates (266 MHz, 333 MHz, 400 MHz) DDR3 1.5 V 800 MHz DDR4 1-1.2 V 1600 MHz GDDR5 is graphics memory based on DDR3 Memory Optimizations Graphics memory: Achieve 2-5 X bandwidth per DRAM vs. DDR3 Wider interfaces (32 bit vs. 16 bit) Higher clock rate Possible because they are attached via soldering instead of socketted DIMM modules Reducing power in SDRAMs: Lower voltage Low power mode (ignores clock, continues to refresh) 5

Flash Memory Type of Electronically Erasable Programmable Read-Only Memory (EEPROM) Holds contents without power Cheaper than SDRAM, more expensive than disk Slower than SDRAM, faster than disk Must be erased (in blocks) before being overwritten Limited number of write cycles Memory Dependability Electronic circuits are susceptible to cosmic rays For SDRAM Soft errors: dynamic errors Detected and fixed by error correcting codes (ECC) One parity bit for ~8 data bits Hard errors: permanent hardware errors DIMMS often contain sparse rows to replace defective rows Chipkill: RAID-like error recovery technique Failure of an entire chip can be handled 6

Intel Core i7 memory hierarchy 48 bit virtual address 36 bit physical address 2-level TLB caches 4 KB page size = 2 12 bytes Virtual address 0 12 24 36 48 Page frame Characteristics Instruction TLB Data TLB Second level TLB Size 128 entries 64 entries 512 entries Associativity 4-way 4-way 4-way Access latency 1 cycle 1 cycle 6 cycles Miss 7 cycles 7 cycles 100s of cycles Intel Core i7 memory hierarchy (II) Characteristics L1 Instruction L1 Data L2 L3 Size 32 KB 32 KB 256 KB 2 MB per core Associativity 4-way 8-way 8-way 16-way Access latency 4 cycles 4 cycles 10 cycles 35 cycles No. of index bits 7 6 9 13 bits* *Assuming a 4-core processors L1 and L2 separate per core, L3 shared among all cores L1 caches are virtually indexed but physically tagged L2 and L3 are physically indexed and tagged Non-blocking caches Merging write buffer for the L1 caches L3 inclusive of L1 and L2 cache Cache block size: 64 bytes => 6 bits for block offset required 7

Accessing the Instruction TLB Given a PC ( = virtual address) Send page frame of virtual address to instruction TLB to retrieve physical address TLB provides physical address if found and checks for access violation If not found in Instruction TLB, 2 nd level TLB is checked If not found in 2 nd level TLB operating system has to perform the translation Full page table can be very large and might be itself swapped out to disk -> another translation step required to load the corresponding part of the page table Accessing the Instruction Cache To identify address in L1 Instruction cache: Index field of virtual address: 7 bits + 2 bits from block offset ( i7 always loads 16 bytes per instruction request) Cache tag of physical address: 23 bits = 36 bits physical address 7 bits index - 6 bits block offset 2 nd level cache is physically indexed and physically tagged 36 bits physical address decomposed into: 6 bits offset, 9 bits index and 21 bits tag Block Address Tag Index Block Offset 8