Memory System Design Part II. Bharadwaj Amrutur ECE Dept. IISc Bangalore.

Memory System Design Part II Bharadwaj Amrutur ECE Dept. IISc Bangalore.

References: Outline Computer Architecture a Quantitative Approach, Hennessy & Patterson Topics Memory hierarchy Cache Multi-core considerations Main Memory Disk Virtual memory Power considerations 2

View from the processor Clk Processor MemOp Address Memory ReadData WriteData Memory Operations (MemOp) (DLX) Load Store (Other RISC Processors) Prefetch Load/Store coprocessor Cache Flush Synchronization Address is 32bits or 64bits (modern processors) Data bus width is 64 (accesses can be in bytes, 32bits, 64bits) 3

The Gap 1000 Performance 100 10 1 Moore s Law Less Law? DRAM 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 CPU µproc 60%/yr. Processor-Memory Performance Gap: (grows 50% / year) DRAM 7%/yr. 4 From Kubiatowicz/UCB

Memory Hierarchy Characteristics Integration Box PCB Chip/ Package Chip <1mm Proc 16-128 64-bit Regs ½ cycle latency, ~ 1000 Gb/s Few mm 4KB-32KB L1 $ 1 cycle latency, ~ 400 Gb/s Few cms 1MB - 8MB L2 $ 5-10 cycles latency, ~200 Gb/s Few Inches 128MB - 4GB Main Memory (DRAM) 40-100 cycles latency, ~50Gb/s Many Inches 80GB-few TB Disk 1000s of cycles, ~1Gb/s 5

Exercise Memory Hierarchy Find Power/Mbps/bit for each layer of the memory hierarchy Plot Power/Mbps versus Bit as well as Bit 0.5 Which is better? 6

Cache concept Small, fast storage to exploit Spatial and Temporal Locality Found in other places: File caches, Name Caches etc. Consider the memory as a sequence of blocks Also known as lines The block can contain multiple bytes. Cache allows storage of a subset of the blocks from main memory Cache is first searched to satisfy the memory access request. A hit will return fast. A miss will incur a penalty. Main Memory 0 1 2 3 4 5 6 7 8 9 10 1112 1314 15 Cache 0 1 2 3 Main memory blocks are temporarily stored in the Cache 7

Average Memory Access Time Program Execution Time is given as: CPU time =IC ALU ops Instr CPI MemAccess Aluops AMAT Cycletime Inst Average Memory Access Time (AMAT) is given as: AMAT=HitTime MissRate MissPenalty HitTime and MissPenalty are in number of clock cycles IC is Instruction Count in the program To reduce AMAT, reduce HitTime, MissRate and MissPenalty HitTime is usually the lowest possible of 1 cycle MissPenalty is a function of the upper levels of the memory hierarchy MissRate is a function of Cache Size & Associativity which also impacts Cycletime : Hence an optimization problem 8

Exercise Write the corresponding equation for the energy consumed by a program 9

Energy Program Execution Energy is given as: CPU ener =IC ALU ops Instr EPI Aluops MemAccess AMAE CPU Inst time LeakagePower Average Memory Access Energy (AMAE) is given as: AMAE=HitEnergy MissRate MissEnergy HitEnergy and MissEnergy are in joules and are average numbers to account for actitivity factor of the data/address bits IC is Instruction Count in the program To reduce AMAE, reduce HitEnergy, MissRate and MissEnergy MissPenalty is a function of the upper levels of the memory hierarchy MissRate is a function of Cache Size & Associativity which also impacts CPUtime : Hence an optimization problem 10

Cache issues Where should a block be placed in the cache? How is a block searched for in the cache? Which block should be replaced on a cache miss? What to do on a write? 11

Direct Mapped: Placement Direct Mapped Main Memory 0 1 2 3 4 5 6 7 8 9 10 111213 14 15 Cache 0 1 2 3 The main memory blocks which map to specific cache blocks are: 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 The formula is: BlockAddress mod CacheSize 12 (CacheSize is in Blocks)

Direct Mapped: Search Valid Tag Decoder Data = Hit/Miss 31 0 Tag CacheIndex ByteSel 13

Block Placement: 2-way Associative Direct Mapped Main Memory 0 1 2 3 4 5 6 7 8 9 10 111213 14 15 Set 0 Set 1 Cache The main memory blocks which map to specific cache blocks are: 0 4 8 12 0 1 2 3 2 6 10 14 1 5 9 13 3 7 11 15 Within each set, the blocks can be in either of the locations The formula is: 14

2-Way Associative: Search 31 0 Tag CacheIndex ByteSel Valid Tag Decoder Data Valid Tag Decoder Data = Tristate Driver = Tristate Driver Hit/Miss_Set0 Hit/Miss_Set1 Exercises: a) Complete the wiring b) How do you generate the final Hit/Miss signal c) Extend the design to a Fully Associative Cache d) What happens to MissRate with associativity e) What happens to MissRate with size f) What happens to cycle time with Associativity and Size? 16

Replacement Tag Decoder Data Valid Tag Decoder Data = Tristate Driver = Tristate Driver Hit/Miss_Set0 Hit/Miss_Set1 Random: Randomly select the cache block to replace LRU: Least Recently Used: Select the cache block which was accessed the last MRU: Most Recently Used: Avoid replacing cache blocks which are accesed recently LFU: Least Frequently Used: Choose the cache block which was used least. 17

LRU Replacement 1 4 3 1 2 3 Miss Least Recently Used 1 4 3 1 4 1 1 3 4 2 1 3 4 3 2 1 4 3 2 1 4 Replace 4. Stack data structure. Push the most recently accessed cache block number at top. Replace from the bottom of the stack How to implement in hardware? 18

LRU Hardware implementation Implement stack Complicated for large associativities to maintain stack Especially removal from middle and insert at top Counter based Have a time stamp counter of small number of bits Update time stamp of each accessed block with current timestamp Replace block with smallest time stamp Periodically clear time stamps: Background process Example 1-bit time stamp : MRU 19

LRU/MRU Implementation MRUValid Tag Decoder Data = Hit/Miss 31 0 Tag CacheIndex ByteSel 20

Write Through Write Policy On a hit, update the cache block as wells as the block in the memory Every write incurs traffic to the main memory Processor will have to wait for the main memory to be updated before continuing Write Stall Unless Write buffer is used Stores to main memory are held in the write buffer and processor continues operation Need to accommodate a burst of stores What if the store buffer gets full? 21

Write Back Writes only done to cache blocks Multiple writes to same block doesn't incur main memory traffic On a cache block eviction, check if it needs to be written back to main memory Need an extra dirty bit per cache line. 22

Write back structure Dirty MRUValid Tag Decoder Data = Hit/Miss 31 0 Tag CacheIndex ByteSel 23

Dealing with Write Miss Write Allocate Load the block into cache on a write miss Similar to read miss Typically used with write back policy No-write Allocate Block is modified at lower level and not brought into this level Typically used with write through policy 24

Exercise Design cache management unit for Write Back Cache with Associativity of 4 LRU approximation 25

Multi-core processors P0 P3 L1 Cache L1 Cache Shared L2 Cache 26

Cache Coherency Same memory block can be in multiple Level1 cache blocks If one processor updates the local copy, how to make all the copies the same? Need cache coherence protocols 27