CPUs Caches. Memory management. CPU performance. Cache : MainMemory :: Window : 1. Door 2. Bigger Door 3. The Great Outdoors 4. Horizontal Blinds 18% 9% 64% 9% Door Bigger Door The Great Outdoors Horizontal Blinds Caching: The Basic Idea Main Memory Stores words (A Z in example) Cache Stores subset of the words (4 in example) Organized into lines Multiple words To exploit spatial locality Access Word must be in cache for processor to access it Processor Small, Fast Cache A B G H A cache line Big, Slow Memory A B C Y Z 1
Caches and CPUs CPU address cache controller cache address main memory Each main memory location is mapped onto a cache entry. May have caches for: instructions; ; + instructions (unified). Memory access time is no longer deterministic! Locality of Reference Principle of Locality: Programs tend to reuse and instructions near those they have used recently. Temporal locality: recently referenced items are likely to be referenced in the near future. Spatial locality: items with nearby addresses tend to be referenced close together in time. sum = 0; for (i = 0; i < n; i++) Locality in Example: sum += a[i]; *v = sum; Data Reference array elements in succession (spatial) Instructions Reference instructions in sequence (spatial) Cycle through loop repeatedly (temporal) Cache performance benefits Keep frequently-accessed locations in fast cache. Cache retrieves more than one word at a time. Simple prediction of what will be used next. Sequential accesses are faster after first access. 2
Loop : temporallocality :: Array : 1. TemporalLocality 2. SpatialLocality 91% 9% TemporalLocality SpatialLocality Array : spatiallocality :: Loop : 1. TemporalLocality 2. SpatialLocality 91% 9% TemporalLocality SpatialLocality Terms Cache hit: required location is in cache. Cache miss: required location is not in cache. Types of misses Compulsory (cold): location has never been accessed; Capacity: working set is too large; Conflict: multiple locations in working set map to same cache entry. Working set: set of locations used by program in a time interval. 3
Memory system performance CPU cache h = cache hit rate. t cache = cache access time, t main = main memory access time. Average memory access time: t av = ht cache + (1-h)t main Multiple levels of cache CPU L1 cache L2 cache h 1 = L1 cache hit rate. h 2 = rate for miss on L1, hit on L2. Average memory access time: t av = h 1 t L1 + h 2 t L2 + (1- h 1 - h 2 )t main Bug in book algorithm Multiple levels of cache (revised formula) CPU L1 cache L2 cache h 1 = L1 cache hit rate. Better way to specify h 2 = L2 cache hit rate. Average memory access time: t av = h 1 t L1 + ((1-h 1 )h 2 )t L2 + (1- h 1 -(1-h 1 )h 2 )t main 4
Multi-level cache access time (revised formula) CPU L1 cache L2 cache h 1 = L1 cache hit rate. t L1 = L1 cache access. h 2 = L2 cache hit rate. t L2 = L2 cache access. t main = main memory access time. Average memory access time: t av = h 1 t L1 + h 2 (1-h 1 )t L2 + (1- h 2 )(1-h 1 )t main Computer System: Cache Concept Processor interrupt On-chip cache Caches Caches Memory-I/O bus bus Net cache Row cache Memory Memory I/O I/O controller I/O I/O controller I/O I/O controller Disk cache Web cache disk Disk disk Disk Display Display Network Network Design Issues for Caches Key Questions: Where should a line be placed in the cache? (line placement) How is a line found in the cache? (line identification) Which line should be replaced on a miss? (line replacement) What happens on a write? (write strategy) Constraints: Design must be very simple Hardware realization All decision making within nanosecond time scale Want to optimize performance for typical programs Do extensive benchmarking and simulations Many subtle engineering tradeoffs 5
Cache organizations Direct-mapped: each memory location maps onto exactly one cache entry. N-way set-associative: each memory location can go into one of n sets. Fully-associative: any memory location can be stored anywhere in the cache (almost never implemented). Main Memory 0x0000 0x0004 byte byte byte... 0x0008 0x0000 0x0000 Direct-mapped cache 1 0xabcd byte byte byte... valid tag cache block address tag index offset = hit value byte 6
Direct-mapped cache locations Many locations map onto the same cache block. Conflict misses are easy to generate: Array a[] uses locations 0, 1, 2, Array b[] uses locations 1024, 1025, 1026, Operation a[i] + b[i] generates conflict misses. How might we improve the cache? What are problems? What are solutions? Set-associative cache A set of direct-mapped caches: Set 1 Set 2 Set n... hit 7
Indexing into 2-Way Associative Cache Use middle s bits to select from among S = 2 s sets Set 0: Set 1: Tag Tag Tag Tag Valid Valid Valid Valid 0 1 B 1 0 1 B 1 0 1 B 1 0 1 B 1 Set S 1: Tag Tag Valid Valid 0 1 B 1 0 1 B 1 t s b tag set index offset Physical Address Ex: direct-mapped vs. set-associative address 000 001 010 011 100 101 110 111 0101 1111 0000 0110 1000 0001 1010 0100 Direct-mapped cache behavior After 001 access: block tag 00 - - 01 0 1111 10 - - 11 - - After 010 access: block tag 00 - - 01 0 1111 10 0 0000 11 - - 8
Direct-mapped cache (cont d.) After 011 access: block tag 00 - - 01 0 1111 10 0 0000 11 0 0110 After 100 access: block tag 00 1 1000 01 0 1111 10 0 0000 11 0 0110 Direct-mapped cache (cont d.) After 101 access: block tag 00 1 1000 01 1 0001 10 0 0000 11 0 0110 After 111 access: block tag 00 1 1000 01 1 0001 10 0 0000 11 1 0100 2-way set-associative cache behavior Final state of cache (twice as big as direct-mapped): set 00 01 10 11 blk-0 tag 1 0 0 0 blk-0 1000 1111 0000 0110 blk-1 tag - 1-1 blk-1-0001 - 0100 9
2-way set-associative cache behavior Final state of cache (same size as directmapped): set 0 1 blk-0 tag 01 10 blk-0 0000 0111 blk-1 tag 10 11 blk-1 1000 0100 Example caches StrongARM: 16 Kbyte, 32-way, 32-byte block instruction cache. 16 Kbyte, 32-way, 32-byte block cache (write-back). SHARC: 32-instruction, 2-way instruction cache. Design Issues for Caches Key Questions: Where should a line be placed in the cache? (line placement) How is a line found in the cache? (line identification) Which line should be replaced on a miss? (line replacement) What happens on a write? (write strategy) Constraints: Design must be very simple Hardware realization All decision making within nanosecond time scale Want to optimize performance for typical programs Do extensive benchmarking and simulations Many subtle engineering tradeoffs 10
Replacement policies Replacement policy: strategy for choosing which cache entry to throw out to make room for a new memory location. Two popular strategies: Random. Least-recently used (LRU). Write strategy Write-through: immediately copy write to main memory. Write-back: write to main memory only when location is removed from cache. 1KB a[i] and 2KB b[i] creates: 1. Compulsory miss 2. Conflict miss 3. Capacity miss 45% 45% On a 2KB cache under looping 9% Compulsory miss Conflict miss Capacity miss 11
Memory management units Memory management unit (MMU) translates addresses: CPU logical address memory management unit physical address main memory Memory management tasks Allows programs to move in physical memory during execution. Allows virtual memory: memory images kept in secondary storage; images returned to main memory on demand during execution. Page fault: request for location not resident in memory. Address translation Requires some sort of register/table to allow arbitrary mappings of logical to physical addresses. Two basic schemes: segmented; paged. Segmentation and paging can be combined (x86). 12
Segments and pages page 1 page 2 segment 1 memory segment 2 Segment address translation segment base address logical address segment lower bound segment upper bound + range check range error physical address Page address translation page offset page i base concatenate page offset 13
Page table organizations page descriptor page descriptor flat tree Caching address translations Large translation tables require main memory access. TLB: cache for address translation. Typically small. ARM memory management Memory region types: section: 1 Mbyte block; large page: 64 Kbytes; small page: 4 Kbytes. An address is marked as section-mapped or page-mapped. Two-level translation scheme. 14
ARM address translation Translation table base register descriptor 1st level table 1st index 2nd index concatenate offset descriptor 2nd level table concatenate physical address Elements of CPU performance Cycle time. CPU pipeline. Memory system. Pipelining Several instructions are executed simultaneously at different stages of completion. Various conditions can cause pipeline bubbles that reduce utilization: branches; memory system delays; etc. 15
Pipeline structures Both ARM and SHARC have 3-stage pipes: fetch instruction from memory; decode opcode and operands; execute. ARM pipeline execution fetch decode execute add r0,r1,#5 sub r2,r3,r6 fetch decode execute cmp r2,#3 fetch decode execute 1 2 3 time Performance measures Latency: time it takes for an instruction to get through the pipeline. Throughput: number of instructions executed per time period. Pipelining increases throughput without reducing latency. 16
Pipeline stalls If every step cannot be completed in the same amount of time, pipeline stalls. Bubbles introduced by stall increase latency, reduce throughput. ARM multi-cycle LDMIA instruction ldmia r0,{r2,r3} fetch decodeex ld r2ex ld r3 sub r2,r3,r6 cmp r2,#3 fetch decode fetch ex sub decodeex cmp time Control stalls Branches often introduce stalls (branch penalty). Stall time may depend on whether branch is taken. May have to squash instructions that already started executing. Don t know what to fetch until condition is evaluated. 17
ARM pipelined branch bne foo fetch decodeex bne ex bne ex bne sub r2,r3,r6 foo add r0,r1,r2 fetch decode fetch decodeex add time Delayed branch To increase pipeline efficiency, delayed branch mechanism requires n instructions after branch always executed whether branch is executed or not. SHARC supports delayed and non-delayed branches. Specified by bit in branch instruction. 2 instruction branch delay slot. Example: ARM execution time Determine execution time of FIR filter: for (i=0; i<n; i++) f = f + c[i]*x[i]; Only branch in loop test may take more than one cycle. BLT loop takes 1 cycle best case, 3 worst case. 18
Superscalar execution Superscalar processor can execute several instructions per cycle. Uses multiple pipelined paths. Programs execute faster, but it is harder to determine how much faster. Data dependencies Execution time depends on operands, not just opcode. Superscalar CPU checks dependencies dynamically: dependency add r2,r0,r1 add r3,r2,r5 r0 r2 r1 r3 r5 Memory system performance Caches introduce indeterminacy in execution time. Depends on order of execution. Cache miss penalty: added time due to a cache miss. Several reasons for a miss: compulsory, conflict, capacity. 19