Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Size: px

Start display at page:

Download "Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University"

Betty Beasley
5 years ago
Views:

1 Caches Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

2 Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Access time of SRAM Capacity and cost/gb of disk 2

3 DRAM Organization (1) Micron MT4LC16M4T8 (16M x 4bit) 3

4 DRAM Organization (2) DRAM configuration Asynchronous: no clock Large capacity: 1 4Gb Arranged as 2D matrix Minimizes wire length Maximizes refresh efficiency Narrow data interface: 1 16 bits (x1, x4, x8, x16) Cheap packages few bus pins Pins are expensive Narrow address interface: Multiplexed address lines: row and column address Signaled by RAS# and CAS# respectively 4

5 DRAM Operation (1) Read operation (1) (1) word line 5

6 DRAM Operation (2) Read operation (2) (2) word line bit line 6

7 DRAM Operation (3) Read operation (3) (3) word line bit line 7

8 DRAM Operation (4) Read cycle 8

9 DRAM Operation (5) Timing parameters t RC : minimum time from the start of one row access to the start of the next (cycle time) t RAC : minimum time from RAS# line falling to the valid data output (access time) Used to be quoted as the nominal speed of a DRAM chip t CAC : minimum time from CAS# line falling to valid data output Model t RC t RAC t CAC MT4LC16M4T ns 50 ns 13 ns MT4LC16M4T ns 60 ns 15 ns 9

10 DRAM Generations Year Capacity $/GB Kbit $ Kbit $ Mbit $ Mbit $ Mbit $ Mbit $ Mbit $ Mbit $ Mbit $ Gbit $ '80 '83 '85 '89 '92 '96 '98 '00 '04 '07 Trac Tcac 10

11 Principle of Locality Programs access a small portion of their address space at any time Temporal locality Items accessed recently are likely to be accessed again soon e.g., instructions in a loop, induction variables Spatial locality Items near those accessed recently are likely to be accessed soon e.g., sequential instruction access, array data 11

von Neumann, Preliminary Discussion of the Logical Design of Electronic Computing Instrument, June 1946.

12 Memory Hierarchy (1) We are therefore forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible. -- A. W. Burks, H. H. Goldstein, J. von Neumann, Preliminary Discussion of the Logical Design of Electronic Computing Instrument, June Taking advantage of locality Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory (main memory) Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory (cache memory) 12

13 Memory Hierarchy (2) Smaller, Faster, and Costlier (per byte) L1: L0: registers on-chip L1 cache (SRAM) CPU registers hold words retrieved from L1 cache. L1 cache holds cache lines retrieved from the L2 cache memory. L2: on-chip L2 cache (SRAM) L2 cache holds cache lines retrieved from main memory. Larger, Slower, and Cheaper (per byte) L4: L3: main memory (DRAM) local secondary storage (local magnetic disks) Main memory holds disk blocks retrieved from local disks. Local disks hold files retrieved from disks on remote network servers. L5: remote secondary storage (distributed file systems, Web servers) 13

Memory Hierarchy (3) Terminologies Block (aka line): unit of copying May be multiple words Hit: If accessed data is present in upper level, access satisfied by upper

14 Memory Hierarchy (3) Terminologies Block (aka line): unit of copying May be multiple words Hit: If accessed data is present in upper level, access satisfied by upper level Hit ratio: hits/accesses Miss: If accessed data is absent, block copied from lower level Time taken: miss penalty Miss ratio: misses/accesses = 1 - hit ratio 14

15 Caches (1) Cache memory The level of the memory hierarchy closest to the CPU Given accesses X 1,, X n 1, X n How to we know if the data is present? Where do we look? 15

16 Caches (2) Direct mapped cache Location determined by address Direct mapped: only one choice (Block address) modulo (#Blocks in cache) #Blocks is a power of 2 Use low-order address bits 16

Store block address as well as the data Actually, only need the high-order

17 Caches (3) Tags and valid bits How do we know which particular block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits Called the tag Address Tag Index Offset What if there is no data in a location? Valid bit: 1 = present, 0 = not present Initially 0 17

18 Cache Example (1) 8-blocks, 1 word/block, direct mapped Initial state Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 N 111 N 18

19 Cache Example (2) Word addr Binary addr Hit/miss Cache block Miss 110 Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N 19

20 Cache Example (3) Word addr Binary addr Hit/miss Cache block Miss 010 Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N 20

21 Cache Example (4) Word addr Binary addr Hit/miss Cache block Hit Hit 010 Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N 21

22 Cache Example (5) Word addr Binary addr Hit/miss Cache block Miss Miss Hit 000 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 11 Mem[11010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N 22

23 Cache Example (6) Word addr Binary addr Hit/miss Cache block Miss 010 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N 23

24 Address Subdivision 24

Block address = 1200/16 = 75 Block number = 75 modulo 64

25 Block Size (1) Example: larger block size 64 blocks, 16 bytes/block To what block number does address 1200 map? Block address = 1200/16 = 75 Block number = 75 modulo 64 = Tag Index Offset 22 bits 6 bits 4 bits 25

26 Block Size (2) Considerations Larger blocks should reduce miss rate Due to spatial locality SPEC92 26

27 Block Size (3) Considerations (cont d) But in a fixed-sized cache Larger blocks fewer of them» More competition increased miss rate Larger blocks pollution Larger miss penalty Can override benefit of reduced miss rate Early restart and critical-word-first can help 27

28 Main Memory & Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width clocked bus Bus clock is typically slower than CPU clock Example cache block read 1 bus cycle for address transfer 15 bus cycles per DRAM access 1 bus cycle per data transfer For 4-word block, 1-word-wide DRAM Miss penalty = x x 1 = 65 bus cycles Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle 28

Increasing Memory Bandwidth 4-word wide memory Miss penalty = 1 + 15 + 1 = 17 bus cycles Bandwidth = 16 bytes / 17 cycles = 0.

29 Increasing Memory Bandwidth 4-word wide memory Miss penalty = = 17 bus cycles Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle 4-bank interleaved memory Miss penalty = x 1 = 20 bus cycles Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle 29

30 Advanced DRAM Organization Bits in a DRAM are organized as a rectangular array DRAM accesses an entire row Burst mode: supply successive words from a row with reduced latency Synchronous DRAM (SDRAM) Add a clock signal to DRAM interfaces Double data rate (DDR) DRAM Transfer on rising and falling clock edges 30

Computer Systems Laboratory Sungkyunkwan University

Caches Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns