CS 3410, Spring 2014 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.15

Size: px

Start display at page:

Download "CS 3410, Spring 2014 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.15"

Alison Dixon
5 years ago
Views:

1 CS 34, Spring 4 Computer Science Cornell University See P& Chapter: , 5.8, 5.5

2 Code Stored in emory (also, data and stack) memory PC +4 new pc inst control extend imm B A compute jump/branch targets B D memory D ctrl ctrl emory ctrl IF/ID ID/EX EX/E Stack, Data, Code E/WB Stored in emory

3 ain memory is very very slow Remember: SRA DRA 6-8 transistors, no refresh, fast transistor, denser, cheaper/bit, needs refresh

4 ain memory is very very slow CPU clock rates ~.33ns ns (3Gz- 5z) emory technology Access :me in nanosecs (ns) Access :me in cycles SRA (on chip).5-.5 ns - 3 cycles SRA (off chip).5-3 ns 5-5 cycles DRA 5-7 ns 5- cycles SSD (Flash) 5k- 5k ns Tens of thousands Disk 5- ns illions

5 ain memory is very very slow CPU clock rates ~.33ns ns (3Gz- 5z) emory technology Access :me in nanosecs (ns) Access :me in cycles $ per GIB in Capacity SRA (on chip).5-.5 ns - 3 cycles 56 KB SRA (off chip).5-3 ns 5-5 cycles $4k 3 B DRA 5-7 ns 5- cycles $- $ 8 GB SSD (Flash) 5k- 5k ns Tens of thousands $.75- $ 5 GB Disk 5- ns illions $.5- $. 4 TB

6 RegFile s bytes < cycle access L3 becoming more common L Cache (several KB) L Cache (½- 3B) emory Pyramid emory (8B few GB) - 3 cycle access 5-5 cycle access 5- cycle access Disk (any GB few TB) + cycle access These are rough numbers: mileage may vary for latest/greatest Caches usually made of SRA

7 Can we create an illusion of cheap, large and fast memory? RegFile s bytes L Cache (several KB) L Cache (½- 3B) emory Pyramid emory (8B few GB) Disk (any GB few TB)

8 Can we create an illusion of cheap, large and fast memory? RegFile s bytes L Cache (several KB) L Cache (½- 3B) emory Pyramid emory (8B few GB) Disk (any GB few TB) Yes, using caches and assuming temporal and spahal locality

12 Caches vs memory vs terhary storage Tradeoffs Cache organizahon Direct apped Fully Associahve N- way set associahve Caching Queshons ow does a cache work? ow fast? ow big?

13 Wrihng a paper on Beren and Lúthien

14 Pick a small set of books; not enhre shelf Spend hme on small set of chapters

15 Pick a small set of books; not enhre shelf Spend hme on small set of chapters Somehmes get other books as well Norse mythology, Tolkien biography Your desk: out of space Replace less useful books with new ones

16 Pick a small set of books; not enhre shelf Cache vs. main memory Working set (the subset in use) Spend hme on small set of chapters Cache hit Locality of access: temporal and spahal Somehmes go to other books Cache may not have data (cache miss) Shelf out of space Cache evichon policy

17 int n = 4; int k[] = { 3, 4,, }; int fib(int i) { if (i <= ) return i; else return fib(i- )+fib(i- ); } Temporal Locality int main(int ac, char **av) { for (int i = ; i < n; i++) { printi(fib(k[i])); prints("\n"); } Spahal Locality }

18 If em[x] was accessed recently... then em[x] is likely to be accessed soon Exploit temporal locality: Put recently accessed em[x] higher in memory hierarchy since it will likely be accessed again soon then em[x ± ε] is likely to be accessed soon Exploit spahal locality: Put en:re block containing em[x] and surrounding addresses higher in memory hierarchy since nearby address will likely be accessed

19 emory closer to processor small & fast stores achve data emory farther from processor big & slow stores inachve data L/L3 Cache SRA L Cache SRA- on- chip emory DRA

20 % of data is accessed the most L Cache SRA- on- chip $R3 Reg LW $R3, em L/L3 Cache SRA 9% of data is achve emory DRA 9% of data inachve (not accessed)

21 emory closer to processor is fast but small usually stores subset of memory farther strictly inclusive Transfer whole blocks (cache lines): 4kb: disk RA 56b: RA L 64b: L L

22 Processor tries to access em[x] Check: is block containing em[x] in the cache? Yes: cache hit return requested data from cache line No: cache miss read block from memory (or lower level cache) (evict an exishng cache line to make room) place new block in cache return requested data à and stall the pipeline while all of this happens

23 Block (or line) inimum unit of informahon that is present/or not in the cache Cache hit, miss it rate The frachon of memory accesses found in a level of the memory hierarchy iss rate The converse

24 What structure to use? Where to place a block (book)? ow to find a block (book)? When miss, which block to replace? What happens on write?

25 A given data block can be placed in exactly one cache line à Direct apped in any cache line à Fully Associahve in a small set of cache lines à Set Associahve

26 emory Each block number maps to a single cache line index Simplest hardware Queshons ow to index into cache ow to find correct word/byte ow to match it x x4 x8 xc x x4 x8 xc x x4 x8 xc x3 x34 x38 x3c x4 x44

27 Each block number maps to a single cache line index Simplest hardware Queshons ow to index into cache ow to find correct word/byte ow to match it line line Cache cachelines - word per cacheline byte addressable x x4 x8 xc x x4 x8 xc x x4 x8 xc x3 x34 x38 x3c x4 x44 emory

28 Queshons ow to index into cache ow to find correct word/byte ow to match it 3- addr 9 tag line line Cache index offset cachelines - word per cacheline byte addressable x x4 x8 xc x x4 x8 xc x x4 x8 xc x3 x34 x38 x3c x4 x44 emory

29 3- addr line line tag 9- bits Cache x index offset - bits - bits cachelines - word per cacheline byte addressable addr x x4 x8 xc x x4 x8 xc x x4 x8 xc x3 x34 x38 x3c x4 x44 x48 emory

30 emory line line Each block number maps to a single cache line index Simplest hardware 3- addr tag Cache index offset cachelines 4- words per cacheline byte addressable x x4 x8 xc x x4 x8 xc x x4 x8 xc x3 x34 x38 x3c x4 x44

31 Size of offset? A) B) C) 3 D) 4 E) 5 Size of tag?

32 emory line line 3- addr 7 tag Cache index offset cachelines 4- words per cacheline byte addressable 4 x x4 x8 xc x x4 x8 xc x x4 x8 xc x3 x34 x38 x3c x4 x44

33 line line Each block number maps to a single cache line index Simplest hardware 3- addr x tag 7- bits x4 Cache index offset - bits 4- bits x8 xc cachelines 4- words per cacheline addr x x4 x8 xc x x4 x8 xc x x4 x8 xc x3 x34 x38 x3c x4 x44 x48 emory

34 line line 3- addr x tag 7- bits x4 Cache index offset - bits 4- bits x8 xc cachelines 4- words per cacheline line line x x4 x8 xc x x4 x8 xc x x4 x8 xc x3 x34 x38 x3c x4 x44 x48 emory

35 line line line line 3 3- addr x Cache tag x4 4 cachelines - words per cacheline index offset x x4 x8 xc x x4 x8 xc x x4 x8 xc x3 x34 x38 x3c x4 x44 x48 emory

36 line line line line 3 3- addr x Cache tag 7- bits x4 4 cachelines - words per cacheline index offset - bits 3- bits line line line line 3 line line line line 3 x x4 x8 xc x x4 x8 xc x x4 x8 xc x3 x34 x38 x3c x4 x44 x48 emory

37 Pros: Very simple hardware

38 Tag Index Offset V Tag Block = tag offset index hit? Word select data 3 bits

39 Tag Index Offset m bytes per block V Tag Block n blocks n bit index, m bit offset Q: ow big is cache (data only)? Cache of size n blocks Block size of m bytes Cache Size: n bytes per block x n blocks = n+m bytes

40 Tag Index Offset m bytes per block V Tag Block n blocks n bit index, m bit offset Q: ow much SRA is needed (data + overhead)? Cache of size n blocks Block size of m bytes Tag field: 3 (n + m), Valid bit: SRA Size: n x (block size + tag size + valid bit size) = n x ( m bytes x 8 bits- per- byte + (3 n m) + ) bits

41 Using byte addresses in this example. Addr Bus = 5 bits Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] $ $ $ $3 Cache 4 cache lines byte block V tag data emory

42 Using byte addresses in this example. Addr Bus = 5 bits Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] $ $ $ $3 Cache 4 cache lines byte block bit tag field bit index field bit block offset V tag data emory

43 Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] $ $ $ $3 Cache index Addr: V tag data emory

44 Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] $ $ $ $3 Cache Addr: V tag data isses: its: emory

45 Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] $ $ $ $3 V Cache tag data isses: its: emory

46 Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] $ $ $ $3 V index Addr: tag data isses: its: emory

47 Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] $ $ $ $3 5 V index Addr: tag data isses: its: emory

48 Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] $ $ $ $3 5 V Cache tag data isses: its: emory

49 Processor Cache emory LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] $ $ $ $3 5 V Addr: tag data isses: its:

50 Processor Cache emory LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] $ $ $ $3 5 V tag data isses: its:

51 Processor Cache emory LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] $ $ $ $3 5 4 Addr: V tag data isses: its:

52 Processor Cache emory LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] $ $ $ $3 5 4 V tag data isses: its:

53 Processor Cache emory LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] $ $ $ $3 4 4 Addr: V tag data isses: its:

54 Processor Cache emory LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 5 ] $ $ $ $3 4 4 V tag data isses: its:

55 Processor Cache emory LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 5 ] $ $ $ $3 4 4 Addr: V tag data isses: 3 its:

56 Processor Cache emory LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 5 ] $ $ $ $3 4 4 V tag data isses: 3 its:

57 Processor Cache emory LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 5 ] $ $ $ $ Addr: V tag data isses: 4 its:

58 Processor Cache emory LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 5 ] LB $ [ 8 ] $ $ $ $ V tag data isses: 4 its:

59 Processor Cache emory LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 5 ] LB $ [ 8 ] $ $ $ $ Addr: V tag data isses: 5 its:

60 Processor Cache emory LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 5 ] LB $ [ 8 ] LB $ [ ] LB $ [ 5 ] LB $ [ 8 ] LB $ [ ] LB $ [ 5 ] LB $ [ 8 ] Addr: V isses: 5 its: tag data

62 Pathological example Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 8 ] $ $ $ $3 4 4 Cache V tag data isses: its: emory

63 Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 8 ] $ $ $ $3 4 4 Cache Addr: V tag data isses: 3 its: emory

64 Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 8 ] $ $ $ $3 4 4 Cache V tag data isses: 3 its: emory

65 Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 8 ] $ $ $ $ Cache Addr: V tag data isses: 4 its: emory

66 Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 8 ] LB $ [ 4 ] LB $ [ ] $ $ $ $ Cache V tag data isses: 4 its: emory

67 Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 8 ] LB $ [ 4 ] LB $ [ ] $ $ $ $3 8 4 Cache V tag data isses: 4+ its: emory

68 Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 8 ] LB $ [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 8 ] Cache V tag data isses: 4+ its: emory

69 Processor LB $ [ ] LB $ [ 5 ] LB $3 [ ] LB $3 [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 8 ] LB $ [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 8 ] LB $ [ 4 ] LB $ [ ] LB $ [ ] LB $ [ 8 ] Cache V tag data isses: 4+++ its: emory

70 Working set is not too big for cache Yet, we can t make it work

Caches and Memory Deniz Altinbuken CS 3410, Spring 2015

Caches and Memory Deniz Altinbuken CS 3410, Spring 2015 s and emory Deniz Altinbuken CS, Spring Computer Science Cornell University See P& Chapter:.-. (except writes) Big Picture: emory Code Stored in emory (also, data and stack) compute jump/branch targets