211: Computer Architecture Summer 2016

Size: px

Start display at page:

Download "211: Computer Architecture Summer 2016"

Sheila Karin McGee
5 years ago
Views:

1 211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage

2 - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University Liu Liu 2

3 - Storage: Today s Topic - Direct - Mapping - Fully Associated - 2-way Associated - Cache Friendly Code Rutgers University Liu Liu 3

4 Memory Hierarchy (Review) Smaller, faster, and costlier (per byte) storage devices L2: L1: L0: registers on-chip L1 cache (SRAM) off-chip L2 cache (SRAM) CPU registers hold words retrieved from L1 cache. L1 cache holds cache lines retrieved from the L2 cache memory. L2 cache holds cache lines retrieved from main memory. Larger, slower, and cheaper (per byte) storage devices L4: L3: main memory (DRAM) local secondary storage (local disks) Main memory holds disk blocks retrieved from local disks. *Local disks hold files retrieved from disks on remote network servers. L5: remote secondary storage (distributed file systems, Web servers) Rutgers University Liu Liu 4

5 Types of Locality Temporal locality Recently accessed locations will likely be accessed again in near future X X X t Spatial locality Will likely access locations close to ones recently accessed in near future Rutgers University Liu Liu 5

6 Cache Miss Cold (compulsary) miss Cold miss occurs when a memory location is accessed for the 1st time Conflict miss Most caches limit blocks at level k+1 to a small subset of the block positions at level k. E.g., Block i at level k+1 must be placed in block (i mod 4) at level k Conflict misses occur when the level k cache is large enough, but multiple data items all map to the same level k block. E.g., Referencing blocks 0, 8, 0, 8, 0, 8,... would miss every time Capacity miss Occurs when the set of active data blocks (working set) is larger than the cache Rutgers University Liu Liu 6

7 Remember: Each level is a cache for previous Each level stores different sizes of information L2: L1: L0 register on-chip L1 cache (SRAM) off-chip L2 cache (SRAM) Request for A A B C L3: main memory (DRAM) L4: local secondary storage (local disks) Rutgers University Liu Liu 7

8 L1 Cache The transfer unit between the CPU register file and the cache is a 4-byte block. The transfer unit between the cache and main memory is a 4-word block (16 bytes). line 0 line 1 block 10 block 21 block 30 a b c d... p q r s... w x y z... The tiny, very fast CPU register file has room for four 4-byte words. The small fast L1 cache has room for two 4-word blocks. The big slow main memory has room for many 4-word blocks. Rutgers University Liu Liu 8

9 Finding data in cache Part of memory address applied to cache Remaining is stored as in cache If matches, hit, use data No match, miss, fetch data from memory address Rutgers University Liu Liu 9

10 Cache is an array of sets. Each set contains one or more lines. Each line holds a block of data. General Org of a Cache Memory S = 2 s sets s: index bits set 0: set 1: 1 valid bit per line valid valid valid valid t bits per line B = 2 b bytes per cache block B 1 B 1 B 1 B 1 E lines per set valid 0 1 B 1 set S-1: valid 0 1 B 1 Cache size: C = B x E x S data bytes Rutgers University Liu Liu 10

11 Accessing Direct-Mapped Caches Line matching and word selection Line matching: Find a valid line in the selected set with a matching Word selection: Then extract the word =1? (1) The valid bit must be set selected set (i): w 0 w 1 w 2 w 3 (2) The bits in the cache line must match the bits in the address =? t bits 0110 s bits b bits i 100 set index block offset (3) If (1) and (2), then cache hit, and block offset selects starting byte. Rutgers University Liu Liu 11

12 Why Use Middle Bits as Index? line Cache High-Order Bit Indexing Adjacent memory lines would map to same cache entry Poor use of spatial locality Middle-Order Bit Indexing Consecutive memory lines map to different cache lines Can hold C-byte region of address space in cache at one time High-Order Bit Indexing Middle-Order Bit Indexing Rutgers University Liu Liu 12

13 Example: Direct mapped cache Given: 32 bit address, 64KB cache, 32 byte block, byteaddressable memory Q: How many sets, how many bits for the, how many bits for the offset? set 0: valid cache block set 1: valid cache block set n-1: valid cache block Rutgers University Liu Liu 13

14 Set selection is trivial Fully Associative Caches No need to reserve bits for sets.(advane?) Accessing a line is the same as a set associative cache Advane? DisAdvane? set 0: valid valid valid valid valid valid cache block cache block cache block cache block cache block cache block Rutgers University Liu Liu 14

15 Set Associative Caches Compromise between data localization / comparison complexity Characterized by more than one line per set set 0: valid cache block valid cache block E=2 lines per set set 1: set S-1: valid cache block valid cache block valid cache block valid cache block Rutgers University Liu Liu 15

16 Accessing Set Associative Caches Set selection identical to direct-mapped cache set 0: valid valid cache block cache block Selected set set 1: valid valid cache block cache block m-1 t bits s bits b bits set index block offset 0 set S-1: valid valid cache block cache block Rutgers University Liu Liu 16

17 Accessing Set Associative Caches Line matching and word selection must compare the in each valid line in the selected set. =1? (1) The valid bit must be set selected set (i): w 0 w 1 w 2 w 3 (2) The bits in one of the cache lines must match the bits in the address m-1 =? t bits 0110 s bits b bits i 100 set index block offset 0 (3) If (1) and (2), then cache hit, and block offset selects starting byte. Rutgers University Liu Liu 17

18 Accessing Set Associative Caches Line matching and word selection must compare the in each valid line in the selected set. =1? (1) The valid bit must be set selected set (i): w 0 w 1 w 2 w 3 (2) The bits in one of the cache lines must match the bits in the address m-1 =? t bits 0110 s bits b bits i 100 set index block offset 0 (3) If (1) and (2), then cache hit, and block offset selects starting byte. Any other issues? Rutgers University Liu Liu 18

19 Replacement What if has a cache miss but all entries in the set are valid? Direct-mapped cache: no decision Discard current content and bring needed content in Set Associate cache: which line to evict? Need a replacement algorithm If we knew the future, can you think of an optimal replacement algorithm? Since we cannot predict, we need to approximate FIFO: First-In-First-Out LRU: Least Recently Used Random: Select victim from set randomly Rutgers University Liu Liu 19

20 Replacement How would you implement FIFO replacement in software? How would you implement LRU replacement in software? Rutgers University Liu Liu 20

21 Example: 2-way associative cache 32 bit address, 64KB cache, 32 byte block How many sets (lines), how many bits for the, how many bits for the offset? set 0: valid cache block valid cache block E=2 lines per set set 1: set S-1: valid cache block valid cache block valid cache block valid cache block Rutgers University Liu Liu 21

22 Example: 2-way associative cache 32 bit address, 32KB cache, 16 byte block How many sets, how many bits for the, how many bits for the offset? set 0: valid cache block valid cache block E=2 lines per set set 1: set S-1: valid cache block valid cache block valid cache block valid cache block Rutgers University Liu Liu 22

23 Writes and Cache Reading information from a cache is straight forward What about writing? What if you re writing data that is already cached (write-hit)? What if the data is not in the cache (write-miss)? Dealing with a write-hit Write-through - immediately write data back to memory Write-back - defer the write to memory for as long as possible Dealing with a write-miss write-allocate - load the block into memory and update no-write-allocate - writes directly to memory Write-through caches are typically no-write-allocate Write-back caches are typically write-allocate Rutgers University Liu Liu 23

24 Write-Back Caches What happens if we need to evict a block that has been written into? Need to write cached copy to lower-level Hence the name write-back How do we know whether a block has been written to? Add a bit called the dirty bit Set dirty bit when writing to a block When evicting a block, need to write to lower-level if dirty bit is set Rutgers University Liu Liu 24

25 Break Rutgers University Liu Liu 25

26 Multi-Level Caches Options: separate data and instruction caches, or a unified cache Processor Regs L1 d-cache L1 i-cache Unified L2 Cache Memory disk Rutgers University Liu Liu 26

27 Intel Pentium Cache Hierarchy Regs. L1 Data 1 cycle latency 16 KB 4-way assoc Write-through 32B lines L1 Instruction 16 KB, 4-way 32B lines L2 Unified 128KB--2 MB 4-way assoc Write-back Write allocate 32B lines Main Memory Up to 4GB Processor Chip Rutgers University Liu Liu 27

28 Cache Performance Metrics Miss Rate Hit Time Fraction of memory references not found in cache (misses/references) Typical numbers: 3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc. Time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache) Typical numbers: 1 clock cycle for L1 3-8 clock cycles for L2 Miss Penalty Additional time required because of a miss Typically cycles for main memory Rutgers University Liu Liu 28

29 Miss Types Cold misses When a location is accessed for the 1 st time Can reduce by increasing block size (leveraging spatial locality) Conflict misses More blocks that map into a single set is concurrently active than can be stored in the set Can reduce by increase associativity Capacity misses More blocks are active than can fit into the cache Working set Pre-fetching is a technique that could be used to alleviate all three types of misses Predict what will be used next and pre-fetch before actually have a miss Large block sizes is a form of prefetching If wrong, can worsen performance of cache Rutgers University Liu Liu 29

30 Writing Cache Friendly Code Repeated references to variables are good (temporal locality) Stride-1 reference patterns are good (spatial locality) Examples: cold cache, 4-byte words, 4-word cache blocks int sumarrayrows(int a[m][n]) { int i, j, sum = 0; int sumarraycols(int a[m][n]) { int i, j, sum = 0; } for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; Miss rate = Miss rate = 1/4 = 25% 100% Rutgers University Liu Liu 30

31 Matrix Multiplication Example Major Cache Effects to Consider Total cache size Exploit temporal locality and keep the working set small (e.g., by using blocking) Block size Exploit spatial locality Description: Multiply N x N matrices O(N3) total operations Accesses N reads per source element N values summed per destination» but may be able to hold in register /* ijk */ Variable sum for (i=0; i<n; i++) { held in register for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Rutgers University Liu Liu 31

32 Miss Rate Analysis for Matrix Multiply Assume: Line size = 32B (big enough for 4 64-bit words) Matrix dimension (N) is very large Approximate 1/N as 0.0 Cache is not even big enough to hold multiple rows Analysis Method: Look at access pattern of inner loop k j j i k i A B C Rutgers University Liu Liu 32

33 Layout of C Arrays in Memory (review) C arrays allocated in row-major order each row in contiguous memory locations Stepping through columns in one row: for (i = 0; i < N; i++) sum += a[0][i]; accesses successive elements if block size (B) > 4 bytes, exploit spatial locality compulsory miss rate = 4 bytes / B Stepping through rows in one column: for (i = 0; i < n; i++) sum += a[i][0]; accesses distant elements no spatial locality! compulsory miss rate = 1 (i.e. 100%) Rutgers University Liu Liu 33

34 Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Inner loop: (*,j) (i,j) (i,*) A B C Row-wise Columnwise Fixed Misses per Inner Loop Iteration: A B C Rutgers University Liu Liu 34

35 Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } } Inner loop: (*,j) (i,j) (i,*) A B C Misses per Inner Loop Iteration: Row-wise Columnwise Fixed A B C Rutgers University Liu Liu 35

36 Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) (k,*) A B C (i,*) Fixed Row-wise Row-wise Misses per Inner Loop Iteration: A B C Rutgers University Liu Liu 36

37 Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) (k,*) A B C (i,*) Fixed Row-wise Row-wise Misses per Inner Loop Iteration: A B C Rutgers University Liu Liu 37

38 Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (*,j) (k,j) A B C Misses per Inner Loop Iteration: 1.0 A B C Column - wise Fixed Columnwise Rutgers University Liu Liu 38

39 Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (*,j) (k,j) A B C Misses per Inner Loop Iteration: A B C Fixed Columnwise Columnwise Rutgers University Liu Liu 39

40 Summary of Matrix Multiplication ijk (& jik): 2 loads, 0 stores misses/iter = 1.25 kij (& ikj): 2 loads, 1 store misses/iter = 0.5 jki (& kji): 2 loads, 1 store misses/iter = 2.0 for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j ++) c[i][j] += r * b[k][j]; } } for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i+ +) c[i][j] += a[i][k] * r; } } Rutgers University Liu Liu 40

41 Pentium Matrix Multiply Performance Miss rates are helpful but not perfect predictors. Code scheduling matters, too Cycles/iteration kji jki kij ikj jik ijk Array size (n) Rutgers University Liu Liu

42 Improving Temporal Locality by Blocking Example: Blocked matrix multiplication block (in this context) does not mean cache block. Instead, it mean a sub-block within the matrix. Example: N = 8; sub-block size = 4 A 11 A 12 A 21 A 22 X B 11 B 12 B 21 B 22 = C 11 C 12 C 21 C 22 Key idea: Sub-blocks (i.e., A xy ) can be treated just like scalars. C 11 = A 11 B 11 + A 12 B 21 C 12 = A 11 B 12 + A 12 B 22 C 21 = A 21 B 11 + A 22 B 21 C 22 = A 21 B 12 + A 22 B 22 Rutgers University Liu Liu 42

43 Pentium Blocked Matrix Multiply Performance Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk and jik) relatively insensitive to array size. 60 Cycles/iteration kji jki kij ikj jik ijk bijk (bsize = 25) bikj (bsize = 25) Array size (n) Rutgers University Liu Liu 43

44 Concluding Observations Programmer can optimize for cache performance How data structures are organized How data are accessed Nested loop structure Blocking is a general technique All systems favor cache friendly code Getting absolute optimum performance is very platform specific Cache sizes, line sizes, associativities, etc. Can get most of the advane with generic code Keep working set reasonably small (temporal locality) Use small strides (spatial locality) Rutgers University Liu Liu 44

CISC 360. Cache Memories Nov 25, 2008

CISC 360. Cache Memories Nov 25, 2008 CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based