ECE331: Hardware Organization and Design

Size: px

Start display at page:

Download "ECE331: Hardware Organization and Design"

Meryl Hart
6 years ago
Views:

1 ECE331: Hardware Organization and Design Lecture 25: Multilevel Caches & Data Access Strategies Adapted from Computer Organization and Design, Patterson & Hennessy, UCB

2 Overview Last time: Associative caches How do we calculate cache hit rate? What about processor performance? Need to consider cache replacement policy How to select what to remove from an associative cache Lots of research done to select the right choices Research ongoing for chips with multiple processors Need to know how to calculate values for the exam ECE331: Cache Performance Analysis 2

3 Multilevel Caches Primary cache attached to CPU Small, but fast Level-2 cache services misses from primary cache Larger, slower, but still faster than main memory Main memory services L-2 cache misses Some high-end systems include L-3 cache ECE331: Cache Performance Analysis 3

4 Multilevel Cache Example Given CPU base CPI = 1, clock rate = 4GHz Miss rate/instruction = 2% Main memory access time = 100ns With just primary cache Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = = 9 Now add L-2 cache Access time = 5 ns L2 miss rate= 0.5% Global miss rate to main memory =.5% x 2% =0.01% Primary miss with L-2 hit Penalty = 5ns/0.25ns = 20 cycles Primary miss with L-2 miss Extra penalty = 1000 cycles CPI = (1000) = 1.5 Performance ratio = 9/1.5 = 6 times better ECE331: Cache Performance Analysis 4

5 Multilevel Cache Considerations Primary cache Focus on minimal hit time L-2 cache Focus on low miss rate to avoid main memory access Hit time has less overall impact Results L-1 cache usually smaller than a single cache L-1 block size smaller than L-2 block size Out-of-order CPUs can execute instructions during cache miss Pending store stays in load/store unit Dependent instructions wait in reservation stations Independent instructions continue Effect of miss depends on program data flow Much harder to analyze Use system simulation ECE331: Cache Performance Analysis 5

6 Interactions with Software Misses depend on memory access patterns Algorithm behavior Compiler optimization for memory access Example for sorting routines Quicksort versus radix sort Radix sort uses less instructions per item than quicksort when the number of items gets large. when the number of clock cycles are accounted for, a different picture arises What happened? The Radix sort has more cache misses than Quicksort, and hence it loses its advantage ECE331: Cache Performance Analysis 6

7 Software Optimization via Blocking n Goal: maximize accesses to data before it is replaced n Consider DGEMM (double precision matrix multiply held in BLAS/LAPACK math libraries found on most unix systems) n The inner loops look like for (int j = 0; j < n; ++j) { } double cij = C[i+j*n]; for( int k = 0; k < n; k++ ) cij += A[i+k*n] * B[k+j*n]; C[i+j*n] = cij; ECE331: Cache Performance Analysis 7

8 Traditional Matrix Multiplication For A x B = C Each row of A is multiplied by a column of B, then summed, and placed in Matrix C as a single entry. Process is repeated until all of C is calculated (m x k times) (dark grey: recently used data; light grey: less recently used; white: not yet used) Matrix A (m x n) Matrix B (n x k) Matrix C (m x k) n columns k columns k columns m rows n rows m rows ECE331: Cache Performance Analysis 8

Blocked DGEMM Access Pattern Matrix A (m x n) Matrix B (n x k) Matrix C (m x k) n columns k columns k columns m rows n rows m rows matrix size Unoptimized ECE331: Cache Performance

9 Blocked DGEMM Access Pattern Matrix A (m x n) Matrix B (n x k) Matrix C (m x k) n columns k columns k columns m rows n rows m rows matrix size Unoptimized ECE331: Cache Performance Analysis 9 Blocked By breaking the matrices up into smaller blocks, we put less demands on cache memory Process requires to revisit entries in Matrices A, B and C to complete the calculation

10 Summary Today: Cache performance Need to understand cache replacement strategy Least recently used Random Determine CPI given cache misses Determine miss rates for cache Associative caches require more hardware More comparison hardware More difficult to replace blocks ECE331: Cache Performance Analysis 10

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal