Memory Hierarchy. Announcement. Computer system model. Reference

Size: px

Start display at page:

Download "Memory Hierarchy. Announcement. Computer system model. Reference"

Harriet Neal
5 years ago
Views:

1 Announcement Memory Hierarchy Computer Organization and Assembly Languages Yung-Yu Chuang 26//5 Grade for hw#4 is online Please DO submit homework if you haen t Please sign up a demo time on /6 or /7 at the following page Hand in your report to TA at your demo time The length of report depends on your project type. It can be html, pdf, doc, ppt with slides by CMU5-23 Reference Chapter 6 from Computer System: A Programmer s Perspectie Computer system model We assume memory is a linear array which holds both instruction and data, and CPU can access memory in a constant time. data bus registers Central Processor Unit (CPU) Memory Storage Unit I/O Deice # I/O Deice #2 ALU CU clock control bus address bus

2 SRAM s DRAM Tran. Access Needs per bit time refresh? Cost Applications SRAM 4 or 6 X No X cache memories DRAM X Yes X Main memories, frame buffers ns The CPU-Memory gap The gap widens between DRAM, disk, and CPU speeds.,,,,,,,,, Access time (cycles) year register cache - memory 5- Disk seek time DRAM access time SRAM access time CPU cycle time disk 2,, Memory hierarchies Memory system in practice Some fundamental and enduring properties of hardware and software: Fast storage technologies cost more per byte, hae less capacity, and require more power (heat!). The gap between CPU and main memory speed is widening. Well-written programs tend to exhibit good locality. They suggest an approach for organizing memory and storage systems known as a memory hierarchy. Smaller, faster, and more expensie (per byte) storage deices Larger, slower, and cheaper (per byte) storage deices L4: L3: L2: L: registers L: on-chip L cache (SRAM) off-chip L2 cache (SRAM) main memory (DRAM) local secondary storage (local disks) L5: remote secondary storage (tapes, distributed file systems, Web serers)

3 Why it works? Most programs tend to access the storage at any particular leel more frequently than the storage at the lower leel. Locality: tend to access the same set of data items oer and oer again or tend to access sets of nearby data items. Why learn it? A programmer needs to understand this because the memory hierarchy has a big impact on performance. You can optimize your program so that its data is more frequently stored in the higher leel of the hierarchy. For example, the difference of running time for matrix multiplication could up to a factor of 6 een if the same amount of arithmetic instructions are performed. Locality Principle of Locality: programs tend to reuse data and instructions near those they hae used recently, or that were recently referenced themseles. Temporal locality: recently referenced items are likely to be referenced in the near future. Spatial locality: items with nearby addresses tend to be referenced close together in time. In general, programs with good locality run faster then programs with poor locality Locality is the reason why cache and irtual memory are designed in architecture and operating system. Another example is web browser caches recently isited webpages. Locality example sum = ; for (i = ; i < n; i++) sum += a[i]; return sum; Data Reference array elements in succession (stride- reference pattern): Spatial locality Reference sum each iteration: Temporal locality Instructions Reference instructions in sequence: Spatial locality Cycle through loop repeatedly: Temporal locality

4 Locality example Being able to look at code and get a qualitatie sense of its locality is important. Does this function hae good locality? int sum_array_rows(int a[m][n]) { int i, j, sum = ; for (i = ; i < M; i++) for (j = ; j < N; j++) sum += a[i][j]; return sum; stride- reference pattern Locality example Does this function hae good locality? int sum_array_cols(int a[m][n]) { int i, j, sum = ; for (j = ; j < N; j++) for (i = ; i < M; i++) sum += a[i][j]; return sum; stride-n reference pattern Locality example Memory hierarchies typedef struct { float [3]; float a[3]; point ; point p[n]; for (i=; i<n; i++) { for (j=; j<3; j++) { p[i].[j]=; p[i].a[j]=; B for (i=; i<n; i++) { for (j=; j<3; j++) p[i].[j]=; for (j=; j<3; j++) p[i].a[j]=; A for (j=; j<3; j++) { for (i=; i<n; i++) p[i].[j]=; for (i=; i<n; i++) p[i].a[j]=; C Smaller, faster, and more expensie (per byte) storage deices Larger, slower, and cheaper (per byte) storage deices L4: L5: L3: L2: L: registers L: on-chip L cache (SRAM) off-chip L2 cache (SRAM) main memory (DRAM) local secondary storage (local disks) remote secondary storage (tapes, distributed file systems, Web serers)

5 Caches Cache: a smaller, faster storage deice that acts as a sing area for a subset of the data in a larger, slower deice. Fundamental idea of a memory hierarchy: For each k, the faster, smaller deice at leel k seres as a cache for the larger, slower deice at leel k+. Why do memory hierarchies work? Programs tend to access the data at leel k more often than they access the data at leel k+. Thus, the storage at leel k+ can be slower, and thus larger and cheaper per bit. Caching in a memory hierarchy leel k leel k Smaller, faster, more Expensie deice at leel k caches a subset of the blocks from leel k+ Data is copied between leels in block-sized transfer units Larger, slower, cheaper Storage deice at leel k+ is partitioned into blocks. General caching concepts Type of cache misses leel k leel k+ 4 2 Request * * Request * Program needs object d, which is stored in some block b. Cache hit Program finds b in the cache at leel k. E.g., block 4. Cache miss b is not at leel k, so leel k cache must fetch it from leel k+. E.g., block 2. If leel k cache is full, then some current block must be replaced (eicted). Which one is the ictim? Placement policy: where can the new block go? E.g., b mod 4 Replacement policy:which block should be eicted? E.g., LRU Cold (compulsory) miss: occurs because the cache is empty. Capacity miss: occurs when the actie cache blocks (working set) is larger than the cache. Conflict miss Most caches limit blocks at leel k+ to a small subset of the block positions at leel k, e.g. block i at leel k+ must be placed in block (i mod 4) at leel k. Conflict misses occur when the leel k cache is large enough, but multiple data objects all map to the same leel k block, e.g. Referencing blocks, 8,, 8,, 8,... would miss eery time.

6 Cache memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. CPU looks first for data in L, then in L2, then in main memory. Typical system structure: SRAM Port L2 data L cache CPU chip register file bus interface ALU system bus I/O bridge memory bus main memory General organization of a cache Cache is an array of sets. Each set contains one or more lines. Each line holds a block of data. S = 2 s sets bit per line set : set : set S-: t bits per line B = 2 b bytes per B B B B B B Cache size: C = B x E x S data bytes E lines per set Addressing caches Address A: t bits s bits b bits Addressing caches Address A: t bits s bits b bits set : B B m- <> <set index> <block offset> set : B B m- <> <set index> <block offset> set : B B set : B B set S-: B B The word at address A is in the cache if the bits in one of the <> lines in set <set index> match <>. The word contents begin at offset <block offset> bytes from the beginning of the block. set S-: B B. Locate the set based on <set index> 2. Locate the line in the set based on <> 3. Check that the line is 4. Locate the data in the line based on <block offset>

7 Direct-mapped cache Simplest kind of cache, easy to build (only compare required per access) Characterized by exactly one line per set. set : set : set S-: E= lines per set Accessing direct-mapped caches Set selection Use the set index bits to determine the set of interest. selected set set : set : set S-: Cache size: C = B x S data bytes m- t bits s bits b bits set index block offset Accessing direct-mapped caches Line matching and word selection Line matching: Find a line in the selected set with a matching Word selection: Then extract the word =? () The bit must be set selected set (i): w w w 2 w 3 Accessing direct-mapped caches Line matching and word selection Line matching: Find a line in the selected set with a matching Word selection: Then extract the word selected set (i): w w w 2 w 3 (2) The bits in the cache line must =? match the bits in the address t bits m- If () and (2), then cache hit s bits b bits i set index block offset (3) If cache hit, block offset selects starting byte. m- t bits s bits b bits i set index block offset

8 Direct-mapped cache simulation What s wrong with direct-mapped? t= s=2 b= x xx x M=6 byte addresses, B=2 bytes/block, S=4 sets, E= entry/set Address trace (reads): [ 2 ], [ 2 ], 7 [ 2 ], 8 [ 2 ], [ 2 ] data? M[8-9] M[-]? M[6-7] miss hit miss miss miss float dotprod(float x[8], y[8]) { float sum=.; for (int i=; i<8; i++) sum+= x[i]*y[i]; return sum; block size=6 bytes element address set element address set x[] y[] 32 x[] 4 y[] 36 x[2] 8 y[2] 4 x[3] 2 y[3] 44 x[4] 6 y[4] 48 x[5] 2 y[5] 52 x[6] 24 y[6] 56 x[7] 28 y[7] 6 Solution? padding float dotprod(float x[2], y[8]) { float sum=.; for (int i=; i<8; i++) sum+= x[i]*y[i]; return sum; element address set element x[] x[] x[2] x[3] x[4] x[5] x[6] x[7] y[] y[] y[2] y[3] y[4] y[5] y[6] y[7] Address set Set associatie caches Characterized by more than one line per set set : set : set S-: E=2 lines per set E-way associatie cache

9 Accessing set associatie caches Set selection identical to direct-mapped cache set : Accessing set associatie caches Line matching and word selection must compare the in each line in the selected set. =? () The bit must be set selected set set : set S-: t bits s bits b bits m- set index block offset selected set (i): (2) The bits in one of the cache lines must match the bits in the address m w w w 2 w 3 t bits =? If () and (2), then cache hit s bits b bits i set index block offset Accessing set associatie caches 2-Way associatie cache simulation Line matching and word selection Word selection is the same as in a direct mapped cache selected set (i): m w w w 2 w 3 (3) If cache hit, block offset selects starting byte. t bits s bits b bits i set index block offset t=2 s= b= xx x x M=6 byte addresses, B=2 bytes/block, S=2 sets, E=2 entry/set Address trace (reads): [ 2 ], [ 2 ], 7 [ 2 ], 8 [ 2 ], [ 2 ] data? M[-]? M[8-9] M[6-7] miss hit miss miss hit

10 Why use middle bits as index? What about writes? 4-line Cache High-order bit indexing adjacent memory lines would map to same cache entry poor use of spatial locality High-Order Bit Indexing Middle-Order Bit Indexing Multiple copies of data exist: L L2 Main Memory Disk What to do when we write? Write-through Write-back (need a dirty bit) What to do on a replacement? Depends on whether it is write through or write back Multi-leel caches Intel Pentium III cache hierarchy Options: separate data and instruction caches, or a unified cache Processor size: speed: $/Mbyte: line size: Regs 2 B 3 ns L d-cache L i-cache 8-64 KB 3 ns Unified Unified L2 L2 Cache Cache 8 B 32 B 32 B larger, slower, cheaper -4MB SRAM 6 ns $/MB Memory 28 MB DRAM 6 ns $.5/MB 8 KB disk disk 3 GB 8 ms $.5/MB Regs. L Data cycle latency 6 KB 4-way assoc Write-through 32B lines L Instruction 6 KB, 4-way 32B lines Processor Processor Chip Chip L2 L2 Unified 28KB--2 MB MB 4-way 4-way assoc assoc Write-back Write Write allocate 32B 32B lines lines Main Memory Up Up to to 4GB 4GB

11 Writing cache friendly code Repeated references to ariables are good (temporal locality) Stride- reference are good (spatial locality) Examples: cold cache, 4-byte words, 4-word cache blocks int sum_array_rows(int a[4][8]) { int i, j, sum = ; for (i = ; i < M; i++) for (j = ; j < N; j++) sum += a[i][j]; return sum; int sum_array_cols(int a[4][8]) { int i, j, sum = ; for (j = ; j < N; j++) for (i = ; i < M; i++) sum += a[i][j]; return sum; Miss rate = /4 = 25% Miss rate = % The memory mountain Read throughput: number of bytes read from memory per second (MB/s) Memory mountain Measured read throughput as a function of spatial and temporal locality. Compact way to characterize memory system performance. oid test(int elems, int stride) { int i, result = ; olatile int sink; for (i = ; i < elems; i += stride) result += data[i]; /* So compiler doesn't optimize away the loop */ sink = result; The memory mountain Throughput (MB/sec) Slopes of Spatial Locality s s3 s5 s7 Stride (words) s9 s mem s3 xe L2 s5 8m L Pentium III 55 MHz 6 KB on-chip L d-cache 6 KB on-chip L i-cache 52 KB off-chip unified L2 cache 2m 52k 28k 32k 8k Ridges of Temporal Locality Working set size (bytes) 2k Ridges of temporal locality Slice through the memory mountain (stride=) illuminates read throughputs of different caches and memory read througput (MB/s) m main memory region 4m 2m 24k 52k 256k L2 cache region 28k 64k 32k working set size (bytes) 6k 8k L cache region 4k 2k k

12 A slope of spatial locality Slice through memory mountain (size=256kb) read throughput (MB/s) shows size one access per cache line s s2 s3 s4 s5 s6 s7 s8 s9 s s s2 s3 s4 s5 s6 stride (words) Matrix multiplication example Major cache effects to consider Total cache size Exploit temporal locality and keep the working set small (e.g., use blocking) Block size /* /* ijk ijk */ */ Exploit spatial locality for (i=; i<n; i++) { Description: Multiply N x N matrices O(N 3 ) total operations Accesses N reads per source element N alues summed per destination but may be able to hold in register Variable sum held in register for (i=; i<n; i++) { for for (j=; (j=; j<n; j<n; j++) j++) { sum sum =.;.; for for (k=; (k=; k<n; k<n; k++) k++) sum sum += += a[i][k] * b[k][j]; c[i][j] = sum; sum; Miss rate analysis for matrix multiply Matrix multiplication (ijk) Assume: Line size = 32B (big enough for four 64-bit words) Matrix dimension (N) is ery large Approximate /N as. Cache is not een big enough to hold multiple rows Analysis method: Look at access pattern of inner loop i k A k j B i j C /* /* ijk ijk */ */ for for (i=; (i=; i<n; i<n; i++) i++) { for for (j=; (j=; j<n; j<n; j++) j++) { sum sum =.;.; for for (k=; (k=; k<n; k<n; k++) k++) sum sum += += a[i][k] * b[k][j]; c[i][j] = sum; sum; Misses per Inner Loop Iteration:.25.. Inner loop: (*,j) (i,j) (i,*) Row-wise Columnwise Fixed

13 Matrix multiplication (jik) Matrix multiplication (kij) /* /* jik jik */ */ for for (j=; (j=; j<n; j<n; j++) j++) { for for (i=; (i=; i<n; i<n; i++) i++) { sum sum =.;.; for for (k=; (k=; k<n; k<n; k++) k++) sum sum += += a[i][k] * b[k][j]; c[i][j] = sum sum Misses per Inner Loop Iteration:.25.. Inner loop: (*,j) (i,j) (i,*) Row-wise Columnwise Fixed /* /* kij kij */ */ for for (k=; (k=; k<n; k<n; k++) k++) { for for (i=; (i=; i<n; i<n; i++) i++) { r = a[i][k]; for for (j=; (j=; j<n; j<n; j++) j++) c[i][j] += += r * b[k][j]; Misses per Inner Loop Iteration: Inner loop: (i,k) (k,*) (i,*) Fixed Row-wise Row-wise Matrix multiplication (ikj) Matrix multiplication (jki) /* /* ikj ikj */ */ for for (i=; (i=; i<n; i<n; i++) i++) { for for (k=; (k=; k<n; k<n; k++) k++) { r = a[i][k]; for for (j=; (j=; j<n; j<n; j++) j++) c[i][j] += += r * b[k][j]; Misses per Inner Loop Iteration: Inner loop: (i,k) (k,*) (i,*) Fixed Row-wise Row-wise /* /* jki jki */ */ for for (j=; (j=; j<n; j<n; j++) j++) { for for (k=; (k=; k<n; k<n; k++) k++) { r = b[k][j]; for for (i=; (i=; i<n; i<n; i++) i++) c[i][j] += += a[i][k] * r; r; Misses per Inner Loop Iteration:... Inner loop: (*,k) (k,j) (*,j) Column - wise Fixed Columnwise

14 Matrix multiplication (kji) Summary of matrix multiplication /* /* kji kji */ */ for for (k=; (k=; k<n; k<n; k++) k++) { for for (j=; (j=; j<n; j<n; j++) j++) { r = b[k][j]; for for (i=; (i=; i<n; i<n; i++) i++) c[i][j] += += a[i][k] * r; r; Misses per Inner Loop Iteration:... Inner loop: (*,k) (k,j) (*,j) Fixed Columnwise Columnwise for (i=; i<n; i++) { for (j=; j<n; j++) { sum =.; for (k=; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; for (k=; k<n; k++) { for (i=; i<n; i++) { r = a[i][k]; for (j=; j<n; j++) c[i][j] += r * b[k][j]; for (j=; j<n; j++) { for (k=; k<n; k++) { r = b[k][j]; for (i=; i<n; i++) c[i][j] += a[i][k] * r; ijk (& jik): 2 loads, stores misses/iter =.25 kij (& ikj): 2 loads, store misses/iter =.5 jki (& kji): 2 loads, store misses/iter = 2. Pentium matrix multiply performance Miss rates are helpful but not perfect predictors. Cycles/iteration 6Code scheduling matters, too kji & jki kij & ikj jik & ijk kji jki kij ikj jik ijk Improing temporal locality by blocking Example: Blocked matrix multiplication Here, block does not mean. Instead, it mean a sub-block within the matrix. Example: N = 8; sub-block size = 4 A A 2 A 2 A 22 X B B 2 B 2 B 22 C = A B + A 2 B 2 C 2 = A B 2 + A 2 B 22 C 2 = A 2 B + A 22 B 2 C 22 = A 2 B 2 + A 22 B 22 = C C 2 C 2 C 22 Key idea: Sub-blocks (i.e., A xy ) can be treated just like scalars Array size (n)

15 Blocked matrix multiply (bijk) for (jj=; jj<n; jj+=bsize) { for (i=; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] =.; for (kk=; kk<n; kk+=bsize) { for (i=; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum =. for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; c[i][j] += sum; Blocked matrix multiply analysis Innermost loop pair multiplies a X bsize slier of A by a bsize X bsize block of B and accumulates into X bsize slier of C Loop oer i steps through n row sliers of A & C, using same B Innermost Loop Pair for (i=; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum =. for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; c[i][j] += sum; kk i kk jj row slier accessed Update successie bsize times block reused n elements of slier times in succession jj i Blocked matrix multiply performance Blocking (bijk and bikj) improes performance by a factor of two oer unblocked ersions (ijk and jik) Cycles/iteration relatiely insensitie to array size Array size (n) kji jki kij ikj jik ijk bijk (bsize = 25) bikj (bsize = 25) Concluding obserations Programmer can optimize for cache performance How data structures are organized How data are accessed Nested loop structure Blocking is a general technique All systems faor cache friendly code Getting absolute optimum performance is ery platform specific Cache sizes, line sizes, associatiities, etc. Can get most of the adane with generic code Keep working set reasonably small (temporal locality) Use small strides (spatial locality)

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory 5-23 The course that gies CMU its Zip! Topics Cache Memories Oct., 22! Generic cache memory organization! Direct mapped caches! Set associatie caches! Impact of caches on performance Cache Memories Cache