Cache Memories. Lecture 14 Cache Memories. Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Similar documents
Giving credit where credit is due

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

CISC 360. Cache Memories Nov 25, 2008

Giving credit where credit is due

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Cache Memories October 8, 2007

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

Memory Hierarchy. Announcement. Computer system model. Reference

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III.

Last class. Caches. Direct mapped

Today Cache memory organization and operation Performance impact of caches

Cache memories are small, fast SRAM based memories managed automatically in hardware.

211: Computer Architecture Summer 2016

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

Memory and I/O Organization

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Example. How are these parameters decided?

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Computer Organization: A Programmer's Perspective

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

CISC 360. Cache Memories Exercises Dec 3, 2009

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management.

Carnegie Mellon. Cache Memories

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Computer Architecture ELEC3441

Array transposition in CUDA shared memory

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden

Assembler. Building a Modern Computer From First Principles.

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Programming in Fortran 90 : 2017/2018

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

ELEC 377 Operating Systems. Week 6 Class 3

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

The Codesign Challenge

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

News. Recap: While Loop Example. Reading. Recap: Do Loop Example. Recap: For Loop Example

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

Storage Binding in RTL synthesis

Introduction to Programming. Lecture 13: Container data structures. Container data structures. Topics for this lecture. A basic issue with containers

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

High-Performance Parallel Computing

CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vidyanagar

Overview. CSC 2400: Computer Systems. Pointers in C. Pointers - Variables that hold memory addresses - Using pointers to do call-by-reference in C

Memory Hierarchy. Instructor: Adam C. Champion, Ph.D. CSE 2431: Introduction to Operating Systems Reading: Chap. 6, [CSAPP]

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) ,

MATHEMATICS FORM ONE SCHEME OF WORK 2004

Parallelism for Nested Loops with Non-uniform and Flow Dependences

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS

Memory hierarchies: caches and their impact on the running time

Sorting. Sorted Original. index. index

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

LLVM passes and Intro to Loop Transformation Frameworks

Advanced Caching Techniques

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

RESISTIVE CIRCUITS MULTI NODE/LOOP CIRCUIT ANALYSIS

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Lecture 3: Computer Arithmetic: Multiplication and Division

Image Representation & Visualization Basic Imaging Algorithms Shape Representation and Analysis. outline

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Notes on Organizing Java Code: Packages, Visibility, and Scope

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Nachos Project 3. Speaker: Sheng-Wei Cheng 2010/12/16

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Advanced Memory Organizations

Cache Performance II 1

Review of Basic Computer Architecture

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CS 240 Stage 3 Abstractions for Practical Systems

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Parallel matrix-vector multiplication

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Systems Programming and Computer Architecture ( ) Timothy Roscoe

3D vector computer graphics

Transcription:

Topcs Lecture 4 Cache Memores Generc cache memory organzaton Drect mapped caches Set assocate caches Impact of caches on performance Cache Memores Cache memores are small, fast SRAM-based memores managed automatcally n hardware. Hold frequently accessed blocks of man memory CPU looks frst for data n L, then n L2, then n man memory. Typcal bus structure: L2 cache CPU chp regster fle L ALU cache cache bus system bus memory bus bus nterface I/O brdge man memory F4 2 Datorarktektur 28 Insertng an L Cache Between the CPU and Man Memory The transfer unt between the CPU regster fle and the cache s a 4-byte block. The transfer unt between the cache and man memory s a 4-word block (6 bytes). lne lne block block 2 block 3 a b c d... p q r s... w x y z... The tny, ery fast CPU regster fle has room for four 4-byte words. The small fast L cache has room for two 4-word blocks. The bg slow man memory has room for many 4-word blocks. F4 3 Datorarktektur 28 General Org of a Cache Memory bt t bts B = 2 b bytes Cache s an array per lne per lne per of sets. B Each set contans E lnes set : one or more lnes. per set B Each lne holds a block of data. B set : S = 2 s sets B B set S-: B Cache sze: C = B x E x S data bytes F4 4 Datorarktektur 28

Addressng Caches Address A: t bts s bts b bts Drect-Mapped Cache Smplest knd of cache set : B B m- <> <set ndex> <block offset> Characterzed by exactly one lne per set. set : set S-: B B B B The word at address A s n the cache f the bts n one of the <> lnes n set <set ndex> match <>. The word contents begn at offset <block offset> bytes from the begnnng of the block. set : set : set S-: E= lnes per set F4 5 Datorarktektur 28 F4 6 Datorarktektur 28 m- Accessng Drect-Mapped Caches Set selecton t bts Use the set ndex bts to determne the set of nterest. s bts selected set b bts set ndex block offset set : set : set S-: Accessng Drect-Mapped Caches Lne matchng and word selecton Lne matchng: Fnd a lne n the selected set wth a matchng Word selecton: Then extract the word selected set (): (2) The bts n the cache lne must match the bts n the address =? () The bt must be set m- =? t bts 2 3 4 5 6 7 w w w 2 w 3 s bts b bts set ndex block offset (3) If () and (2), then cache ht, and block offset selects startng byte. F4 7 Datorarktektur 28 F4 8 Datorarktektur 28

Drect-Mapped Cache Smulaton t= s=2 b= x xx () (4) x M=6 byte addresses, B=2 bytes/block, S=4 sets, E= entry/set Address trace (reads): [ 2 ], [ 2 ], 3 [ 2 ], 8 [ 2 ], [ 2 ] [ 2 ] (mss) data m[] M[-] m[] 8 [ 2 ] (mss) data m[9] M[8-9] m[8] M[2-3] 3 [ 2 ] (mss) data m[] M[-] m[] m[3] m[2] M[2-3] F4 9 Datorarktektur 28 (3) (5) [ 2 ] (mss) data m[] M[-] m[] m[3] M[2-3] m[2] Why Use Mddle Bts as Index? 4-lne Cache Hgh-Order Bt Indexng Adjacent memory lnes would map to same cache entry Poor use of spatal localty Mddle-Order Bt Indexng Consecute memory lnes map to dfferent cache lnes Can hold C-byte regon of address space n cache at one tme Hgh-Order Bt Indexng Mddle-Order Bt Indexng F4 Datorarktektur 28 Set Assocate Caches Characterzed by more than one lne per set Accessng Set Assocate Caches Set selecton dentcal to drect-mapped cache set : E=2 lnes per set set : set : Selected set set : set S-: m- t bts s bts b bts set ndex block offset set S-: F4 Datorarktektur 28 F4 2 Datorarktektur 28

Accessng Set Assocate Caches Lne matchng and word selecton must compare the n each lne n the selected set. Mult-Leel Caches Optons: separate data and nstructon caches,, or a unfed cache selected set (): =? () The bt must be set. 2 3 4 5 6 7 w w w 2 w 3 Processor Regs L d-cache L -cache Unfed L2 Cache Memory dsk (2) The bts n one of the cache lnes must match the bts n the address m- =? t bts s bts b bts set ndex block offset (3) If () and (2), then cache ht, and block offset selects startng byte. sze: speed: $/Mbyte: lne sze: 2 B 8-64 KB 3 ns 3 ns 8 B 32 B larger, slower, cheaper -4MB SRAM 6 ns $/MB 32 B 28 MB DRAM 6 ns $.5/MB 8 KB 3 GB 8 ms $.5/MB F4 3 Datorarktektur 28 F4 4 Datorarktektur 28 Intel Pentum Cache Herarchy L Data cycle latency 6 KB L2 Unfed Regs. 4-way assoc 28KB--2 MB Wrte-through Man 4-way assoc 32B lnes Memory Wrte-back Up to 4GB L Instructon Wrte allocate 6 KB, 4-way 32B lnes 32B lnes Processor Chp F4 5 Datorarktektur 28 Cache Performance Metrcs Mss Rate Fracton of memory references not found n cache (msses/references) Typcal numbers: 3-% for L can be qute small (e.g., < %) for L2, dependng on sze, etc. Ht Tme Tme to deler a lne n the cache to the processor (ncludes tme to determne whether the lne s n the cache) Typcal numbers: clock cycle for L 3-8 clock cycles for L2 Mss Penalty Addtonal tme requred because of a mss Typcally 25- cycles for man memory F4 6 Datorarktektur 28

Wrtng Cache Frendly Code Repeated references to arables are good (temporal localty) Strde- reference patterns are good (spatal localty) Examples: cold cache, 4-byte words, 4-word s nt sumarrayrows(nt a[m][n]) { nt, j, sum = ; nt sumarraycols(nt a[m][n]) { nt, j, sum = ; The Memory Mountan Read throughput (read bandwdth) Number of bytes read from memory per second (MB/s) Memory mountan Measured read throughput as a functon of spatal and temporal localty. Compact way to characterze memory system performance. for ( = ; < M; ++) for (j = ; j < N; j++) sum += a[][j]; return sum; for (j = ; j < N; j++) for ( = ; < M; ++) sum += a[][j]; return sum; Mss rate = /4 = 25% Mss rate = % F4 7 Datorarktektur 28 F4 8 Datorarktektur 28 Memory Mountan Test Functon /* The test functon */ od test(nt elems, nt strde) { nt, result = ; olatle nt snk; for ( = ; < elems; += strde) result += data[]; snk = result; /* So compler doesn't optmze away the loop */ /* Run test(elems, strde) and return read throughput (MB/s) */ double run(nt sze, nt strde, double Mhz) { double cycles; nt elems = sze / szeof(nt); test(elems, strde); /* warm up the cache */ cycles = fcyc2(test, elems, strde, ); /* call test(elems,strde) */ return (sze / strde) / (cycles / Mhz); /* conert cycles to MB/s */ F4 9 Datorarktektur 28 Memory Mountan Man Routne /* mountan.c - Generate the memory mountan. */ #defne MINBYTES ( << ) /* Workng set sze ranges from KB */ #defne MAXBYTES ( << 23) /*... up to 8 MB */ #defne MAXSTRIDE 6 /* Strdes range from to 6 */ #defne MAXELEMS MAXBYTES/szeof(nt) nt data[maxelems]; /* The array we'll be traersng */ nt man() { nt sze; /* Workng set sze (n bytes) */ nt strde; /* Strde (n array elements) */ double Mhz; /* Clock frequency */ nt_data(data, MAXELEMS); /* Intalze each element n data to */ Mhz = mhz(); /* Estmate the clock frequency */ for (sze = MAXBYTES; sze >= MINBYTES; sze >>= ) { for (strde = ; strde <= MAXSTRIDE; strde++) prntf("%.f\t", run(sze, strde, Mhz)); prntf("\n"); ext(); F4 2 Datorarktektur 28

The Memory Mountan read throughput (MB/s) Slopes of Spatal Localty 2 8 6 4 2 s s3 strde (words) s5 s7 s9 s s3 mem s5 8m F4 2 Datorarktektur 28 xe L2 2m 52k L 28k 32k 8k 2k Pentum III Xeon 55 MHz 6 KB on-chp L d-cache 6 KB on-chp L -cache 52 KB off-chp unfed L2 cache Rdges of Temporal Localty workng set sze (bytes) Rdges of Temporal Localty Slce through the memory mountan wth strde= llumnates read throughputs of dfferent caches and memory read througput (MB/s) 2 8 6 4 2 8m man memory regon 4m 2m 24k 52k 256k L2 cache regon 28k workng set sze (bytes) F4 22 Datorarktektur 28 64k 32k 6k 8k L cache regon 4k 2k k A Slope of Spatal Localty Slce through memory mountan wth sze=256kb shows sze. read throughput (MB/s) 8 7 6 5 4 3 2 s s2 s3 s4 s5 s6 s7 s8 s9 s s s2 s3 s4 s5 s6 strde (words) one access per cache lne F4 23 Datorarktektur 28 Matrx Multplcaton Example Major Cache Effects to Consder Total cache sze Explot temporal localty and keep the workng set small (e.g., by usng blockng) Block sze Descrpton: Explot spatal localty Multply N x N matrces O(N3) total operatons Accesses N reads per source element N alues summed per destnaton» but may be able to hold n regster /* jk */ Varable sum for (=; <n; ++) { held n regster for (j=; j<n; j++) { sum =.; for (k=; k<n; k++) c[][j] = sum; F4 24 Datorarktektur 28

Mss Rate Analyss for Matrx Multply Assume: Lne sze = 32B (bg enough for 4 64-bt words) Matrx dmenson (N) s ery large Approxmate /N as. Cache s not een bg enough to hold multple rows Analyss Method: Look at access pattern of nner loop k A k j B j C Layout of C Arrays n Memory (reew) C arrays allocated n row-major order each row n contguous memory locatons Steppng through columns n one row: for ( = ; < N; ++) sum += a[][]; accesses successe elements f block sze (B) > 4 bytes, explot spatal localty compulsory mss rate = 4 bytes / B Steppng through rows n one column: for ( = ; < n; ++) sum += a[][]; accesses dstant elements no spatal localty! compulsory mss rate = (.e. %) F4 25 Datorarktektur 28 F4 26 Datorarktektur 28 Matrx Multplcaton (jk) Matrx Multplcaton (jk) /* jk */ for (=; <n; ++) { for (j=; j<n; j++) { sum =.; for (k=; k<n; k++) (*,j) (,j) (,*) /* jk */ for (j=; j<n; j++) { for (=; <n; ++) { sum =.; for (k=; k<n; k++) (*,j) (,j) (,*) c[][j] = sum; c[][j] = sum Msses per Inner Loop Iteraton: Row-wse Fxed Msses per Inner Loop Iteraton: Row-wse Columnwse Columnwse Fxed.25...25.. F4 27 Datorarktektur 28 F4 28 Datorarktektur 28

Matrx Multplcaton (kj) Matrx Multplcaton (kj) /* kj */ for (k=; k<n; k++) { for (=; <n; ++) { r = a[][k]; for (j=; j<n; j++) c[][j] += r * b[k][j]; (,k) (k,*) (,*) Fxed Row-wse Row-wse /* kj */ for (=; <n; ++) { for (k=; k<n; k++) { r = a[][k]; for (j=; j<n; j++) c[][j] += r * b[k][j]; (,k) (k,*) (,*) Fxed Row-wse Row-wse Msses per Inner Loop Iteraton: Msses per Inner Loop Iteraton:..25.25..25.25 F4 29 Datorarktektur 28 F4 3 Datorarktektur 28 Matrx Multplcaton (jk) Matrx Multplcaton (kj) /* jk */ for (j=; j<n; j++) { for (k=; k<n; k++) { r = b[k][j]; for (=; <n; ++) c[][j] += a[][k] * r; Msses per Inner Loop Iteraton: /* kj */ for (k=; k<n; k++) { for (j=; j<n; j++) { r = b[k][j]; for (=; <n; ++) c[][j] += a[][k] * r; Msses per Inner Loop Iteraton: (*,k) (*,j) (k,j) Column - wse Fxed Columnwse (*,k) (*,j) (k,j) Columnwse Fxed Columnwse...... F4 3 Datorarktektur 28 F4 32 Datorarktektur 28

Summary of Matrx Multplcaton Pentum Matrx Multply Performance jk (& jk): kj (& kj): jk (& kj): Mss rates are helpful but not perfect predctors. 2 loads, stores 2 loads, store 2 loads, store Code schedulng matters, too. msses/ter =.25 msses/ter =.5 msses/ter = 2. 6 for (=; <n; ++) { for (k=; k<n; k++) { for (j=; j<n; j++) { for (j=; j<n; j++) { for (=; <n; ++) { for (k=; k<n; k++) { 5 sum =.; for (k=; k<n; k++) c[][j] = sum; r = a[][k]; for (j=; j<n; j++) c[][j] += r * b[k][j]; r = b[k][j]; for (=; <n; ++) c[][j] += a[][k] * r; Cycles/te 4 3 2 kj jk kj kj jk jk 25 5 75 25 5 75 2 225 25 275 3 325 35 375 4 Array sze (n) F4 33 Datorarktektur 28 F4 34 Datorarktektur 28 Improng Temporal Localty by Blockng Example: Blocked matrx multplcaton block (n ths context) does not mean. Instead, t mean a sub-block wthn the matrx. Example: N = 8; sub-block sze = 4 A A 2 A 2 A 22 B B 2 X = B 2 B 22 C C 2 C 2 C 22 Key dea: Sub-blocks (.e., A xy ) can be treated just lke scalars. C = A B + A 2 B 2 C 2 = A B 2 + A 2 B 22 C 2 = A 2 B + A 22 B 2 C 22 = A 2 B 2 + A 22 B 22 Blocked Matrx Multply (bjk) for (jj=; jj<n; jj+=bsze) { for (=; <n; ++) for (j=jj; j < mn(jj+bsze,n); j++) c[][j] =.; for (kk=; kk<n; kk+=bsze) { for (=; <n; ++) { for (j=jj; j < mn(jj+bsze,n); j++) { sum =. for (k=kk; k < mn(kk+bsze,n); k++) { c[][j] += sum; F4 35 Datorarktektur 28 F4 36 Datorarktektur 28

Blocked Matrx Multply Analyss Innermost loop par multples a X bsze sler of A by a bsze X bsze block of B and accumulates nto X bsze sler of C Loop oer steps through n row slers of A & C, usng same B for (=; <n; ++) { for (j=jj; j < mn(jj+bsze,n); j++) { sum =. for (k=kk; k < mn(kk+bsze,n); k++) { Innermost c[][j] += sum; kk jj jj Loop Par kk row sler accessed Update successe bsze tmes block reused n tmes elements of sler n successon F4 37 Datorarktektur 28 Pentum Blocked Matrx Multply Performance Blockng (bjk and bkj) mproes performance by a factor of two oer unblocked ersons (jk and jk) relately nsenste to array sze. Cycles/teraton 6 5 4 3 2 25 5 75 25 5 75 2 225 25 275 3 325 35 375 4 kj jk kj kj jk jk bjk (bsze = 25) bkj (bsze = 25) Array sze (n) F4 38 Datorarktektur 28 Concludng Obseratons Programmer can optmze for cache performance How data structures are organzed How data are accessed Nested loop structure Blockng s a general technque All systems faor cache frendly code Gettng absolute optmum performance s ery platform specfc Cache szes, lne szes, assocattes, etc. Can get most of the adane wth generc code Keep workng set reasonably small (temporal localty) Use small strdes (spatal localty) F4 39 Datorarktektur 28