Sequential Optimization. Analysis Tools and Cache Optimization. Lab Course Efficient Programming of Multicore-Systems and Supercomputers

Size: px
Start display at page:

Download "Sequential Optimization. Analysis Tools and Cache Optimization. Lab Course Efficient Programming of Multicore-Systems and Supercomputers"

Transcription

1 Sequential Optimization Analysis Tools and Cache Optimization Lab Course Efficient Programming of Multicore-Systems and Supercomputers

2 Outline Performance analysis tools/libraries Measurement strategies Examples: gprof, perf, likwid, PAPI, callgrind,... Cache optimization How do caches work? How to improve cache usage?

3 Performance Analysis Why? locate code regions (where most time is spent in) where optimizations are useful check correctness of assumptions on runtime behavior chose best algorithm from alternatives for a problem get knowledge about unknown code

4 Performance Analysis (Cont d) How? at end of (fully tested) Implementation on compiler-optimized release version with typical/representative input data Steps of optimization cycle: Start Measurement Locate Bottleneck Modify Code Finished Yes No Improvement Satisfying? Check for Improvement (Runtime)

5 Performance Analysis (Cont d) bottlenecks (sequential) logical errors: too often called functions algorithms with bad complexity or implementation bad memory access behavior (bad layout, low acess locality) lots of (conditional) jumps, lots of (unnecessary) data dependencies

6 Performance Measurement (cont'd) Hardware Sensors Event Stream Postprocessing (Online) Runs on Instrumented Program Meta- Information Measurement Output Instrumentation Visualization (Online/Offline) Program

7 Measurement - Terms Trace record of time-stamped event stream enter/leave of code region, actions, example: dynamic call tree huge amount of data (linear to runtime) unneeded for sequential performance analysis

8 Measurement Terms (Cont d) Profiling (e.g. of time) record summary over execution exclusive, inclusive cost / time, counters example: DCT DCG (Dynamic Call Graph) amount of data linear to code size

9 Methods Precise Measurements Increment Counter (Array) on Event Attribute Counters to Code (Context, Basic Block Sequence) Data (Address, Access Pattern) Data Reduction Possibilities Selection (Event Type, Code Range, Data) Online Processing (Compression, ) Needs Instrumentation (Measurement Code)

10 Methods - Instrumentation Manual Source Instrumentation Library Version with Instrumentation Compiler Binary Editing Runtime Instrumentation / Compiler Runtime Injection

11 Methods (Cont d) Statistical Measurement ( Sampling ) TBS (Time Based), EBS (Event Based) Assumption: Event Distribution over Code Approximated by checking every N-th Event only Similar Way for Iterative Code: Measure only every N-th Iteration Advantage: Data Reduction Tunable Compromise between Quality/Overhead

12 Methods (Cont d) Simulation Events for (not existant) HW Models Results not influenced by Measurement Compromise Quality / Slowdown Rough Model = High Discrepancy to Reality Detailed Model = Best Match to Reality But: Reality (CPU) often unknown Allows for Architecture Parameter Studies

13 Hardware Support Monitor Hardware Event Sensors (in CPU, on Board) Event Processing / Collection / Storing Best: Separate HW Comprimise: Use Same Resources after Data Reduction Most CPUs nowadays include Performance Counters

14 Performance Counters Multiple Event Sensors ALU Utilisation, Branch Prediction, Cache Events (L1/L2/TLB), Bus Utilisation Processing Hardware Filter Techniques (Itanium, Pentium-4) Opcode, Code/Data Range, Threesholds (Latency!) Counter Registers Nehalem/Sandybridge: 7 (3 fixed), Core2Duo: 5 (3 fest), Itanium2: 4, Pentium-4: 18, Opteron: 8 Athlon: 4, Pentium-II/III/M: 2, Alpha 21164: 3

15 Performance Counters (Cont d) Two Uses: Read Get Precise Count of Events in Code Regions by Enter/Leave Instrumentation Interrupt on Overflow Allows Statistical Sampling Handler Gets Process State & Restarts Counter Both can have Overhead Often Difficult to Understand

16 Performance Counters (Cont d) Common Misconceptions Don t take absolute values for granted! Instruction Counts: Including Speculation? Miss Counts: influenced by OOO, including Prefetch Events (?), Nothing about Stall Time

17 Tools - Measurement Read Hardware Performance Counters Specific: PerfCtr (x86), Pfmon (Itanium), perfex (SGI), DCPI (Alpha), perf (Linux) Portable: PAPI, PCL, perf (Linux) Statistical Sampling DCPI, Pfmon (Itanium), OProfile (Linux), perf (Linux), VTune (commercial - Intel), Prof/GProf (TBS) Instrumentation GProf, Pixie (HP/SGI), Atom (Alpha), VTune (Intel) DynaProf (Using DynInst), Valgrind/PIN (Runtime Instr.)

18 Tools Example (1) GProf (Compiler generated Instr.): Function Entries increment Call Counter for (caller, called)-tupel Combined with Time Based Sampling Compile with gcc pg..., icc -p Run creates gmon.out Analysis with gprof... Overhead sometimes around 100%! Available with GCC on UNIX

19 Tools Examples (2) Perf access performance counter: sampling/reading perf stat <program> / perf record <program> / perf report on SuperMUC: old binary Intel AmplifierXE (was VTune) profiling tool from Intel with GUI module load amplifier_xe/2013 amplxe-gui only does time sampling (performance counter needs kernel module...) Likwid module load likwid likwid-perfctr -g L3 c 0 <program>

20 Tools Example (3) Callgrind (Linux/x86), GPL cache simulator using Valgrind builds up dynamic call graph runtime instrumentation module load valgrind valgrind --tool=callgrind --cache-sim=yes <program> callgrind_annotate callgrind.out.xxx (GUI: kcachegrind)

21 The Memory Wall CPU Peak Performance (clock & cores) + 40% / year 1000 Growing Gap 100 Main Memory Performance +7% / year Access latency to main memory today up to 300 cycles Assume 2 Flops/clock tick 600 Flops wasted while waiting for one main memory access! 21/83

22 The Memory Hierarchy motivation: technology & economy fast memory is expensive & small SRAM: 6 transistors per cell parallel access & wide access paths needs lot of power & die space access time related to size: smaller = faster slow memory possible at much higher density DRAM: 1 transistor + 1 capacitor per cell serialized access discrepancy in design goals processor should be fast memory should be large wish: system with fast processor & large memory solution: hierarchy of buffers 22/83

23 Solution: The Memory Hierarchy upper level LLC: last level cache Size Latency Bandwidth lower level L1 Registers Fast Buffer 300 B 1 32 kb GB/s on-chip LLC Slower Buffer 4 MB GB/s off-chip CPU-local Main Memory 4 GB GB/s Remote Main Memory (attached to other CPUs) Even more remote Memory (on I/O devices,...) 4 GB GB/s 1 TB > ,2 GB/s 23/83

24 Namespaces in Memory Hierarchy easiest for programming? every memory hierarchy level with separate namespace explicit copy operations to move data between levels one large address namespace for all levels all but bottom/largest level only can contain subsets transparent use of higher levels: work as cache addresses are keys into associative data array cost: time to find data + storage for keys (= tags) cached entries can become invalid: coherency protocol best use of smaller buffers? eviction/replacement policy mix registers: we need very fast access no transparency needed: allocated by compiler L1 (apart from registers: smallest / fastest buffer) LLC transparent, use same namespace as memory 24/83

25 Driving Goal: Performance How to get good performance? fast processing high clock rate parallelism fast data access: processing happens on data memory performance: latency & bandwidth we want locality: data stored where processing done easy to have fast access & high bandwidth: cheap! depends on algorithm: better to not move data 25/83

26 Importance of fast Memory Accesses applications very often do memory accesses exploitation of processor caches is crucial example: 3 GHz clock rate, 2 sockets NUMA latency to remote socket ~ 100 ns: 300 clock ticks bandwidth (1 core) ~ 15 GB/s compare to L1 access: latency 2-3 ticks, bandwidth ~150GB/s bad memory performance easily can dominate performance (better memory performance also will speed up parallel code) further issue: contention if served from private levels (e.g. L1): perf. scales memory modules: shared by all cores & sockets M M 26/83

27 Processor Caches Why are Caches effective? Principle of Locality repeated accesses to same memory cells temporal locality good to keep recently accessed data in cache accesses to memory cells near recent accesses spatial locality good to work on blocks of nearside data (cache line) locality depends on the application 27/83

28 Cache Utilization effective only when data is found (cache hit) otherwise fetch from next level (cache miss) hit ratio h: probability of a cache hit example (1 cache level + memory) t eff : average resulting access time (= access latency) t cache : access time cache t mem : access time memory t h * t (1 h ) eff cache t mem 28/83

29 Design Options how to find cached data fast? full associativity: copies can reside in every cache line lots of comparisons needed very expensive (power/space/time) used for very small caches, e.g. TLB direct mapped: one possible place for a memory block address bits index into cache lines no eviction policy needed very simple/power efficient/fast ( microcontrollers) high conflict potential: reduces effectiveness m-way set-associativity: m possible places per address tradeoff: reduced conflicts, less expensive often: L1 4/8-way, L3: -24 way (less latency sensitive) 29/83

30 Full Associative Cache Word offset within line Byte offset in Word Address Tag Tag V = Tag V = Tag V = Tag V = Multiplexer Example: 64 byte line, 32-bit architecture Data bus 2 bit byte offset, 4 bit word offset, 26 bit tags 30/83

31 Decoder Direct Mapped Cache Index: use lower bits! Word offset within line Byte offset in Word Address Tag Index Tag Tag Tag Tag V V V V Example: 64 byte line, 256 lines (16k) = & Hit Multiplexer Data bus 4+2 bit offset, 8 bit index, 18 bit tags 31/83

32 Decoder m-way Associativity Word offset within line Byte offset in Word Address Tag Index Tag V Cache Set with m lines Tag V = = = Example: 64 byte line, 384 lines (24k), 6-way 64 sets, 4+2 bit offset, 6 bit index, 20 bit tags 32/83

33 Cache Hierarchy: Design Alternatives inclusive cache hierarchy all data in L1 also in L2 (also in L3) automatically ensured if (assume same line sizes) L n assoc * (#L n caches) L n+1 assoc easier management: to invalidate from outside only check larger level Intel (L2 & L3 usually without strict inclusion why?) exclusive cache hierarchy same data only once in hierarchy memory L1, if L1 full, evict data to L2 (L2 to L3) (victim cache: temporarily take cache data on overflow) AMD on-chip: wider paths easy, high BW off-chip: allows larger caches sometimes done, tags usually on-chip (for fast checks) 33/83

34 Cache Hierarchy: Design Alternatives Write Policies on hit: also update lower levels? write through: yes, directly on update write back: mark as modified (dirty), write on eviction (reduces write bandwidth needs!) on miss: allocate a cache line? write-allocate policy no: directly write to memory (high bandwidth!) yes: read memory block first (more data to transfer!) Write Buffers queuing of writes to lower levels, in background increases parallelism within cache hierarchy loads check write buffers, allow forwarding 34/83

35 Eviction/Replacement Policies New line to be allocated in set, but set full which existing line to evict to make room? Best strategy? OPT: we know the future! evict the line which is accessed again farthest in future upper bound to evaluate other policies also known as Belady (IBM, 60-ties) 35/83

36 Eviction Policies LRU: evict least recently used line approximates OPT if locality of future accesses is same as locality of recent accesses LFU: evict least frequently used line captures more history behavior than only recent access (as LRU) experiments show that LRU is better Others RANDOM: evict random line FIFO: evict in order of allocation 36/83

37 Techniques for Processor Caches Most important non-blocking (asynchronous) cache allow overlapping of multiple accesses non-blocking only for hits: lockup-free cache also for misses: table with outstanding loads, needs non-blocking support also from lower levels pipelining of cache accesses multi-level (hierarchy) smaller caches are faster, use less power if inclusive: lower caches should be significantly larger lower caches are accessed less frequently, thus can be slower (higher latency, lower bandwidth) hardware prefetcher: predict future accesses 37/83

38 Hardware Prefetching: Motivation Example sequential memory accesses in loop (8 byte type) cache accepts multiple outstanding loads out-of-order pipeline allows 64 outstanding loads 64 byte cache lines: always waiting for 8 lines enough to utilize memory performance? Memory performance 70ns latency, 3GHz: 210 cycles 2x DDR3-1600: ~ 25GB/s 8 cache lines every 200 cycles 512 bytes / 70 ns = 512GB /70s = 7,3 GB/s this is an ideal case: no dependencies! pointer chasing, linked list: 1 cache line/200 cycles: < 1GB/s we need prefetching (load before use) to exploit memory 38/83

39 Hardware Prefetching: Motivation in software: by programmer, by compiler regular loads (waste register) / separate instructions very often trigger much more prefetching than needed estimate what already is in cache (assume LRU ) prefetching only once per cache line complicates code often not possible: prefetch per data item prefetches consume space in instruction sequence larger code, higher IF/ID bandwidth needed slowdown best prefetch distance? in hardware caches observe memory accesses behavior predict future memory accesses, trigger prefetching 39/83

40 Hardware Prefetching What is a good prefetcher? accuracy: is all data prefetched / is prefetched data used? timeliness: early enough / not overwriting data still needed? Types prefetch adjacent memory block illusion of doubled cache line size for loads stream prefetchers: sequential access up/down stride prefetchers: series with fix addr. difference stream prefetching is a special case address correlation prefetchers remember arbitrary miss sequences: pointer chasing! with large tables very effective, only in research papers 40/83

41 The Principle of Locality is not enough... Reasons for Performance Loss for SPEC2000 [Beyls/Hollander, ICCS 2004] 41/83

42 Cache Optimizations use fastest algorithm (bad complexity kills) use existing implementations (Intel MKL, ) compiler options (-O3, -march=native), reduce aliasing problems (restrict keyword) Only optimize after profiling useless waste of time? changes for increased performance usually increase complexity, reduce maintainability / portability hide optimization behind interfaces Tips on following slides: best with C / C++ / Fortran / may also work with Java / JavaScript data layout cannot be influenced in Java (in JavaScript: use typed arrays ) 42/83

43 Cache Optimization: Reduce Accesses use large data types (done by compiler?) vectors instead of bytes 1 cache line = 1 access: use full cache lines hash table entries can be 64 bytes (if not fitting in L1/L2) hierarchical data structures alignment: crossing cache line gives two accesses (redundant) calculation instead of memory access avoid unneeded writes check if a variable already has given value before writing writes result in higher bandwidth needs 43/83

44 Cache Optimization: Reorder Accesses if possible, do sequential accesses exploit full cache line, trigger hardware prefetcher best up to page size, but stay in L1 small sequential accesses: reduce accuracy of prefetcher blocking: reuse data as much as possible instead of multiple large sweeps over large buffer, split up into multiple small sweeps over buffer parts recursion: block bisection / space filling curve best use of multiple cache levels without complexity cache-oblivious 44/83

45 Cache Optimization: Reorder Accesses Blocking: make arrays fit into a cache Recursion: Morton (Peano, Hilbert, ) [ ] 45/83

46 Cache Optimization: Improve Layout group data with same access frequency and access type (read vs. write) writes always trigger a read before avoid holes : if possible, use every byte of a fetched cache line (unused data is wasted space + bandwidth) reorder data in memory according to traversal order in program may also be useful for trees, linked lists incremental, in small steps, while accessing the data go from array of structs to structs of arrays AoS-to-SoA conversion to avoid power-of-2 strides (easy: conflict misses) use padding (artificial holes in data structure) No need to complicate code: interface layout 46/83

47 Cache Optimization

48 Basic efficiency guidelines Choose the best algorithm Use efficient libraries Find good compiler, options, give hints -O3, -fno-alias, restrict C99 keyword... Use suitable data layout Restructure code

49 Use suitable data layout Sequential access is fastest! E.g. in C/C++, for a 2D matrix double a[n][m]; the loops should be such that for (i...) for (j...) a[i][j]... Apply loop interchange, if necessary, see below

50 Data layout optimizations: Array padding Idea: Allocate arrays larger than necessary Change relative memory distances Avoid severe cache thrashing effects Example: Replace double u[1024][1024] by double u[1024][1024+pad] How to choose pad?

51 Loop unrolling Loop fusion Loop split Loop blocking Etc. Loop optimizations

52 Data access optimizations: Loop blocking Loop blocking = loop tiling Divide the data set into subsets (blocks) which are small enough to fit in cache Perform as much work as possible on the data in cache before moving to the next block This is not always easy to accomplish because of data dependencies

53 Data access optimizations: Interleaving of iterations Start next iteration while the previous is still calculated Aim: Reuse data in cache as often as possible Also known as time skewing

54 Example: Stencil, 2D Jacobi iteratively solve PDE (partial differential equation) discretize regularily in 2D, double precision for(iter=0; iter<itermax; iter++) for(i=0; i<n; i++) for(j=0; j<n; j++) anew[i,j] = (a[i-1,j]+ +a[j,i+1])/4; update: average of 4 neighbors, 4 Flops/update bandwidth requirement to memory: 16 bytes/update 2 lines fit into cache whole matrix does not fit into cache one read + one write, everything else in cache memory-bound: 2 cores already occupy bus 54/83

55 Example: Stencil, 2D Jacobi base parallel version geometric 2D decomposition synchronization needed for accessing partition boundaries (1 layer) Core Core Core Cache Memory memory-bound: 2 cores already occupy bus on SuperMUC: 7 GFlop/s per node (2 sockets) 55/83

56 2D Jacobi: Cache Optimizations Spatial blocking of n iterations: n layers to neighbor Core Core Core Cache Memory recursive splitting of blocks for one core until blocks fit into cache update of inner borders reduced BW to memory better scalability 3 56/83

57 2D Jacobi: Cache Optimizations Wavefront: similar to blocking, use shared $ C1 C2 C1 C3 C2 C1 C1 C2 C3 Memory matrix gets streamed trough cores within multicore, may be combined with blocking allows larger blocks, less border updates on SuperMUC > 10x improvement (!!) for details: lab course in SS14 57/83

Performance Optimization: Simulation and Real Measurement

Performance Optimization: Simulation and Real Measurement Performance Optimization: Simulation and Real Measurement KDE Developer Conference, Introduction Agenda Performance Analysis Profiling Tools: Examples & Demo KCachegrind: Visualizing Results What s to

More information

Cache Performance Analysis with Callgrind and KCachegrind

Cache Performance Analysis with Callgrind and KCachegrind Cache Performance Analysis with Callgrind and KCachegrind 21 th VI-HPS Tuning Workshop April 2016, Garching Josef Weidendorfer Computer Architecture I-10, Department of Informatics Technische Universität

More information

Analysis and Optimization of the Memory Access Behavior of Applications

Analysis and Optimization of the Memory Access Behavior of Applications Analysis and Optimization of the Memory Access Behavior of Applications October 8th, 2013 David Büttner (Josef Weidendorfer) Chair for Computer Architecture (LRR) TUM, Munich, Germany My Background Chair

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

Cache Performance Analysis with Callgrind and KCachegrind

Cache Performance Analysis with Callgrind and KCachegrind Cache Performance Analysis with Callgrind and KCachegrind VI-HPS Tuning Workshop 8 September 2011, Aachen Josef Weidendorfer Computer Architecture I-10, Department of Informatics Technische Universität

More information

Analysis and Optimization of the Memory Access Behavior of Applications

Analysis and Optimization of the Memory Access Behavior of Applications Analysis and Optimization of the Memory Access Behavior of Applications École Optimisation 2014 Université de Strasbourg Josef Weidendorfer Chair for Computer Architecture (LRR) TUM, Munich, Germany My

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate Outline Lecture 7: EITF20 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University November 21, 2012 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21,

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

Page 1. Memory Hierarchies (Part 2)

Page 1. Memory Hierarchies (Part 2) Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy

More information

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)] ECE7995 (6) Improving Cache Performance [Adapted from Mary Jane Irwin s slides (PSU)] Measuring Cache Performance Assuming cache hit costs are included as part of the normal CPU execution cycle, then CPU

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund

Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access Martin Grund Agenda Key Question: What is the memory hierarchy and how to exploit it? What to take home How is computer memory organized.

More information

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Cache Performance Analysis with Callgrind and KCachegrind

Cache Performance Analysis with Callgrind and KCachegrind Cache Performance Analysis with Callgrind and KCachegrind Parallel Performance Analysis Course, 31 October, 2010 King Abdullah University of Science and Technology, Saudi Arabia Josef Weidendorfer Computer

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required

More information

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg Computer Architecture and System Software Lecture 09: Memory Hierarchy Instructor: Rob Bergen Applied Computer Science University of Winnipeg Announcements Midterm returned + solutions in class today SSD

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =

More information

Lecture 7 - Memory Hierarchy-II

Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 2: M E M O R Y M E M O R Y H I E R A R C H I E S M E M O R Y D A T A F L O W M E M O R Y A C C E S S O P T I M I Z A T I O N J A N L E M E I R E Memory Just

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CPU issues address (and data for write) Memory returns data (or acknowledgment for write) The Main Memory Unit CPU and memory unit interface Address Data Control CPU Memory CPU issues address (and data for write) Memory returns data (or acknowledgment for write) Memories: Design Objectives

More information

Improving Cache Performance

Improving Cache Performance Improving Cache Performance Computer Organization Architectures for Embedded Computing Tuesday 28 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition,

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB Memory Technology Caches 1 Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Average access time similar

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Review: Major Components of a Computer Processor Devices Control Memory Input Datapath Output Secondary Memory (Disk) Main Memory Cache Performance

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5 th Edition Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic

More information

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory Memory systems Memory technology Memory hierarchy Virtual memory Memory technology DRAM Dynamic Random Access Memory bits are represented by an electric charge in a small capacitor charge leaks away, need

More information

Lecture 2: Memory Systems

Lecture 2: Memory Systems Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH Many Different Technologies Zebo Peng, IDA, LiTH 2 Internal and External Memories CPU Date transfer

More information

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University Computer Architecture Memory Hierarchy Lynn Choi Korea University Memory Hierarchy Motivated by Principles of Locality Speed vs. Size vs. Cost tradeoff Locality principle Temporal Locality: reference to

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016 Caches and Memory Hierarchy: Review UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only 10-20% of the processor peak Most of the single processor performance loss

More information

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance What is EE 352 Unit 11 Definitions Address Mapping Performance memory is a small, fast memory used to hold of data that the processor will likely need to access in the near future sits between the processor

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example Locality CS429: Computer Organization and Architecture Dr Bill Young Department of Computer Sciences University of Texas at Austin Principle of Locality: Programs tend to reuse data and instructions near

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt Memory Hierarchy 2/18/2016 CS 152 Sec6on 5 Colin Schmidt Agenda Review Memory Hierarchy Lab 2 Ques6ons Return Quiz 1 Latencies Comparison Numbers L1 Cache 0.5 ns L2 Cache 7 ns 14x L1 cache Main Memory

More information

Memory Hierarchy: Caches, Virtual Memory

Memory Hierarchy: Caches, Virtual Memory Memory Hierarchy: Caches, Virtual Memory Readings: 5.1-5.4, 5.8 Big memories are slow Computer Fast memories are small Processor Memory Devices Control Input Datapath Output Need to get fast, big memories

More information

Chapter 6 Objectives

Chapter 6 Objectives Chapter 6 Memory Chapter 6 Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017 Caches and Memory Hierarchy: Review UCSB CS24A, Fall 27 Motivation Most applications in a single processor runs at only - 2% of the processor peak Most of the single processor performance loss is in the

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2 ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

More information

Course Administration

Course Administration Spring 207 EE 363: Computer Organization Chapter 5: Large and Fast: Exploiting Memory Hierarchy - Avinash Kodi Department of Electrical Engineering & Computer Science Ohio University, Athens, Ohio 4570

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed 5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Memory Management! How the hardware and OS give application pgms: The illusion of a large contiguous address space Protection against each other Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Spatial and temporal locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

Introduction to OpenMP. Lecture 10: Caches

Introduction to OpenMP. Lecture 10: Caches Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2017 Lecture 15 LAST TIME: CACHE ORGANIZATION Caches have several important parameters B = 2 b bytes to store the block in each cache line S = 2 s cache sets

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Cache Memories From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Key Point. What are Cache lines

Key Point. What are Cache lines Caching 1 Key Point What are Cache lines Tags Index offset How do we find data in the cache? How do we tell if it s the right data? What decisions do we need to make in designing a cache? What are possible

More information