Effect of memory latency

Similar documents
CS 426 Parallel Computing. Parallel Computing Platforms

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

External Memory. Philip Bille

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Cache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Advanced Computer Architecture

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Massive Data Algorithmics. Lecture 12: Cache-Oblivious Model

Structure of Computer Systems

Multi-core Computing Lecture 2

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Computer Architecture Memory hierarchies and caches

Lecture 2. Memory locality optimizations Address space organization

Page 1. Memory Hierarchies (Part 2)

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Memory Hierarchy. Slides contents from:

CSE502: Computer Architecture CSE 502: Computer Architecture

Assignment 1 due Mon (Feb 4pm

Introduction to cache memories

The Memory Hierarchy & Cache

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

Computer Systems Architecture

Page 1. Multilevel Memories (Improving performance using a little cash )

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Report Seminar Algorithm Engineering

EE/CSCI 451 Midterm 1

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Cache-oblivious Programming

Matrix Multiplication

Computer Systems Architecture

Performance metrics for caches

Lecture 12: Instruction Execution and Pipelining. William Gropp

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Chap. 4 Multiprocessors and Thread-Level Parallelism

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Virtual Memory: From Address Translation to Demand Paging

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance

Caches. Hiding Memory Access Times

How to Write Fast Numerical Code

Introduction to OpenMP. Lecture 10: Caches

211: Computer Architecture Summer 2016

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

EE 4683/5683: COMPUTER ARCHITECTURE

High performance computing. Memory

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

EE/CSCI 451: Parallel and Distributed Computation

Multiprocessors & Thread Level Parallelism

Memory Hierarchy. Slides contents from:

Lecture 24 November 24, 2015

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Memory Systems and Performance Engineering

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Advanced Memory Organizations

Double-Precision Matrix Multiply on CUDA

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first

Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Advanced Parallel Programming I

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

Motivation for Caching and Optimization of Cache Utilization

Last class. Caches. Direct mapped

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

I/O Model. Cache-Oblivious Algorithms : Algorithms in the Real World. Advantages of Cache-Oblivious Algorithms 4/9/13

Cray XE6 Performance Workshop

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

CS377P Programming for Performance Single Thread Performance Caches I

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Caches III. CSE 351 Spring Instructor: Ruth Anderson

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Transcription:

CACHE AWARENESS

Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable of executing two instructions in each cycle of 1 ns. The peak processor rating is therefore 2 GFLOPS. Since the memory latency is equal to 100 cycles every time a memory request is made, the processor must wait 100 cycles before it can process the data. Consider the problem of computing the dot-product of two vectors. A dot-product computation performs one multiply-add on a single pair of vector elements, i.e., each floating point operation requires one data fetch. It is easy to see that the peak speed of this computation is limited to one multiply-add operation every 100 ns, or a speed of 10 MFLOPS.

Cache Memory processor cache memory word block/line Smaller and faster memory between the processor and the DRAM The data needed by the processor is first fetched into the cache. All subsequent accesses to data items residing in the cache are serviced by the cache. Performance improves in presence of High Locality Def Cache Hit Ratio: fraction of memory references being resolved by the cache memory

With Cache Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and is capable of executing two instructions in each cycle of 1 ns. The peak processor rating is therefore 2 GFLOPS. Suppose we have a cache of 32KB, with a latency of 1 ns per word. We must multiply two matrices A and B of size 32*32. N.B.: A, B and A*B fit in cache The time needed to load A and B into cache is 32*32*2 * 100ns = 205 µs Multiplying two matrices of size n*n takes 2n 3 multiply-adds, in our case 2*32 3 = 66K multiply-adds, which implies 66 µs Total time is 205 + 66 = 271 µs Throughput is 66K*2/271µs = 488 MFLOPS (> 10 MFLOPS) (<2 GFLOPS) Locality: n^3 operations on n^2 memory locations!!

Effect of memory bandwidth Consider the previous example If the cache block has a width of one single word, then it takes 32*32*2 * 100ns = 205 µs to load the two matrices in cache If the cache line has a width of four words, then it takes 32*32*2/4 * 100ns = 51 µs to load the two matrices in cache Total time is 51+ 66 = 117 µs Throughput is 66K*2/117µs = 1282 MFLOPS (>488 MFLOPS) (> 10 MFLOPS) (<2 GFLOPS) Warning! We are assuming data is laid out linearly in memory.

Other approaches for hiding memory latency Multi-threading Split the problem in multiple sub-problems Run an independent thread for each sub-problem When a thread is idle on a miss, another thread can execute computational tasks Pre-fetching Anticipate load operations, so that data is already available when needed Drawbacks: They both impact on bandwidth and cache pollution

Cache-to-Memory coherence processor cache memory write After updating/writing data in cache, when to write to memory? Other devices might read the same data from memory Write-Through Policy Data is immediately written to memory Write delay (~100 cycles) When to write? Write-Back Policy Memory is updated upon eviction Less operations (especially in case of re-use/locality) In case of write to data not being in cache: n Write-allocate: first load into cache, then update cache (only) n Write no allocate: write (stream) to memory (no loads into cache)

Cache-to-Cache Coherence in Symmetric Multi-Processors P 1 P 2 P 3 P n $ u = 5 $ $ $ u = 5 7 BUS u = 5 Memory The processors see different values of u after event 3 Snooping Protocols I/O devices (Most) assume Write-through policy and a shared communication channel among processors/caches The Cache Controller snoops all the bus transactions a transaction is relevant for a cache if the referred data line (univocally identified by the (block) address) is present in cache the possible actions to guarantee the cache coherence are: n Invalidate: cache line is invalidated, and it must be re-loaded from memory first (write-through guarantees correctness) n Update: cache line is updated with the new value Note: There are strategies for non write-through policies Modern processors use point-to-point links among (multi-core) CPUs rather than a shared bus

Cache Invalidate vs. Cache Update Update cons: can waste unnecessarily the bandwidth of the bus n when a cache block is updated remotely, but is no longer read by the associated processor n subsequent writes by the same processor cause multiple updates pros: multiple R/W copies are kept coherent after each write. n This avoids misses for each subsequent read access (thus saving bus bandwidth) Invalidate pros: multiple writes by the same processors does not cause any additional bus traffic cons: an access that follows an invalidation causes a miss Pros and Cons of the two approaches depend on the application and its read/write patterns

False Sharing Coherence protocols work in terms of cache blocks/lines, rather than single words/bytes The block size plays an important role in the coherence protocol with small blocks, the protocols are more efficient (less data to transmit when update or flush) large blocks are better for spatial locality What if multiple processors access the same cache block?

False Sharing P1 P2 P1 P2 13 14 17 18 cache line cache update False sharing 13 14 17 18 cache line 13 14 17 23 cache line cache update 13 14 17 23 cache line consider two unrelated variables (e.g., variables that are logically private to distinct threads), which are allocated in the same block write accesses to a block by different processors running the threads are considered conflicts by the coherence protocol n even if the two processors access disjoint words of the same block thus, it is needed to put on the same block related variables n e.g., all of which are logically private for a given thread n compiling techniques, programmer

Intel Core i7 (4770) Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32GB L1 Data cache L1 Instruction cache = 32 KB, 64B/line L2 cache = 256 KB, 64B/line L3 cache L1 Data Cache Latency = 4-5 cycles L2 Cache Latency L3 Cache Latency = 36 cycles RAM Latency = 36 + ~200 cycles = 32 KB, 64B/line = 8 MB, 64B/line = 12 cycles

So Far Cache has a significant impact in the performance of modern applications How can we study the cache access patterns of an algorithm? How can we improve algorithm design? Two examples: Sort Matrix Multiplication

External Memory Model We use cache vs. disk to make it clear the relative costs Transfers occur in blocks of size B The cache has size M>=B With M/B entries Model Properties: Simple Asks to Minimize I/O cost Optimize for a specific M and B Explicitly issue read/write ops. Explicitly manage cache n Can you do this? n [J.S. Vitter, ACM Computing Surveys, 2001]

Example: Linear Scan Theorem: Scanning N elements stored in a contiguous segment of memory costs at most N/B + 1 memory transfers

Cache Oblivious Model Simple Idea: Design an algorithm optimal for any B and M Properties: CPU W work M/B Z L Cache lines Don t need to know M and B, which can be hard to know, and can harm the generality of an algorithm Only one cache level Do not explicitly manage the cache Cache Assumptions: n Tall cache assumption : M = Ω(B 2 ) n Ideal cache model = optimal cache replacement vs. FIFO, LRU n Full associativity vs. n-way associativity organized by optimal replacement strategy Cache Lines of length LB Q cache misses Main Memory n [Frigo et al., FOCS 99]

Cache Oblivious Model: Generalization to multiple cache levels Theorem: (from one level to many) If algorithm A is cache-oblivious optimal then it is optimal on any two adjacent memory levels in a complex hierarchy. Proof: If the inclusion property holds, i.e. M i M i+1 Consider M i+1 the external memory, A is optimal w.r.t. M i and B i M N, B N M 3, B 3 M 2, B 2 M 1, B 1 CPU Theorem: (levels of different cost) Let C i be the cost of accessing memory M i, if A is cache oblivious optimal up to a constant factor, then A is optimal for any possible set of constant factors C i.

Cache Oblivious Model : Generalization to multiple cache levels The following theorems make the model feasible: Theorem: (from optimal replacement strategy to LRU/FIFO) If A takes T transfers and cache size M/2, then A takes at most 2T transfers on a cache of size M with LRU or FIFO replacement policy. Theorem: (from fully associative to 1-way associative) An LRU/FIFO cache with of size M, and block size B, can be simulated in O(M) space, such that an access to a block takes O(1) expected time. Conclusion: A cache oblivious algorithm can be translated in a FIFO/LRU cache with 1-associativity paying only constant factors.

Matrix Multiplication Compute C = A * B With B and A being NxN matrices Preliminaries: how to store matrices? Row-major order vs. Column-major order

Cache cost/complexity For each element C ij Scan the i-th row of A Scan the j-th column of B (Compute multiply-add) Each element of C involves O(1+N/B) transfers Since there are N 2 elements in C, the complexity is: O(N 2 +N 3 /B) Simple approaches to reduce this cost: If M>N then row i of A can be stored in cache memory and used to compute all the C ix values In order to keep matrix A in cache, M should be >N 2

Improved Algorithm This can be improved to optimal complexity of O(N 2 /B + N 3 /(B* M)) with block matrices: A11 A12 B11 B12 A11*B11 + A11*B12 + A21 A22 * B21 B22 = A12*B21 A21*B11 + A12*B22 A21*B12 + A22*B21 A22*B22

Improved Algorithm What is the best strategy given M and B? What is the best Cache-Conscious algorithm? What is the best algorithm according to the external memory model? Simple: use blocks of size s*s, such that 3 * s 2 = M. Use blocked memory layout

Cache Oblivious Algorithm Divide-and-Conquer Approach A11 A21 A12 A22 * B11 B21 B12 B22 = + + + + A11 A21 A12 A22 * B11 B21 B12 B22 = + + + + A11 A21 A12 A22 * B11 B21 B12 B22 = + + + + At some point the recursion will fit in the cache, whatever its size

Data layout We don t know when it will fit!!!!!! We need a recursive data layout, such that however we recursively split the matrix, at some point all the data will be in (almost) consecutive memory locations that can be easily loaded in cache. Space filling curves: Z-Order

How to implement Z-Order For any subscript of 2 dimensional array such as array [ 2, 3 ] Binary value of row 2 -> 1 0 Binary value of col 3 -> 1 1 value is stored at location 1 1 0 1 location, i.e. 13th position.

Complexity Sketch: The base case of the recursion is when three blocks fit in cache, since successive recursion steps do not cause additional misses Assume the base case is with blocks of size k M * k M, for some constant k Such blocks fill the cache with M/B misses The computational complexity of the block-based multiplication is ( N/(k M) ) 3 Resulting in a number of misses ( N/(k M) ) 3 * M/B = O( N 3 /(B M) ) The sum across the various block generates misses O( N 2 /B ) The total cost is thus O(N 2 /B + N 3 /(B* M))

End of first part Parallel Programming for Multicore and Cluster Systems Sec. 2.7 Cache and Memory Hierarchy Cache-Oblivious Algorithms and Data Structures. Erik D. Demaine. Sec. 1, 2, 3.1.1, 3.2.3