How to Write Fast Numerical Code

Similar documents
How to Write Fast Numerical Code

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

How to Write Fast Numerical Code Spring 2013 Lecture: Architecture/Microarchitecture and Intel Core

Problem: Processor- Memory Bo<leneck

CS 240 Stage 3 Abstractions for Practical Systems

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Problem: Processor Memory BoJleneck

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

+ Random-Access Memory (RAM)

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Cache Memory and Performance

CS3350B Computer Architecture

Today Cache memory organization and operation Performance impact of caches

General Cache Mechanics. Memory Hierarchy: Cache. Cache Hit. Cache Miss CPU. Cache. Memory CPU CPU. Cache. Cache. Memory. Memory

Computer Systems CSE 410 Autumn Memory Organiza:on and Caches

CS241 Computer Organization Spring Principle of Locality

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

211: Computer Architecture Summer 2016

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Read-only memory (ROM): programmed during production Programmable ROM (PROM): can be programmed once SRAM (Static RAM)

Memory hierarchies: caches and their impact on the running time

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Caches Spring 2016 May 4: Caches 1

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

CMPSC 311- Introduction to Systems Programming Module: Caching

Advanced Memory Organizations

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

CMPSC 311- Introduction to Systems Programming Module: Caching

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich

Recap: Machine Organization

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

ECE232: Hardware Organization and Design

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Memory Hierarchy. Slides contents from:

CS3350B Computer Architecture

: How to Write Fast Numerical Code ETH Computer Science, Spring 2016 Midterm Exam Wednesday, April 20, 2016

Make sure that your exam is not missing any sheets, then write your full name and login ID on the front.

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

F28HS Hardware-Software Interface: Systems Programming

Memory Hierarchy. Slides contents from:

The Memory Hierarchy & Cache

Advanced Parallel Programming I

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

The University of Adelaide, School of Computer Science 13 September 2018

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Carnegie Mellon. Cache Memories

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

14:332:331. Week 13 Basics of Cache

How to Write Fast Numerical Code

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, FALL 2012

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Giving credit where credit is due

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Giving credit where credit is due

Page 1. Multilevel Memories (Improving performance using a little cash )

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

CISC 360. Cache Memories Nov 25, 2008

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Last class. Caches. Direct mapped

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

CISC 360. The Memory Hierarchy Nov 13, 2008

Modern CPU Architectures

Caches Concepts Review

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Cache Memories October 8, 2007

Cache Memory and Performance

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Memory Management! Goals of this Lecture!

Memory. Objectives. Introduction. 6.2 Types of Memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Hierarchy. Announcement. Computer system model. Reference

Page 1. Memory Hierarchies (Part 2)

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, SPRING 2013

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

EE 4683/5683: COMPUTER ARCHITECTURE

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

CSE502: Computer Architecture CSE 502: Computer Architecture

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Transcription:

How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory hierarchy Caches Chapter 5 in Computer Systems: A Programmer's Perspective, 2 nd edition, Randal E. Bryant and David R. O'Hallaron, Addison Wesley 2010 Part of these slides are adapted from the course associated with this book 2

Problem: Processor-Memory Bottleneck Processor performance doubled about every 18 months CPU Reg Bus bandwidth doubled every 36 months Main Memory Core i7 Haswell: Peak performance: 2 AVX three operand (FMA) ops/cycles consumes up to 192 Bytes/cycle Core i7 Haswell: Bandwidth 16 Bytes/cycle Solution: Caches/Memory hierarchy 3 Typical Memory Hierarchy L0: registers CPU registers hold words retrieved from L1 cache Smaller, faster, costlier per byte Larger, slower, cheaper per byte L4: L3: L2: L1: on-chip L1 cache (SRAM) on-chip L2 cache (SRAM) main memory (DRAM) local secondary storage (local disks) L1 cache holds cache lines retrieved from L2 cache L2 cache holds cache lines retrieved from main memory Main memory holds disk blocks retrieved from local disks Local disks hold files retrieved from disks on remote network servers L5: remote secondary storage (tapes, distributed file systems, Web servers) 4

Abstracted Microarchitecture: Example Core i7 Haswell (2013) and Sandybridge (2011) Throughput (tp) is measured in doubles/cycle. For example: 4 (2) Latency (lat) is measured in cycles 1 double floating point (FP) = 8 bytes fma = fused multiply-add Rectangles not to scale Haswell Sandy Bridge double FP: max scalar tp: 2 fmas/cycle = 2 (1) adds/cycle and 2 (1) mults/cycle fp add fp mul fp fma int ALU load store SIMD logic/sh uffle execution units 1 Core max vector tp (AVX) 2 vfmas/cycle = 8 fmas/cycle = 8 (4) adds/cycle and 8 (4) mults/cycle internal registers out of order execution superscalar 16 FP register issue 8 (6) μops/ cycle instruction decoder RISC μops CISC ops instruction pool (up to 192 (168) in flight ) ISA lat: 4 (4) tp: 12 = 8 ld + 4 st (4) L1 Dcache both: 32 KB 8-way 64B CB L1 Icache lat: 11 (12) tp: 8 (4) CB = cache block L2 cache 256 KB 8-way 64B CB lat: ~34 (26-31) tp: 4 (4) Shared L3 cache 8 MB 16-way 64B CB ring interconnect processor die Core i7-4770 Haswell: 4 cores, 8 threads Core #1, L1, L2 L3 3.4 GHz (3.9 GHz max turbo freq) Core #2, L1, L2 L3 2 DDR3 channels 1600 MHz RAM Core #3, L1, L2 L3 Memory hierarchy: Registers L1 cache L2 cache L3 cache Main memory Hard disk lat: ~125 (100) tp: 2 (1) Main Memory (RAM) lat: millions 32 GB max tp: ~1/50 (~1/100) depends on hard disk technology, I/O interconnect Hard disk 0.5 TB Core #4, L1, L2 L3 Source: Intel manual (chapter 2), http://www.7-cpu.com/cpu/haswell.html core uncore Why Caches Work: Locality Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently History of locality Temporal locality: Recently referenced items are likely to be referenced again in the near future memory Spatial locality: Items with nearby addresses tend to be referenced close together in time memory 6

Example: Locality? sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Data: Temporal: sum referenced in each iteration Spatial: array a[] accessed consecutively Instructions: Temporal: loops cycle through the same instructions Spatial: instructions referenced in sequence Being able to assess the locality of code is a crucial skill for a performance programmer 7 Locality Example #1 int sum_array_rows(double a[m][n]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } 8

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248 256 Locality Example #2 int sum_array_cols(double a[m][n]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; } 9 Locality Example #3 int sum_array_3d(double a[m][n][k]) { int i, j, k, sum = 0; How to improve locality? for (i = 0; i < M; i++) for (j = 0; j < N; j++) for (k = 0; k < K; k++) sum += a[k][i][j]; return sum; } Performance [flops/cycle] 0.4 0.35 0.3 0.25 CPU: Intel(R) Core(TM) i7-4980hq CPU @ 2.80GHz gcc: Apple LLVM version 8.0.0 (clang-800.0.42.1) flags: -O3 -fno-vectorize k-i-j 0.2 0.15 0.1 i-j-k 0.05 0 = M = N = K 10

Operational Intensity Again Definition: Given a program P, assume cold (empty) cache Operational intensity: I(n) = W(n) Q(n) #flops (input size n) #bytes transferred cache memory (for input size n) Examples: Determine asymptotic bounds on I(n) Vector sum: y = x + y O(1) Matrix-vector product: y = Ax O(1) Fast Fourier transform O(log(n)) Matrix-matrix product: C = AB + C O(n) 11 Compute/Memory Bound A function/piece of code is: Compute bound if it has high operational intensity Memory bound if it has low operational intensity Relationship between operational intensity and locality? They are closely related Operational intensity only describes the boundary last level cache/memory 12

Effects FFT: I(n) = O(log(n)) MMM: I(n) = O(n) Discrete Fourier Transform (DFT) on 2 x Core 2 Duo 3 GHz (single) Gflop/s 30 25 20 Matrix-Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz (double) Gflop/s 50 45 40 35 15 10 5. 30 25 20 15 10 5 0 16 32 64 128 256 512 1'024 2'048 4'096 8'192 16'384 32'768 65'536 131'072 262'144 input size 0 0 1'000 2'000 3'000 4'000 5'000 6'000 7'000 8'000 9'000 matrix size Up to 40-50% peak Performance drop outside last level cache (LLC) Most time spent transferring data Up to 80-90% peak Performance can be maintained outside LLC Cache miss time compensated/hidden by computation 13 Cache Definition: Computer memory with short access time used for the storage of frequently or recently used instructions or data CPU Cache Main Memory Naturally supports temporal locality Spatial locality is supported by transferring data in blocks Core family: one block = 64 B = 8 doubles 14

General Cache Mechanics Cache 84 9 14 10 3 Smaller, faster, more expensive memory caches a subset of the blocks 10 4 Data is copied in block-sized transfer units Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Larger, slower, cheaper memory viewed as partitioned into blocks 15 General Cache Concepts: Hit Cache Request: 14 8 9 14 3 Data in block b is needed Block b is in cache: Hit! Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

General Cache Concepts: Miss Cache Request: 12 8 12 9 14 3 Data in block b is needed Block b is not in cache: Miss! 12 Request: 12 Block b is fetched from memory Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Block b is stored in cache Placement policy: determines where b goes Replacement policy: determines which block gets evicted (victim) 17 Types of Cache Misses (The 3 C s) Compulsory (cold) miss Occurs on first access to a block Capacity miss Occurs when working set is larger than the cache Conflict miss Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot Not a clean classification but still useful 18

Cache Structure Draw a direct mapped cache (E = 1, B = 4 doubles, S = 8) Show how blocks are mapped into cache 19 Example (S=8, E=1) int sum_array_rows(double a[16][16]) { int i, j; double sum = 0; Ignore the variables sum, i, j assume: cold (empty) cache, a[0][0] goes here for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum; } int sum_array_cols(double a[16][16]) { int i, j; double sum = 0; for (j = 0; j < 16; i++) B = 32 byte = 4 doubles for (i = 0; i < 16; j++) sum += a[i][j]; return sum; } blackboard 20

Cache Structure Add associativity (E = 2, B = 4 doubles, S = 8) Show how elements are mapped into cache 21 Example (S=4, E=2) int sum_array_rows(double a[16][16]) { int i, j; double sum = 0; Ignore the variables sum, i, j assume: cold (empty) cache, a[0][0] goes here for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum; } int sum_array_cols(double a[16][16]) { int i, j; double sum = 0; for (j = 0; j < 16; i++) for (i = 0; i < 16; j++) sum += a[i][j]; return sum; } B = 32 byte = 4 doubles blackboard 22

General Cache Organization (S, E, B) E = 2 e lines per set E = associativity, E=1: direct mapped set line S = 2 s sets v tag 0 1 2 B-1 Cache size: S x E x B data bytes valid bit B = 2 b bytes per cache block (the data) 23 Cache Read E = 2 e lines per set E = associativity, E=1: direct mapped Locate set Check if any line in set has matching tag Yes + line valid: hit Locate data starting at offset S = 2 s sets Address of word: t bits s bits b bits tag set index block offset data begins at this offset v tag 0 1 2 B-1 valid bit B = 2 b bytes per cache block (the data) 24

Terminology Direct mapped cache: Cache with E = 1 Means every block from memory has a unique location in cache Fully associative cache Cache with S = 1 (i.e., maximal E) Means every block from memory can be mapped to any location in cache In practice to expensive to build One can view the register file as a fully associative cache LRU (least recently used) replacement when selecting which block should be replaced (happens only for E > 1), the least recently used one is chosen 25 Small Example, Part 1 x[0] Cache: E = 1 (direct mapped) S = 2 B = 16 (2 doubles) Array (accessed twice in example) x = x[0],, x[7] % Matlab style code for j = 0:1 for i = 0:7 access(x[i]) Access pattern: Hit/Miss: 0123456701234567 MHMHMHMHMHMHMHMH Result: 8 misses, 8 hits Spatial locality: yes Temporal locality: no 26

Small Example, Part 2 x[0] Cache: E = 1 (direct mapped) S = 2 B = 16 (2 doubles) Array (accessed twice in example) x = x[0],, x[7] % Matlab style code for j = 0:1 for i = 0:2:7 access(x[i]) for i = 1:2:7 access(x[i]) Result: 16 misses Spatial locality: no Temporal locality: no Access pattern: Hit/Miss: 0246135702461357 MMMMMMMMMMMMMMMM 27 Small Example, Part 3 x[0] Cache: E = 1 (direct mapped) S = 2 B = 16 (2 doubles) Array (accessed twice in example) x = x[0],, x[7] % Matlab style code for j = 0:1 for k = 0:1 for i = 0:3 access(x[i+4j]) Access pattern: Hit/Miss: 0123012345674567 MHMHHHHHMHMHHHHH Result: 4 misses, 12 hits (is optimal, why?) Spatial locality: yes Temporal locality: yes 28

Cache Performance Metrics Miss Rate Fraction of memory references not found in cache: misses / accesses = 1 hit rate Hit Time Time to deliver a block in the cache to the processor Core 2: 3 clock cycles for L1 14 clock cycles for L2 Miss Penalty Additional time required because of a miss Core 2: about 100 cycles for L2 miss 29 What about writes? What to do on a write-hit? Write-through: write immediately to memory Write-back: defer write to memory until replacement of line What to do on a write-miss? Write-allocate: load into cache, update line in cache No-write-allocate: writes immediately to memory Write-back/write-allocate (Core) Write-through/no-write-allocate mem mem mem mem 1: load 2: update update $ $ $ $ CPU update CPU 2: update CPU 1: update CPU Write-hit Write-miss Write-hit Write-miss 30

Example: (Blackboard) z = x + y, x, y, z vector of length n assume they fit jointly in cache + cold cache memory traffic Q(n)? operational intensity I(n)? 31 Locality Optimization: Blocking Example: MMM (blackboard) 32

The Killer: Two-Power Strided Working Sets % t = 1,2,4,8, a 2-power % size of working set: n/t for (i = 0; i < n; i += t) access(x[i]) for (i = 0; i < n; i += t) access(x[i]) blackboard Cache: E = 2, B = 4 doubles t = 1: t = 2: t = 4: t = 8: t 4S: x[0] Spatial locality Temporal locality: if n/t C Some spatial locality Temporal locality: if n/t C/2 No spatial locality Temporal locality: if n/t C/4 No spatial locality Temporal locality: if n/t C/8 No spatial locality Temporal locality: if n/t 2 33 The Killer: Where Can It Occur? Accessing two-power size 2D arrays (e.g., images) columnwise 2d Transforms Stencil computations Correlations Various transform algorithms Fast Fourier transform Wavelet transforms Filter banks 34

Summary It is important to assess temporal and spatial locality in the code Cache structure is determined by three parameters block size number of sets associativity You should be able to roughly simulate a computation on paper Blocking to improve locality Two-power strides are problematic (conflict misses) 35