CAMA: Modern processors. Memory hierarchy: Caches basics Data access locality Cache management

CAMA: Modern processors Memory hierarchy: Caches basics Data access locality Cache management Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center Johannes Hofmann/Dietmar Fey, Department for Computer Science University Erlangen-Nürnberg, Sommersemester 2016

Memory hierarchy: Caches basics Data access locality Cache management

Von Neumann bottleneck reloaded: DRAM gap DP peak performance and peak main memory bandwidth for a single Intel processor (chip) Approx. 10 F/B Main memory access speed not sufficient to keep CPU busy Introduce fast on-chip caches, holding copies of recently used data items 3

Schematic view of modern memory hierarchy & cache logic CPU/Arithmetic unit issues a LOAD request to transfer a data item to a register Cache logic automatically checks all cache levels if data item is already in cache. 15 60 GB/s If data item is in cache ( cache hit ) it is loaded to register. 50 150 GB/s If data item is in no cache level ( cache miss ) data item is loaded from main memory and a copy is held in cache. 4

Memory hierarchies: Effective Bandwidths Hardware: Quantities to characterize the quality of a memory hierarchy: Latency (T l ): Set up time for data transfer from source (e.g. main memory or caches) to destination (e.g. registers). Bandwidth (b): Maximum amount of data per second which can be transferred between source (e.g. main memory or caches) and destination (e.g. registers). Application: Transfer time (T) and effective bandwidth (b eff ) depend on data volume(v) to be transferred: Transfer time: T = T l + V b Effective bandwidth: b eee = V = V T T l + V b Low data volume (V 0): b eee 0 Large data volume ( V b T l): b eee b 5

Latency and bandwidth in modern computer environments ns b eee = V T l + V b µs b eff ms V 6

Memory hierarchies: The latency problem Main memory latency and bandwidth for modern multicore CPUs: T l = 64 ns & b = 64 GB/s V T l V/b T b eff 8 B 64 ns 0.125 ns 64.125 ns 0.13 GB/s 128 B 64 ns 2 ns 66 ns 1.9 GB/s 4096 B 64 ns 64 ns 128 ns 32 GB/s Data access is organized in cache lines (CL) always a full CL is transferred (V=64 B or V=128 B on modern architectures) Multiple CLs can be loaded concurrently Multiple data requests by application code Automatic hardware prefetching 7

Memory hierarchies: Cache lines If one data item is loaded from main memory ( cache miss ), whole cache line it belongs to is loaded Cache lines are contiguous in main memory, i.e. neighboring items can then be used from cache Iteration 1 2 3 4 5 6 7 8 LD Cache miss : T l V/b Use data Cache line size: 4 words LD LD Use data LD Use data LD Use data t do i=1,n s = s + a(i)*a(i) enddo LD Use data LD Use data Use 8

Memory hierarchies: (Automatic) Prefetching Prefetching data to hide memory latencies of CL transfers Iteration 1 2 3 4 5 6 7 8 LD Cache miss : T l V/b Use data Data transfer is started before cache miss Prefetching LD LD Use data LD Use data LD Use data LD Use data LD t Use data LD LD do i=1,n s = s + a(i)*a(i) enddo Use Use data Use data LD Use 9

Memory Hierarchies: Prefetching Hide memory latency Prefetch (PFT) instructions (limited use on modern architectures): Transfer one cache line from memory to cache and then issue LD to registers Most architectures (Intel/AMD x86, IBM Power) use hardware-based automatic prefetch mechanisms HW detects regular, consecutive memory access patterns (streams) and prefetches at will Intel x86: Adjacent cache line prefetch loads 2 (64-byte) cache lines on L3 miss Effectively doubles line length on loads (typical. enabled in BIOS) Intel x86: Hardware prefetcher: Prefetches complete page (4 KB) if 2 successive CLs in this page are accessed For regular data access main memory latency is not an issue! May generate excessive data transfers for irregular access pattern 10

Node-level architecture - revisited Memory hierarchy: Caches basics Data access locality Cache management

Memory Hierarchies: Cache line Spatial locality Cache line features Cache line use is optimal for contiguous access ( stride 1 ) STREAMING Non-consecutive access reduces performance Access with wrong stride (e.g. cache line size) can lead to disastrous performance breakdown Typical CL sizes: 64 Byte (AMD/Intel) or 128 Byte (IBM) Spatial locality : Ensure accesses to neighboring data items GOOD ( Streaming ) do i=1,n s = s + a(i)*a(i) enddo BAD ( Strided ) do i=1,n,2 s = s + a(i)*a(i) enddo If a(1:n) is loaded from main memory: same runtime! Performance of strided loop is half of the continuous one 12

Memory Hierarchies: Spatial locality & Data Layout How to traverse multidimensional arrays?! Example: Initialize matrix A with A(i,j) = i*j What is the storage order of multidimensional-data structure? It depends, e.g. 2-dimensional 3x3 array A of doubles FORTRAN: column by column ( column major order ) 0 B Memory layout 71 B A(1,1) A(2,1) A(3,1) A(1,2) A(2,2) A(3,2) A(1,3) A(2,3) A(3,3) C/C++: row by row ( row major order ) 0 B Memory layout 71 B A[0][0] A[0][1] A[0][2] A[1][0] A[1][1] A[1][2] A[2][0] A[2][1] A[2][2] 13

Memory Hierarchies: Spatial locality & Data Layout Default layout for FORTRAN: column by column (column major order) do i=1,n do j=1,n a(j,i)=i*j enddo enddo Continuous access! do j=1,n do i=1,n a(j,i)=i*j enddo enddo Stride n access! FORTRAN: Inner loop must access innermost/left array index Data arrangement is transpose of the usual matrix layout 14

Memory Hierarchies: Spatial locality & Data Layout Default layout for C/C++: row by row (row major order) for(i=0; i<n; ++i) { for(j=0; j<n; ++j) { a[i][j] = i*j; } } Continuous access! for(j=0; j<n; ++j) { for(i=0; i<n; ++i) { a[i][j] = i*j; } } Stride N access! In C: Inner loop must access outermost/rightmost array index 15

Memory Hierarchies: Spatial locality & Data Layout 3-dimensional arrays in C/C++ for(i=0; i<n; ++i) { for(j=0; j<n; ++j) { for(k=0; k<n; ++k) { a[i][j][k] = i*j*k; } } } Continuous access! for(k=0; k<n; ++k){ for(j=0; j<n; ++j) { for(i=0; i<n; ++i) { a[i][j][k] = i*j*k; } } } Stride N*N access! C/C++: Always start with rightmost index as inner loop index if possible! Sometimes there are problems. (spatial blocking may improve the situation here cf. later) for(i=0; i<n; ++i) { for(j=0; j<n; ++j) { a[i][j] = b[j][i]; } } 16

Memory hierarchies: Temporal locality Phenomenon: some/many data items are accessed frequently ( Temporal locality ) Data reuse from cache at higher rates! If data is already in cache - reuse it from there! Blocking techniques can often be applied Example: Dense matrix vector multiplication (assume that cache is large enough to hold y(1:r) ) A(r,c) x(c) stays in register tmp stays in register in inner loop do c = 1, C tmp=x(c) do r = 1, R y(r)=y(r) + A(r,c)* tmp enddo enddo y(1:r) is loaded C times Temporal locality for C-1 accesses A(,):continuous access Spatial locality 17

Node-level architecture - revisited Memory hierarchy: Caches basics Data access locality Cache management

Memory Hierarchies: Cache Mapping Cache Size (~MBs) is much smaller than main memory (~GBs) Cache mapping strategy required Pairing of memory addresses with cache locations L1 Cache (~10 3 Byte) L2 Cache (~10 6 Byte) Where is the CL to given memory addressed placed in the cache? Different mapping strategies throughout memory hierarchy (i.e. cache levels) CL 0 CL 1 CL 2 CL 3 Memory (~10 9 Byte) Extreme strategies: Direct mapped (see next slide) vs. fully associative (memory address can be mapped to any cache entry) 19

Memory Hierarchies: Cache Mapping Direct mapping Simplest mapping strategy Direct Mapped caches Every memory address is mapped to exactly one cache entry Easy to handle/implement in hardware, e.g. if cache size is 1 KB choose lowest 10 bits of memory address to identify cache entry Memory address 32 Bit: 011100100000 11110100111110 001111 L1 Cache (~10 3 Byte) Mapping substantially impacts flexibility of replacement strategies Reduces the potential set of evict/replace locations May incur additional data transfer ( cache thrashing)! L2 Cache (~10 6 Byte) CL 0 CL 1 CL 2 CL 3 Memory (~10 9 Byte) 20

Memory Hierarchies: Cache Mapping Associative Caches Set-associative cache: m-way associative cache of size m x n: each memory location i can be mapped to the m cache locations ( ways ) j*n+mod(i,n), j=0..m-1 E.g.: 2-way set associative cache of size 256 KBytes: Way 1... 128 KB 128KB+1 CacheLine 256 KB Set Address within Number of sets: 256 KB/ 64 Byte /2 = 2048 cache line Memory address (32 Bit): 011100100000111 10100111110 001111 Modern processors: 4-way to 48-way associative caches 21

Memory hierarchies: Cache Mapping Cache Thrashing If many memory locations are used that are mapped to the same set, cache reuse can be very limited even with m-way associative caches Warning: Using powers of 2 in the leading array dimensions of multi-dimensional arrays should be avoided! (Cache Thrashing) 011100100000111 10100111110 001111 011100100001000 10100111110 001111 011100100001001 10100111110 001111 double precision A(16384,16384) A(1,1) A(1,2) A(1,3) If cache / m-ways are full and new data comes in from main memory, data in cache (full cache line) must be invalidated or written back to main memory Ensure spatial and temporal data locality for data access! 22

Memory hierarchies: Cache thrashing - Example Example: 2D square lattice At each lattice point the 4 velocities for each of the 4 directions are stored N=16 real*8 vel(1:n, 1:N, 4) s=0.d0 do j=1,n do i=1,n s=s+vel(i,j,1)-vel(i,j,2)+vel(i,j,3)-vel(i,j,4) enddo enddo 23

Memory hierarchies: Cache thrashing - Example Memory to cache mapping for vel(1:16, 1:16, 4) Cache: 256 byte (=32 double) / 2-way associative / Cache line size=32 byte 1,1,1 2,1,1 3,1,1 4,1,1. 1,1,2 2,1,2 3,1,2 4,1,2. 1,1,3 2,1,3 3,1,3 4,1,3. 1,1,4 2,1,4 3,1,4 4,1,4 i=1, j=1 1,1,1 2,1,1 3,1,1 4,1,1. 1,1,2 2,1,2 3,1,2 4,1,2. 1,1,3 2,1,3 3,1,3 4,1,3. 1,1,4 2,1,4 3,1,4 4,1,4. Vel(1:16,1:16,1) Vel(1:16,1:16,2) Vel(1:16,1:16,3) Vel(1:16,1:16,4) Cache: 2 ways 1,1,1 1,1,3 2,1,1 2,1,3 3,1,1 3,1,3 4,1,1 4,1,3 1,1,2 1,1,4 2,1,2 2,1,4 3,1,2 3,1,4 4,1,2 4,1,4 with 16 double each Each cache line must be loaded 4 times from main memory to cache! 24

Memory hierarchies: Cache thrashing - Example Memory to cache mapping for vel(1:16+2, 1:16+2, 4) Cache: 256 byte (=32 doubles) / 2-way associative / Cache line size=32 byte 1,1,1 2,1,1 3,1,1 4,1,1. 1,1,2 2,1,2 3,1,2 4,1,2. 1,1,3 2,1,3 3,1,3 4,1,3. 1,1,4 2,1,4 3,1,4 4,1,4 i=1, j=1 Cache: 2 way 1,1,1 2,1,1 3,1,1 4,1,1 15,18,1 16,18,117,18,1 18,18,1 1,1,1 2,1,1 3,1,1 4,1,1 1,1,2 2,1,2 3,1,2 4,1,2. 1,1,3 2,1,3. 3,1,3 4,1,3.. 1,1,4 2,1,4. 3,1,4 4,1,4 1,1,2 2,1,2 3,1,2 4,1,2 1,1,3 2,1,3 3,1,3 4,1,3 1,1,4 2,1,4 3,1,4 4,1,4.. with 16 doubles each Each cache line needs only be loaded once from memory to cache! 25

Memory hierarchies: Cache management details Cache misses: LOAD misses: If data item (e.g. a[2]) to be loaded to a register is not available in cache, the full cache line (e.g. a[0:7]) holding the data item is loaded from main memory to cache. STORE miss: Data item to be modified (e.g. a[2]=0.0) is not in cache? One cache line is the minimum data transfer unit between main memory and cache (e.g. a[0:7]). Load cache line from main memory to cache ( WRITE ALLOCATE ) Modify data item in cache Later evict/write back full cache line to main memory ( store to main memory) Overall data transfer volume increases up to 2x! (NT stores: no increase) do i=1,n do j=1,n a(j,i)= 0.0 enddo enddo n 2 words are loaded from main memory to cache (WRITE ALLOCATE) and n 2 words are evicted/written back to main memory! 26

Memory hierarchies: Data transfers Caches help with getting instructions and data to the CPU fast How does data travel from memory to the CPU and back? Remember: Caches are organized in cache lines (e.g., 64 bytes) Only complete cache lines are transferred between memory hierarchy levels (except registers) Cache MISS: Load or store instruction does not find the data in a cache level CL transfer required LD C(1) MISS CL CPU registers ST A(1) MISS Cache CL write allocate LD C(2..N cl ) ST A(2..N cl ) HIT evict (delayed) Example: Array copy A(:)=C(:) 2-level ( Inclusive cache ) cache hierarchy: CL CL C(:) Memory A(:) 3 CL transfers L1 Load MISS L2 Load MISS CL loaded to L2 CL loaded to L1 data loaded to register 27

Memory management: Caches management details Inclusive: Cache line copy in all levels Reduced effective size in outer cache levels Cheap eviction for unmodified cache lines Higher latency: cache lines have to load through hierarchy Intel processors Exclusive: Only one cache line copy in cache hierarchy Full aggregate effective cache size Eviction is expensive (copy back) Lower latency: Data can be directly loaded in L1 cache AMD processors Write back : A modified cache line is evicted to the next (lower) cache/memory level before it is overwritten by new data Write through : When a cache line is updated then the cache line copy in the next (lower) cache/memory level is updated as well 28

Memory management: Cache management details tags Every cache line (CL) has an associated cache tag (which is also located in cache but not directly visible to programmer) Cache tag contains information about: Main memory address associated with CL (Tag information + Set information = baseline memory address of CL) Status of CL (e.g. Invalid, Shared, Exclusive, Modified) which e.g. determines what to do with the cache entry if it needs to be replaced 29

Memory Hierarchies: Typical cache configuration Intel Xeon E5-2680 Sandy Bridge # FP registers 16 # GP registers 16 L1 D L2 L3 Size Associativity Size Associativity Size Associativity local per core local per core 32 KB 8-way 256 KB 8-way 20 MB (shared) shared across all cores 20-way SIMD registers Same for more recent Intel architectures: Ivy Bridge & Haswell Depends on core count, CPU variant, CoD mode 30

Intel Xeon E5 multicore processors FP instructions throughput per core Max. data transfer per cycle between caches Peak main memory bandwidth 31

Characterization of Memory Hierarchies Determine performance levels with low level benchmark: Vector Triad DOUBLE PRECISION, dimension(size):: A,B,C,D DOUBLE PRECISION :: S,E,MFLOPS! Input N.le. SIZE DO i=1,n A(i) = 0.d0; B(i)=1.d0; C(i)=2.d0; D(i)=3.d0! initialize ENDDO call get_walltime(s) DO ITER=1, NITER DO i=1, N A(i) = B(i) + C(i) * D(i)! 3 loads + 1 store; 2 FLOP ENDDO IF(A(2).lt.0) call dummy(a,b,c,d)! Prevent loop interchange ENDDO call get_walltime(e) MFLOPS = NITER * N * 2.d0 /( (E-S) * 10 6 ) 32

Memory Hierarchies: Measure performance levels Vector Triad single core performance: A[1:N]=B[1:N]+C[1:N]*D[1:N] Can we explain performance based on hardware features? L1 cache L2 cache L3 cache 33

CAMA: Multicore processors Moore s law & multicore technology Basic Compute architecture Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center Dietmar Fey, Department for Computer Science University Erlangen-Nürnberg, Sommersemester 2016

Moore s Law driving multicore technology

Introduction: Moore s law continues Nvidia Maxwell: 8 Billion Intel Haswell EP: 5.6 Billion 1965: G. Moore claimed #transistors on microchip doubles every 12-24 months 36

Introduction: Moore s law clock speeds saturate 10000 8 core Sandy Bridge 12 core Ivy Bridge 1000 100 Intel x86 clock speed 18 core Haswell Frequency [MHz] 10 1 0,1 1 core Nocona 37

Introduction: Trends to consider Clock speed of multicore chips will not increase Power/energy saving mechanisms in hardware Clock speed depends on execution time parameter, e.g. number of cores used type of application executed environment temperature Transistor budget can be invested in various directions Execution units #FMAs Width of execution units SIMD: 128 Bit 256 Bit 512 Bit Cache sizes #Cores (n core ) 2,4,,22 additional functionalities, e.g. PCIe or GPU on-chip 38

There is no longer a single driving force for chip performance! Floating Point (FP) Peak Performance of a single chip: P cccc = n cccc P cccc P cccc = nff sssss n FFF n SSSS f Intel Xeon EP ( Broadwell ) (up to 22 core variants are available) TOP1 1996 Intel Xeon E5-2699 v4 ( Broadwell ): f = 2. 2 GGG FF n cccc = 22 ; n sssss = 2; n FFF = 2; n SSSS = 4 P cccc = 777. 4 GG s (double) But: P chip =8.8 GF/s for serial, non-vectorized code 39

There is no single driving force for single core performance! FF P cccc = n cccc n sssss n FFF n SSSS f n cccc Cores FF n sssss inst./cy Superscalarity n FFF n SSSS ops/inst FMA factor SIMD factor Server Clock Speed f [GHz] P cccc [GF/s] Nehalem 4 2 1 2 Q1/2009 X5570 2.93 46.8 Westmere 6 2 1 2 Q1/2010 X5650 2.66 63.6 Sandy Bridge 8 2 1 4 Q1/2012 E5-2680 2.7 173 Ivy Bridge 10 2 1 4 Q3/2013 E5-2660 v2 2.2 176 Haswell 14 2 2 4 Q3/2014 E5-2695 v3 2.3 515 Broadwell 22 2 2 4 Q1/2016 E5-2699 v4 2.2 774 IBM POWER8 10 2 2 2 Q2/2014 S822LC 2.93 234 Nvidia K20 13 1 2 64 0.7 1165 Phi 5110P 60 1 2 8 1.05 1008

Intel Xeon E5-2600v3 dual socket server (2014) One Xeon E5-2600v3 Haswell EP chip: Up to 18 cores running at 2.3 GHz (max 3.6 GHz) Simultaneous Multithreading (SMT) reports as 36-way chip Up to 40 MB cache & 40 PCIe 3.0 lanes 5.7 Billion Transistors / 22 nm Die size: 662 mm 2 Standard HPC/server configurations: 2 socket server 18 cores 18 cores 42

Basic compute node architecture From UMA to ccnuma

Single Chip is not enough! Basic architecture of shared memory compute nodes Hardware/software layers (HT/QPI): Shared address space and ensure data coherency A(1: 1 000 000 000) Separate memory controllers scalable performance Single shared address space ease of use Cache-coherent Non-Uniform Memory Architecture (ccnuma) HT / QPI :scalable bandwidth at the price of ccnuma: Where does my data finally end up?

There is no longer a single flat memory: From UMA to ccnuma 2-way nodes Yesterday: Dual-socket Intel Core2 node: Uniform Memory Architecture (UMA): Flat memory ; symmetric MPs But: system anisotropy Shared Address Space within the node! Today: Dual-socket Intel (Westmere) node: Cache-coherent Non-Uniform Memory Architecture (ccnuma) HT / QPI provide scalable bandwidth at the expense of ccnuma architectures: Where does my data finally end up? On AMD it is even more complicated ccnuma within a chip! 45

Parallel computers Shared-Memory Architectures Basic Classification Shared memory computers provide a single shared address space (memory) for all processors All processors share the same view of the address space! CPU CPU Shared Address Space Two basic categories of shared memory systems CPU CPU Uniform Memory Access (UMA): Memory is equally accessible to all processors with the same performance (Bandwidth & Latency) cache-coherent Non Uniform Memory Access (ccnuma): Memory is physically distributed but appears as a single address space: Performance (Bandwidth & Latency) is different for local and remote memory access Copies of the same cache line may reside in different caches Cache coherence protocols guarantees consistency all time (for UMA & ccnuma) Cache coherence protocols do not alleviate parallel programming for shared-memory architectures! 46

Parallel computers: Shared-memory: UMA UMA Architecture: switch/bus arbitrates memory access Flat memory a.k.a Symmetric Multi-Processor (SMP) Data access speed (performance) does not depend on data location CPU 1 CPU 2 CPU 3 CPU 4 Cache Cache Cache Cache... Switch/Bus... Memory 47

Parallel shared memory computers: ccnuma/node Layout ccnuma: Single address space although physically distributed memory through proprietary hardware concepts (e.g. NUMALink in SGI systems; QPI for Intel; HT for AMD) Advantages: Aggregate memory bandwidth is scalable Systems with more 1024 cores are available (SGI) Disadvantages: Cache Coherence hard to implement / expensive Performance depends on access to local or remote memory C C Memory C C Memory C C Memory C C Memory Examples: All modern multisocket compute nodes 48

ccnuma nodes cache coherence & data locality Cache coherence for UMA & ccnuma! Multiple copies of the same cache line in multiple caches how to keep them coherent? ccnuma: Data Locality C C C C C C C C M M M M "Golden Rule" of ccnuma: A memory page gets mapped into the local memory of the processor that first touches it! 49

ccnuma nodes Golden Rule All modern multi-socket servers are of ccnuma type First touch policy ( Golden Rule ): Locailty domain 0 Locailty domain 1 A memory page is mapped into the locality domain of the processor that first writes to it Consequences Mapping happens at initialization, not allocation Initialization and computation must use the same memory pages per thread/process Affinity matters! Where are my threads/processes running? It is sufficient to touch a single item to map the entire page P C C P C C C MI P C C Memory P C C P C C MI P C C Memory double precision, dimension(:), allocatable :: huge allocate(huge(n))! memory not mapped yet do i=1,n huge(i) = 0.d0! mapping here! enddo P C C C P C C 50

ccnuma nodes - Coding for Data Locality Dense matrix vector multiplication (dmvm) void dmvm(int n, int m, double *lhs, double *rhs, double *mat){ #pragma omp parallel for private(offset,c)schedule(static) { for(r=0; r<n; ++r) { offset=m*r; for(c=0; c<m; ++c) }} lhs[r] += mat[c + offset]*rhs[c]; OpenMP parallelization?! 51

ccnuma nodes Coding for Data Locality #pragma omp parallel for schedule(static) private(c) { for(r=0; r<n; ++r) for(c=0; c<m; ++c) mat[c + m*r] = ; } } 10 cores 10 cores Parallelization of matrix data initialization dmvm Matrix data initialization: serial 52

Multicore processors Summary Modern multicore processor chips come with an increasing number of cores Multiple hardware features ( ) contribute to peak performance of a single processor chip: FF P cccc = n cccc n sssss n FFF n SSSS f A single multicore processor chip is typically of UMA architecture Multiple chips form ccnuma architectures UMA architecture: ccnuma architecture: flat memory concept of data locality both in hardware & software (see golden rule) Golden rule for ccnuma systems first touch principle 53

2015er Folien die ersetzt worden sind