CAMA: Modern processors. Memory hierarchy: Caches basics Data access locality Cache management

Size: px
Start display at page:

Download "CAMA: Modern processors. Memory hierarchy: Caches basics Data access locality Cache management"

Transcription

1 CAMA: Modern processors Memory hierarchy: Caches basics Data access locality Cache management Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center Johannes Hofmann/Dietmar Fey, Department for Computer Science University Erlangen-Nürnberg, Sommersemester 2016

2 Memory hierarchy: Caches basics Data access locality Cache management

3 Von Neumann bottleneck reloaded: DRAM gap DP peak performance and peak main memory bandwidth for a single Intel processor (chip) Approx. 10 F/B Main memory access speed not sufficient to keep CPU busy Introduce fast on-chip caches, holding copies of recently used data items 3

4 Schematic view of modern memory hierarchy & cache logic CPU/Arithmetic unit issues a LOAD request to transfer a data item to a register Cache logic automatically checks all cache levels if data item is already in cache GB/s If data item is in cache ( cache hit ) it is loaded to register GB/s If data item is in no cache level ( cache miss ) data item is loaded from main memory and a copy is held in cache. 4

5 Memory hierarchies: Effective Bandwidths Hardware: Quantities to characterize the quality of a memory hierarchy: Latency (T l ): Set up time for data transfer from source (e.g. main memory or caches) to destination (e.g. registers). Bandwidth (b): Maximum amount of data per second which can be transferred between source (e.g. main memory or caches) and destination (e.g. registers). Application: Transfer time (T) and effective bandwidth (b eff ) depend on data volume(v) to be transferred: Transfer time: T = T l + V b Effective bandwidth: b eee = V = V T T l + V b Low data volume (V 0): b eee 0 Large data volume ( V b T l): b eee b 5

6 Latency and bandwidth in modern computer environments ns b eee = V T l + V b µs b eff ms V 6

7 Memory hierarchies: The latency problem Main memory latency and bandwidth for modern multicore CPUs: T l = 64 ns & b = 64 GB/s V T l V/b T b eff 8 B 64 ns ns ns 0.13 GB/s 128 B 64 ns 2 ns 66 ns 1.9 GB/s 4096 B 64 ns 64 ns 128 ns 32 GB/s Data access is organized in cache lines (CL) always a full CL is transferred (V=64 B or V=128 B on modern architectures) Multiple CLs can be loaded concurrently Multiple data requests by application code Automatic hardware prefetching 7

8 Memory hierarchies: Cache lines If one data item is loaded from main memory ( cache miss ), whole cache line it belongs to is loaded Cache lines are contiguous in main memory, i.e. neighboring items can then be used from cache Iteration LD Cache miss : T l V/b Use data Cache line size: 4 words LD LD Use data LD Use data LD Use data t do i=1,n s = s + a(i)*a(i) enddo LD Use data LD Use data Use 8

9 Memory hierarchies: (Automatic) Prefetching Prefetching data to hide memory latencies of CL transfers Iteration LD Cache miss : T l V/b Use data Data transfer is started before cache miss Prefetching LD LD Use data LD Use data LD Use data LD Use data LD t Use data LD LD do i=1,n s = s + a(i)*a(i) enddo Use Use data Use data LD Use 9

10 Memory Hierarchies: Prefetching Hide memory latency Prefetch (PFT) instructions (limited use on modern architectures): Transfer one cache line from memory to cache and then issue LD to registers Most architectures (Intel/AMD x86, IBM Power) use hardware-based automatic prefetch mechanisms HW detects regular, consecutive memory access patterns (streams) and prefetches at will Intel x86: Adjacent cache line prefetch loads 2 (64-byte) cache lines on L3 miss Effectively doubles line length on loads (typical. enabled in BIOS) Intel x86: Hardware prefetcher: Prefetches complete page (4 KB) if 2 successive CLs in this page are accessed For regular data access main memory latency is not an issue! May generate excessive data transfers for irregular access pattern 10

11 Node-level architecture - revisited Memory hierarchy: Caches basics Data access locality Cache management

12 Memory Hierarchies: Cache line Spatial locality Cache line features Cache line use is optimal for contiguous access ( stride 1 ) STREAMING Non-consecutive access reduces performance Access with wrong stride (e.g. cache line size) can lead to disastrous performance breakdown Typical CL sizes: 64 Byte (AMD/Intel) or 128 Byte (IBM) Spatial locality : Ensure accesses to neighboring data items GOOD ( Streaming ) do i=1,n s = s + a(i)*a(i) enddo BAD ( Strided ) do i=1,n,2 s = s + a(i)*a(i) enddo If a(1:n) is loaded from main memory: same runtime! Performance of strided loop is half of the continuous one 12

13 Memory Hierarchies: Spatial locality & Data Layout How to traverse multidimensional arrays?! Example: Initialize matrix A with A(i,j) = i*j What is the storage order of multidimensional-data structure? It depends, e.g. 2-dimensional 3x3 array A of doubles FORTRAN: column by column ( column major order ) 0 B Memory layout 71 B A(1,1) A(2,1) A(3,1) A(1,2) A(2,2) A(3,2) A(1,3) A(2,3) A(3,3) C/C++: row by row ( row major order ) 0 B Memory layout 71 B A[0][0] A[0][1] A[0][2] A[1][0] A[1][1] A[1][2] A[2][0] A[2][1] A[2][2] 13

14 Memory Hierarchies: Spatial locality & Data Layout Default layout for FORTRAN: column by column (column major order) do i=1,n do j=1,n a(j,i)=i*j enddo enddo Continuous access! do j=1,n do i=1,n a(j,i)=i*j enddo enddo Stride n access! FORTRAN: Inner loop must access innermost/left array index Data arrangement is transpose of the usual matrix layout 14

15 Memory Hierarchies: Spatial locality & Data Layout Default layout for C/C++: row by row (row major order) for(i=0; i<n; ++i) { for(j=0; j<n; ++j) { a[i][j] = i*j; } } Continuous access! for(j=0; j<n; ++j) { for(i=0; i<n; ++i) { a[i][j] = i*j; } } Stride N access! In C: Inner loop must access outermost/rightmost array index 15

16 Memory Hierarchies: Spatial locality & Data Layout 3-dimensional arrays in C/C++ for(i=0; i<n; ++i) { for(j=0; j<n; ++j) { for(k=0; k<n; ++k) { a[i][j][k] = i*j*k; } } } Continuous access! for(k=0; k<n; ++k){ for(j=0; j<n; ++j) { for(i=0; i<n; ++i) { a[i][j][k] = i*j*k; } } } Stride N*N access! C/C++: Always start with rightmost index as inner loop index if possible! Sometimes there are problems. (spatial blocking may improve the situation here cf. later) for(i=0; i<n; ++i) { for(j=0; j<n; ++j) { a[i][j] = b[j][i]; } } 16

17 Memory hierarchies: Temporal locality Phenomenon: some/many data items are accessed frequently ( Temporal locality ) Data reuse from cache at higher rates! If data is already in cache - reuse it from there! Blocking techniques can often be applied Example: Dense matrix vector multiplication (assume that cache is large enough to hold y(1:r) ) A(r,c) x(c) stays in register tmp stays in register in inner loop do c = 1, C tmp=x(c) do r = 1, R y(r)=y(r) + A(r,c)* tmp enddo enddo y(1:r) is loaded C times Temporal locality for C-1 accesses A(,):continuous access Spatial locality 17

18 Node-level architecture - revisited Memory hierarchy: Caches basics Data access locality Cache management

19 Memory Hierarchies: Cache Mapping Cache Size (~MBs) is much smaller than main memory (~GBs) Cache mapping strategy required Pairing of memory addresses with cache locations L1 Cache (~10 3 Byte) L2 Cache (~10 6 Byte) Where is the CL to given memory addressed placed in the cache? Different mapping strategies throughout memory hierarchy (i.e. cache levels) CL 0 CL 1 CL 2 CL 3 Memory (~10 9 Byte) Extreme strategies: Direct mapped (see next slide) vs. fully associative (memory address can be mapped to any cache entry) 19

20 Memory Hierarchies: Cache Mapping Direct mapping Simplest mapping strategy Direct Mapped caches Every memory address is mapped to exactly one cache entry Easy to handle/implement in hardware, e.g. if cache size is 1 KB choose lowest 10 bits of memory address to identify cache entry Memory address 32 Bit: L1 Cache (~10 3 Byte) Mapping substantially impacts flexibility of replacement strategies Reduces the potential set of evict/replace locations May incur additional data transfer ( cache thrashing)! L2 Cache (~10 6 Byte) CL 0 CL 1 CL 2 CL 3 Memory (~10 9 Byte) 20

21 Memory Hierarchies: Cache Mapping Associative Caches Set-associative cache: m-way associative cache of size m x n: each memory location i can be mapped to the m cache locations ( ways ) j*n+mod(i,n), j=0..m-1 E.g.: 2-way set associative cache of size 256 KBytes: Way KB 128KB+1 CacheLine 256 KB Set Address within Number of sets: 256 KB/ 64 Byte /2 = 2048 cache line Memory address (32 Bit): Modern processors: 4-way to 48-way associative caches 21

22 Memory hierarchies: Cache Mapping Cache Thrashing If many memory locations are used that are mapped to the same set, cache reuse can be very limited even with m-way associative caches Warning: Using powers of 2 in the leading array dimensions of multi-dimensional arrays should be avoided! (Cache Thrashing) double precision A(16384,16384) A(1,1) A(1,2) A(1,3) If cache / m-ways are full and new data comes in from main memory, data in cache (full cache line) must be invalidated or written back to main memory Ensure spatial and temporal data locality for data access! 22

23 Memory hierarchies: Cache thrashing - Example Example: 2D square lattice At each lattice point the 4 velocities for each of the 4 directions are stored N=16 real*8 vel(1:n, 1:N, 4) s=0.d0 do j=1,n do i=1,n s=s+vel(i,j,1)-vel(i,j,2)+vel(i,j,3)-vel(i,j,4) enddo enddo 23

24 Memory hierarchies: Cache thrashing - Example Memory to cache mapping for vel(1:16, 1:16, 4) Cache: 256 byte (=32 double) / 2-way associative / Cache line size=32 byte 1,1,1 2,1,1 3,1,1 4,1,1. 1,1,2 2,1,2 3,1,2 4,1,2. 1,1,3 2,1,3 3,1,3 4,1,3. 1,1,4 2,1,4 3,1,4 4,1,4 i=1, j=1 1,1,1 2,1,1 3,1,1 4,1,1. 1,1,2 2,1,2 3,1,2 4,1,2. 1,1,3 2,1,3 3,1,3 4,1,3. 1,1,4 2,1,4 3,1,4 4,1,4. Vel(1:16,1:16,1) Vel(1:16,1:16,2) Vel(1:16,1:16,3) Vel(1:16,1:16,4) Cache: 2 ways 1,1,1 1,1,3 2,1,1 2,1,3 3,1,1 3,1,3 4,1,1 4,1,3 1,1,2 1,1,4 2,1,2 2,1,4 3,1,2 3,1,4 4,1,2 4,1,4 with 16 double each Each cache line must be loaded 4 times from main memory to cache! 24

25 Memory hierarchies: Cache thrashing - Example Memory to cache mapping for vel(1:16+2, 1:16+2, 4) Cache: 256 byte (=32 doubles) / 2-way associative / Cache line size=32 byte 1,1,1 2,1,1 3,1,1 4,1,1. 1,1,2 2,1,2 3,1,2 4,1,2. 1,1,3 2,1,3 3,1,3 4,1,3. 1,1,4 2,1,4 3,1,4 4,1,4 i=1, j=1 Cache: 2 way 1,1,1 2,1,1 3,1,1 4,1,1 15,18,1 16,18,117,18,1 18,18,1 1,1,1 2,1,1 3,1,1 4,1,1 1,1,2 2,1,2 3,1,2 4,1,2. 1,1,3 2,1,3. 3,1,3 4,1,3.. 1,1,4 2,1,4. 3,1,4 4,1,4 1,1,2 2,1,2 3,1,2 4,1,2 1,1,3 2,1,3 3,1,3 4,1,3 1,1,4 2,1,4 3,1,4 4,1,4.. with 16 doubles each Each cache line needs only be loaded once from memory to cache! 25

26 Memory hierarchies: Cache management details Cache misses: LOAD misses: If data item (e.g. a[2]) to be loaded to a register is not available in cache, the full cache line (e.g. a[0:7]) holding the data item is loaded from main memory to cache. STORE miss: Data item to be modified (e.g. a[2]=0.0) is not in cache? One cache line is the minimum data transfer unit between main memory and cache (e.g. a[0:7]). Load cache line from main memory to cache ( WRITE ALLOCATE ) Modify data item in cache Later evict/write back full cache line to main memory ( store to main memory) Overall data transfer volume increases up to 2x! (NT stores: no increase) do i=1,n do j=1,n a(j,i)= 0.0 enddo enddo n 2 words are loaded from main memory to cache (WRITE ALLOCATE) and n 2 words are evicted/written back to main memory! 26

27 Memory hierarchies: Data transfers Caches help with getting instructions and data to the CPU fast How does data travel from memory to the CPU and back? Remember: Caches are organized in cache lines (e.g., 64 bytes) Only complete cache lines are transferred between memory hierarchy levels (except registers) Cache MISS: Load or store instruction does not find the data in a cache level CL transfer required LD C(1) MISS CL CPU registers ST A(1) MISS Cache CL write allocate LD C(2..N cl ) ST A(2..N cl ) HIT evict (delayed) Example: Array copy A(:)=C(:) 2-level ( Inclusive cache ) cache hierarchy: CL CL C(:) Memory A(:) 3 CL transfers L1 Load MISS L2 Load MISS CL loaded to L2 CL loaded to L1 data loaded to register 27

28 Memory management: Caches management details Inclusive: Cache line copy in all levels Reduced effective size in outer cache levels Cheap eviction for unmodified cache lines Higher latency: cache lines have to load through hierarchy Intel processors Exclusive: Only one cache line copy in cache hierarchy Full aggregate effective cache size Eviction is expensive (copy back) Lower latency: Data can be directly loaded in L1 cache AMD processors Write back : A modified cache line is evicted to the next (lower) cache/memory level before it is overwritten by new data Write through : When a cache line is updated then the cache line copy in the next (lower) cache/memory level is updated as well 28

29 Memory management: Cache management details tags Every cache line (CL) has an associated cache tag (which is also located in cache but not directly visible to programmer) Cache tag contains information about: Main memory address associated with CL (Tag information + Set information = baseline memory address of CL) Status of CL (e.g. Invalid, Shared, Exclusive, Modified) which e.g. determines what to do with the cache entry if it needs to be replaced 29

30 Memory Hierarchies: Typical cache configuration Intel Xeon E Sandy Bridge # FP registers 16 # GP registers 16 L1 D L2 L3 Size Associativity Size Associativity Size Associativity local per core local per core 32 KB 8-way 256 KB 8-way 20 MB (shared) shared across all cores 20-way SIMD registers Same for more recent Intel architectures: Ivy Bridge & Haswell Depends on core count, CPU variant, CoD mode 30

31 Intel Xeon E5 multicore processors FP instructions throughput per core Max. data transfer per cycle between caches Peak main memory bandwidth 31

32 Characterization of Memory Hierarchies Determine performance levels with low level benchmark: Vector Triad DOUBLE PRECISION, dimension(size):: A,B,C,D DOUBLE PRECISION :: S,E,MFLOPS! Input N.le. SIZE DO i=1,n A(i) = 0.d0; B(i)=1.d0; C(i)=2.d0; D(i)=3.d0! initialize ENDDO call get_walltime(s) DO ITER=1, NITER DO i=1, N A(i) = B(i) + C(i) * D(i)! 3 loads + 1 store; 2 FLOP ENDDO IF(A(2).lt.0) call dummy(a,b,c,d)! Prevent loop interchange ENDDO call get_walltime(e) MFLOPS = NITER * N * 2.d0 /( (E-S) * 10 6 ) 32

33 Memory Hierarchies: Measure performance levels Vector Triad single core performance: A[1:N]=B[1:N]+C[1:N]*D[1:N] Can we explain performance based on hardware features? L1 cache L2 cache L3 cache 33

34 CAMA: Multicore processors Moore s law & multicore technology Basic Compute architecture Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center Dietmar Fey, Department for Computer Science University Erlangen-Nürnberg, Sommersemester 2016

35 Moore s Law driving multicore technology

36 Introduction: Moore s law continues Nvidia Maxwell: 8 Billion Intel Haswell EP: 5.6 Billion 1965: G. Moore claimed #transistors on microchip doubles every months 36

37 Introduction: Moore s law clock speeds saturate core Sandy Bridge 12 core Ivy Bridge Intel x86 clock speed 18 core Haswell Frequency [MHz] ,1 1 core Nocona 37

38 Introduction: Trends to consider Clock speed of multicore chips will not increase Power/energy saving mechanisms in hardware Clock speed depends on execution time parameter, e.g. number of cores used type of application executed environment temperature Transistor budget can be invested in various directions Execution units #FMAs Width of execution units SIMD: 128 Bit 256 Bit 512 Bit Cache sizes #Cores (n core ) 2,4,,22 additional functionalities, e.g. PCIe or GPU on-chip 38

39 There is no longer a single driving force for chip performance! Floating Point (FP) Peak Performance of a single chip: P cccc = n cccc P cccc P cccc = nff sssss n FFF n SSSS f Intel Xeon EP ( Broadwell ) (up to 22 core variants are available) TOP Intel Xeon E v4 ( Broadwell ): f = 2. 2 GGG FF n cccc = 22 ; n sssss = 2; n FFF = 2; n SSSS = 4 P cccc = GG s (double) But: P chip =8.8 GF/s for serial, non-vectorized code 39

40 There is no single driving force for single core performance! FF P cccc = n cccc n sssss n FFF n SSSS f n cccc Cores FF n sssss inst./cy Superscalarity n FFF n SSSS ops/inst FMA factor SIMD factor Server Clock Speed f [GHz] P cccc [GF/s] Nehalem Q1/2009 X Westmere Q1/2010 X Sandy Bridge Q1/2012 E Ivy Bridge Q3/2013 E v Haswell Q3/2014 E v Broadwell Q1/2016 E v IBM POWER Q2/2014 S822LC Nvidia K Phi 5110P

41 Intel Xeon E5-2600v3 dual socket server (2014) One Xeon E5-2600v3 Haswell EP chip: Up to 18 cores running at 2.3 GHz (max 3.6 GHz) Simultaneous Multithreading (SMT) reports as 36-way chip Up to 40 MB cache & 40 PCIe 3.0 lanes 5.7 Billion Transistors / 22 nm Die size: 662 mm 2 Standard HPC/server configurations: 2 socket server 18 cores 18 cores 42

42 Basic compute node architecture From UMA to ccnuma

43 Single Chip is not enough! Basic architecture of shared memory compute nodes Hardware/software layers (HT/QPI): Shared address space and ensure data coherency A(1: ) Separate memory controllers scalable performance Single shared address space ease of use Cache-coherent Non-Uniform Memory Architecture (ccnuma) HT / QPI :scalable bandwidth at the price of ccnuma: Where does my data finally end up?

44 There is no longer a single flat memory: From UMA to ccnuma 2-way nodes Yesterday: Dual-socket Intel Core2 node: Uniform Memory Architecture (UMA): Flat memory ; symmetric MPs But: system anisotropy Shared Address Space within the node! Today: Dual-socket Intel (Westmere) node: Cache-coherent Non-Uniform Memory Architecture (ccnuma) HT / QPI provide scalable bandwidth at the expense of ccnuma architectures: Where does my data finally end up? On AMD it is even more complicated ccnuma within a chip! 45

45 Parallel computers Shared-Memory Architectures Basic Classification Shared memory computers provide a single shared address space (memory) for all processors All processors share the same view of the address space! CPU CPU Shared Address Space Two basic categories of shared memory systems CPU CPU Uniform Memory Access (UMA): Memory is equally accessible to all processors with the same performance (Bandwidth & Latency) cache-coherent Non Uniform Memory Access (ccnuma): Memory is physically distributed but appears as a single address space: Performance (Bandwidth & Latency) is different for local and remote memory access Copies of the same cache line may reside in different caches Cache coherence protocols guarantees consistency all time (for UMA & ccnuma) Cache coherence protocols do not alleviate parallel programming for shared-memory architectures! 46

46 Parallel computers: Shared-memory: UMA UMA Architecture: switch/bus arbitrates memory access Flat memory a.k.a Symmetric Multi-Processor (SMP) Data access speed (performance) does not depend on data location CPU 1 CPU 2 CPU 3 CPU 4 Cache Cache Cache Cache... Switch/Bus... Memory 47

47 Parallel shared memory computers: ccnuma/node Layout ccnuma: Single address space although physically distributed memory through proprietary hardware concepts (e.g. NUMALink in SGI systems; QPI for Intel; HT for AMD) Advantages: Aggregate memory bandwidth is scalable Systems with more 1024 cores are available (SGI) Disadvantages: Cache Coherence hard to implement / expensive Performance depends on access to local or remote memory C C Memory C C Memory C C Memory C C Memory Examples: All modern multisocket compute nodes 48

48 ccnuma nodes cache coherence & data locality Cache coherence for UMA & ccnuma! Multiple copies of the same cache line in multiple caches how to keep them coherent? ccnuma: Data Locality C C C C C C C C M M M M "Golden Rule" of ccnuma: A memory page gets mapped into the local memory of the processor that first touches it! 49

49 ccnuma nodes Golden Rule All modern multi-socket servers are of ccnuma type First touch policy ( Golden Rule ): Locailty domain 0 Locailty domain 1 A memory page is mapped into the locality domain of the processor that first writes to it Consequences Mapping happens at initialization, not allocation Initialization and computation must use the same memory pages per thread/process Affinity matters! Where are my threads/processes running? It is sufficient to touch a single item to map the entire page P C C P C C C MI P C C Memory P C C P C C MI P C C Memory double precision, dimension(:), allocatable :: huge allocate(huge(n))! memory not mapped yet do i=1,n huge(i) = 0.d0! mapping here! enddo P C C C P C C 50

50 ccnuma nodes - Coding for Data Locality Dense matrix vector multiplication (dmvm) void dmvm(int n, int m, double *lhs, double *rhs, double *mat){ #pragma omp parallel for private(offset,c)schedule(static) { for(r=0; r<n; ++r) { offset=m*r; for(c=0; c<m; ++c) }} lhs[r] += mat[c + offset]*rhs[c]; OpenMP parallelization?! 51

51 ccnuma nodes Coding for Data Locality #pragma omp parallel for schedule(static) private(c) { for(r=0; r<n; ++r) for(c=0; c<m; ++c) mat[c + m*r] = ; } } 10 cores 10 cores Parallelization of matrix data initialization dmvm Matrix data initialization: serial 52

52 Multicore processors Summary Modern multicore processor chips come with an increasing number of cores Multiple hardware features ( ) contribute to peak performance of a single processor chip: FF P cccc = n cccc n sssss n FFF n SSSS f A single multicore processor chip is typically of UMA architecture Multiple chips form ccnuma architectures UMA architecture: ccnuma architecture: flat memory concept of data locality both in hardware & software (see golden rule) Golden rule for ccnuma systems first touch principle 53

53 2015er Folien die ersetzt worden sind

Programming Techniques for Supercomputers: Modern processors. Architecture of the memory hierarchy

Programming Techniques for Supercomputers: Modern processors. Architecture of the memory hierarchy Programming Techniques for Supercomputers: Modern processors Architecture of the memory hierarchy Prof. Dr. G. Wellein (a,b), Dr. G. Hager (a), Dr. M. Wittmann (a) (a) HPC Services Regionales Rechenzentrum

More information

CAMA: Modern processors. Memory hierarchy: Caches. Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center

CAMA: Modern processors. Memory hierarchy: Caches. Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center AMA: Modern processors Memory hierarchy: aches Gerhard Wellein, Department for omputer Science and Erlangen Regional omputing enter Johannes Hofmann/Dietmar Fey, Department for omputer Science University

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster, Today Cache Memories CSci 2021: Machine Architecture and Organization November 7th-9th, 2016 Your instructor: Stephen McCamant Cache memory organization and operation Performance impact of caches The memory

More information

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories Agenda Chapter 6 Cache Memories Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal

More information

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Modern computer architecture. From multicore to petaflops

Modern computer architecture. From multicore to petaflops Modern computer architecture From multicore to petaflops Motivation: Multi-ores where and why Introduction: Moore s law Intel Sandy Brige EP: 2.3 Billion nvidia FERMI: 3 Billion 1965: G. Moore claimed

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Cache Memories October 8, 2007

Cache Memories October 8, 2007 15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

Effect of memory latency

Effect of memory latency CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Last class. Caches. Direct mapped

Last class. Caches. Direct mapped Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place

More information

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Cache memories are small, fast SRAM based memories managed automatically in hardware. Cache Memories Cache memories are small, fast SRAM based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

Today Cache memory organization and operation Performance impact of caches

Today Cache memory organization and operation Performance impact of caches Cache Memories 1 Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Introduction to OpenMP. Lecture 10: Caches

Introduction to OpenMP. Lecture 10: Caches Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for

More information

Node-Level Performance Engineering.

Node-Level Performance Engineering. Registers Node-Level Performance Engineering L1 L2 Georg Hager and Gerhard Wellein Friedrich-Alexander-Universität Erlangen-Nürnberg & Università della Svizzera Italiana Lugano L3 MEM Faculty of Informatics,

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University

More information

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons Cache Memories 15-213/18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, 2017 Today s Instructor: Phil Gibbons 1 Today Cache memory organization and operation Performance impact

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27

More information

The ECM (Execution-Cache-Memory) Performance Model

The ECM (Execution-Cache-Memory) Performance Model The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop Memory issues on Multi- and Manycore

More information

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Cache Memories From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2015 Lecture 15 LAST TIME! Discussed concepts of locality and stride Spatial locality: programs tend to access values near values they have already accessed

More information

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010 Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of

More information

Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model

Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model ERLANGEN REGIONAL COMPUTING CENTER Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model Holger Stengel, J. Treibig, G. Hager, G. Wellein Erlangen Regional

More information

CISC 360. Cache Memories Nov 25, 2008

CISC 360. Cache Memories Nov 25, 2008 CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

Giving credit where credit is due

Giving credit where credit is due CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Case study: OpenMP-parallel sparse matrix-vector multiplication

Case study: OpenMP-parallel sparse matrix-vector multiplication Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 Performance Workshop Mark Bull David Henty EPCC, University of Edinburgh Overview Why caches are needed How caches work Cache design and performance. 2 1 The memory speed gap Moore s Law: processors

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2 ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

More information

Compact Course, KTH Stockholm, March 2011

Compact Course, KTH Stockholm, March 2011 Efficient multithreaded programming on modern CPUs and GPUs Integrated Memory Controller 3 Ch DDR Core Q P I Core Core Shared L3 Cache Core Prof. Gerhard Wellein, Dr. Georg Hager Erlangen Regional Computing

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Multicore Scaling: The ECM Model

Multicore Scaling: The ECM Model Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Stencil code examples: 2D Jacobi in L1 and L2 cache 3D Jacobi in memory 3D long-range stencil G. Hager, J. Treibig,

More information

Introduction to Computer Architecture. Jan Eitzinger (RRZE) Georg Hager (RRZE)

Introduction to Computer Architecture. Jan Eitzinger (RRZE) Georg Hager (RRZE) Introduction to omputer Architecture Jan Eitzinger (RRZE) Georg Hager (RRZE) Milestone Inventions 1938 Elwood Shannon: Solve boolean algebra and binary arithmetic with arrangements of relays 1941 Zuse

More information

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud Cache Memories CS-281: Introduction to Computer Systems Instructor: Thomas C. Bressoud 1 Random-Access Memory (RAM) Key features RAM is traditionally packaged as a chip. Basic storage unit is normally

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Moore s law Intel Sandy Bridge EP: 2.3 billion Nvidia

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Mo Money, No Problems: Caches #2...

Mo Money, No Problems: Caches #2... Mo Money, No Problems: Caches #2... 1 Reminder: Cache Terms... Cache: A small and fast memory used to increase the performance of accessing a big and slow memory Uses temporal locality: The tendency to

More information

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Organization Prof. Michel A. Kinsy The course has 4 modules Module 1 Instruction Set Architecture (ISA) Simple Pipelining and Hazards Module 2 Superscalar Architectures

More information

Kevin O Leary, Intel Technical Consulting Engineer

Kevin O Leary, Intel Technical Consulting Engineer Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."

More information

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core 1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload

More information

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance What is EE 352 Unit 11 Definitions Address Mapping Performance memory is a small, fast memory used to hold of data that the processor will likely need to access in the near future sits between the processor

More information

High performance computing. Memory

High performance computing. Memory High performance computing Memory Performance of the computations For many programs, performance of the calculations can be considered as the retrievability from memory and processing by processor In fact

More information

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Roadmap. Java: Assembly language: OS: Machine code: Computer system: Roadmap C: car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c); Assembly language: Machine code: get_mpg: pushq movq... popq ret %rbp %rsp, %rbp %rbp 0111010000011000

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

CDA3101 Recitation Section 13

CDA3101 Recitation Section 13 CDA3101 Recitation Section 13 Storage + Bus + Multicore and some exam tips Hard Disks Traditional disk performance is limited by the moving parts. Some disk terms Disk Performance Platters - the surfaces

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk

More information

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O Hallaron (CMU) Mohamed Zahran (NYU)

More information

HW Trends and Architectures

HW Trends and Architectures Pavel Tvrdík, Jiří Kašpar (ČVUT FIT) HW Trends and Architectures MI-POA, 2011, Lecture 1 1/29 HW Trends and Architectures prof. Ing. Pavel Tvrdík CSc. Ing. Jiří Kašpar Department of Computer Systems Faculty

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System

More information

Basics of performance modeling for numerical applications: Roofline model and beyond

Basics of performance modeling for numerical applications: Roofline model and beyond Basics of performance modeling for numerical applications: Roofline model and beyond Georg Hager, Jan Treibig, Gerhard Wellein SPPEXA PhD Seminar RRZE April 30, 2014 Prelude: Scalability 4 the win! Scalability

More information

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster, Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Cache Memory Organization and Access Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Assignment 1 due Mon (Feb 4pm

Assignment 1 due Mon (Feb 4pm Announcements Assignment 1 due Mon (Feb 19) @ 4pm Next week: no classes Inf3 Computer Architecture - 2017-2018 1 The Memory Gap 1.2x-1.5x 1.07x H&P 5/e, Fig. 2.2 Memory subsystem design increasingly important!

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

PCOPP Uni-Processor Optimization- Features of Memory Hierarchy. Uni-Processor Optimization Features of Memory Hierarchy

PCOPP Uni-Processor Optimization- Features of Memory Hierarchy. Uni-Processor Optimization Features of Memory Hierarchy PCOPP-2002 Day 1 Classroom Lecture Uni-Processor Optimization- Features of Memory Hierarchy 1 The Hierarchical Memory Features and Performance Issues Lecture Outline Following Topics will be discussed

More information

Linear Algebra for Modern Computers. Jack Dongarra

Linear Algebra for Modern Computers. Jack Dongarra Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Motivation for Caching and Optimization of Cache Utilization

Motivation for Caching and Optimization of Cache Utilization Motivation for Caching and Optimization of Cache Utilization Agenda Memory Technologies Bandwidth Limitations Cache Organization Prefetching/Replacement More Complexity by Coherence Performance Optimization

More information

Carnegie Mellon. Cache Memories

Carnegie Mellon. Cache Memories Cache Memories Thanks to Randal E. Bryant and David R. O Hallaron from CMU Reading Assignment: Computer Systems: A Programmer s Perspec4ve, Third Edi4on, Chapter 6 1 Today Cache memory organiza7on and

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Memory Computer Technology

More information

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous

More information

Parallel Computing Platforms: Control Structures and Memory Hierarchy

Parallel Computing Platforms: Control Structures and Memory Hierarchy Parallel Computing Platforms: Control Structures and Memory Hierarchy John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 12 27 September 2018 Topics

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

CS 240 Stage 3 Abstractions for Practical Systems

CS 240 Stage 3 Abstractions for Practical Systems CS 240 Stage 3 Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the process model Virtual memory Dynamic memory allocation Victory lap Memory Hierarchy: Cache Memory

More information

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19 Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing

More information

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating

More information

Show Me the $... Performance And Caches

Show Me the $... Performance And Caches Show Me the $... Performance And Caches 1 CPU-Cache Interaction (5-stage pipeline) PCen 0x4 Add bubble PC addr inst hit? Primary Instruction Cache IR D To Memory Control Decode, Register Fetch E A B MD1

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund

Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access Martin Grund Agenda Key Question: What is the memory hierarchy and how to exploit it? What to take home How is computer memory organized.

More information

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information