CAMA: Modern processors. Memory hierarchy: Caches. Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center

Size: px
Start display at page:

Download "CAMA: Modern processors. Memory hierarchy: Caches. Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center"

Transcription

1 AMA: Modern processors Memory hierarchy: aches Gerhard Wellein, Department for omputer Science and Erlangen Regional omputing enter Johannes Hofmann/Dietmar Fey, Department for omputer Science University Erlangen-Nürnberg, Sommersemester 2015

2 Schematic view of modern memory hierarchy & cache logic U/Arithmetic unit issues a LOAD request to transfer a data item to a register ache logic automatically checks all cache levels if data item is already in cache GB/s If data item is in cache ( cache hit ) it is loaded to register GB/s If data item is in no cache level ( cache miss ) data item is loaded from main memory and a copy is held in cache. AMA D. Fey and G. Wellein 2

3 Memory hierarchies: Latency problem & ache line Two quantities characterize the quality of each memory hierarchy: Latency (T lat ): Time to set up the memory transfer from source (main memory or caches) to destination (registers). Bandwidth (BW): Maximum amount of data which can be transferred per second between source (main memory or caches) and destination (registers). Transfer time: T = T lat + (amount of data) / BW Effective bandwidth: BW eff = (amount of data) / T Low amount of data: Large amount of data: BW eff << BW BW eff ~ BW AMA D. Fey and G. Wellein 3

4 Memory hierarchies: Latency problem & ache line ~ Typical values for modern microprocessors: = T lat =100 ns - BW=4 GB/s amount of data=8 Byte (double) T=102 ns (100 ns from latency!) Data transfer rate: 8 B / 102 ns = B/ns = GB/s Data access is organized in cache lines (L) that are transferred as a whole e.g. amount of data = 128 Byte (16 doubles) T=132 ns (100 ns from latency) Data transfer rate: 128 B / 132 ns = 0.97 B/ns = 0.97 GB/s Data transfers between memory and cache as well as between caches always happens at the L granularity! Still not sufficient to hide most of the memory latency Multiple non-blocking cache line transfers are supported Automatic hardware prefetcher AMA D. Fey and G. Wellein 4

5 Memory hierarchies: ache lines If one item is loaded from main memory ( cache miss ), whole cache line it belongs to is loaded to the caches ache lines are contiguous in main memory, i.e. neighboring items can then be used from cache Iteration LD ache miss : Latency Use data ache line size: 4 words LD LD Use data LD Use data LD Use data t ache miss : Latency do i=1,n s = s + a(i)*a(i) enddo LD Use data LD Use data Use AMA D. Fey and G. Wellein 5

6 Memory Hierarchies: ache line Spatial locality ache line addresses latency problem not bandwidth bottleneck ache line use is optimal for contiguous access ( stride 1 ) STREAMING Non-consecutive reduces performance; Access with wrong stride (e.g. with cache line size) can lead to disastrous performance breakdown Typical L sizes: 64 Byte or 128 Byte alculations get cache bandwidth inside the cache line, but main memory bandwidth still limits the speed of the cache line transfer Spatial locality : Ensure accesses to neighboring data items GOOD ( Streaming ) do i=1,n s = s + a(i)*a(i) enddo BAD ( Strided ) do i=1,n,2 s = s + a(i)*a(i) enddo If a(1:n) is loaded from main memory: same runtime! erformance of strided loop is half of the continuous one AMA D. Fey and G. Wellein 6

7 Memory Hierarchies: ache size Temporal locality If cache is full an old data items need to be removed if new data items come in ache lines wear out Age of cache line Last access time (Remember cache replacement strategies: LRU, ) Efficient use of caches requires some locality of reference, i.e. a data item loaded to cache needs to be reused several times soon ( Temporal locality ) before it gets old. Assume large N A(1:N)=B(1:N)+Z(1:N) (1:N)=(1:N)*Z(1:N) E(1:N)=Z(1:N)+A(1:N)*(1:N) DO I = 1,N A(I) = B(I)+Z(I) (I) = (I)*Z(I) E(I) = Z(I)+A(I)*(I) ENDDO AMA D. Fey and G. Wellein 7

8 Memory Hierarchies: Data Layout & contiguous access How to traverse multidimensional arrays?! Example: Initialize matrix A with A(i,j) = i*j What is the storage order of multidimensional-data structure? It depends, e.g. 2-dimensional 3x3 array A of doubles FORTRAN: column by column ( column major order ) 0 B Memory layout 71 B A(1,1) A(2,1) A(3,1) A(1,2) A(2,2) A(3,2) A(1,3) A(2,3) A(3,3) /++: row by row ( row major order ) 0 B Memory layout 71 B A[0][0] A[0][1] A[0][2] A[1][0] A[1][1] A[1][2] A[2][0] A[2][1] A[2][2] AMA D. Fey and G. Wellein 8

9 Memory Hierarchies: Data Layout & contiguous access Default layout for FORTRAN: column by column (column major order) do i=1,n do j=1,n a(j,i)=i*j enddo enddo ontinuous access! do j=1,n do i=1,n a(j,i)=i*j enddo enddo Stride n access! FORTRAN: Inner loop must access innermost/left array index Data arrangement is transpose of the usual matrix layout AMA D. Fey and G. Wellein 9

10 Memory Hierarchies: Data Layout & contiguous access Default layout for /++: row by row (row major order) for(i=0; i<n; ++i) { for(j=0; j<n; ++j) { a[i][j] = i*j; } } ontinuous access! for(j=0; j<n; ++j) { for(i=0; i<n; ++i) { a[i][j] = i*j; } } Stride N access! In : Inner loop must access outermost/rightmost array index AMA D. Fey and G. Wellein 10

11 Memory Hierarchies: Data Layout & contiguous access 3-dimensional arrays in /++ for(i=0; i<n; ++i) { for(j=0; j<n; ++j) { for(k=0; k<n; ++k) { a[i][j][k] = i*j*k; } } } ontinuous access! for(k=0; k<n; ++k){ for(j=0; j<n; ++j) { for(i=0; i<n; ++i) { a[i][j][k] = i*j*k; } } } Stride N*N access! /++: Always start with rightmost index as inner loop index if possible! Sometimes there are problems. (spatial blocking may improve the situation here cf. later) for(i=0; i<n; ++i) { for(j=0; j<n; ++j) { a[i][j] = b[j][i]; } } AMA D. Fey and G. Wellein 11

12 Memory Hierarchies: ache Mapping (see D. Fey) ache Mapping airing of memory locations with cache locations e.g. mapping 1 GB of main memory to 1 MB of cache Memory (10 9 Byte) ache line x ache line y ache (10 6 Byte) Static Mapping Directly Mapped caches vs. m-way set associative caches Mapping substantially impacts the flexibility of replacement strategies Reduces the potential set of evict/replace locations May incur additional data transfer ( cache thrashing)! AMA D. Fey and G. Wellein 12

13 Memory Hierarchies: ache Mapping Directly mapped (see D. Fey) Directly mapped cache: Every memory location can only be mapped to exactly one cache location If cache size=n, i-th memory location can be stored at cache location mod(i,n) Easy to implementation & fast lookup, e.g. Mapping of 1 MB to 1 KB Memory Address (20 Bit) ache Address (10 Bit) No penalty for stride-one access Memory access with stride=cache size will not allow caching of more than one line of data, i.e. effective cache size is one line! AMA D. Fey and G. Wellein 13

14 Memory Hierarchies: ache Mapping Directly Mapped (see D. Fey) N-1 N N+1 N N-1 Memory... ache Example: Directly mapped cache. Each memory location can be mapped to one cache location only. E.g. Size of main memory= 1 GByte; ache Size= 256 KB 4096 memory locations are mapped to the same cache location AMA D. Fey and G. Wellein 14

15 Memory Hierarchies: ache Mapping Associative aches Set-associative cache: m-way associative cache of size m x n: each memory location i can be mapped to the m cache locations j*n+mod(i,n), j=0..m-1 E.g.: 2-way set associative cache of size 256 KBytes: Way KB 128KB+1 acheline 256 KB Set Address within Number of sets: 256 KB/ 64 Byte /2 = 2048 cache line Memory address (32 Bit): Ideal world: Fully associative cache where every memory location is mapped to any cache line Thrashing nearly impossible The higher the associativity, the larger the overhead, e.g. latencies increase AMA D. Fey and G. Wellein 15

16 Memory hierarchies: ache Mapping Associative aches (see D. Fey) N-1 N N+1 N N-1 Memory... ache Example: 2-way associative cache. Each memory location can be mapped to two cache locations ( ways ) within the same set: E.g. Size of main memory= 1 GByte; ache Size= 256 KB 8192 memory locations are mapped to two cache locations AMA D. Fey and G. Wellein 16

17 Memory hierarchies: itfalls & roblems If many memory locations are used that are mapped to the same set, cache reuse can be very limited even with m-way associative caches Warning: Using powers of 2 in the leading array dimensions of multi-dimensional arrays should be avoided! (ache Thrashing) double precision A(16384,16384) A(1,1) A(1,2) A(1,3) If cache / m-ways are full and new data comes in from main memory, data in cache (full cache line) must be invalidated or written back to main memory Ensure spatial and temporal data locality for data access! AMA D. Fey and G. Wellein 17

18 Memory hierarchies: ache thrashing - Example Example: 2D square lattice At each lattice point the 4 velocities for each of the 4 directions are stored N=16 real*8 vel(1:n, 1:N, 4) s=0.d0 do j=1,n do i=1,n s=s+vel(i,j,1)-vel(i,j,2)+vel(i,j,3)-vel(i,j,4) enddo enddo AMA D. Fey and G. Wellein 18

19 Memory hierarchies: ache thrashing - Example Memory to cache mapping for vel(1:16, 1:16, 4) ache: 256 byte (=32 double) / 2-way associative / ache line size=32 byte 1,1,1 2,1,1 3,1,1 4,1,1. 1,1,2 2,1,2 3,1,2 4,1,2. 1,1,3 2,1,3 3,1,3 4,1,3. 1,1,4 2,1,4 3,1,4 4,1,4 i=1, j=1 1,1,1 2,1,1 3,1,1 4,1,1. 1,1,2 2,1,2 3,1,2 4,1,2. 1,1,3 2,1,3 3,1,3 4,1,3. 1,1,4 2,1,4 3,1,4 4,1,4. Vel(1:16,1:16,1) Vel(1:16,1:16,2) Vel(1:16,1:16,3) Vel(1:16,1:16,4) ache: 2 ways 1,1,1 1,1,3 2,1,1 2,1,3 3,1,1 3,1,3 4,1,1 4,1,3 1,1,2 1,1,4 2,1,2 2,1,4 3,1,2 3,1,4 4,1,2 4,1,4 with 16 double each Each cache line must be loaded 4 times from main memory to cache! AMA D. Fey and G. Wellein 19

20 Memory hierarchies: ache thrashing - Example Memory to cache mapping for vel(1:16+2, 1:16+2, 4) ache: 256 byte (=32 doubles) / 2-way associative / ache line size=32 byte 1,1,1 2,1,1 3,1,1 4,1,1. 1,1,2 2,1,2 3,1,2 4,1,2. 1,1,3 2,1,3 3,1,3 4,1,3. 1,1,4 2,1,4 3,1,4 4,1,4 i=1, j=1 ache: 2 way 1,1,1 2,1,1 3,1,1 4,1,1 15,18,1 16,18,1 17,18,1 18,18,1 1,1,1 2,1,1 3,1,1 4,1,1 1,1,2 2,1,2 3,1,2 4,1,2. 1,1,3 2,1,3. 3,1,3 4,1,3.. 1,1,4 2,1,4. 3,1,4 4,1,4 1,1,2 2,1,2 3,1,2 4,1,2 1,1,3 2,1,3 3,1,3 4,1,3 1,1,4 2,1,4 3,1,4 4,1,4.. with 16 doubles each Each cache line needs only be loaded once from memory to cache! AMA D. Fey and G. Wellein 20

21 Memory hierarchies: ache management details ache misses: LOAD misses: If data item (e.g. a[2]) to be loaded to a register is not available in cache, the full cache line (e.g. a[0:7]) holding the data item is loaded from main memory to cache. STORE miss: Data item to be modified (e.g. a[2]=0.0) is not in cache? One cache line is the minimum data transfer unit between main memory and cache (e.g. a[0:7]). Load cache line from main memory to cache ( WRITE ALLOATE ) Modify data item in cache Later evict/write back full cache line to main memory ( store to main memory) Overall data transfer volume increases up to 2x! (NT stores: no increase) do i=1,n do j=1,n a(j,i)= 0.0 enddo enddo n 2 words are loaded from main memory to cache (WRITE ALLOATE) and n 2 words are evicted/written back to main memory! AMA D. Fey and G. Wellein 21

22 Memory hierarchies: ache management details How does data travel from memory to the U and back? Example: Array copy A(:)=(:) Special store instruction to avoid WA! LD (1) MISS U registers ST A(1) MISS LD (2..N cl ) ST A(2..N cl ) HIT LD (1) MISS U registers NTST A(1) LD (2..N cl ) NTST A(2..N cl ) HIT L ache L L ache write allocate evict (delayed) L L L L 3 L (:) A(:) (:) A(:) Memory transfers Standard stores (WRITE ALLOATE) Memory Nontemporal (NT) stores 2 L transfers 50% performanc e boost for OY AMA D. Fey and G. Wellein 22

23 Memory management: aches management details (see D. Fey) Inclusive: ache line copy in all levels Reduced effective size in outer cache levels heap eviction for unmodified cache lines Higher latency: cache lines have to load through hierarchy All Intel processors Exclusive: Only one cacheline copy in cache hierarchy Full aggregate effective cache size Eviction is expensive (copy back) Lower latency: Data can be directly loaded in L1 cache All AMD processors Write back : A modified cache line is evicted to the next (lower) cache/memory level before it is overwritten by new data Write through : When a cache line is updated then the cache line copy in the next (lower) cache/memory level is updated as well AMA D. Fey and G. Wellein 23

24 Memory Hierarchies: Intel vs. AMD (current generations) Single core specs eak perf. lock freq. Intel Xeon Sandy Bridge 21.6 GFlop/s 2.7 GHz AMD Opteron Interlagos 22.4 GFlop/s 2.8 GHz Q1/2012: Intel Sandy Bridge: 8 cores AMD Interlagos: 16 cores # F Registers 16/32 16/32 Size 32 KB 16 KB L1 D BW ~130 GB/s ~90 GB/s Latency 4 cycles 4 cycles Size 256 KB 2 MB (2 cores) L2 BW ~90 GB/s ~90 GB/s L3 Mem. Latency 12 cycles >20 cycles Size 20 MB (shared) 8 MB (8 cores) BW ~300 GB/s ~40 GB/s Latency ~30 cycles 48 (?) cyles Socket BW ~36 GB/s (measured) ~32 GB/s (measured) Latency ~150 cycles ~150 cycles ache Associativity Intel AMD L1 8 4 L L AMA D. Fey and G. Wellein 24

25 haracterization of Memory Hierarchies Determine performance levels with low level benchmark: Vector Triad DOUBLE REISION, dimension(size):: A,B,,D DOUBLE REISION :: S,E,MFLOS! Input N.le. SIZE DO i=1,n A(i) = 0.d0; B(i)=1.d0; (i)=2.d0; D(i)=3.d0! initialize ENDDO call get_walltime(s) DO ITER=1, NITER DO i=1, N A(i) = B(i) + (i) * D(i)! 3 loads + 1 store; 2 FLO ENDDO IF(A(2).lt.0) call dummy(a,b,,d)! revent loop interchange ENDDO call get_walltime(e) MFLOS = NITER * N * 2.d0 /( (E-S) * 10 6 ) AMA D. Fey and G. Wellein 25

26 Memory Hierarchies: Measure performance levels Vector Triad single core performance: A[1:N]=B[1:N]+[1:N]*D[1:N] an we explain performance based on hardware features? L1 cache L2 cache L3 cache AMA D. Fey and G. Wellein 26

27 AMA: Multicore processors There is no way back Modern multi-/manycore chips Basic ompute architecture Gerhard Wellein, Department for omputer Science and Erlangen Regional omputing enter Dietmar Fey, Department for omputer Science University Erlangen-Nürnberg, Sommersemester 2013

28 Moore s law continues NVIDIA Fermi: ~3.0 billion Intel SNB E: ~2.2 billion Intel orp Electronics Magazine, April 1965: The complexity for minimum component costs has increased at a rate of roughly a factor of two per year ertainly over the short term this rate can be expected to continue, if not to increase. AMA D. Fey and G. Wellein 28

29 but the free lunch is over Moore s law run smaller transistors faster Faster clock speed Higher Throughput (Instructions/s) for free Frequency [MHz] Intel x86 processor clock speed Single core: Instruction level parallelism: Superscalarity Single Instruction Multiple Data (SIMD) SSE / AVX 10 Investing the transistor budget: 1 Multi-ore/Threading 0, Year omplex on chip caches New on-chip functionalities (GU, Ie, ) AMA D. Fey and G. Wellein 29

30 ower consumption the root of all evil By courtesy of D. Vrsalovic, Intel N transistors Dual-ore 1.73x erformance ower 1.13x 1.00x 2N transistors 1.73x 1.02x ower envelope: Max W ower consumption: = f * (V core ) 2 V core ~ V Over-clocked (+20%) Max Frequency Dual-core (-20%) Same process technology: ~ f 3 AMA D. Fey and G. Wellein 30

31 Modern multi- and manycore chips Intel Sandy Bridge AMD Interlagos/Bulldozer NVIDIA GK110 / K20 Intel Xeon hi Be prepared for more cores with less complexity and slower clock!

32 The x86 multicore evolution so far Intel Single-Dual-/Quad-/Hexa-/Octo-ores (single socket view) hipset Memory 2005: Fake dual-core hipset Memory Woodcrest ore2 Duo 65nm 2006: True dual-core hipset Memory Other socket hipset Memory Other socket Harpertown ore2 Quad 45nm 2008: Simultaneous MultiThreading (SMT) T0 T1 T0 T1 T0 T1 T0 T1 T0 T1 2010: 6-core chip T0 T1 T0 T1 T0 T1 T0 T1 T0 T1 2012: Wider SIMD units AVX: 256 Bit T0 T1 T0 T1 T0 T1 T0 T1 T0 T1 T0 T1 T0 T1 T0 T1 MI Other socket MI Other socket MI Memory Memory Memory Nehalem E ore i7 45nm Westmere E ore i7 32nm Sandy Bridge E ore i7 32nm AMA D. Fey and G. Wellein 32

33 There is no longer a single driving force for chip performance! Floating oint (F) erformance: = n core * F * S * ν n core number of cores: 8 F F instructions per cycle: 2 (1 MULT and 1 ADD) Intel Xeon E ( Sandy Bridge ) (4,6 core variants also available) S F ops / instruction: 4 (dp) / 8 (sp) (256 Bit SIMD registers AVX ) ν lock speed : 2.5 GHz TO = 160 GF/s (dp) / 320 GF/s (sp) But: =5.0 GF/s (dp) for serial, non-vectorized code AMA D. Fey and G. Wellein 33

34 omplex socket topologies: AMD Interlagos / Bulldozer Up to 16 cores (8 modules 2.6 GHz Each module: 2 lightweight cores FU: 4 MULT & 4 ADD /cycle (dp) 16 kb dedicated L1D cache 2 MB shared L2 cache 6 MB shared L3 cache = GF/s = GF/s (dp) 2 2 DDR channels: GB/s ccnuma: 2 NUMA domains per socket AMA D. Fey and G. Wellein 34

35 NVIDIA Kepler GK110 Block Diagram (GGU) Architecture 7.1B Transistors 15 big cores 15 SMX units with 192 (sp) units each 192 single precision ops/ instruction block > 1 TFLO D peak 1.5 MB L2 ache 3:1 S:D performance NVIDIA orp. Used with permission. AMA D. Fey and G. Wellein

36 Intel Xeon hi block diagram Architecture 3B Transistors 16 single precision ops/instruction 60+ cores with 512 bit SIMD unit each 1 TFLO D peak 0.5 MB L2/core GDDR5 2:1 S:D performance 64 byte/cy AMA D. Fey and G. Wellein

37 omparing accelerators Intel Xeon hi 60+ IA32 cores each with 512 Bit SIMD FMA unit 480/960 SIMD D/S tracks lock Speed: ~1000 MHz Transistor count: ~3 B (22nm) ower consumption: ~250 W eak erformance (D): ~ 1 TF/s Memory BW: ~250 GB/s (GDDR5) Threads to execute: rogramming: Fortran//++ +OpenM + SIMD NVIDIA Kepler K20 15 SMX units each with 192 cores 960/2880 D/S cores lock Speed: ~700 MHz Transistor count: 7.1 B (28nm) ower consumption: ~250 W eak erformance (D): ~ 1.3 TF/s Memory BW: ~ 250 GB/s (GDDR5) Threads to execute: 10,000+ rogramming: UDA, OpenL, (OpenA) AMA D. Fey and G. Wellein

38 Trading single thread performance for parallelism: GGUs vs. Us GU vs. U light speed estimate: 1. ompute bound: 2-10x 2. Memory Bandwidth: 1-5x Intel ore i ( Sandy Bridge ) Intel Xeon E D node ( Sandy Bridge ) NVIDIA K20x ( Kepler ) ores@lock 3.3 GHz 2 x 2.7 GHz 0.7 GHz erformance + /core 52.8 GFlop/s 43.2 GFlop/s 1.4 GFlop/s Threads@STREAM <4 <16 >8000 Total performance GFlop/s 691 GFlop/s 4,000 GFlop/s Stream BW 18 GB/s 2 x 40 GB/s 168 GB/s (E=1) Transistors / TD 1 Billion* / 95 W 2 x (2.27 Billion/130W) 7.1 Billion/250W + Single recision * Includes on-chip GU and I-Express omplete compute device AMA D. Fey and G. Wellein

39 Basic compute node architecture From UMA to ccnuma

40 Single hip is not enough! Basic architecture of shared memory compute nodes Hardware/software layers (HT/QI): Shared address space and ensure data coherency A(1: ) Separate memory controllers scalable performance Single shared address space ease of use ache-coherent Non-Uniform Memory Architecture (ccnuma) HT / QI :scalable bandwidth at the price of ccnuma: Where does my data finally end up? AMA D. Fey and G. Wellein

41 There is no longer a single flat memory: From UMA to ccnuma 2-way nodes Yesterday: Dual-socket Intel ore2 node: Uniform Memory Architecture (UMA): Flat memory ; symmetric Ms But: system anisotropy Shared Address Space within the node! Today: Dual-socket Intel (Westmere) node: ache-coherent Non-Uniform Memory Architecture (ccnuma) HT / QI provide scalable bandwidth at the expense of ccnuma architectures: Where does my data finally end up? On AMD it is even more complicated ccnuma within a chip! AMA D. Fey and G. Wellein 41

42 arallel computers Shared-Memory Architectures Basic lassification Shared memory computers provide a single shared address space (memory) for all processors All processors share the same view of the address space! U U Shared Memory Two basic categories of shared memory systems U U Uniform Memory Access (UMA): Memory is equally accessible to all processors with the same performance (Bandwidth & Latency) cache-coherent Non Uniform Memory Access (ccnuma): Memory is physically distributed but appears as a single address space: erformance (Bandwidth & Latency) is different for local and remote memory access opies of the same cache line may reside in different caches ache coherence protocols guarantees consistency all time (for UMA & ccnuma) ache coherence protocols do not alleviate parallel programming for shared-memory architectures! AMA D. Fey and G. Wellein 42

43 arallel computers: Shared-memory: UMA UMA Architecture: switch/bus arbitrates memory access Special protocol ensures cross-u cache data consistency Flat memory also known as Symmetric Multi-rocessor (SM) U 1 U 2 U 3 U 4 ache ache ache ache... Switch/Bus... Memory AMA D. Fey and G. Wellein 43

44 arallel shared memory computers: ccnuma/node Layout ccnuma: Single address space although physically distributed memory through proprietary hardware concepts (e.g. NUMALink in SGI systems; QI for Intel; HT for AMD) Advantages: Aggregate memory bandwidth is scalable Systems with more 1024 cores are available (SGI) Disadvantages: ache oherence hard to implement / expensive erformance depends on access to local or remote memory Examples: All modern multi-socket compute nodes SGI Altix/UV Memory Memory Memory Memory AMA D. Fey and G. Wellein 44

45 Basic challenges for shared memory architectures ache coherence for UMA & ccnuma! Multiple copies of the same cache line in multiple caches how to keep them coherent? ccnuma: Data Locality M M M M "Golden Rule" of ccnuma: A memory page gets mapped into the local memory of the processor that first touches it! AMA D. Fey and G. Wellein 45

Programming Techniques for Supercomputers: Modern processors. Architecture of the memory hierarchy

Programming Techniques for Supercomputers: Modern processors. Architecture of the memory hierarchy Programming Techniques for Supercomputers: Modern processors Architecture of the memory hierarchy Prof. Dr. G. Wellein (a,b), Dr. G. Hager (a), Dr. M. Wittmann (a) (a) HPC Services Regionales Rechenzentrum

More information

CAMA: Modern processors. Memory hierarchy: Caches basics Data access locality Cache management

CAMA: Modern processors. Memory hierarchy: Caches basics Data access locality Cache management CAMA: Modern processors Memory hierarchy: Caches basics Data access locality Cache management Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center Johannes Hofmann/Dietmar

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Modern computer architecture. From multicore to petaflops

Modern computer architecture. From multicore to petaflops Modern computer architecture From multicore to petaflops Motivation: Multi-ores where and why Introduction: Moore s law Intel Sandy Brige EP: 2.3 Billion nvidia FERMI: 3 Billion 1965: G. Moore claimed

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Introduction to Computer Architecture. Jan Eitzinger (RRZE) Georg Hager (RRZE)

Introduction to Computer Architecture. Jan Eitzinger (RRZE) Georg Hager (RRZE) Introduction to omputer Architecture Jan Eitzinger (RRZE) Georg Hager (RRZE) Milestone Inventions 1938 Elwood Shannon: Solve boolean algebra and binary arithmetic with arrangements of relays 1941 Zuse

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

Compact Course, KTH Stockholm, March 2011

Compact Course, KTH Stockholm, March 2011 Efficient multithreaded programming on modern CPUs and GPUs Integrated Memory Controller 3 Ch DDR Core Q P I Core Core Shared L3 Cache Core Prof. Gerhard Wellein, Dr. Georg Hager Erlangen Regional Computing

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

The ECM (Execution-Cache-Memory) Performance Model

The ECM (Execution-Cache-Memory) Performance Model The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop Memory issues on Multi- and Manycore

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Moore s law Intel Sandy Bridge EP: 2.3 billion Nvidia

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Multicore Scaling: The ECM Model

Multicore Scaling: The ECM Model Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Stencil code examples: 2D Jacobi in L1 and L2 cache 3D Jacobi in memory 3D long-range stencil G. Hager, J. Treibig,

More information

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,

More information

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories Agenda Chapter 6 Cache Memories Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

Effect of memory latency

Effect of memory latency CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable

More information

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster, Today Cache Memories CSci 2021: Machine Architecture and Organization November 7th-9th, 2016 Your instructor: Stephen McCamant Cache memory organization and operation Performance impact of caches The memory

More information

Last class. Caches. Direct mapped

Last class. Caches. Direct mapped Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place

More information

Kevin O Leary, Intel Technical Consulting Engineer

Kevin O Leary, Intel Technical Consulting Engineer Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."

More information

Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model

Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model ERLANGEN REGIONAL COMPUTING CENTER Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model Holger Stengel, J. Treibig, G. Hager, G. Wellein Erlangen Regional

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2015 Lecture 15 LAST TIME! Discussed concepts of locality and stride Spatial locality: programs tend to access values near values they have already accessed

More information

Cache Memories October 8, 2007

Cache Memories October 8, 2007 15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache

More information

Today Cache memory organization and operation Performance impact of caches

Today Cache memory organization and operation Performance impact of caches Cache Memories 1 Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University

More information

Analytical Tool-Supported Modeling of Streaming and Stencil Loops

Analytical Tool-Supported Modeling of Streaming and Stencil Loops ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported Modeling of Streaming and Stencil Loops Georg Hager, Julian Hammer Erlangen Regional Computing Center (RRZE) Scalable Tools Workshop August

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Cache memories are small, fast SRAM based memories managed automatically in hardware. Cache Memories Cache memories are small, fast SRAM based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

HPC Issues for DFT Calculations. Adrian Jackson EPCC

HPC Issues for DFT Calculations. Adrian Jackson EPCC HC Issues for DFT Calculations Adrian Jackson ECC Scientific Simulation Simulation fast becoming 4 th pillar of science Observation, Theory, Experimentation, Simulation Explore universe through simulation

More information

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

Tools and techniques for optimization and debugging. Fabio Affinito October 2015 Tools and techniques for optimization and debugging Fabio Affinito October 2015 Fundamentals of computer architecture Serial architectures Introducing the CPU It s a complex, modular object, made of different

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Parallel Computing Platforms: Control Structures and Memory Hierarchy

Parallel Computing Platforms: Control Structures and Memory Hierarchy Parallel Computing Platforms: Control Structures and Memory Hierarchy John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 12 27 September 2018 Topics

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Tools and techniques for optimization and debugging. Andrew Emerson, Fabio Affinito November 2017

Tools and techniques for optimization and debugging. Andrew Emerson, Fabio Affinito November 2017 Tools and techniques for optimization and debugging Andrew Emerson, Fabio Affinito November 2017 Fundamentals of computer architecture Serial architectures Introducing the CPU It s a complex, modular object,

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

4. Shared Memory Parallel Architectures

4. Shared Memory Parallel Architectures Master rogram (Laurea Magistrale) in Computer cience and Networking High erformance Computing ystems and Enabling latforms Marco Vanneschi 4. hared Memory arallel Architectures 4.4. Multicore Architectures

More information

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons Cache Memories 15-213/18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, 2017 Today s Instructor: Phil Gibbons 1 Today Cache memory organization and operation Performance impact

More information

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27

More information

Philippe Thierry Sr Staff Engineer Intel Corp.

Philippe Thierry Sr Staff Engineer Intel Corp. HPC@Intel Philippe Thierry Sr Staff Engineer Intel Corp. IBM, April 8, 2009 1 Agenda CPU update: roadmap, micro-μ and performance Solid State Disk Impact What s next Q & A Tick Tock Model Perenity market

More information

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance What is EE 352 Unit 11 Definitions Address Mapping Performance memory is a small, fast memory used to hold of data that the processor will likely need to access in the near future sits between the processor

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Multicore-aware parallelization strategies for efficient temporal blocking (BMBF project: SKALB)

Multicore-aware parallelization strategies for efficient temporal blocking (BMBF project: SKALB) Multicore-aware parallelization strategies for efficient temporal blocking (BMBF project: SKALB) G. Wellein, G. Hager, M. Wittmann, J. Habich, J. Treibig Department für Informatik H Services, Regionales

More information

Linear Algebra for Modern Computers. Jack Dongarra

Linear Algebra for Modern Computers. Jack Dongarra Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d

More information

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous

More information

Memory Systems and Performance Engineering. Fall 2009

Memory Systems and Performance Engineering. Fall 2009 Memory Systems and Performance Engineering Fall 2009 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide illusion of fast larger memory

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2017 Lecture 15 LAST TIME: CACHE ORGANIZATION Caches have several important parameters B = 2 b bytes to store the block in each cache line S = 2 s cache sets

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

CISC 360. Cache Memories Nov 25, 2008

CISC 360. Cache Memories Nov 25, 2008 CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based

More information

Performance of serial C programs. Performance of serial C programs p. 1

Performance of serial C programs. Performance of serial C programs p. 1 Performance of serial C programs Performance of serial C programs p. 1 Motivations In essence, parallel computations consist of serial computations (executed on multiple computing units) and the needed

More information

PCOPP Uni-Processor Optimization- Features of Memory Hierarchy. Uni-Processor Optimization Features of Memory Hierarchy

PCOPP Uni-Processor Optimization- Features of Memory Hierarchy. Uni-Processor Optimization Features of Memory Hierarchy PCOPP-2002 Day 1 Classroom Lecture Uni-Processor Optimization- Features of Memory Hierarchy 1 The Hierarchical Memory Features and Performance Issues Lecture Outline Following Topics will be discussed

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2 ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

More information

Parallel Programming Platforms

Parallel Programming Platforms arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel

More information

Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings

Evaluation of Intel Xeon Phi Knights Corner: Opportunities and Shortcomings ERLANGEN REGIONAL COMPUTING CENTER Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings J. Eitzinger 29.6.2016 Technologies Driving Performance Technology 1991 1992 1993 1994 1995

More information

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

The Processor Memory Hierarchy

The Processor Memory Hierarchy Corrected COMP 506 Rice University Spring 2018 The Processor Memory Hierarchy source code IR Front End Optimizer Back End IR target code Copyright 2018, Keith D. Cooper & Linda Torczon, all rights reserved.

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Basics of performance modeling for numerical applications: Roofline model and beyond

Basics of performance modeling for numerical applications: Roofline model and beyond Basics of performance modeling for numerical applications: Roofline model and beyond Georg Hager, Jan Treibig, Gerhard Wellein SPPEXA PhD Seminar RRZE April 30, 2014 Prelude: Scalability 4 the win! Scalability

More information

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010 Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Cache Memories From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging

More information

Parallel Computer Architecture - Basics -

Parallel Computer Architecture - Basics - Parallel Computer Architecture - Basics - Christian Terboven 19.03.2012 / Aachen, Germany Stand: 15.03.2012 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Processor

More information

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys

More information

Giving credit where credit is due

Giving credit where credit is due CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based

More information

Computer Architecture Crash course

Computer Architecture Crash course Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping

More information

Lecture 17. NUMA Architecture and Programming

Lecture 17. NUMA Architecture and Programming Lecture 17 NUMA Architecture and Programming Announcements Extended office hours today until 6pm Weds after class? Partitioning and communication in Particle method project 2012 Scott B. Baden /CSE 260/

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Memories. CPE480/CS480/EE480, Spring Hank Dietz. Memories CPE480/CS480/EE480, Spring 2018 Hank Dietz http://aggregate.org/ee480 What we want, what we have What we want: Unlimited memory space Fast, constant, access time (UMA: Uniform Memory Access) What

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Memory Systems and Performance Engineering

Memory Systems and Performance Engineering SPEED LIMIT PER ORDER OF 6.172 Memory Systems and Performance Engineering Fall 2010 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Parallel Computer Architecture Concepts

Parallel Computer Architecture Concepts Outline This image cannot currently be displayed. arallel Computer Architecture Concepts TDDD93 Lecture 1 Christoph Kessler ELAB / IDA Linköping university Sweden 2015 Lecture 1: arallel Computer Architecture

More information

CSC D70: Compiler Optimization Memory Optimizations

CSC D70: Compiler Optimization Memory Optimizations CSC D70: Compiler Optimization Memory Optimizations Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry, Greg Steffan, and

More information

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

The course that gives CMU its Zip! Memory System Performance. March 22, 2001 15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,

More information

Parallelized Progressive Network Coding with Hardware Acceleration

Parallelized Progressive Network Coding with Hardware Acceleration Parallelized Progressive Network Coding with Hardware Acceleration Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto Network coding Information is coded

More information

Performance of Variant Memory Configurations for Cray XT Systems

Performance of Variant Memory Configurations for Cray XT Systems Performance of Variant Memory Configurations for Cray XT Systems Wayne Joubert, Oak Ridge National Laboratory ABSTRACT: In late 29 NICS will upgrade its 832 socket Cray XT from Barcelona (4 cores/socket)

More information

COSC 6385 Computer Architecture. - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors

COSC 6385 Computer Architecture. - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors OS 6385 omputer Architecture - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors Spring 2012 Long-term trend on the number of transistor per integrated circuit Number of transistors

More information