Master Informatics Eng.

Size: px

Start display at page:

Download "Master Informatics Eng."

Calvin Greene
5 years ago
Views:

1 Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8

2 AJProença, Advanced Architectures, MiEI, UMinho, 207/8 2

3 Goals of the Roofline Model AJProença, Advanced Architectures, MiEI, UMinho, 207/8 3

4 Performance Limiting Factors AJProença, Advanced Architectures, MiEI, UMinho, 207/8 4

Roofline Performance Model n n Basic idea: n n Plot peak floating-point throughput as a function of arithmetic intensity Ties together floating-point performance and memory

5 Roofline Performance Model n n Basic idea: n n Plot peak floating-point throughput as a function of arithmetic intensity Ties together floating-point performance and memory performance for a target machine Arithmetic intensity n Floating-point operations per byte read SIMD Instruction Set Extensions for Multimedia Copyright 202, Elsevier Inc. All rights reserved. 5

6 AJProença, Advanced Architectures, MiEI, UMinho, 207/8 6

7 ... but we could also consider SP or int AJProença, Advanced Architectures, MiEI, UMinho, 207/8 7

8 AJProença, Advanced Architectures, MiEI, UMinho, 207/8 8

9 3Cs model for caches AJProença, Advanced Architectures, MiEI, UMinho, 207/8 9

10 Three Classes of Locality v Temporal Locality reusing data (either registers or cache lines) multiple times amortizes the impact of limited bandwidth. transform loops or algorithms to maximize reuse. v Spatial Locality data is transferred from cache to registers in words. However, data is transferred to the cache in 64-28Byte lines using every word in a line maximizes spatial locality. transform data structures into structure of arrays (SoA) layout v Sequential Locality Many memory address patterns access cache lines sequentially. CPU s hardware stream prefetchers exploit this observation to hide speculatively load data to memory latency. Transform loops to generate (a few) long, unit-stride accesses. LAWRENCE BERKELEY NATIONAL LABORATORY 0

11 Preliminary notes in the Roofline Model goal: integrate in-core performance, memory bandwidth, and locality into a single readily understandable performance figure graphically show the penalty associated with not including certain software optimizations Roofline model will be unique to each architecture AJProença, Advanced Architectures, MiEI, UMinho, 207/8

12 Key elements in the Roofline Model x-axis: the operational intensity, operations per byte of RAM traffic, Flops/byte (traffic between caches and memory) y-axis: the attainable floating-point performance, GFlops/sec includes both peak processor/memory performance peak processor FP performance: a horizontal line computed from the processor specs peak memory performance: bounds the max FP performance of the memory system for a given operational intensity for each kernel: its performance is a point on a vertical line that crosses the x-axis on the kernel operational intensity AJProença, Advanced Architectures, MiEI, UMinho, 207/8 2

Arithmetic Intensity O( ) O( log(n) ) O( N ) A r i t h m e t i c I n t e n s i t y SpMV, BLAS,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods v v v v v True

13 Arithmetic Intensity O( ) O( log(n) ) O( N ) A r i t h m e t i c I n t e n s i t y SpMV, BLAS,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods v v v v v True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality) Others have constant intensity Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses. LAWRENCE BERKELEY NATIONAL LABORATORY 3

14 NUMA v Recent multicore SMPs have integrated the memory controllers on chip. v As a result, memory-access is non-uniform (NUMA) v That is, the bandwidth to read a given address varies dramatically among between cores v Exploit NUMA (affinity+first touch) when you malloc/init data. v Concept is similar to data decomposition for distributed memory Core Core 4MB L2 FSB Core Core 4MB L2 Core Core 4MB L2 FSB Core 0.66 GB/s 0.66 GB/s Core 4MB L2 Opteron Opteron Opteron Opteron 52K 52K 52K 52K 2MB victim SRI / xbar HyperTransport 4GB/s (each direction) HyperTransport Opteron Opteron Opteron Opteron 52K 52K 52K 52K 2MB victim SRI / xbar MCH (4x64b controllers) 2.33 GB/s 0.66 GB/s 8 x 667MHz FBDIMMs 2x64b controllers 0.66GB/s 667MHz DDR2 DIMMs 2x64b controllers 0.66GB/s 667MHz DDR2 DIMMs LAWRENCE BERKELEY NATIONAL LABORATORY 4

15 Additional notes Memory bandwidth # s collected via micro benchmarks (or the STREAM benchmark) Computation # s derived from optimization manuals (pencil and paper) Assume complete overlap of either communication or computation => AJProença, Advanced Architectures, MiEI, UMinho, 207/8 5

16 AJProença, Advanced Architectures, MiEI, UMinho, 207/8 6

17 Example v Consider the Opteron 2356: Dual Socket (NUMA) limited HW stream prefetchers quad-core (8 total) 2.3GHz 2-way SIMD (DP) separate FPMUL and FPADD datapaths 4-cycle FP latency Opteron 52K Opteron 52K Opteron 52K Opteron 52K 2MB victim SRI / xbar 2x64b controllers 0.66GB/s HyperTransport 4GB/s (each direction) HyperTransport Opteron 52K Opteron 52K Opteron 52K 2x64b controllers 0.66GB/s Opteron 52K 2MB victim SRI / xbar 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs v Assuming expression of parallelism is the challenge on this architecture, what would the roofline model look like? LAWRENCE BERKELEY NATIONAL LABORATORY 7

18 attainable GFLOP/s Roofline Model Basic Concept Opteron 2356 (Barcelona) peak DP v Plot on log-log scale v Given AI, we can easily bound performance v But architectures are much more complicated v We will bound performance as we eliminate specific forms of in-core parallelism 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 8

19 attainable GFLOP/s Roofline Model computational ceilings Opteron 2356 (Barcelona) peak DP mul / add imbalance v Opterons have dedicated multipliers and adders. v If the code is dominated by adds, then attainable performance is half of peak. v We call these Ceilings v They act like constraints on performance 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 9

20 attainable GFLOP/s Roofline Model computational ceilings Opteron 2356 (Barcelona) peak DP mul / add imbalance w/out SIMD v Opterons have 28-bit datapaths. v If instructions aren t SIMDized, attainable performance will be halved 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 20

21 attainable GFLOP/s Roofline Model computational ceilings Opteron 2356 (Barcelona) peak DP mul / add imbalance w/out SIMD w/out ILP v On Opterons, floating-point instructions have a 4 cycle latency. v If we don t express 4-way ILP, performance will drop by as much as 4x 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 2

22 attainable GFLOP/s Roofline Model communication ceilings Opteron 2356 (Barcelona) peak DP v We can perform a similar exercise taking away parallelism from the memory subsystem 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 22

23 attainable GFLOP/s Roofline Model communication ceilings Opteron 2356 (Barcelona) peak DP v Explicit software prefetch instructions are required to achieve peak bandwidth 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 23

24 attainable GFLOP/s Roofline Model communication ceilings Opteron 2356 (Barcelona) peak DP v Opterons are NUMA v As such memory traffic must be correctly balanced among the two sockets to achieve good Stream bandwidth. v We could continue this by examining strided or random memory access patterns 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 24

25 attainable GFLOP/s Roofline Model computation + communication ceilings Opteron 2356 (Barcelona) peak DP mul / add imbalance w/out SIMD w/out ILP v We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth. 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 25

26 LAWRENCE BERKELEY NATIONAL LABORATORY 26

27 attainable GFLOP/s Roofline Model locality walls / 8 Opteron 2356 (Barcelona) peak DP mul / add imbalance only compulsory miss traffic w/out SIMD w/out ILP / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY v Remember, memory traffic includes more than just compulsory misses. v As such, actual arithmetic intensity may be substantially lower. v Walls are unique to the architecture-kernel combination AI = FLOPs Compulsory Misses 27

28 attainable GFLOP/s Roofline Model locality walls / 8 Opteron 2356 (Barcelona) mul / add imbalance +write allocation traffic peak DP only compulsory miss traffic w/out SIMD w/out ILP / 4 / actual FLOP:Byte ratio AI = LAWRENCE BERKELEY NATIONAL LABORATORY v Remember, memory traffic includes more than just compulsory misses. v As such, actual arithmetic intensity may be substantially lower. v Walls are unique to the architecture-kernel combination FLOPs Allocations + Compulsory Misses 28

29 attainable GFLOP/s Roofline Model locality walls / 8 Opteron 2356 (Barcelona) +capacity miss traffic mul / add imbalance +write allocation traffic peak DP only compulsory miss traffic w/out SIMD w/out ILP / 4 / actual FLOP:Byte ratio AI = LAWRENCE BERKELEY NATIONAL LABORATORY v Remember, memory traffic includes more than just compulsory misses. v As such, actual arithmetic intensity may be substantially lower. v Walls are unique to the architecture-kernel combination FLOPs Capacity + Allocations + Compulsory 29

30 attainable GFLOP/s Roofline Model locality walls / 8 Opteron 2356 (Barcelona) +conflict miss traffic +capacity miss traffic mul / add imbalance +write allocation traffic peak DP only compulsory miss traffic w/out SIMD w/out ILP / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY v Remember, memory traffic includes more than just compulsory misses. v As such, actual arithmetic intensity may be substantially lower. v Walls are unique to the architecture-kernel combination FLOPs AI = Conflict + Capacity + Allocations + Compulsory 30

31 attainable GFLOP/s Roofline Model locality walls / 8 Opteron 2356 (Barcelona) +conflict miss traffic +capacity miss traffic mul / add imbalance +write allocation traffic peak DP only compulsory miss traffic w/out SIMD w/out ILP / 4 / actual FLOP:Byte ratio v Optimizations remove these walls and ceilings which act to constrain performance. LAWRENCE BERKELEY NATIONAL LABORATORY 3

? explicit branches SIMD Maximizing Memory Bandwidth Exploit NUMA Hide memory latency Satisfy Little s Law memory?? affinity?

32 Optimization Categorization Maximizing In-core Performance Exploit in-core parallelism (ILP, DLP, etc ) Good (enough) floating-point balance? reorder unroll?& jam eliminate?? explicit branches SIMD Maximizing Memory Bandwidth Exploit NUMA Hide memory latency Satisfy Little s Law memory?? affinity? DMA lists unit-stride? streams SW prefetch? TLB blocking Minimizing Memory Traffic Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior cache blocking???? array padding compress data streaming stores LAWRENCE BERKELEY NATIONAL LABORATORY 32

33 No overlap of communication and computation v Previously, we assumed perfect overlap of communication or computation. v What happens if there is a dependency (either inherent or by a lack of optimization) that serializes communication and computation? Byte s / STREAM Bandwidth Byte s / STREAM Bandwidth Flop s / Flop/s Flop s / Flop/s time v Time is the sum of communication time and computation time. v The result is that flop/s grows asymptotically. LAWRENCE BERKELEY NATIONAL LABORATORY 33

34 attainable GFLOP/s No overlap of communication and computation Generic Machine Stream Stream Bandwidth Bandwidth w/out w/out NUMA NUMA peak DP w/out ILP / 8 / 4 / actual flop:byte ratio v Consider a generic machine v If we can perfectly decouple and overlap communication with computation, the roofline is sharp/angular. v However, without overlap, the roofline is smoothed, and attainable performance is degraded by up to a factor of 2x. LAWRENCE BERKELEY NATIONAL LABORATORY 34

35 Alternate Bandwidths v Thus far, we assumed a synergy between streaming applications and bandwidth (proxied by the STREAM benchmark) v STREAM is NOT a good proxy for short stanza/random cacheline access patterns as memory latency (instead of just bandwidth) is being exposed. v Thus one might conceive of alternate memory benchmarks to provide a bandwidth upper bound (ceiling) v Similarly, if data is primarily local in the LLC cache, one should construct rooflines based on LLC bandwidth and flop:llc byte ratios. v For GPUs/accelerators, PCIe bandwidth can be an impediment. Thus one can construct a roofline model based on PCIe bandwidth and the flop:pcie byte ratio. LAWRENCE BERKELEY NATIONAL LABORATORY 35

36 AJProença, Advanced Architectures, MiEI, UMinho, 207/8 36

37 AJProença, Advanced Architectures, MiEI, UMinho, 207/8 37

38 Some more examples AJProença, Advanced Architectures, MiEI, UMinho, 207/8 38

39 Some more examples AJProença, Advanced Architectures, MiEI, UMinho, 207/8 39

40 Some more examples AJProença, Advanced Architectures, MiEI, UMinho, 207/8 40

41 Alternate Computations v Arising from HPC kernels, its no surprise roofline use DP Flop/s. v Of course, it could use SP flop/s, integer ops, bit operations, pairwise comparisons (sorting), graphics operations, etc LAWRENCE BERKELEY NATIONAL LABORATORY 4

The Roofline Model: EECS. A pedagogical tool for program analysis and optimization

The Roofline Model: EECS. A pedagogical tool for program analysis and optimization P A R A L L E L C O M P U T I N G L A B O R A T O R Y The Roofline Model: A pedagogical tool for program analysis and optimization Samuel Williams,, David Patterson, Leonid Oliker,, John Shalf, Katherine