The ECM (Execution-Cache-Memory) Performance Model

Size: px

Start display at page:

Download "The ECM (Execution-Cache-Memory) Performance Model"

Meagan Sanders
5 years ago
Views:

1 The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop Memory issues on Multi- and Manycore Platforms at PPAM 2009, the 8th International Conference on Parallel Processing and Applied Mathematics, Wroclaw, Poland, September 13-16, Lecture Notes in Computer Science Volume 6067, 2010, pp DOI: / _64. G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience, DOI: /cpe.3180 (2013). Preprint: arxiv: H. Stengel, J. Treibig, G. Hager, and G. Wellein: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. Submitted. Preprint: arxiv:

2 Assumptions and shortcomings of the roofline model Assumes one of two bottlenecks 1. In-core execution 2. Bandwidth of a single hierarchy level Latency effects are not modeled pure data streaming assumed In-core execution is sometimes hard to A(:)=B(:)+C(:)*D(:) model Saturation effects in multicore chips are not explained ECM model gives more insight Roofline predicts full socket BW 2

3 The Execution-Cache-Memory (ECM) model

4 ECM Model ECM = Execution-Cache-Memory Observations: Single-core execution time is not the maximum of 1. In-core execution 2. Data transfers through a single bottleneck Data transfers may or may not overlap with each other or with in-core execution Scaling is linear until the relevant bottleneck is reached ECM model Input: Same as for Roofline + data transfer times in hierarchy 4

5 Example: Schönauer Vector Triad in L2 cache REPEAT[ A(:) = B(:) + C(:) * double precision Analysis for Sandy Bridge core w/ AVX (unit of work: 1 cache line) Machine characteristics: Registers L1 1 LD/cy ST/cy Triad analysis (per CL): Registers L1 6 cy/cl Timeline: ADD ADD MULT MULT LD LD ST/2 LD ST/2 LD LD ST/2 16 F/CL (AVX) LD ST/2 32 B/cy (2 cy/cl) 10 cy/cl LD LD LD WA ST L2 L2 Roofline prediction: 16/10 F/cy Arithmetic: 1 ADD/cy+ 1 MULT/cy Arithmetic: AVX: 2 cy/cl Measurement: 16F / 17cy 5

6 Example: ECM model for Schönauer Vector Triad A(:)=B(:)+C(:)*D(:) on a Sandy Bridge Core with AVX CL transfer Writeallocate CL transfer 6

7 Testing different overlap hypotheses Results suggest no overlap! 7

8 Multicore scaling in the ECM model Identify relevant bandwidth bottlenecks L3 cache Memory interface Scale single-thread performance until first bottleneck is hit: n threads: P n = min(np 0, I b S ) Example: Scalable L3 on Sandy Bridge... 8

9 ECM prediction vs. measurements for A(:)=B(:)+C(:)*D(:) on a Sandy Bridge socket (no-overlap assumption) Model: Scales until saturation sets in Saturation point (# cores) well predicted Measurement: scaling not perfect Caveat: This is specific for this architecture and this benchmark! Check: Use overlappable kernel code 9

10 ECM prediction vs. measurements for A(:)=B(:)+C(:)/D(:) on a Sandy Bridge socket (full overlap assumption) In-core execution is dominated by divide operation (44 cycles with AVX, 22 scalar) Almost perfect agreement with ECM model General observation: If the L1 cache is 100% occupied by LD, there is no overlap throughout the hierarchy If there is slack at the L1, there is overlap in the hierarchy 10

11 Example 1: A 2D Jacobi stencil in DP with SSE2 on Sandy Bridge 11

12 Example 1: 2D Jacobi in DP with SSE2 on SNB 4-way unrolling 8 LUP / iteration Instruction count - 13 LOAD - 4 STORE - 12 ADD - 4 MUL 12

13 Example 1: 2D Jacobi in DP with SSE2 on SNB Processor characteristics (SSE instructions per cycle) - 2 LOAD (1 LOAD + 1 STORE) - 1 ADD - 1 MUL Code characteristics (SSE instructions per iteration) - 13 LOAD - 4 STORE - 12 ADD - 4 MUL LD LD LD LD 2LD 2LD 2LD 2LD L ST ST ST ST * * * * core execution: 12 cy 13

14 Example 1: 2D Jacobi in DP with SSE2 on SNB Situation 1: Data set fits into L1 cache ECM prediction: (8 LUP / 12 cy) * 3.5 GHz = 2.3 GLUP/s Measurement: 2.2 GLUP/s 12 cy Situation 2: Data set fits into L2 cache (not into L1) 3 additional transfer streams from L2 to L1 (data delay) Prediction: (8 LUP / (12+6) cy) * 3.5 GHz = 1.5 GLUP/s t1 RFO t0 6 cy Measurement: 1.9 GLUP/s Overlap? 14

15 Example 1: 2D Jacobi in DP with SSE2 on SNB LD LD LD LD 2LD 2LD 2LD 2LD L ST ST ST ST LOAD bottleneck: 8.5 cy * * * * core execution: 12 cycles L2 delay: 6 cycles L1 single ported no overlap during LD/ST ECM prediction w/ overlap: (8 LUP / (8.5+6) cy) * 3.5 GHz = 1.9 GLUP/s Measurement: 1.9 GLUP/s 12 cy t1 RFO t0 6 cy If the model fails, we learn something 15

16 ECM model the rules 4 cy 1. LOADs in the L1 cache do not overlap with any other data transfer in the memory hierarchy MULT 8 cy 3 cy STORE ADD 43 cy LOAD 6 cy L2-L1 9 cy 2. Everything else in the core overlaps perfectly with data transfers L3-L2 9 cy 3. The scaling limit is set by the ratio of # cycles per CL overall # cycles per CL at the bottleneck 4. The Roofline Model is recovered when assuming full overlap of all contributions (c) RRZE 2014 Example: time [cy] MEM 19 cy -L3 Single-core (data in L1): 8 cy (ADD) Single-core (data in memory): cy = 43 cy Scaling limit: 43 / 19 = 2.3 cores ECM model 16

17 ECM model notation Core time = overlapping and non-overlapping contributions ECM prediction = maximum of overlapping time and sum of all other contributions Convenient shorthand notation for contributions: Example from prev. slide: Predictions for data in different memory hierarchy levels: Experimental data (measured) notation: Saturation assumption for memory bottleneck: 17

18 ECM Model for DAXPY (AVX) on SNB 2.7 GHz (phinally) Loop: Contributions: Predictions: 18

19 ECM Model and measurements for array sum on SNB 2.7 GHz (phinally) Loop: Naive = scalar, no unrolling (full 3 cy penalty per ADD) 19

20 ECM Model and measurements for 2D Jacobi (AVX) on SNB 2.7 GHz (phinally) Loop: LC = layer condition satisfied in 20

21 Jacobi 2D impact of inner loop blocking on SNB (phinally) ECM 21

22 Jacobi 2D: Why outer loop blocking? Extra data prefetched from memory at block boundaries 22

23 Kahan dot product

24 Kahan dot product Goal: Compute large sums (many operands) with controlled numerical error attribute ((optimize("no-tree-vectorize"))) void ddot_kahan_scalar_comp( int N, const double* a, const double* b, double* r) { int i; double sum = 0.0; double c = 0.0; for (i=0; i<n; ++i) { double prod = a[i]*b[i]; double y = prod-c; double t = sum+y; c = (t-sum)-y; sum = t; } } (*r) = sum; 24

25 Example (from Wikipedia) 6-digit FP, initial sum = , adding and y = y = input[i] - c t = = Many digits have been lost! c = ( ) This must be evaluated as written! = Assimilated part of y recovered, vs. full y. = sum = Inaccurate result On the next step, c gives the error. y = Shortfall from previous stage included. = It is of a size similar to y: most digits meet. t = But few meet the digits of sum. = , rounds to c = ( ) This extracts whatever went in. = In this case, too much. = The excess would be subtracted off next time. sum = Exact result is , this is correctly rounded to 6 digits. 25

26 ECM Model and measurements on Emmy (IVB 2.2 GHz, 3 cy/cl from memory) Standard DP ddot: Scalar: AVX: Kahan ddot: Scalar: AVX: Conclusion: DP Kahan ddot saturates even in scalar mode SP Kahan will not saturate 26

27 Performance Modeling of Stencil Codes Applying the ECM model to stencil updates: - 3D Jacobi smoother (DP, AVX) - Long-range stencil (SP, AVX) (H. Stengel, RRZE)

28 Example 2: A 3D Jacobi smoother with AVX vectorization on an Intel Ivy Bridge processor 28

29 Jacobi 3D Manual Analysis Operation Count (1 LUP) MUL 1 ADD 5 LOAD 6 STORE 1 Cycle Count (4x unroll + AVX = 16 LUP) MUL 4 ADD 20 LOAD 24 STORE 8 29

30 Interlude: Intel Architecture Code Analyzer (IACA) Performs architecture-specific code analysis Prerequisite: Mark start and end of dominant work loop In high-level code (documented) In assembly code (see iacamarks.h) Does not influence code optimization (e.g. vectorization) Assembly loop might perform multiple updates per iteration (unrolling, SIMD) Important reports (throughput mode): Block throughput: runtime of one loop iteration ( core-time) Throughput bottleneck: limiting resource for code execution Port pressure: dominant pipeline port 30

31 16 updates (4x unroll + AVX) = 2 cache lines per loop iteration #pragma vector aligned 31

32 M-L3 (12cy) L3-L2 (10cy) L2-L1 (10cy) ADD (10cy) L1-REG (LD 12cy) Jacobi 3D ECM Non-LD/ST time Intel(R) Xeon(R) CPU E GHz Memory Bandwidth 47 GB/s Data transfers FrontEnd stalls 0.5*( ) =0.05cy MUL (2cy) Reg-Reg (6cy) Stores (4cy) Times [cy] for 8 LUP (DP) = 1 CL update = 0.5 loop iterations (ASM) = 0.5 * IACA output IACA throughput: 24.1cy/16LUP Single-core performance 3.0GHz / (44cy/ 8LUP) = 545MLUP/s Measurement (N=400): 542MLUP/s (~44cy) 44cy #pragma vector aligned 32

33 Socket Scaling Intel(R) Xeon(R) CPU E GHz Memory Bandwidth 47 GB/s 34

34 Example 3: 3D long-range stencil in single precision with AVX on Sandy Bridge 35

35 Example 3: 3D long-range stencil in SP with AVX on SNB Core execution 4 neighbors per direction Operations per update (code) 27 LOAD (25 V, 1 ROC, 1 U) 1 STORE (U) 26 ADD 15 MUL Core time & actual LOAD count IACA Collaboration with D. Keyes & T. Malas (KAUST) 36

36 IACA example output Core execution Core Execution time (16 LUP) = 2*34.25 cy = 68.5 cy Data transfer: LOAD ports REG L1: 2*30.5 cy = 61 cy 128 Bit Loads AVX vectorization, no unrolling: One iteration updates 8 SP (float) elements Multiply all numbers by 2X to get time for updating 1 CacheLine (16 floats) 37

Example 3: Data delay Problem size: 260 3 (single precision) cy/cl Spatial blocking Layer condition at L3 and row condition in L1: OK From IACA analysis 61 cy 8

37 Example 3: Data delay Problem size: (single precision) cy/cl Spatial blocking Layer condition at L3 and row condition in L1: OK From IACA analysis 61 cy 8 LOADS to V can be served directly by L3 cache + 1 from main memory 24 cy 24 cy 17 cy MemBW=40 GB/s Minimum data transfer to main memory: 4 WORD/LUP (LD: U,V,ROC ST:U) 38

38 M-L3 17 cy L3-L2 24 cy L2-L1 24 cy 126 cy ADD 52 cy MULT 38 cy Reg-Reg transfers 48cy L1-REG (Load) 61 cy Example 3: Putting it all together Core execution (Non-LD/ST cycles) Data delay Stores 4cy optimization target! IACA throughput 68.5 cy / CL (sp) FrontEnd stalls overlap: ( ) cy =7.5cy Single-core performance (ECM Model) 2.7GHz / (126cy / 16LUP) = 343 MLUP/s Measurement: 320 MLUP/s temporal blocking useless! 39

39 Socket scaling memory bandwidth limit 41

40 ECM model: Conclusions & outlook Saturation effects are ubiquitous; understanding them gives us opportunity to Find out about optimization opportunities Save energy by letting cores idle see power model later on Putting idle cores to better use communication, functional decomposition Simple models work best. Do not try to complicate things unless it is really necessary! Possible extensions to the ECM model Accommodate latency effects Model simple architectural hazards 42

Multicore Scaling: The ECM Model

Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Stencil code examples: 2D Jacobi in L1 and L2 cache 3D Jacobi in memory 3D long-range stencil G. Hager, J. Treibig,