Performance Modeling for Ranking Blocked Algorithms

Size: px

Start display at page:

Download "Performance Modeling for Ranking Blocked Algorithms"

Kevin Benjamin Andrews
5 years ago
Views:

1 Performance Modeling for Ranking Blocked Algorithms Elmar Peise Aachen Institute for Advanced Study in Computational Engineering Science Elmar Peise (AICES) Performance Modeling

2 Blocked algorithms Inversion of a triangular matrix L L 1 R n n L 00 L 10 L 11 L 20 L 21 L 22 Variant 1 Variant 2 Variant 3 Variant 4 L 10 L 10L 00 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 21 L 21L 1 11 L 11 L 1 11 L 21 L 21L 1 11 L 20 L 20 + L 21L 10 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 20 L 20 L 21L 10 L 10 L 10L 00 L 11 L 1 11 Elmar Peise (AICES) Performance Modeling

3 Execution time Inversion of a triangular matrix L L 1 R n n Time [s] variant 1 variant 2 variant 3 variant ,024 1,536 2,048 matrix size n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling

4 Efficiency Inversion of a triangular matrix L L 1 R n n Efficiency [%] variant 1 variant 2 variant 3 variant ,024 matrix size n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling

5 Blocked algorithms Inversion of a triangular matrix L L 1 R n n 3rd ( ) 2nd ( ) 1st ( ) 4th ( ) Variant 1 Variant 2 Variant 3 Variant 4 L 10 L 10L 00 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 21 L 21L 1 11 L 11 L 1 11 L 21 L 21L 1 11 L 20 L 20 + L 21L 10 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 20 L 20 L 21L 10 L 10 L 10L 00 L 11 L 1 11 Can we rank the algorithms based on their update statements? No! Elmar Peise (AICES) Performance Modeling

6 Outline 1 Sampling 2 Modeling 3 Prediction and Ranking Elmar Peise (AICES) Performance Modeling

7 Sampling B 0.37 B L 1 with B R , L R Routine call dtrsm(r, L, N, U, 128, 96, 0.37, L, 128, B, 128) Sampling Performance counters (PAPI) ticks flops L1misses Elmar Peise (AICES) Performance Modeling

8 Matrix arguments Routine call dtrsm(r, L, N, U, 128, 96, 0.37, L, 128, B, 128) Performance is independent of L and B. Assign (random) memory chunk of sufficient size. Input (dtrsm, R, L, N, U, 128, 96, 0.37, , 128, , 128) Elmar Peise (AICES) Performance Modeling

9 Multiple samples Input (dtrsm, R, L, N, U, 128, 96, 0.37, 12288, 128, 12288, 128) (dtrsm, R, L, N, U, 128, 96, 0.37, 12288, 128, 12288, 128) (dtrsm, R, L, N, U, 128, 96, 0.37, 12288, 128, 12288, 128) (dtrsm, R, L, N, U, 128, 96, 0.37, 12288, 128, 12288, 128) Performance counters. Sampling ticks flops L1misses Elmar Peise (AICES) Performance Modeling

10 Influence of caching dtrsm(r, L, N, U, 128, 96, 0.37, L, 128, B, 128) 10 6 ticks in cache: GotoBLAS2 MKL ATLAS out of cache: GotoBLAS2 MKL ATLAS ,000 experiments AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling

11 Outline 1 Sampling 2 Modeling 3 Prediction and Ranking Elmar Peise (AICES) Performance Modeling

12 Modeling Understanding performance Model structure Model generation Modeling results Elmar Peise (AICES) Performance Modeling

13 Argument types B α A 1 B α A B dtrsm(side, uplo, transa, diag, m, n, alpha, A, lda, B, ldb) discrete size data scalar matrix leading dimension Elmar Peise (AICES) Performance Modeling

14 Discrete arguments dtrsm(side, uplo, transa, diag, 256, 256, 0.5, A, 256, B, 256) L L N N L L N U L L T N L L T U L U N N L U N U L U T N L U T U R L N N R L N U R L T N R L T U R U N N R U N U R U T N R U T U ticks GotoBLAS2 MKL ATLAS side uplo transa diag AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling

15 Size arguments Small scale dgemm(n, N, m, n, k, 0.5, A, m, B, k, 0.5, C, m) GotoBLAS2 MKL ATLAS ticks m/n/k AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling

16 Size arguments Large scale dgemm(n, N, m, n, k, 0.5, A, m, B, k, 0.5, C, m) ticks GotoBLAS2 MKL ATLAS m n k m/n/k AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling

17 Scalar arguments dgemm(n, N, 256, 256, 256, alpha, A, 256, B, 256, beta, C, 256) ticks GotoBLAS2 MKL ATLAS 0 C 0 C AB C AB C 0.5AB C C C AB + C C AB + C C 0.5AB + C C C C AB C C AB C C 0.5AB C C 0.5C C AB + 0.5C C AB + 0.5C C 0.5AB+0.5C AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling

18 Matrix arguments dgemm(n, N, 256, 256, 256, 0.5, A, 256, B, 256, 0.5, C, 256) No performance dependency Elmar Peise (AICES) Performance Modeling

19 Leading dimension arguments dgemm(n, N, 128, 128, 128, 0.5, A, lda, B, ldb, 0.5, C, ldc) ticks GotoBLAS2 MKL ATLAS lda ldb ldc lda/ldb/ldc AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling

20 Modeling Understanding performance Model structure Model generation Modeling results Elmar Peise (AICES) Performance Modeling

21 Model structure input: (dtrsm, L, U, N, N, 256, 512, 0.7, A, 512, B, 512) model for dtrsm parameter extraction: (L, U, N) (256, 512) piecewise polynomials: (256, 512) discrete cases: LLN LLT LUN LUT RLN RUN RLT RUT statistical quantities: min avg std max performance counters: ticks flops L1misses polynomial: (256, 512) output: (min, avg,...) (min, avg,...) (min, avg, std, max) Elmar Peise (AICES) Performance Modeling

22 Modeling Understanding performance Model structure Model generation Modeling results Elmar Peise (AICES) Performance Modeling

23 Modeler Modeler Sampler Interface Sampler Routine Modeler Piecewise Polynomial Modeler 1 case (e.g. (L, U, N)) Model Model 1 counter (e.g. ticks) Model Elmar Peise (AICES) Performance Modeling

24 Piecewise Polynomial Modelers Model Expansion Adaptive Refinement Elmar Peise (AICES) Performance Modeling

25 Model Expansion completed region Elmar Peise (AICES) Performance Modeling

26 Adaptive Refinement refinement refinement refinement accuracy: bad threshold good Elmar Peise (AICES) Performance Modeling

27 Modeling Understanding performance Model structure Model generation Modeling results Elmar Peise (AICES) Performance Modeling

28 Model Expansion for ticks dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

29 Opposite direction: dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

30 Smaller approximation error bound dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

31 Smaller models dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

32 Adaptive Refinement for ticks dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

33 Smaller models dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

34 Smaller approximation error bound dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

35 Comparison dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 10 average error [%] #samples 10 5 Model Expansion Adaptive Refinement AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

36 Outline 1 Sampling 2 Modeling 3 Prediction and Ranking Elmar Peise (AICES) Performance Modeling

37 Ranking Blocked algorithm revisited: L L 1 L 00 L 10 L 11 L 20 L 21 L 22 Variant 1 Variant 2 Variant 3 Variant 4 L 10 L 10L 00 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 21 L 21L 1 11 L 11 L 1 11 L 21 L 21L 1 11 L 20 L 20 + L 21L 10 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 20 L 20 L 21L 10 L 10 L 10L 00 L 11 L 1 11 Elmar Peise (AICES) Performance Modeling

38 C implementation: Variant 1 i n t t r i n v 1 _ ( c h a r diag, i n t n, d o u b l e A, i n t lda, i n t b s i z e ) { i f ( n == 1) { i f ( d i a g [ 0 ] == N ) A = 1 / A ; r e t u r n 0 ; } i n t i o n e = 1 ; d o u b l e one = 1 ; d o u b l e mone = 1; f o r ( i n t p = 0 ; p < n ; p += b s i z e ) { i n t b = b s i z e ; i f ( p + b > n ) b = n p ; #d e f i n e A00 (A) #d e f i n e A10 (A + p ) #d e f i n e A11 (A + lda p + p ) Variant 1 L 10 L 10L 00 L 10 L 1 11 L10 L 11 L 1 11 } dtrmm_( "R", "L", "N", diag, &b, &p, &one, A00, lda, A10, lda ) ; dtrsm_ ( "L", "L", "N", diag, &b, &p, &mone, A11, lda, A10, lda ) ; t r i n v 1 _ ( diag, &b, A11, lda, &i o n e ) ; } r e t u r n 0 ; Elmar Peise (AICES) Performance Modeling

39 Prediction Input: trinv1(n, 300, A, 300, 100) Compute routine invocations update routine invocation L 10 L 10 L 00 (dtrmm, R, L, N, N, 100, 0, 1,, 300,, 300) L 10 L 1 11 L 10 (dtrsm, L, L, N, N, 100, 0, 1,, 300,, 300) L 11 L 1 11 (trinv1, N, 100,, 300, 1) L 10 L 10 L 00 (dtrmm, R, L, N, N, 100, 100, 1,, 300,, 300) L 10 L 1 11 L 10 (dtrsm, L, L, N, N, 100, 100, 1,, 300,, 300) L 11 L 1 11 (trinv1, N, 100,, 300, 1) L 10 L 10 L 00 (dtrmm, R, L, N, N, 100, 200, 1,, 300,, 300) L 10 L 1 11 L 10 (dtrsm, L, L, N, N, 100, 200, 1,, 300,, 300) L 11 L 1 11 (trinv1, N, 100,, 300, 1) Evaluate performance Models Accumulate ticks Elmar Peise (AICES) Performance Modeling

40 Results: cache-trashing efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction ,024 n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling

41 Results: in-cache efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction ,024 n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling

42 Results: in-cache efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction ,024 n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling

43 Results: probabilistic prediction efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction: median average min max ,024 n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling

44 Optimal block-size efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction: median average min max blocksize Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling

45 Conclusion 1 Sampling Performance measurement tool Applicable to various DLA routines 2 Modeling Automatic modeling system Flexible and accurate 3 Prediction and Ranking Accurate predictions Correct ranking Elmar Peise (AICES) Performance Modeling

46 Performance Modeling for Ranking Blocked Algorithms Questions? Funding from DFG is gratefully acknowledged Elmar Peise (AICES) Performance Modeling

INTEL MKL Vectorized Compact routines

INTEL MKL Vectorized Compact routines Mesut Meterelliyoz, Peter Caday, Timothy B. Costa, Kazushige Goto, Louise Huot, Sarah Knepper, Arthur Araujo Mitrano, Shane Story 2018 BLIS RETREAT 09/17/2018 OUTLINE