Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Size: px

Start display at page:

Download "Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University"

Silas Fleming
5 years ago
Views:

1 Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University

2 Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning (IEEE Proc. Vol. 93, No., Feb. 5) ATLAS FFTW SPIRAL

3 Matrix Multiplication and the FFT Matrix Multiplication and the FFT n k kj ik ij B A C ,, R l S R S l kl N S l N l kl N k x y l l k k x y k k l k l k S k k R l S k RS N ω ω ω ω

Basic Linear Algebra Subprograms (BLAS) Level vector vector, O(n) data, O(n) operations Level matrix vector, O(n ) data, O(n ) operations Level 3 matrix matrix, O(n ) data, O(n 3 ) operations data

4 Basic Linear Algebra Subprograms (BLAS) Level vector vector, O(n) data, O(n) operations Level matrix vector, O(n ) data, O(n ) operations Level 3 matrix matrix, O(n ) data, O(n 3 ) operations data reuse locality! LAPACK built on top of BLAS (level 3) Blocking (for the memory hierarchy) is the single most important optimization for linear algebra algorithms GEMM General Matrix Multiplication SUBROUTINE DGEMM (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) C : alpha*op( A )*op( B ) + beta*c, where op(x) X or X

5 DGEMM * Form C : alpha*a*b + beta*c. * DO 9, J, N IF( BETA.EQ.ZERO )THEN DO 5, I, M C( I, J ) ZERO 5 CONTINUE ELSE IF( BETA.NE.ONE )THEN DO 6, I, M C( I, J ) BETA*C( I, J ) 6 CONTINUE END IF DO 8, L, K IF( B( L, J ).NE.ZERO )THEN TEMP ALPHA*B( L, J ) DO 7, I, M C( I, J ) C( I, J ) + TEMP*A( I, L ) 7 CONTINUE END IF 8 CONTINUE 9 CONTINUE

6 Matrix Multiplication Performance

7 Matrix Multiplication Performance

Numeric Recipes Numeric Recipes in C The Art of Scientific Computing, nd Ed. William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P. Flannery, Cambridge University Press, 99.

8 Numeric Recipes Numeric Recipes in C The Art of Scientific Computing, nd Ed. William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P. Flannery, Cambridge University Press, 99. This book is unique, we think, in offering, for each topic considered, a certain amount of general discussion, a certain amount of analytical mathematics, a certain amount of discussion of algorithmics, and (most important) actual implementations of these ideas in the form of working computer routines.. Preliminarys. Solutions of Linear Algebraic Equations. Fast Fourier Transform 9. Partial Differential Equations. Less Numerical Algorithms

9 four

10 four (cont)

11 FFT Performance

12 Atlas Architecture and Search Parameters N B L data cache tile size NCN B L data cache tile size for non copying version M U, N U Register tile size K U Unroll factor for k loop L S Latency for computation scheduling FMA if fused multiply add available, otherwise F F, I F, N F Scheduling of loads Yotov et al., Is Search Really Necessary to Generate High Performance BLAS?, Proc. IEEE, Vol. 93, No., Feb. 5

13 ATLAS Code Generation Optimization for locality Cache tiling, Register tiling

14 ATLAS Code Generation Register Tiling MU + NU + MU NU NR Loop unrolling Scalar replacement Add/mul interleaving NU K mul mul Loop skewing B mul Ls add C i j C i j + A i k *B k j NB mul Ls+ add MU K NB mul Mu Nu add Mu Nu Ls+ A C add Mu Nu

15 ATLAS Search Estimate Machine Parameters (C, N R, FMA, L S ) Used to bound search Orthogonal Line Search (fix all parameters except one and search for the optimal value of this parameter) Search order NB MU, NU KU LS FF, IF, NF NCNB Cleanup codes

16 Using FFTW

17 FFTW Infrastructure Use dynamic programming to find an efficient way to combine code sequences. Combine code sequences using divide and conquer structure in FFT Codelets (optimized code sequences for small FFTs) Plan encodes divide and conquer strategy and stores twiddle factors Right Recursive Executor computes FFT of given data using algorithm described by plan.

18 SPIRAL system user DSP transform specifies goes for a coffee S P I R A L Formula Generator Mathematician fast algorithm as SPL formula C/Fortran/SIMD code Expert SPL Compiler Programmer controls algorithm generation controls implementation options runtime on given platform Search Engine (or an espresso for small transforms) platform adapted implementation comes back

19 DSP DSP Algorithms: Example 4 point DFT Algorithms: Example 4 point DFT Cooley/Tukey FFT (size 4): algorithms reduce arithmetic cost O(n^) O(nlog(n)) product of structured sparse matrices mathematical notation exhibits structure i i i i i ) ( ) ( L DFT I T I DFT DFT Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker product

20 Algorithms Ruletrees Formulas DCT R ( II ) 8 DCT ( II ) n ( II ) ( IV ) P ( DCTn / DCTn / ) ( F I n / ) DCT ( II ) 4 DCT ( IV ) 4 R R6 DCT ( IV ) n P DCT ( II ) n S DCT R3 F ( II ) DCT DST R4 F ( IV ) R6 ( II ) DCT II ( ) DCT R3 F F ( II ) DCT R ( II ) 4 DCT DST R4 F ( IV ) R6 ( II )

21 Generated DFT Vector Code: Pentium 4, SSE 7 6 hand tuned vendor assembly code (Pseudo) gflop/s Spiral SSE Intel MKL interl. FFTW..3 Spiral C Spiral C vect SIMD FFT n DFT n single precision, Pentium 4,.53 GHz, using Intel C compiler 6. speedups (to C code) up to factor of 3.

22 Best DFT Trees, size 4 Pentium 4 float Pentium 4 double Pentium III float AthlonXP float scalar C vect SIMD trees platform/datatype dependent

23 Crosstiming of best trees on Pentium 4 5. Slowdown factor w.r.t. best Pentium 4 SSE Pentium 4 SSE AthlonXP SSE PentiumIII SSE Pentium 4 float n DFT n single precision, runtime of best found of other platforms software adaptation is necessary

Parallelism in Spiral

Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was