SPIRAL Overview: Automatic Generation of DSP Algorithms & More *

Size: px

Start display at page:

Download "SPIRAL Overview: Automatic Generation of DSP Algorithms & More *"

Erick Douglas
6 years ago
Views:

Material for presentation provided by Franz Francheti, Yevgen Voronenko, Frédéric de

1 SPIRAL Overview: Automatic Generation of DSP Algorithms & More * Jeremy Johnson (& SPIRAL Team) Drexel University * The work is supported by DARPA DESA, NSF, and Intel. Material for presentation provided by Franz Francheti, Yevgen Voronenko, Frédéric de Mesmay, Daniel McFarlin, Markus Püschel and the rest of the SPIRAL team. CS 65 Program Generation and Opt.

Outline Introduction Challenge and goal PreSPIRAL Formulas, programs, and rewrite rules Multilinear Algebra and Block Recursive Algorithms SPIRAL

2 Outline Introduction Challenge and goal PreSPIRAL Formulas, programs, and rewrite rules Multilinear Algebra and Block Recursive Algorithms SPIRAL (straightline code) SPL and Ruletrees SPIRAL (loops) -SPL SPIRAL (vector, parallel, and more) Tagged SPL SPIRAL (library generation) Operator Language Beyond transforms

3 Outline Introduction Challenge and goal PreSPIRAL SPIRAL (straightline code) SPIRAL (loops) SPIRAL (vector, parallel, and more) SPIRAL (library generation) Operator Language

4 Challenge of Obtaining Efficient Code Multiple threads: 2x Memory hierarchy: 5x Vector instructions: 3x High performance library development has become a nightmare

5 Spiral Overview Research Goal: Teach computers to write fast libraries Complete automation of implementation and optimization Including vectorization, parallelization Functionality: Linear transforms (discrete Fourier transform, filters, wavelets) BLAS SAR imaging En/decoding (Viterbi, Ebcot in JPEG2) more Platforms: Desktop (vector, SMP), FPGAs, GPUs, distributed, hybrid Collaboration with Intel (Kuck, Tang, Sabanin) Parts of MKL/IPP generated with Spiral IPP 6.: ippg domain for Spiral generated code

6 SPIRAL: Abstraction Levels Carnegie Mellon, Drexel, UIUC Transform problem specification Ruletree easy manipulation for search SPL Σ-SPL vector/parallel optimizations loop optimizations traditional human domain for (i=; i<8; i++) { t[2i+4]=x[i-4]+x[i+4]; } C Code C compiler tools available

Program Generation in Spiral (Sketched) Transform

optimizations C Code: Iteration of this process

7 Program Generation in Spiral (Sketched) Transform user specified Fast algorithm in SPL many choices Optimization at all abstraction levels parallelization vectorization -SPL: [PLDI 5] loop optimizations C Code: Iteration of this process constant folding to search for the fastest scheduling But that s not all

8 Outline Introduction PreSPIRAL Formulas, programs, and rewrite rules [CSSP 9] Multilinear Algebra and Block Recursive Algorithms [JSC 9] SPIRAL (straightline code) SPIRAL (loops) SPIRAL (vector, parallel, and more) SPIRAL (library generation) Operator Language

9 DSP Algorithms & Matrix Factorizations = i i i i i Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker product Input vector Output vector input vector (signal) output vector (signal) transform = matrix

10 Rewrite Rules & Formula Manipulation

11 Rules and Dynamic Programming Diagonal matrix (twiddles) Kronecker product Identity Permutation DFT(6) DFT(6) DFT(6) DFT(2) DFT(8) DFT(4) DFT(4) DFT(8) DFT(2) Dynamic Programming (DP) searches over different applications of CT Rule using best transform found for recursive calls

12 Formulas to Code t() = x(); t() = x(2); t(2) = x(); t(3) = x(3) u() = t() + t(); u() = t() t(); u(2) = t(2) + t(3); u(3) = t(2) t(3); v() = u(); v() = u(); v(2) = u(2); v(3) = i*u(3); y() = v() + v(2); y(2) = v() - v(2); y() = v() + v(3); y(3) = v() - v(3); Simplify u() = x() + x(2); u() = x() x(); u(2) = x() + x(3); u(3) = x() x(3); u(3) = i*u(3) y() = u() + u(2); y(2) = u() - u(2); y() = u() + u(3); y(3) = u() - u(3);

13 Formulas and Architectures Parallel Operation A A A A Processor Processor Processor 2 Processor 3 x y Vector Operation A 4 A 4 A 4 A 4 A 4 A 4 A 4 A 4 Task: Rewrite formulas to introduce varying amounts of vectorization and parallelism and modify access patterns

14 Multilinear Algebra & Block Recursive Algs Matrix Multiplication Strassen s Algorithm

15 Outline Introduction PreSPIRAL SPIRAL (straightline code) SPL and Ruletrees SPIRAL (loops) SPIRAL (vector, parallel, and more) SPIRAL (library generation) Operator Language

16 SPL (Signal Processing Language) [PLDI ] SPL expresses transform algorithms as structured sparse matrix factorization Examples: SPL grammar in Backus-Naur form 6

17 Compiling SPL to Code Using Templates for i=..n- for j=..m- y[i+n*j]=x[m*i+j] y[::n-] = call A(x[::n-]) y[n::n+m-] = call B(x[n::n+m-]) for i=..n- y[im::im+m-] = call B(x[im::im+m-]) for i=..n- y[im::im+m-] = call B(x[i:n:i+m-]) 7

18 SPIRAL Architecture [IEEE 5] Approach: Empirical search over alternative recursive algorithms Transform Formula Generator SPIRAL Search Engine SPL Compiler C Compiler... for (int j=; j<=3; j++) { y[j] y[2*j] = C*x[j] = + + C2*x[j+4]; y[j+4] y[2*j+] = C*x[j] = - C2*x[j+4]; } A 2B FF 33 C4 F... Timer 8 ns Adapted Implementation DFT_8.c 8 ns 8

19 DSP Algorithms: Transforms & Breakdown Rules Combining these rules yields many algorithms for every given transform

20 Outline Introduction PreSPIRAL SPIRAL (straightline code) SPIRAL (loops) -SPL SPIRAL (vector, parallel, and more) SPIRAL (library generation) Operator Language

21 Problem: Fusing Permutations and Loops Two passes over the working set Complex index computation direct mapping void sub(double *y, double *x) { double t[8]; for (int i=; i<=7; i++) t[(i/4)+2*(i%4)] = x[i]; for (int i=; i<4; i++){ y[2*i] = t[2*i] + t[2*i+]; y[2*i+] = t[2*i] - t[2*i+]; } } C compiler cannot do this One pass over the working set Simple index computation State-of-the-art SPIRAL: Hardcoded with templates FFTW: Hardcoded in the infrastructure How does hardcoding scale? void sub(double *y, double *x) { for (int j=; j<=3; j++){ y[2*j] = x[j] + x[j+4]; y[2*j+] = x[j] - x[j+4]; } } 2

22 New Approach for Loop Merging [PLDI 5] Transform SPL formula Search Engine Formula Generator New SPL SPL Compiler C Compiler Timer SPL To Σ-SPL Σ-SPL formula Loop Merging Σ-SPL formula Index Simplification Σ-SPL formula Σ-SPL Compiler Adapted Implementation Code 22

23 Σ SPL Four central constructs: Σ, G, S, Perm Σ (sum) makes loops explicit G f (gather) reads data using the index mapping f S f (scatter) writes data using the index mapping f Perm f permutes data using the index mapping f Every Σ-SPL formula still represents a matrix factorization Example: j= j= F 2 j=2 j=3 Output Input 23

24 Loop Merging With Rewriting Rules SPL SPL To Σ-SPL F 2 Σ-SPL Y T X Loop Merging Σ-SPL F 2 Index Simplification Y X Σ-SPL Rules: Σ-SPL Compiler Code for (int j=; j<=3; j++) { y[2*j] = x[j] + x[j+4]; y[2*j+] = x[j] - x[j+4]; } 24

25 Outline Introduction PreSPIRAL SPIRAL (straightline code) SPIRAL (loops) SPIRAL (vector, parallel, and more) Tagged SPL SPIRAL (library generation) Operator Language

26 SPL to Shared Memory Code: Basic Idea [SC 6] Governing construct: tensor product A A A A Processor Processor Processor 2 Processor 3 p-way embarrassingly parallel, load-balanced Problematic construct: permutations produce false sharing x y Task: Rewrite formulas to extract tensor product + keep contiguous blocks x y

27 Parallelization by Rewriting coarse platform model Load-balanced No false sharing

28 Same Approach for Other Parallel Paradigms Message Passing [ISPA 6] Vectorization [VecPar 6] MPI Cg/OpenGL for GPUs Verilog for FPGAs [DAC 8]

29 Example Results Multicore [SC 6] Code written by the computer is faster (for many sizes) than any human-written code

30 Outline Introduction PreSPIRAL SPIRAL (straightline code) SPIRAL (loops) SPIRAL (vector, parallel, and more) SPIRAL (library generation) From Rules to complete library infrastructure Operator Language

31 Tasks To Be Automated blue = hand-written, red = generated FFTW Codelet generator Library infrastructure Recursive functions needed Vectorization (partially) Threading Adaptation mechanism (plan) Codelets ( 2 types) Our Goal Spiral Library infrastructure Recursive functions needed Vectorization Threading Adaptation mechanism (plan) Codelets ( 2 types)

Library Implementation Build library plan

32 How Library Generation Works [Vor 8] Transforms + Breakdown rules Library Target (FFTW, VSIPL, IPP FFT,...) Library Structure Parallelization / Vectorization Recursion Step Closure recursion step closure as Σ-SPL formulas Library Implementation Build library plan Hot/cold partition Generate target code High-performance library 32

33 Breakdown Rules to Library Code Cooley-Tukey Fast Fourier Transform (FFT) DFT k=4 Naive implementation void dft(int n, cplx X[], cplx Y[]) { k = choose_factor(n); m = n/k; Z = permute(x) } for i= to k- dft_subvec(m, Z, Y, ) for i= to n- Y[i] = Y[i]*T[i]; for i= to m- dft_strided(k, Y, Y, ) 2 extra functions needed 33

$X[], cplx Y[]) { k = choose_factor(n); m = n/k; Z = permute(x) } for i= to k- dft_subvec(m, Z, Y, ) for i= to n- Y[i] = Y[i]*T[i]; for i= to m-$ $dft_strided(k, Y, Y, ) void dft(int n, cplx X[], cplx Y[]) { k = choose_factor(n); m = n/k; } for i= to k- dft_strided2(m, X, Y, ) for i= to m-$

34 Breakdown Rules to Library Code Cooley-Tukey Fast Fourier Transform (FFT) DFT Naive implementation Optimized implementation void dft(int n, cplx X[], cplx Y[]) { k = choose_factor(n); m = n/k; Z = permute(x) } for i= to k- dft_subvec(m, Z, Y, ) for i= to n- Y[i] = Y[i]*T[i]; for i= to m- dft_strided(k, Y, Y, ) void dft(int n, cplx X[], cplx Y[]) { k = choose_factor(n); m = n/k; } for i= to k- dft_strided2(m, X, Y, ) for i= to m- dft_strided3_scaled(k, Y, Y, T, ) 2 extra functions needed 2 extra functions needed 34 How to discover these specialized variants automatically?

35 Computing Recursion Step Closure Input: transform T and a breakdown rule Output: spawned recursion steps + Σ-SPL implementation Algorithm:. Apply the breakdown rule 2. Convert to Σ-SPL 3. Apply loop merging + index simplification rules. 4. Extract recursion steps 5. Repeat until closure is reached Parameterization (not shown) derives the independent parameter set for each recursion step 35

36 Recursion Step Closure Examples DFT (scalar) DCT4 (vectorized) 4 mutually recursive functions - computed automatically - described using Σ-SPL formulas 7 mutually recursive functions 36

37 Benchmark: Complex DFT Complex DFT [Intel Core i7, 2.66 GHz, 4 cores] Performance, Gflop/s 6 4 Spiral, 4 threads Carnegie Mellon, Drexel, UIUC 2 8 FFTW Intel IPP 6. Spiral, thread k 4k 6k 64k 256k M input size

38 Examples of Generated Libraries RDFT Carnegie Mellon, Drexel, UIUC DHT DCT2 DCT3 DCT4 DFT 2-way vectorized, 2-threaded Most are faster than hand-written libs Code size: 8 2 KLOC or.5 5 MB Generation time: 3 hours Filter Wavelet Total: 3 KLOC / 3.3 MB of code generated in < 2 hours from a few simple algorithm specs Intel IPP library 6. includes Spiral generated code 38

39 Outline Introduction PreSPIRAL SPIRAL (straightline code) SPIRAL (loops) SPIRAL (vector, parallel, and more) SPIRAL (library generation) Operator Language Beyond transforms

40 Operators [DSL 9] Definition Operator: Multiple complex vectors! multiple complex vectors Higher-dimensional data is linearized Operators are potentially nonlinear Example: Matrix-matrix-multiplication (MMM) A C B

41 Operator Language Carnegie Mellon, Drexel, UIUC

42 Example: Matrix Multiplication (MMM) Breakdown rules: capture various forms of blocking

26 Spiral-generated library Rank-k Update, single precision, k=4 performance [Gflop/s] Dual Intel Xeon 56, 3Ghz 8 6 4

43 Matrix Multiplication Library Carnegie Mellon, Drexel, UIUC MKL. GotoBLAS.26 Spiral-generated library MKL. GotoBLAS.26 Spiral-generated library Rank-k Update, single precision, k=4 performance [Gflop/s] Dual Intel Xeon 56, 3Ghz Spiral-generated library Input size MKL Rank-k Update, double precision, k=4 performance [Gflop/s] Spiral-generated library Dual Intel Xeon 56, 3Ghz Input size MKL

44 References [CSSP 9] J. R. Johnson, R. W Johnson, D. Rodriguez, and R. Tolimieri, A methodology for designing, modifying, and implementing Fourier Transform algorithms on various architectures, Circuits Systems Signal Process., (9)4: 449-5, 99. [JSC 9] R. W. Johnson, C. H. Huang and J. R. Johnson, Multilinear algebra and parallel programming, The Journal of Supercomputing, Vol. 5, Oct. 99. [IEEE 5] Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson and Nicholas Rizzolo, SPIRAL: Code Generation for DSP Transforms, Proc. of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", Vol. 93, No. 2, pp , 25. [IPDPS 2] Kang Chen and Jeremy Johnson, A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms, Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp , 22. [SC 6] Franz Franchetti, Yevgen Voronenko and Markus Püschel, FFT Program Generation for Shared Memory: SMP and Multicore, Proc. Supercomputing (SC), 26. [ISPA 6] Andreas Bonelli, Franz Franchetti, Juergen Lorenz, Markus Püschel and Christoph W. Ueberhuber, Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers, Proc. International Symposium on Parallel and Distributed Processing and Application (ISPA), Lecture Notes In Computer Science, Springer, Vol. 433, pp , 26.

References [VecPar 8] Franz Franchetti, Yevgen Voronenko and Markus Püschel, A Rewriting System for the Vectorization of Signal Transforms, Proc.

45 References [VecPar 8] Franz Franchetti, Yevgen Voronenko and Markus Püschel, A Rewriting System for the Vectorization of Signal Transforms, Proc. High Performance Computing for Computational Science (VECPAR), Lecture Notes in Computer Science, Springer, Vol. 4395, pp , 26. [DAC 8] Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel, Formal Datapath Representation and Manipulation for Implementing DSP Transforms, Proc. Design Automation Conference (DAC), 28. [IPDPS 4] Jeremy Johnson and Kang Chen, A Self-Adapting Distributed Memory Package for Fast Signal Transforms, Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 44-, 24. [PLDI ] Jianxin Xiong, Jeremy Johnson, Robert W. Johnson and David Padua, SPL: A Language and Compiler for DSP Algorithms, Proc. Programming Languages Design and Implementation (PLDI), pp , 2. [PLDI 5] Franz Franchetti, Yevgen Voronenko and Markus Püschel, Formal Loop Merging for Signal Transforms, Proc. Programming Languages Design and Implementation (PLDI), pp , 25. [Vor 8] Yevgen Voronenko, Library Generation for Linear Transforms, PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 28. [DSL 9] Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin and Markus Püschel, Operator Language: A Program Generation Framework for Fast Kernels, Proc. IFIP Working Conference on Domain Specific Languages (DSL WC), Lecture Notes in Computer Science, Springer, Vol. 5658, pp , 29,

SPIRAL Generated Modular FFTs *

SPIRAL Generated Modular FFTs * Jeremy Johnson Lingchuan Meng Drexel University * The work is supported by DARPA DESA, NSF, and Intel. Material for SPIRAL overview provided by Franz Francheti, Yevgen Voronenko,,