SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

Size: px

Start display at page:

Download "SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms"

Clifton Morgan
5 years ago
Views:

1 SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms Markus Püschel Faculty José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar Inc.) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU) Manuela Veloso (CMU) Students Franz Franchetti (TU Vienna) Gavin Haentjens (CMU) Pinit Kumhom (Drexel) Neungsoo Park (USC) David Sepiashvili (CMU) Bryan Singer (CMU) Yevgen Voronenko (Drexel) Jianxin Xiong (UIUC)

2 Sponsor Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT administered by the Army Directorate of Contracting.

3 Moore s Law and High(est est) Performance Scientific Computing (single processor, off-the-shelf) Moore s Law: processor-memory bottleneck short life cycles of computers very complex architectures vendor specific special instructions (MMX, SSE, FMA, ) undocumented features Effects on software/algorithms: arithmetic cost model not accurate for predicting runtime (one cache miss = floating point ops) better performance models hard to get best code is machine dependent (registers/caches size, structure) hand-tuned code becomes obsolete as fast as it is written compiler limitations full performance requires (in part) assembly coding Portable performance requires automation

4 SPIRAL Automates Implementation cuts development costs code less error-prone Optimization systematic exploration of alternatives both at algorithmic and code level Platform-Adaptation of DSP algorithms takes advantage of architecture specific features porting without loss of performance are performance critical A library generator for highly optimized signal processing algorithms

5 SPIRAL Approach given DSP Transform (DFT, DCT, Wavelets etc.) SPIRAL Search Space Possible Algorithms Possible Implementations Performance Evaluation Intelligent Search adapted implementation given Computing Platform (Pentium III, Pentium 4, Athlon, SUN, PowerPC, Alpha, )

6 DSP Transform Algorithm

7 DSP DSP Algorithms: Example 4 Algorithms: Example 4-point DFT point DFT Cooley/Tukey FFT (size 4): = i i i i i ) ( ) ( L DFT I T I DFT DFT = Diagonal matrix (twiddles) Fourier transform Kronecker product Permutation Identity product of structured sparse matrices mathematical notation

8 DSP Algorithms: Terminology Transform DFTn parameterized matrix Rule DFT nm ( DFT I ) D ( I DFT ) P a breakdown strategy product of sparse matrices n m n m Ruletree DFT 8 DFT 2 DFT 4 DFT 2 DFT 2 recursive application of rules uniquely defines an algorithm efficient representation easy manipulation Formula DFT ( F I ) D ( I ( I F )) P = L few constructs and primitives uniquely defines an algorithm can be translated into code 2 2

9 DSP Transforms discrete Fourier transform Walsh-Hadamard transform [ exp ( 2kli π n) ] DFT n = / WHT k = DFT 2 2 L DFT 2 ( II ) DCT n = [ cos( k( l + / 2) π / n) ] discrete cosine and sine ( IV ) DCT n Transforms (6 types) = [ cos( ( k + / 2) ( l + / 2) π / n) ] ( I ) DST n = [ sin ( klπ / n) ] MDCT = [ cos( ( k + ( n + ) / 2)( l + / 2) n) ] modified discrete cosine transform two-dimensional transform n 2 n π / T T Others: filters, discrete wavelet transforms, Haar, Hartley,

10 Rules = Breakdown Strategies DCT II (, / ) F2 ( ) 2 diag 2 ( ( II ) ( IV ) DCT DCT ) ( I F ) Q ( II ) DCT n P n / 2 n / 2 n / 2 2 DCT DCT DFT DFT F F n ( IV ) n n ( IV n nm S DCT ( II ) n D ) M L M r ( I ) ( I ) B ( DCT DST ) C base case recursive translation iterative n / 2 n / 2 recursive ( DFT I ) D ( I DFT ) P n k h) ( I n / d I d + k ) ( I n d Fd ( h)) h) Circ( h ) E m n ( / n m recursive recursive ( recursive DWT n( W ) ( DWTn / 2( W ) In / 2) P ( In / 2 n W ) E n n+ K+ ni ni ni+ + K+ nt WHT ( I WHT I ) MDCT i= n 2 n S DCT ( IV ) n P k recursive iterative/ recursive translation

11 Formula for a DCT, size 6

12 DSP Transform Algorithm (Formula) Implementation

13 Formulas in SPL ( compose ( diagonal ( 2*cos(/6*pi) 2*cos(3/6*pi) 2*cos(5/6*pi) 2*cos(7/6*pi) ) ) ( permutation ( ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( sqrt(/2) ) ) ) ( compose ( matrix ( ) ( (-) ) ) ( diagonal ( cos(3/8*pi)-sin(3/8*pi) sin(3/8*pi) cos(3/8*pi)+sin(3/8*pi) ) ) ( matrix ( ) ( ) ( ) ) ( permutation ( 2 ) )

14 SPL Syntax (Subset) matrix operations: (compose formula formula...) (tensor formula formula...) (direct_sum formula formula...) direct matrix description: (matrix (a a2...) (a2 a22...)...) (diagonal (d d2...)) (permutation (p p2...)) parameterized matrices: (I n) (F n) scalars:.5, 2/7, cos(..), w(3), pi,.2e-4 definition of new symbols: allows extension of SPL (define name formula) (template formula (i-code-list) directives for code generation #codetype real/complex #unroll on/off controls loop unrolling

15 SPL Compiler, 4-point 4 FFT (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) complex #codetype real fast algorithm as formula as SPL program f = x() + x(3) f = x() - x(3) f2 = x(2) + x(4) f3 = x(2) - x(4) f4 = (.d,-.d)*f(3) y() = f + f2 y(2) = f - f2 y(3) = f + f4 y(4) = f - f4 r = x() + x(5) r = x() - x(5) r2 = x(2) + x(6) r3 = x(2) - x(6) r4 = x(3) + x(7) r5 = x(3) - x(7) r6 = x(4) + x(8) r7 = x(4) - x(8) y() = r + r4 y(2) = r + r5 y(3) = r - r4 y(4) = r - r5 y(5) = r2 + r7 y(6) = r3 - r6 y(7) = r2 - r7 y(8) = r3 + r6

16 SPL Compiler: Summary SPL Formula Abstract Syntax Tree SPL Program Symbol Definition Parsing Symbol Table Intermediate Code Generation I-Code Intermediate Code Restructuring I-Code Optimization I-Code Target Code Generation C, FORTRAN function Template Definition Template Table Built-in optimizations: single static assignment code no reuse of temporary vars only scalar temporary vars constants precomputed limited CSE Extensible through templates

17 SIMD Short Vector Extensions SIMD vector length = 4 (4-way) + x Extension to instruction set architecture Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4) Originally for multimedia (like MMX for integers) Requires fine grain parallelism Large potential speed-up Problems: SIMD instructions are architecture specific No common API (usually assembly hand coding) Performance very sensitive to memory access Automatic vectorization very limited

18 Vector code generation from SPL formulas Naturally vectorizable construct A I 4 A k i= i i vector length (Current) generic construct completely vectorizable: P D ( A I ) E Vectorization in two steps: i υ i Q i x P i, Q i D i, E i. Formula manipulation using manipulation rules 2. Code generation (vector code + C code) A i ν y permutations diagonals arbitrary formulas SIMD vector length

19 DSP Transform Algorithm (Formula) Search Implementation

20 Why Search? DCT, type IV, size 6 ~3 formulas maaaany different formulas large spread in runtimes, even for modest size not due to arithmetic cost

21 Search Methods available in SPIRAL Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (similar to a genetic algorithm) Possible Sizes Formulas Timed Results Exhaust Very small All Best DP All s-s (very) good Random All User decided fair/good Hill Climbing All s-s Good STEER All s-s (very) good Search over algorithm space and implementation options (degree of unrolling)

22 STEER Population n: Mutation Cross-Breeding expand differently Population n+: swap expansions Survival of Fittest

23 Learning to Generate Fast Algorithms Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy) Learns from a transform of one size, generates the best algorithm for many sizes Tested for DFT and WHT

24 SPIRAL

25 SPIRAL system user DSP transform specifies goes for a coffee S P I R A L Formula Generator fast algorithm as SPL formula SPL Compiler C/Fortran/SIMD code controls algorithm generation controls implementation options runtime (performance) Search Engine (or an espresso for small transforms) platform-adapted implementation comes back

26 SPIRAL System: Summary Available for download: Easy installation (Unix: configure/make; Windows: install shield) Unix/Linux and Windows 98/ME/NT/2/XP Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I IV, MDCT, Filters, Wavelets, Toeplitz, Circulants Extensible

27 Experimental Results search methods (applicable to all transforms) high performance code (compared with FFTW) different transforms

28 Vectorized code WHT DFT normalized runtime normalized runtime transform size 2-dim DCT normalized runtime transform size speed-ups up to a factor of 2.5 beats hand-tuned Intel MKL (< 24) transform size (Pentium III, SSE)

29 Conclusions Closing the gap between math domain (algorithms) and implementation domain (programs) Mathematical computer representation of algorithms Automatic translation of algorithms into code Implement constructs not transforms Optimization as intelligent search/learning in the space of alternatives High level: Mathematical manipulation of algorithms Low level: Coding degrees of freedom

Algorithms and Computation in Signal Processing

Algorithms and Computation in Signal Processing special topic course 8-799B spring 25 24 th and 25 th Lecture Apr. 7 and 2, 25 Instructor: Markus Pueschel TA: Srinivas Chellappa Research Projects Presentations