Tackling Parallelism With Symbolic Computation

Size: px

Start display at page:

Download "Tackling Parallelism With Symbolic Computation"

Gabriella Gallagher
6 years ago
Views:

Mellon University This work was supported by

1 Tackling Parallelism With Symbolic Computation Markus Püschel and the Spiral team (only part shown) With: Franz Franchetti Yevgen Voronenko Electrical and Computer Engineering Carnegie Mellon University This work was supported by DARPA, NSF-NGS/ITR,ACR,CPA, Intel, Mercury, National Instruments

The Problem: Example DFT Discrete Fourier Transform (DFT) on 2xCore2Duo 3 GHz (single precision) Performance [Gflop/s] 30 25 20 Best code 15.

2 The Problem: Example DFT Discrete Fourier Transform (DFT) on 2xCore2Duo 3 GHz (single precision) Performance [Gflop/s] Best code x x Numerical recipes ,024 2,048 4,096 8,192 16,384 32,768 65, , ,144 input size Standard desktop computer, vendor compiler, using optimization flags All implementations have roughly the same operations count 4nlog 2 (n) Maybe the DFT is just difficult?

3 Evolution of Processors (Intel) Era of parallelism High performance library development becomes increasingly difficult

4 DFT Plot: Analysis Multiple threads: 2x Memory hierarchy: 5x Vector instructions: 3x High performance library development has become a nightmare

5 Organization Spiral: Brief overview Parallelization using symbolic computation Conclusion

6 Spiral Library generator for linear transforms (DFT, DCT, DWT, filters,.) and recently more Wide range of platforms supported: scalar, fixed point, vector, parallel, Verilog Research Goal: Teach computers to write fast libraries Complete automation of implementation and optimization Conquer the high algorithm level for automation When a new platform comes out: Regenerate a retuned library When a new platform paradigm comes out (e.g., multicore): Update the tool and then regenerate library

Search How Spiral Works Spiral: Complete automation of the implementation and optimization task Basic ideas: Declarative representation of algorithms Rewriting systems to generate and optimize

7 Search How Spiral Works Spiral: Complete automation of the implementation and optimization task Basic ideas: Declarative representation of algorithms Rewriting systems to generate and optimize algorithms at a high level of abstraction Problem specification (transform) Algorithm Generation Algorithm Optimization algorithm Implementation Code Optimization C code Compilation Compiler Optimizations Fast executable controls controls performance Spiral

8 Linear Transforms Mathematically: Matrix-vector multiplication Output vector Transform = matrix Input vector Example: Discrete Fourier transform (DFT)

9 Transform Algorithms: Example 4-point FFT Cooley/Tukey fast Fourier transform (FFT): j 1 j j 1 j 1 1 j Fourier transform Diagonal matrix (twiddles) Kronecker product Identity Permutation Algorithms are divide-and-conquer: Breakdown rules Mathematical, declarative representation: SPL (signal processing language) SPL describes the structure of the dataflow

10 Breakdown Rules (>200 for >50 Transforms) Combining these rules yields many algorithms for every given transform

11 SPL to Sequential Code Example: tensor product Correct code: easy fast code: very difficult

Program Generation in Spiral (Sketched) Transform user specified Fast algorithm in SPL many choices -SPL: Optimization at all abstraction levels

12 Program Generation in Spiral (Sketched) Transform user specified Fast algorithm in SPL many choices -SPL: Optimization at all abstraction levels parallelization vectorization loop optimizations C Code: Iteration of this process constant folding to search for the fastest scheduling But that s not all

13 SPL to Shared Memory Code: Basic Idea Key construct: Tensor product p-way embarrassingly parallel, load-balanced x A A A A y Processor 0 Processor 1 Processor 2 Processor 3 Problematic construct: Permutations produce false sharing x y cacheline boundaries Task: Rewrite formulas to extract tensor product + avoid false sharing

14 Step 1: Shared Memory Tags Identify crucial hardware parameters Number of processors: p Cache line size: μ Introduce them as tags in SPL This means: SPL formula (algorithm) A is to be optimized for p processors and cache line size μ

15 Step 2: Identify Good Formulas Load balanced, avoiding false sharing Tagged operators (no further rewriting necessary) Definition: A formula is fully optimized for (p, μ) if it is one of the above or of the form where A and B are fully optimized

16 Step 3: Identify Rewriting Rules Goal: Transform formulas into fully optimized formulas Formulas rewritten, tags propagated There may be choices

17 Parallelization by Rewriting Fully optimized (load-balanced, no false sharing) in the sense of our definition

18 Same Approach for Different Paradigms Threading: Vectorization: GPUs: Verilog for FPGAs: Rigorous, correct by construction Overcomes compiler limitations

19 Example Result: DFT on Intel Multicore Computer generated code: Up to 4 threads, 2-way vectorized, optimized for the memory hierarchy Faster than any hand-written code Similar results for other transforms and computing platforms

10 11 00 010001 JPEG 2000 (Wavelet, EBCOT) Synthetic Aperture Radar (SAR) DWT

20 Same Approach Beyond Transforms Matrix-Matrix Multiplication Viterbi Decoder = convolutional encoder Viterbi decoder JPEG 2000 (Wavelet, EBCOT) Synthetic Aperture Radar (SAR) DWT quantization entropy coding (EBCOT + MQ) preprocessing matched filtering interpolation 2D ifft

Conclusion Spiral: Successful approach to automate the development of performance libraries Commercially used by Intel void dft64(float *Y, float *X) { m512 U912, U913, U914, U915,.

21 Conclusion Spiral: Successful approach to automate the development of performance libraries Commercially used by Intel void dft64(float *Y, float *X) { m512 U912, U913, U914, U915,... m512 *a2153, *a2155; a2153 = (( m512 *) X); s1107 = *(a2153); s1108 = *((a )); t1323 = _mm512_add_ps(s1107,s1108); t1324 = _mm512_sub_ps(s1107,s1108); <many more lines> U926 = _mm512_swizupconv_r32( ); s1121 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi( _mm512_set_1to16_ps( ),0xaaaa,a2154,u926),t1341), _mm512_mask_sub_ps(_mm512_set_1to16_ps( ), ), _mm512_swizupconv_r32(t1341,_mm_swiz_reg_cdab)); U927 = _mm512_swizupconv_r32 <many more lines> } Key idea: Symbolic computation Domain specific declarative mathematical language Difficult optimizations through rewriting

Parallelism in Spiral

Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was