Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.

Size: px

Start display at page:

Download "Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance."

Barry Campbell
5 years ago
Views:

1 Spiral Computer Generation of Performance Libraries José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team Platforms Performance Applications

2 What is Spiral? Traditionally Spiral Approach Spiral High performance library optimized for given platform Comparable performance High performance library optimized for given platform

rewriting abstraction search algorithm space Architectural parameter:

3 Main Idea: Program Generation Model: common abstraction = spaces of matching formulas pick architecture space abstraction ν p μ defines rewriting abstraction search algorithm space Architectural parameter: Vector length, #processors, optimization Kernel: problem size, algorithm choice

of algorithms Rewriting systems to generate and optimize algorithms at a high level of abstraction algorithm

4 Search How Spiral Works Problem specification ( DFT 1024 or DFT ) Complete automation of the implementation and optimization task Algorithm Generation Algorithm Optimization controls Basic ideas: Declarative representation of algorithms Rewriting systems to generate and optimize algorithms at a high level of abstraction algorithm Implementation Code Optimization C code Compilation Compiler Optimizations controls performance Fast executable Spiral

Algorithms: Rules in Domain Specific Language

encoder Viterbi decoder 010001 11 10 00 01 10 01

Matrix-Matrix Multiplication Synthetic Aperture

5 Algorithms: Rules in Domain Specific Language Linear Transforms Viterbi Decoding convolutional encoder Viterbi decoder Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR) = preprocessing matched filtering interpolation 2D ifft

6 One Approach for all Types of Parallelism Multithreading (Multicore) Vector SIMD (SSE, VMX/Altivec, ) Message Passing (Clusters, MPP) Streaming/multibuffering (Cell) Graphics Processors (GPUs) Gate-level parallelism (FPGA) HW/SW partitioning (CPU + FPGA)

embarrassingly parallel operator A A A A Processor 0 Processor 1

7 Example: Code Generation for Multicore CPUs Hardware abstraction: shared cache with cache lines Tensor product: embarrassingly parallel operator A A A A Processor 0 Processor 1 Processor 2 Processor 3 Permutation: problematic; may produce false sharing x y x y

8 Spiral: Meta-Tool to Autotuning Libraries Input: Transform: Algorithms: Vectorization: 2-way SSE Threading: Yes Output: Optimized library (10,000 lines of C++) For general input size (not collection of fixed sizes) Vectorized Multithreaded With runtime adaptation mechanism Performance competitive with hand-written code Spiral Library Generator High-Performance Library FFTW-like

9 Verification and Testing Verify algorithms symbolically =? Check rules through verification of instances =? Check code empirically =? DFT4([0,1,0,0]) DFT4([0.1,1.77,2.28,-55.3]) =? DFT4_rnd([0.1,1.77,2.28,-55.3]))

Range: Cell Phone To Supercomputer Global FFT (1D FFT, HPC Challenge) performance [Gflop/s] 6.

2GHz with NEON ISA SIMD vectorization + multi-threading BlueGene/P at Argonne National Laboratory

Dalton, L. L. Hu, F. Franchetti, Y. Liu, A. Sidelnik, T. Spelce, I. G. Tānase, E. Tiotto, Y.

10 Range: Cell Phone To Supercomputer Global FFT (1D FFT, HPC Challenge) performance [Gflop/s] 6.4 Tflop/s BlueGene/P Samsung i9100 Galaxy S II Dual-core ARM at 1.2GHz with NEON ISA SIMD vectorization + multi-threading BlueGene/P at Argonne National Laboratory 128k cores (quad-core CPUs) at 850 MHz SIMD vectorization + multi-threading + MPI G. Almási, B. Dalton, L. L. Hu, F. Franchetti, Y. Liu, A. Sidelnik, T. Spelce, I. G. Tānase, E. Tiotto, Y. Voronenko, X. Xue: 2010 IBM HPC Challenge Class II Submission. Winner of the 2010 HPC Challenge Class II Award (Most Productive System).

11 More Results: Spiral Outperforms Humans FFT on Multicore SAR SDR FFT on FPGA improvement

12 More Information:

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Franz Franchetti 1, Yevgen Voronenko 2, Gheorghe Almasi 3 1 University and SpiralGen, Inc. 2 AccuRay, Inc., 3 IBM Research