Specializing Code for FFT Libraries. Minhaj Ahmad Khan Henri Pierre Charles University of Versailles Saint-Quentin-en-Yvelines, France

Size: px

Start display at page:

Download "Specializing Code for FFT Libraries. Minhaj Ahmad Khan Henri Pierre Charles University of Versailles Saint-Quentin-en-Yvelines, France"

Chad Hawkins
6 years ago
Views:

1 Specializing Code for FFT Libraries Minhaj Ahmad Khan Henri Pierre Charles University of Versailles Saint-Quentin-en-Yvelines, France

2 Outline Specialization Issues for FFT Libraries Limited Code Specialization Experimental Results Related Work Conclusion and Future Work LCPC

3 FFT Libraries Problem Statement Unavailability of Values Optimizations largely dependent on Integer parameters Strides, loop trip counts etc. Function parameters FFTW, Scimark, GSL-FFT, Numutils, FFT2, Kiss-FFT void Function(int size, double * a, double *b, int dist, int stride) { int i; for (i=0; i< size; i++, a+=dist, b+=dist) { a[stride*2] = a[stride] + b[stride]; LCPC

4 Problem Statement Code Specialization Exposing the value (guided by Profiling Information) Optimizations Partial Evaluation swp, Loop Unrolling etc. Performed at Static Compile Time Code Explosion (I-cache impact) Dynamic Compile Time Overhead of Code Generation LCPC

5 Limited Code Specialization Improved Performance Values availability at static compile time Limited number of versions Through Dynamic Specialization adapting a single generic version to multiple values and keep versions for values for which no generic template could be found Reduced Runtime Overhead Through specialization of limited number of instructions //With Stride = 5 {.mii sub r24=r30,r19 add r11=40,r28 nop.i 0 ;; //With Stride = 7 {.mii sub r24=r30,r19 add r11=56,r28 nop.i 0 ;; //With Stride = 10 {.mii sub r24=r30,r19 add r11=80,r28 nop.i 0 ;; LCPC

6 Dynamic Specialization Function Code Source Code Modification Specialized Code with Wrapper Static Compilation Specialized Templates Invariants Analysis Runtime Specializer Generation Formulae + Locations Native (Executable) Code Runtime Specializer with Binary Template Specializer LCPC

7 Dynamic Specialization Specialize code for integer parameters to obtain versions Find Valid template Find similar code versions differing only in immediate constants Values should be based on affine formula, V = A * param + B, where A and B are constants. To confirm that at a specified instruction coefficients A and B are equal, solve the system of equations For n parameters, n+1 versions will be required to be validated LCPC

8 Dynamic Specialization(2) Find the range for which template would be valid Solve the system of equations a i * param + b i Instruction Validation Generate a Runtime specializer Modify binary code during execution to adapt it to these values LCPC

9 Example void Function(int size, double * a, double *b, int dist, int stride) { int i; for (i=0; i< size; i++, a+=dist, b+=dist) { a[stride*2] = a[stride] + b[stride]; LCPC

10 Example //With Stride = 7 Function:... {.mii sub r24=r30,r19 add r11=56,r28 nop.i 0 ;; {.mib add r14=28,r24 nop.i 0 br.cond.dptk.b1_3;;. {.mii add r24=28,r29 add r19=28,r30 add r16=56,r30 //With Stride = 5 Function:... {.mii sub r24=r30,r19 add r11=40,r28 nop.i 0 ;; {.mib add r14=20,r24 nop.i 0 br.cond.dptk.b1_3;;. {.mii add r24=20,r29 add r19=20,r30 add r16=40,r30 LCPC

11 Generation of Runtime Specializer void BinaryTemplateSpecializer( long base_address, int bundle_address, int instruction_address, int new_value ); void Specializer_Function(long base_address, register long param) { BinaryTemplateSpecializer( base_address, 6, 1, param * 8 + 0) BinaryTemplateSpecializer( base_address, 7, 0, param * 4 + 0)... BinaryTemplateSpecializer( base_address, 30, 0, param * 4 + 0); BinaryTemplateSpecializer( base_address, 30, 1, param * 4 + 0); BinaryTemplateSpecializer( base_address, 30, 2, param * 8 + 0);... LCPC

12 Runtime Activities Initialization Specializer invocation making code segment modifiable Instruction Specialization Calculation of new value to insert (affine formula) Storing the new value at specific location Cache coherence Flushing, synchronization LCPC

13 Runtime View (After Specialization) //With Stride = 11 {.mii sub r24=r30,r19 add r11=88,r28 nop.i 0 ;; {.mib add r14=44,r24 nop.i 0 br.cond.dptk.b1_3;; {.mii add r24=44,r29 add r19=44,r30 add r16=88,r30 LCPC

14 Performance Results icc v 9.0, Itanium-II 1.5GHz FFTW GSL- FFT Scimark FFT2 Numutils Kiss- FFT Avg. Specialization Overhead 1.00 % 1.00 % 2.00% 2.00% 1.00% 2.00% Code Size Increase 10.00% 2.00% 19.00% 10.00% 12.00% 11.00% Avg. Speedup LCPC

15 Performance Results (FFTW) icc v 9.0, Itanium-II 1.5GHz 2 Speedup DFT Size Code Specialized: n, m and t codelets LCPC

16 Performance Results (GSL-FFT) icc v 9.0, Itanium-II 1.5GHz Speedup Code Specialized: fft_complex_pass_n DFT Size LCPC

17 Performance Results (GSL-FFT) icc v 9.0, Itanium-II 1.5GHz Speedup DFT Size Code Specialized: fft_complex_radix2_transform LCPC

18 Performance Results (Scimark) icc v 9.0, Itanium-II 1.5GHz Speedup DFT Size Code Specialized: fft_transform_internal LCPC

19 Performance Results (FFT2) icc v 9.0, Itanium-II 1.5GHz 1.3 Speedup Code Specialized: join DFT Size LCPC

20 Performance Results (Numutils) icc v 9.0, Itanium-II 1.5GHz 1.15 Speedup Code Specialized: fft DFT Size LCPC

21 Performance Results (Kiss-FFT) icc v 9.0, Itanium-II 1.5GHz Speedup Code Specialized: kf_bfly2 and kf_bfly DFT Size LCPC

22 Related Work Limited Code Specialization Tempo C-Mix Tick C DCG Static Compile Time Low-level code Analysis, Optimizations Analyses, Partial Evaluation Partial Evaluation Analysis, Different Optimizations Analysis, Optimizations Dynamic Compile Time Binary Instruction Specialization Tick C/gcc to optimize N.A Optimizations using VCODE, ICODE Code Generation Overhead 12 to 20 CPI Same as Tick C Static Compile time 100 (VCODE) or 300 to 800 (ICODE) CPI > CPI LCPC

23 Conclusion and Future Work Limited Code Specialization Performance Improvement Code explosion reduction Runtime Overhead minimum Heavily dependent on the optimizations performed by compiler Runtime Specialization Template specialized and optimized at static compile time Low-level code analysis Generation of runtime specializer Runtime specialization of binary instructions Cost Analysis, Dependency Analysis, Multiple Platforms LCPC

24 -- Q & A -- LCPC

An Effective Automated Approach to Specialization of Code

An Effective Automated Approach to Specialization of Code Minhaj Ahmad Khan, H.-P. Charles, and D. Barthou University of Versailles-Saint-Quentin-en-Yvelines, France. Abstract. Application performance