Scout. A Source-to-Source Transformator for SIMD-Optimizations. Zellescher Weg 12 Willers-Bau A 105 Tel

Size: px

Start display at page:

Download "Scout. A Source-to-Source Transformator for SIMD-Optimizations. Zellescher Weg 12 Willers-Bau A 105 Tel"

Geoffrey Parrish
6 years ago
Views:

1 Center for Information Services and High Performance Computing (ZIH) Scout A Source-to-Source Transformator for SIMD-Optimizations Zellescher Weg 12 Willers-Bau A 105 Tel (olaf.krzikalla@tu-dresden.de)

2 Scout what is it? A newly developed source-to-source transformator Based on the open source compiler Clang Usable as a preprocessor Optimizes and vectorizes C source code Provides common optimizations like inlining, loop-splitting and loop-unrolling reduces control flow statements and function calls in loop bodies Focuses on vectorization of loops using SIMD instructions at source level No extensive dependency or data flow analysis Pragma controlled optimizations, e.g. #pragma scout function expand #pragma scout condition invariant #pragma scout loop vectorize [align] 2

3 Scout why use it? Adaptable to various SIMD architectures (e.g. SSE2, Larrabee): vector size, alignment requirements and instruction sets are configurable via configuration files flexibility is a must-have as SIMD is a very vibrant topic Independent of available compilers at target platforms Semi-automatic generation of SIMD code: writing SIMD code by hand quickly becomes tedious fully automatic vectorization must be conservative Aggressive optimizations are possible: programmer has to actively mark code parts to be optimized transformed source code is still human readable Open source guarantees completely transparent optimization approaches 3

4 Scout how does it look like? 4

5 Scout how does it look like? 5

6 Scout how does it look like? 6

7 Scout how does it look like? 7

8 Scout how does it look like? 8

9 Scout how does it look like? 9

10 Scout how does the vectorization work? Precondition: inner Fortran-like loop for-condition of form i<expr or i!=expr for-increment of form ++i or i++ increment and condition refer to the same integral variable Basic block level vectorization basic block: sequence of expressions without control-flow statements complex expressions splitted to sequences of binary operations library for common mathematical functions (e.g. abs, exp) unsupported operations (function calls, control flows) unrolled Optional: aligned memory access introduce prolog loop for aligning apparently inefficient for CFD loops (mostly array-of-structures data) 10

11 Scout loop collapsing Original perfectly nested loop: for (i = 0; i < i_range; ++i) for (j = 0; j < j_range; ++j) /* body */ Transformed perfectly nested loop: for (i = 0; i < i_range; ++i) for (j = 0; j < j_range - 4; j += 4) /* vectorized body */ for (; j < j_range; ++j) /* body */ Vectorized part (vector size is 4) Epilog (not vectorized) 11

12 Scout loop collapsing Collapsed loop: for (temp = 0; temp < i_range * j_range; ++temp) /* compute i and j from temp */ /* body */ multiple perfectly nested loops - a common pattern of existing CFD codes 12

13 Scout loop collapsing Transformed collapsed loop: for (temp = 0; temp < i_range * j_range - 4; temp += 4) /* compute the next four i from temp? */ /* compute the next four j from temp? */ /* vectorized body */ for (; temp < i_range * j_range; ++temp) /* compute i and j from temp */ /* body */ after loop collapsing the non-vectorized epilog gets executed only once but: is there a cheap automatic method to compute i and j from temp? or is the programmer responsible for loop collapsing? 13

14 Scout perfectly nested loop-variant conditions Yet another common pattern in our codes: for (i = 0; i < i_range; ++i) if (condition(i)) /* body */ This prevents the complete loop from getting vectorized! 14

15 Scout perfectly nested loop-variant conditions A possible solution: for (i = 0; i < i_range; ++i) if (condition(i)) /* temporarily store i */ if (/* enough i are stored */) /* vectorized body */ /* clear store */ for (/*remaining i in the store*/) /* body */ Is this approach extensible for imperfectly nested conditions? 15

16 Scout today Machine: Intel Core2 Duo, Windows Vista 32 calcmuelaminar fluxviscfull_f - prturb disabled 6 vectorized / 2 unrolled ops 276 vectorized / 100 unrolled ops 0,0040 0,09 0,0035 0,0030 0,0025 0,0020 0,0015 0,0010 0,0005 original inlined vectorized 0,08 0,07 0,06 0,05 0,04 0,03 0,02 0,01 original inlined vectorized 0,0000 gcc 4.4 msvc 9.0 intel ,00 gcc 4.4 msvc 9.0 intel 11.1 Memory access is not measurably accelerated by SIMD Computationally intensive loops gain a significant speedup 16

17 Scout the future Find or build a library for commonly used mathematical functions: e.g. _mm_abs_ps, _mm_exp_ps Intel 11.1 already has <ia32intrin.h> Complete the Scout implementation: currently, deals only with parts of CFD codes vectorization of nested conditions analyze the possible speedup of loop collapsing Investigate other language approaches (e.g. CUDA) with respect to Scout: the analysis framework for loops already exists in Scout extend Scout s source transformation capabilities 17

Compiler Options. Linux/x86 Performance Practical,

Compiler Options. Linux/x86 Performance Practical, Center for Information Services and High Performance Computing (ZIH) Compiler Options Linux/x86 Performance Practical, 17.06.2009 Zellescher Weg 12 Willers-Bau A106 Tel. +49 351-463 - 31945 Ulf Markwardt