Advanced Parallel Programming II

Size: px

Start display at page:

Download "Advanced Parallel Programming II"

Hector Welch
5 years ago
Views:

1 Advanced Parallel Programming II Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz

2 Introduction to Vectorization RISC Software GmbH Johannes Kepler University Linz

3 Motivation Increasement in number of cores Threading techniques to improve performance But flops per cycle of vector units increased as much as number of cores No use of vector units wasting flops/watt For best performance Use all cores Efficient use of vector units Ignoring potential of vector units is so inefficient as using only one core RISC Software GmbH Johannes Kepler University Linz

4 Vector Unit Single Instruction Multiple Data (SIMD) units Mostly for floating point operations Data parallelization with one instruction 64-Bit unit 1 DP flop, 2 SP flop 128-Bit unit 2 DP flop, 4 SP flop Multiple data elements are loaded into vector registers and used by vector units Some architectures have more than one instruction per cycle (e.g. Sandy Bridge) RISC Software GmbH Johannes Kepler University Linz

Parallel Execution Scalar version works on one element at a time Vector version carries out the same instructions on many elements at a time a[i] = b[i] + c[i] x d[i]; a[i:8] = b[i:8] + c[i:8] *

5 Parallel Execution Scalar version works on one element at a time Vector version carries out the same instructions on many elements at a time a[i] = b[i] + c[i] x d[i]; a[i:8] = b[i:8] + c[i:8] * d[i:8]; a[i] = b[i] + c[i] x d[i] a[i] = b[i] + c[i] x d[i] a[i+1] = b[i+1] + c[i+1] x d[i+1] a[i+2] = b[i+2] + c[i+2] x d[i+2] a[i+3] = b[i+3] + c[i+3] x d[i+3] a[i+4] = b[i+4] + c[i+4] x d[i+4] a[i+5] = b[i+5] + c[i+5] x d[i+5] a[i+6] = b[i+6] + c[i+6] x d[i+6] a[i+7] = b[i+7] + c[i+7] x d[i+7] RISC Software GmbH Johannes Kepler University Linz

6 Vector Registers RISC Software GmbH Johannes Kepler University Linz

7 Vector Unit Usage (Programmers View) Use vectorized libraries (e.g. Intel MKL) Ease of use Fully automatic vectorization Auto vectorization hints (#pragma ivdep) SIMD feature (#pragma simd and simd function annotation) Vector intrinsics (e.g. mm_add_ps()) ASM code (e.g. addps) Programmer control RISC Software GmbH Johannes Kepler University Linz

8 Auto Vectorization RISC Software GmbH Johannes Kepler University Linz

9 Auto Vectorization Modern compilers analyse loops in serial code identification for vectorization Perform loop transformations for identification Usage of instruction set of target architecture RISC Software GmbH Johannes Kepler University Linz

10 Common Compiler Switches GCC and ICC Functionality Switch Disable optimization -O0 Optimize for speed (no code size increase) -O1 Optimize for speed (default) -O2 High-level optimizer (e.g. loop unroll) -O3 Aggressive optimizations (e.g. ipo, -O3, ) -fast Create symbols for debugging -g Generate assembly files -S OpenMP support -openmp RISC Software GmbH Johannes Kepler University Linz

11 Architecture Specific Compiler Switches GCC Functionality Optimize for current machine Generate SSE v1 code Generate SSE v2 code (default, may also emit SSE v1 code) Generate SSE v3 code (may also emit SSE v1 and v2 code) Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) Generate AVX code Generate AVX v2 code Switch -march=native -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 RISC Software GmbH Johannes Kepler University Linz

12 Architecture Specific Compiler Switches ICC Functionality Switch * Optimize for current machine Generate SSE v1 code Generate SSE v2 code (default, may also emit SSE v1 code) Generate SSE v3 code (may also emit SSE v1 and v2 code) Generate SSE v3 code for Atom-based processors Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) Generate AVX code -xhost -xsse1 -xsse2 -xsse3 -xsse_atom -xssse3 -xsse4.1 -xsse4.2 -xavx * For Intel processors use x, for non-intel processors use -m RISC Software GmbH Johannes Kepler University Linz

13 Example 4 Simple Vector Addition 1. Go to the directory example_4. 2. Compile simple.cpp (no vectorization) and execute the binary icc -std=c++11 -O2 -no-vec simple.cpp (g++ -std=c++11 -O2 simple.cpp) 3. Compile simple.cpp (vectorization) and execute the binary icc -std=c++11 -O2 simple.cpp (g++ -std=c++11 -O2 -ftree-vectorize) 4. What is the difference considering the execution times? 5. Can the execution times be further improved? RISC Software GmbH Johannes Kepler University Linz

14 SIMD Vectorization Basics Vectorization offers good performance improvement on floating point intensive code Vectorized code could compute slightly different results than non vectorized (x87 FPU 80 Bit, SIMD 64 Bit) Even for scalar operations vector unit is used Vectorization is only one aspect to improve performance Efficent use of the cache is necessary RISC Software GmbH Johannes Kepler University Linz

15 Parallelization at No Cost We tell the compiler to vectorize That s not the whole story There are cases, where a compiler cannot vectorize the code How can we analyse such situations vectorization report Generation of vectorization report via the compiler More interesting what was not done and why Focus on code paths, which were not vectorized RISC Software GmbH Johannes Kepler University Linz

16 Example 5 Vectorization Report 1. Go to the directory example_5. 2. Compile simple.cpp with enabled vector report generation: icc -std=c++11 -O2 -vec-report=2 simple.cpp (g++ -std=c++11 -O2 -ftree-vectorize fopt-info-vec-missed simple.cpp) 3. Which positive/negative information does the vectorization report tell? 4. Insert the following code before std::cout in calcsp() and compile again. Have a look at the vectorization report. for (i = 1; i < VECSIZE; i++) { a[i] = a[i] + a[i 1]; } RISC Software GmbH Johannes Kepler University Linz

17 Vectorization Report ICC Information Which code was vectorized? Which code was not vectorized? Compiler switch vec-report<n> n=0: no diagnostic information n=1: (default) vectorized loops n=2: vectorized/non vectorized loops (and why) n=3: additional dependency information n=4: only non vectorized loops n=5: only non vectorized loops and dependency information n=6: vectorized/non vectorized loops with details RISC Software GmbH Johannes Kepler University Linz

18 Loop Unrolling Unrolling allows compiler to reconstruct loop for vector operations for (i = 0; i < N; i++) { a[i] = b[i] * c[i]; } Load b(i,, i + 3) Load c(i,, i + 3) Operate b * c -> a Store a(i,, i + 3) for (i = 0; i < N; i += 4) { a[i ] = b[i ] * c[i ]; a[i + 1] = b[i + 1] * c[i + 1]; a[i + 2] = b[i + 2] * c[i + 2]; a[i + 3] = b[i + 3] * c[i + 3]; } RISC Software GmbH Johannes Kepler University Linz

19 Requirements for Auto Vectorization Countable Single entry and exit while (i < 100) { a[i] = b[i] * c[i]; if (a[i] < 0.0) // data-dependent exit condition: break; ++i; } // loop not vectorized Straight-line code (no switch, if with masking) for (int i = 0; i < length; i++) { float s = b[i] * b[i] 4 * a[i] * c[i]; if (s >= 0) x[i] = sqrt(s); else x[i] = 0.; } // loop vectorized (because of masking) RISC Software GmbH Johannes Kepler University Linz

20 Requirements for Auto Vectorization Only most inner loop (caution in case of loop interchange or loop collapsing) No functions calls, but Instrinsic math (sin, log, ) Inline functions Elemental functions attribute ((vector)) RISC Software GmbH Johannes Kepler University Linz

21 Inhibitors of Auto Vectorization Non contiguous data // arrays accessed with stride 2 for (int i=0; i<size; i+=2) b[i]+=a[i]+x[i]; // inner loop accesses a with stride SIZE for (int j=0; j<size; j++) for (int i=0; i<size; i++) b[i]+=a[i][j]*x[j]; // indirect addressing of x using index array for (int i=0; i<size; i+=2) b[i]+=a[i]*x[index[i]]; stride 1 is best. Caution in case of more dimensional arrays: do j=1,n do i=1,n a(i,j)=b(i,j)*s enddo enddo F90 C for (j=0; j<n; j++) for (i=0;i<n;i++) a[j][i]=b[j][i]*s; RISC Software GmbH Johannes Kepler University Linz

22 Inhibitors of Auto Vectorization Inability to identify data with alias (or overlapping) Runtime check possible multi versioned (vectorized/non vectorized) void my_cp(int nx, double* a, double* b) { for (int i = 0; i < nx; i++) a[i] = b[i]; } Runtime check not possible non vectorized void my_combine(int* ioff, int nx, double* a, double* b, double* c) { for (int i = 0; i < nx; i++) { a[i] = b[i] + c[i + *ioff]; } } Would vectorize with strict aliasing (-ansi-alias) RISC Software GmbH Johannes Kepler University Linz

23 Inhibitors of Auto Vectorization Vector dependency Read after write (RAW): non vectorizable for (i = 1; i < N; i++) a[i] = a[i 1] + b[i]; Write after read (WAR): vectorizable for (i = 0; i < N - 1; i++) a[i] = a[i + 1] + b[i]; Read after read (RAR): vectorizable for (i = 0; i < N; i++) a[i] = b[i % M] + c[i]; Write after write (WAW): non vectorizable for (i = 0; i < N; i++) a[i % M] = b[i] + c[i]; RISC Software GmbH Johannes Kepler University Linz

$definition in code: local/global: attribute ((alligned(16))) Heap: _mm_alloc, _mm_free Compiler support: assume_aligned(x, 16) void fill(char* x) { for (int i=0;i<1024;i++) x[i]=1; } Peeling$

24 Efficiency Aspects for Auto Vectorization Alignment Address alignment (SSE: 16 Bytes, AVX: 32 Bytes) Check at compile time alignment Otherwise check at runtime peel and remainder loop Explicit definition in code: local/global: attribute ((alligned(16))) Heap: _mm_alloc, _mm_free Compiler support: assume_aligned(x, 16) void fill(char* x) { for (int i=0;i<1024;i++) x[i]=1; } Peeling peel=x&0x0f; if (peel!=0) { peel=16 peel; for (i=0; i<peel;i++)x[i]=1; } for (i=peel; i<1024; i++) x[i]=1; RISC Software GmbH Johannes Kepler University Linz

25 Efficieny Aspects for Auto Vectorization Data Layout Structure of Arrays (SoA) instead of Array of Structures (AoS) struct Vector3d { //AoS double x; double y; double z; }; x y z x y z x y z x y z struct Vectors3d { //SoA double* x; double* y; double* z; }; x x x x y y y y z z z z RISC Software GmbH Johannes Kepler University Linz

26 Example 6 Non Contiguous Data 1. Go to the directory example_6. 2. Compile contig.cpp with enabled vectorization report icc -std=c++11 -O2 -vec-report=2 contig.cpp (g++ -std=c++11 -O2 -ftree-vectorize fopt-info-vec-missed contig.cpp) 3. What does the vectorization report tell? RISC Software GmbH Johannes Kepler University Linz

27 Compiler Directives ICC Many compilers have directives for vectorization hints ivdep (C: #pragma ivdep, Fortran:!dec$ ivdep ) No vector dependency in loop (usage in case of existing dependency leads to incorrect code) vector always (C: #pragma vector always, Fortran:!dec$ vector always ) Elemental functions: attribute ((vector)) RISC Software GmbH Johannes Kepler University Linz

28 Compiler Directives ICC SIMD Extensions SIMD pragma SIMD function annotation attribute ((simd)) Differences to traditional ivdep and vector always Traditional pragmas are more like hints New SIMD extension more like an assertion Fine control over auto vectorization with additiontal clauses (vectorlength, private, linear, reduction, assert) RISC Software GmbH Johannes Kepler University Linz

29 Example 7 Vector Dependency Go to the directory example_7 Compile the file forware.cpp and execute the binary. icc -std=c++11 -O2 -vec-report=2 forward.cpp (g++ -std=c++11 -O2 ftree-vectorize -fopt-info-vec forward.cpp) What is the execution time? What does the vectorization report tell? What happens if you split the inner loop into two seperate update loops for b and a? RISC Software GmbH Johannes Kepler University Linz

30 Vector Intrinsics RISC Software GmbH Johannes Kepler University Linz

31 Vector Intrinsics void vec_eltwise_product_avx(vec_t* a, vec_t* b, vec_t* c) { size_t i; m256 va; m256 vb; m256 vc; for (i = 0; i < a->size; i += 8) { va = _mm256_loadu_ps(&a->data[i]); vb = _mm256_loadu_ps(&b->data[i]); vc = _mm256_mul_ps(va, vb); _m256_storeu_ps(&c->data[i], vc); } } RISC Software GmbH Johannes Kepler University Linz

32 Vector Intrinsics Different data types for different architectures SSE: mm128, mm128d, mm128i AVX: mm256, mm256d, mm256i Different operations for different architectures SSE: _mm_add_ps(), _mm_add_pd(), AVX:_mm256_add_ps(), _mm256_add_pd(), Portability Implement wrapper for different architectures RISC Software GmbH Johannes Kepler University Linz

33 Cilk Plus Array Notation RISC Software GmbH Johannes Kepler University Linz

34 Cilk Plus Array Notation C/C++ language extension Support for data parallel operations New language array expression array-expression [ lower-bound : length : stride ] Default values of each argument in [:] lower-bound: 0 length: length of array stride: 1 (if default stride, the second : may be omitted) array-expression[:] array section is entire array of known length and stride 1 RISC Software GmbH Johannes Kepler University Linz

35 Cilk Plus Array Notation array-expression[:][:] denotes a two dimensional array Two new terms rank: number of array sections of single array (rank zero scalar) shape: length of each array section Statement all expressions have same rank and shape or rank zero broadcast for each element Built-in functions (reductions, ) Overlap between LHS and RHS array expression undefined behaviour (unless exact overlap) RISC Software GmbH Johannes Kepler University Linz

36 Cilk Plus Array Notation No new data types (use of existing array types in C and C++) Short-hand for entire array: array[:] Exception: dynamically allocated arrays array[start : length] Availability Intel C/C++ compiler GCC 5 Clang/LLVM fork ( Examples int a[10]; int b[10]; int c[10][10]; int d[10]; RISC Software GmbH Johannes Kepler University Linz

37 Examples Cilk Array notation Scalar C/C++ code a[:] = 5; for (i = 0; i < 10; i++) a[i] = 5; a[0:7] = 5; a[7:3] = 4; a[0:5:2] = 5; a[1:5:2] = 4; a[:] = b[:]; for (i = 0; i < 7; i++) a[i] = 5; for (i = 7; i < (7 + 3); i++) a[i] = 4; for (i = 0; i < 10; i += 2) a[i] = 5; for (i = 1; i < 10; i += 2) a[i] = 4; for (i = 0; i < 10; i++) a[i] = b[i]; a[:] = b[:] + 5; for (i = 0; i < 10; i++) a[i] = b[i] + 5; RISC Software GmbH Johannes Kepler University Linz

38 Examples Cilk Array notation d[:] = a[:] + b[:]; Scalar C/C++ code for (i = 0; i < 10; i++) d[i] = a[i] + b[i]; a[0:n] = 5; for (i = 0; i < n; i++) a[i] = 5; c[:][:] = 12; for (i = 0; i < 10; i++) for (j = 0; j < 10; j++) c[i][j] = 12; c[0:5:2][:] = 12; for (i = 0; i < 10; i += 2) for (j = 0; j < 10; j++) c[i][j] = 12; c[4][:] = a[:]; for (j = 0; j < 10; j++) c[4][j] = a[j]; func(a[:]); d[:] = a[b[:]] a[b[:]] = d[:] for (i = 0; i < 10; i++) func(a[i]); for (i = 0; i < 10; i++) d[i] = a[b[i]]; for (i = 0; i < 10; i++) a[b[i]] = d[i]; RISC Software GmbH Johannes Kepler University Linz

39 Examples Cilk Array notation Scalar C/C++ code if (5 == a[:]) b[:] = 1; else b[:] = 0; if ((5 == a[:]) (8 == a[:])) b[:] = 1; else b[:] = 0; a[:] = b[:] < 5? b[:] : a[:]; for (i = 0; i < 10; i++) { if (5 == a[i]) b[i] = 1; else b[i] = 0; } for (i = 0; i < 10; i++) { if ((5 == a[i]) (8 == a[i])) b[i] = 1; else b[i] = 0; } for (i = 0; i < 10; i++) a[i] = b[i] < 5? b[i] : a[i]; RISC Software GmbH Johannes Kepler University Linz

40 Functions Scalar function applied to all elements of array section Element type overloading in C++ Example: a[:] = sin(b[:]) Compiler may use vectorized version of function Built-in function User defined SIMD-enabled function elemental functions: attribute ((vector)) RISC Software GmbH Johannes Kepler University Linz

41 Builtin Functions Builtin function sec_reduce_add(a[:]) sec_reduce_mul(a[:]) sec_reduce_max(a[:]) sec_reduce_min(a[:]) sec_reduce_max_ind(a[:]) sec_reduce_min_ind(a[:]) Return value Scalar that is a sum of all the elements in the array section. Scalar that is a product of the 10 elements from index '0' (inclusive). Scalar that is the largest element in the array section. Scalar that is the smallest element in the array section. Integer index of the largest element in the array section. Integer index of the smallest element in the array section. sec_reduce_all_zero(a[:]) 1 if all the elements of the array section are zero, else 0. sec_reduce_all_nonzero(a[:]) 1 if all elements of the array section are non-zero, else 0. sec_reduce_any_zero(a[:]) 1 if any of the elements in the array section are zero, else 0. sec_reduce_any_nonzero(a[:]) 1 if any of the elements in the array section are non-zero, else 0. RISC Software GmbH Johannes Kepler University Linz

42 Example 8 Minimum Distance Computation ~2.07 RISC Software GmbH Johannes Kepler University Linz

43 Example 8 Minimum Distance Computation Go to the directory example_8. Compile the file BinarySTLReader.cpp and distance.cpp icc -O2 -openmp BinarySTLReader.cpp distance.cpp Execute the binary. What are the execution times? Enable OpenMP support (comment out). Compile and execute again. What are the execution times? RISC Software GmbH Johannes Kepler University Linz

44 Portability RISC Software GmbH Johannes Kepler University Linz

45 Portability Initial situation Auto vectorization is default enabled (Intel) Default instruction set SSE2 (-msse2) Goal Optimal performance on target machine Usage of latest instruction set architecture (-xavx) Problem Illegal instruction exception on older hardware Solution Create several binaries Better: usage of CPU Dispatch RISC Software GmbH Johannes Kepler University Linz

46 ICC CPU Dispatch Generation of multiple code paths Usage of compiler switch ax Base line path Other switches (e.g. O3) apply to base line path Specified via x oder m (default msse2) Alternative path Specified via ax (e.g. axavx) Path selection based on executing CPU RISC Software GmbH Johannes Kepler University Linz

47 Manuel CPU Dispatch Usage of attribute ((cpu_dispatch(cpuid, ))) Example attribute ((cpu_dispatch(generic,future_cpu_16)) void dispatch_func() {}; attribute ((cpu_specific(generic))) void dispatch_func() { /* Code for generic */ } attribute ((cpu_specific(future_cpu_16))) void dispatch_func() { /* Code for future_cpu_16 */} int main() { dispatch_func(); } RISC Software GmbH Johannes Kepler University Linz

48 Conclusion Vectorization can improve the runtime performance of floating point intensive loops (increase flops/watt) Auto vectorization Many factors inhibit auto vectorization (some do not apply to certain processors like scatter/gather) Vectorization report helps to identify (non) vectorized code If one compiler fails, another could be successful The Intel compiler can provide hints for code modifications Explicit vectorization Vector intrinsics abstraction over vector assembler Cilk Plus array notation good abstraction for data parallelism RISC Software GmbH Johannes Kepler University Linz

49 Thank You! Castor, 4228m Pollux, 4092m zwischen Monte-Rosa-Massiv und Matterhorn Wallis, Schweiz RISC Software GmbH Johannes Kepler University Linz

Vectorization on KNL

Vectorization on KNL Steve Lantz Senior Research Associate Cornell University Center for Advanced Computing (CAC) steve.lantz@cornell.edu High Performance Computing on Stampede 2, with KNL, Jan. 23, 2017