Vc: Portable and Easy SIMD Programming with C++

Vc: Portable and Easy SIMD Programming with C++ Matthias Kretz Frankfurt Institute Institute for Computer Science Goethe University Frankfurt May 19th, 2014 HGS-HIRe Helmholtz Graduate School for Hadron and Ion Research Matthias Kretz (CC BY-NC-SA 2.0)

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 2 SIMD Vc: Vector Types Mandelbrot

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 3 by Kevin Dooley (CC BY 2.0)

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 4 Do you code?

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 5 for (int i = 0; i < N-1; ++i) { dx[i] = x[i + 1] - x[i];

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 6 i = 0 loop: load x[i+1] load x[i] x[i+1] - x[i] store dx[i] i += 1 if (i < N - 1) goto loop for (int i = 0; i < N - 1; ++i) { dx[i] = x[i + 1] - x[i];

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 7 multiple operations in one instruction

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 8 i = 0 loop: load x[1], x[2],..., x[w+1] load x[0], x[1],..., x[w] x[1] - x[0], x[2] - x[1], x[3] - x[2],... store dx[0], dx[1],..., dx[w] i += W if (i < N - 1) goto loop for (int i = 0; i < N - 1; ++i) { dx[i] = x[i + 1] - x[i];

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 9 SIMD Single Instruction Multiple Data u 1 u 0 u 2 u 1 u 3 u 2 u 4 u 3 u 1 u 2 u 3 u 4 u 0 u 1 u 2 u 3

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 10 SIMD is synchronous parallelism Think of W threads executing in lock-step

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 11 How much did you miss? Depends data type, target CPU factor of 2, 4, 8, 16, 32,?

10 8 FLOP/cycle (sp float) 6 4 2 0 1998 2000 2002 2004 2006 2008 2010 2012 2014 Year Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 12

32 Scalar SIMD FLOP/cycle (sp float) 16 8 4 2 0 1998 2000 2002 2004 2006 2008 2010 2012 2014 Year Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 12

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 13 Why don t you use SIMD?

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 14 How could you? Programming languages only provide scalar types and operations.

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 15 template<typename T> static inline Vector<T> calc(vector<t> x) typedef Vector<T> V; typedef typename V::Mask M; typedef Const<T> C; const M invalidmask = x < V::Zero(); const M infinitymask = x == V::Zero(); const M denormal = x <= C::min(); x(denormal) *= V(Vc_buildDouble(1, 0, 54)); V exponent = x.exponent(); exponent(denormal) -= 54; x.setzero(c::exponentmask()); x = C::_1_2(); const M smallx = x < C::_1_sqrt2(); x(smallx) += x; x -= V::One(); exponent(!smallx) += V::One();

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 16 fundamental types in C++ map to hardware (registers/instructions)

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 17 but SIMD hardware does not map to C++ types I work on fixing this issue

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 18 Vc is a zero-overhead wrapper around intrinsic SIMD functions with greatly improved syntax and semantics recognizable types: float_v double_v int_v ushort_v operator overloads: a + b * b masks and value vectors are different types: float_m mask = a < b;

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 19 The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 20 The Idea (2) namespace MIC {... namespace SSE {... namespace Scalar {......

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 21 The Idea (3) namespace Vc { using AVX::Vector; using AVX::float_v; using AVX::int_v;...

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 22 Portability Concerns How to fill a float_v? void f(float *data) { float_v x; x.load(data, Vc::Unaligned);... float_v::size (W) could be 1, 2, 4, 8, 16, Cannot assign float_v x = {1, 2, 3, 4; Requires W consecutive values in memory for a load Alignment (guaranteed power-of-two factor in prime factorization of a pointer address number of least significant bits that are zero)

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 23 Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); initialize one SIMD register of target-specific size W with W consecutive values starting from array[offset]

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 23 Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); multiply W values in x by 2 and add 1 broadcast integral 2 (and 1) to floating-point SIMD register use fused-multiply-add instruction if supported by target

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 23 Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); store W values from SIMD register overwrite W values in array

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 24 Container Interface float_v x =...; for (size_t i = 0; i < float_v::size; ++i) { x[i] += i; or with C++11 and latest Vc: float_v x =...; int i = 0; for (auto &scalar : x) { scalar += ++i;

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 25 No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 26 Portable Masking phi(phi < 0.f) += 360.f; // or: where(phi < 0.f) phi += 360.f; equivalent to: for (auto &phi_entry : phi) { if (phi_entry < 0.f) { phi_entry += 360.f; but optimized for the target s SIMD instruction set

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 27 Marketing Speak Vc makes SIMD programming intuitive, portable, and fast! And free & open: LGPL licensed (BSD for Vc 1.0)

SIMD Matthias Kretz (compeng.uni-frankfurt.de) Vc: Vector Types May 19th, 2014 Mandelbrot 28

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 29 Example: Mandelbrot Mandelbrot Iteration z u +1 = z 2 u + c z 0 = 0 and c = canvas coordinates Iterate until either z u 2 4 or n > n max n determines color

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 30 Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = 255 - n; colorizenextpixels(colorvalue);

Mandelbrot Speedup 8 7 6 5 Vc::AVX Vc::SSE (ternary) Vc::SSE (binary) Vc::Scalar speedup 4 3 2 1 0 0 100 200 300 400 500 600 700 width/3 = height/2 [pixels] Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 31

SIMD Vc: Vector Types Mandelbrot SIMD Vc: Vector Types Mandelbrot Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 32

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 33 by INABA Tomoaki CC BY-SA @flickr

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 34 Auto-Vectorization Overview but compiler recognizes data parallelism modern compilers are impressively smart! tightly coupled to loops language standard requires such a transformation to ensure that the semantics of the original code stay unchanged the number of involved data structures increase complexity function calls (and therefore abstraction) can inhibit auto-vectorization for (int i = 0; i < N; ++i) { a[i] += b[i];

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 35 Auto-Vectorization Improvements Auto-vectorization capabilities are constantly being improved No breakthrough can be expected Compiler writers rather turn to explicit loop vectorization

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 36 Intrinsics Overview functions that wrap instructions very target specific inline assembly on steroids compiler does register allocation compiler can (in theory) optimize as well as scalar builtin types for (int i = 0; i < N; i += 8) { _mm256_store_ps(&a[i], _mm256_add_ps(_mm256_load_ps(&a[i]), _mm256_load_ps(&b[i]));

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 37 Intrinsics Improvements? New instructions new intrinsics existing intrinsics must keep source compatibility (and stay C interfaces) no improvements possible GCC and Clang have a nicer alternative: vector attribute infix notation subscripting builtins better optimization opportunities (the compiler sees more of the developers intent)

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 38 Vector Loops Overview #pragma vector with ICC #pragma omp simd with OpenMP 4 compatible compilers loop transformations similar to auto-vectorization difference: the concurrent execution semantics are explicit compiler does not have to prove that scalar and vector execution are equivalent #pragma omp simd for (int i = 0; i < N; ++i) { a[i] += b[i];

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 39 Vector Loops Challenges special semantics inside vector loops cannot use exceptions cannot do thread synchronization function calls require annotated functions many more (important) arguments to the #pragma safelen(length) linear(list[:linear-step]) aligned(list[:alignment]) private(list) lastprivate(list) reduction(operator:list) collapse(n) part of the algorithm s logic may therefore appear in the #pragma

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 40 SIMT The vector loops for GPU programming all code implicitly runs in SIMD context with an attached index that signifies the SIMD lane Think OpenCL for x86 families (CPU or Xeon Phi)

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 41 SIMDized Containers containers with overloaded operators each operation semantically acts on all entries of the container without any specific ordering std::valarray is such a class somewhat abandoned runtime sized / allocated cache inefficient suboptimal mapping on SIMD width std::valarray<float> a(n), b(n); a += b;

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 42 Array Notation Intel Cilk Plus (known from Fortran) a[:] += b[:];

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 43 SIMD Types types for SIMD registers and operations target-specific SIMD type width for (int i = 0; i < (N / float_v::size); ++i) { a[i] += b[i]; Implementations (sorted by initial release): Vc boost::simd (not in Boost part of NT² ) Prof. Agner Fog s vector classes libsimdpp

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 44 Future of the C++ Standard everything is still open maybe two approaches high-level and low-level needs Vector Loops SIMD Types