Vc: Portable and Easy SIMD Programming with C++

Size: px

Start display at page:

Download "Vc: Portable and Easy SIMD Programming with C++"

Luke McCoy
6 years ago
Views:

1 Vc: Portable and Easy SIMD Programming with C++ Matthias Kretz Frankfurt Institute Institute for Computer Science Goethe University Frankfurt May 19th, 2014 HGS-HIRe Helmholtz Graduate School for Hadron and Ion Research Matthias Kretz (CC BY-NC-SA 2.0)

2 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMD Vc: Vector Types Mandelbrot

3 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, by Kevin Dooley (CC BY 2.0)

4 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Do you code?

5 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, for (int i = 0; i < N-1; ++i) { dx[i] = x[i + 1] - x[i];

6 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, i = 0 loop: load x[i+1] load x[i] x[i+1] - x[i] store dx[i] i += 1 if (i < N - 1) goto loop for (int i = 0; i < N - 1; ++i) { dx[i] = x[i + 1] - x[i];

7 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, multiple operations in one instruction

8 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, i = 0 loop: load x[1], x[2],..., x[w+1] load x[0], x[1],..., x[w] x[1] - x[0], x[2] - x[1], x[3] - x[2],... store dx[0], dx[1],..., dx[w] i += W if (i < N - 1) goto loop for (int i = 0; i < N - 1; ++i) { dx[i] = x[i + 1] - x[i];

9 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMD Single Instruction Multiple Data u 1 u 0 u 2 u 1 u 3 u 2 u 4 u 3 u 1 u 2 u 3 u 4 u 0 u 1 u 2 u 3

10 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMD is synchronous parallelism Think of W threads executing in lock-step

11 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, How much did you miss? Depends data type, target CPU factor of 2, 4, 8, 16, 32,?

12 10 8 FLOP/cycle (sp float) Year Matthias Kretz (compeng.uni-frankfurt.de) May 19th,

13 32 Scalar SIMD FLOP/cycle (sp float) Year Matthias Kretz (compeng.uni-frankfurt.de) May 19th,

14 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Why don t you use SIMD?

15 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, How could you? Programming languages only provide scalar types and operations.

16 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, template<typename T> static inline Vector<T> calc(vector<t> x) typedef Vector<T> V; typedef typename V::Mask M; typedef Const<T> C; const M invalidmask = x < V::Zero(); const M infinitymask = x == V::Zero(); const M denormal = x <= C::min(); x(denormal) *= V(Vc_buildDouble(1, 0, 54)); V exponent = x.exponent(); exponent(denormal) -= 54; x.setzero(c::exponentmask()); x = C::_1_2(); const M smallx = x < C::_1_sqrt2(); x(smallx) += x; x -= V::One(); exponent(!smallx) += V::One();

17 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, fundamental types in C++ map to hardware (registers/instructions)

18 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, but SIMD hardware does not map to C++ types I work on fixing this issue

19 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Vc is a zero-overhead wrapper around intrinsic SIMD functions with greatly improved syntax and semantics recognizable types: float_v double_v int_v ushort_v operator overloads: a + b * b masks and value vectors are different types: float_m mask = a < b;

20 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...

21 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...

22 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...

23 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...

24 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea (2) namespace MIC {... namespace SSE {... namespace Scalar {......

25 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea (3) namespace Vc { using AVX::Vector; using AVX::float_v; using AVX::int_v;...

26 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Portability Concerns How to fill a float_v? void f(float *data) { float_v x; x.load(data, Vc::Unaligned);... float_v::size (W) could be 1, 2, 4, 8, 16, Cannot assign float_v x = {1, 2, 3, 4; Requires W consecutive values in memory for a load Alignment (guaranteed power-of-two factor in prime factorization of a pointer address number of least significant bits that are zero)

27 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Portability Concerns How to fill a float_v? void f(float *data) { float_v x; x.load(data, Vc::Unaligned);... float_v::size (W) could be 1, 2, 4, 8, 16, Cannot assign float_v x = {1, 2, 3, 4; Requires W consecutive values in memory for a load Alignment (guaranteed power-of-two factor in prime factorization of a pointer address number of least significant bits that are zero)

28 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Portability Concerns How to fill a float_v? void f(float *data) { float_v x; x.load(data, Vc::Unaligned);... float_v::size (W) could be 1, 2, 4, 8, 16, Cannot assign float_v x = {1, 2, 3, 4; Requires W consecutive values in memory for a load Alignment (guaranteed power-of-two factor in prime factorization of a pointer address number of least significant bits that are zero)

29 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); initialize one SIMD register of target-specific size W with W consecutive values starting from array[offset]

30 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); multiply W values in x by 2 and add 1 broadcast integral 2 (and 1) to floating-point SIMD register use fused-multiply-add instruction if supported by target

31 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); store W values from SIMD register overwrite W values in array

32 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Container Interface float_v x =...; for (size_t i = 0; i < float_v::size; ++i) { x[i] += i; or with C++11 and latest Vc: float_v x =...; int i = 0; for (auto &scalar : x) { scalar += ++i;

33 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop

34 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop

35 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop

36 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop

37 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Portable Masking phi(phi < 0.f) += 360.f; // or: where(phi < 0.f) phi += 360.f; equivalent to: for (auto &phi_entry : phi) { if (phi_entry < 0.f) { phi_entry += 360.f; but optimized for the target s SIMD instruction set

38 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Marketing Speak Vc makes SIMD programming intuitive, portable, and fast! And free & open: LGPL licensed (BSD for Vc 1.0)

39 SIMD Matthias Kretz (compeng.uni-frankfurt.de) Vc: Vector Types May 19th, 2014 Mandelbrot 28

40 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Example: Mandelbrot Mandelbrot Iteration z u +1 = z 2 u + c z 0 = 0 and c = canvas coordinates Iterate until either z u 2 4 or n > n max n determines color

41 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

42 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

43 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

44 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

45 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

46 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

47 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

48 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

49 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

50 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

51 Mandelbrot Speedup Vc::AVX Vc::SSE (ternary) Vc::SSE (binary) Vc::Scalar speedup width/3 = height/2 [pixels] Matthias Kretz (compeng.uni-frankfurt.de) May 19th,

52 SIMD Vc: Vector Types Mandelbrot SIMD Vc: Vector Types Mandelbrot Matthias Kretz (compeng.uni-frankfurt.de) May 19th,

53 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, by INABA Tomoaki CC

54 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Auto-Vectorization Overview but compiler recognizes data parallelism modern compilers are impressively smart! tightly coupled to loops language standard requires such a transformation to ensure that the semantics of the original code stay unchanged the number of involved data structures increase complexity function calls (and therefore abstraction) can inhibit auto-vectorization for (int i = 0; i < N; ++i) { a[i] += b[i];

55 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Auto-Vectorization Improvements Auto-vectorization capabilities are constantly being improved No breakthrough can be expected Compiler writers rather turn to explicit loop vectorization

56 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Intrinsics Overview functions that wrap instructions very target specific inline assembly on steroids compiler does register allocation compiler can (in theory) optimize as well as scalar builtin types for (int i = 0; i < N; i += 8) { _mm256_store_ps(&a[i], _mm256_add_ps(_mm256_load_ps(&a[i]), _mm256_load_ps(&b[i]));

57 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Intrinsics Improvements? New instructions new intrinsics existing intrinsics must keep source compatibility (and stay C interfaces) no improvements possible GCC and Clang have a nicer alternative: vector attribute infix notation subscripting builtins better optimization opportunities (the compiler sees more of the developers intent)

58 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Vector Loops Overview #pragma vector with ICC #pragma omp simd with OpenMP 4 compatible compilers loop transformations similar to auto-vectorization difference: the concurrent execution semantics are explicit compiler does not have to prove that scalar and vector execution are equivalent #pragma omp simd for (int i = 0; i < N; ++i) { a[i] += b[i];

59 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Vector Loops Challenges special semantics inside vector loops cannot use exceptions cannot do thread synchronization function calls require annotated functions many more (important) arguments to the #pragma safelen(length) linear(list[:linear-step]) aligned(list[:alignment]) private(list) lastprivate(list) reduction(operator:list) collapse(n) part of the algorithm s logic may therefore appear in the #pragma

60 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMT The vector loops for GPU programming all code implicitly runs in SIMD context with an attached index that signifies the SIMD lane Think OpenCL for x86 families (CPU or Xeon Phi)

61 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMDized Containers containers with overloaded operators each operation semantically acts on all entries of the container without any specific ordering std::valarray is such a class somewhat abandoned runtime sized / allocated cache inefficient suboptimal mapping on SIMD width std::valarray<float> a(n), b(n); a += b;

62 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Array Notation Intel Cilk Plus (known from Fortran) a[:] += b[:];

63 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMD Types types for SIMD registers and operations target-specific SIMD type width for (int i = 0; i < (N / float_v::size); ++i) { a[i] += b[i]; Implementations (sorted by initial release): Vc boost::simd (not in Boost part of NT² ) Prof. Agner Fog s vector classes libsimdpp

64 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Future of the C++ Standard everything is still open maybe two approaches high-level and low-level needs Vector Loops SIMD Types

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC. Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing: