Vc: Portable and Easy SIMD Programming with C++

Size: px
Start display at page:

Download "Vc: Portable and Easy SIMD Programming with C++"

Transcription

1 Vc: Portable and Easy SIMD Programming with C++ Matthias Kretz Frankfurt Institute Institute for Computer Science Goethe University Frankfurt May 19th, 2014 HGS-HIRe Helmholtz Graduate School for Hadron and Ion Research Matthias Kretz (CC BY-NC-SA 2.0)

2 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMD Vc: Vector Types Mandelbrot

3 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, by Kevin Dooley (CC BY 2.0)

4 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Do you code?

5 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, for (int i = 0; i < N-1; ++i) { dx[i] = x[i + 1] - x[i];

6 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, i = 0 loop: load x[i+1] load x[i] x[i+1] - x[i] store dx[i] i += 1 if (i < N - 1) goto loop for (int i = 0; i < N - 1; ++i) { dx[i] = x[i + 1] - x[i];

7 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, multiple operations in one instruction

8 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, i = 0 loop: load x[1], x[2],..., x[w+1] load x[0], x[1],..., x[w] x[1] - x[0], x[2] - x[1], x[3] - x[2],... store dx[0], dx[1],..., dx[w] i += W if (i < N - 1) goto loop for (int i = 0; i < N - 1; ++i) { dx[i] = x[i + 1] - x[i];

9 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMD Single Instruction Multiple Data u 1 u 0 u 2 u 1 u 3 u 2 u 4 u 3 u 1 u 2 u 3 u 4 u 0 u 1 u 2 u 3

10 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMD is synchronous parallelism Think of W threads executing in lock-step

11 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, How much did you miss? Depends data type, target CPU factor of 2, 4, 8, 16, 32,?

12 10 8 FLOP/cycle (sp float) Year Matthias Kretz (compeng.uni-frankfurt.de) May 19th,

13 32 Scalar SIMD FLOP/cycle (sp float) Year Matthias Kretz (compeng.uni-frankfurt.de) May 19th,

14 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Why don t you use SIMD?

15 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, How could you? Programming languages only provide scalar types and operations.

16 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, template<typename T> static inline Vector<T> calc(vector<t> x) typedef Vector<T> V; typedef typename V::Mask M; typedef Const<T> C; const M invalidmask = x < V::Zero(); const M infinitymask = x == V::Zero(); const M denormal = x <= C::min(); x(denormal) *= V(Vc_buildDouble(1, 0, 54)); V exponent = x.exponent(); exponent(denormal) -= 54; x.setzero(c::exponentmask()); x = C::_1_2(); const M smallx = x < C::_1_sqrt2(); x(smallx) += x; x -= V::One(); exponent(!smallx) += V::One();

17 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, fundamental types in C++ map to hardware (registers/instructions)

18 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, but SIMD hardware does not map to C++ types I work on fixing this issue

19 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Vc is a zero-overhead wrapper around intrinsic SIMD functions with greatly improved syntax and semantics recognizable types: float_v double_v int_v ushort_v operator overloads: a + b * b masks and value vectors are different types: float_m mask = a < b;

20 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...

21 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...

22 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...

23 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...

24 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea (2) namespace MIC {... namespace SSE {... namespace Scalar {......

25 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea (3) namespace Vc { using AVX::Vector; using AVX::float_v; using AVX::int_v;...

26 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Portability Concerns How to fill a float_v? void f(float *data) { float_v x; x.load(data, Vc::Unaligned);... float_v::size (W) could be 1, 2, 4, 8, 16, Cannot assign float_v x = {1, 2, 3, 4; Requires W consecutive values in memory for a load Alignment (guaranteed power-of-two factor in prime factorization of a pointer address number of least significant bits that are zero)

27 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Portability Concerns How to fill a float_v? void f(float *data) { float_v x; x.load(data, Vc::Unaligned);... float_v::size (W) could be 1, 2, 4, 8, 16, Cannot assign float_v x = {1, 2, 3, 4; Requires W consecutive values in memory for a load Alignment (guaranteed power-of-two factor in prime factorization of a pointer address number of least significant bits that are zero)

28 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Portability Concerns How to fill a float_v? void f(float *data) { float_v x; x.load(data, Vc::Unaligned);... float_v::size (W) could be 1, 2, 4, 8, 16, Cannot assign float_v x = {1, 2, 3, 4; Requires W consecutive values in memory for a load Alignment (guaranteed power-of-two factor in prime factorization of a pointer address number of least significant bits that are zero)

29 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); initialize one SIMD register of target-specific size W with W consecutive values starting from array[offset]

30 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); multiply W values in x by 2 and add 1 broadcast integral 2 (and 1) to floating-point SIMD register use fused-multiply-add instruction if supported by target

31 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); store W values from SIMD register overwrite W values in array

32 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Container Interface float_v x =...; for (size_t i = 0; i < float_v::size; ++i) { x[i] += i; or with C++11 and latest Vc: float_v x =...; int i = 0; for (auto &scalar : x) { scalar += ++i;

33 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop

34 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop

35 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop

36 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop

37 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Portable Masking phi(phi < 0.f) += 360.f; // or: where(phi < 0.f) phi += 360.f; equivalent to: for (auto &phi_entry : phi) { if (phi_entry < 0.f) { phi_entry += 360.f; but optimized for the target s SIMD instruction set

38 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Marketing Speak Vc makes SIMD programming intuitive, portable, and fast! And free & open: LGPL licensed (BSD for Vc 1.0)

39 SIMD Matthias Kretz (compeng.uni-frankfurt.de) Vc: Vector Types May 19th, 2014 Mandelbrot 28

40 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Example: Mandelbrot Mandelbrot Iteration z u +1 = z 2 u + c z 0 = 0 and c = canvas coordinates Iterate until either z u 2 4 or n > n max n determines color

41 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

42 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

43 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

44 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

45 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

46 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

47 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

48 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

49 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

50 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);

51 Mandelbrot Speedup Vc::AVX Vc::SSE (ternary) Vc::SSE (binary) Vc::Scalar speedup width/3 = height/2 [pixels] Matthias Kretz (compeng.uni-frankfurt.de) May 19th,

52 SIMD Vc: Vector Types Mandelbrot SIMD Vc: Vector Types Mandelbrot Matthias Kretz (compeng.uni-frankfurt.de) May 19th,

53 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, by INABA Tomoaki CC

54 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Auto-Vectorization Overview but compiler recognizes data parallelism modern compilers are impressively smart! tightly coupled to loops language standard requires such a transformation to ensure that the semantics of the original code stay unchanged the number of involved data structures increase complexity function calls (and therefore abstraction) can inhibit auto-vectorization for (int i = 0; i < N; ++i) { a[i] += b[i];

55 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Auto-Vectorization Improvements Auto-vectorization capabilities are constantly being improved No breakthrough can be expected Compiler writers rather turn to explicit loop vectorization

56 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Intrinsics Overview functions that wrap instructions very target specific inline assembly on steroids compiler does register allocation compiler can (in theory) optimize as well as scalar builtin types for (int i = 0; i < N; i += 8) { _mm256_store_ps(&a[i], _mm256_add_ps(_mm256_load_ps(&a[i]), _mm256_load_ps(&b[i]));

57 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Intrinsics Improvements? New instructions new intrinsics existing intrinsics must keep source compatibility (and stay C interfaces) no improvements possible GCC and Clang have a nicer alternative: vector attribute infix notation subscripting builtins better optimization opportunities (the compiler sees more of the developers intent)

58 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Vector Loops Overview #pragma vector with ICC #pragma omp simd with OpenMP 4 compatible compilers loop transformations similar to auto-vectorization difference: the concurrent execution semantics are explicit compiler does not have to prove that scalar and vector execution are equivalent #pragma omp simd for (int i = 0; i < N; ++i) { a[i] += b[i];

59 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Vector Loops Challenges special semantics inside vector loops cannot use exceptions cannot do thread synchronization function calls require annotated functions many more (important) arguments to the #pragma safelen(length) linear(list[:linear-step]) aligned(list[:alignment]) private(list) lastprivate(list) reduction(operator:list) collapse(n) part of the algorithm s logic may therefore appear in the #pragma

60 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMT The vector loops for GPU programming all code implicitly runs in SIMD context with an attached index that signifies the SIMD lane Think OpenCL for x86 families (CPU or Xeon Phi)

61 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMDized Containers containers with overloaded operators each operation semantically acts on all entries of the container without any specific ordering std::valarray is such a class somewhat abandoned runtime sized / allocated cache inefficient suboptimal mapping on SIMD width std::valarray<float> a(n), b(n); a += b;

62 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Array Notation Intel Cilk Plus (known from Fortran) a[:] += b[:];

63 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMD Types types for SIMD registers and operations target-specific SIMD type width for (int i = 0; i < (N / float_v::size); ++i) { a[i] += b[i]; Implementations (sorted by initial release): Vc boost::simd (not in Boost part of NT² ) Prof. Agner Fog s vector classes libsimdpp

64 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Future of the C++ Standard everything is still open maybe two approaches high-level and low-level needs Vector Loops SIMD Types

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC. Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:

More information

Advanced OpenMP Features

Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =

More information

SIMD Types: ABI Considerations

SIMD Types: ABI Considerations Document Number: N4395 Date: 2015-04-10 Project: Programming Language C++, Library Working Group Reply-to: Matthias Kretz SIMD Types: ABI Considerations ABSTRACT This document

More information

const-correctness in C++

const-correctness in C++ const-correctness in C++ Matthias Kretz Frankfurt Institute Institute for Computer Science Goethe University Frankfurt 2015-04-01 HGS-HIRe Helmholtz Graduate School for Hadron and Ion Research Matthias

More information

SIMD T ypes: The Vector T ype & Operations

SIMD T ypes: The Vector T ype & Operations Document Number: N4184 Revision of: N3759 Date: 2014-10-10 Project: Programming Language C++, Library Working Group Reply-to: Matthias Kretz SIMD T ypes: The Vector T ype

More information

SIMD Exploitation in (JIT) Compilers

SIMD Exploitation in (JIT) Compilers SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input

More information

Dynamic SIMD Scheduling

Dynamic SIMD Scheduling Dynamic SIMD Scheduling Florian Wende SC15 MIC Tuning BoF November 18 th, 2015 Zuse Institute Berlin time Dynamic Work Assignment: The Idea Irregular SIMD execution Caused by branching: control flow varies

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives

More information

OpenCL Vectorising Features. Andreas Beckmann

OpenCL Vectorising Features. Andreas Beckmann Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels

More information

Costless Software Abstractions For Parallel Hardware System

Costless Software Abstractions For Parallel Hardware System Costless Software Abstractions For Parallel Hardware System Joel Falcou LRI - INRIA - MetaScale SAS Maison de la Simulation - 04/03/2014 Context In Scientific Computing... there is Scientific Applications

More information

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0/4.5. Mark Bull, EPCC OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all

More information

HPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA

HPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA HPC TNT - 2 Tips and tricks for Vectorization approaches to efficient code HPC core facility CalcUA ANNIE CUYT STEFAN BECUWE FRANKY BACKELJAUW [ENGEL]BERT TIJSKENS Overview Introduction What is vectorization

More information

std::assume_aligned Abstract

std::assume_aligned Abstract std::assume_aligned Timur Doumler (papers@timur.audio) Chandler Carruth (chandlerc@google.com) Document #: P1007R0 Date: 2018-05-04 Project: Programming Language C++ Audience: Library Evolution Working

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

MIPP: a Portable C++ SIMD Wrapper and its use for Error Correction Coding in 5G Standard

MIPP: a Portable C++ SIMD Wrapper and its use for Error Correction Coding in 5G Standard MIPP: a Portable C++ SIMD Wrapper and its use for Error Correction Coding in 5G Standard Adrien Cassagne1,2 Olivier Aumage1 Denis Barthou1 Camille Leroux2 Christophe Jégo2 1 Inria, Bordeaux Institute of

More information

Advanced OpenMP Vectoriza?on

Advanced OpenMP Vectoriza?on UT Aus?n Advanced OpenMP Vectoriza?on TACC TACC OpenMP Team milfeld/lars/agomez@tacc.utexas.edu These slides & Labs:?nyurl.com/tacc- openmp Learning objec?ve Vectoriza?on: what is that? Past, present and

More information

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list Compiling for Scalable Computing Systems the Merit of SIMD Ayal Zaks Intel Corporation Acknowledgements: too many to list Takeaways 1. SIMD is mainstream and ubiquitous in HW 2. Compiler support for SIMD

More information

Presenter: Georg Zitzlsberger Date:

Presenter: Georg Zitzlsberger Date: C++ SIMD parallelism with Intel Cilk Plus and OpenMP* 4.0 Presenter: Georg Zitzlsberger Date: 05-12-2014 Agenda SIMD & Vectorization How to Vectorize? Vectorize with OpenMP* 4.0 Vectorize with Intel Cilk

More information

What s New August 2015

What s New August 2015 What s New August 2015 Significant New Features New Directory Structure OpenMP* 4.1 Extensions C11 Standard Support More C++14 Standard Support Fortran 2008 Submodules and IMPURE ELEMENTAL Further C Interoperability

More information

Polly Polyhedral Optimizations for LLVM

Polly Polyhedral Optimizations for LLVM Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas Simbürger - Armin Grösslinger - Louis-Noël Pouchet April 03, 2011 Polly - Polyhedral Optimizations for LLVM

More information

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach OpenMP: Vectorization and #pragma omp simd Markus Höhnerbach 1 / 26 Where does it come from? c i = a i + b i i a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 + b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 = c 1 c 2 c 3 c 4 c 5 c

More information

Kampala August, Agner Fog

Kampala August, Agner Fog Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler

More information

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark Intel Xeon Phi programming September 22nd-23rd 2015 University of Copenhagen, Denmark Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED,

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Introduction SIMD Vectorization and SIMD-enabled Functions are a part of Intel Cilk Plus feature supported by the Intel

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform

More information

Advanced Parallel Programming II

Advanced Parallel Programming II Advanced Parallel Programming II Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Introduction to Vectorization RISC Software GmbH Johannes Kepler

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

Progress on OpenMP Specifications

Progress on OpenMP Specifications Progress on OpenMP Specifications Wednesday, November 13, 2012 Bronis R. de Supinski Chair, OpenMP Language Committee This work has been authored by Lawrence Livermore National Security, LLC under contract

More information

Cilk Plus GETTING STARTED

Cilk Plus GETTING STARTED Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions

More information

Image Processing Optimization C# on GPU with Hybridizer

Image Processing Optimization C# on GPU with Hybridizer Image Processing Optimization C# on GPU with Hybridizer regis.portalez@altimesh.com Median Filter Denoising Noisy image (lena 1960x1960) Denoised image window = 3 2 Median Filter Denoising window Output[i,j]=

More information

A Simple Path to Parallelism with Intel Cilk Plus

A Simple Path to Parallelism with Intel Cilk Plus Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description

More information

OpenMP 4.5: Threading, vectorization & offloading

OpenMP 4.5: Threading, vectorization & offloading OpenMP 4.5: Threading, vectorization & offloading Michal Merta michal.merta@vsb.cz 2nd of March 2018 Agenda Introduction The Basics OpenMP Tasks Vectorization with OpenMP 4.x Offloading to Accelerators

More information

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf

More information

Advanced OpenMP. Lecture 11: OpenMP 4.0

Advanced OpenMP. Lecture 11: OpenMP 4.0 Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013 Starting to make an appearance in production compilers What s new in 4.0 User defined reductions Construct cancellation

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

OpenMPCon Developers Conference 2017, Stony Brook Univ., New York, USA

OpenMPCon Developers Conference 2017, Stony Brook Univ., New York, USA Xinmin Tian, Hideki Saito, Satish Guggilla, Elena Demikhovsky, Matt Masten, Diego Caballero, Ernesto Su, Jin Lin and Andrew Savonichev Intel Compiler and Languages, SSG, Intel Corporation September 18-20,

More information

Operators. The Arrow Operator. The sizeof Operator

Operators. The Arrow Operator. The sizeof Operator Operators The Arrow Operator Most C++ operators are identical to the corresponding Java operators: Arithmetic: * / % + - Relational: < = >!= Logical:! && Bitwise: & bitwise and; ^ bitwise exclusive

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

S Comparing OpenACC 2.5 and OpenMP 4.5

S Comparing OpenACC 2.5 and OpenMP 4.5 April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

Parallel processing with OpenMP. #pragma omp

Parallel processing with OpenMP. #pragma omp Parallel processing with OpenMP #pragma omp 1 Bit-level parallelism long words Instruction-level parallelism automatic SIMD: vector instructions vector types Multiple threads OpenMP GPU CUDA GPU + CPU

More information

The Role of Standards in Heterogeneous Programming

The Role of Standards in Heterogeneous Programming The Role of Standards in Heterogeneous Programming Multi-core Challenge Bristol UWE 45 York Place, Edinburgh EH1 3HP June 12th, 2013 Codeplay Software Ltd. Incorporated in 1999 Based in Edinburgh, Scotland

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat OpenMP 4.0 implementation in GCC Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat OpenMP 4.0 implementation in GCC Work started in April 2013, C/C++ support with host fallback only

More information

Improving graphics processing performance using Intel Cilk Plus

Improving graphics processing performance using Intel Cilk Plus Improving graphics processing performance using Intel Cilk Plus Introduction Intel Cilk Plus is an extension to the C and C++ languages to support data and task parallelism. It provides three new keywords

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,

More information

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016 Copyright Khronos Group, 2014 - Page 1 SYCL SG14, February 2016 BOARD OF PROMOTERS Over 100 members worldwide any company is welcome to join Copyright Khronos Group 2014 SYCL 1. What is SYCL for and what

More information

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber, HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:

More information

Pablo Halpern Parallel Programming Languages Architect Intel Corporation

Pablo Halpern Parallel Programming Languages Architect Intel Corporation Pablo Halpern Parallel Programming Languages Architect Intel Corporation CppCon, 8 September 2014 This work by Pablo Halpern is licensed under a Creative Commons Attribution

More information

Vectorization with Haswell. and CilkPlus. August Author: Fumero Alfonso, Juan José. Supervisor: Nowak, Andrzej

Vectorization with Haswell. and CilkPlus. August Author: Fumero Alfonso, Juan José. Supervisor: Nowak, Andrzej Vectorization with Haswell and CilkPlus August 2013 Author: Fumero Alfonso, Juan José Supervisor: Nowak, Andrzej CERN openlab Summer Student Report 2013 Project Specification This project concerns the

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

Growth in Cores - A well rehearsed story

Growth in Cores - A well rehearsed story Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

More information

Intel Array Building Blocks

Intel Array Building Blocks Intel Array Building Blocks Productivity, Performance, and Portability with Intel Parallel Building Blocks Intel SW Products Workshop 2010 CERN openlab 11/29/2010 1 Agenda Legal Information Vision Call

More information

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell Khronos OpenCL SYCL committee 05/12/2015 IWOCL 2015 SYCL Tutorial OpenCL SYCL committee work... Weekly telephone meeting Define

More information

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009. Parallel Programming Lecture 9: Task Parallelism in OpenMP Administrative Programming assignment 1 is posted (after class) Due, Tuesday, September 22 before class - Use the handin program on the CADE machines

More information

High Performance Parallel Programming. Multicore development tools with extensions to many-core. Investment protection. Scale Forward.

High Performance Parallel Programming. Multicore development tools with extensions to many-core. Investment protection. Scale Forward. High Performance Parallel Programming Multicore development tools with extensions to many-core. Investment protection. Scale Forward. Enabling & Advancing Parallelism High Performance Parallel Programming

More information

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP) HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP) V. Akishina, I. Kisel, G. Kozlov, I. Kulakov, M. Pugach, M. Zyzak Goethe University of Frankfurt am Main 2015 Task Parallelism Parallelization

More information

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu OpenMP 4.0/4.5: New Features and Protocols Jemmy Hu SHARCNET HPC Consultant University of Waterloo May 10, 2017 General Interest Seminar Outline OpenMP overview Task constructs in OpenMP SIMP constructs

More information

Introduction to Runtime Systems

Introduction to Runtime Systems Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents

More information

Intel Software Development Products for High Performance Computing and Parallel Programming

Intel Software Development Products for High Performance Computing and Parallel Programming Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN

More information

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Intel Advisor XE. Vectorization Optimization. Optimization Notice Intel Advisor XE Vectorization Optimization 1 Performance is a Proven Game Changer It is driving disruptive change in multiple industries Protecting buildings from extreme events Sophisticated mechanics

More information

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation

More information

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list Compiling for Scalable Computing Systems the Merit of SIMD Ayal Zaks Intel Corporation Acknowledgements: too many to list Lo F IRST, Thanks for the Technion For Inspiration and Recognition of Science and

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

VECTORISATION. Adrian

VECTORISATION. Adrian VECTORISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Vectorisation Same operation on multiple data items Wide registers SIMD needed to approach FLOP peak performance, but your code must be

More information

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN OpenMP Tutorial Seung-Jai Min (smin@purdue.edu) School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 1 Parallel Programming Standards Thread Libraries - Win32 API / Posix

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor

More information

Vectorization on KNL

Vectorization on KNL Vectorization on KNL Steve Lantz Senior Research Associate Cornell University Center for Advanced Computing (CAC) steve.lantz@cornell.edu High Performance Computing on Stampede 2, with KNL, Jan. 23, 2017

More information

Experiences with Achieving Portability across Heterogeneous Architectures

Experiences with Achieving Portability across Heterogeneous Architectures Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore

More information

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number OpenMP C and C++ Application Program Interface Version 1.0 October 1998 Document Number 004 2229 001 Contents Page v Introduction [1] 1 Scope............................. 1 Definition of Terms.........................

More information

Scheduling Image Processing Pipelines

Scheduling Image Processing Pipelines Lecture 15: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];

More information

COMP Parallel Computing. Programming Accelerators using Directives

COMP Parallel Computing. Programming Accelerators using Directives COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator

More information

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,

More information

SYCL-based Data Layout Abstractions for CPU+GPU Codes

SYCL-based Data Layout Abstractions for CPU+GPU Codes DHPCC++ 2018 Conference St Catherine's College, Oxford, UK May 14th, 2018 SYCL-based Data Layout Abstractions for CPU+GPU Codes Florian Wende, Matthias Noack, Thomas Steinke Zuse Institute Berlin Setting

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

A short stroll along the road of OpenMP* 4.0

A short stroll along the road of OpenMP* 4.0 * A short stroll along the road of OpenMP* 4.0 Christian Terboven Michael Klemm Members of the OpenMP Language Committee 1 * Other names and brands

More information

A New Vectorization Technique for Expression Templates in C++ arxiv: v1 [cs.ms] 6 Sep 2011

A New Vectorization Technique for Expression Templates in C++ arxiv: v1 [cs.ms] 6 Sep 2011 A New Vectorization Technique for Expression Templates in C++ arxiv:1109.1264v1 [cs.ms] 6 Sep 2011 J. Progsch, Y. Ineichen and A. Adelmann October 28, 2018 Abstract Vector operations play an important

More information

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

Now What? Sean Parent Principal Scientist Adobe Systems Incorporated. All Rights Reserved..

Now What? Sean Parent Principal Scientist Adobe Systems Incorporated. All Rights Reserved.. Now What? Sean Parent Principal Scientist Beauty 2 C++11 Standard 1338 Pages 2012 Adobe Systems Incorporated. All Rights Reserved. 3 C++11 Standard 1338 Pages C++98 Standard 757 Pages 2012 Adobe Systems

More information

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Da Zhang Collaborators: Hao Wang, Kaixi Hou, Jing Zhang Advisor: Wu-chun Feng Evolution of Genome Sequencing1 In 20032: 1 human genome

More information

Intel Array Building Blocks (Intel ArBB) Technical Presentation

Intel Array Building Blocks (Intel ArBB) Technical Presentation Intel Array Building Blocks (Intel ArBB) Technical Presentation Copyright 2010, Intel Corporation. All rights reserved. 1 Noah Clemons Software And Services Group Developer Products Division Performance

More information

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Extending C++ for Explicit Data-Parallel Programming via SIMD Vector Types

Extending C++ for Explicit Data-Parallel Programming via SIMD Vector Types Extending C++ for Explicit Data-Parallel Programming via SIMD Vector Types Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften vorgelegt beim Fachbereich 12 der Johann Wolfgang Goethe-Universität

More information

Shared Memory programming paradigm: openmp

Shared Memory programming paradigm: openmp IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM

More information

SIMD in Scientific Computing

SIMD in Scientific Computing SIMD in Scientific Computing Tim Haines (terminal) PhD Candidate University of Wisconsin-Madison Department of Astronomy State of Tree(PM)-based N-Body solvers in Astronomy SPH? GPU? Xeon Phi? Gadget2

More information

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,

More information

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

OpenMP dynamic loops. Paolo Burgio.

OpenMP dynamic loops. Paolo Burgio. OpenMP dynamic loops Paolo Burgio paolo.burgio@unimore.it Outline Expressing parallelism Understanding parallel threads Memory Data management Data clauses Synchronization Barriers, locks, critical sections

More information

VisionCPP Mehdi Goli

VisionCPP Mehdi Goli VisionCPP Mehdi Goli Motivation Computer vision Different areas Medical imaging, film industry, industrial manufacturing, weather forecasting, etc. Operations Large size of data Sequence of operations

More information

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Achieving High Performance Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Does Instruction Set Matter? We find that ARM and x86 processors are simply engineering design points optimized

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force

More information