Vc: Portable and Easy SIMD Programming with C++
|
|
- Luke McCoy
- 6 years ago
- Views:
Transcription
1 Vc: Portable and Easy SIMD Programming with C++ Matthias Kretz Frankfurt Institute Institute for Computer Science Goethe University Frankfurt May 19th, 2014 HGS-HIRe Helmholtz Graduate School for Hadron and Ion Research Matthias Kretz (CC BY-NC-SA 2.0)
2 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMD Vc: Vector Types Mandelbrot
3 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, by Kevin Dooley (CC BY 2.0)
4 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Do you code?
5 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, for (int i = 0; i < N-1; ++i) { dx[i] = x[i + 1] - x[i];
6 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, i = 0 loop: load x[i+1] load x[i] x[i+1] - x[i] store dx[i] i += 1 if (i < N - 1) goto loop for (int i = 0; i < N - 1; ++i) { dx[i] = x[i + 1] - x[i];
7 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, multiple operations in one instruction
8 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, i = 0 loop: load x[1], x[2],..., x[w+1] load x[0], x[1],..., x[w] x[1] - x[0], x[2] - x[1], x[3] - x[2],... store dx[0], dx[1],..., dx[w] i += W if (i < N - 1) goto loop for (int i = 0; i < N - 1; ++i) { dx[i] = x[i + 1] - x[i];
9 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMD Single Instruction Multiple Data u 1 u 0 u 2 u 1 u 3 u 2 u 4 u 3 u 1 u 2 u 3 u 4 u 0 u 1 u 2 u 3
10 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMD is synchronous parallelism Think of W threads executing in lock-step
11 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, How much did you miss? Depends data type, target CPU factor of 2, 4, 8, 16, 32,?
12 10 8 FLOP/cycle (sp float) Year Matthias Kretz (compeng.uni-frankfurt.de) May 19th,
13 32 Scalar SIMD FLOP/cycle (sp float) Year Matthias Kretz (compeng.uni-frankfurt.de) May 19th,
14 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Why don t you use SIMD?
15 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, How could you? Programming languages only provide scalar types and operations.
16 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, template<typename T> static inline Vector<T> calc(vector<t> x) typedef Vector<T> V; typedef typename V::Mask M; typedef Const<T> C; const M invalidmask = x < V::Zero(); const M infinitymask = x == V::Zero(); const M denormal = x <= C::min(); x(denormal) *= V(Vc_buildDouble(1, 0, 54)); V exponent = x.exponent(); exponent(denormal) -= 54; x.setzero(c::exponentmask()); x = C::_1_2(); const M smallx = x < C::_1_sqrt2(); x(smallx) += x; x -= V::One(); exponent(!smallx) += V::One();
17 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, fundamental types in C++ map to hardware (registers/instructions)
18 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, but SIMD hardware does not map to C++ types I work on fixing this issue
19 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Vc is a zero-overhead wrapper around intrinsic SIMD functions with greatly improved syntax and semantics recognizable types: float_v double_v int_v ushort_v operator overloads: a + b * b masks and value vectors are different types: float_m mask = a < b;
20 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...
21 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...
22 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...
23 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...
24 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea (2) namespace MIC {... namespace SSE {... namespace Scalar {......
25 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, The Idea (3) namespace Vc { using AVX::Vector; using AVX::float_v; using AVX::int_v;...
26 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Portability Concerns How to fill a float_v? void f(float *data) { float_v x; x.load(data, Vc::Unaligned);... float_v::size (W) could be 1, 2, 4, 8, 16, Cannot assign float_v x = {1, 2, 3, 4; Requires W consecutive values in memory for a load Alignment (guaranteed power-of-two factor in prime factorization of a pointer address number of least significant bits that are zero)
27 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Portability Concerns How to fill a float_v? void f(float *data) { float_v x; x.load(data, Vc::Unaligned);... float_v::size (W) could be 1, 2, 4, 8, 16, Cannot assign float_v x = {1, 2, 3, 4; Requires W consecutive values in memory for a load Alignment (guaranteed power-of-two factor in prime factorization of a pointer address number of least significant bits that are zero)
28 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Portability Concerns How to fill a float_v? void f(float *data) { float_v x; x.load(data, Vc::Unaligned);... float_v::size (W) could be 1, 2, 4, 8, 16, Cannot assign float_v x = {1, 2, 3, 4; Requires W consecutive values in memory for a load Alignment (guaranteed power-of-two factor in prime factorization of a pointer address number of least significant bits that are zero)
29 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); initialize one SIMD register of target-specific size W with W consecutive values starting from array[offset]
30 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); multiply W values in x by 2 and add 1 broadcast integral 2 (and 1) to floating-point SIMD register use fused-multiply-add instruction if supported by target
31 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); store W values from SIMD register overwrite W values in array
32 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Container Interface float_v x =...; for (size_t i = 0; i < float_v::size; ++i) { x[i] += i; or with C++11 and latest Vc: float_v x =...; int i = 0; for (auto &scalar : x) { scalar += ++i;
33 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop
34 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop
35 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop
36 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop
37 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Portable Masking phi(phi < 0.f) += 360.f; // or: where(phi < 0.f) phi += 360.f; equivalent to: for (auto &phi_entry : phi) { if (phi_entry < 0.f) { phi_entry += 360.f; but optimized for the target s SIMD instruction set
38 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Marketing Speak Vc makes SIMD programming intuitive, portable, and fast! And free & open: LGPL licensed (BSD for Vc 1.0)
39 SIMD Matthias Kretz (compeng.uni-frankfurt.de) Vc: Vector Types May 19th, 2014 Mandelbrot 28
40 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Example: Mandelbrot Mandelbrot Iteration z u +1 = z 2 u + c z 0 = 0 and c = canvas coordinates Iterate until either z u 2 4 or n > n max n determines color
41 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);
42 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);
43 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);
44 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);
45 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);
46 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);
47 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);
48 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);
49 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);
50 Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = n; colorizenextpixels(colorvalue);
51 Mandelbrot Speedup Vc::AVX Vc::SSE (ternary) Vc::SSE (binary) Vc::Scalar speedup width/3 = height/2 [pixels] Matthias Kretz (compeng.uni-frankfurt.de) May 19th,
52 SIMD Vc: Vector Types Mandelbrot SIMD Vc: Vector Types Mandelbrot Matthias Kretz (compeng.uni-frankfurt.de) May 19th,
53 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, by INABA Tomoaki CC
54 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Auto-Vectorization Overview but compiler recognizes data parallelism modern compilers are impressively smart! tightly coupled to loops language standard requires such a transformation to ensure that the semantics of the original code stay unchanged the number of involved data structures increase complexity function calls (and therefore abstraction) can inhibit auto-vectorization for (int i = 0; i < N; ++i) { a[i] += b[i];
55 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Auto-Vectorization Improvements Auto-vectorization capabilities are constantly being improved No breakthrough can be expected Compiler writers rather turn to explicit loop vectorization
56 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Intrinsics Overview functions that wrap instructions very target specific inline assembly on steroids compiler does register allocation compiler can (in theory) optimize as well as scalar builtin types for (int i = 0; i < N; i += 8) { _mm256_store_ps(&a[i], _mm256_add_ps(_mm256_load_ps(&a[i]), _mm256_load_ps(&b[i]));
57 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Intrinsics Improvements? New instructions new intrinsics existing intrinsics must keep source compatibility (and stay C interfaces) no improvements possible GCC and Clang have a nicer alternative: vector attribute infix notation subscripting builtins better optimization opportunities (the compiler sees more of the developers intent)
58 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Vector Loops Overview #pragma vector with ICC #pragma omp simd with OpenMP 4 compatible compilers loop transformations similar to auto-vectorization difference: the concurrent execution semantics are explicit compiler does not have to prove that scalar and vector execution are equivalent #pragma omp simd for (int i = 0; i < N; ++i) { a[i] += b[i];
59 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Vector Loops Challenges special semantics inside vector loops cannot use exceptions cannot do thread synchronization function calls require annotated functions many more (important) arguments to the #pragma safelen(length) linear(list[:linear-step]) aligned(list[:alignment]) private(list) lastprivate(list) reduction(operator:list) collapse(n) part of the algorithm s logic may therefore appear in the #pragma
60 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMT The vector loops for GPU programming all code implicitly runs in SIMD context with an attached index that signifies the SIMD lane Think OpenCL for x86 families (CPU or Xeon Phi)
61 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMDized Containers containers with overloaded operators each operation semantically acts on all entries of the container without any specific ordering std::valarray is such a class somewhat abandoned runtime sized / allocated cache inefficient suboptimal mapping on SIMD width std::valarray<float> a(n), b(n); a += b;
62 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Array Notation Intel Cilk Plus (known from Fortran) a[:] += b[:];
63 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, SIMD Types types for SIMD registers and operations target-specific SIMD type width for (int i = 0; i < (N / float_v::size); ++i) { a[i] += b[i]; Implementations (sorted by initial release): Vc boost::simd (not in Boost part of NT² ) Prof. Agner Fog s vector classes libsimdpp
64 Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, Future of the C++ Standard everything is still open maybe two approaches high-level and low-level needs Vector Loops SIMD Types
Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.
Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:
More informationAdvanced OpenMP Features
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =
More informationSIMD Types: ABI Considerations
Document Number: N4395 Date: 2015-04-10 Project: Programming Language C++, Library Working Group Reply-to: Matthias Kretz SIMD Types: ABI Considerations ABSTRACT This document
More informationconst-correctness in C++
const-correctness in C++ Matthias Kretz Frankfurt Institute Institute for Computer Science Goethe University Frankfurt 2015-04-01 HGS-HIRe Helmholtz Graduate School for Hadron and Ion Research Matthias
More informationSIMD T ypes: The Vector T ype & Operations
Document Number: N4184 Revision of: N3759 Date: 2014-10-10 Project: Programming Language C++, Library Working Group Reply-to: Matthias Kretz SIMD T ypes: The Vector T ype
More informationSIMD Exploitation in (JIT) Compilers
SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input
More informationDynamic SIMD Scheduling
Dynamic SIMD Scheduling Florian Wende SC15 MIC Tuning BoF November 18 th, 2015 Zuse Institute Berlin time Dynamic Work Assignment: The Idea Irregular SIMD execution Caused by branching: control flow varies
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationModule 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program
The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives
More informationOpenCL Vectorising Features. Andreas Beckmann
Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels
More informationCostless Software Abstractions For Parallel Hardware System
Costless Software Abstractions For Parallel Hardware System Joel Falcou LRI - INRIA - MetaScale SAS Maison de la Simulation - 04/03/2014 Context In Scientific Computing... there is Scientific Applications
More informationOpenMP 4.0/4.5. Mark Bull, EPCC
OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all
More informationHPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA
HPC TNT - 2 Tips and tricks for Vectorization approaches to efficient code HPC core facility CalcUA ANNIE CUYT STEFAN BECUWE FRANKY BACKELJAUW [ENGEL]BERT TIJSKENS Overview Introduction What is vectorization
More informationstd::assume_aligned Abstract
std::assume_aligned Timur Doumler (papers@timur.audio) Chandler Carruth (chandlerc@google.com) Document #: P1007R0 Date: 2018-05-04 Project: Programming Language C++ Audience: Library Evolution Working
More informationOpenMP 4.0. Mark Bull, EPCC
OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!
More informationMIPP: a Portable C++ SIMD Wrapper and its use for Error Correction Coding in 5G Standard
MIPP: a Portable C++ SIMD Wrapper and its use for Error Correction Coding in 5G Standard Adrien Cassagne1,2 Olivier Aumage1 Denis Barthou1 Camille Leroux2 Christophe Jégo2 1 Inria, Bordeaux Institute of
More informationAdvanced OpenMP Vectoriza?on
UT Aus?n Advanced OpenMP Vectoriza?on TACC TACC OpenMP Team milfeld/lars/agomez@tacc.utexas.edu These slides & Labs:?nyurl.com/tacc- openmp Learning objec?ve Vectoriza?on: what is that? Past, present and
More informationCompiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list
Compiling for Scalable Computing Systems the Merit of SIMD Ayal Zaks Intel Corporation Acknowledgements: too many to list Takeaways 1. SIMD is mainstream and ubiquitous in HW 2. Compiler support for SIMD
More informationPresenter: Georg Zitzlsberger Date:
C++ SIMD parallelism with Intel Cilk Plus and OpenMP* 4.0 Presenter: Georg Zitzlsberger Date: 05-12-2014 Agenda SIMD & Vectorization How to Vectorize? Vectorize with OpenMP* 4.0 Vectorize with Intel Cilk
More informationWhat s New August 2015
What s New August 2015 Significant New Features New Directory Structure OpenMP* 4.1 Extensions C11 Standard Support More C++14 Standard Support Fortran 2008 Submodules and IMPURE ELEMENTAL Further C Interoperability
More informationPolly Polyhedral Optimizations for LLVM
Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas Simbürger - Armin Grösslinger - Louis-Noël Pouchet April 03, 2011 Polly - Polyhedral Optimizations for LLVM
More informationOpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach
OpenMP: Vectorization and #pragma omp simd Markus Höhnerbach 1 / 26 Where does it come from? c i = a i + b i i a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 + b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 = c 1 c 2 c 3 c 4 c 5 c
More informationKampala August, Agner Fog
Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler
More informationIntel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark
Intel Xeon Phi programming September 22nd-23rd 2015 University of Copenhagen, Denmark Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED,
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationGetting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions
Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Introduction SIMD Vectorization and SIMD-enabled Functions are a part of Intel Cilk Plus feature supported by the Intel
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform
More informationAdvanced Parallel Programming II
Advanced Parallel Programming II Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Introduction to Vectorization RISC Software GmbH Johannes Kepler
More informationLecture 4: OpenMP Open Multi-Processing
CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP
More informationProgress on OpenMP Specifications
Progress on OpenMP Specifications Wednesday, November 13, 2012 Bronis R. de Supinski Chair, OpenMP Language Committee This work has been authored by Lawrence Livermore National Security, LLC under contract
More informationCilk Plus GETTING STARTED
Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions
More informationImage Processing Optimization C# on GPU with Hybridizer
Image Processing Optimization C# on GPU with Hybridizer regis.portalez@altimesh.com Median Filter Denoising Noisy image (lena 1960x1960) Denoised image window = 3 2 Median Filter Denoising window Output[i,j]=
More informationA Simple Path to Parallelism with Intel Cilk Plus
Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description
More informationOpenMP 4.5: Threading, vectorization & offloading
OpenMP 4.5: Threading, vectorization & offloading Michal Merta michal.merta@vsb.cz 2nd of March 2018 Agenda Introduction The Basics OpenMP Tasks Vectorization with OpenMP 4.x Offloading to Accelerators
More informationTurbo Boost Up, AVX Clock Down: Complications for Scaling Tests
Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf
More informationAdvanced OpenMP. Lecture 11: OpenMP 4.0
Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013 Starting to make an appearance in production compilers What s new in 4.0 User defined reductions Construct cancellation
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationOpenMPCon Developers Conference 2017, Stony Brook Univ., New York, USA
Xinmin Tian, Hideki Saito, Satish Guggilla, Elena Demikhovsky, Matt Masten, Diego Caballero, Ernesto Su, Jin Lin and Andrew Savonichev Intel Compiler and Languages, SSG, Intel Corporation September 18-20,
More informationOperators. The Arrow Operator. The sizeof Operator
Operators The Arrow Operator Most C++ operators are identical to the corresponding Java operators: Arithmetic: * / % + - Relational: < = >!= Logical:! && Bitwise: & bitwise and; ^ bitwise exclusive
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationS Comparing OpenACC 2.5 and OpenMP 4.5
April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel
More informationParallel processing with OpenMP. #pragma omp
Parallel processing with OpenMP #pragma omp 1 Bit-level parallelism long words Instruction-level parallelism automatic SIMD: vector instructions vector types Multiple threads OpenMP GPU CUDA GPU + CPU
More informationThe Role of Standards in Heterogeneous Programming
The Role of Standards in Heterogeneous Programming Multi-core Challenge Bristol UWE 45 York Place, Edinburgh EH1 3HP June 12th, 2013 Codeplay Software Ltd. Incorporated in 1999 Based in Edinburgh, Scotland
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationOpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat
OpenMP 4.0 implementation in GCC Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat OpenMP 4.0 implementation in GCC Work started in April 2013, C/C++ support with host fallback only
More informationImproving graphics processing performance using Intel Cilk Plus
Improving graphics processing performance using Intel Cilk Plus Introduction Intel Cilk Plus is an extension to the C and C++ languages to support data and task parallelism. It provides three new keywords
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationAutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming
AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,
More informationCopyright Khronos Group, Page 1 SYCL. SG14, February 2016
Copyright Khronos Group, 2014 - Page 1 SYCL SG14, February 2016 BOARD OF PROMOTERS Over 100 members worldwide any company is welcome to join Copyright Khronos Group 2014 SYCL 1. What is SYCL for and what
More informationHPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,
HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:
More informationPablo Halpern Parallel Programming Languages Architect Intel Corporation
Pablo Halpern Parallel Programming Languages Architect Intel Corporation CppCon, 8 September 2014 This work by Pablo Halpern is licensed under a Creative Commons Attribution
More informationVectorization with Haswell. and CilkPlus. August Author: Fumero Alfonso, Juan José. Supervisor: Nowak, Andrzej
Vectorization with Haswell and CilkPlus August 2013 Author: Fumero Alfonso, Juan José Supervisor: Nowak, Andrzej CERN openlab Summer Student Report 2013 Project Specification This project concerns the
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationGrowth in Cores - A well rehearsed story
Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
More informationIntel Array Building Blocks
Intel Array Building Blocks Productivity, Performance, and Portability with Intel Parallel Building Blocks Intel SW Products Workshop 2010 CERN openlab 11/29/2010 1 Agenda Legal Information Vision Call
More informationtrisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee
trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell Khronos OpenCL SYCL committee 05/12/2015 IWOCL 2015 SYCL Tutorial OpenCL SYCL committee work... Weekly telephone meeting Define
More informationCS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.
Parallel Programming Lecture 9: Task Parallelism in OpenMP Administrative Programming assignment 1 is posted (after class) Due, Tuesday, September 22 before class - Use the handin program on the CADE machines
More informationHigh Performance Parallel Programming. Multicore development tools with extensions to many-core. Investment protection. Scale Forward.
High Performance Parallel Programming Multicore development tools with extensions to many-core. Investment protection. Scale Forward. Enabling & Advancing Parallelism High Performance Parallel Programming
More informationHPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)
HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP) V. Akishina, I. Kisel, G. Kozlov, I. Kulakov, M. Pugach, M. Zyzak Goethe University of Frankfurt am Main 2015 Task Parallelism Parallelization
More informationOpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu
OpenMP 4.0/4.5: New Features and Protocols Jemmy Hu SHARCNET HPC Consultant University of Waterloo May 10, 2017 General Interest Seminar Outline OpenMP overview Task constructs in OpenMP SIMP constructs
More informationIntroduction to Runtime Systems
Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents
More informationIntel Software Development Products for High Performance Computing and Parallel Programming
Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN
More informationIntel Advisor XE. Vectorization Optimization. Optimization Notice
Intel Advisor XE Vectorization Optimization 1 Performance is a Proven Game Changer It is driving disruptive change in multiple industries Protecting buildings from extreme events Sophisticated mechanics
More informationCSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization
CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation
More informationCompiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list
Compiling for Scalable Computing Systems the Merit of SIMD Ayal Zaks Intel Corporation Acknowledgements: too many to list Lo F IRST, Thanks for the Technion For Inspiration and Recognition of Science and
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationVECTORISATION. Adrian
VECTORISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Vectorisation Same operation on multiple data items Wide registers SIMD needed to approach FLOP peak performance, but your code must be
More informationOpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN
OpenMP Tutorial Seung-Jai Min (smin@purdue.edu) School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 1 Parallel Programming Standards Thread Libraries - Win32 API / Posix
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationIntel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant
Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor
More informationVectorization on KNL
Vectorization on KNL Steve Lantz Senior Research Associate Cornell University Center for Advanced Computing (CAC) steve.lantz@cornell.edu High Performance Computing on Stampede 2, with KNL, Jan. 23, 2017
More informationExperiences with Achieving Portability across Heterogeneous Architectures
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore
More informationOpenMP C and C++ Application Program Interface Version 1.0 October Document Number
OpenMP C and C++ Application Program Interface Version 1.0 October 1998 Document Number 004 2229 001 Contents Page v Introduction [1] 1 Scope............................. 1 Definition of Terms.........................
More informationScheduling Image Processing Pipelines
Lecture 15: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];
More informationCOMP Parallel Computing. Programming Accelerators using Directives
COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator
More informationOpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen
OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,
More informationSYCL-based Data Layout Abstractions for CPU+GPU Codes
DHPCC++ 2018 Conference St Catherine's College, Oxford, UK May 14th, 2018 SYCL-based Data Layout Abstractions for CPU+GPU Codes Florian Wende, Matthias Noack, Thomas Steinke Zuse Institute Berlin Setting
More informationAllows program to be incrementally parallelized
Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationA short stroll along the road of OpenMP* 4.0
* A short stroll along the road of OpenMP* 4.0 Christian Terboven Michael Klemm Members of the OpenMP Language Committee 1 * Other names and brands
More informationA New Vectorization Technique for Expression Templates in C++ arxiv: v1 [cs.ms] 6 Sep 2011
A New Vectorization Technique for Expression Templates in C++ arxiv:1109.1264v1 [cs.ms] 6 Sep 2011 J. Progsch, Y. Ineichen and A. Adelmann October 28, 2018 Abstract Vector operations play an important
More informationOpenCL TM & OpenMP Offload on Sitara TM AM57x Processors
OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast
More informationECE 574 Cluster Computing Lecture 10
ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular
More informationNow What? Sean Parent Principal Scientist Adobe Systems Incorporated. All Rights Reserved..
Now What? Sean Parent Principal Scientist Beauty 2 C++11 Standard 1338 Pages 2012 Adobe Systems Incorporated. All Rights Reserved. 3 C++11 Standard 1338 Pages C++98 Standard 757 Pages 2012 Adobe Systems
More informationAccelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture
Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Da Zhang Collaborators: Hao Wang, Kaixi Hou, Jing Zhang Advisor: Wu-chun Feng Evolution of Genome Sequencing1 In 20032: 1 human genome
More informationIntel Array Building Blocks (Intel ArBB) Technical Presentation
Intel Array Building Blocks (Intel ArBB) Technical Presentation Copyright 2010, Intel Corporation. All rights reserved. 1 Noah Clemons Software And Services Group Developer Products Division Performance
More informationOpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means
High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationExtending C++ for Explicit Data-Parallel Programming via SIMD Vector Types
Extending C++ for Explicit Data-Parallel Programming via SIMD Vector Types Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften vorgelegt beim Fachbereich 12 der Johann Wolfgang Goethe-Universität
More informationShared Memory programming paradigm: openmp
IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM
More informationSIMD in Scientific Computing
SIMD in Scientific Computing Tim Haines (terminal) PhD Candidate University of Wisconsin-Madison Department of Astronomy State of Tree(PM)-based N-Body solvers in Astronomy SPH? GPU? Xeon Phi? Gadget2
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More informationBring your application to a new era:
Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationOpenMP dynamic loops. Paolo Burgio.
OpenMP dynamic loops Paolo Burgio paolo.burgio@unimore.it Outline Expressing parallelism Understanding parallel threads Memory Data management Data clauses Synchronization Barriers, locks, critical sections
More informationVisionCPP Mehdi Goli
VisionCPP Mehdi Goli Motivation Computer vision Different areas Medical imaging, film industry, industrial manufacturing, weather forecasting, etc. Operations Large size of data Sequence of operations
More informationAchieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013
Achieving High Performance Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Does Instruction Set Matter? We find that ARM and x86 processors are simply engineering design points optimized
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force
More information