Vc: Portable and Easy SIMD Programming with C++

Similar documents
Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Advanced OpenMP Features

SIMD Types: ABI Considerations

const-correctness in C++

SIMD T ypes: The Vector T ype & Operations

SIMD Exploitation in (JIT) Compilers

Dynamic SIMD Scheduling

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

OpenCL Vectorising Features. Andreas Beckmann

Costless Software Abstractions For Parallel Hardware System

OpenMP 4.0/4.5. Mark Bull, EPCC

HPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA

std::assume_aligned Abstract

OpenMP 4.0. Mark Bull, EPCC

MIPP: a Portable C++ SIMD Wrapper and its use for Error Correction Coding in 5G Standard

Advanced OpenMP Vectoriza?on

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Presenter: Georg Zitzlsberger Date:

What s New August 2015

Polly Polyhedral Optimizations for LLVM

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

Kampala August, Agner Fog

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

High Performance Computing: Tools and Applications

Advanced Parallel Programming II

Lecture 4: OpenMP Open Multi-Processing

Progress on OpenMP Specifications

Cilk Plus GETTING STARTED

Image Processing Optimization C# on GPU with Hybridizer

A Simple Path to Parallelism with Intel Cilk Plus

OpenMP 4.5: Threading, vectorization & offloading

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests

Advanced OpenMP. Lecture 11: OpenMP 4.0

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

OpenMPCon Developers Conference 2017, Stony Brook Univ., New York, USA

Operators. The Arrow Operator. The sizeof Operator

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Parallel Programming Libraries and implementations

S Comparing OpenACC 2.5 and OpenMP 4.5

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Parallel processing with OpenMP. #pragma omp

The Role of Standards in Heterogeneous Programming

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

Improving graphics processing performance using Intel Cilk Plus

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Pablo Halpern Parallel Programming Languages Architect Intel Corporation

Vectorization with Haswell. and CilkPlus. August Author: Fumero Alfonso, Juan José. Supervisor: Nowak, Andrzej

Parallel Programming on Larrabee. Tim Foley Intel Corp

Growth in Cores - A well rehearsed story

Intel Array Building Blocks

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

High Performance Parallel Programming. Multicore development tools with extensions to many-core. Investment protection. Scale Forward.

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Introduction to Runtime Systems

Intel Software Development Products for High Performance Computing and Parallel Programming

Intel Advisor XE. Vectorization Optimization. Optimization Notice

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Martin Kruliš, v

VECTORISATION. Adrian

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

Parallel Programming. Libraries and Implementations

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Vectorization on KNL

Experiences with Achieving Portability across Heterogeneous Architectures

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

Scheduling Image Processing Pipelines

COMP Parallel Computing. Programming Accelerators using Directives

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

SYCL-based Data Layout Abstractions for CPU+GPU Codes

Allows program to be incrementally parallelized

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

A short stroll along the road of OpenMP* 4.0

A New Vectorization Technique for Expression Templates in C++ arxiv: v1 [cs.ms] 6 Sep 2011

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

ECE 574 Cluster Computing Lecture 10

Now What? Sean Parent Principal Scientist Adobe Systems Incorporated. All Rights Reserved..

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture

Intel Array Building Blocks (Intel ArBB) Technical Presentation

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Modern Processor Architectures. L25: Modern Compiler Design

Extending C++ for Explicit Data-Parallel Programming via SIMD Vector Types

Shared Memory programming paradigm: openmp

SIMD in Scientific Computing

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Bring your application to a new era:

OpenACC programming for GPGPUs: Rotor wake simulation

OpenMP dynamic loops. Paolo Burgio.

VisionCPP Mehdi Goli

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

High Performance Computing: Tools and Applications

Transcription:

Vc: Portable and Easy SIMD Programming with C++ Matthias Kretz Frankfurt Institute Institute for Computer Science Goethe University Frankfurt May 19th, 2014 HGS-HIRe Helmholtz Graduate School for Hadron and Ion Research Matthias Kretz (CC BY-NC-SA 2.0)

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 2 SIMD Vc: Vector Types Mandelbrot

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 3 by Kevin Dooley (CC BY 2.0)

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 4 Do you code?

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 5 for (int i = 0; i < N-1; ++i) { dx[i] = x[i + 1] - x[i];

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 6 i = 0 loop: load x[i+1] load x[i] x[i+1] - x[i] store dx[i] i += 1 if (i < N - 1) goto loop for (int i = 0; i < N - 1; ++i) { dx[i] = x[i + 1] - x[i];

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 7 multiple operations in one instruction

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 8 i = 0 loop: load x[1], x[2],..., x[w+1] load x[0], x[1],..., x[w] x[1] - x[0], x[2] - x[1], x[3] - x[2],... store dx[0], dx[1],..., dx[w] i += W if (i < N - 1) goto loop for (int i = 0; i < N - 1; ++i) { dx[i] = x[i + 1] - x[i];

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 9 SIMD Single Instruction Multiple Data u 1 u 0 u 2 u 1 u 3 u 2 u 4 u 3 u 1 u 2 u 3 u 4 u 0 u 1 u 2 u 3

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 10 SIMD is synchronous parallelism Think of W threads executing in lock-step

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 11 How much did you miss? Depends data type, target CPU factor of 2, 4, 8, 16, 32,?

10 8 FLOP/cycle (sp float) 6 4 2 0 1998 2000 2002 2004 2006 2008 2010 2012 2014 Year Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 12

32 Scalar SIMD FLOP/cycle (sp float) 16 8 4 2 0 1998 2000 2002 2004 2006 2008 2010 2012 2014 Year Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 12

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 13 Why don t you use SIMD?

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 14 How could you? Programming languages only provide scalar types and operations.

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 15 template<typename T> static inline Vector<T> calc(vector<t> x) typedef Vector<T> V; typedef typename V::Mask M; typedef Const<T> C; const M invalidmask = x < V::Zero(); const M infinitymask = x == V::Zero(); const M denormal = x <= C::min(); x(denormal) *= V(Vc_buildDouble(1, 0, 54)); V exponent = x.exponent(); exponent(denormal) -= 54; x.setzero(c::exponentmask()); x = C::_1_2(); const M smallx = x < C::_1_sqrt2(); x(smallx) += x; x -= V::One(); exponent(!smallx) += V::One();

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 16 fundamental types in C++ map to hardware (registers/instructions)

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 17 but SIMD hardware does not map to C++ types I work on fixing this issue

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 18 Vc is a zero-overhead wrapper around intrinsic SIMD functions with greatly improved syntax and semantics recognizable types: float_v double_v int_v ushort_v operator overloads: a + b * b masks and value vectors are different types: float_m mask = a < b;

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 19 The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 19 The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 19 The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 19 The Idea namespace AVX { template <typename T> class Vector { // target-specific data member (intrinsic) public: static constexpr size_t Size;... ; typedef Vector<float> float_v; typedef Vector<int> int_v;...

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 20 The Idea (2) namespace MIC {... namespace SSE {... namespace Scalar {......

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 21 The Idea (3) namespace Vc { using AVX::Vector; using AVX::float_v; using AVX::int_v;...

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 22 Portability Concerns How to fill a float_v? void f(float *data) { float_v x; x.load(data, Vc::Unaligned);... float_v::size (W) could be 1, 2, 4, 8, 16, Cannot assign float_v x = {1, 2, 3, 4; Requires W consecutive values in memory for a load Alignment (guaranteed power-of-two factor in prime factorization of a pointer address number of least significant bits that are zero)

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 22 Portability Concerns How to fill a float_v? void f(float *data) { float_v x; x.load(data, Vc::Unaligned);... float_v::size (W) could be 1, 2, 4, 8, 16, Cannot assign float_v x = {1, 2, 3, 4; Requires W consecutive values in memory for a load Alignment (guaranteed power-of-two factor in prime factorization of a pointer address number of least significant bits that are zero)

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 22 Portability Concerns How to fill a float_v? void f(float *data) { float_v x; x.load(data, Vc::Unaligned);... float_v::size (W) could be 1, 2, 4, 8, 16, Cannot assign float_v x = {1, 2, 3, 4; Requires W consecutive values in memory for a load Alignment (guaranteed power-of-two factor in prime factorization of a pointer address number of least significant bits that are zero)

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 23 Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); initialize one SIMD register of target-specific size W with W consecutive values starting from array[offset]

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 23 Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); multiply W values in x by 2 and add 1 broadcast integral 2 (and 1) to floating-point SIMD register use fused-multiply-add instruction if supported by target

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 23 Infix Operators float_v x(&array[offset]); x = x * 2 + 1; x.store(&array[offset]); store W values from SIMD register overwrite W values in array

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 24 Container Interface float_v x =...; for (size_t i = 0; i < float_v::size; ++i) { x[i] += i; or with C++11 and latest Vc: float_v x =...; int i = 0; for (auto &scalar : x) { scalar += ++i;

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 25 No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 25 No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 25 No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 25 No Implicit Context always in scalar context in contrast to vector loops with SIMD types all context is attached to the type! x = x * 2 + 1; // SIMD operations if (any_of(x < 0.f)) { // scalar decision throw runtime_error( unexpected result ); x = sin(x); // more SIMD operations auto scalar = x.sum(); // SIMD reduction Guess what happens if you throw inside a vector loop

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 26 Portable Masking phi(phi < 0.f) += 360.f; // or: where(phi < 0.f) phi += 360.f; equivalent to: for (auto &phi_entry : phi) { if (phi_entry < 0.f) { phi_entry += 360.f; but optimized for the target s SIMD instruction set

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 27 Marketing Speak Vc makes SIMD programming intuitive, portable, and fast! And free & open: LGPL licensed (BSD for Vc 1.0)

SIMD Matthias Kretz (compeng.uni-frankfurt.de) Vc: Vector Types May 19th, 2014 Mandelbrot 28

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 29 Example: Mandelbrot Mandelbrot Iteration z u +1 = z 2 u + c z 0 = 0 and c = canvas coordinates Iterate until either z u 2 4 or n > n max n determines color

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 30 Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = 255 - n; colorizenextpixels(colorvalue);

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 30 Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = 255 - n; colorizenextpixels(colorvalue);

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 30 Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = 255 - n; colorizenextpixels(colorvalue);

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 30 Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = 255 - n; colorizenextpixels(colorvalue);

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 30 Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = 255 - n; colorizenextpixels(colorvalue);

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 30 Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = 255 - n; colorizenextpixels(colorvalue);

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 30 Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = 255 - n; colorizenextpixels(colorvalue);

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 30 Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = 255 - n; colorizenextpixels(colorvalue);

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 30 Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = 255 - n; colorizenextpixels(colorvalue);

Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 30 Mandelbrot with Vc for (int y = 0; y < imageheight; ++y) { const float_v c_imag = y0 + y * scale; for (int_v x = int_v::indexesfromzero(); anyof(x < imagewidth); x += float_v::size) { const std::complex<float_v> c(x0 + x * scale, c_imag); std::complex<float_v> z = c; int_v n = 0; int_m inside = norm(z) < 4.f; while (anyof(inside && n < 255)) { z = z * z + c; n(inside) += 1; inside = norm(z) < 4.f; int_v colorvalue = 255 - n; colorizenextpixels(colorvalue);

Mandelbrot Speedup 8 7 6 5 Vc::AVX Vc::SSE (ternary) Vc::SSE (binary) Vc::Scalar speedup 4 3 2 1 0 0 100 200 300 400 500 600 700 width/3 = height/2 [pixels] Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 31

SIMD Vc: Vector Types Mandelbrot SIMD Vc: Vector Types Mandelbrot Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 32

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 33 by INABA Tomoaki CC BY-SA @flickr

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 34 Auto-Vectorization Overview but compiler recognizes data parallelism modern compilers are impressively smart! tightly coupled to loops language standard requires such a transformation to ensure that the semantics of the original code stay unchanged the number of involved data structures increase complexity function calls (and therefore abstraction) can inhibit auto-vectorization for (int i = 0; i < N; ++i) { a[i] += b[i];

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 35 Auto-Vectorization Improvements Auto-vectorization capabilities are constantly being improved No breakthrough can be expected Compiler writers rather turn to explicit loop vectorization

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 36 Intrinsics Overview functions that wrap instructions very target specific inline assembly on steroids compiler does register allocation compiler can (in theory) optimize as well as scalar builtin types for (int i = 0; i < N; i += 8) { _mm256_store_ps(&a[i], _mm256_add_ps(_mm256_load_ps(&a[i]), _mm256_load_ps(&b[i]));

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 37 Intrinsics Improvements? New instructions new intrinsics existing intrinsics must keep source compatibility (and stay C interfaces) no improvements possible GCC and Clang have a nicer alternative: vector attribute infix notation subscripting builtins better optimization opportunities (the compiler sees more of the developers intent)

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 38 Vector Loops Overview #pragma vector with ICC #pragma omp simd with OpenMP 4 compatible compilers loop transformations similar to auto-vectorization difference: the concurrent execution semantics are explicit compiler does not have to prove that scalar and vector execution are equivalent #pragma omp simd for (int i = 0; i < N; ++i) { a[i] += b[i];

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 39 Vector Loops Challenges special semantics inside vector loops cannot use exceptions cannot do thread synchronization function calls require annotated functions many more (important) arguments to the #pragma safelen(length) linear(list[:linear-step]) aligned(list[:alignment]) private(list) lastprivate(list) reduction(operator:list) collapse(n) part of the algorithm s logic may therefore appear in the #pragma

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 40 SIMT The vector loops for GPU programming all code implicitly runs in SIMD context with an attached index that signifies the SIMD lane Think OpenCL for x86 families (CPU or Xeon Phi)

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 41 SIMDized Containers containers with overloaded operators each operation semantically acts on all entries of the container without any specific ordering std::valarray is such a class somewhat abandoned runtime sized / allocated cache inefficient suboptimal mapping on SIMD width std::valarray<float> a(n), b(n); a += b;

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 42 Array Notation Intel Cilk Plus (known from Fortran) a[:] += b[:];

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 43 SIMD Types types for SIMD registers and operations target-specific SIMD type width for (int i = 0; i < (N / float_v::size); ++i) { a[i] += b[i]; Implementations (sorted by initial release): Vc boost::simd (not in Boost part of NT² ) Prof. Agner Fog s vector classes libsimdpp

Abstractions Matthias Kretz (compeng.uni-frankfurt.de) May 19th, 2014 44 Future of the C++ Standard everything is still open maybe two approaches high-level and low-level needs Vector Loops SIMD Types