OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

Similar documents
Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Presenter: Georg Zitzlsberger Date:

VECTORISATION. Adrian

Advanced OpenMP Features

Advanced OpenMP Vectoriza?on

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

Progress on OpenMP Specifications

Advanced OpenMP. Lecture 11: OpenMP 4.0

OpenMP 4.5: Threading, vectorization & offloading

Using Intel AVX without Writing AVX

HPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA

SIMD: Data parallel execution

OpenMP 4.0. Mark Bull, EPCC

High Performance Computing: Tools and Applications

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Bring your application to a new era:

Dynamic SIMD Scheduling

Vectorization on KNL

What s New August 2015

Ge#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017

Advanced Vector extensions. Optimization report. Optimization report

OpenMP 4.0/4.5. Mark Bull, EPCC

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

SIMD Exploitation in (JIT) Compilers

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Kampala August, Agner Fog

Parallel Computing. Prof. Marco Bertini

Growth in Cores - A well rehearsed story

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Implementing a Scalable Parallel Reduction in Unified Parallel C

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Ernesto Su, Hideki Saito, Xinmin Tian Intel Corporation. OpenMPCon 2017 September 18, 2017

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Why C? Because we can t in good conscience espouse Fortran.

Allows program to be incrementally parallelized

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Parallel Programming

Lecture 4: OpenMP Open Multi-Processing

ECE 574 Cluster Computing Lecture 10

SIMD Programming CS 240A, 2017

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Compiling for HSA accelerators with GCC

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Obtaining the Last Values of Conditionally Assigned Privates

High Performance Computing: Tools and Applications

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

Advanced Parallel Programming II

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

What s P. Thierry

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to HPC and Optimization Tutorial VI

Introduction to OpenMP. Lecture 4: Work sharing directives

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Martin Kruliš, v

OpenACC (Open Accelerators - Introduced in 2012)

Programming Parallel Computers

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

Code Quality Analyzer (CQA)

GPU Microarchitecture Note Set 2 Cores

Parallel Numerical Algorithms

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Program Optimization Through Loop Vectorization

Overview: Programming Environment for Intel Xeon Phi Coprocessor

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

Shared Memory programming paradigm: openmp

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Explicit Vector Programming with OpenMP 4.0 SIMD Extensions

OpenMP 4.0: a Significant Paradigm Shift in High-level Parallel Computing (Not just Shared Memory anymore) Michael Wong OpenMP CEO

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

CS4961 Parallel Programming. Lecture 13: Task Parallelism in OpenMP 10/05/2010. Programming Assignment 2: Due 11:59 PM, Friday October 8

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Code optimization in a 3D diffusion model

Modern Processor Architectures. L25: Modern Compiler Design

5.12 EXERCISES Exercises 263

Dan Stafford, Justine Bonnot

The ECM (Execution-Cache-Memory) Performance Model

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

Parallel Programming. Easy Cases: Data Parallelism

Auto-Vectorization with GCC

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Introduction to OpenMP.

CS 470 Spring Mike Lam, Professor. OpenMP

ARMv8-A Scalable Vector Extension for Post-K. Copyright 2016 FUJITSU LIMITED

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

OpenMP on the IBM Cell BE

Introduction to OpenMP

Transcription:

OpenMP: Vectorization and #pragma omp simd Markus Höhnerbach 1 / 26

Where does it come from? c i = a i + b i i a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 + b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 = c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 2 / 26

3 / 26

Why would I care? 4 / 26

Why would I care? 4 / 26

Everywhere... x86 SSE AVX(2) AVX-512 (IMCI) 128 bit 256 bit 512 bit ARM NEON 128 bit POWER SPARC AltiVec/VMX/VSX QPX HPC-ACE HPC-ACE2 128 bit 256 bit 128 bit 256 bit 5 / 26

Vectorization on Intel CPUs [v]mova[p/s]s reg1, reg2/mem reg: xmm0-xmm15 (128bit) ymm0-ymm15 (256bit) zmm0-zmm15 (512bit) vaddps: Vectorized vaddss: Scalar How to see assembly: add -S to command line. 6 / 26

Vectorization: Indication Profiling! Worth it on the hot path! Increases available memory bandwidth to cache Increases throughput of compute operations More power efficient Reduce frontend pressure (fewer instructions to decode) Keep in mind Ahmdahl s law! 7 / 26

Cool! How can I use that? Libraries (MKL, OpenBLAS, BLIS, fftw, numpy, OpenCV) Hoping for a good compiler: Autovectorization Assisting compiler through annotations: OpenMP SIMD pragma Writing intrinsics/assembly code (not covered here) 8 / 26

Autovectorization The compiler needs to prove that the optimization is legal And the compiler needs to prove that the optimization is beneficial (under almost all circumstances) What could possibly go wrong? Conditionals (different vector lanes executing different code) Inner loops (might have different trip counts) Function calls (the functions might not be vectorized) Cross-iteration dependencies OpenMP addresses the last two points in particular 9 / 26

The OpenMP simd pragma Unifies the enforcement of vectorization for for loop Introduced in OpenMP 4.0 Explicit vectorization of for loops Same restrictions as omp for, and then some Executions in chunks of simdlength, concurrently executed Only directive allowed inside: omp ordered simd (OpenMP 4.5) Can be combined with omp for No exceptions 10 / 26

Clauses safelen(len): Maximum number of iterations per chunk simdlen(len): Recommended number of iterations per chunk linear(stride: variable var,...): with respect to iteration aligned(alignment: var): alignment of variable private, lastprivate, reduction, collapse: As with omp for 11 / 26

Issues with your code Aliasing Alignment Floating point issues Correctness Function calls Ordering 12 / 26

Aliasing float * a =...; float * b =...; float s;... for (int i = 0; i < N; i++) { a[i] += s * b[i]; Compiler does not know that a and b do not overlap. Has to be conservative. 13 / 26

Aliasing float * a =...; float * b =...; float aa[n]; memcpy(aa, a, N * sizeof(float)); float bb[n]; memcpy(bb, b, N * sizeof(float)); float s;... for (int i = 0; i < N; i++) { aa[i] += s * bb[i]; memcpy(a, aa, N * sizeof(float)); 14 / 26

Aliasing float * restrict a =...; float * restrict b =...; float s;... for (int i = 0; i < N; i++) { a[i] += s * b[i]; 15 / 26

Aliasing float * a =...; float * b =...; float s;... #pragma omp simd for (int i = 0; i < N; i++) { a[i] += s * b[i]; 16 / 26

Alignment Loading a chunk of data is cheaper if the address is aligned. Allows for faster hardware instructions to load a vector. Avoid cache line splits. Ex: Recent Intel CPUs have 64 byte cache lines, and 32 byte vectors, best alignment is 32 bytes. Cache lines A,B,...: AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHHIIII #### <- Want to load this data? Unaligned. #### <- Want to load this data? Aligned. 17 / 26

Alignment float * a; posix_memalign(&a,...); float * b; posix_memalign(&b,...); float s;... assume_aligned(a, 32); // Intel assume_aligned(b, 32); // Intel #pragma omp simd for (int i = 0; i < N; i++) { a[i] += s * b[i]; 18 / 26

Alignment float * a; posix_memalign(&a,...); float * b; posix_memalign(&b,...); float s;... #pragma omp simd aligned(a, b: 32) for (int i = 0; i < N; i++) { a[i] += s * b[i]; 19 / 26

Floating Point Models for (int i = 0; i < n; i++) { sum += a[i]; -ffast-math /fp:fast -fp-model fast=2 #pragma omp simd reduction(+: sum) for (int i = 0; i < n; i++) { sum += a[i]; https://msdn.microsoft.com/en-us/library/aa289157.aspx 20 / 26

Correctness for (int i = 0; i < N; i++) { int j = d[i]; a[j] += s * b[i]; Vectorization is only legal is the elements in d are distinct. This case occurs in applications! #pragma omp simd for (int i = 0; i < N; i++) { int j = d[i]; a[j] += s * b[i]; 21 / 26

Functions // This won t vectorize unless foo inlined. foo(float a, float * b, float c); float s; #pragma omp simd for (int i = 0; i < N; i++) { int j = d[i]; a[j] += foo(s, &b[i], a[j]); 22 / 26

The OpenMP declare simd directive asks compiler to generate veectorized version of a function allows vectorization of loops with function calls notinbranch, inbranch: Generate masking code, non-masking code everything from the simd pragma + uniform uniform: does not change linear: increases with index 23 / 26

Functions #pragma omp declare simd uniform(a) linear(1: b) foo(float a, float * b, float c); float s; #pragma omp simd for (int i = 0; i < N; i++) { int j = d[i]; a[j] += foo(s, &b[i], a[j]); 24 / 26

Masks #pragma omp simd for (int i = 0; i < n; i++) { if (a[i] < 1.0) continue; //.. int j = d[i]; a[j] =...; 25 / 26

Ordering #pragma omp simd for (int i = 0; i < n; i++) { if (a[i] < 1.0) continue; //.. int j = d[i]; #pragma omp ordered simd a[j] =...; If d is not containing distinct elements. 26 / 26