Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Similar documents
Presenter: Georg Zitzlsberger Date:

High Performance Computing: Tools and Applications

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

VECTORISATION. Adrian

OpenMP 4.5: Threading, vectorization & offloading

SIMD Exploitation in (JIT) Compilers

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

Presenter: Georg Zitzlsberger. Date:

OpenCL Vectorising Features. Andreas Beckmann

Dynamic SIMD Scheduling

Bring your application to a new era:

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

HPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA

Growth in Cores - A well rehearsed story

Vc: Portable and Easy SIMD Programming with C++

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

Advanced OpenMP Features

OpenMP 4.0/4.5. Mark Bull, EPCC

What s New August 2015

Optimising with the IBM compilers

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

OpenMP 4.0. Mark Bull, EPCC

Advanced OpenMP Vectoriza?on

Vectorization on KNL

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

Native Computing and Optimization. Hang Liu December 4 th, 2013

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.

Overview: Programming Environment for Intel Xeon Phi Coprocessor

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

S Comparing OpenACC 2.5 and OpenMP 4.5

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Kampala August, Agner Fog

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date:

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

Advanced Parallel Programming II

Progress on OpenMP Specifications

What s P. Thierry

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Advanced OpenMP. Lecture 11: OpenMP 4.0

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

Programming Intel R Xeon Phi TM

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

Kevin O Leary, Intel Technical Consulting Engineer

Code optimization techniques

Native Computing and Optimization on Intel Xeon Phi

Dan Stafford, Justine Bonnot

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Code optimization in a 3D diffusion model

Overview: The OpenMP Programming Model

Point-to-Point Synchronisation on Shared Memory Architectures

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Intel Knights Landing Hardware

Open Multi-Processing: Basic Course

Using Intel AVX without Writing AVX

Cilk Plus GETTING STARTED

Software Optimization Case Study. Yu-Ping Zhao

Online Course Evaluation. What we will do in the last week?

Shared Memory programming paradigm: openmp

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Programming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment

Introduction to OpenMP. Tasks. N.M. Maclaren September 2017

Introduction to OpenMP

Introduction to OpenMP. Lecture 4: Work sharing directives

OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

OpenMP programming Part II. Shaohao Chen High performance Louisiana State University

CS 5220: Shared memory programming. David Bindel

Introduction to OpenMP

Advanced Vector extensions. Optimization report. Optimization report

Scientific Programming in C XIV. Parallel programming

5.12 EXERCISES Exercises 263

Implementation of Parallelization

The New C Standard (Excerpted material)

COMP528: Multi-core and Multi-Processor Computing

SIMD: Data parallel execution

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

Code modernization of Polyhedron benchmark suite

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

OpenMP 4.0: a Significant Paradigm Shift in High-level Parallel Computing (Not just Shared Memory anymore) Michael Wong OpenMP CEO

Technical Report. Research Lab: LERIA

Compiling for HSA accelerators with GCC

Optimising for the p690 memory system

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Introduction to OpenMP

Parallel Programming

Parallel Programming using OpenMP

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

[Potentially] Your first parallel application

Parallel Programming using OpenMP

ECE 574 Cluster Computing Lecture 10

Transcription:

Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015

Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary

Section 1 Overview

What is SIMD? Scalar Processing: a0 b0 a1 b1 a2 b2 a3 b3 Scalar Code + + + + Executes one element at a time. Vector Code c0 c1 c2 c3 Executes on multiple elements at a time in hardware. Single Instruction Multiple Data. Vector Processing: a0 a1 a2 a3 + b0 b1 b2 b3 c0 c1 c2 c3

A Brief History Pentium (1993): 32 bit: MMX (1997): 64 bit: Streaming SIMD Extensions (SSE in 1999,.., SSE4.2 in 2008): 128 bit: Advanced Vector Extensions (AVX in 2011, AVX2 in 2013): 256 bit: Intel MIC Architecture (Intel Xeon Phi in 2012): 512 bit:

Why you should care about SIMD (1/2) Big potential performance speed-ups per core. E.g. for Double Precision FP vector width vs theoretical speed-up over scalar: 128 bit: 2 potential for SSE. 256 bit: 4 potential for AVX. 256 bit: 8 potential for AVX2 (FMA). 512 bit: 16 potential for Xeon Phi (FMA). Wider vectors allow for higher potential performance gains. Little programmer effort can often unlock hidden 2-8 in code!

Why you should care about SIMD (2/2) The Future: Chip designers like SIMD low cost, low power, big gains. Next Generation Intel Xeon and Xeon Phi (AVX-512): 512 bit: Not just Intel: ARM Neon - 128 bit SIMD. IBM Power8-128 bit (VMX) AMD Piledriver - 256 bit SIMD (AVX+FMA).

Many Ways to Vectorise Auto-Vectorisation (no change to code) Ease of Use Auto-Vectorisation (w/ compiler hints) Excplicit Vectorisation (e.g OpenMP 4, Cilk+) SIMD Intrinsic Class (e.g F32vec, VC, boost.simd) Vector Intrinsics (e.g. mm_fmadd_pd(), mm_add_ps(),...) Inline Assembly (e.g. vaddps, vaddss,...) Programmer Control

Many Ways to Vectorise Auto-Vectorisation (no change to code) Ease of Use Auto-Vectorisation (w/ compiler hints) Excplicit Vectorisation (e.g OpenMP 4, Cilk+) SIMD Intrinsic Class (e.g F32vec, VC, boost.simd) Vector Intrinsics (e.g. mm_fmadd_pd(), mm_add_ps(),...) Inline Assembly (e.g. vaddps, vaddss,...) Programmer Control

Section 2 Implicit Vectorisation

Auto-Vectorisation Compiler will analyse your loops and generate vectorised versions of them at the optimisation stage. Intel Compiler required flags: Xeon: -O2 -xhost Mic Native: -O2 -mmic On Intel use qopt-report=[n] to see if loop was auto-vectorised. Powerful, but the compiler cannot make unsafe assumptions.

Auto-Vectorisation What does the compiler check for: i n t g s i z e ; v o i d n o t v e c t o r i s a b l e ( f l o a t a, f l o a t b, f l o a t c, i n t i n d ) { f o r ( i n t i =0; i < g s i z e ; ++i ) { i n t j = i n d [ i ] ; c [ j ] = a [ i ] + b [ i ] ; } } Is *g size loop-invariant? Do a, b, and c point to different arrays? (Aliasing) Is ind[i] a one-to-one mapping?

Auto-Vectorisation This will now auto-vectorise: i n t g s i z e ; v o i d v e c t o r i s a b l e ( f l o a t r e s t r i c t a, f l o a t r e s t r i c t b, f l o a t r e s t r i c t c, i n t r e s t r i c t i n d ) { i n t n = g s i z e ; #pragma i v d e p f o r ( i n t i =0; i < n ; ++i ) { i n t j = i n d [ i ] ; c [ i ] = a [ i ] + b [ i ] ; } } Dereference *g size outside of loop. restrict keyword tells compiler there is no aliasing. ivdep tells compiler there are no data dependencies between iterations.

Auto-Vectorisation Summary Minimal programmer effort. May require some compiler hints. Compiler can decide if scalar loop is more efficient. Powerful, but cannot make unsafe assumptions. Compiler will always choose correctness over performance.

Section 3 Explicit Vectorisation

Explicit Vectorisation There are more involved methods for generating the code you want. These can give you: Fine-tuned performance. Advanced things the auto-vectoriser would never think of. Greater performance portability. This comes at a price of increased programmer effort and possibly decreased portability.

Explicit Vectorisation Compiler s Responsibilities Allow programmer to declare that code can and should be run in SIMD. Generate the code that the programmer asked for. Programmer s Responsibilities Correctness (e.g. no dependencies or incorrect memory accesses) Efficiency (e.g. alignment, strided memory access)

Vectorise with OpenMP4.0 SIMD OpenMP 4.0 ratified July 2013. Specifications: http://openmp.org/wp/openmp-specifications/ Industry standard. OpenMP 4.0 new feature: SIMD pragmas!

OpenMP Pragma SIMD Pragma SIMD: The simd construct can be applied to a loop to indicate that the loop can be transformed into a SIMD loop (that is, multiple iterations of the loop can be executed concurrently using SIMD instructions). - OpenMP 4.0 Spec. Syntax in C/C++: #pragma omp simd [ c l a u s e [, c l a u s e ]... ] f o r ( i n t i =0; i <N; ++i ) Syntax in Fortran:! omp$ simd [ c l a u s e [, c l a u s e ]... ]

OpenMP Pragma SIMD Clauses safelen(len) len must be a power of 2: The compiler can assume a vectorization for a vector length of len to be safe. private(v1, v2,...): Variables private to each lane. linear(v1:step1, v2:step2,...) For every iteration of original scalar loop v1 is incremented by step1,... etc. Therefore it is incremented by step1 * vector length for the vectorised loop. reduction(operator:v1,v2,...): Variables v1, v2,...etc. are reduction variables for operation operator. collapse(n): Combine nested loops. aligned(v1:base,v2:base,...): Tell compiler variables v1, v2,... are aligned.

OpenMP SIMD Example 1 The old example that wouldn t auto-vectorise will do so now with SIMD: i n t g s i z e ; v o i d v e c t o r i s a b l e ( f l o a t a, f l o a t b, f l o a t c, i n t i n d ) { #pragma omp simd f o r ( i n t i =0; i < g s i z e ; ++i ) { i n t j = i n d [ i ] ; c [ j ] = a [ i ] + b [ i ] ; } } The programmer asserts that there is no aliasing or loop variance. Explicit SIMD lets you express what you want, but correctness is your responsibility.

OpenMP SIMD Example 2 An example of SIMD reduction: i n t g s i z e ; v o i d v e c r e d u c e ( f l o a t a, f l o a t b, f l o a t c ) { f l o a t sum=0; #pragma omp simd r e d u c t i o n (+:sum ) f o r ( i n t i =0; i < g s i z e ; ++i ) { i n t j = i n d [ i ] ; c [ j ] = a [ i ] + b [ i ] ; sum += c [ j ] ; } } sum should be treated as a reduction.

OpenMP SIMD Example 3 An example of SIMD reduction with linear clause. f l o a t sum = 0. 0 f ; f l o a t p = a ; i n t s t e p = 4 ; #pragma omp simd r e d u c t i o n (+:sum ) l i n e a r ( p : s t e p ) f o r ( i n t i = 0 ; i < N; ++i ) { sum += p ; p += s t e p ; } linear clause tells the compiler that p has a linear relationship w.r.t the iterations space. i.e. it is computable from the loop index p i = p 0 + i * step. It also means that p is SIMD lane private. Its initial value is the value before the loop. After the loop p is set to the value it was in the sequentially last iteration.

SIMD Enabled Functions SIMD-enabled functions allow user defined functions to be vectorised when they are called from within vectorised loops. The vector declaration and associated modifying clauses specify the vector and scalar nature of the function arguments. Syntax C/C++: #pragma omp d e c l a r e simd [ c l a u s e [, c l a u s e ]... ] f u n c t i o n d e f i n i t i o n or d e c l a r a t i o n Syntax Fortran:! $omp d e c l a r e simd ( proc name ) [ c l a u s e [ [, ] c l a u s e ]... ] f u n c t i o n d e f i n i t i o n or d e c l a r a t i o n

SIMD-Enabled Function Clauses simdlen(len) len must be a power of 2: generate a function that works for this vector length. linear(v1:step1, v2:step2,...) For every iteration of original scalar loop v1 is incremented by step1,... etc. Therefore it is incremented by step1 * vector length for the vectorised loop. uniform(a1,a2,...) Arguments a1, a2,... etc are not treated as vectors (constant values across SIMD lanes). inbranch, notinbranch: SIMD-enabled function called only insde branches or never. aligned(v1:base,v2:base,...): Tell compiler variables v1, v2,... are aligned.

SIMD-Enabled Functions Example Write a function for one element and add pragma as follows: #pragma omp d e c l a r e simd f l o a t f o o ( f l o a t a, f l o a t b, f l o a t c, f l o a t d ) { r e t u r n a b + c d ; } You can call the scalar version as per usual: e = f o o ( a, b, c, d ) ; Call vectorised version in a SIMD loop: #pragma omp simd f o r ( i =0; i < n ; ++i ) { E [ i ] = f o o (A[ i ], B [ i ], C [ i ], D[ i ] ) ; }

SIMD-Enabled Functions Recommendations SIMD-enabled functions still incur overhead. Inlining is always better, if possible.

Explicit Vectorisation CilkPlus Array Notation An extension to C/C++. Perform operations on sections of arrays in parallel. Example, vector addition: A [ : ] = B [ : ] + C [ : ] ; Looks like matlab/numpy/fortran,... but in C/C++!

Explicit Vectorisation CilkPlus Array Notation Syntax: A [ : ] A[ s t a r t i n d e x : l e n g t h ] A[ s t a r t i n d e x : l e n g t h : s t r i d e ] Use : for all elements length specifies the number of elements of a subset. N.B. Not like F90. stride is the distance between elements for subset.

Explicit Vectorisation CilkPlus Array Notation Array notation also works with SIMD-enabled functions: A [ : ] = mysimdfn (B [ : ], C [ : ] ) ; Reductions on vectors done via predefined functions e.g.: s e c r e d u c e a d d, s e c r e d u c e m u l, s e c r e d u c e a l l z e r o, s e c r e d u c e a l l n o n z e r o, s e c r e d u c e m a x, s e c r e d u c e m i n,...

Array Notation Performance Issues Long Form C [ 0 : N] = A [ 0 : N] + B [ 0 : N ] ; D[ 0 : N] = C [ 0 : N] C [ 0 : N ] ; Short Form f o r ( i =0; i <N; i+=v) { C [ i : V] = A[ i : V] + B[ i : V ] ; D[ i : V] = C [ i : V] C [ i : V ] ; } Long form is more elegant, but the short form will actually have better performance. If we expand the expressions back into for loops: For large N, the long form will kick C out of cache. No reuse in next loop. For appropriate V in the short form C can even be kept in registers. This is applicable for Fortran as well as Cilk Plus.

CilkPlus Availability The following support Cilk Plus array notation (as well as its other features): GNU GCC 4.9+: Enable with -fcilkplus clang/llvm 3.5: Not official branch yet but development branch exists: http://cilkplus.github.io/ Enable with -fcilkplus Intel C/C++ compiler since version 12.0.

Implicit vs Explicit Vectorisation Implicit Automatic dependency analysis (e.g. reductions). Recognises idioms with data dependencies. Non-inline functions are scalar. Limited support for outer-loop vectorisation (possible in -O3). Relies on the compiler s ability to recognise patterns/idioms it knows how to vectorise. Explicit No dependency analysis (e.g. reductions declared explicitly). Recognises idioms without data dependencies. Non-inline functions can be vectorised. Outer loops can be vectorised. May be more cross-compiler portable.

Implicit vs Explicit Vectorisation

Section 4 Data Alignment

Data Alignment Why it Matters Cache Line 0 0 1 2 3... 4... 5 6 7 Cache Line 1 8 9.................. 0 1 2 3 6 7 8 9 Aligned Load Address is aligned. One cache line. One instruction. 2-version vector/remainder. Unaligned Load Address is not aligned. Potentially multiple cache lines. Potentially multiple instructions. 3-version peel/vector/remainder.

Data Alignment Workflow 1 Align your data. 2 Access your memory in an aligned way. 3 Tell the compiler the data is aligned.

1. Align Your Data Automatic / free-store arrays in C/C++: f l o a t a [ 1 0 2 4 ] a t t r i b u t e ( ( a l i g n e d ( 6 4 ) ) ) ; Heap arrays in C/C++: f l o a t a = mm malloc (1024 s i z e o f ( a ), 64) ; // on I n t e l /GNU mm free ( a ) ; // need t h i s to f r e e! (For non-intel there is also posix memalign and aligned alloc (C11)). In Fortran: prettyc r e a l : : A(1024)! d i r $ a t t r i b u t e s a l i g n : 64 : : A real, a l l o c a t a b l e : : B(512)! d i r $ a t t r i b u t e s a l i g n : 64 : : B

2. Access Memory in Aligned Way Example: f l o a t a [N] a t t r i b u t e ( ( a l i g n e d ( 6 4 ) ) ;... f o r ( i n t i =0; i <N; ++i ) a [ i ] =... ; Starting from an aligned boundary e.g. a[0], a[16],...

3. Tell the Compiler In C/C++: #pragma vector aligned #pragma omp simd aligned(p:64) assume aligned(p, 16) assume(i%16==0) In Fortran:!dir$ vector aligned!omp$ simd aligned(p:64)!dir$ assume aligned(p, 16)!dir$ assume (mod(i,16).eq.0)

Alignment Example f l o a t a = mm malloc ( n s i z e o f ( a ), 64) ; f l o a t b = mm malloc ( n s i z e o f ( b ), 64) ; f l o a t c = mm malloc ( n s i z e o f ( c ), 64) ; #pragma omp simd a l i g n e d ( a : 6 4, b : 6 4, c : 6 4 ) f o r ( i n t i =0; i <n ; ++i ) { a [ i ] = b [ i ] + c [ i ] ; }

Aligning Multi-dimensional Arrays 1/2 Consider a 15 15 sized array of doubles. If we do: d o u b l e a = mm malloc (15 15 s i z e o f ( a ), 64) ; a[0] is aligned. a[i*15+0] for i > 0 are not aligned. The following may seg-fault: f o r ( i n t i =0; i <n ; ++i ) { #pragma omp simd a l i g n e d ( a : 6 4 ) f o r ( i n t j =0; j <n ; ++j ) { b [ j ] += a [ i n+j ] ;

Aligning Multi-dimensional Arrays 2/2 We need add padding to every row of the array so each row starts on a 64 byte boundary. For 15 15 we should alloc 15 16. Useful code: i n t n pad = ( n+7) & 7 ; d o u b l e a = mm malloc ( n n pad s i z e o f ( a ), 64) ; The following is now valid: f o r ( i n t i =0; i <n ; ++i ) { assume ( n pad % 8 == 0) ; #pragma omp simd a l i g n e d ( a : 6 4 ) f o r ( i n t j =0; j <n ; ++j ) { b [ j ] += a [ i n pad+j ] ;

Section 5 Summary

Summary What we have learned: Why vectorisation is important. How the vector units on modern processors can provide big speed-ups with often small effort. Auto-vectorisation in modern compilers. Explicit vectorisation with OpenMP4.0, and array notation. SIMD-Enabled functions. How to align data and why it helps SIMD performance.