High Performance Computing: Tools and Applications

Similar documents
Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Advanced Vector extensions. Optimization report. Optimization report

HPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA

High Performance Computing: Tools and Applications

Growth in Cores - A well rehearsed story

Vectorization on KNL

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

Ge#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017

Programming Intel R Xeon Phi TM

Advanced OpenMP Vectoriza?on

Program Optimization Through Loop Vectorization

Kampala August, Agner Fog

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

Improving performance of the N-Body problem

VECTORISATION. Adrian

Dan Stafford, Justine Bonnot

High Performance Computing: Tools and Applications

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Code Optimization Process for KNL. Dr. Luigi Iapichino

SIMD Exploitation in (JIT) Compilers

Advanced Parallel Programming II

6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance

PRACE PATC Course: Vectorisation & Basic Performance Overview. Ostrava,

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

Presenter: Georg Zitzlsberger. Date:

Dynamic SIMD Scheduling

Cell Programming Tips & Techniques

Kevin O Leary, Intel Technical Consulting Engineer

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

MAQAO hands-on exercises

Lecture 4. Performance: memory system, programming, measurement and metrics Vectorization

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

SIMD: Data parallel execution

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Bring your application to a new era:

CS4961 Parallel Programming. Lecture 13: Task Parallelism in OpenMP 10/05/2010. Programming Assignment 2: Due 11:59 PM, Friday October 8

Masterpraktikum Scientific Computing

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Intel Knights Landing Hardware

AUTOMATIC VECTORIZATION OF TREE TRAVERSALS

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Software Optimization Case Study. Yu-Ping Zhao

FFTSS Library Version 3.0 User s Guide

MAQAO Hands-on exercises FROGGY Cluster

Presenter: Georg Zitzlsberger Date:

Native Computing and Optimization. Hang Liu December 4 th, 2013

Parallel Programming. Libraries and Implementations

Boundary element quadrature schemes for multi- and many-core architectures

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

The New C Standard (Excerpted material)

CME 213 S PRING Eric Darve

SIMD Programming CS 240A, 2017

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests

High Performance Computing: Tools and Applications

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

Intel Advisor XE. Vectorization Optimization. Optimization Notice

OpenMP 4.5: Threading, vectorization & offloading

What s New August 2015

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

Intel Xeon Phi Coprocessor

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Lecture 2. Memory locality optimizations Address space organization

rabbit.engr.oregonstate.edu What is rabbit?

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Program Optimization Through Loop Vectorization

A Framework for Safe Automatic Data Reorganization

Code optimization. Geert Jan Bex

MAQAO Hands-on exercises LRZ Cluster

Accelerator Programming Lecture 1

Linear Algebra for Modern Computers. Jack Dongarra

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

OpenCL Vectorising Features. Andreas Beckmann

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

MACHINE-LEVEL PROGRAMMING IV: Computer Organization and Architecture

Addressing the Memory Wall

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

How to Write Fast Numerical Code

Lecture 3. Vectorization Memory system optimizations Performance Characterization

GPU Fundamentals Jeff Larkin November 14, 2016

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Programmazione Avanzata

Intel Software Development Products for High Performance Computing and Parallel Programming

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Intel Xeon Phi Programmability (the good, the bad and the ugly)

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

Programming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment

Advanced OpenMP Features

Allows program to be incrementally parallelized

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

High Performance Computing: Tools and Applications

Day 6: Optimization on Parallel Intel Architectures

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Transcription:

High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8

Processor-level SIMD SIMD instructions can perform an operation on multiple words simultaneously This is a form of data parallelism SIMD: single-instruction, multiple data Recent SIMD versions Nehalem Sandy Bridge Haswell MIC SSE 4.2 (128 bit) AVX (256 bit) AVX2 (256 bit with FMA) AVX-512 (512 bit) Versions are not backward compatible, i.e., cannot use AVX instructions on Nehalem.

Example Loops such as the following could exploit SIMD for (i=0; i<n; i++) a[i] = a[i] + b[i]; With AVX, 8 floats or 4 doubles are computed at the same time. Historical note: CRAY computers had vector units (often 64 words long) that operated in pipelined fashion. processor-level SIMD instructions are often called short vector instructions, and using these instructions is called vectorization exploiting SIMD is essential in HPC codes, especially on processors with wide SIMD units, e.g., Intel Xeon Phi

Many ways to exploit SIMD use the auto-vectorizer in the compiler (i.e., do nothing except help make sure that the coding style will not prevent vectorization) use SIMD intrinsic functions in your code (when the auto-vectorizer does not seem to do what you want)

Example use of intrinsic functions (AVX) mm256 qa, qb; for (i=0; i<n/4; i++) { qa = _mm256_load_ps(a); qb = _mm256_load_ps(b); qa = _mm256_add_ps(qa, qb); _mm256_store_ps(a, qa); a += 4; b += 4; }

Data alignment Many things can prevent automatic vectorization or produce suboptimal vectorized code In the previous example, the loads into the vector qa must come from an address a that is 256-bit (32 bytes) aligned for (i=0; i<n; i++) a[i] = a[i] + b[i]; If a and/or b can be unaligned, then the compiler will not vectorize the code or will generate extra code at the beginning and/or end of the loop to handle the misaligned elements. Need to allocate memory that is aligned Need to tell the compiler that the memory is aligned

Allocating aligned memory In these examples, assume AVX-512, or 64-byte alignment needed #include <stdlib.h>... buffer = _mm_malloc(num_bytes, 64);... _mm_free(buffer); Other functions for allocating aligned memory also available #include <stdlib.h>... ret = posix_memalign(&buffer, 64, num_bytes); buffer = aligned_alloc(64, num_bytes); // C11... free(buffer);

Compiler hints Aligned memory on the stack declspec(align(64)) double a[4000]; Telling the compiler that memory is aligned assume_aligned(ptr, 64);

Auto vectorization Tell the compiler what architecture you are using. Examples: -mmic (Intel) -msse4.2 (Gnu)

Auto vectorization Vectorization is enabled with -O1 and above (otherwise vector instructions not used?) Default is -O2 which is the same as -O or not specifying the optimization level flag

Disabling vectorization -no-vec disables auto-vectorization -no-simd disables vectorization of loops with Intel SIMD pragmas (see also -qno-openmp-simd)

Vectorization reports The compiler can tell you how well your code was vectorized. Compile with -qopt-report=1 -qopt-report-phase=vec The compiler will output an *.optrpt file.

Will these loops vectorize? for (i=0; i<n; i+=2) b[i] += a[i]*x[i];

Will these loops vectorize? for (i=0; i<n; i+=2) b[i] += a[i]*x[i]; for (i=0; i<n; i++) b[i] += a[i]*x[index[i]];

Obstacles to vectorization non-contiguous memory access

Will these loops vectorize? for (i=1; i<=n; i++) a[i] = a[i-1] + 1.; a[1] = a[0] + 1; a[2] = a[1] + 1; a[3] = a[2] + 1; a[4] = a[3] + 1;

Will these loops vectorize? for (i=1; i<=n; i++) a[i] = a[i-1] + 1.; a[1] = a[0] + 1; a[2] = a[1] + 1; a[3] = a[2] + 1; a[4] = a[3] + 1; for (i=1; i<=n; i++) a[i-1] = a[i] + 1.; a[0] = a[1] + 1; a[1] = a[2] + 1; a[2] = a[3] + 1; a[3] = a[4] + 1;

Will these loops vectorize? for (i=1; i<=n; i++) a[i] = a[i-1] + 1.; a[1] = a[0] + 1; a[2] = a[1] + 1; a[3] = a[2] + 1; a[4] = a[3] + 1; for (i=1; i<=n; i++) a[i-1] = a[i] + 1.; a[0] = a[1] + 1; a[1] = a[2] + 1; a[2] = a[3] + 1; a[3] = a[4] + 1; No and Yes.

Obstacles to vectorization non-contiguous memory access data dependencies

Will this loop vectorize? double sum = 0.; for (j=0; j<n; j++) sum += a[j]*b[j];

Will this loop vectorize? double sum = 0.; for (j=0; j<n; j++) sum += a[j]*b[j]; Yes, the compiler recognizes this as a reduction.

Will this loop vectorize? void add(int n, double *a, double *b, double *c) { for (int i=0; i<n; i++) c[i] = a[i] * b[i]; }

Will this loop vectorize? void add(int n, double *a, double *b, double *c) { for (int i=0; i<n; i++) c[i] = a[i] * b[i]; } No, array c might overlap with array a or b. However, the compiler might generate code that tests for overlap and use different code depending on whether or not there is overlap.

Use the restrict keyword Tell the compiler that a pointer is the only way to access an array void add(int n, double * restrict a, double * restrict b, double * restrict c) { for (int i=0; i<n; i++) c[i] = a[i] * b[i]; } restrict is not available on all compilers on Intel compilers, use -restrict on the compile line

Use #pragma ivdep Tell the compiler to ignore any potential data dependencies void add(int n, double *a, double *b, double *c) { #pragma ivdep for (int i=0; i<n; i++) c[i] = a[i] * b[i]; } Another example (restrict cannot help): void ignore_vec_dep(int *a, int k, int c, int m) { #pragma ivdep for (int i=0; i<m; i++) a[i] = a[i + k] * c; // cannot vectorize if k<0 }

ivdep and other pragmas #pragma ivdep - ignore potential data dependencies in loop #pragma novector - do not vectorize loop #pragma loop count(n) - typical trip count tells compiler whether vectorization is worthwhile #pragma vector always - always vectorize #pragma vector align - asserts data in loop is aligned #pragma vector nontemporal - hints to compiler that data will not be reused, and therefore to use streaming stores that bypass cache

Which is better? struct Particle { double pos[3]; double radius; int type; }; // array of structures struct Particle allparticles[10000]; struct AllParticles { double pos[3][10000]; double radius[10000]; int type[10000]; }; // structure of arrays struct AllParticles allparticles;

Array of structures (AOS) vs. Structure of arrays (SOA) Data layout may affect vectorization e.g., how will positions be updated in AOS case? SOA is often better for vectorization AOS may also be bad for cache line alignment (but padding can be used here) If particles are accessed in a random order and particle radius and type are accessed with positions, then SOA may underutilize cache lines

Can this pseudocode be vectorized? for i = 1 to npos for j = 1 to npos if (i == j) continue vec = p(i)-p(j) dist = norm(vec); force(i) += k*vec/(dist^2); end end

Trick to remove the if test for i = 1 to npos for j = 1 to npos vec = p(i)-p(j) dist = norm(vec) + eps; force(i) += k*vec/(dist^2); end end

Next time: #pragma omp simd