SIMD Exploitation in (JIT) Compilers

Similar documents
PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

OpenCL Vectorising Features. Andreas Beckmann

Dan Stafford, Justine Bonnot

AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

Costless Software Abstractions For Parallel Hardware System

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Dynamic SIMD Scheduling

High Performance Computing: Tools and Applications

Addressing Heterogeneity in Manycore Applications

Instruction Set Architecture ( ISA ) 1 / 28

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Trends in HPC (hardware complexity and software challenges)

ispc: A SPMD Compiler for High-Performance CPU Programming

Advanced OpenMP Features

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Introduction to Runtime Systems

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Vc: Portable and Easy SIMD Programming with C++

The Era of Heterogeneous Computing

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

S Comparing OpenACC 2.5 and OpenMP 4.5

Technology for a better society. hetcomp.com

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

Cell Programming Tips & Techniques

Parallel Systems. Project topics

45-year CPU Evolution: 1 Law -2 Equations

GPU Fundamentals Jeff Larkin November 14, 2016

Parallel Programming. Libraries and Implementations

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Improving graphics processing performance using Intel Cilk Plus

OpenMP 4.0: a Significant Paradigm Shift in High-level Parallel Computing (Not just Shared Memory anymore) Michael Wong OpenMP CEO

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

Massively Parallel Architectures

The Mont-Blanc approach towards Exascale

OpenACC programming for GPGPUs: Rotor wake simulation

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

SIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016

Parallel Computing. Hwansoo Han (SKKU)

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

Parallel Programming on Larrabee. Tim Foley Intel Corp

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Intel Xeon Phi Coprocessors

Introduction to CUDA Programming

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Parallel Programming Libraries and implementations

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors

Introduction to CELL B.E. and GPU Programming. Agenda

CellSs Making it easier to program the Cell Broadband Engine processor

General introduction: GPUs and the realm of parallel architectures

Progress Report on QDP-JIT

GTC 2014 Session 4155

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

A case study of performance portability with OpenMP 4.5

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

Bring your application to a new era:

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

SIMD: Data parallel execution

Our new HPC-Cluster An overview

Growth in Cores - A well rehearsed story

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

COMP Parallel Computing. Programming Accelerators using Directives

Presenter: Georg Zitzlsberger Date:

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

CME 213 S PRING Eric Darve

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Lecture 11 - Portability and Optimizations

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Compilation for Heterogeneous Platforms

Numba: A Compiler for Python Functions

Native Computing and Optimization. Hang Liu December 4 th, 2013

Operating System: Chap2 OS Structure. National Tsing-Hua University 2016, Fall Semester

Instruction Set Principles and Examples. Appendix B

High Performance Computing on GPUs using NVIDIA CUDA

Real-time Graphics 9. GPGPU

A Study of Productivity and Performance of Modern Vector Processors

Algorithms and Computation in Signal Processing

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Parallel Hybrid Computing Stéphane Bihan, CAPS

Scientific Computing on GPUs: GPU Architecture Overview

OpenMP tasking model for Ada: safety and correctness

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Transcription:

SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1

What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input 1 A0 A1 A2 A3 vector register input 2 B0 input 2 B0 B1 B2 B3 SIMD instructions add add add add add output C0 output C0 C1 C2 C3 scalar instruction SIMD instruction 2

SIMD is all around us Intel x86 MMX (since MMX Pentium released in 1997) SSE, SSE2, SSE3, SSE4 AVX, AVX2 Xeon Phi (co-processor) PowerPC VMX (a.k.a. AltiVec, Velocity Engine), VSX SPU of Cell BE (SIMD only ISA) ARM NEON, NEON2 3

What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register vector length input 1 A0 input 1 A0 A1 A2 A3 input 2 B0 add input 2 B0 B1 B2 B3 add add add add supported data types output C0 output C0 C1 C2 C3 scalar instruction SIMD instruction 4

SIMD is all around us Intel x86 MMX (since MMX Pentium released in 1997) SSE, SSE2, SSE3, SSE4 AVX, AVX2 Xeon Phi (co-processor) PowerPC VMX (a.k.a. AltiVec, Velocity single Engine), fp, double fp VSX SPU of Cell BE (SIMD only ISA) ARM NEON, NEON2 64-bit vector registers (shared with FPRs) integer types only 512-bit vector registers 128-bit vector registers integer, single fp, double fp 256-bit vector registers single fp, double fp, integer 5

So what? SIMD yields higher peak performance e.g. Peak flops become 4x for double fp with 256- bit vector registers (AVX) So, can we get 4x speed up by just buying SIMD processors? Then, can we get 4x speed up by recompiling my source code? 6

Outline Background - What s SIMD? Approaches for SIMD programming Explicit SIMD parallelization Automatic loop vectorization Programming models suitable for SIMD Summary 7

Manual programming with SIMD We can write SIMD instructions explicitly using inline assembly! Compiler support can make the explicit SIMD programming (somewhat) easier data types to represent vector register build-in methods that are directly mapped onto SIMD instructions (called intrinsics ) Most of processor vendors define C/C++ language extensions for their SIMD ISA 8

Examples of programming with SIMD intrinsics Original scalar loop 32-bit integer arrays Vectorized with AVX2 intrinsics 8 (= 256 / 32) elements at once represents a vector register, which contains 4 integers Vectorized with VMX intrinsics 4 (= 128 / 32) elements at once is mapped onto a SIMD add instruction 9

Explicit SIMD Programming Pros & Cons Explicit SIMD Programming with intrinsics: can achieve best possible performance in most cases is easier than assembly programming (register allocation, instruction scheduling etc.) is still very hard to program, debug and maintain depends on underlying processor architecture and is not portable among different processors is not suitable for platform-neutral languages such as Java and scripting languages Often, it also requires change in algorithms and data layout for efficient SIMD exploitation 10

Outline Background - What s SIMD? Approaches for SIMD programming Explicit SIMD parallelization Automatic loop vectorization Programming models suitable for SIMD Summary 11

Can compilers help more? Yes, automatic loop vectorization has been studied for vector computers and later for SIMD instructions Compiler analyzes scalar loop and translates it to vector processing if possible Major reasons that prevent vectorization include: loop-carried dependency control flow (e.g. if statement) in loop body memory alignment issue method call in loop body 12

Example of automatic loop vectorization Original scalar loop Compiler needs to analyze loops: input and output vectors may overlap each other loop-carried dependency vectors may not properly aligned loop count (N ) may not be a multiple of the vector size 13 Compiler generates guard code to handle each case or just gives up vectorization

Automatic loop vectorization Pros & Cons Automatic loop vectorization: does not require source code changes performance gain is limited compared to hand vectorization Average performance gain of many loops over scalar (non-simd) by automatic and manual vectorization [1] on POWER7 on Nehalem higher productivity but lower expected performance gain compared to the explicit approach [1] Maleki et al. An Evaluation of Vectorizing Compilers, PACT 2011 14

Programmer can help automatic loop vectorization By changing algorithms and data layout algorithm e.g. SOR red-black SOR data layout e.g. array of structures structure of arrays By adding explicit declarations or pragmas restrict keyword of C99 declaration to show there is no aliasing 15 /* a[], b[], and c[] are independent in this function */ standard/non-standard pragmas e.g. #pragma omp simd (OpenMP 4.0), #pragma simd (icc), #pragma disjoint (xlc) etc

AOS and SOA Key data layout transformation for efficient SIMD execution by avoiding discontiguous (gather/scatter) memory accesses Array of Structures (AOS) x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 Structure of arrays (SOA) x0 x1 x2 x3 x4 y0 y1 y2 y3 y4 z0 z1 z2 z3 z4 Hybrid (often performs best for hand vectorization) x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 16 (assuming 4-way SIMD)

Automatic vectorization for JIT compilers Automatic vectorization can be implemented in JIT compilers as well as static compilers Pros & Cons JIT compiler can select best SIMD ISA at runtime (e.g. SSE or AVX) The analysis for loop vectorization is often too costly for JIT compilation To avoid excessive compilation time in a JIT compiler, it is possible to execute analysis offline and embedded information in bytecode [2] [2] Nuzman et al. Vapor SIMD: Auto-vectorize once, run everywhere, CGO 2011 17

Tradeoff between programmability and performance Is there a good way to balance performance and programmability? Explicit SIMD parallelization Automatic loop vectorization higher performance lower programmability lower portability lower performance higher programmability higher portability 18

Outline Background - What s SIMD? Approaches for SIMD programming Explicit SIMD parallelization Automatic loop vectorization Programming models suitable for SIMD Summary 19

New programming models suitable for SIMD programming Stream processing small programs (kernel functions) are applied for each element in data streams (e.g. StreamIt, Brook, RapidMind) SIMT (Single Instruction Multiple Threads) execution model for GPU introduced by Nvidia each slot of SIMD instruction works as an independent thread (e.g. CUDA, OpenCL) Programmer and Language runtime system collaborate to vectorize the code Programmer identifies and explicitly shows parallelism Language runtime is responsible for optimizing the code for the underlying architecture 20

Example of OpenCL kernel [3] Programmer write a program for each data element as scalar code (called kernel ) Runtime applies the kernel for all data elements 21 [3] Neil Trevett, OpenCL BOF at SIGGRAPH 2010, https://www.khronos.org/assets/uploads/developers/library/2010_siggraph_bof_o pencl/opencl-bof-intro-and-overview_siggraph-jul10.pdf

SIMT architecture, which extends SIMD Today s GPU architecture extends SIMD for efficient execution of SIMT code Each thread executing kernel is mapped onto a slot of SIMD instructions gather/scatter memory access support predicated execution support to convert control flow to data flow SIMD ISA of general-purpose processors are also going toward a similar direction (e.g. AVX2 supports gather and scatter) 22

Other programming model for SIMD Framework or language for explicit SIMD parallelization with abstracted SIMD for better portability Boost.SIMD [4], VecImp [5] SIMT processing model designed for SIMD instructions of generalpurpose processors Intel SPMD Program Compiler (ispc) [6] exploits SIMT (they call SPMD) model for programming SSE and AVX [4] Estérie et al. Boost.SIMD: generic programming for portable SIMDization, PACT 2012 [5] Leißa et al. Extending a C-like language for portable SIMD programming, PPoPP 2012 [6] Pharr et al. ispc: A SPMD Compiler for High-Performance CPU Programming, Innovative Parallel Computing, 2012. 23

Summary Categorized SIMD programming techniques into three approaches Explicit SIMD parallelization Automatic loop vectorization Programming models suitable for SIMD No one-fit-all solution; each approach has benefits and drawbacks 24