Auto-Vectorization with GCC

Similar documents
Auto-Vectorization of Interleaved Data for SIMD

Compiling Effectively for Cell B.E. with GCC

CS 293S Parallelism and Dependence Theory

Compiler techniques for leveraging ILP

Proceedings of the GCC Developers Summit

Autovectorization in GCC

Review. Topics. Lecture 3. Advanced Programming Topics. Review: How executable files are generated. Defining macros through compilation flags

Compiling for HSA accelerators with GCC

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

Automatic Transformations for Effective Parallel Execution on Intel Many Integrated Core

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

Enhancing Parallelism

CSC766: Project - Fall 2010 Vivek Deshpande and Kishor Kharbas Auto-vectorizating compiler for pseudo language

Parallel Programming. Easy Cases: Data Parallelism

Intermediate Representations

Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions

Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program

Parallel Processing: October, 5, 2010

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Using GCC as a Research Compiler

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

Dan Stafford, Justine Bonnot

The New C Standard (Excerpted material)

CMSC 611: Advanced Computer Architecture

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

SIMD Programming CS 240A, 2017

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Lecture Notes on Cache Iteration & Data Dependencies

Program Optimization Through Loop Vectorization

COE608: Computer Organization and Architecture

Lecture 2. Memory locality optimizations Address space organization

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

Cell Programming Tips & Techniques

Supercomputing in Plain English Part IV: Henry Neeman, Director

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

SoftSIMD - Exploiting Subword Parallelism Using Source Code Transformations

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Lecture Compiler Middle-End

AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

Compilation for Heterogeneous Platforms

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

Cilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Parallel Processing SIMD, Vector and GPU s

Induction-variable Optimizations in GCC

The GNU Compiler Collection

OpenACC Fundamentals. Steve Abbott November 13, 2016

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

SIMD: Data parallel execution

Where We Are. Lexical Analysis. Syntax Analysis. IR Generation. IR Optimization. Code Generation. Machine Code. Optimization.

Advanced Compiler Construction Theory And Practice

Advanced optimizations of cache performance ( 2.2)

Tour of common optimizations

Program Optimization Through Loop Vectorization

Computer System Architecture

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Double-precision General Matrix Multiply (DGEMM)

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

Beware Of Your Cacheline

Alias Analysis for Intermediate Code

OpenMP 4.0: a Significant Paradigm Shift in High-level Parallel Computing (Not just Shared Memory anymore) Michael Wong OpenMP CEO

Vectorization Past Dependent Branches Through Speculation

Final Exam Review Questions

An Overview to Compiler Design. 2008/2/14 \course\cpeg421-08s\topic-1a.ppt 1

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

SIMD Exploitation in (JIT) Compilers

Compiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005

C Language Constructs for Parallel Programming

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Loop Transformations! Part II!

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Bentley Rules for Optimizing Work

A Parallelizing Compiler for Multicore Systems

What Do Compilers Do? How Can the Compiler Improve Performance? What Do We Mean By Optimization?

CS 2461: Computer Architecture 1

CSC2/458 Parallel and Distributed Systems Machines and Models

Some features of modern CPUs. and how they help us

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

GCC An Architectural Overview

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

Leftover. UMA vs. NUMA Parallel Computer Architecture. Shared-memory Distributed-memory

arxiv: v1 [cs.pf] 21 Dec 2014

An Architectural Overview of GCC

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

Usually, target code is semantically equivalent to source code, but not always!

Kampala August, Agner Fog

Compiler Theory. (GCC the GNU Compiler Collection) Sandro Spina 2009

Parallel Processing SIMD, Vector and GPU s

Automatic Polyhedral Optimization of Stencil Codes

OpenMP 4.0. Mark Bull, EPCC

Flynn Taxonomy Data-Level Parallelism

Exploring Parallelism At Different Levels

Transcription:

Auto-Vectorization with GCC Hanna Franzen Kevin Neuenfeldt HPAC High Performance and Automatic Computing Seminar on Code-Generation Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 1 / 26

Outline 1 Introduction What Is Vectorization What Are The Challenges 2 Structure Of GCC 3 Vectorization 4 Code Generation & Algorithm 5 Summary Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 2 / 26

Outline 1 Introduction What Is Vectorization What Are The Challenges 2 Structure Of GCC 3 Vectorization 4 Code Generation & Algorithm 5 Summary Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 3 / 26

What Is Vectorization SIMD: Single Instruction Multiple Data Most modern CPUs have SIMD vector registers of mostly 8 to 16 bytes Check support on UNIX (Intel/AMD): cat /proc/cpuinfo grep -E \s(sse mmx)\s Definition Vectorization is the rewriting of loops into vector instructions. Notation: a[i:j] = Elements between index i and j (including). Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 4 / 26

Challenges Memory references must (mostly) be consecutive Memory must be aligned to natural vector size boundary Number of iterations must be countable Data Dependences Semantics of the vectorized program must be identical to the original program Example for (int i=1; i<5; i++) { a[i] = a[i-1] + b[i]; // S1 Assume a = [1,2,3,4,5] and b = [5,4,3,2,1]. Sequential: a = [1,5,8,10,11] Vectorized: a = [1,5,5,5,5] Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 5 / 26

Challenges Memory references must (mostly) be consecutive Memory must be aligned to natural vector size boundary Number of iterations must be countable Data Dependences Semantics of the vectorized program must be identical to the original program Example for (int i=1; i<5; i++) { a[i] = a[i-1] + b[i]; // S1 Assume a = [1,2,3,4,5] and b = [5,4,3,2,1]. Sequential: a = [1,5,8,10,11] Vectorized: a = [1,5,5,5,5] Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 5 / 26

Challenges Memory references must (mostly) be consecutive Memory must be aligned to natural vector size boundary Number of iterations must be countable Data Dependences Semantics of the vectorized program must be identical to the original program Example for (int i=1; i<5; i++) { a[i] = a[i-1] + b[i]; // S1 Assume a = [1,2,3,4,5] and b = [5,4,3,2,1]. Sequential: a = [1,5,8,10,11] Vectorized: a = [1,5,5,5,5] Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 5 / 26

Outline 1 Introduction What Is Vectorization What Are The Challenges 2 Structure Of GCC 3 Vectorization 4 Code Generation & Algorithm 5 Summary Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 6 / 26

Structure Of GCC vectorization assembly source code Front-end optimizations on trees GIMPLify Middle-end optimizations on GIMPLE expand Back-end optimizations on RTL codegen GIMPLE RTL Vector instructions are platform dependent Different processors have different SIMD instruction sets When built GCC uses machine description file to store which operations are available Intermediate Languages (IL): RTL (Register Transfer Language) IL: GIMPLE Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 7 / 26

Structure Of GCC vectorization assembly source code Front-end optimizations on trees GIMPLify Middle-end optimizations on GIMPLE expand Back-end optimizations on RTL codegen GIMPLE RTL Vector instructions are platform dependent Different processors have different SIMD instruction sets When built GCC uses machine description file to store which operations are available Intermediate Languages (IL): RTL (Register Transfer Language) IL: GIMPLE Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 7 / 26

Outline 1 Introduction What Is Vectorization What Are The Challenges 2 Structure Of GCC 3 Vectorization 4 Code Generation & Algorithm 5 Summary Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 8 / 26

Overview Vectorization Steps Analyze... loop form: exit condition, control-flow,... memory references: compute function that describes memory reference modifications scalar dependence cycles data reference access patterns data reference alignment data reference dependences Determine VF (Vectorization Factor) Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 9 / 26

Overview Vectorization Steps Analyze... loop form: exit condition, control-flow,... memory references: compute function that describes memory reference modifications scalar dependence cycles data reference access patterns data reference alignment data reference dependences Determine VF (Vectorization Factor) Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 9 / 26

Scalar Dependences Use of scalar variables (no arrays, pointers) Example: Reduction Example for (i=0; i<n; i++) { sum += A[i]; Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 10 / 26

Scalar Dependences Use of scalar variables (no arrays, pointers) Example: Reduction Example for (i=0; i<n; i++) { sum += A[i]; Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 10 / 26

Scalar Dependences: Reduction Example for (i=0; i<n; i++) { sum += A[i]; Intel Core 2Duo B950 VF = 4 N = 1024, 256 partial sums Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 11 / 26

Access Patterns Most SIMD instruction sets require consecutive memory access Otherwise permutations on data is needed Some SIMD instruction sets support certain access patterns (e.g. odd/even for complex numbers) Costs of permutations are critical to the worth of the vectorization Example for (i=0; i<n; i++) { A[i*2] = x; Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 12 / 26

Access Patterns Most SIMD instruction sets require consecutive memory access Otherwise permutations on data is needed Some SIMD instruction sets support certain access patterns (e.g. odd/even for complex numbers) Costs of permutations are critical to the worth of the vectorization Example for (i=0; i<n; i++) { A[i*2] = x; Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 12 / 26

Alignment Access on an unaligned memory location is needed. Target platform needs at least one capability: memory read/write to unaligned memory location general vector shuffling ability specialized alignment support Most SIMD platforms load VS bytes from closest previous aligned address (e.g. AltiVec, Alpha). Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 13 / 26

Data Dependences S 1 δ 0 1 δ 1 S 2 δ 1 1, δ 1 δ δ 2, δ 1, δ 1 1 δ 1 1 δ δ 1 S 3 δ 0 1 δ 0 1 S 4 Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 14 / 26

Outline 1 Introduction What Is Vectorization What Are The Challenges 2 Structure Of GCC 3 Vectorization 4 Code Generation & Algorithm 5 Summary Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 15 / 26

Vectorization Algorithm I for (i=1; i<=100; i++) { X[i] = Y[i] + 10; // S1 for (j=1; j<=100; j++) { B[j] = A[j][N]; // S2 for (k=1; k<=100; k++) { A[j+1][k] = B[j] + C[j][k] // S3 Y[i+j] = A[j+1][N] // S4 codegen(r,k,d), R region, k min nesting level, D dependence graph Base for gcc s vectorization procedure Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 16 / 26

Vectorization Algorithm I for (i=1; i<=100; i++) { X[i] = Y[i] + 10; // S1 for (j=1; j<=100; j++) { B[j] = A[j][N]; // S2 for (k=1; k<=100; k++) { A[j+1][k] = B[j] + C[j][k] // S3 Y[i+j] = A[j+1][N] // S4 codegen(r,k,d), R region, k min nesting level, D dependence graph Base for gcc s vectorization procedure Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 16 / 26

Vectorization Algorithm II S 1 δ 0 1 δ 1 S 2 δ 1 1, δ 1 δ δ 2, δ 1, δ 1 1 δ 1 1 δ δ 1 S 3 δ 0 1 δ 0 1 S 4 Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 17 / 26

Vectorization Algorithm III S 2 for (i=1; i<=100; i++) { codegen({s2,s3,s4,2,d2) X[1:100] = Y[1:100] + 10; δ δ 2 δ S 3 S 4 Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 18 / 26

Vectorization Algorithm IV for (i=1; i<=100; i++) { for (j=1; j<=100; j++) { codegen({s2,s3,3,d3) Y[i+1:i+100] = A[2:101][N]; X[1:100] = Y[1:100] + 10; S 2 δ S 3 Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 19 / 26

Vectorization Algorithm V for (i=1; i<=100; i++) { for (j=1; j<=100; j++) { B[j] = A[j][N]; A[j+1][1:100] = B[j] + C[j][1:100]; Y[i+1:i+100] = A[2:101][N]; X[1:100] = Y[1:100] + 10; S 2 δ S 3 Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 20 / 26

Code Generation vectorization assembly source code Front-end optimizations on trees GIMPLify Middle-end optimizations on GIMPLE expand Back-end optimizations on RTL codegen GIMPLE RTL Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 21 / 26

Outline 1 Introduction What Is Vectorization What Are The Challenges 2 Structure Of GCC 3 Vectorization 4 Code Generation & Algorithm 5 Summary Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 22 / 26

Overview scalar dependence cycles data reference dependences data reference access patterns data reference alignment Still in progress: Cost model MIMD support Enhancement of existing techniques Much more... Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 23 / 26

Overview scalar dependence cycles data reference dependences data reference access patterns data reference alignment Still in progress: Cost model MIMD support Enhancement of existing techniques Much more... Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 23 / 26

Summary: Speed Enhancement Compile code like: gcc example.c -o example_vect -O3 /usr/bin/time -f "%E time, %P CPU"./example 0:00.13 time, 97% CPU /usr/bin/time -f "%E time, %P CPU"./example_vect 0:00.03 time, 93% CPU Speed enhancement of factor 4 VF = 4. Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 24 / 26

Summary: Speed Enhancement Compile code like: gcc example.c -o example_vect -O3 /usr/bin/time -f "%E time, %P CPU"./example 0:00.13 time, 97% CPU /usr/bin/time -f "%E time, %P CPU"./example_vect 0:00.03 time, 93% CPU Speed enhancement of factor 4 VF = 4. Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 24 / 26

Summary: Speed Enhancement Compile code like: gcc example.c -o example_vect -O3 /usr/bin/time -f "%E time, %P CPU"./example 0:00.13 time, 97% CPU /usr/bin/time -f "%E time, %P CPU"./example_vect 0:00.03 time, 93% CPU Speed enhancement of factor 4 VF = 4. Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 24 / 26

GCC Code GCC 4.9 Source gcc/tree-vect* 31995 well documented LOC Introduced on Jan 1st, 2004 in the lno branch Initiated by Dorit Nuzman (IBM) Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 25 / 26

References Dorit Nuzman, Richard Henderson (2006) Multi-platform Auto-vectorization Proceedings of the International Symposium on Code Generation and Optimization CGO 06, 281 294. Dorit Naishlos (2004) Autovectorization in GCC Dorit Nuzman, Ayal Zaks (2006) Autovectorization in GCC - Two years later GCC-Summit-Proceedings 2006, 145 158. Alexandre Eichenberger, Peng Wu, Kevin O Brien (2004) Vectorization for SIMD Architectures with Alignment Constraints Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, 82 93. Kennedy, Ken and Allen, John R (2001) Optimizing compilers for modern architectures: a dependence-based approach Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 26 / 26