Auto-Vectorization with GCC

Size: px
Start display at page:

Download "Auto-Vectorization with GCC"

Transcription

1 Auto-Vectorization with GCC Hanna Franzen Kevin Neuenfeldt HPAC High Performance and Automatic Computing Seminar on Code-Generation Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 1 / 26

2 Outline 1 Introduction What Is Vectorization What Are The Challenges 2 Structure Of GCC 3 Vectorization 4 Code Generation & Algorithm 5 Summary Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 2 / 26

3 Outline 1 Introduction What Is Vectorization What Are The Challenges 2 Structure Of GCC 3 Vectorization 4 Code Generation & Algorithm 5 Summary Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 3 / 26

4 What Is Vectorization SIMD: Single Instruction Multiple Data Most modern CPUs have SIMD vector registers of mostly 8 to 16 bytes Check support on UNIX (Intel/AMD): cat /proc/cpuinfo grep -E \s(sse mmx)\s Definition Vectorization is the rewriting of loops into vector instructions. Notation: a[i:j] = Elements between index i and j (including). Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 4 / 26

5 Challenges Memory references must (mostly) be consecutive Memory must be aligned to natural vector size boundary Number of iterations must be countable Data Dependences Semantics of the vectorized program must be identical to the original program Example for (int i=1; i<5; i++) { a[i] = a[i-1] + b[i]; // S1 Assume a = [1,2,3,4,5] and b = [5,4,3,2,1]. Sequential: a = [1,5,8,10,11] Vectorized: a = [1,5,5,5,5] Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 5 / 26

6 Challenges Memory references must (mostly) be consecutive Memory must be aligned to natural vector size boundary Number of iterations must be countable Data Dependences Semantics of the vectorized program must be identical to the original program Example for (int i=1; i<5; i++) { a[i] = a[i-1] + b[i]; // S1 Assume a = [1,2,3,4,5] and b = [5,4,3,2,1]. Sequential: a = [1,5,8,10,11] Vectorized: a = [1,5,5,5,5] Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 5 / 26

7 Challenges Memory references must (mostly) be consecutive Memory must be aligned to natural vector size boundary Number of iterations must be countable Data Dependences Semantics of the vectorized program must be identical to the original program Example for (int i=1; i<5; i++) { a[i] = a[i-1] + b[i]; // S1 Assume a = [1,2,3,4,5] and b = [5,4,3,2,1]. Sequential: a = [1,5,8,10,11] Vectorized: a = [1,5,5,5,5] Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 5 / 26

8 Outline 1 Introduction What Is Vectorization What Are The Challenges 2 Structure Of GCC 3 Vectorization 4 Code Generation & Algorithm 5 Summary Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 6 / 26

9 Structure Of GCC vectorization assembly source code Front-end optimizations on trees GIMPLify Middle-end optimizations on GIMPLE expand Back-end optimizations on RTL codegen GIMPLE RTL Vector instructions are platform dependent Different processors have different SIMD instruction sets When built GCC uses machine description file to store which operations are available Intermediate Languages (IL): RTL (Register Transfer Language) IL: GIMPLE Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 7 / 26

10 Structure Of GCC vectorization assembly source code Front-end optimizations on trees GIMPLify Middle-end optimizations on GIMPLE expand Back-end optimizations on RTL codegen GIMPLE RTL Vector instructions are platform dependent Different processors have different SIMD instruction sets When built GCC uses machine description file to store which operations are available Intermediate Languages (IL): RTL (Register Transfer Language) IL: GIMPLE Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 7 / 26

11 Outline 1 Introduction What Is Vectorization What Are The Challenges 2 Structure Of GCC 3 Vectorization 4 Code Generation & Algorithm 5 Summary Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 8 / 26

12 Overview Vectorization Steps Analyze... loop form: exit condition, control-flow,... memory references: compute function that describes memory reference modifications scalar dependence cycles data reference access patterns data reference alignment data reference dependences Determine VF (Vectorization Factor) Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 9 / 26

13 Overview Vectorization Steps Analyze... loop form: exit condition, control-flow,... memory references: compute function that describes memory reference modifications scalar dependence cycles data reference access patterns data reference alignment data reference dependences Determine VF (Vectorization Factor) Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 9 / 26

14 Scalar Dependences Use of scalar variables (no arrays, pointers) Example: Reduction Example for (i=0; i<n; i++) { sum += A[i]; Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 10 / 26

15 Scalar Dependences Use of scalar variables (no arrays, pointers) Example: Reduction Example for (i=0; i<n; i++) { sum += A[i]; Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 10 / 26

16 Scalar Dependences: Reduction Example for (i=0; i<n; i++) { sum += A[i]; Intel Core 2Duo B950 VF = 4 N = 1024, 256 partial sums Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 11 / 26

17 Access Patterns Most SIMD instruction sets require consecutive memory access Otherwise permutations on data is needed Some SIMD instruction sets support certain access patterns (e.g. odd/even for complex numbers) Costs of permutations are critical to the worth of the vectorization Example for (i=0; i<n; i++) { A[i*2] = x; Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 12 / 26

18 Access Patterns Most SIMD instruction sets require consecutive memory access Otherwise permutations on data is needed Some SIMD instruction sets support certain access patterns (e.g. odd/even for complex numbers) Costs of permutations are critical to the worth of the vectorization Example for (i=0; i<n; i++) { A[i*2] = x; Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 12 / 26

19 Alignment Access on an unaligned memory location is needed. Target platform needs at least one capability: memory read/write to unaligned memory location general vector shuffling ability specialized alignment support Most SIMD platforms load VS bytes from closest previous aligned address (e.g. AltiVec, Alpha). Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 13 / 26

20 Data Dependences S 1 δ 0 1 δ 1 S 2 δ 1 1, δ 1 δ δ 2, δ 1, δ 1 1 δ 1 1 δ δ 1 S 3 δ 0 1 δ 0 1 S 4 Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 14 / 26

21 Outline 1 Introduction What Is Vectorization What Are The Challenges 2 Structure Of GCC 3 Vectorization 4 Code Generation & Algorithm 5 Summary Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 15 / 26

22 Vectorization Algorithm I for (i=1; i<=100; i++) { X[i] = Y[i] + 10; // S1 for (j=1; j<=100; j++) { B[j] = A[j][N]; // S2 for (k=1; k<=100; k++) { A[j+1][k] = B[j] + C[j][k] // S3 Y[i+j] = A[j+1][N] // S4 codegen(r,k,d), R region, k min nesting level, D dependence graph Base for gcc s vectorization procedure Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 16 / 26

23 Vectorization Algorithm I for (i=1; i<=100; i++) { X[i] = Y[i] + 10; // S1 for (j=1; j<=100; j++) { B[j] = A[j][N]; // S2 for (k=1; k<=100; k++) { A[j+1][k] = B[j] + C[j][k] // S3 Y[i+j] = A[j+1][N] // S4 codegen(r,k,d), R region, k min nesting level, D dependence graph Base for gcc s vectorization procedure Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 16 / 26

24 Vectorization Algorithm II S 1 δ 0 1 δ 1 S 2 δ 1 1, δ 1 δ δ 2, δ 1, δ 1 1 δ 1 1 δ δ 1 S 3 δ 0 1 δ 0 1 S 4 Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 17 / 26

25 Vectorization Algorithm III S 2 for (i=1; i<=100; i++) { codegen({s2,s3,s4,2,d2) X[1:100] = Y[1:100] + 10; δ δ 2 δ S 3 S 4 Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 18 / 26

26 Vectorization Algorithm IV for (i=1; i<=100; i++) { for (j=1; j<=100; j++) { codegen({s2,s3,3,d3) Y[i+1:i+100] = A[2:101][N]; X[1:100] = Y[1:100] + 10; S 2 δ S 3 Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 19 / 26

27 Vectorization Algorithm V for (i=1; i<=100; i++) { for (j=1; j<=100; j++) { B[j] = A[j][N]; A[j+1][1:100] = B[j] + C[j][1:100]; Y[i+1:i+100] = A[2:101][N]; X[1:100] = Y[1:100] + 10; S 2 δ S 3 Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 20 / 26

28 Code Generation vectorization assembly source code Front-end optimizations on trees GIMPLify Middle-end optimizations on GIMPLE expand Back-end optimizations on RTL codegen GIMPLE RTL Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 21 / 26

29 Outline 1 Introduction What Is Vectorization What Are The Challenges 2 Structure Of GCC 3 Vectorization 4 Code Generation & Algorithm 5 Summary Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 22 / 26

30 Overview scalar dependence cycles data reference dependences data reference access patterns data reference alignment Still in progress: Cost model MIMD support Enhancement of existing techniques Much more... Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 23 / 26

31 Overview scalar dependence cycles data reference dependences data reference access patterns data reference alignment Still in progress: Cost model MIMD support Enhancement of existing techniques Much more... Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 23 / 26

32 Summary: Speed Enhancement Compile code like: gcc example.c -o example_vect -O3 /usr/bin/time -f "%E time, %P CPU"./example 0:00.13 time, 97% CPU /usr/bin/time -f "%E time, %P CPU"./example_vect 0:00.03 time, 93% CPU Speed enhancement of factor 4 VF = 4. Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 24 / 26

33 Summary: Speed Enhancement Compile code like: gcc example.c -o example_vect -O3 /usr/bin/time -f "%E time, %P CPU"./example 0:00.13 time, 97% CPU /usr/bin/time -f "%E time, %P CPU"./example_vect 0:00.03 time, 93% CPU Speed enhancement of factor 4 VF = 4. Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 24 / 26

34 Summary: Speed Enhancement Compile code like: gcc example.c -o example_vect -O3 /usr/bin/time -f "%E time, %P CPU"./example 0:00.13 time, 97% CPU /usr/bin/time -f "%E time, %P CPU"./example_vect 0:00.03 time, 93% CPU Speed enhancement of factor 4 VF = 4. Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 24 / 26

35 GCC Code GCC 4.9 Source gcc/tree-vect* well documented LOC Introduced on Jan 1st, 2004 in the lno branch Initiated by Dorit Nuzman (IBM) Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 25 / 26

36 References Dorit Nuzman, Richard Henderson (2006) Multi-platform Auto-vectorization Proceedings of the International Symposium on Code Generation and Optimization CGO 06, Dorit Naishlos (2004) Autovectorization in GCC Dorit Nuzman, Ayal Zaks (2006) Autovectorization in GCC - Two years later GCC-Summit-Proceedings 2006, Alexandre Eichenberger, Peng Wu, Kevin O Brien (2004) Vectorization for SIMD Architectures with Alignment Constraints Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, Kennedy, Ken and Allen, John R (2001) Optimizing compilers for modern architectures: a dependence-based approach Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar on Code-Generation 26 / 26

Auto-Vectorization of Interleaved Data for SIMD

Auto-Vectorization of Interleaved Data for SIMD Auto-Vectorization of Interleaved Data for SIMD Dorit Nuzman, Ira Rosen, Ayal Zaks (IBM Haifa Labs) PLDI 06 Presented by Bertram Schmitt 2011/04/13 Motivation SIMD (Single Instruction Multiple Data) exploits

More information

Compiling Effectively for Cell B.E. with GCC

Compiling Effectively for Cell B.E. with GCC Compiling Effectively for Cell B.E. with GCC Ira Rosen David Edelsohn Ben Elliston Revital Eres Alan Modra Dorit Nuzman Ulrich Weigand Ayal Zaks IBM Haifa Research Lab IBM T.J.Watson Research Center IBM

More information

CS 293S Parallelism and Dependence Theory

CS 293S Parallelism and Dependence Theory CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law

More information

Compiler techniques for leveraging ILP

Compiler techniques for leveraging ILP Compiler techniques for leveraging ILP Purshottam and Sajith October 12, 2011 Purshottam and Sajith (IU) Compiler techniques for leveraging ILP October 12, 2011 1 / 56 Parallelism in your pocket LINPACK

More information

Proceedings of the GCC Developers Summit

Proceedings of the GCC Developers Summit Reprinted from the Proceedings of the GCC Developers Summit July 18th 20th, 2007 Ottawa, Ontario Canada Conference Organizers Andrew J. Hutton, Steamballoon, Inc., Linux Symposium, Thin Lines Mountaineering

More information

Autovectorization in GCC

Autovectorization in GCC Autovectorization in GCC Dorit Naishlos dorit@il.ibm.com IBM Labs in Haifa Vectorization in GCC - Talk Layout Background: GCC HRL and GCC Vectorization Background The GCC Vectorizer Developing a vectorizer

More information

Review. Topics. Lecture 3. Advanced Programming Topics. Review: How executable files are generated. Defining macros through compilation flags

Review. Topics. Lecture 3. Advanced Programming Topics. Review: How executable files are generated. Defining macros through compilation flags Review Dynamic memory allocation Look a-like dynamic 2D array Simulated 2D array How cache memory / cache line works Lecture 3 Command line arguments Pre-processor directives #define #ifdef #else #endif

More information

Compiling for HSA accelerators with GCC

Compiling for HSA accelerators with GCC Compiling for HSA accelerators with GCC Martin Jambor SUSE Labs 8th August 2015 Outline HSA branch: svn://gcc.gnu.org/svn/gcc/branches/hsa Table of contents: Very Brief Overview of HSA Generating HSAIL

More information

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy Automatic Translation of Fortran Programs to Vector Form Randy Allen and Ken Kennedy The problem New (as of 1987) vector machines such as the Cray-1 have proven successful Most Fortran code is written

More information

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC. Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:

More information

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list Compiling for Scalable Computing Systems the Merit of SIMD Ayal Zaks Intel Corporation Acknowledgements: too many to list Takeaways 1. SIMD is mainstream and ubiquitous in HW 2. Compiler support for SIMD

More information

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach OpenMP: Vectorization and #pragma omp simd Markus Höhnerbach 1 / 26 Where does it come from? c i = a i + b i i a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 + b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 = c 1 c 2 c 3 c 4 c 5 c

More information

Automatic Transformations for Effective Parallel Execution on Intel Many Integrated Core

Automatic Transformations for Effective Parallel Execution on Intel Many Integrated Core Automatic Transformations for Effective Parallel Execution on Intel Many Integrated Core Kevin Stock The Ohio State University stockk@cse.ohio-state.edu Louis-Noël Pouchet The Ohio State University pouchet@cse.ohio-state.edu

More information

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 SE205 - TD1 Énoncé General Instructions You can download all source files from: https://se205.wp.mines-telecom.fr/td1/ SIMD-like Data-Level Parallelism Modern processors often come with instruction set

More information

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization 1 Introduction The purpose of this lab assignment is to give some experience in using SIMD instructions on x86 and getting compiler

More information

Enhancing Parallelism

Enhancing Parallelism CSC 255/455 Software Analysis and Improvement Enhancing Parallelism Instructor: Chen Ding Chapter 5,, Allen and Kennedy www.cs.rice.edu/~ken/comp515/lectures/ Where Does Vectorization Fail? procedure vectorize

More information

CSC766: Project - Fall 2010 Vivek Deshpande and Kishor Kharbas Auto-vectorizating compiler for pseudo language

CSC766: Project - Fall 2010 Vivek Deshpande and Kishor Kharbas Auto-vectorizating compiler for pseudo language CSC766: Project - Fall 2010 Vivek Deshpande and Kishor Kharbas Auto-vectorizating compiler for pseudo language Table of contents 1. Introduction 2. Implementation details 2.1. High level design 2.2. Data

More information

Parallel Programming. Easy Cases: Data Parallelism

Parallel Programming. Easy Cases: Data Parallelism Parallel Programming The preferred parallel algorithm is generally different from the preferred sequential algorithm Compilers cannot transform a sequential algorithm into a parallel one with adequate

More information

Intermediate Representations

Intermediate Representations Compiler Design 1 Intermediate Representations Compiler Design 2 Front End & Back End The portion of the compiler that does scanning, parsing and static semantic analysis is called the front-end. The translation

More information

Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions

Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions Improving performance with SSE We ve seen how we can apply multithreading to speed up the cardiac simulator But there is another kind of parallelism

More information

Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program

Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program SIMD Single Instruction Multiple Data Lecture 12: SIMD-machines & data parallelism, dependency analysis for automatic vectorizing and parallelizing of serial program Part 1 Parallelism through simultaneous

More information

Parallel Processing: October, 5, 2010

Parallel Processing: October, 5, 2010 Parallel Processing: Why, When, How? SimLab2010, Belgrade October, 5, 2010 Rodica Potolea Parallel Processing Why, When, How? Why? Problems too costly to be solved with the classical approach The need

More information

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK. Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK. Legal Disclaimer & Software and workloads used in performance tests may have been optimized

More information

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions

More information

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen OpenMP - II Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT

More information

Using GCC as a Research Compiler

Using GCC as a Research Compiler Using GCC as a Research Compiler Diego Novillo dnovillo@redhat.com Computer Engineering Seminar Series North Carolina State University October 25, 2004 Introduction GCC is a popular compiler, freely available

More information

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010 Parallel Programming Lecture 7: Introduction to SIMD Mary Hall September 14, 2010 Homework 2, Due Friday, Sept. 10, 11:59 PM To submit your homework: - Submit a PDF file - Use the handin program on the

More information

Dan Stafford, Justine Bonnot

Dan Stafford, Justine Bonnot Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing

More information

The New C Standard (Excerpted material)

The New C Standard (Excerpted material) The New C Standard (Excerpted material) An Economic and Cultural Derek M. Jones derek@knosof.co.uk Copyright 2002-2008 Derek M. Jones. All rights reserved. 1788 goto statement Constraints The identifier

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Compilers Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science

More information

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday Outline L5: Writing Correct Programs, cont. 1 How to tell if your parallelization is correct? Race conditions and data dependences Tools that detect race conditions Abstractions for writing correct parallel

More information

SIMD Programming CS 240A, 2017

SIMD Programming CS 240A, 2017 SIMD Programming CS 240A, 2017 1 Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures usually both in same system! Most common parallel processing programming style: Single

More information

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen OpenMP - III Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT

More information

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,

More information

Lecture Notes on Cache Iteration & Data Dependencies

Lecture Notes on Cache Iteration & Data Dependencies Lecture Notes on Cache Iteration & Data Dependencies 15-411: Compiler Design André Platzer Lecture 23 1 Introduction Cache optimization can have a huge impact on program execution speed. It can accelerate

More information

Program Optimization Through Loop Vectorization

Program Optimization Through Loop Vectorization Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Program Optimization

More information

COE608: Computer Organization and Architecture

COE608: Computer Organization and Architecture Add on Instruction Set Architecture COE608: Computer Organization and Architecture Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University Overview More

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space. The Lecture Contains: Iteration Space Iteration Vector Normalized Iteration Vector Dependence Distance Direction Vector Loop Carried Dependence Relations Dependence Level Iteration Vector - Triangular

More information

Cell Programming Tips & Techniques

Cell Programming Tips & Techniques Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and

More information

Supercomputing in Plain English Part IV: Henry Neeman, Director

Supercomputing in Plain English Part IV: Henry Neeman, Director Supercomputing in Plain English Part IV: Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday September 19 2007 Outline! Dependency Analysis! What is

More information

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation

More information

SoftSIMD - Exploiting Subword Parallelism Using Source Code Transformations

SoftSIMD - Exploiting Subword Parallelism Using Source Code Transformations SoftSIMD - Exploiting Subword Parallelism Using Source Code Transformations Stefan Kraemer, Rainer Leupers, Gerd Ascheid, Heinrich Meyr Institute for Integrated Signal Processing Systems RWTH Aachen University,

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

Lecture Compiler Middle-End

Lecture Compiler Middle-End Lecture 16-18 18 Compiler Middle-End Jianwen Zhu Electrical and Computer Engineering University of Toronto Jianwen Zhu 2009 - P. 1 What We Have Done A lot! Compiler Frontend Defining language Generating

More information

AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory September 18 th, 2007 PACT2007

More information

Compilation for Heterogeneous Platforms

Compilation for Heterogeneous Platforms Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey

More information

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs

More information

Cilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation

Cilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation Cilk Plus in GCC GNU Tools Cauldron 2012 Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation July 10, 2012 Presentation Outline Introduction Cilk Plus components Implementation GCC Project Status

More information

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor

More information

Parallel Processing SIMD, Vector and GPU s

Parallel Processing SIMD, Vector and GPU s Parallel Processing SIMD, Vector and GPU s EECS4201 Fall 2016 York University 1 Introduction Vector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating on Single Data

More information

Induction-variable Optimizations in GCC

Induction-variable Optimizations in GCC Induction-variable Optimizations in GCC 程斌 bin.cheng@arm.com 2013.11 Outline Background Implementation of GCC Learned points Shortcomings Improvements Question & Answer References Background Induction

More information

The GNU Compiler Collection

The GNU Compiler Collection The GNU Compiler Collection Diego Novillo dnovillo@redhat.com Gelato Federation Meeting Porto Alegre, Rio Grande do Sul, Brazil October 3, 2005 Introduction GCC is a popular compiler, freely available

More information

OpenACC Fundamentals. Steve Abbott November 13, 2016

OpenACC Fundamentals. Steve Abbott November 13, 2016 OpenACC Fundamentals Steve Abbott , November 13, 2016 Who Am I? 2005 B.S. Physics Beloit College 2007 M.S. Physics University of Florida 2015 Ph.D. Physics University of New Hampshire

More information

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list Compiling for Scalable Computing Systems the Merit of SIMD Ayal Zaks Intel Corporation Acknowledgements: too many to list Lo F IRST, Thanks for the Technion For Inspiration and Recognition of Science and

More information

SIMD: Data parallel execution

SIMD: Data parallel execution ERLANGEN REGIONAL COMPUTING CENTER SIMD: Data parallel execution J. Eitzinger HLRS, 15.6.2018 CPU Stored Program Computer: Base setting Memory for (int j=0; j

More information

Where We Are. Lexical Analysis. Syntax Analysis. IR Generation. IR Optimization. Code Generation. Machine Code. Optimization.

Where We Are. Lexical Analysis. Syntax Analysis. IR Generation. IR Optimization. Code Generation. Machine Code. Optimization. Where We Are Source Code Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Machine Code Where We Are Source Code Lexical Analysis Syntax Analysis

More information

Advanced Compiler Construction Theory And Practice

Advanced Compiler Construction Theory And Practice Advanced Compiler Construction Theory And Practice Introduction to loop dependence and Optimizations 7/7/2014 DragonStar 2014 - Qing Yi 1 A little about myself Qing Yi Ph.D. Rice University, USA. Associate

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

Tour of common optimizations

Tour of common optimizations Tour of common optimizations Simple example foo(z) { x := 3 + 6; y := x 5 return z * y } Simple example foo(z) { x := 3 + 6; y := x 5; return z * y } x:=9; Applying Constant Folding Simple example foo(z)

More information

Program Optimization Through Loop Vectorization

Program Optimization Through Loop Vectorization Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Simple Example Loop

More information

Computer System Architecture

Computer System Architecture CSC 203 1.5 Computer System Architecture Department of Statistics and Computer Science University of Sri Jayewardenepura Instruction Set Architecture (ISA) Level 2 Introduction 3 Instruction Set Architecture

More information

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County Accelerating a climate physics model with OpenCL Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou University of Maryland Baltimore County Introduction The demand to increase forecast predictability

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT 6.189 IAP 2007 Lecture 11 Parallelizing Compilers 1 6.189 IAP 2007 MIT Outline Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel

More information

Beware Of Your Cacheline

Beware Of Your Cacheline Beware Of Your Cacheline Processor Specific Optimization Techniques Hagen Paul Pfeifer hagen@jauu.net http://protocol-laboratories.net Jan 18 2007 Introduction Why? Memory bandwidth is high (more or less)

More information

Alias Analysis for Intermediate Code

Alias Analysis for Intermediate Code Alias Analysis for Intermediate Code Sanjiv K. Gupta Naveen Sharma System Software Group HCL Technologies, Noida, India 201 301 {sanjivg,naveens}@noida.hcltech.com Abstract Most existing alias analysis

More information

OpenMP 4.0: a Significant Paradigm Shift in High-level Parallel Computing (Not just Shared Memory anymore) Michael Wong OpenMP CEO

OpenMP 4.0: a Significant Paradigm Shift in High-level Parallel Computing (Not just Shared Memory anymore) Michael Wong OpenMP CEO OpenMP 4.0: a Significant Paradigm Shift in High-level Parallel Computing (Not just Shared Memory anymore) Michael Wong OpenMP CEO OpenMP 4.0: a Significant Paradigm Shift in High-level Parallel Computing

More information

Vectorization Past Dependent Branches Through Speculation

Vectorization Past Dependent Branches Through Speculation Vectorization Past Dependent Branches Through Speculation Majedul Haque Sujon R. Clint Whaley Center for Computation & Technology(CCT), Louisiana State University (LSU). University of Texas at San Antonio

More information

Final Exam Review Questions

Final Exam Review Questions EECS 665 Final Exam Review Questions 1. Give the three-address code (like the quadruples in the csem assignment) that could be emitted to translate the following assignment statement. However, you may

More information

An Overview to Compiler Design. 2008/2/14 \course\cpeg421-08s\topic-1a.ppt 1

An Overview to Compiler Design. 2008/2/14 \course\cpeg421-08s\topic-1a.ppt 1 An Overview to Compiler Design 2008/2/14 \course\cpeg421-08s\topic-1a.ppt 1 Outline An Overview of Compiler Structure Front End Middle End Back End 2008/2/14 \course\cpeg421-08s\topic-1a.ppt 2 Reading

More information

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions Randal E. Bryant David R. O Hallaron January 14, 2016 Notice The material in this document is supplementary material to

More information

SIMD Exploitation in (JIT) Compilers

SIMD Exploitation in (JIT) Compilers SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input

More information

Compiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005

Compiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005 Compiling for Performance on hp OpenVMS I64 Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005 Compilers discussed C, Fortran, [COBOL, Pascal, BASIC] Share GEM optimizer

More information

C Language Constructs for Parallel Programming

C Language Constructs for Parallel Programming C Language Constructs for Parallel Programming Robert Geva 5/17/13 1 Cilk Plus Parallel tasks Easy to learn: 3 keywords Tasks, not threads Load balancing Hyper Objects Array notations Elemental Functions

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

Loop Transformations! Part II!

Loop Transformations! Part II! Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow

More information

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Lecture 4 Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Partners? Announcements Scott B. Baden / CSE 160 / Winter 2011 2 Today s lecture Why multicore? Instruction

More information

Bentley Rules for Optimizing Work

Bentley Rules for Optimizing Work 6.172 Performance Engineering of Software Systems SPEED LIMIT PER ORDER OF 6.172 LECTURE 2 Bentley Rules for Optimizing Work Charles E. Leiserson September 11, 2012 2012 Charles E. Leiserson and I-Ting

More information

A Parallelizing Compiler for Multicore Systems

A Parallelizing Compiler for Multicore Systems A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014)

More information

What Do Compilers Do? How Can the Compiler Improve Performance? What Do We Mean By Optimization?

What Do Compilers Do? How Can the Compiler Improve Performance? What Do We Mean By Optimization? What Do Compilers Do? Lecture 1 Introduction I What would you get out of this course? II Structure of a Compiler III Optimization Example Reference: Muchnick 1.3-1.5 1. Translate one language into another

More information

CS 2461: Computer Architecture 1

CS 2461: Computer Architecture 1 Next.. : Computer Architecture 1 Performance Optimization CODE OPTIMIZATION Code optimization for performance A quick look at some techniques that can improve the performance of your code Rewrite code

More information

CSC2/458 Parallel and Distributed Systems Machines and Models

CSC2/458 Parallel and Distributed Systems Machines and Models CSC2/458 Parallel and Distributed Systems Machines and Models Sreepathi Pai January 23, 2018 URCS Outline Recap Scalability Taxonomy of Parallel Machines Performance Metrics Outline Recap Scalability Taxonomy

More information

Some features of modern CPUs. and how they help us

Some features of modern CPUs. and how they help us Some features of modern CPUs and how they help us RAM MUL core Wide operands RAM MUL core CP1: hardware can multiply 64-bit floating-point numbers Pipelining: can start the next independent operation before

More information

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Guest Lecturer: Alan Christopher 3/08/2014 Spring 2014 -- Lecture #19 1 Neuromorphic Chips Researchers at IBM and

More information

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 Worksharing constructs To date: #pragma omp parallel created a team of threads We distributed

More information

GCC An Architectural Overview

GCC An Architectural Overview GCC An Architectural Overview Diego Novillo dnovillo@redhat.com Red Hat Canada OSDL Japan - Linux Symposium Tokyo, September 2006 Topics 1. Overview 2. Development model 3. Compiler infrastructure 4. Current

More information

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat OpenMP 4.0 implementation in GCC Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat OpenMP 4.0 implementation in GCC Work started in April 2013, C/C++ support with host fallback only

More information

Leftover. UMA vs. NUMA Parallel Computer Architecture. Shared-memory Distributed-memory

Leftover. UMA vs. NUMA   Parallel Computer Architecture. Shared-memory Distributed-memory Leftover Parallel Computer Architecture Shared-memory Distributed-memory 1 2 Shared- Parallel Computer Shared- Parallel Computer System Bus Main Shared Programming through threading Multiple processors

More information

arxiv: v1 [cs.pf] 21 Dec 2014

arxiv: v1 [cs.pf] 21 Dec 2014 Performance comparison between Java and JNI for optimal implementation of computational micro-kernels Nassim Halli Univ. Grenoble, France Aselta Nanographics, France nassim.halli@aselta.com Henri-Pierre

More information

An Architectural Overview of GCC

An Architectural Overview of GCC An Architectural Overview of GCC Diego Novillo dnovillo@redhat.com Red Hat Canada Linux Symposium Ottawa, Canada, July 2006 Topics 1. Overview 2. Development model 3. Compiler infrastructure 4. Intermediate

More information

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions Randal E. Bryant David R. O Hallaron October 12, 2015 Notice The material in this document is supplementary material to

More information

Usually, target code is semantically equivalent to source code, but not always!

Usually, target code is semantically equivalent to source code, but not always! What is a Compiler? Compiler A program that translates code in one language (source code) to code in another language (target code). Usually, target code is semantically equivalent to source code, but

More information

Kampala August, Agner Fog

Kampala August, Agner Fog Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler

More information

Compiler Theory. (GCC the GNU Compiler Collection) Sandro Spina 2009

Compiler Theory. (GCC the GNU Compiler Collection) Sandro Spina 2009 Compiler Theory (GCC the GNU Compiler Collection) Sandro Spina 2009 GCC Probably the most used compiler. Not only a native compiler but it can also cross-compile any program, producing executables for

More information

Parallel Processing SIMD, Vector and GPU s

Parallel Processing SIMD, Vector and GPU s Parallel Processing SIMD, ector and GPU s EECS4201 Comp. Architecture Fall 2017 York University 1 Introduction ector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating

More information

Automatic Polyhedral Optimization of Stencil Codes

Automatic Polyhedral Optimization of Stencil Codes Automatic Polyhedral Optimization of Stencil Codes ExaStencils 2014 Stefan Kronawitter Armin Größlinger Christian Lengauer 31.03.2014 The Need for Different Optimizations 3D 1st-grade Jacobi smoother Speedup

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

Flynn Taxonomy Data-Level Parallelism

Flynn Taxonomy Data-Level Parallelism ecture 27 Computer Science 61C Spring 2017 March 22nd, 2017 Flynn Taxonomy Data-Level Parallelism 1 New-School Machine Structures (It s a bit more complicated!) Software Hardware Parallel Requests Assigned

More information

Exploring Parallelism At Different Levels

Exploring Parallelism At Different Levels Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities

More information