FADA : Fuzzy Array Dataflow Analysis

Size: px
Start display at page:

Download "FADA : Fuzzy Array Dataflow Analysis"

Transcription

1 FADA : Fuzzy Array Dataflow Analysis M. Belaoucha, D. Barthou, S. Touati 27/06/2008 Abstract This document explains the basis of fuzzy data dependence analysis (FADA) and its applications on code fragment parallelization for multicore architectures. Usual data dependence analysis works for a restricted set of programs, mainly the so called static control programs. However, usual C code fragments for instance do not fit in this class of programs, and consequently current compilers do not have precise data dependence analysis for them. FADA extends the array data dependence analysis to take into account irregular code structures such as while-loops, function calls, non affine array accesses, etc. FADA is more powerful but requires more compilation time. Thanks to FADA, we are able to extract hidden thread level parallelism for code fragments in order to be exploited on multicore architectures. Other applications are also possible and are described in this report. 1 Introduction Dependence analysis is a task required by many code fragment optimizations and transformations. If these dependence information are precise, more optimizations become possible and more parallelism could be detected and extracted. Below, we explain how the proposed method can provide a full support for a larger set of programs, and how we can get more precise dependencies while recovering full information about them. FADA has been proposed by Barthou et. al. [4], and detailed in Barthou s Ph.D. thesis [1]. FADA is an adaptation of Feautrier s exact dataflow analysis [5] for irregular programs. This analysis computes instance-wise dependencies for a given program. Feautriers method allows to compute exact dependence information for static control programs (also called regular programs) with the following properties: Loop have bounds (as Fortran DO-loops) Conditional if and while conditions, and array indexes are affine functions of surrounding loops and of parameters (symbolic constants). If a code fragment does not fit in the above characteristics, then Feautriers method cannot be used. In our context, we call irregular programs (not static control programs) the set of codes fragments that do not fit in the Feautriers model: that is, all codes fragments that contain almost all control structures imperative programs (while-loops, do-while loops, for-loops, if-then-else, function 1

2 for (i=1;i<=n;++i) { if(f(i-1)) S1:...= B[i-1]...; if(! F(i)) S2: B[i]=...; Figure 1: A simple example to explain the FADA modelling calls, and even pointers). Thanks to FADA, we can compute the best approximate affine dependencies of this set of irregular programs. The approximation can be improved by using information related to the application context. We talk about reducing fuzziness. FADA is composed of two building blocks. First, FADA performs an instancewise dependence analysis encoding all non-linear constraints in parameters, where the approximation is wide. Second, the resulting dependence information is enhanced by the second building block method. Its aim is to get better approximations for non-affine constraints. By handling this kind of constraints, FADA becomes a powerful static dataflow analysis. 2 The FADA Processing The following section explains the two building blocks of FADA in order to compute more precise dependencies. 2.1 Step 1: Basic Analysis This step computes the exact source of each referenced program variable and each array cell in a given static control program. It also makes approximations and produces parametric dependence relations in case of irregular programs. Program 1 has two statements accessing the same array B. Statement S0 reads B cells and statement S1 produces B values. F in the if-conditions represents non-affine boolean function calls. This is a typical example of what we call an irregular code fragment. We have a typical producer-consumer model here, and S0 may be dependent of S1 through B cells values. An instance-wise approach proposes to check whether statements instances are dependent (for all possible iterations). To fix the iteration number, we represent by ir the iteration number where S0 reads a B cell, and by iw the iteration number where S1 writes into the same B cell. FADA will solve a system of constraints to detect data dependences. The system of constraints is modelled as: S0 and S1 must be executed: 1 <= ir <= N, (1) F (ir 1) = true, (2) 2

3 S0 and S1 must access the same cell: 1 <= iw <= N, (3) F (iw) = false (4) ir 1 = iw (5) If we look at flow dependence, S1 must be executed before S0: iw <= ir (6) FADA builds the set of constraints (1, 2, 3, 4, 5, 6) and gives them to a parametric integer solver (called PIP) in order to compute the source of B[ir-1] read by S0: that is, it computes the set of statement instances that write into B[ir-1]. Computing a source is equivalent to finding the maximal value of iw that satisfies the constraints (1, 2, 3, 4, 5, 6). If such maximal value of iw exists, we deduce that S0 and S1 are data-dependent, otherwise, they are completely independent. The main problem with irregular programs is the fact that FADA system of constraints generates non-affine constraints (here 2and4), which are not supported by available parametric solvers such as PIP. FADA bypasses this limitation by considering properties of the possible source of dependence. For that, it defines a parameter with affine description to approximate the computation of sources without the non-affine constraints. These constraints are semantically described by the added parameter, but they are avoided during the computation of the source. As a consequence, it may conservatively detect some false dependencies. For program 1, the presence of non-affine constraints (2 and 4) causes FADA s basic analysis to compute conservative dependence information with possible false-dependencies: this is because some constraints can never be satisfied but the solver cannot detect it. This first step of FADA attests that instance ir of S0 can be data-dependent on instance ir-1 of S1 through the array cell B[ir-1]. However, the next step of FADA can demonstrate that there is no dependence between S0 and S1 in program Step 2: Advanced Analysis Going around non-affine constraints introduces some fuzziness and produces, by the way, false dependencies. Reducing fuzziness (by handling non-affine constraints) means eliminating these false dependencies. FADA proposes three mechanisms to handle separately the non-affine constraints, and to approximate them by finding the smallest affine set Structural analysis It tries to take advantage of flow-properties of if and while constructs, in order to eliminate their non-affine conditions. For example, we know that 3

4 p(i) x(i), (7) p(i) y(i) (8) x(i) y(i) (9) Figure 2: Logical unification for each execution of an if, the condition is evaluated to either true or false, hence exactly oneof the branches of the condition is executed. It means that if we compare two write operations (on the same variable), one in the then part and the other in the else part, we can deduce that in all cases the variable will be written by one of the two operations (see Ex. 2 in the next section) Iterative analysis It tries to find the condition for which two expressions will be equivalent. Given two non-affine constraints, they have the same value for two iterations if: The expressions are syntactically equivalent, and the referenced variables in these expressions have the same sources It implies that sources of these reference variables are computed beforehand, maybe through another FADA. This is why it is called iterative analysis (see Ex. 3.1 and 4 in the next section) Translating properties It uses some (internal and/or external) knowledge to interpret non-affine functions and translate them into an appropriate affine form. For instance, the user can define invariants on arrays used for indirections, depending on the application context. The process is similar to automatic demonstration based on an advanced PROLOG-like iterative unification-resolution process. The intuition is that, from predicates involving non-affine constraints, the system deduces automatically some affine consequences. To illustrate how it works, let us study the implication (Fig. 2) in which if we have constraints (7) and (8) we can deduce constraint (9). So we can eliminate predicate p (called unifier) without knowing anything about it. It will bebeneficial for us, if p is a non-affine predicate, where x and y are linear ones. Here, FADA assumes that the non-affine functions are pure functions, i.e., without side effects. Consequently, for the same argument values, these functions should return the same results; otherwise the mechanism of translating properties cannot be applied. The next section provides some applications for FADA in order to enhance programs execution times on multicore architectures. 3 Applications Usual code optimisation methods have to preserve the original code semantics. In many cases, preserving semantics requires preserving flow dependencies. The 4

5 for(i=0;i<n;++i) { if(p(i)) S0: T=... else S1: T=... S2:...=T; Figure 3: Applying structural analysis for parallelism detection main motivation of FADA is to give more precise descriptions of dependencies for a larger set of programs, in order to give more optimisation choices. We demonstrate in this section how FADA s results can be used to detect strongly hidden parallelism, improve communications in manually parallelized applications, and validate transformations in order to improve instruction level parallelism (ILP). 3.1 Parallelism Detection For program 1 (exposed in the previous section) we have shown that FADA basic analysis (first FADA step) provides conservative result because of the presence of non-affine constraints. Translating properties engine (second FADA step) can handle them, and performs exact analysis with the unique hypothesis that F is a pure function (without side effects). In fact, the generated constraints describing dependence between S0 and S1 are 1, 2, 3, 4, 5 and 6 as described in the previous section. Here, we are particularly interested in constraints 2 and 4. By combining 2, 4 and 5, FADA deduces that these cannot be satisfied whatever be the function F, which means that there is no dependence between S0 and S1. For program 3, the non-affine if-condition can be bypassed by performing structural analysis to exhibit parallelism. In fact, we know that during iteration, either the if-part or the else-part is executed, and at least one of them is executed. Consequently, the value of T read by S2 is produced (by S0 or by S1) during the same iteration. So there is no dependence carried by the i-loop. It means that this loop is parallel. We note that Pugh-Wonnacott [8] and Creusillet [2] methods can also detect parallelism in this program. This sort of code can be found in scientific programs. For example: Fuzzy-logic operators [9] metaheuristics operators [7]. In program 4, we can merge the for-loops. But before we need to know whether the while loops will execute the same iterations, which can be checked by the iterative analysis. However, comparing while-conditions requires the computation of the sources of A[i,j] read by statements W0 and W1. These two sources are not defined in this program (A cells are not written in this program portion). FADA can deduce that they have the same definition, hence exactly the same values. Consequently, while-conditions are equivalent for each i and each j. We can merge all similar controls to get program 5, in 5

6 for (i=0;i<n;++i){ W0: while (f(a[i,j])) { S0: B[i,j] =..; ++j; for (i=0;i<n;++i){ W1: while (f(a[i,j])) { S1:...= B[i,j]; ++j; Figure 4: Similar controls with non affine domains for (i=0;i<n;++i){ W: while (f(a[i,j])) { S0: B[i,j] =...; S1:...=B[i,j]; ++j; Figure 5: Merging similar controls using FADA results which FADA can obviously detect the parallelism of the i-loop. 3.2 Improving Communications Synchronisations and communications are costly operations in a parallel application. We show how these operations can be improved by using the FADA analysis results. For the OpenMP program (6), the values of A cells are produced during the execution of the first parallel-for, and the same values will be read in the second parallel-for. Here, the output of the first loop will be used as input to the second one. Bernstein conditions are not respected, and that imposes a barrier (which is implicitly inserted) at the end of first parallel loop. With FADA s results, we can do better (knowing some specificities of OpenMP) by eliminating the implicit barrier without undermining the program s correctness. For that purpose, we rely on two facts: while-loops of the first and the second for loops are equivalent, and values of the arrays a and b read by S0 and S1 have exactly the same sources. 6

7 #pragma omp parallel cyclic for(int i=1;i<=n; ++i){ while(c(i,j)) S0: A[a[i],b[j]]=...; #pragma omp parallel cyclic for(int i=1;i<=n; ++i){ while(c(i,j)) S1:...=A[a[i],b[j]]; Figure 6: Optimizing synchronizations with FADA These two facts can easily be checked by performing FADA iterative analysis on the elements of A, a and b. The results of FADA lead us to say that the A values read by S1 are all produced by S0 and only S0, so OpenMP parallelization of the for-loops with the same scheduling strategy guarantees that any iteration i of the two for-loops will be scheduled within the same thread. Therefore, the reference A[a[i],b[i]] becomes a local data for the thread (in which iteration i is scheduled) and is no longer an external input. Bernstein s condition is not violated, so we can avoid the implicit barrier by adding #pragma nowait at the end of the first parallel for. In general, we have to compute the input and the output of a thread to know which variables must be communicated and then apply Bernsteins conditions1 to know whether to make synchronization or not. Conventional dataflow analyses are more conservative than FADA, and would lead to the communication of unused variables. We can also use FADA s results in order to compute exact input and output sets. 3.3 Irregular Code Transformation In general, we can transform a program through a schedule strategy to improve it. In fact, a lot of work has been done in this area for static control programs but less work for irregular ones. Bastoul et al. [3] proposed a tool, CLooG for affine schedules. CLooG generates efficient static control codes, and in addition schedules can be driven by the user through meta-compilation directives. We want to do the same thing for irregular programs. The fundamental issue is to generate the irregular code matching this schedule, for a given dependence graph (computed by FADA) and a schedule (introduced by the user via directives). We started with a well known transformation Deep-jam [6] ( progral 3.3, a simplified deep-jam example), It boils down to an unroll and fusion applied 7

8 #pragma unroll (i,2) for (i=0;i<n;++i){ while (f(a[i,j])){ B[i,j] =..; ++j; fuse(while) Figure 7: A simplified DeepJam example to for-loop containing while-loops. Generating a scheduled irregular code is not an easy task. For instance, deep-jam had never been implemented. All test transformations have been applied manually. We are interested in a (semi- )automatic method to generate a correct code matching a given program and a schedule. 4 Open Questions 4.1 How does FADA pass its information to the compiler? When the FADA finishes its analysis, the collected data dependence information can be used in two ways: in a semi-automatic code parallelization method (middle term research), used by an expert. This is planned in the context of the PhD thesis of Marouane Belaoucha. in an automatic code parallelization method. This is currently done inside the GRAPHITE project in collaboration with INRIA-Saclay. We have a PhD project that will integrate the FADA inside GRAPHITE and gcc. Consequently, FADA information would be available inside the gcc infrastructure within the next years. 4.2 How can we implement thread level parallelization thanks to FADA? FADA allows us to either detect parallel irregular loops, or to help to transform an irregular loop into parallel code fractions. In both cases, the implementation of the thread level parallelism can be done by OpenMP directives or by using POSIX threads. 5 Conclusion The main property of FADA method is that it recovers all information about dependencies. Its advanced analyses take advantage of relations between the non-affine constraints in order to define precise dependencies. 8

9 We illustrated how FADA can be beneficial in several areas: parallelism detection, enhancing communications and synchronizations schemes, and performing source-to-source transformations of irregular programs. The short term objective is to use FADA to detect parallelism; other applications are under study. In the ParMA project, FADA could come behind the STEP1 tool in order to perform dataflow analysis leading to advanced optimizations on kernels resulting from its processing. We are also studying the integration of FADA in GCC middle end through the polyhedral environment GRAPHITE implementation. This would enable us to take many languages as input (C/C++, FORTRAN,...), and to take advantage of advanced techniques (constant propagation, SSA, induction variables analysis,...) already implemented in GCC. We are also studying how to use full information provided by FADA to perform some extra tasks. Our main interests are: validating transformations, code generation from non-affine schedules and automatic parallelization of irregular codes for multicore processors. References [1] Denis Barthou. Array Dataflow Analysis in Presence of Non-Affine Constraints. PhD thesis, Universit de Versailles en Saint-Quentin, [2] Franois Irigoin Batrice Creusillet. Exact vs. approximate array region analyses. Languages and Compilers for Parallel Computing, [3] Sylvain Girbal Saurabh Sharma Olivier Temam Cedric Bastoul, Albert Cohen. Putting polyhedral loop transformations to work. International Workshop on Languages and Compilers for Parallel Computers (LCPC), [4] Paul Feautrier Denis Barthou, Jean-Franois Collard. Fuzzy array dataflow analysis. Journal of Parallel and Distributed Computing, 40(2), [5] Paul Feautrier. Dataflow analysis of array and scalar references. Parallel Processing Letters, [6] Albert Cohen William Jalby Patrick Carribault, Stphane Zuckerman. Deep jam: Conversion of coarse-grain parallelism to fine-grain and vector. Journal of Instruction-Level Parallelism, 9, [7] Marco Tomassini. A survey of genetic algorithms. Reviews of Computational Physics, [8] David Wonnacott William Pugh. Nonlinear array dependence analysis. Languages, Compilers and Run-Time Systems for Scalable Computers, [9] Lotfi A. Zadeh. Fuzzy sets. Information and Control, 8,

FADAlib: an open source C++ library for fuzzy Array dataflow analysis

FADAlib: an open source C++ library for fuzzy Array dataflow analysis FADAlib: an open source C++ library for fuzzy Array dataflow analysis Marouane BELAOUCHA a Denis BARTHOU b Adrien ELICHE a Sid-Ahmed-Ali TOUATI a,c a University of Versailles Saint-Quentin-en-Yvelines,

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 22, 2010 888.11, Class #5 Introduction: Polyhedral

More information

The Polyhedral Model Is More Widely Applicable Than You Think

The Polyhedral Model Is More Widely Applicable Than You Think The Polyhedral Model Is More Widely Applicable Than You Think Mohamed-Walid Benabderrahmane 1 Louis-Noël Pouchet 1,2 Albert Cohen 1 Cédric Bastoul 1 1 ALCHEMY group, INRIA Saclay / University of Paris-Sud

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

An Overview to. Polyhedral Model. Fangzhou Jiao

An Overview to. Polyhedral Model. Fangzhou Jiao An Overview to Polyhedral Model Fangzhou Jiao Polyhedral Model A framework for performing loop transformation Loop representation: using polytopes to achieve fine-grain representation of program Loop transformation:

More information

A Parallelizing Compiler for Multicore Systems

A Parallelizing Compiler for Multicore Systems A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014)

More information

Polyèdres et compilation

Polyèdres et compilation Polyèdres et compilation François Irigoin & Mehdi Amini & Corinne Ancourt & Fabien Coelho & Béatrice Creusillet & Ronan Keryell MINES ParisTech - Centre de Recherche en Informatique 12 May 2011 François

More information

Alan LaMielle, Michelle Strout Colorado State University March 16, Technical Report CS

Alan LaMielle, Michelle Strout Colorado State University March 16, Technical Report CS Computer Science Technical Report Enabling Code Generation within the Sparse Polyhedral Framework Alan LaMielle, Michelle Strout Colorado State University {lamielle,mstrout@cs.colostate.edu March 16, 2010

More information

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March

More information

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Compiling for GPUs. Adarsh Yoga Madhav Ramesh Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation

More information

Hybrid Analysis and its Application to Thread Level Parallelization. Lawrence Rauchwerger

Hybrid Analysis and its Application to Thread Level Parallelization. Lawrence Rauchwerger Hybrid Analysis and its Application to Thread Level Parallelization Lawrence Rauchwerger Thread (Loop) Level Parallelization Thread level Parallelization Extracting parallel threads from a sequential program

More information

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation via an Extended Polyhedral Representation Habanero Extreme Scale Software Research Group Department of Computer Science Rice University 6th International Workshop on Polyhedral Compilation Techniques (IMPACT

More information

Polyhedral Optimizations of Explicitly Parallel Programs

Polyhedral Optimizations of Explicitly Parallel Programs Habanero Extreme Scale Software Research Group Department of Computer Science Rice University The 24th International Conference on Parallel Architectures and Compilation Techniques (PACT) October 19, 2015

More information

A polyhedral loop transformation framework for parallelization and tuning

A polyhedral loop transformation framework for parallelization and tuning A polyhedral loop transformation framework for parallelization and tuning Ohio State University Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, P. Sadayappan Argonne National Laboratory

More information

PolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet

PolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet PolyOpt/C A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C 0.2.1 March 12th 2012 Louis-Noël Pouchet This manual is dedicated to PolyOpt/C version 0.2.1, a framework for Polyhedral

More information

Compiler techniques for leveraging ILP

Compiler techniques for leveraging ILP Compiler techniques for leveraging ILP Purshottam and Sajith October 12, 2011 Purshottam and Sajith (IU) Compiler techniques for leveraging ILP October 12, 2011 1 / 56 Parallelism in your pocket LINPACK

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve PTHREADS pthread_create, pthread_exit, pthread_join Mutex: locked/unlocked; used to protect access to shared variables (read/write) Condition variables: used to allow threads

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 15, 2010 888.11, Class #4 Introduction: Polyhedral

More information

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two? Exploiting ILP through Software Approaches Venkatesh Akella EEC 270 Winter 2005 Based on Slides from Prof. Al. Davis @ cs.utah.edu Let the Compiler Do it Pros and Cons Pros No window size limitation, the

More information

Parametric Multi-Level Tiling of Imperfectly Nested Loops*

Parametric Multi-Level Tiling of Imperfectly Nested Loops* Parametric Multi-Level Tiling of Imperfectly Nested Loops* Albert Hartono 1, Cedric Bastoul 2,3 Sriram Krishnamoorthy 4 J. Ramanujam 6 Muthu Baskaran 1 Albert Cohen 2 Boyana Norris 5 P. Sadayappan 1 1

More information

More Data Locality for Static Control Programs on NUMA Architectures

More Data Locality for Static Control Programs on NUMA Architectures More Data Locality for Static Control Programs on NUMA Architectures Adilla Susungi 1, Albert Cohen 2, Claude Tadonki 1 1 MINES ParisTech, PSL Research University 2 Inria and DI, Ecole Normale Supérieure

More information

Unrolling Loops Containing Task Parallelism

Unrolling Loops Containing Task Parallelism Unrolling Loops Containing Task Parallelism Roger Ferrer 1, Alejandro Duran 1, Xavier Martorell 1,2, and Eduard Ayguadé 1,2 1 Barcelona Supercomputing Center Nexus II, Jordi Girona, 29, Barcelona, Spain

More information

GRAPHITE: Polyhedral Analyses and Optimizations

GRAPHITE: Polyhedral Analyses and Optimizations GRAPHITE: Polyhedral Analyses and Optimizations for GCC Sebastian Pop 1 Albert Cohen 2 Cédric Bastoul 2 Sylvain Girbal 2 Georges-André Silber 1 Nicolas Vasilache 2 1 CRI, École des mines de Paris, Fontainebleau,

More information

Class Information INFORMATION and REMINDERS Homework 8 has been posted. Due Wednesday, December 13 at 11:59pm. Third programming has been posted. Due Friday, December 15, 11:59pm. Midterm sample solutions

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Tasking in OpenMP 4. Mirko Cestari - Marco Rorro -

Tasking in OpenMP 4. Mirko Cestari - Marco Rorro - Tasking in OpenMP 4 Mirko Cestari - m.cestari@cineca.it Marco Rorro - m.rorro@cineca.it Outline Introduction to OpenMP General characteristics of Taks Some examples Live Demo Multi-threaded process Each

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Milind Kulkarni Research Statement

Milind Kulkarni Research Statement Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers

More information

Questions from last time

Questions from last time Questions from last time Pthreads vs regular thread? Pthreads are POSIX-standard threads (1995). There exist earlier and newer standards (C++11). Pthread is probably most common. Pthread API: about a 100

More information

Shared Memory programming paradigm: openmp

Shared Memory programming paradigm: openmp IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM

More information

COS 320. Compiling Techniques

COS 320. Compiling Techniques Topic 5: Types COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer 1 Types: potential benefits (I) 2 For programmers: help to eliminate common programming mistakes, particularly

More information

Improving the Practicality of Transactional Memory

Improving the Practicality of Transactional Memory Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,

More information

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0/4.5. Mark Bull, EPCC OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all

More information

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime Runtime The optimized program is ready to run What sorts of facilities are available at runtime Compiler Passes Analysis of input program (front-end) character stream Lexical Analysis token stream Syntactic

More information

On Demand Parametric Array Dataflow Analysis

On Demand Parametric Array Dataflow Analysis On Demand Parametric Array Dataflow Analysis Sven Verdoolaege Consultant for LIACS, Leiden INRIA/ENS, Paris Sven.Verdoolaege@ens.fr Hristo Nikolov LIACS, Leiden nikolov@liacs.nl Todor Stefanov LIACS, Leiden

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation CSE 501: Compiler Construction Course outline Main focus: program analysis and transformation how to represent programs? how to analyze programs? what to analyze? how to transform programs? what transformations

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Performance analysis of existing codes Data dependence analysis for detecting parallelism Specifying transformations using frameworks Today Usefulness

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

A Framework for Automatic OpenMP Code Generation

A Framework for Automatic OpenMP Code Generation 1/31 A Framework for Automatic OpenMP Code Generation Raghesh A (CS09M032) Guide: Dr. Shankar Balachandran May 2nd, 2011 Outline 2/31 The Framework An Example Necessary Background Polyhedral Model SCoP

More information

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009. Parallel Programming Lecture 9: Task Parallelism in OpenMP Administrative Programming assignment 1 is posted (after class) Due, Tuesday, September 22 before class - Use the handin program on the CADE machines

More information

Optimising with the IBM compilers

Optimising with the IBM compilers Optimising with the IBM Overview Introduction Optimisation techniques compiler flags compiler hints code modifications Optimisation topics locals and globals conditionals data types CSE divides and square

More information

Revisiting Loop Transformations with X10 Clocks. Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

Revisiting Loop Transformations with X10 Clocks. Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015 Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015 The Problem n The Parallelism Challenge n cannot escape going parallel n parallel programming is hard

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Preserving high-level semantics of parallel programming annotations through the compilation flow of optimizing compilers

Preserving high-level semantics of parallel programming annotations through the compilation flow of optimizing compilers Preserving high-level semantics of parallel programming annotations through the compilation flow of optimizing compilers Antoniu Pop 1 and Albert Cohen 2 1 Centre de Recherche en Informatique, MINES ParisTech,

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

9/5/15. n Class objectives, goals. n Why Fine grain parallelism? n Equational Programming (intro) Sanjay Rajopadhye Colorado State University

9/5/15. n Class objectives, goals. n Why Fine grain parallelism? n Equational Programming (intro) Sanjay Rajopadhye Colorado State University Sanjay Rajopadhye Colorado State University n Class objectives, goals n Why Fine grain parallelism? n Equational Programming (intro) 2 1 n Every problem is underspecified n Questions are ill posed n Finding

More information

Control flow graphs and loop optimizations. Thursday, October 24, 13

Control flow graphs and loop optimizations. Thursday, October 24, 13 Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange

More information

We will focus on data dependencies: when an operand is written at some point and read at a later point. Example:!

We will focus on data dependencies: when an operand is written at some point and read at a later point. Example:! Class Notes 18 June 2014 Tufts COMP 140, Chris Gregg Detecting and Enhancing Loop-Level Parallelism Loops: the reason we can parallelize so many things If the compiler can figure out if a loop is parallel,

More information

Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies

Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies International Journal of Parallel Programming, Vol.??, No.?,??? 2006 ( c 2006) Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies Sylvain Girbal, 1 Nicolas Vasilache,

More information

Hoare logic. A proof system for separation logic. Introduction. Separation logic

Hoare logic. A proof system for separation logic. Introduction. Separation logic Introduction Hoare logic Lecture 6: Examples in separation logic In the previous lecture, we saw how reasoning about pointers in Hoare logic was problematic, which motivated introducing separation logic.

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele Advanced C Programming Winter Term 2008/09 Guest Lecture by Markus Thiele Lecture 14: Parallel Programming with OpenMP Motivation: Why parallelize? The free lunch is over. Herb

More information

Polly Polyhedral Optimizations for LLVM

Polly Polyhedral Optimizations for LLVM Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas Simbürger - Armin Grösslinger - Louis-Noël Pouchet April 03, 2011 Polly - Polyhedral Optimizations for LLVM

More information

Automatic OpenCL Optimization for Locality and Parallelism Management

Automatic OpenCL Optimization for Locality and Parallelism Management Automatic OpenCL Optimization for Locality and Parallelism Management Xing Zhou, Swapnil Ghike In collaboration with: Jean-Pierre Giacalone, Bob Kuhn and Yang Ni (Intel) Maria Garzaran and David Padua

More information

Exploring Parallelism At Different Levels

Exploring Parallelism At Different Levels Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities

More information

!OMP #pragma opm _OPENMP

!OMP #pragma opm _OPENMP Advanced OpenMP Lecture 12: Tips, tricks and gotchas Directives Mistyping the sentinel (e.g.!omp or #pragma opm ) typically raises no error message. Be careful! The macro _OPENMP is defined if code is

More information

CS671 Parallel Programming in the Many-Core Era

CS671 Parallel Programming in the Many-Core Era 1 CS671 Parallel Programming in the Many-Core Era Polyhedral Framework for Compilation: Polyhedral Model Representation, Data Dependence Analysis, Scheduling and Data Locality Optimizations December 3,

More information

Study of Variations of Native Program Execution Times on Multi-Core Architectures

Study of Variations of Native Program Execution Times on Multi-Core Architectures Study of Variations of Native Program Execution Times on Multi-Core Architectures Abdelhafid MAZOUZ University of Versailles Saint-Quentin, France. Sid-Ahmed-Ali TOUATI INRIA-Saclay, France. Denis BARTHOU

More information

Enhancing Parallelism

Enhancing Parallelism CSC 255/455 Software Analysis and Improvement Enhancing Parallelism Instructor: Chen Ding Chapter 5,, Allen and Kennedy www.cs.rice.edu/~ken/comp515/lectures/ Where Does Vectorization Fail? procedure vectorize

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit

More information

Advanced Program Analyses and Verifications

Advanced Program Analyses and Verifications Advanced Program Analyses and Verifications Thi Viet Nga NGUYEN François IRIGOIN entre de Recherche en Informatique - Ecole des Mines de Paris 35 rue Saint Honoré, 77305 Fontainebleau edex, France email:

More information

PERFORMANCE OPTIMISATION

PERFORMANCE OPTIMISATION PERFORMANCE OPTIMISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Hardware design Image from Colfax training material Pipeline Simple five stage pipeline: 1. Instruction fetch get instruction

More information

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives

More information

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 ILP techniques

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

A Data-flow Approach to Solving the von Neumann Bottlenecks

A Data-flow Approach to Solving the von Neumann Bottlenecks 1 A Data-flow Approach to Solving the von Neumann Bottlenecks Antoniu Pop Inria and École normale supérieure, Paris St. Goar, June 21, 2013 2 Problem Statement J. Preshing. A Look Back at Single-Threaded

More information

Tour of common optimizations

Tour of common optimizations Tour of common optimizations Simple example foo(z) { x := 3 + 6; y := x 5 return z * y } Simple example foo(z) { x := 3 + 6; y := x 5; return z * y } x:=9; Applying Constant Folding Simple example foo(z)

More information

Automatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX

Automatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Recap from Last Year A framework for automatic tuning of applications Fine grain control of transformations

More information

Sanjay Rajopadhye Colorado State University

Sanjay Rajopadhye Colorado State University Sanjay Rajopadhye Colorado State University n Class objectives, goals, introduction n CUDA performance tuning (wrap up) n Equational Programming (intro) 2 n Parallel Programming is hard n End of the free

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Intermediate Representations & Symbol Tables

Intermediate Representations & Symbol Tables Intermediate Representations & Symbol Tables Copyright 2014, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California have explicit permission

More information

Lecture Notes on Static Single Assignment Form

Lecture Notes on Static Single Assignment Form Lecture Notes on Static Single Assignment Form 15-411: Compiler Design Frank Pfenning Lecture 6 September 12, 2013 1 Introduction In abstract machine code of the kind we have discussed so far, a variable

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

CS 293S Parallelism and Dependence Theory

CS 293S Parallelism and Dependence Theory CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law

More information

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006 Loop Nest Optimizer of GCC CRI / Ecole des mines de Paris Avgust, 26 Architecture of GCC and Loop Nest Optimizer C C++ Java F95 Ada GENERIC GIMPLE Analyses aliasing data dependences number of iterations

More information

CSE 501 Midterm Exam: Sketch of Some Plausible Solutions Winter 1997

CSE 501 Midterm Exam: Sketch of Some Plausible Solutions Winter 1997 1) [10 pts] On homework 1, I asked about dead assignment elimination and gave the following sample solution: 8. Give an algorithm for dead assignment elimination that exploits def/use chains to work faster

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

COMP528: Multi-core and Multi-Processor Computing

COMP528: Multi-core and Multi-Processor Computing COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 17 Background Reading

More information

ECE 5775 (Fall 17) High-Level Digital Design Automation. Static Single Assignment

ECE 5775 (Fall 17) High-Level Digital Design Automation. Static Single Assignment ECE 5775 (Fall 17) High-Level Digital Design Automation Static Single Assignment Announcements HW 1 released (due Friday) Student-led discussions on Tuesday 9/26 Sign up on Piazza: 3 students / group Meet

More information

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview

More information

Intermediate Representation (IR)

Intermediate Representation (IR) Intermediate Representation (IR) Components and Design Goals for an IR IR encodes all knowledge the compiler has derived about source program. Simple compiler structure source code More typical compiler

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion Outline Loop Optimizations Induction Variables Recognition Induction Variables Combination of Analyses Copyright 2010, Pedro C Diniz, all rights reserved Students enrolled in the Compilers class at the

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation

More information

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

Compiler Design. Code Shaping. Hwansoo Han

Compiler Design. Code Shaping. Hwansoo Han Compiler Design Code Shaping Hwansoo Han Code Shape Definition All those nebulous properties of the code that impact performance & code quality Includes code, approach for different constructs, cost, storage

More information

Parallelism and runtimes

Parallelism and runtimes Parallelism and runtimes Advanced Course on Compilers Spring 2015 (III-V): Lecture 7 Vesa Hirvisalo ESG/CSE/Aalto Today Parallel platforms Concurrency Consistency Examples of parallelism Regularity of

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

Parallelizing Compilers

Parallelizing Compilers CSc 553 Principles of Compilation 36 : Parallelizing Compilers I Parallelizing Compilers Department of Computer Science University of Arizona collberg@gmail.com Copyright c 2011 Christian Collberg Parallelizing

More information

Polyhedral-Based Data Reuse Optimization for Configurable Computing

Polyhedral-Based Data Reuse Optimization for Configurable Computing Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet 1 Peng Zhang 1 P. Sadayappan 2 Jason Cong 1 1 University of California, Los Angeles 2 The Ohio State University February

More information

Synchronization. Event Synchronization

Synchronization. Event Synchronization Synchronization Synchronization: mechanisms by which a parallel program can coordinate the execution of multiple threads Implicit synchronizations Explicit synchronizations Main use of explicit synchronization

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information