Intel Parallel Composer. Stephen Blair-Chappell Intel Compiler Labs

Size: px
Start display at page:

Download "Intel Parallel Composer. Stephen Blair-Chappell Intel Compiler Labs"

Transcription

1 Intel Parallel Composer Stephen Blair-Chappell Intel Compiler Labs

2 Intel Parallel Composer Develop effective applications with a C/C++ compiler and comprehensive threaded libraries CODE & DEBUG PHASE Easier, faster parallelism for Windows* apps C/C++ compiler and advanced threaded libraries Built-in parallel debugger Supports OpenMP* Save time and increase productivity 2

3 Key Features of Composer Extensions for parallelism Simple concurrent functionality ( task/ taskcomplete) Vectorization support for SSE2/SSE3/SSSE3/SSE4 instruction set OpenMP 3.0 Seamless integration into MS Visual Studio* Intel Parallel Debugger Extensions - A Plug-in to Visual Studio* Intel Threading Building Blocks C++ lambda function support (enables simpler interfacing with Intel TBB) Intel Integrated Performance Primitives Integrated array notation, data-parallel Intel IPP functions Parallel build (/MP) feature Diagnostics to help develop parallel programs (/Qdiag-enable:thread) Threading tutorials with sample code 3

4 Intel Parallel Composer Extend parallel debugging capabilities Adds a new class of data breakpoints Data race detection Allows filtering to control amount of data collected Serializes parallel regions without recompilation Adds window to visualize logs 4

5 Compiler Pro 11.0 Composer Full C and C++ Support Fortran Support No Fortran Future Intel Compiler Fortran 2003 Support Several features No Fortran 2003 Many more features task/ taskcomplete No Parallel Exploration OpenMP 3.0 Valarray Specializations Lambda Functions Intel Threading Building Blocks Intel Integrated Performance Primitives Intel Math Kernel Library No C++ Parallel Debug Plug-in with SSE/vector window and OpenMP parallelism window (Windows Only) No Windows Debugger GUI Debugger with SSE/vector window and OpenMP parallelism window (Linux Only) No Linux debugger Code Coverage Utility Test Select Utility Full Fortran Interoperability, except when IPO used Decimal Floating-point 5 5 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

6 Section 2 New Features Concurrent Functionailty New C++0x Features Debugger Extensions

7 Concurrent Functionality - Idea The parallel programming extensions are intended for quickly getting a program parallelized without learning a great deal about APIs A few keywords and the program is parallelized If the constructs are not powerful enough in terms of data control then there may be a need to look into other more comprehensive parallel programming methodologies, such as OpenMP. 7 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

8 Concurrent Functionality Introduction of novel C/C++ language extensions to make parallel programming easier There are four new keywords introduced used as statement prefixes: taskcomplete, task, par, and critical. To benefit from the parallelism afforded by these keywords the switch /Qpar must be used The runtime system will manage the actual degree of parallelism 8 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

9 Concurrent Functionality - example int a[1000], b[1000], c[1000]; void f_sum ( int length, int *a, int *b, int *c ) { int i; for (i=0; i<length; i++) { c[i] = a[i] + b[i]; f_sum(1000, a, b, c); taskcomplete { task f_sum(500, a, b, c); task f_sum(500, a+500, b+500, c+500); Serial call Parallel call 9 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

10 New C++0x Features New C++0x features enabled by switch: /Qstd=c++0x (Win) lambda functions static assertions RVALUE references C99-compliant preprocessor func predefined identifier variadic templates Extern templates and some more 10

11 C++0x: Futures Future - a mechanism used to provide values that will be accessed in the future and resolved to another value asynchronously The definition of futures does not specify if the computation of the given expression starts immediately or when the result is requested Futures are realized in the Intel Compiler by templates intel::future<page*> future_page; do { // on user click if (user clicked NEXT) { page = future_page.get();//wait for next page to finish loading else { future_page.cancel(); //user clicked END, speculation wasted break; while (1); 11

12 C++0x: Lambda Functions A lambda abstraction defines an unnamed function Lambda functions in C++ provides: treat functions as first class objects composing functions inline treating functions as class objects Enhances concurrent code, as it is possible to pass around code chunks like objects Ability to pass code as parameter Defines the <> operator std::vector<int> somelist; int total = 0; std::for_each(somelist.begin(),somelist.end(), <>(int x) : [&total](total += x)); std::cout << total; 12

13 VALARRAY methods accelerated by Intel IPP C++ standard template (STL) container class for arrays consisting of array methods for high performance computing The operations are designed to exploit low level hardware features, for example vectorization In order to take full advantage of valarray, you need an optimizing C++ compiler that recognizes valarray as an intrinsic type and replaces such types by Intel IPP library calls // Create a valarray of ints. valarray_t::value_type ibuf[10] = {0,1,2,3,4,5,6,7,8,9; valarray_t vi(ibuf, 10); // Create a valarray of bools for a mask. maskarray_t::value_type mbuf[10] = {1,0,1,1,1,0,0,1,1,0; maskarray_t mask(mbuf,10); // Double the values of the masked array vi[mask] += static_cast<valarray_t> (vi[mask]); 13

14 Intel Parallel Composer Debugger Extensions

15 Key Features Shared Data Access Detection Break on Thread Shared Data Access Re-entrant Function Detection SIMD SSE Debugging Window Enhanced OpenMP* Support Serialize OpenMP threaded application execution on the fly Insight into thread groups, barriers, locks, wait lists etc. 15 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

16 Shared Data Access Detection Shared data access is a major problem in multithreaded applications Can cause hard to diagnose intermittent program failure Tool support is required for detection Compiler Instrumented Application Memory Acces Instrumentation Normal debugging Visual Studio or IDB GUI s GUI Extension Debug Engine Technology built on: Code Instrumentation by Intel compiler Debug runtime library (RTL) that collects data access traces and triggers debugger tool Add-on that reports and visualizes RTL events while debugging The combination enables a large variety of additional debug use cases Intel Debug Runtime RTL events Debugger Extension 16 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

17 Shared Data Access Detection Data sharing detection is part of overall debug process Breakpoint model (stop on detection) GUI extensions show results & link to source Filter capabilities to hide false positives New powerful data breakpoint types Stop when 2nd thread accesses specific address Key Stop User on Benefit: read from address A simplified feature to detect shared data accesses from multiple threads 17 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

18 Shared Data Access Detection - Filtering Data sharing detection is selective Data Filter Specific data items and variables can be excluded Code Filter Functions can be excluded Source files can be excluded Address ranges can be excluded, 18 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

19 Re-Entrant Call Detection Automatically halts execution when a function is executed by more than one thread at any given point in time. Allows to identify reentrancy requirements/problems for multi-threaded applications 19 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

20 Enhanced OpenMP* Debugging Support Dedicated OpenMP runtime object - information Windows OpenMP Task and Spawn Tree lists Barrier and Lock information Task Wait lists Thread Team worker lists Serialize Parallel Regions Change number of parallel threads dynamically during runtime to 1 or N (all) Verify code correctness for serial execution vs. Parallel execution User benefit: Identify whether a runtime issue is Detailed really execution parallelism state related information for OpenMP applications (deadlock detection). Influences execution behavior without recompile! 20 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

21 OpenMP*Task Details 21 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

22 Parallel Debug Plug-In Allows Filtering to control amount of data collected Adds window to visualize logs 22 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice. Can serialize parallel

23 23 Parallel Run-Control - Use Cases Stepping parallel loops Problem: State investigation difficult * Threads stop at arbitrary positions (red line) Parallel Debugger Support Add Syncpoint to stop team threads at same location User Benefit Get and keep defined program State Operations like private data comparison now meaningful Normal Step Lock Step Lock Step Lock Step Breakpoint Syncpoint Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice. Serial Execution Problem: Parallel loop computes a wrong result. Is it a concurrency or algorithm issue? Parallel Debugger Support Runtime access to the OpenMP num_thread property Set to 1 for serial execution of next parallel block User Benefit Verification of a algorithm on-the-fly without slowing down the entire application to serial execution On demand serial debugging without recompile/restart Disable parallel Enable parallel

24 SIMD SSE Debugging Window SIMD Window (new) Supports evaluation of arbitrary length expressions. SSE Registers display of variables used for SIMD operations In-depth insight into data parallelization and vectorization. 24 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

25 Section 3 Creating Parallel Code

26 Implementing Parallelism Different Methods Automatic Via Compiler No Code Changes Programming OpenMP Native Threads Win32 POSIX Threading Building Blocks MPI Automatic Programming Using Parallel-enabled Libraries MKL IPP Libraries Three ways of achieving Parallelism 26

27 Auto Parallelism Loop-level parallelism automatically supplied by the compiler 27

28 Auto-parallelization Auto-parallelization: Automatic threading of loops without having to manually insert OpenMP* directives. Windows* /Qparallel /Qpar_report[n] Linux* -parallel -par_report[n] Mac* -parallel -par_report[n] Compiler can identify easy candidates for parallelization, but large applications are difficult to analyze. 28

29 Optimisation Results pi application Optimisation default Time Taken (secs) Speedup 1 auto-vectorisation auto-parallelism auto-vec vec.. & auto-par

30 Sample implementations of Parallel Programming MPI 30 POOP Parallel Object Oriented Programming *Other names and brands may be claimed as the property of others.

31 No Threads The Sample Application

32 Our running Example: The PI program Numerical Integration Mathematically, we know that: (1+x 2 ) dx = π F(x) = 4.0/(1+x 2 ) 2.0 We can approximate the integral as a sum of rectangles: N F(x i ) x π i = X 1.0 Where each rectangle has width x and height F(x i ) at the middle of interval i. 32

33 PI Program: The sequential program static long num_steps = ; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); pi = step * sum;

34 Native Threads

35 Threads and Parallel Programming Operating Systems 101: Process: A unit of work managed by an OS with its own address space (the heap) and OS managed resources. Threads: resources within a process that execute the instructions in a program. They have their own program counter and a private memory region (a stack) but share the other resources within the process including the heap. Threads are the natural unit of execution for parallel programs on shared memory hardware. The threads share memory so data structures don t have to be torn apart into distinct pieces. 35

36 Programming with Native Threads The OS provides an API for creating, managing, and destroying threads. Windows* threading API Posix threads (on Linux) Advantages of threads libraries: The thread library gives you detailed control over the threads. Disadvantage of thread libraries: The thread library REQUIRES that you take detailed control over the threads. 36 *Other names and brands may be claimed as the property of others.

37 Win32 API #include <windows.h> #define NUM_THREADS 2 HANDLE thread_handles[num_threads]; CRITICAL_SECTION hupdatemutex; static long num_steps = ; double step; double global_sum = 0.0; void Pi (void *arg) { int i, start; double x, sum = 0.0; start = *(int *) arg; step = 1.0/(double) num_steps; for (i=start;i<= num_steps; i=i+num_threads){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); EnterCriticalSection(&hUpdateMutex); global_sum += sum; LeaveCriticalSection(&hUpdateMutex); void main () { double pi; int i; DWORD threadid; int threadarg[num_threads]; for(i=0; i<num_threads; i++) threadarg[i] = i+1; InitializeCriticalSection(&hUpdateMutex); for (i=0; i<num_threads; i++){ thread_handles[i] = CreateThread(0, 0, (LPTHREAD_START_ROUTINE) Pi, &threadarg[i], 0, &threadid); WaitForMultipleObjects(NUM_THREADS, thread_handles, TRUE,INFINITE); pi = global_sum * step; printf(" pi is %f \n",pi);

38 Win32 thread library: It #include <windows.h> #define NUM_THREADS 2 Setup multi-threading support HANDLE thread_handles[num_threads]; CRITICAL_SECTION hupdatemutex; static long num_steps = ; double step; double global_sum = 0.0; void Pi (void *arg) { int i, start; double x, sum = 0.0; Define the work each thread will do and pack it into a start = *(int *) arg; step = 1.0/(double) function num_steps; Update the for (i=start;i<= num_steps; final answer i=i+num_threads){ one thread at x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); a time EnterCriticalSection(&hUpdateMutex); global_sum += sum; LeaveCriticalSection(&hUpdateMutex); It s not as bad as it looks void main () { double pi; int i; DWORD threadid; int threadarg[num_threads]; for(i=0; i<num_threads; i++) threadarg[i] = i+1; InitializeCriticalSection(&hUpdateMutex); Setup arguments, book keeping, and launch the for (i=0; i<num_threads; threads i++){ thread_handles[i] = CreateThread(0, 0, (LPTHREAD_START_ROUTINE) Pi, &threadarg[i], 0, &threadid); WaitForMultipleObjects(NUM_THREADS, thread_handles, TRUE,INFINITE); Wait for all the threads to finish pi = global_sum * step; printf(" pi is %f \n",pi); Compute and print final answer

39 Win32 API #include <windows.h> #define NUM_THREADS 2 HANDLE thread_handles[num_threads]; CRITICAL_SECTION hupdatemutex; static long num_steps = ; double step; double global_sum = 0.0; void Pi (void *arg) { int i, start; double x, sum = 0.0; start = *(int *) arg; step = 1.0/(double) num_steps; for (i=start;i<= num_steps; i=i+num_threads){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); EnterCriticalSection(&hUpdateMutex); global_sum += sum; LeaveCriticalSection(&hUpdateMutex); void main () { double pi; int i; DWORD threadid; int threadarg[num_threads]; for(i=0; i<num_threads; i++) threadarg[i] = i+1; InitializeCriticalSection(&hUpdateMutex); for (i=0; i<num_threads; i++){ thread_handles[i] = CreateThread(0, 0, (LPTHREAD_START_ROUTINE) Pi, &threadarg[i], 0, &threadid); WaitForMultipleObjects(NUM_THREADS, thread_handles, TRUE,INFINITE); pi = global_sum * step; printf(" pi is %f \n",pi);

40 Threading Building Blocks

41 Featured Components Task Scheduler Generic Parallel Algorithms parallel_for parallel_reduce pipeline parallel_sort parallel_while parallel_scan Concurrent Containers concurrent_hash_map concurrent_queue concurrent_vector 41

42 Concurrent Containers Library provides highly concurrent containers STL containers are not concurrency-friendly: attempt to modify them concurrently can corrupt container Standard practice is to wrap a lock around STL containers Turns container into serial bottleneck Library provides fine-grained locking or lockless implementations Worse single-thread performance, but better scalability. Can be used with the library, OpenMP, or native threads. 42

43 Generic Programming for C++ developers Best known example is C++ STL Enables distribution of broadly-useful high-quality algorithms and data structures Write best possible algorithm with fewest constraints Do not force particular data structure on user Classic example: STL std::sort Instantiate algorithm to specific situation C++ template instantiation, partial specialization, and inlining make resulting code efficient 43

44 template <typename Range, typename Body> void parallel_for(const Range& range, const Body &body); Requirements for parallel_for Body Body::Body(const Body&) Body::~Body() void Body::operator() (Range& subrange) const Copy constructor Destructor Apply the body to subrange. parallel_for partitions original range into subranges, and deals out subranges to worker threads in way that: Balances load Uses cache efficiently Scales 44

45 Serial Example static void SerialUpdateVelocity() { for( int i=1; i<universeheight-1; ++i ) #pragma ivdep for( int j=1; j<universewidth-1; ++j ) V[i][j] += (S[i][j] - S[i][j-1] + T[i][j] - T[i-1][j])*M[i]; Intel TBB product has complete serial and parallel versions of seismic wave simulation. 45

46 Parallel Version struct UpdateVelocityBody { void operator()( const blocked_range<int>& range ) const { int end = range.end(); for( int i= range.begin(); i<end; ++i ) { #pragma ivdep for( int j=1; j<universewidth-1; ++j ) V[i][j] += (S[i][j] - S[i][j-1] + T[i][j] - T[i-1][j])*M[i]; ; void ParallelUpdateVelocity() { parallel_for( blocked_range<int>( 1, UniverseHeight-1, GrainSize ), UpdateVelocityBody() ); Pattern Task blue = original code red = provided by TBB black = boilerplate for library Establishes grain size 46

47 Task scheduler example Split range..... recursively......until grainsize. tasks available to thieves 47

48 TBB Class class ParallelPi { public: double pi; ParallelPi(): pi(0) { ParallelPi(ParallelPi &body, Split):pi(0) { void operator()(const tbb::blocked_range<int> &r) { for(int i = r.begin(); i!= r.end(); ++i) { float x = Step * ((float)i-0.5); pi += 4.0 / (1.0 + x*x); void join(parallelpi &body) { pi += body.pi; ; 48

49 TBB Class int main() { ParallelPi Pi; parallel_reduce( tbb::blocked_range<int>(0, INTERVALS, 100), Pi ); printf( Pi = %f\n, Pi.pi/INTERVALS); 49

50 Section 5 Expressing Parallelism Using the Intel Compiler Stephen Blair-Chappell Technical Consulting Engineer Intel Compiler Labs

51 OpenMP A deeper dive into OpenMP 51

52 What is OpenMP? Portable, Shared Memory Multi-processing API Fortran 77, Fortran 90, C, and C++ Multi-vendor support, for both Unix and Windows Standardizes loop-level parallelism Supports coarse-grained parallelism Combines serial and parallel code in single source No need for separate source code revision See for standard documents, tutorials, sample code Intel is premier member of OpenMP Review Board 52

53 Parallel APIs: OpenMP* C$OMP FLUSH #pragma omp critical C$OMP THREADPRIVATE(/ABC/) C$OMP parallel do shared(a, b, c) call OMP_INIT_LOCK (ilok) C$OMP SINGLE PRIVATE(X) C$OMP PARALLEL REDUCTION (+: A, B) CALL OMP_SET_NUM_THREADS(10) OpenMP: An API for Writing Multithreaded Applications call omp_test_lock(jlok) A set of compiler directives and library routines for parallel application programmers C$OMP ATOMIC C$OMP MASTER setenv OMP_SCHEDULE dynamic Makes it easy to create multithreaded (MT) programs in Fortran, C and C++ C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C) Standardizes last 15 years of SMP practice C$OMP SECTIONS C$OMP ORDERED #pragma omp parallel for private(a, B)!$OMP BARRIER C$OMP PARALLEL COPYIN(/blk/) C$OMP DO lastprivate(xx) Nthrds = OMP_GET_NUM_PROCS() omp_set_lock(lck)

54 OpenMP Architecture Fork-Join Model Worksharing constructs Synchronization constructs Directive/pragma-based parallelism Extensive API for finer control 54

55 OpenMP Runtime Application User Directive Compiler Environment Variables Runtime Library Threads in Operating System 55

56 OpenMP Programming Model: Fork-Join Parallelism: Master Thread in red Master thread spawns a team of threads as needed. Parallelism added incrementally until performance are met: i.e. the sequential program evolves into a parallel program. Parallel Regions A Nested Nested Parallel region region 56 Sequential Parts *Other names and brands may be claimed as the property of others.

57 Intel Compiler Switches for OpenMP OpenMP support /Qopenmp /Qopenmp_report{

58 Basic Syntax Fork-Join Model Threads are created as parallel pragma is crossed Data is classed as shared among threads or private to each thread Several, e.g. 4, threads created on entry main() { #pragma omp parallel \ shared(a) private(i) { // this code is parallel... Threads either spin or sleep between regions

59 Hello World This program runs on three threads: Prints this: Void main() #pragma omp parallel { printf( Hello World\n ); #pragma omp for for(i=0;i<=4;i++) { printf( Iter: %d, I); printf( Goodbye World\n ); Hello World Hello World Hello World Iter: 1 Iter: 2 Iter: 3 Iter: 4 Goodbye World Goodbye World Goodbye World 59

60 Parallel Loop Model Threads are created Data is classified as shared or private void* work(float* A) { #pragma omp parallel for \ shared(a) private(i) for(i=1; i<=12; i++) { /* Iterations divided * among threads */ A Parallel For I=1 I=2 I=3 I=4 I=5 I=6 Iterations distributed across threads A is shared Barrier at end Threads either spin or sleep between regions

61 Data Scope Attributes The default status can be modified with default (shared none) Scoping attribute clauses shared(varname, ) private(varname, ) 61

62 The Private Clause Reproduces the variable for each thread Variables are un-initialized; C++ object is default constructed Any value external to the parallel region is undefined void* work(float* c, int N) { float x, y; int i; #pragma omp parallel for private(x,y) for(i=0; i<n; i++) { x = a[i]; y = b[i]; c[i] = x + y; 62

63 Example: Dot Product float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<n; i++) { sum += a[i] * b[i]; return sum; What is Wrong? 63

64 Protect Shared Data Must protect access to shared, modifiable data float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<n; i++) { #pragma omp critical sum += a[i] * b[i]; return sum; 64

65 OpenMP* Critical Construct #pragma omp critical [(lock_name)] Defines a critical region on a structured block Threads wait their turn at a time, only one calls consum() thereby protecting R1 and R2 from race conditions. Naming the critical constructs is optional, but may increase performance. float R1, R2; #pragma omp parallel { float A, B; #pragma omp for for(int i=0; i<niters; i++){ B = big_job(i); #pragma omp critical (R1_lock) consum (B, &R1); A = bigger_job(i); #pragma omp critical (R2_lock) consum (A, &R2); 65

66 OpenMP* Reduction Clause reduction (op : list) The variables in list must be shared in the enclosing parallel region Inside parallel or work-sharing construct: A PRIVATE copy of each list variable is created and initialized depending on the op These copies are updated locally by threads At end of construct, local copies are combined through op into a single value and combined with the value in the original SHARED variable 66

67 Reduction Example #pragma omp parallel for reduction(+:sum) for(i=0; i<n; i++) { sum += a[i] * b[i]; Local copy of sum for each thread All local copies of sum added together and stored in global variable 67

68 C/C++ Reduction Operations A range of associative and commutative operators can be used with reduction Initial values are the ones that make sense Operator Initial Value Operator Initial Value + 0 & ~0 * && 1 ^

69 Schedule Clause #pragma omp parallel for schedule (static, 8) for( int i = start; i <= end; i += 2 ) { if ( TestForPrime(i) ) gprimesfound++; Iterations are divided according to schedule statement 69

70 OpenMP Parallel for with a reduction #include <omp.h> static long num_steps = ; double step; #define NUM_THREADS 2 void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; omp_set_num_threads(num_threads); #pragma omp parallel for reduction(+:sum) private(x) for (i=1;i<= num_steps; i++) { x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); pi = step * sum; *Other names and brands may be claimed as the property of others.

71 OpenMP FOR-schedule schemas schedule clause defines how loop iterations assigned to threads Compromise between two opposite goals: Best thread load balancing With minimal controlling overhead single thread schedule(static) schedule(guided, f) schedule(dynamic, c) C N/2 N f 71

72 Iterative worksharing versus code replication Iterative worksharing -- prints Hello World 10 times, regardless of the number of threads: #pragma omp parallel for for( int i = 1,i< 10;i++) { printf( Hello World\n ); Code replication -- assuming a team of 4 threads, prints Hello World 40 times: #pragma omp parallel for( int i = 1,i< 10;i++) { printf( Hello World\n ); 72

73 Parallel Sections Independent sections of code can execute concurrently #pragma omp parallel sections { #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3(); Serial Parallel 73

74 Implicit Barriers Several OpenMP constructs have implicit barriers: do for single sections Unnecessary barriers hurt performance Suppress them, when safe, with nowait!$omp do [...]!$omp end do nowait #pragma omp for nowait for(...) [...];!$omp sections [...]!$omp end sections nowait #pragma omp single nowait { [...] 74

75 Static and Dynamic Extent Static extent or lexical extent is the code that is lexically within the parallel/end parallel directive Dynamic extent includes the static extent and the entire call tree of any subroutine or function called in the static extent. OpenMP directives in the dynamic extent of a parallel region are called orphaned directives An orphaned worksharing construct behaves as if the construct was within the lexical extent the work is divided across thread team The only difference is that slightly different data scoping rules apply. 75

76 Static and Dynamic Extent Program main!$omp parallel <- Static extent call foo <-!$omp end parallel <- End + Dynamic extent Subroutine foo <-!$omp do!orphaned do i = 1,100 enddo <-!$omp end do Call X!also dynamic End subroutine foo <- 76

77 Communication and data scope In OpenMP every variable has a scope that is either shared or private By default, all variables have shared scope Data scoping clauses that can appear on parallel constructs: The shared and private clauses explicitly scope specific variables The firstprivate and lastprivate clauses perform initialization and finalization of private variables The default clause changes the default scoping rules when variables are not explicitly scoped The reduction clause explicitly identifies reduction variables 77

78 Communication and data scope Each thread has a private stack used for automatic variables For all other program variables, the parallel constructs can scope each variable as shared, private, or reduction Private variables need to be initialized at the start of a parallel construct. The firstprivate clause will initialize from the global instance For parallelized loops, the lastprivate clause will update the global instance from the private value computed with the last iteration 78

79 Communication and data scope Variables default to shared scope, except for these cases: Loop index variables to which a parallel do or parallel for applies default to private Locals variables in subroutines in the dynamic extent default to private unless they are marked with the save attribute (Fortran) or as static (C/C++) Data scoping clauses only apply to the named variables within the lexical extent Global variables in orphaned constructs are shared by default, regardless of the attribute in the lexical extent Automatic (stack allocated) variables in orphaned constructs are always private Formal parameters to a subroutine in the dynamic extent acquire their scope from that of the actual variables in the caller s context 79

80 Environment Variables Standard Environment Variables OMP_SCHEDULE Runtime schedule and optional iteration chunk size OMP_NUM_THREADS Number of worker threads; defaults to number of processors OMP_DYNAMIC Enables dynamic adjustment of the number of threads OMP_NESTED Enables nested parallelism Intel Extension Environment Variables KMP_ALL_THREADS Maximum number of threads in a parallel region KMP_BLOCKTIME Thread wait time before sleeping at the end of a parallel region KMP_LIBRARY Runtime execution mode: throughput (default) for multi-user systems, turnaround for a dedicated (single user) system KMP_STACKSIZE Worker thread stack size 80

81 Synchronization Two kinds of synchronization Mutual exclusion synchronization for exclusive access to data by only one thread at a time Event synchronization for imposing a thread execution order Most commonly used synchronization directives Data synchronization gives a thread exclusive access to a shared variable!$omp critical an arbitrarily large block of structured code!$omp end critical!$omp atomic a single assignment that updates a scalar variable Event synchronization signals the occurrence of an event that all threads must synchronize their execution on!$omp barrier specifies a point in the program where each thread must wait for all other threads to arrive Other less-frequently used synchronization directives!$omp master,!$omp flush,!$omp ordered 81

82 The Workqueuing Model Work units need not be all known or pre-computed at the beginning of construct (e.g.: while loops and recursive functions) taskq specifies an environment (the queue) task specifies the units of work (dynamic) One thread executes the taskq block, enqueuing each task it encounters All other threads dequeue and execute work from the queue Intel-specific extension to OpenMP* 2.5 Has been accepted in new 3.0 version ( to be released Q2/08) 82

83 Why Workqueuing? while(p!= NULL){ do_work(p->data); p = p->next; Workqueuing Worksharing #pragma intel omp parallel taskq { while(p!= NULL){ #pragma intel omp task do_work(p->data); p = p->next; int n=0, i=0; NODEPTR q=p; NODEPTR *r; while(q!= NULL){ n++; q = q->next; r=allocate(n * sizeof(nodeptr)); while(p!= NULL){ r[i++]=p; p = p->next; #pragma omp parallel for for (i=0; i<n; i++) do_work(r[i]->data); free(r); 83

84 OpenMP* 3.0 Support Intel Compilers 11.0 for C++ and Fortan will be fully complaint to OpenMP* 3.0 Standard not release yet Very likely May 08! Draft at Four major extensions: Tasking for unstructured parallelism Loop collapsing Enhanced loop scheduling control Better support for nested parallelism and for all who waited eventually allow unsigned int for loop index variable 84

85 OpenMP* 3.0 Tasking Maybe the most relevant new feature A task has Code to execute A data environment (it owns its data) An assigned thread that executes the code and uses the data Two activities: packaging and execution Each encountering thread packages a new instance of a task (code and data) Some thread in the team executes the task at some later time A task is nothing really new to OpenMP Implicitly each PARALLEL directive creates tasks but they have been transparent objects OpenMP* 3.0 makes tasking explicit 85

86 OpenMP* 3.0 Tasking - Definitions Task construct task directive plus structured block Task the package of code and instructions for allocating data created when a thread encounters a task construct Task region the dynamic sequence of instructions produced by the execution of a task by a thread #pragma omp task [clause[[,][ [[,]clause] ]...] structured-block Where <clause> is one of if (expression)( untied shared (list) ( private (list) ( firstprivate (list) default( shared none ) 86

87 OpenMP* 3.0 Tasking Example Postorder Tree Traversal void postorder(node *p) { if (p->left) #pragma omp task postorder(p->left); if (p->right) #pragma omp task postorder(p->right); Task scheduling point Threads may switch to execute other tasks #pragma omp taskwait // wait for descendants process(p->data); Parent task suspended until children tasks complete 87

88 OpenMP* 3.0 Enhanced Schedule Control Made schedule(runtime) more useful now can get/set it with library routines omp_set_schedule() omp_get_schedule() allow implementations to use their own schedule kinds Adds a new schedule kind AUTO which gives full freedom to the runtime to determine the scheduling of iterations to threads Allows C++ random access iterators as loop control variables in parallel loops 88

89 OpenMP* 3.0 Loop Collapsing Allow collapsing of perfectly nested loops!$omp parallel do collapse(2) do i=1,n do j=1,n... end do end do Will form a single loop and then parallelize that Scheduling of combined iteration space follows the order of the original, sequential execution 89

90 OpenMP 3.0 : Nested Parallelism Better support for nested parallelism Per-thread internal control variables Allows, for example, calling omp_set_num_threads() inside a parallel region. Controls the team sizes for next level of parallelism Library routines to determine depth of nesting, IDs of parent/grandparent etc. threads, team sizes of parent/grandparent etc. teams omp_get_active_level() omp_get_ancestor(level) omp_get_teamsize(level) 90

OpenMP * Past, Present and Future

OpenMP * Past, Present and Future OpenMP * Past, Present and Future Tim Mattson Intel Corporation Microprocessor Technology Labs timothy.g.mattson@intel.com * The name OpenMP is the property of the OpenMP Architecture Review Board. 1 OpenMP

More information

Multi-core Architecture and Programming

Multi-core Architecture and Programming Multi-core Architecture and Programming Yang Quansheng( 杨全胜 ) http://www.njyangqs.com School of Computer Science & Engineering 1 http://www.njyangqs.com Programming with OpenMP Content What is PpenMP Parallel

More information

Programming with OpenMP*

Programming with OpenMP* Objectives At the completion of this module you will be able to Thread serial code with basic OpenMP pragmas Use OpenMP synchronization pragmas to coordinate thread execution and memory access 2 Agenda

More information

Cluster Computing 2008

Cluster Computing 2008 Objectives At the completion of this module you will be able to Thread serial code with basic OpenMP pragmas Use OpenMP synchronization pragmas to coordinate thread execution and memory access Based on

More information

Objectives At the completion of this module you will be able to Thread serial code with basic OpenMP pragmas Use OpenMP synchronization pragmas to coordinate thread execution and memory access Based on

More information

Introduction to OpenMP.

Introduction to OpenMP. Introduction to OpenMP www.openmp.org Motivation Parallelize the following code using threads: for (i=0; i

More information

Alfio Lazzaro: Introduction to OpenMP

Alfio Lazzaro: Introduction to OpenMP First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B. Bertinoro Italy, 12 17 October 2009 Alfio Lazzaro:

More information

Parallel Programming with OpenMP. CS240A, T. Yang

Parallel Programming with OpenMP. CS240A, T. Yang Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs

More information

Introduction to. Slides prepared by : Farzana Rahman 1

Introduction to. Slides prepared by : Farzana Rahman 1 Introduction to OpenMP Slides prepared by : Farzana Rahman 1 Definition of OpenMP Application Program Interface (API) for Shared Memory Parallel Programming Directive based approach with library support

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

Multithreading in C with OpenMP

Multithreading in C with OpenMP Multithreading in C with OpenMP ICS432 - Spring 2017 Concurrent and High-Performance Programming Henri Casanova (henric@hawaii.edu) Pthreads are good and bad! Multi-threaded programming in C with Pthreads

More information

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008 1 of 6 Lecture 7: March 4 CISC 879 Software Support for Multicore Architectures Spring 2008 Lecture 7: March 4, 2008 Lecturer: Lori Pollock Scribe: Navreet Virk Open MP Programming Topics covered 1. Introduction

More information

Lecture 2 A Hand-on Introduction to OpenMP

Lecture 2 A Hand-on Introduction to OpenMP CS075 1896 1920 1987 2006 Lecture 2 A Hand-on Introduction to OpenMP, 2,1 01 1 2 Outline Introduction to OpenMP Creating Threads Synchronization between variables Parallel Loops Synchronize single masters

More information

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives

More information

https://www.youtube.com/playlist?list=pllx- Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

https://www.youtube.com/playlist?list=pllx- Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG https://www.youtube.com/playlist?list=pllx- Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG OpenMP Basic Defs: Solution Stack HW System layer Prog. User layer Layer Directives, Compiler End User Application OpenMP library

More information

Introduction to OpenMP. Motivation

Introduction to OpenMP.  Motivation Introduction to OpenMP www.openmp.org Motivation Parallel machines are abundant Servers are 2-8 way SMPs and more Upcoming processors are multicore parallel programming is beneficial and actually necessary

More information

Data Environment: Default storage attributes

Data Environment: Default storage attributes COSC 6374 Parallel Computation Introduction to OpenMP(II) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel) Edgar Gabriel Fall 2014 Data Environment: Default storage attributes

More information

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono OpenMP Algoritmi e Calcolo Parallelo References Useful references Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Programming Shared-memory Platforms with OpenMP. Xu Liu

Programming Shared-memory Platforms with OpenMP. Xu Liu Programming Shared-memory Platforms with OpenMP Xu Liu Introduction to OpenMP OpenMP directives concurrency directives parallel regions loops, sections, tasks Topics for Today synchronization directives

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2. OpenMP Overview in 30 Minutes Christian Terboven 06.12.2010 / Aachen, Germany Stand: 03.12.2010 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda OpenMP: Parallel Regions,

More information

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory

More information

DPHPC: Introduction to OpenMP Recitation session

DPHPC: Introduction to OpenMP Recitation session SALVATORE DI GIROLAMO DPHPC: Introduction to OpenMP Recitation session Based on http://openmp.org/mp-documents/intro_to_openmp_mattson.pdf OpenMP An Introduction What is it? A set

More information

[Potentially] Your first parallel application

[Potentially] Your first parallel application [Potentially] Your first parallel application Compute the smallest element in an array as fast as possible small = array[0]; for( i = 0; i < N; i++) if( array[i] < small ) ) small = array[i] 64-bit Intel

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,

More information

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP Topics Lecture 11 Introduction OpenMP Some Examples Library functions Environment variables 1 2 Introduction Shared Memory Parallelization OpenMP is: a standard for parallel programming in C, C++, and

More information

OpenMP on Ranger and Stampede (with Labs)

OpenMP on Ranger and Stampede (with Labs) OpenMP on Ranger and Stampede (with Labs) Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition November 6, 2012 Based on materials developed by Kent

More information

OpenMP Tutorial. Rudi Eigenmann of Purdue, Sanjiv Shah of Intel and others too numerous to name have contributed content for this tutorial.

OpenMP Tutorial. Rudi Eigenmann of Purdue, Sanjiv Shah of Intel and others too numerous to name have contributed content for this tutorial. OpenMP * in Action Tim Mattson Intel Corporation Barbara Chapman University of Houston Acknowledgements: Rudi Eigenmann of Purdue, Sanjiv Shah of Intel and others too numerous to name have contributed

More information

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen OpenMP - II Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT

More information

Shared Memory programming paradigm: openmp

Shared Memory programming paradigm: openmp IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

Introduction to OpenMP

Introduction to OpenMP Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University History De-facto standard for Shared-Memory

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

Introduction to Standard OpenMP 3.1

Introduction to Standard OpenMP 3.1 Introduction to Standard OpenMP 3.1 Massimiliano Culpo - m.culpo@cineca.it Gian Franco Marras - g.marras@cineca.it CINECA - SuperComputing Applications and Innovation Department 1 / 59 Outline 1 Introduction

More information

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018 OpenMP 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

Intel Developer Products for Parallelized Software Development

Intel Developer Products for Parallelized Software Development Intel Developer Products for Parallelized Software Development Vipin Kumar E K Technical Consulting Engineer Software Solutions Group, Intel 1 Software Solutions Group - Developer Products Division Agenda

More information

Distributed Systems + Middleware Concurrent Programming with OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP Distributed Systems + Middleware Concurrent Programming with OpenMP Gianpaolo Cugola Dipartimento di Elettronica e Informazione Politecnico, Italy cugola@elet.polimi.it http://home.dei.polimi.it/cugola

More information

Session 4: Parallel Programming with OpenMP

Session 4: Parallel Programming with OpenMP Session 4: Parallel Programming with OpenMP Xavier Martorell Barcelona Supercomputing Center Agenda Agenda 10:00-11:00 OpenMP fundamentals, parallel regions 11:00-11:30 Worksharing constructs 11:30-12:00

More information

DPHPC: Introduction to OpenMP Recitation session

DPHPC: Introduction to OpenMP Recitation session SALVATORE DI GIROLAMO DPHPC: Introduction to OpenMP Recitation session Based on http://openmp.org/mp-documents/intro_to_openmp_mattson.pdf OpenMP An Introduction What is it? A set of compiler directives

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Ekpe Okorafor School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 A little about me! PhD Computer Engineering Texas A&M University Computer Science

More information

Tasking in OpenMP 4. Mirko Cestari - Marco Rorro -

Tasking in OpenMP 4. Mirko Cestari - Marco Rorro - Tasking in OpenMP 4 Mirko Cestari - m.cestari@cineca.it Marco Rorro - m.rorro@cineca.it Outline Introduction to OpenMP General characteristics of Taks Some examples Live Demo Multi-threaded process Each

More information

More Advanced OpenMP. Saturday, January 30, 16

More Advanced OpenMP. Saturday, January 30, 16 More Advanced OpenMP This is an abbreviated form of Tim Mattson s and Larry Meadow s (both at Intel) SC 08 tutorial located at http:// openmp.org/mp-documents/omp-hands-on-sc08.pdf All errors are my responsibility

More information

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP) HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP) V. Akishina, I. Kisel, G. Kozlov, I. Kulakov, M. Pugach, M. Zyzak Goethe University of Frankfurt am Main 2015 Task Parallelism Parallelization

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Christian Terboven 10.04.2013 / Darmstadt, Germany Stand: 06.03.2013 Version 2.3 Rechen- und Kommunikationszentrum (RZ) History De-facto standard for

More information

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai Parallel Programming Principle and Practice Lecture 7 Threads programming with TBB Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Outline Intel Threading

More information

Parallel programming using OpenMP

Parallel programming using OpenMP Parallel programming using OpenMP Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Le Yan Scientific computing consultant User services group High Performance Computing @ LSU Goals Acquaint users with the concept of shared memory parallelism Acquaint users with

More information

OpenMP 4.5: Threading, vectorization & offloading

OpenMP 4.5: Threading, vectorization & offloading OpenMP 4.5: Threading, vectorization & offloading Michal Merta michal.merta@vsb.cz 2nd of March 2018 Agenda Introduction The Basics OpenMP Tasks Vectorization with OpenMP 4.x Offloading to Accelerators

More information

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming

More information

Parallel Computing Parallel Programming Languages Hwansoo Han

Parallel Computing Parallel Programming Languages Hwansoo Han Parallel Computing Parallel Programming Languages Hwansoo Han Parallel Programming Practice Current Start with a parallel algorithm Implement, keeping in mind Data races Synchronization Threading syntax

More information

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele Advanced C Programming Winter Term 2008/09 Guest Lecture by Markus Thiele Lecture 14: Parallel Programming with OpenMP Motivation: Why parallelize? The free lunch is over. Herb

More information

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number OpenMP C and C++ Application Program Interface Version 1.0 October 1998 Document Number 004 2229 001 Contents Page v Introduction [1] 1 Scope............................. 1 Definition of Terms.........................

More information

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides Parallel Programming with OpenMP CS240A, T. Yang, 203 Modified from Demmel/Yelick s and Mary Hall s Slides Introduction to OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for

More information

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs. Review Lecture 14 Structured block Parallel construct clauses Working-Sharing contructs for, single, section for construct with different scheduling strategies 1 2 Tasking Work Tasking New feature in OpenMP

More information

Intel Parallel Studio

Intel Parallel Studio Intel Parallel Studio Product Brief Intel Parallel Studio Parallelism for your Development Lifecycle Intel Parallel Studio brings comprehensive parallelism to C/C++ Microsoft Visual Studio* application

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions

More information

Shared Memory Programming Model

Shared Memory Programming Model Shared Memory Programming Model Ahmed El-Mahdy and Waleed Lotfy What is a shared memory system? Activity! Consider the board as a shared memory Consider a sheet of paper in front of you as a local cache

More information

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato OpenMP Application Program Interface Introduction Shared-memory parallelism in C, C++ and Fortran compiler directives library routines environment variables Directives single program multiple data (SPMD)

More information

OPENMP OPEN MULTI-PROCESSING

OPENMP OPEN MULTI-PROCESSING OPENMP OPEN MULTI-PROCESSING OpenMP OpenMP is a portable directive-based API that can be used with FORTRAN, C, and C++ for programming shared address space machines. OpenMP provides the programmer with

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

Parallel Programming Principle and Practice. Lecture 6 Shared Memory Programming OpenMP. Jin, Hai

Parallel Programming Principle and Practice. Lecture 6 Shared Memory Programming OpenMP. Jin, Hai Parallel Programming Principle and Practice Lecture 6 Shared Memory Programming OpenMP Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Outline OpenMP Overview

More information

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

EPL372 Lab Exercise 5: Introduction to OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP EPL372 Lab Exercise 5: Introduction to OpenMP References: https://computing.llnl.gov/tutorials/openmp/ http://openmp.org/wp/openmp-specifications/ http://openmp.org/mp-documents/openmp-4.0-c.pdf http://openmp.org/mp-documents/openmp4.0.0.examples.pdf

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group Parallelising Scientific Codes Using OpenMP Wadud Miah Research Computing Group Software Performance Lifecycle Scientific Programming Early scientific codes were mainly sequential and were executed on

More information

Programming Shared Memory Systems with OpenMP Part I. Book

Programming Shared Memory Systems with OpenMP Part I. Book Programming Shared Memory Systems with OpenMP Part I Instructor Dr. Taufer Book Parallel Programming in OpenMP by Rohit Chandra, Leo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh Menon 2 1 Machine

More information

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel) COSC 6374 Parallel Computation Introduction to OpenMP Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel) Edgar Gabriel Fall 2015 OpenMP Provides thread programming model at a

More information

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen OpenMP Dr. William McDoniel and Prof. Paolo Bientinesi HPAC, RWTH Aachen mcdoniel@aices.rwth-aachen.de WS17/18 Loop construct - Clauses #pragma omp for [clause [, clause]...] The following clauses apply:

More information

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 Worksharing constructs To date: #pragma omp parallel created a team of threads We distributed

More information

OpenMP: The "Easy" Path to Shared Memory Computing

OpenMP: The Easy Path to Shared Memory Computing OpenMP: The "Easy" Path to Shared Memory Computing Tim Mattson Intel Corp. timothy.g.mattson@intel.com 1 * The name OpenMP is the property of the OpenMP Architecture Review Board. Copyright 2012 Intel

More information

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas HPCSE - I «OpenMP Programming Model - Part I» Panos Hadjidoukas 1 Schedule and Goals 13.10.2017: OpenMP - part 1 study the basic features of OpenMP able to understand and write OpenMP programs 20.10.2017:

More information

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0/4.5. Mark Bull, EPCC OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer OpenMP examples Sergeev Efim Senior software engineer Singularis Lab, Ltd. OpenMP Is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism.

More information

15-418, Spring 2008 OpenMP: A Short Introduction

15-418, Spring 2008 OpenMP: A Short Introduction 15-418, Spring 2008 OpenMP: A Short Introduction This is a short introduction to OpenMP, an API (Application Program Interface) that supports multithreaded, shared address space (aka shared memory) parallelism.

More information

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause Review Lecture 12 Compiler Directives Conditional compilation Parallel construct Work-sharing constructs for, section, single Synchronization Work-tasking Library Functions Environment Variables 1 2 13b.cpp

More information

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California EE/CSCI 451 Introduction to Parallel and Distributed Computation Discussion #4 2/3/2017 University of Southern California 1 USC HPCC Access Compile Submit job OpenMP Today s topic What is OpenMP OpenMP

More information

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah Introduction to OpenMP Martin Čuma Center for High Performance Computing University of Utah mcuma@chpc.utah.edu Overview Quick introduction. Parallel loops. Parallel loop directives. Parallel sections.

More information

OpenMP Application Program Interface

OpenMP Application Program Interface OpenMP Application Program Interface DRAFT Version.1.0-00a THIS IS A DRAFT AND NOT FOR PUBLICATION Copyright 1-0 OpenMP Architecture Review Board. Permission to copy without fee all or part of this material

More information

Concurrent Programming with OpenMP

Concurrent Programming with OpenMP Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed

More information

Intel Thread Building Blocks, Part II

Intel Thread Building Blocks, Part II Intel Thread Building Blocks, Part II SPD course 2013-14 Massimo Coppola 25/03, 16/05/2014 1 TBB Recap Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization

More information

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013 OpenMP António Abreu Instituto Politécnico de Setúbal 1 de Março de 2013 António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 1 / 37 openmp what? It s an Application Program Interface

More information

Parallel Programming using OpenMP

Parallel Programming using OpenMP 1 Parallel Programming using OpenMP Mike Bailey mjb@cs.oregonstate.edu openmp.pptx OpenMP Multithreaded Programming 2 OpenMP stands for Open Multi-Processing OpenMP is a multi-vendor (see next page) standard

More information

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu OpenMP 4.0/4.5: New Features and Protocols Jemmy Hu SHARCNET HPC Consultant University of Waterloo May 10, 2017 General Interest Seminar Outline OpenMP overview Task constructs in OpenMP SIMP constructs

More information

Parallel Programming using OpenMP

Parallel Programming using OpenMP 1 OpenMP Multithreaded Programming 2 Parallel Programming using OpenMP OpenMP stands for Open Multi-Processing OpenMP is a multi-vendor (see next page) standard to perform shared-memory multithreading

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++ OpenMP OpenMP Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum 1997-2002 API for Fortran and C/C++ directives runtime routines environment variables www.openmp.org 1

More information

Programming Shared-memory Platforms with OpenMP

Programming Shared-memory Platforms with OpenMP Programming Shared-memory Platforms with OpenMP John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 7 31 February 2017 Introduction to OpenMP OpenMP

More information

Shared Memory Programming Paradigm!

Shared Memory Programming Paradigm! Shared Memory Programming Paradigm! Ivan Girotto igirotto@ictp.it Information & Communication Technology Section (ICTS) International Centre for Theoretical Physics (ICTP) 1 Multi-CPUs & Multi-cores NUMA

More information

Programming with OpenMP* Intel Software College

Programming with OpenMP* Intel Software College Programming with OpenMP* Intel Software College Objectives Upon completion of this module you will be able to use OpenMP to: implement data parallelism implement task parallelism Agenda What is OpenMP?

More information

Shared Memory Parallelism - OpenMP

Shared Memory Parallelism - OpenMP Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (http://www.llnl.gov/computing/tutorials/openmp/#introduction) OpenMP sc99 tutorial

More information

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen Programming with Shared Memory PART II HPC Fall 2012 Prof. Robert van Engelen Overview Sequential consistency Parallel programming constructs Dependence analysis OpenMP Autoparallelization Further reading

More information

CS 5220: Shared memory programming. David Bindel

CS 5220: Shared memory programming. David Bindel CS 5220: Shared memory programming David Bindel 2017-09-26 1 Message passing pain Common message passing pattern Logical global structure Local representation per processor Local data may have redundancy

More information

OpenMP - Introduction

OpenMP - Introduction OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı - 21.06.2012 Outline What is OpenMP? Introduction (Code Structure, Directives, Threads etc.) Limitations Data Scope Clauses Shared,

More information

Efficiently Introduce Threading using Intel TBB

Efficiently Introduce Threading using Intel TBB Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++

More information

Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche.

Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche. Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche massimo.bernaschi@cnr.it OpenMP by example } Dijkstra algorithm for finding the shortest paths from vertex 0 to the

More information

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah Introduction to OpenMP Martin Čuma Center for High Performance Computing University of Utah mcuma@chpc.utah.edu Overview Quick introduction. Parallel loops. Parallel loop directives. Parallel sections.

More information