Intel Parallel Composer. Stephen Blair-Chappell Intel Compiler Labs

Size: px

Start display at page:

Download "Intel Parallel Composer. Stephen Blair-Chappell Intel Compiler Labs"

Phebe Stafford
5 years ago
Views:

1 Intel Parallel Composer Stephen Blair-Chappell Intel Compiler Labs

2 Intel Parallel Composer Develop effective applications with a C/C++ compiler and comprehensive threaded libraries CODE & DEBUG PHASE Easier, faster parallelism for Windows* apps C/C++ compiler and advanced threaded libraries Built-in parallel debugger Supports OpenMP* Save time and increase productivity 2

3 Key Features of Composer Extensions for parallelism Simple concurrent functionality ( task/ taskcomplete) Vectorization support for SSE2/SSE3/SSSE3/SSE4 instruction set OpenMP 3.0 Seamless integration into MS Visual Studio* Intel Parallel Debugger Extensions - A Plug-in to Visual Studio* Intel Threading Building Blocks C++ lambda function support (enables simpler interfacing with Intel TBB) Intel Integrated Performance Primitives Integrated array notation, data-parallel Intel IPP functions Parallel build (/MP) feature Diagnostics to help develop parallel programs (/Qdiag-enable:thread) Threading tutorials with sample code 3

4 Intel Parallel Composer Extend parallel debugging capabilities Adds a new class of data breakpoints Data race detection Allows filtering to control amount of data collected Serializes parallel regions without recompilation Adds window to visualize logs 4

5 Compiler Pro 11.0 Composer Full C and C++ Support Fortran Support No Fortran Future Intel Compiler Fortran 2003 Support Several features No Fortran 2003 Many more features task/ taskcomplete No Parallel Exploration OpenMP 3.0 Valarray Specializations Lambda Functions Intel Threading Building Blocks Intel Integrated Performance Primitives Intel Math Kernel Library No C++ Parallel Debug Plug-in with SSE/vector window and OpenMP parallelism window (Windows Only) No Windows Debugger GUI Debugger with SSE/vector window and OpenMP parallelism window (Linux Only) No Linux debugger Code Coverage Utility Test Select Utility Full Fortran Interoperability, except when IPO used Decimal Floating-point 5 5 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

6 Section 2 New Features Concurrent Functionailty New C++0x Features Debugger Extensions

7 Concurrent Functionality - Idea The parallel programming extensions are intended for quickly getting a program parallelized without learning a great deal about APIs A few keywords and the program is parallelized If the constructs are not powerful enough in terms of data control then there may be a need to look into other more comprehensive parallel programming methodologies, such as OpenMP. 7 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

8 Concurrent Functionality Introduction of novel C/C++ language extensions to make parallel programming easier There are four new keywords introduced used as statement prefixes: taskcomplete, task, par, and critical. To benefit from the parallelism afforded by these keywords the switch /Qpar must be used The runtime system will manage the actual degree of parallelism 8 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

9 Concurrent Functionality - example int a[1000], b[1000], c[1000]; void f_sum ( int length, int *a, int *b, int *c ) { int i; for (i=0; i<length; i++) { c[i] = a[i] + b[i]; f_sum(1000, a, b, c); taskcomplete { task f_sum(500, a, b, c); task f_sum(500, a+500, b+500, c+500); Serial call Parallel call 9 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

10 New C++0x Features New C++0x features enabled by switch: /Qstd=c++0x (Win) lambda functions static assertions RVALUE references C99-compliant preprocessor func predefined identifier variadic templates Extern templates and some more 10

11 C++0x: Futures Future - a mechanism used to provide values that will be accessed in the future and resolved to another value asynchronously The definition of futures does not specify if the computation of the given expression starts immediately or when the result is requested Futures are realized in the Intel Compiler by templates intel::future<page*> future_page; do { // on user click if (user clicked NEXT) { page = future_page.get();//wait for next page to finish loading else { future_page.cancel(); //user clicked END, speculation wasted break; while (1); 11

12 C++0x: Lambda Functions A lambda abstraction defines an unnamed function Lambda functions in C++ provides: treat functions as first class objects composing functions inline treating functions as class objects Enhances concurrent code, as it is possible to pass around code chunks like objects Ability to pass code as parameter Defines the <> operator std::vector<int> somelist; int total = 0; std::for_each(somelist.begin(),somelist.end(), <>(int x) : [&total](total += x)); std::cout << total; 12

13 VALARRAY methods accelerated by Intel IPP C++ standard template (STL) container class for arrays consisting of array methods for high performance computing The operations are designed to exploit low level hardware features, for example vectorization In order to take full advantage of valarray, you need an optimizing C++ compiler that recognizes valarray as an intrinsic type and replaces such types by Intel IPP library calls // Create a valarray of ints. valarray_t::value_type ibuf[10] = {0,1,2,3,4,5,6,7,8,9; valarray_t vi(ibuf, 10); // Create a valarray of bools for a mask. maskarray_t::value_type mbuf[10] = {1,0,1,1,1,0,0,1,1,0; maskarray_t mask(mbuf,10); // Double the values of the masked array vi[mask] += static_cast<valarray_t> (vi[mask]); 13

14 Intel Parallel Composer Debugger Extensions

15 Key Features Shared Data Access Detection Break on Thread Shared Data Access Re-entrant Function Detection SIMD SSE Debugging Window Enhanced OpenMP* Support Serialize OpenMP threaded application execution on the fly Insight into thread groups, barriers, locks, wait lists etc. 15 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

Shared Data Access Detection Shared data access is a major problem in multithreaded applications Can cause hard to diagnose intermittent program failure Tool support is required for detection

16 Shared Data Access Detection Shared data access is a major problem in multithreaded applications Can cause hard to diagnose intermittent program failure Tool support is required for detection Compiler Instrumented Application Memory Acces Instrumentation Normal debugging Visual Studio or IDB GUI s GUI Extension Debug Engine Technology built on: Code Instrumentation by Intel compiler Debug runtime library (RTL) that collects data access traces and triggers debugger tool Add-on that reports and visualizes RTL events while debugging The combination enables a large variety of additional debug use cases Intel Debug Runtime RTL events Debugger Extension 16 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

17 Shared Data Access Detection Data sharing detection is part of overall debug process Breakpoint model (stop on detection) GUI extensions show results & link to source Filter capabilities to hide false positives New powerful data breakpoint types Stop when 2nd thread accesses specific address Key Stop User on Benefit: read from address A simplified feature to detect shared data accesses from multiple threads 17 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

can be excluded, 18 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation.

18 Shared Data Access Detection - Filtering Data sharing detection is selective Data Filter Specific data items and variables can be excluded Code Filter Functions can be excluded Source files can be excluded Address ranges can be excluded, 18 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

19 Re-Entrant Call Detection Automatically halts execution when a function is executed by more than one thread at any given point in time. Allows to identify reentrancy requirements/problems for multi-threaded applications 19 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

Enhanced OpenMP* Debugging Support Dedicated OpenMP runtime object - information Windows OpenMP Task and Spawn Tree lists Barrier and Lock information Task Wait lists Thread Team worker lists

Parallel execution User benefit: Identify whether a runtime issue is Detailed really execution parallelism state related information for OpenMP applications (deadlock detection).

20 Enhanced OpenMP* Debugging Support Dedicated OpenMP runtime object - information Windows OpenMP Task and Spawn Tree lists Barrier and Lock information Task Wait lists Thread Team worker lists Serialize Parallel Regions Change number of parallel threads dynamically during runtime to 1 or N (all) Verify code correctness for serial execution vs. Parallel execution User benefit: Identify whether a runtime issue is Detailed really execution parallelism state related information for OpenMP applications (deadlock detection). Influences execution behavior without recompile! 20 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

21 OpenMP*Task Details 21 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

22 Parallel Debug Plug-In Allows Filtering to control amount of data collected Adds window to visualize logs 22 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice. Can serialize parallel

23 23 Parallel Run-Control - Use Cases Stepping parallel loops Problem: State investigation difficult * Threads stop at arbitrary positions (red line) Parallel Debugger Support Add Syncpoint to stop team threads at same location User Benefit Get and keep defined program State Operations like private data comparison now meaningful Normal Step Lock Step Lock Step Lock Step Breakpoint Syncpoint Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice. Serial Execution Problem: Parallel loop computes a wrong result. Is it a concurrency or algorithm issue? Parallel Debugger Support Runtime access to the OpenMP num_thread property Set to 1 for serial execution of next parallel block User Benefit Verification of a algorithm on-the-fly without slowing down the entire application to serial execution On demand serial debugging without recompile/restart Disable parallel Enable parallel

24 SIMD SSE Debugging Window SIMD Window (new) Supports evaluation of arbitrary length expressions. SSE Registers display of variables used for SIMD operations In-depth insight into data parallelization and vectorization. 24 Software Solutions Group Developer Products Division Copyright 2008, Intel Corporation. All rights reserved.. Product Plans, dates, and specifications are preliminary and subject to change without notice.

25 Section 3 Creating Parallel Code

26 Implementing Parallelism Different Methods Automatic Via Compiler No Code Changes Programming OpenMP Native Threads Win32 POSIX Threading Building Blocks MPI Automatic Programming Using Parallel-enabled Libraries MKL IPP Libraries Three ways of achieving Parallelism 26

27 Auto Parallelism Loop-level parallelism automatically supplied by the compiler 27

28 Auto-parallelization Auto-parallelization: Automatic threading of loops without having to manually insert OpenMP* directives. Windows* /Qparallel /Qpar_report[n] Linux* -parallel -par_report[n] Mac* -parallel -par_report[n] Compiler can identify easy candidates for parallelization, but large applications are difficult to analyze. 28

29 Optimisation Results pi application Optimisation default Time Taken (secs) Speedup 1 auto-vectorisation auto-parallelism auto-vec vec.. & auto-par

30 Sample implementations of Parallel Programming MPI 30 POOP Parallel Object Oriented Programming *Other names and brands may be claimed as the property of others.

31 No Threads The Sample Application

32 Our running Example: The PI program Numerical Integration Mathematically, we know that: (1+x 2 ) dx = π F(x) = 4.0/(1+x 2 ) 2.0 We can approximate the integral as a sum of rectangles: N F(x i ) x π i = X 1.0 Where each rectangle has width x and height F(x i ) at the middle of interval i. 32

33 PI Program: The sequential program static long num_steps = ; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); pi = step * sum;

34 Native Threads

35 Threads and Parallel Programming Operating Systems 101: Process: A unit of work managed by an OS with its own address space (the heap) and OS managed resources. Threads: resources within a process that execute the instructions in a program. They have their own program counter and a private memory region (a stack) but share the other resources within the process including the heap. Threads are the natural unit of execution for parallel programs on shared memory hardware. The threads share memory so data structures don t have to be torn apart into distinct pieces. 35

36 Programming with Native Threads The OS provides an API for creating, managing, and destroying threads. Windows* threading API Posix threads (on Linux) Advantages of threads libraries: The thread library gives you detailed control over the threads. Disadvantage of thread libraries: The thread library REQUIRES that you take detailed control over the threads. 36 *Other names and brands may be claimed as the property of others.

37 Win32 API #include <windows.h> #define NUM_THREADS 2 HANDLE thread_handles[num_threads]; CRITICAL_SECTION hupdatemutex; static long num_steps = ; double step; double global_sum = 0.0; void Pi (void *arg) { int i, start; double x, sum = 0.0; start = *(int *) arg; step = 1.0/(double) num_steps; for (i=start;i<= num_steps; i=i+num_threads){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); EnterCriticalSection(&hUpdateMutex); global_sum += sum; LeaveCriticalSection(&hUpdateMutex); void main () { double pi; int i; DWORD threadid; int threadarg[num_threads]; for(i=0; i<num_threads; i++) threadarg[i] = i+1; InitializeCriticalSection(&hUpdateMutex); for (i=0; i<num_threads; i++){ thread_handles[i] = CreateThread(0, 0, (LPTHREAD_START_ROUTINE) Pi, &threadarg[i], 0, &threadid); WaitForMultipleObjects(NUM_THREADS, thread_handles, TRUE,INFINITE); pi = global_sum * step; printf(" pi is %f \n",pi);

38 Win32 thread library: It #include <windows.h> #define NUM_THREADS 2 Setup multi-threading support HANDLE thread_handles[num_threads]; CRITICAL_SECTION hupdatemutex; static long num_steps = ; double step; double global_sum = 0.0; void Pi (void *arg) { int i, start; double x, sum = 0.0; Define the work each thread will do and pack it into a start = *(int *) arg; step = 1.0/(double) function num_steps; Update the for (i=start;i<= num_steps; final answer i=i+num_threads){ one thread at x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); a time EnterCriticalSection(&hUpdateMutex); global_sum += sum; LeaveCriticalSection(&hUpdateMutex); It s not as bad as it looks void main () { double pi; int i; DWORD threadid; int threadarg[num_threads]; for(i=0; i<num_threads; i++) threadarg[i] = i+1; InitializeCriticalSection(&hUpdateMutex); Setup arguments, book keeping, and launch the for (i=0; i<num_threads; threads i++){ thread_handles[i] = CreateThread(0, 0, (LPTHREAD_START_ROUTINE) Pi, &threadarg[i], 0, &threadid); WaitForMultipleObjects(NUM_THREADS, thread_handles, TRUE,INFINITE); Wait for all the threads to finish pi = global_sum * step; printf(" pi is %f \n",pi); Compute and print final answer

39 Win32 API #include <windows.h> #define NUM_THREADS 2 HANDLE thread_handles[num_threads]; CRITICAL_SECTION hupdatemutex; static long num_steps = ; double step; double global_sum = 0.0; void Pi (void *arg) { int i, start; double x, sum = 0.0; start = *(int *) arg; step = 1.0/(double) num_steps; for (i=start;i<= num_steps; i=i+num_threads){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); EnterCriticalSection(&hUpdateMutex); global_sum += sum; LeaveCriticalSection(&hUpdateMutex); void main () { double pi; int i; DWORD threadid; int threadarg[num_threads]; for(i=0; i<num_threads; i++) threadarg[i] = i+1; InitializeCriticalSection(&hUpdateMutex); for (i=0; i<num_threads; i++){ thread_handles[i] = CreateThread(0, 0, (LPTHREAD_START_ROUTINE) Pi, &threadarg[i], 0, &threadid); WaitForMultipleObjects(NUM_THREADS, thread_handles, TRUE,INFINITE); pi = global_sum * step; printf(" pi is %f \n",pi);

40 Threading Building Blocks

41 Featured Components Task Scheduler Generic Parallel Algorithms parallel_for parallel_reduce pipeline parallel_sort parallel_while parallel_scan Concurrent Containers concurrent_hash_map concurrent_queue concurrent_vector 41

42 Concurrent Containers Library provides highly concurrent containers STL containers are not concurrency-friendly: attempt to modify them concurrently can corrupt container Standard practice is to wrap a lock around STL containers Turns container into serial bottleneck Library provides fine-grained locking or lockless implementations Worse single-thread performance, but better scalability. Can be used with the library, OpenMP, or native threads. 42

43 Generic Programming for C++ developers Best known example is C++ STL Enables distribution of broadly-useful high-quality algorithms and data structures Write best possible algorithm with fewest constraints Do not force particular data structure on user Classic example: STL std::sort Instantiate algorithm to specific situation C++ template instantiation, partial specialization, and inlining make resulting code efficient 43

44 template <typename Range, typename Body> void parallel_for(const Range& range, const Body &body); Requirements for parallel_for Body Body::Body(const Body&) Body::~Body() void Body::operator() (Range& subrange) const Copy constructor Destructor Apply the body to subrange. parallel_for partitions original range into subranges, and deals out subranges to worker threads in way that: Balances load Uses cache efficiently Scales 44

45 Serial Example static void SerialUpdateVelocity() { for( int i=1; i<universeheight-1; ++i ) #pragma ivdep for( int j=1; j<universewidth-1; ++j ) V[i][j] += (S[i][j] - S[i][j-1] + T[i][j] - T[i-1][j])*M[i]; Intel TBB product has complete serial and parallel versions of seismic wave simulation. 45

46 Parallel Version struct UpdateVelocityBody { void operator()( const blocked_range<int>& range ) const { int end = range.end(); for( int i= range.begin(); i<end; ++i ) { #pragma ivdep for( int j=1; j<universewidth-1; ++j ) V[i][j] += (S[i][j] - S[i][j-1] + T[i][j] - T[i-1][j])*M[i]; ; void ParallelUpdateVelocity() { parallel_for( blocked_range<int>( 1, UniverseHeight-1, GrainSize ), UpdateVelocityBody() ); Pattern Task blue = original code red = provided by TBB black = boilerplate for library Establishes grain size 46

47 Task scheduler example Split range..... recursively......until grainsize. tasks available to thieves 47

48 TBB Class class ParallelPi { public: double pi; ParallelPi(): pi(0) { ParallelPi(ParallelPi &body, Split):pi(0) { void operator()(const tbb::blocked_range<int> &r) { for(int i = r.begin(); i!= r.end(); ++i) { float x = Step * ((float)i-0.5); pi += 4.0 / (1.0 + x*x); void join(parallelpi &body) { pi += body.pi; ; 48

49 TBB Class int main() { ParallelPi Pi; parallel_reduce( tbb::blocked_range<int>(0, INTERVALS, 100), Pi ); printf( Pi = %f\n, Pi.pi/INTERVALS); 49

50 Section 5 Expressing Parallelism Using the Intel Compiler Stephen Blair-Chappell Technical Consulting Engineer Intel Compiler Labs

51 OpenMP A deeper dive into OpenMP 51

52 What is OpenMP? Portable, Shared Memory Multi-processing API Fortran 77, Fortran 90, C, and C++ Multi-vendor support, for both Unix and Windows Standardizes loop-level parallelism Supports coarse-grained parallelism Combines serial and parallel code in single source No need for separate source code revision See for standard documents, tutorials, sample code Intel is premier member of OpenMP Review Board 52

53 Parallel APIs: OpenMP* C$OMP FLUSH #pragma omp critical C$OMP THREADPRIVATE(/ABC/) C$OMP parallel do shared(a, b, c) call OMP_INIT_LOCK (ilok) C$OMP SINGLE PRIVATE(X) C$OMP PARALLEL REDUCTION (+: A, B) CALL OMP_SET_NUM_THREADS(10) OpenMP: An API for Writing Multithreaded Applications call omp_test_lock(jlok) A set of compiler directives and library routines for parallel application programmers C$OMP ATOMIC C$OMP MASTER setenv OMP_SCHEDULE dynamic Makes it easy to create multithreaded (MT) programs in Fortran, C and C++ C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C) Standardizes last 15 years of SMP practice C$OMP SECTIONS C$OMP ORDERED #pragma omp parallel for private(a, B)!$OMP BARRIER C$OMP PARALLEL COPYIN(/blk/) C$OMP DO lastprivate(xx) Nthrds = OMP_GET_NUM_PROCS() omp_set_lock(lck)

54 OpenMP Architecture Fork-Join Model Worksharing constructs Synchronization constructs Directive/pragma-based parallelism Extensive API for finer control 54

55 OpenMP Runtime Application User Directive Compiler Environment Variables Runtime Library Threads in Operating System 55

56 OpenMP Programming Model: Fork-Join Parallelism: Master Thread in red Master thread spawns a team of threads as needed. Parallelism added incrementally until performance are met: i.e. the sequential program evolves into a parallel program. Parallel Regions A Nested Nested Parallel region region 56 Sequential Parts *Other names and brands may be claimed as the property of others.

57 Intel Compiler Switches for OpenMP OpenMP support /Qopenmp /Qopenmp_report{

58 Basic Syntax Fork-Join Model Threads are created as parallel pragma is crossed Data is classed as shared among threads or private to each thread Several, e.g. 4, threads created on entry main() { #pragma omp parallel \ shared(a) private(i) { // this code is parallel... Threads either spin or sleep between regions

59 Hello World This program runs on three threads: Prints this: Void main() #pragma omp parallel { printf( Hello World\n ); #pragma omp for for(i=0;i<=4;i++) { printf( Iter: %d, I); printf( Goodbye World\n ); Hello World Hello World Hello World Iter: 1 Iter: 2 Iter: 3 Iter: 4 Goodbye World Goodbye World Goodbye World 59

60 Parallel Loop Model Threads are created Data is classified as shared or private void* work(float* A) { #pragma omp parallel for \ shared(a) private(i) for(i=1; i<=12; i++) { /* Iterations divided * among threads */ A Parallel For I=1 I=2 I=3 I=4 I=5 I=6 Iterations distributed across threads A is shared Barrier at end Threads either spin or sleep between regions

61 Data Scope Attributes The default status can be modified with default (shared none) Scoping attribute clauses shared(varname, ) private(varname, ) 61

62 The Private Clause Reproduces the variable for each thread Variables are un-initialized; C++ object is default constructed Any value external to the parallel region is undefined void* work(float* c, int N) { float x, y; int i; #pragma omp parallel for private(x,y) for(i=0; i<n; i++) { x = a[i]; y = b[i]; c[i] = x + y; 62

63 Example: Dot Product float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<n; i++) { sum += a[i] * b[i]; return sum; What is Wrong? 63

64 Protect Shared Data Must protect access to shared, modifiable data float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<n; i++) { #pragma omp critical sum += a[i] * b[i]; return sum; 64

65 OpenMP* Critical Construct #pragma omp critical [(lock_name)] Defines a critical region on a structured block Threads wait their turn at a time, only one calls consum() thereby protecting R1 and R2 from race conditions. Naming the critical constructs is optional, but may increase performance. float R1, R2; #pragma omp parallel { float A, B; #pragma omp for for(int i=0; i<niters; i++){ B = big_job(i); #pragma omp critical (R1_lock) consum (B, &R1); A = bigger_job(i); #pragma omp critical (R2_lock) consum (A, &R2); 65

66 OpenMP* Reduction Clause reduction (op : list) The variables in list must be shared in the enclosing parallel region Inside parallel or work-sharing construct: A PRIVATE copy of each list variable is created and initialized depending on the op These copies are updated locally by threads At end of construct, local copies are combined through op into a single value and combined with the value in the original SHARED variable 66

67 Reduction Example #pragma omp parallel for reduction(+:sum) for(i=0; i<n; i++) { sum += a[i] * b[i]; Local copy of sum for each thread All local copies of sum added together and stored in global variable 67

68 C/C++ Reduction Operations A range of associative and commutative operators can be used with reduction Initial values are the ones that make sense Operator Initial Value Operator Initial Value + 0 & ~0 * && 1 ^

69 Schedule Clause #pragma omp parallel for schedule (static, 8) for( int i = start; i <= end; i += 2 ) { if ( TestForPrime(i) ) gprimesfound++; Iterations are divided according to schedule statement 69

70 OpenMP Parallel for with a reduction #include <omp.h> static long num_steps = ; double step; #define NUM_THREADS 2 void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; omp_set_num_threads(num_threads); #pragma omp parallel for reduction(+:sum) private(x) for (i=1;i<= num_steps; i++) { x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); pi = step * sum; *Other names and brands may be claimed as the property of others.

71 OpenMP FOR-schedule schemas schedule clause defines how loop iterations assigned to threads Compromise between two opposite goals: Best thread load balancing With minimal controlling overhead single thread schedule(static) schedule(guided, f) schedule(dynamic, c) C N/2 N f 71

72 Iterative worksharing versus code replication Iterative worksharing -- prints Hello World 10 times, regardless of the number of threads: #pragma omp parallel for for( int i = 1,i< 10;i++) { printf( Hello World\n ); Code replication -- assuming a team of 4 threads, prints Hello World 40 times: #pragma omp parallel for( int i = 1,i< 10;i++) { printf( Hello World\n ); 72

73 Parallel Sections Independent sections of code can execute concurrently #pragma omp parallel sections { #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3(); Serial Parallel 73

74 Implicit Barriers Several OpenMP constructs have implicit barriers: do for single sections Unnecessary barriers hurt performance Suppress them, when safe, with nowait!$omp do [...]!$omp end do nowait #pragma omp for nowait for(...) [...];!$omp sections [...]!$omp end sections nowait #pragma omp single nowait { [...] 74

75 Static and Dynamic Extent Static extent or lexical extent is the code that is lexically within the parallel/end parallel directive Dynamic extent includes the static extent and the entire call tree of any subroutine or function called in the static extent. OpenMP directives in the dynamic extent of a parallel region are called orphaned directives An orphaned worksharing construct behaves as if the construct was within the lexical extent the work is divided across thread team The only difference is that slightly different data scoping rules apply. 75

76 Static and Dynamic Extent Program main!$omp parallel <- Static extent call foo <-!$omp end parallel <- End + Dynamic extent Subroutine foo <-!$omp do!orphaned do i = 1,100 enddo <-!$omp end do Call X!also dynamic End subroutine foo <- 76

77 Communication and data scope In OpenMP every variable has a scope that is either shared or private By default, all variables have shared scope Data scoping clauses that can appear on parallel constructs: The shared and private clauses explicitly scope specific variables The firstprivate and lastprivate clauses perform initialization and finalization of private variables The default clause changes the default scoping rules when variables are not explicitly scoped The reduction clause explicitly identifies reduction variables 77

78 Communication and data scope Each thread has a private stack used for automatic variables For all other program variables, the parallel constructs can scope each variable as shared, private, or reduction Private variables need to be initialized at the start of a parallel construct. The firstprivate clause will initialize from the global instance For parallelized loops, the lastprivate clause will update the global instance from the private value computed with the last iteration 78

79 Communication and data scope Variables default to shared scope, except for these cases: Loop index variables to which a parallel do or parallel for applies default to private Locals variables in subroutines in the dynamic extent default to private unless they are marked with the save attribute (Fortran) or as static (C/C++) Data scoping clauses only apply to the named variables within the lexical extent Global variables in orphaned constructs are shared by default, regardless of the attribute in the lexical extent Automatic (stack allocated) variables in orphaned constructs are always private Formal parameters to a subroutine in the dynamic extent acquire their scope from that of the actual variables in the caller s context 79

80 Environment Variables Standard Environment Variables OMP_SCHEDULE Runtime schedule and optional iteration chunk size OMP_NUM_THREADS Number of worker threads; defaults to number of processors OMP_DYNAMIC Enables dynamic adjustment of the number of threads OMP_NESTED Enables nested parallelism Intel Extension Environment Variables KMP_ALL_THREADS Maximum number of threads in a parallel region KMP_BLOCKTIME Thread wait time before sleeping at the end of a parallel region KMP_LIBRARY Runtime execution mode: throughput (default) for multi-user systems, turnaround for a dedicated (single user) system KMP_STACKSIZE Worker thread stack size 80

81 Synchronization Two kinds of synchronization Mutual exclusion synchronization for exclusive access to data by only one thread at a time Event synchronization for imposing a thread execution order Most commonly used synchronization directives Data synchronization gives a thread exclusive access to a shared variable!$omp critical an arbitrarily large block of structured code!$omp end critical!$omp atomic a single assignment that updates a scalar variable Event synchronization signals the occurrence of an event that all threads must synchronize their execution on!$omp barrier specifies a point in the program where each thread must wait for all other threads to arrive Other less-frequently used synchronization directives!$omp master,!$omp flush,!$omp ordered 81

82 The Workqueuing Model Work units need not be all known or pre-computed at the beginning of construct (e.g.: while loops and recursive functions) taskq specifies an environment (the queue) task specifies the units of work (dynamic) One thread executes the taskq block, enqueuing each task it encounters All other threads dequeue and execute work from the queue Intel-specific extension to OpenMP* 2.5 Has been accepted in new 3.0 version ( to be released Q2/08) 82

83 Why Workqueuing? while(p!= NULL){ do_work(p->data); p = p->next; Workqueuing Worksharing #pragma intel omp parallel taskq { while(p!= NULL){ #pragma intel omp task do_work(p->data); p = p->next; int n=0, i=0; NODEPTR q=p; NODEPTR *r; while(q!= NULL){ n++; q = q->next; r=allocate(n * sizeof(nodeptr)); while(p!= NULL){ r[i++]=p; p = p->next; #pragma omp parallel for for (i=0; i<n; i++) do_work(r[i]->data); free(r); 83

84 OpenMP* 3.0 Support Intel Compilers 11.0 for C++ and Fortan will be fully complaint to OpenMP* 3.0 Standard not release yet Very likely May 08! Draft at Four major extensions: Tasking for unstructured parallelism Loop collapsing Enhanced loop scheduling control Better support for nested parallelism and for all who waited eventually allow unsigned int for loop index variable 84

85 OpenMP* 3.0 Tasking Maybe the most relevant new feature A task has Code to execute A data environment (it owns its data) An assigned thread that executes the code and uses the data Two activities: packaging and execution Each encountering thread packages a new instance of a task (code and data) Some thread in the team executes the task at some later time A task is nothing really new to OpenMP Implicitly each PARALLEL directive creates tasks but they have been transparent objects OpenMP* 3.0 makes tasking explicit 85

86 OpenMP* 3.0 Tasking - Definitions Task construct task directive plus structured block Task the package of code and instructions for allocating data created when a thread encounters a task construct Task region the dynamic sequence of instructions produced by the execution of a task by a thread #pragma omp task [clause[[,][ [[,]clause] ]...] structured-block Where <clause> is one of if (expression)( untied shared (list) ( private (list) ( firstprivate (list) default( shared none ) 86

87 OpenMP* 3.0 Tasking Example Postorder Tree Traversal void postorder(node *p) { if (p->left) #pragma omp task postorder(p->left); if (p->right) #pragma omp task postorder(p->right); Task scheduling point Threads may switch to execute other tasks #pragma omp taskwait // wait for descendants process(p->data); Parent task suspended until children tasks complete 87

88 OpenMP* 3.0 Enhanced Schedule Control Made schedule(runtime) more useful now can get/set it with library routines omp_set_schedule() omp_get_schedule() allow implementations to use their own schedule kinds Adds a new schedule kind AUTO which gives full freedom to the runtime to determine the scheduling of iterations to threads Allows C++ random access iterators as loop control variables in parallel loops 88

89 OpenMP* 3.0 Loop Collapsing Allow collapsing of perfectly nested loops!$omp parallel do collapse(2) do i=1,n do j=1,n... end do end do Will form a single loop and then parallelize that Scheduling of combined iteration space follows the order of the original, sequential execution 89

90 OpenMP 3.0 : Nested Parallelism Better support for nested parallelism Per-thread internal control variables Allows, for example, calling omp_set_num_threads() inside a parallel region. Controls the team sizes for next level of parallelism Library routines to determine depth of nesting, IDs of parent/grandparent etc. threads, team sizes of parent/grandparent etc. teams omp_get_active_level() omp_get_ancestor(level) omp_get_teamsize(level) 90

OpenMP * Past, Present and Future

OpenMP * Past, Present and Future Tim Mattson Intel Corporation Microprocessor Technology Labs timothy.g.mattson@intel.com * The name OpenMP is the property of the OpenMP Architecture Review Board. 1 OpenMP