Tools for OpenMP Programming Dieter an Mey Center for Computing and Communication Aachen University anmey@rz rz.rwth-aachen.de 1
Tools for OpenMP Programming Debugging of OpenMP Codes KAP/Pro Toolset from KAI/Intel Guide - Compilers Assure GuideView TotalView from Etnus 2
Debugging of OpenMP-Programs Programs (1) Prepare the serial code Carefully select a reasonable test case! Is the serial program delivering the right results? ( use at least O3 ) How about compiler warnings (lint, f90 Xlist)? Fortran: Put all local variables on the stack: f90 stackvar... Now try the OpenMP version Check the stacksize limits! export STACKSIZE=... ulimit s... Respect compiler messages f90: USE omp_lib f90 xcommonchk xvpara xloopinfo XlistMP... Try the OpenMP dummy library? (link with [x]openmp=stubs / guide: execute with KMP_LIBARY=serial ) 3
Debugging of OpenMP-Programs Programs (2) Is the OpenMP program running with a single thread? Is the OpenMP program running correctly sometimes with more than one thread? Race Conditions? Thread Safety? Use of static variables within a parallel region? (f90: SAVE, DATA,..., C: static, extern ) Check your program with Assure! (Intel Thread Checker) compare Sun and Guide compilers! guidexx... WGopt=0, When compiling with Guide, compile without optimization and with g -> use the TotalView debugger together with guide Turn on and off single parallel Regions! serialise single parts of long parallel regions: omp single... omp end single introduce additional barriers for testing Different rounding errors matter? -fsimple=0 Don t parallelize reductions 4
Debugging of OpenMP Programs (3) Data Races The typical OpenMP programming errors: Data Races One thread modifies a memory location, which another thread reads or writes in the same region (between 2 synchronisation points). Take care: The sequence of the execution of parallel loop iterations is non deterministic and may change from run to run. Test: The serial code should give the same answers, when running the parallelized loop backwards. Assure traces all memory references and detects possible data races. It verifies that the OpenMP code gives the same results than a serial program run. In many cases private clauses, barriers, or critical regions are missing. Assure does not accept OpenMP runtime functions. (The Thread Checker does) 5
TotalView Debugging of OpenMP-Programs Programs (1) See TotalView User s Guide: Each parallel region is outlined into a separate Routine Each parallel loop is outlined into a separate Routine The names of these outlined routines base on the original name of the calling routine and the line number of the parallel directive Shared variables are declared in the calling routine and passed to the outlined routine. Private variables are declared in the outlined routine. The slave threads are generated on entry of the parallel region You must not step into a parallel region, but run into a previously defined breakpoint. 6
TotalView Debugging of OpenMP-Programs Programs (2) Use the Guide-OpenMP-compiler, because TotalView does not yet support OpenMP debugging with the Sun compilers Compile and Link separately #!/bin/ksh guidef90 c WG,-cmpo=i \ WGkeepcpp prog.f90 -orguidef90 c WG,-cmpo=i g prog.f90 #!/bin/ksh guidec c g prog.f90 guidec o a.out g prog.o export OMP_NUM_THREADS=2 totalview a.out guidef90 o a.out WG,-cmpo=i g prog.o export OMP_NUM_THREADS=2 totalview a.out 7
KAP Pro/Toolset Guide Compilers versus Sun-Compilers Guide compilers guidef77 / guidef90 / guidec / guidec++: preprocessors replacing OpenMP constructs by calls to additional runtime library using pthreads evoking underlying native Fortran / C compilers guide*: any optimization level of the underlying native compiler can be selected => debugging is possible guide*: supported by the TotalView parallel debugger guidef90: no internal subroutines in parallel regions guidec++ includes the famous KCC C++ compiler Sun compilers CC: automatically turns on xo3 => debugging is impossible cc / f90 / f95: new option for debugging xopenmp=noopt f90 / f95 / cc: combination auf OpenMP and auto parallelization is supported Attention: different performance characteristics, different defaults! 8
Assure Usage Like the guide compilers, assure is a preprocessor which instruments the source code collects additional information about the code evokes the native compiler assurec assurec++ assuref77 assuref90 -WGpname=project \ -fast... sourcefiles -o a.out The executable is run in serial mode (and takes a lot of memory and run time) all memory references are traced possible data races are detected in a postprocessing phase (for the given dataset!) a.out The results of the analysis can be reported in line mode or presented with a GUI. assureview -pname=project -txt assureview -pname=project 9
Assure Example: : Jacobi (1)!$omp parallel private(resid,k_local) k_local = 1 do while (k_local.le.maxit.and. error.gt.tol)!$omp do do j=1,m; do i=1,n; uold(i,j) = u(i,j); enddo; enddo!$omp single error = 0.0!$omp end single!$omp do reduction(+:error) do j..;do i..;resid=..;u(i,j)=..;error=..;enddo;enddo!$omp single error = sqrt(error)/dble(n*m)!$omp end single k_local = k_local + 1 enddo!$omp master k = k_local!$omp end master!$omp end parallel 10
Assure Example: : Jacobi (2)!$omp parallel private(resid,k_local) k_local = 1 do while (k_local.le.maxit.and. error.gt.tol)!$omp do do j=1,m; do i=1,n; uold(i,j) = u(i,j); enddo; enddo error = 0.0!$omp do reduction(+:error) do j..;do i..;resid=..;u(i,j)=..;error=..;enddo;enddo!$omp single error = sqrt(error)/dble(n*m)!$omp end single k_local = k_local + 1 enddo!$omp master k = k_local!$omp end master!$omp end parallel 11
Assure Example: : Jacobi (3) 12
Assure Example: : Jacobi (3) 13
Assure Example: : Jacobi (4) 14
c$omp parallel... c$omp do private(l,tmp) DO I=1,N L = ind(i) tmp = X(L)*a(I)+Y(L)*b(I) X(L) = X(L)-tmp*a(I) Y(L) = Y(L)-tmp*b(I) END DO c$omp end do... c$omp end parallel Assure Example: Thermoflow (1) User: The values of the index array IND are certainly disjoint! But: Assure complains Check: c$omp single open (unit=99,file="ind.dat") do i = 1,n write(99,*) ind(i) end do close (99) c$omp end single 2 values out of 2000 occured twice! sort ind.dat > ind.sort sort -u ind.dat > ind.usort diff ind.sort ind.usort 98d97 < 1085 1619d1617 < 505 15
Assure Example: Thermoflow (2) C$omp parallel... DO iter = 1,maxiter c$omp do DO I = 3,n-2 y(i) = (x(i-1) + x(i) + x(i+1)) / 3.0d0 END DO c$omp end do c$omp do DO I = 3,n-2 x(i) = y(i) END DO c$omp end do Assure complains! What is wrong? x(2) = y(3) x(n-1) = y(n-2) END DO... C$omp parallel 16
Assure Example: Thermoflow (3) C$omp parallel... DO iter = 1,maxiter c$omp do DO I = 3,n-2 y(i) = (x(i-1) + x(i) + x(i+1)) / 3.0d0 END DO c$omp end do c$omp do DO I = 3,n-2 x(i) = y(i) END DO c$omp end do nowait c$omp single x(2) = y(3) x(n-1) = y(n-2) c$omp end single END DO... C$omp parallel This barrier can be omitted. This barrier was missing Assure complains! What is wrong? 17
Assure My Advice Never put an OpenMP code into production...... without using Assure... 18
Intel Thread Checker... or the Intel Thread Checker......which is the successor of Assure since Intel bought KAI. Currently the Thread- Checker only runs on the MS Windows platform. 19
GuideView Usage Compile with the guide compiler guidec guidec++ guidef77 guidef90 \ -c -fast... sourcefiles Link with the guide compiler driver and add the -Wgstats option guidec guidec++ guidef77 guidef90 -WGstats \ -fast... objectfiles -o a.out Execute the program, at the end a statistics file is written OMP_NUM_THREADS=4 a.out Visualize the statistics file with the GuideView GUI guideview 20
GuideView Example: : Jacobi (1) Barrier 1 Barrier 2 Barrier 3 Barrier 4!$omp parallel private(resid,k_local) k_local = 1 do while (k_local.le.maxit.and. error.gt.tol)!$omp do do j=1,m; do i=1,n; uold(i,j) = u(i,j); enddo; enddo!$omp single error = 0.0!$omp end single!$omp do reduction(+:error) do j..;do i..;resid=..;u(i,j)=..;error=..;enddo;enddo!$omp single error = sqrt(error)/dble(n*m)!$omp end single k_local = k_local + 1 enddo!$omp master k = k_local!$omp end master!$omp end parallel 21
GuideView Example: : Jacobi (2) Barrier 1 Barrier 2 Barrier 3 Barrier 4!$omp parallel private(re k_local = 1 do while (k_local.l!$omp do do j=1,m; do i=1!$omp single error = 0.0!$omp end single!$omp do reduction(+:erro do j..;do i..;resi!$omp single error = sqrt(err!$omp end single k_local = k_loca enddo!$omp master k = k_local!$omp end master!$omp end parallel 22
GuideView Example: : Jacobi (2) Barrier 1 Barrier 2 Barrier 3 Barrier 4!$omp parallel private(re k_local = 1 do while (k_local.l!$omp do do j=1,m; do i=1!$omp single error = 0.0!$omp end single!$omp do reduction(+:erro do j..;do i..;resi!$omp single error = sqrt(err!$omp end single k_local = k_loca enddo!$omp master k = k_local!$omp end master!$omp end parallel 23
GuideView Example: : Jacobi (3) Wait at a barrier Wait at the end of a parallel region Overhead when entering a parallel region Parallel time Waiting at a critical region Waiting for a lock 24
GuideView Example: : TFS (1) 25
GuideView Example: : TFS (1) 26
Loop Scheduling Example Matrix Transpose (1) export OMP_NUM_THREADS=8 ulimit -s 300000 export STACKSIZE=300000 guidef90 -WGstats -fast transpose.f90 export KMP_STATSFILE=static8.gvs export OMP_SCHEDULE=static,8 a.out guideview!$omp parallel do schedule(runtime) private(h) do i = 1, n-1 do j = i+1, n h = a(j,i) a(j,i) = a(i,j) a(i,j) = h end do end do end do 27
dynamic,1 10.24 sec Loop Scheduling Example Matrix Transpose (2) matrix size: 5000x5000 11 repetitions static,1 10.41 sec static 6.30 sec dynamic,8 4.96 sec guided,1 3.35 sec guided,8 3.31 sec static,8 4.12 sec 28
Loop Scheduling Example Matrix Transpose (3) 1 0.8 0.6 0.4 0.2 0 best version using the Sun compiler static static,1 static,8 dyn.,1 dyn.,8 guided,1 guided,8 matrix size: 5000x5000 average time (sec) best version using the Guide compiler guidef90 f90 -openmp 29
Summary Debugging of OpenMP codes: Parallelize carefully! Watch out for compiler messages (-XlistMP) Use Assure (or ThreadChecker) Most likely, using a debugger on OpenMP codes is not necessary. If it is, you can use TotalView in combination with Guide Runtime analysis of OpenMP codes: Sun s Analyzer is an excellent and very powerfull tool On the OpenMP directive level, GuideView statistics are sometimes easier to understand 30