Speeding Up Reactive Transport Code Using OpenMP. OpenMP

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for parallelizing Fortran and C/C++ on shared memory systems Minimal changes to sequential code required Incremental parallelization OpenMP Compliant and normal compilers!$omp No message passing between processors Fine and course grained Parallelism Do Loop Sections

Threads A thread is forked to start the parallel region and joined at the end of the region. The thread that was forked is called the master, while the other threads are the workers. What is a thread? Parallel Region Constructor!$OMP PARALLEL clause1 clause2 parallel code is placed here!$omp END PARALLEL Optional clauses include PRIVATE (list) SHARED (list) DEFAULT (PRIVATE SHARED NONE) FIRSTPRIVATE (list) REDUCTION (operator:list) IF (scalar logical expression) NUM THREADS (scalar integer expression)

Work Sharing Constructs!$OMP DO clause1 clause2 DO i=1, N parallel code is placed here END DO!$OMP END DO end_clause Optional clauses include PRIVATE (list) FIRSTPRIVATE (list) LASTPRIVATE (list) REDUCTION (operator:list) SCHEDULE (type, chunk)!$omp SINGLE clause1 clause2......!$omp END SINGLE end_clause Optional clauses include PRIVATE (list) FIRSTPRIVATE (list)!$omp SECTIONS clause1 clause2...!$omp SECTION... parallel code is placed here!$omp SECTION... parallel code is placed here!$omp END SECTIONS end_clause Optional clauses include PRIVATE (list) FIRSTPRIVATE (list) LASTPRIVATE (list) REDUCTION (operator:list)!$omp WORKSHARE...!$OMP END WORKSHARE end_clause Clauses Shared(list) Same location of variable available to all threads; exists before and after parallel region Must check that no race conditions occur Default(NONE SHARED PRIVATE) Any unstated variables can be defaulted to shared or private None says all variables must be declared in the shared or private clauses Private(list) Each thread has its own copy of the variable considered to be local to that parallel construct Private variables have to be initialized inside the parallel region and are considered to be undefined outside of that region Do Loop Counters are always private

Clauses FIRSTPRIVATE (list) gives the private variable an initialized value of the original variable when entering the parallel region LASTPRIVATE (list) gives the exiting private variable the value of the last iteration or final section REDUCTION (operator:list) to ensure that a shared variable location is written to one thread at a time; each thread has a private copy of the shared variable that gets updated at the end of the parallel region; operators include +, *,,.AND.,.OR.,.EQV.,.NEQV., MAX, MIN, IAND, IOR or IEOR IF (scalar logical expression) allows the parallel region to be run sequentially if the expression is false NUM THREADS (scalar integer expression) allows the number of threads the region is fork into to be declared (still an optional command that is not needed) Clauses SCHEDULE (type, chunk) type can be static, dynamic, or guided; help determine the efficiency of the code Static divides the iterations statically in the beginning between the threads; if a chunk size is set the last thread may have a different number of iterations then the others; offers the best performance if all the iterations require the same computational time Dynamic each thread is given a small amount of work the size of chunk and when it is done it is given more; if the chunk is not specified the default is one; obviously increases overhead Guided gives a combination of the two by handing out large loads at first and then handing out smaller loads decreasing exponentially NOWAIT an end_clause causing the threads not to wait at the end of a work sharing region, but to continue on to the next work sharing region; without this clause there is an implied barrier for all the threads to catch up and synchronize with each other

Van der Pas, R. (2005, June 1 4). An Introduction into OpenMP. Presented at the University of Oregon. REACTION TRANSPORT MODELING

Performance Analysis Note: debug versus release mode Governing Equation: C t 2 C C = v + D + SS + reactions 2 x x Operator Split: C C = v t x 2 C C = D 2 t x C = SS t C = reactions t

100 Species in 1D Column after 40 years 100 Species Problem Specifics Parallelized in two places Advection Dispersion Equation with parallel Do Loop species iterations split between threads Reactions with parallel Do Loop node iterations split between threads the same way RT3D is done Results presented from Debug mode runs Simulation Time (yr) 40 Length (m) 2000 Velocity (m/yr) 5 x 1 t 0.1 Dispersion coefficient; Dx (m^2/yr) 50 Courant 0.5 Peclet 0.1

Timing (Static Scheduling) 1 2 3 4 Program Run Time 172.77765 88.43443 62.32897 49.229939 Program Speedup 1.953737 2.772028 3.5096052 Efficiency 0.976869 0.924009 0.8774013 Reaction Run Time 134.536829 66.42403 44.81983 35.355174 Reaction Speedup 2.025424 3.001726 3.80529393 Efficiency 1.012712 1.000575 0.95132348 Dispersion RunTime 35.341182 18.9151 14.41493 10.701726 Adv Disp Speedup 1.868411 2.451707 3.3023815 Efficiency 0.934206 0.817236 0.82559538 Time Spent in Reactions 77.87% 75.11% 71.91% 71.82% Time Spent in Adv Disp 20.45% 21.39% 23.13% 21.74% Don t focus on the Program speedup and efficiency, just the parallelized sections. Timing (Guided Scheduling) 1 2 3 4 Program Run Time 172.77765 93.36956 65.11877 51.98954 Program Speedup 1.850471 2.65327 3.32331561 Efficiency 0.925235 0.884423 0.8308289 Reaction Run Time 134.536829 66.10992 44.28971 33.657039 Reaction Speedup 2.035048 3.037654 3.99728654 Efficiency 1.017524 1.012551 0.99932164 Dispersion RunTime 35.341182 23.54525 17.70636 14.985179 Adv Disp Speedup 1.50099 1.99596 2.35840907 Efficiency 0.750495 0.66532 0.58960227 Time Spent in Reactions 77.87% 70.80% 68.01% 64.74% Time Spent in Adv Disp 20.45% 25.22% 27.19% 28.82%

Timing (Static Adv Disp & Guided Reactions) 1 2 3 4 Program Run Time 172.77765 88.07855 61.75762 48.335757 Program Speedup 1.961631 2.797673 3.57453076 Efficiency 0.980816 0.932558 0.89363269 Reaction Run Time 134.536829 66.12452 44.33819 34.313412 Reaction Speedup 2.034598 3.034333 3.92082341 Efficiency 1.017299 1.011444 0.98020585 Dispersion RunTime 35.341182 18.84921 14.26557 10.735491 Adv Disp Speedup 1.874943 2.477375 3.29199494 Efficiency 0.937471 0.825792 0.82299873 Time Spent in Reactions 77.87% 75.07% 71.79% 70.99% Time Spent in Adv Disp 20.45% 21.40% 23.10% 22.21% Superlinear Speedup 100 Species Runtimes

100 Species Speedup Vinyl Chloride after 10000 days

RT3D Problem Specifics A Program called MT3D solves the advection, dispersion, and source/sink equations and calls the RT3D subroutines to solve the reactions equation The specific problem solved in this example was the sequential decay of PCE, TCE, DCE, and VC. The continuous source spill concentration of PCE was 1000 mg/l at the well. The initial levels of all chemicals in the aquifer was 0.0 mg/l. The site was 510 m x 310 m x 100 m. This created a grid 51x31x10. The reactions solved were as follows: R PCE = k 1 * [PCE] R TCE = k 1 *Y TCE/PCE *[PCE] k 2 * [TCE] R DCE = k 2 *Y DCE/TCE *[TCE] k 3 * [DCE] R VC = k 3 *Y VC/DCE *[DCE] k 4 * [VC] Results presented from a Release mode version k1 0.005 day 1 k2 0.003 day 1 k3 0.002 day 1 k4 0.001 day 1 YTCE/PCE 0.7920 YDCE/TCE 0.7377 YVC/DCE 0.6445 Timing Loop Around Row Do Loop (Static Scheduling) 1 2 3 4 Program Run Time 394.4463 284.5834 258.6008 240.5579 Program Speedup 1.386048 1.525309 1.639715 Efficiency 0.693024 0.508436 0.409929 Rt3d Run Time 229.1753 120.3458 94.81803 75.2866 Rt3d Speedup 1.904307 2.417002 3.044039 Efficiency 0.952153 0.805667 0.76101 Time Spent in Rt3d 58.10% 42.29% 36.67% 31.30% Don t focus on the Program speedup and efficiency, just the parallelized sections.

Timing Loop Around Row Do Loop (Guided Scheduling) 1 2 3 4 Program Run Time 394.4463 280.7615 247.1585 227.8098 Program Speedup 1.404916 1.595924 1.731472 Efficiency 0.702458 0.531975 0.432868 Rt3d Run Time 229.1753 117.2388 80.37176 60.98532 Rt3d Speedup 1.954774 2.851441 3.757877 Efficiency 0.977387 0.95048 0.939469 Time Spent in Rt3d 58.10% 41.76% 32.52% 26.77% RT3D Decay Problem Runtimes

RT3D Decay Problem Runtimes Conclusion Clearly the capabilities of OpenMP are limited to the available computer architectures. Much more speedup is possible with hundreds of processors in a cluster system possibly using Message Passing Interface routines, but OpenMP leaves code intact sequentially, is easy to implement, and accomplishes great speedup when a limited number of processors are available in a shared memory system. Options for future research can include a Hybrid MPI/OpenMP code utilizing the benefits of both standards. OpenMP is available primarily in commercial compilers such as Intel Visual Fortran and PGI compilers. Omni compiler might be free with OpenMP http://phase.hpcc.jp/omni/ I have not tried it so I don t know if it works.