Speeding Up Reactive Transport Code Using OpenMP. OpenMP

Similar documents
RT3D Sequential Anaerobic Degradation: PCE TCE DCE VC

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Barbara Chapman, Gabriele Jost, Ruud van der Pas

[Potentially] Your first parallel application

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

RT3D User-Defined Reactions

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven

v. 9.1 GMS 9.1 Tutorial RT3D User-Defined Reactions Prerequisite Tutorials Time Required Components

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Introduction to Standard OpenMP 3.1

Practical in Numerical Astronomy, SS 2012 LECTURE 12

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Lecture 4: OpenMP Open Multi-Processing

Parallel Programming

ECE 574 Cluster Computing Lecture 10

OpenMP 4.5: Threading, vectorization & offloading

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Shared Memory Programming with OpenMP

Introduction to OpenMP

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

Open Multi-Processing: Basic Course

Parallel Programming: OpenMP

Shared Memory Parallelism using OpenMP

Data Environment: Default storage attributes

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP

Module 11: The lastprivate Clause Lecture 21: Clause and Routines. The Lecture Contains: The lastprivate Clause. Data Scope Attribute Clauses

Lab: Scientific Computing Tsunami-Simulation

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

OpenMP. A.Klypin. Shared memory and OpenMP. Simple Example. Threads. Dependencies. Directives. Handling Common blocks.

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

15-418, Spring 2008 OpenMP: A Short Introduction

Compiling and running OpenMP programs. C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp. Programming with OpenMP*

Introduction to OpenMP

Allows program to be incrementally parallelized

Session 4: Parallel Programming with OpenMP

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015

Using OpenMP. Rebecca Hartman-Baker Oak Ridge National Laboratory

OpenMP - Introduction

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Introduction to OpenMP

Introduction to OpenMP. Lecture 4: Work sharing directives

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Introduction to OpenMP

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Overview: The OpenMP Programming Model

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

Advanced OpenMP. OpenMP Basics

Shared Memory programming paradigm: openmp

Introduction [1] 1. Directives [2] 7

Shared Memory Programming Model

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

An Introduction to OpenMP

Introduction to OpenMP

OpenMP: Open Multiprocessing

A Source-to-Source OpenMP Compiler

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Masterpraktikum - High Performance Computing

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Shared Memory Programming with OpenMP. Lecture 3: Parallel Regions

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

!OMP #pragma opm _OPENMP

OpenMP+F90 p OpenMP+F90

Shared Memory Parallelism - OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

EPL372 Lab Exercise 5: Introduction to OpenMP

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Introduction to. Slides prepared by : Farzana Rahman 1

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

COMP Parallel Computing. SMM (2) OpenMP Programming Model

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

CS691/SC791: Parallel & Distributed Computing

Threaded Programming. Lecture 3: Parallel Regions. Parallel region directive. Code within a parallel region is executed by all threads.

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

Introduction to OpenMP

Introduction to OpenMP

Parallel Programming in C with MPI and OpenMP

OpenMP Shared Memory Programming

Introduction to OpenMP

OPENMP TIPS, TRICKS AND GOTCHAS

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

OpenMP on Ranger and Stampede (with Labs)

Introduction to OpenMP

Transcription:

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for parallelizing Fortran and C/C++ on shared memory systems Minimal changes to sequential code required Incremental parallelization OpenMP Compliant and normal compilers!$omp No message passing between processors Fine and course grained Parallelism Do Loop Sections

Threads A thread is forked to start the parallel region and joined at the end of the region. The thread that was forked is called the master, while the other threads are the workers. What is a thread? Parallel Region Constructor!$OMP PARALLEL clause1 clause2 parallel code is placed here!$omp END PARALLEL Optional clauses include PRIVATE (list) SHARED (list) DEFAULT (PRIVATE SHARED NONE) FIRSTPRIVATE (list) REDUCTION (operator:list) IF (scalar logical expression) NUM THREADS (scalar integer expression)

Work Sharing Constructs!$OMP DO clause1 clause2 DO i=1, N parallel code is placed here END DO!$OMP END DO end_clause Optional clauses include PRIVATE (list) FIRSTPRIVATE (list) LASTPRIVATE (list) REDUCTION (operator:list) SCHEDULE (type, chunk)!$omp SINGLE clause1 clause2......!$omp END SINGLE end_clause Optional clauses include PRIVATE (list) FIRSTPRIVATE (list)!$omp SECTIONS clause1 clause2...!$omp SECTION... parallel code is placed here!$omp SECTION... parallel code is placed here!$omp END SECTIONS end_clause Optional clauses include PRIVATE (list) FIRSTPRIVATE (list) LASTPRIVATE (list) REDUCTION (operator:list)!$omp WORKSHARE...!$OMP END WORKSHARE end_clause Clauses Shared(list) Same location of variable available to all threads; exists before and after parallel region Must check that no race conditions occur Default(NONE SHARED PRIVATE) Any unstated variables can be defaulted to shared or private None says all variables must be declared in the shared or private clauses Private(list) Each thread has its own copy of the variable considered to be local to that parallel construct Private variables have to be initialized inside the parallel region and are considered to be undefined outside of that region Do Loop Counters are always private

Clauses FIRSTPRIVATE (list) gives the private variable an initialized value of the original variable when entering the parallel region LASTPRIVATE (list) gives the exiting private variable the value of the last iteration or final section REDUCTION (operator:list) to ensure that a shared variable location is written to one thread at a time; each thread has a private copy of the shared variable that gets updated at the end of the parallel region; operators include +, *,,.AND.,.OR.,.EQV.,.NEQV., MAX, MIN, IAND, IOR or IEOR IF (scalar logical expression) allows the parallel region to be run sequentially if the expression is false NUM THREADS (scalar integer expression) allows the number of threads the region is fork into to be declared (still an optional command that is not needed) Clauses SCHEDULE (type, chunk) type can be static, dynamic, or guided; help determine the efficiency of the code Static divides the iterations statically in the beginning between the threads; if a chunk size is set the last thread may have a different number of iterations then the others; offers the best performance if all the iterations require the same computational time Dynamic each thread is given a small amount of work the size of chunk and when it is done it is given more; if the chunk is not specified the default is one; obviously increases overhead Guided gives a combination of the two by handing out large loads at first and then handing out smaller loads decreasing exponentially NOWAIT an end_clause causing the threads not to wait at the end of a work sharing region, but to continue on to the next work sharing region; without this clause there is an implied barrier for all the threads to catch up and synchronize with each other

Van der Pas, R. (2005, June 1 4). An Introduction into OpenMP. Presented at the University of Oregon. REACTION TRANSPORT MODELING

Performance Analysis Note: debug versus release mode Governing Equation: C t 2 C C = v + D + SS + reactions 2 x x Operator Split: C C = v t x 2 C C = D 2 t x C = SS t C = reactions t

100 Species in 1D Column after 40 years 100 Species Problem Specifics Parallelized in two places Advection Dispersion Equation with parallel Do Loop species iterations split between threads Reactions with parallel Do Loop node iterations split between threads the same way RT3D is done Results presented from Debug mode runs Simulation Time (yr) 40 Length (m) 2000 Velocity (m/yr) 5 x 1 t 0.1 Dispersion coefficient; Dx (m^2/yr) 50 Courant 0.5 Peclet 0.1

Timing (Static Scheduling) 1 2 3 4 Program Run Time 172.77765 88.43443 62.32897 49.229939 Program Speedup 1.953737 2.772028 3.5096052 Efficiency 0.976869 0.924009 0.8774013 Reaction Run Time 134.536829 66.42403 44.81983 35.355174 Reaction Speedup 2.025424 3.001726 3.80529393 Efficiency 1.012712 1.000575 0.95132348 Dispersion RunTime 35.341182 18.9151 14.41493 10.701726 Adv Disp Speedup 1.868411 2.451707 3.3023815 Efficiency 0.934206 0.817236 0.82559538 Time Spent in Reactions 77.87% 75.11% 71.91% 71.82% Time Spent in Adv Disp 20.45% 21.39% 23.13% 21.74% Don t focus on the Program speedup and efficiency, just the parallelized sections. Timing (Guided Scheduling) 1 2 3 4 Program Run Time 172.77765 93.36956 65.11877 51.98954 Program Speedup 1.850471 2.65327 3.32331561 Efficiency 0.925235 0.884423 0.8308289 Reaction Run Time 134.536829 66.10992 44.28971 33.657039 Reaction Speedup 2.035048 3.037654 3.99728654 Efficiency 1.017524 1.012551 0.99932164 Dispersion RunTime 35.341182 23.54525 17.70636 14.985179 Adv Disp Speedup 1.50099 1.99596 2.35840907 Efficiency 0.750495 0.66532 0.58960227 Time Spent in Reactions 77.87% 70.80% 68.01% 64.74% Time Spent in Adv Disp 20.45% 25.22% 27.19% 28.82%

Timing (Static Adv Disp & Guided Reactions) 1 2 3 4 Program Run Time 172.77765 88.07855 61.75762 48.335757 Program Speedup 1.961631 2.797673 3.57453076 Efficiency 0.980816 0.932558 0.89363269 Reaction Run Time 134.536829 66.12452 44.33819 34.313412 Reaction Speedup 2.034598 3.034333 3.92082341 Efficiency 1.017299 1.011444 0.98020585 Dispersion RunTime 35.341182 18.84921 14.26557 10.735491 Adv Disp Speedup 1.874943 2.477375 3.29199494 Efficiency 0.937471 0.825792 0.82299873 Time Spent in Reactions 77.87% 75.07% 71.79% 70.99% Time Spent in Adv Disp 20.45% 21.40% 23.10% 22.21% Superlinear Speedup 100 Species Runtimes

100 Species Speedup Vinyl Chloride after 10000 days

RT3D Problem Specifics A Program called MT3D solves the advection, dispersion, and source/sink equations and calls the RT3D subroutines to solve the reactions equation The specific problem solved in this example was the sequential decay of PCE, TCE, DCE, and VC. The continuous source spill concentration of PCE was 1000 mg/l at the well. The initial levels of all chemicals in the aquifer was 0.0 mg/l. The site was 510 m x 310 m x 100 m. This created a grid 51x31x10. The reactions solved were as follows: R PCE = k 1 * [PCE] R TCE = k 1 *Y TCE/PCE *[PCE] k 2 * [TCE] R DCE = k 2 *Y DCE/TCE *[TCE] k 3 * [DCE] R VC = k 3 *Y VC/DCE *[DCE] k 4 * [VC] Results presented from a Release mode version k1 0.005 day 1 k2 0.003 day 1 k3 0.002 day 1 k4 0.001 day 1 YTCE/PCE 0.7920 YDCE/TCE 0.7377 YVC/DCE 0.6445 Timing Loop Around Row Do Loop (Static Scheduling) 1 2 3 4 Program Run Time 394.4463 284.5834 258.6008 240.5579 Program Speedup 1.386048 1.525309 1.639715 Efficiency 0.693024 0.508436 0.409929 Rt3d Run Time 229.1753 120.3458 94.81803 75.2866 Rt3d Speedup 1.904307 2.417002 3.044039 Efficiency 0.952153 0.805667 0.76101 Time Spent in Rt3d 58.10% 42.29% 36.67% 31.30% Don t focus on the Program speedup and efficiency, just the parallelized sections.

Timing Loop Around Row Do Loop (Guided Scheduling) 1 2 3 4 Program Run Time 394.4463 280.7615 247.1585 227.8098 Program Speedup 1.404916 1.595924 1.731472 Efficiency 0.702458 0.531975 0.432868 Rt3d Run Time 229.1753 117.2388 80.37176 60.98532 Rt3d Speedup 1.954774 2.851441 3.757877 Efficiency 0.977387 0.95048 0.939469 Time Spent in Rt3d 58.10% 41.76% 32.52% 26.77% RT3D Decay Problem Runtimes

RT3D Decay Problem Runtimes Conclusion Clearly the capabilities of OpenMP are limited to the available computer architectures. Much more speedup is possible with hundreds of processors in a cluster system possibly using Message Passing Interface routines, but OpenMP leaves code intact sequentially, is easy to implement, and accomplishes great speedup when a limited number of processors are available in a shared memory system. Options for future research can include a Hybrid MPI/OpenMP code utilizing the benefits of both standards. OpenMP is available primarily in commercial compilers such as Intel Visual Fortran and PGI compilers. Omni compiler might be free with OpenMP http://phase.hpcc.jp/omni/ I have not tried it so I don t know if it works.