Testing of PVODE, a Parallel ODE Solver

Similar documents
Testing PL/SQL with Ounit UCRL-PRES

NIF ICCS Test Controller for Automated & Manual Testing

METADATA REGISTRY, ISO/IEC 11179

LosAlamos National Laboratory LosAlamos New Mexico HEXAHEDRON, WEDGE, TETRAHEDRON, AND PYRAMID DIFFUSION OPERATOR DISCRETIZATION

Alignment and Micro-Inspection System

High Scalability Resource Management with SLURM Supercomputing 2008 November 2008

Portable Data Acquisition System

Resource Management at LLNL SLURM Version 1.2

FY97 ICCS Prototype Specification

Optimizing Bandwidth Utilization in Packet Based Telemetry Systems. Jeffrey R Kalibjian

and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

Electronic Weight-and-Dimensional-Data Entry in a Computer Database

Stereo Vision Based Automated Grasp Planning

Information to Insight

Adding a System Call to Plan 9

Saiidia National Laboratories. work completed under DOE ST485D sponsored by DOE

Installation Guide for sundials v2.6.2

In-Field Programming of Smart Meter and Meter Firmware Upgrade

Go SOLAR Online Permitting System A Guide for Applicants November 2012

Java Based Open Architecture Controller

OPTIMIZING CHEMICAL SENSOR ARRAY SIZES

The ANLABM SP Scheduling System

Development of Web Applications for Savannah River Site

ENDF/B-VII.1 versus ENDFB/-VII.0: What s Different?

ACCELERATOR OPERATION MANAGEMENT USING OBJECTS*

Mike McGlaun, Allen Robinson and James Peery. Sandia National Laboratories, Albuquerque, New Mexico, U.S.A.

Graphical Programming of Telerobotic Tasks

Flux: Practical Job Scheduling

SmartSacramento Distribution Automation

NATIONAL GEOSCIENCE DATA REPOSITORY SYSTEM

INCLUDING MEDICAL ADVICE DISCLAIMER

Clusters Using Nonlinear Magnification

Integrated Training for the Department of Energy Standard Security System

GA A22720 THE DIII D ECH MULTIPLE GYROTRON CONTROL SYSTEM

HPC Colony: Linux at Large Node Counts

ESNET Requirements for Physics Reseirch at the SSCL

A Piecewise Linear Finite Element Discretization of the Diffusion Equation for Arbitrary Polyhedral Grids

5A&-qg-oOL6c AN INTERNET ENABLED IMPACT LIMITER MATERIAL DATABASE

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

The NMT-5 Criticality Database

DOE EM Web Refresh Project and LLNL Building 280

GA A22637 REAL TIME EQUILIBRIUM RECONSTRUCTION FOR CONTROL OF THE DISCHARGE IN THE DIII D TOKAMAK

COMPUTATIONAL FLUID DYNAMICS (CFD) ANALYSIS AND DEVELOPMENT OF HALON- REPLACEMENT FIRE EXTINGUISHING SYSTEMS (PHASE II)

Integrated Volt VAR Control Centralized

On Demand Meter Reading from CIS

Displacements and Rotations of a Body Moving About an Arbitrary Axis in a Global Reference Frame

PJM Interconnection Smart Grid Investment Grant Update

REAL-TIME MULTIPLE NETWORKED VIEWER CAPABILITY OF THE DIII D EC DATA ACQUISITION SYSTEM

Washington DC October Consumer Engagement. October 4, Gail Allen, Sr. Manager, Customer Solutions

Technical Information Resources for Criticality Safety

ABSTRACT. W. T. Urban', L. A. Crotzerl, K. B. Spinney', L. S. Waters', D. K. Parsons', R. J. Cacciapouti2, and R. E. Alcouffel. 1.

Centrally Managed. Ding Jun JLAB-ACC This process of name resolution and control point location

zorder-lib: Library API for Z-Order Memory Layout

Real Time Price HAN Device Provisioning

Software Integration Plan for Dynamic Co-Simulation and Advanced Process Control

Site Impact Policies for Website Use

@ST1. JUt EVALUATION OF A PROTOTYPE INFRASOUND SYSTEM ABSTRACT. Tom Sandoval (Contractor) Los Alamos National Laboratory Contract # W7405-ENG-36

Intelligent Grid and Lessons Learned. April 26, 2011 SWEDE Conference

Reduced Order Models for Oxycombustion Boiler Optimization

ALAMO: Automatic Learning of Algebraic Models for Optimization

Motion Planning of a Robotic Arm on a Wheeled Vehicle on a Rugged Terrain * Abstract. 1 Introduction. Yong K. Hwangt

Data Management Technology Survey and Recommendation

Productivity and Injectivity of Horizohtal Wells

MULTIPLE HIGH VOLTAGE MODULATORS OPERATING INDEPENDENTLY FROM A SINGLE COMMON 100 kv dc POWER SUPPLY

A VERSATILE DIGITAL VIDEO ENGINE FOR SAFEGUARDS AND SECURITY APPLICATIONS

Tim Draelos, Mark Harris, Pres Herrington, and Dick Kromer Monitoring Technologies Department Sandia National Laboratories

Los Alamos National Laboratory

MAS. &lliedsignal. Design of Intertransition Digitizing Decomutator KCP Federal Manufacturing & Technologies. K. L.

PJM Interconnection Smart Grid Investment Grant Update

Final Report for LDRD Project Learning Efficient Hypermedia N avi g a ti o n

Conversion of NIMROD simulation results for graphical analysis using VisIt

Use of the target diagnostic control system in the National Ignition Facility

Advanced Synchrophasor Protocol DE-OE-859. Project Overview. Russell Robertson March 22, 2017

Large Scale Test Simulations using the Virtual Environment for Test Optimization

MERIDIANSOUNDINGBOARD.COM TERMS AND CONDITIONS

Southern Company Smart Grid

NFRC Spectral Data Library #4 for use with the WINDOW 4.1 Computer Program

Bridging The Gap Between Industry And Academia

DERIVATIVE-FREE OPTIMIZATION ENHANCED-SURROGATE MODEL DEVELOPMENT FOR OPTIMIZATION. Alison Cozad, Nick Sahinidis, David Miller

Systems Integration Tony Giroti, CEO Bridge Energy Group

Charles L. Bennett, USA. Livermore, CA 94550

BWXT Y-12 Y-12. A BWXT/Bechtel Enterprise COMPUTER GENERATED INPUTS FOR NMIS PROCESSOR VERIFICATION. Y-12 National Security Complex

Cross-Track Coherent Stereo Collections

IN-SITU PROPERTY MEASUREMENTS ON LASER-DRAWN STRANDS OF SL 5170 EPOXY AND SL 5149 ACRYLATE

Example Programs for CVODE v2.2.0

CHANGING THE WAY WE LOOK AT NUCLEAR

Visualization for the Large Scale Data Analysis Project. R. E. Flanery, Jr. J. M. Donato

Lossless Compression of Floating-Point Geometry

Collaborating with Human Factors when Designing an Electronic Textbook

Winnebago Industries, Inc. Privacy Policy

SUMNER ASSOCIATES FINAL REPORT 0412JOOO4-3Y TASK 41 DISCLAIMER

KCP&L SmartGrid Demonstration

Installing Enterprise Switch Manager

HELIOS CALCULATIONS FOR UO2 LATTICE BENCHMARKS

Leonard E. Duda Sandia National Laboratories. PO Box 5800, MS 0665 Albuquerque, NM

EPICS Add On Products SourceRelease Control

Entergy Phasor Project Phasor Gateway Implementation

SGE Roll: Users Guide. Version 5.3 Edition

ADAPTIVE MESH REFINEMENT IN CTH

Adaptive Mesh Refinement in Titanium

Transcription:

Testing of PVODE, a Parallel ODE Solver Michael R. Wittman Lawrence Livermore National Laboratory Center for Applied Scientific Computing UCRL-ID-125562 August 1996

DISCLAIMER This document was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the University of California nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial products, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or the University of California. The views and opinions of authors expressed herein do not necessarily state or d ect those of the United States Government or the University of California, and shall not be used for advertising or product endorsement purposes.

DISCLAIMER Portions of this document may be illegible in electronic image products. Images are produced from the best available original dowment.

Testing of PVODE, a Parallel ODE Solver* Michael R. Wittmant August 9, 1996 1 Introduction The purpose of this paper is to discuss the issues involved with, and the results from, testing of two example programs that use PVODE: pvkx and pvnx. These two programs are intended to provide a template for users for follow when writing their own code. However, we also used them (primarily pvkx) to do performance testing and visualization. This work was done on a Cray T3D, a Sparc 10, and a Sparc 5. 2 The Test Problems The pvkx test problem is one that was used in [l]as an example problem for CVODE [l, 21. It is a system derived from a one-dimensional diurnal kinetics-diffusion PDE system with two species. The PDEs have the form v =4 x, IC,(y) = 10-8 ey'5, = 0.001, R'(cl,c2,t ) = 4 l C ' - &k k&) * 7.4 lo= k d ( t ) C 2, R2 ( c1, c2, t ) = k1c' - k 2 c 1 c 2 - kq(t)c2, li/h IC1 + - + = 6.031, IC2 = 4.66 exp[-22.62/ sin(wt)], for sin(wt) > 0, for sin(wt) 5 0, * exp[-7.601/ sin(wt)], for sin(wt) > 0, for sin(&) 5 0, w = T/43,200, *Research performed under the auspices of the U.S. Department of Energy, by Lawrence Livermore National Laboratory under contract W-7405-ENG-48. +Summerstudent with CASC, May-August, 1996. Permanent address: wittman@purdue.edu 1

with 0 < -2 < - 20 kni, 30 5 y 5 50 km, 0 5 t 5 86,400 seconds (one day). The PDEs are discretized by finite differencing on a uniform mesh. Homogeneous Neumann boundary conditions are used. The initial conditions are given by + c i ( x,y ) = clcale(1-5' +.5z4)[.75 2 5 tanh(ij/b)], f = ( 2-10)/10, = (y - 40)/10, 1 s =.01 C,& - lo6, C,,Ie - 1oI2. ) For the parallel implementation of the problem, the mesh was split up over a two dimensional array of processors. In fact, the mesh was actually specified by the size of the submesh on each processor, and the size of each dimension of the processor array. The advantage of this approach was the automatic consistency of the size of the data on each processor, which in turn made the code dealing with the data on all processors simpler. The minor disadvantage was the slight inflexibility in the size of mesh that can be used. Since the problem is stiff, we used the SPGMR module with a block' diagonal preconditioner t o get our solutions. For the testing, the problem was used as is, with some modifications for writing the mesh to disk and for timing. The specifics of these modifications will be discussed later. pvnx is a program which models an ODE system based on a non-stiff one dimensional problem similar to pvkx. The testing of this problem was done primarily to verify that th,e solution was correct and was consistent with the CVODE version of this problem. No visualization was clone with this code, but we did need to modify it to do non-blocking communication, which will be discussed in the next section. 3 MPI Communication One of the problems that became apparent during the testing process was the possibility of deadlocks occurring during MPI communication. Originally in both the pvkx and pvnx problems, all sends and receives were done using MPI-Sendo and MPI_Recv(). Because the MPI libraries that we were using (EPCC and MPICH) apparently implemented these calls using buffering, we were not getting deadlocks. However, according t o the MPI specification, no buffering is required to be done on these two calls. Thus an MPISendO could potentially wait for the corresponding MPI-Recv() before returning, and vice versa. This posed a problem in both pvkx and pvnx, since at certain points in those codes, all processors communicate with their neighbors simultaneously using MPISendO and MPI-Recv (). To resolve these problems, we adopted a strategy of first doing non-blocking receives, then blocking sends, then waiting on the non-blocking receives. This strategy is endorsed by the MPI manual, as it makes it more likely that receive buffers will be available when data gets to each processor, thus limiting the amount of system buffering and copying necessary. The call sequence used was: MP I-Irecv (1 ; MPISend() ; MPLWait ( >; 2

Using the non-blocking method of communication involved a small amount of additional complexity over the blocking method. This was due to the propagation of one MPI-Request handle from MPI-Irecv() to MPI-Wait (>for each communication, as well as the fact that the data in the user buffer given to MPI-Irecv () could not be accessed until after the MPI-Wait () call. 4 Tests on Sparc 5 s and Sparc 10 s Limited testing was done on Sparc 5 and Sparc 10 workstations. The available Sparc 10 workstations were running SunOS 4.1.3, and the avaiiable Sparc 5 machines were running Solaris. On the Sparc 10 machines, an existing MPICH installation (version 1.0.11) was used. On the Sparc 5 machines, version 1.0.12 of MPICH was installed from scratch and used. MPI on workstations is designed to run parallel programs as separate processes on (possibly different) sequential machines as if they were running as separate PE s on a parallel machine. Thus we were able to test both the pvkx and pvnx parallel programs on our workstations. On the Sparc lo s, the code for both the test programs executed as would be expected, albeit much slower than on a true parallel machine. The code was able to execute as separate processes on the same computer, and also as separate processes on different computers. On the Sparc 5 s, the code executed correctly, but was unable to run using more than one process. This is probably due to the fact that the MPI version for Solaris is still in devopment and has not undergone testing. 5 One-PE Version of MPI To do simple testing of the non-communication aspects of our parallel programs, the one processor MPI library was developed. The main purpose of this library was to allow programs using MPI t o compile and run on sequential machines as if they were running as a single process job on a parallel machine. The implementation of this library involved writing functions that mimicked the MPI calls used in our parallel programs and provided the expected results for a single processor run. In addition, some error checking of proper MPI initialization and communication was included in the library. Using this library we were able to compile and run our codes on individual workstations with very little effort. 6 Visualization of Cray-T3D Results Visualization of results from the pvkx code on the Cray T3D was necessary to ensure that the code was indeed correct, and that the problem that the code was solving was neither too difficult nor too easy for the mesh that was used. The actual visualization procedure for the pvkx problem used a three-step approach, due to the fact that the amount of data generated 3

by the program was potentially very large, and the fact that the visualization tools necessary (Matlab) existed on a different machine. The first step in the visualization procedure was the actual writing to disk of the mesh data after each time interval in the problem. This was done by modifying pvkx to have each processor write its own data to a separate file for each time interval. This approach greatly simplified the writing of data to disk, as it did not require the synchronization of processors necessary to write all data to a single file. The second step in the visualization was the use of the sequential program collect.c on the T3D which collected the data from a number of individual processor data files and merged them into a single larger data file. The third step was the use of a Matlab script, running on a Sparc 10, to call the collect program on the T3D and get the necessary data for each time interval for each species in the problem. The Matlab program then did a surface plot of the data or both species, one time interval at a time. Since we were working with large amounts of data at times, we incorporated some flexibility in dealing with the data into each of the three steps of the visualization. This flexibility entailed allowing any of the three steps to limit the amount of data that was seen by the next step to any rectangular subset of the entire processor mesh. Being able to limit the amount of data was absolutely essential for runs of 64 processors or more. (Visualizing the entire mesh for a 256 processor run would have generated 59 megs of data, far beyond Matlab s ca;pabilities for an interactive session.) See Fig. 1 for an example surface plot. 7 Timing of Cray-T3D Results In addition to visualization, timing and performance testing of the pvkx code was also done. This testing was done over various processor mesh and submesh sizes for the three message passing libraries that we used: SHMEM, MPICH, and EPCC. The main purpose of the testing was to evaluate whether one of the two MPI libraries (MPICH and EPCC) could compete in terms of speed with the Cray-native SHMEM library. The hope was that at least one of the MPI libraries would be nearly as fast as the SHMEM library, since the MPI code has the benefit of being portable. For the actual timing itself, we used both the native Cray cpusedo timing function and the. MPI-Wtimeo call within the pvkx code itself. The results we got indicated that the values returned by the MPI-Wtime() function were consistent with the cpusedo function, and simply different by a matter of a scaling factor. So for our results and analysis, we simply used MPI-Wt ime (1. The actual timing done was of almost the entire main function of the pvkx code; the only code not executed during the timing was the finalization code which could not be executed before the last MPI call. During the time testing, the pvkx-specific visualization code was turned off using a preprocessor switch. This was done in an attempt to avoid effects of disk 1/0 which might skew the time data. 4

c2-400x400 x 10 Y 30 0 X Figure 1: Visualization output for a 64 PE run of pvkx on a 400 x 400 mesh. 5

8 Testing Procedure The testing procedure for the pvkx code was essentially an edit-compile-run cycle for each test, done. The processor submesh sizes as well as the size of the processor mesh and the toggle for mesh printing were all hardwired using #define. In order to do a new test at a different sized mesh, these settings in the pvkx file were changed. The compile was very straight-forward, with the exception that changing libraries (e.g., MPICH to EPCC) involved changing variables in the Makefile. The run was also straight-forward for runs of 32 ]processors or less, which could be submitted interactively. For 64 to 256 processor runs, we used a C-shell script which simplified the batch queue submission process by creating a batch script on-the-fly and submitting it to the NQS to run later. For those runs that were for timing purposes, the ouput containing the data for the run was filed based on the run specifications. For runs that were for visualization purposes, the prolcessor data files were kept in a subdirectory, and the Matlab script on the Sparc 10 was run to do the visualization. - 9 Test Results For the pvkx code, the test results were very encouraging. Our trials were done over 64, 128 and 256 processors, in which we either kept the total mesh size constant, or scaled it linearly with the number of processors. In all of these trials, the EPCC implementation of MPI was consistently competitive with the native SHMEM library. The MPICH implementation, however, was considerably slower than either of the other two libraries, and its performance actually appeared to degrade with increasing numbers of processors. See Figs. 2 and 3 for plots of the performance of the pvkx problem over a variable sized grid!. Figs. 4 and 5 show the performance of the problem on a constant size mesh, with a varying number of processors. References [l]s. D. Cohen and A. C. Hindmarsh, CVODE User Guide, Lawrence Livermore National Laboratory report UCRL-MA-118618, September 1994. [2] Scott D. Cohen and Alan C. Hindmarsh, CVODE, a Stifl/Nonsti$ ODE Solver in C, Computers in Physics, vol. 10, no. 2 (March/April 1996), pp. 138-143. 6

14001 8.16,32 x 8 PE mesh, 50 x 50 subgrids IPICH PCC HMEM 0' 64 128 processors I Figure 2: Execution time for 50 x 50 submeshes. 0.07I 8.16.32 x 8 PE mesh, 50 x 50 subgrids MPlCH 0.06 0.05 - EPCC SHMEM Figure 3: Execution time weighted by work done (number of f 0 calls). 7

1200 8.16.32x 8 PE mesh, 100,5025 x 50 subgrids 1000 800 u ln MPICH ln 400 200 128 processors, i6 Figure 4: Execution time for a fixed SO0 x 400 mesh. 8.16.32 x 8 PE mesh, 100.50.25 x 50 subgrids 0.08-5 I 0) v) 0.04-0.03 - - 0.02 1 I 0.01 064I 128 256 pwa?sors Figure 5: Execution time weighted by work done (number of f 0 calls). a