Practical MPI for the Geissler group

Size: px

Start display at page:

Download "Practical MPI for the Geissler group"

Arabella Alexandra Nash
6 years ago
Views:

1 Practical MPI for the Geissler group Anna August 12, 2011 Contents 1 Introduction What is MPI? Resources A tiny glossary MPI implementations Writing MPI code MPI program design Basic functions: overhead Basic functions: send/receive Send/receive examples Running MPI code Learning about your MPI installation Compiling in general Compiling on NERSC Running locally Submitting jobs on quaker, muesli, or lers Submitting jobs on NERSC Introduction 1.1 What is MPI? MPI stands for Message Passing Interface. It is a set of specifications for message-passing libraries, and has many implementations in many languages(including C/C++, Fortran, and Python). It s good for CPU parallel-programming tasks where your processes are running the same code mostly independently, but may need to exchange small pieces of information once in a while. Typical uses for Geissler group members might be: data analysis (one analysis per process, each with a different datafile) 1

2 replica exchange/parallel tempering simulations (one simulation per process, each with a different temperature or other set of parameters) This tutorial focuses on what I learned while writing a replica exchange simulation in C++. Tasks that are probably not well suited to MPI include anything using a shared-memory model, programs that require sharing large amounts of data with complex internal structure (e.g., whole system configurations), or programs that utilize a large number of processes for only a small fraction of the total wall-clock time (because the processes hog cluster space even when they re inactive). 1.2 Resources The group has two MPI reference books floating around: Parallel Programming with MPI by Peter S. Pacheco Parallel Programming in C with MPI and OpenMPI by Michael J. Quinn Have a glance through them to get yourself started, then start googling to answer specific debugging questions. A few links I found particularly useful are for runtime issues 1.3 A tiny glossary core aka processor, unit of computing hardware that executes MPI code node a physical computer, like your desktop; modern ones have several cores process what s running on a core, executing a complete copy of your code job aka session, the collection of all N processes that you run together as a single command-line operation rank the ID number of a process (between 0 and N 1) message a packet of information passed between two or more processes within the same job communicator what passes messages between processes; the only one you probably need to know about is MPI_COMM_WORLD 1.4 MPI implementations I chose to use the C++ bindings of the OpenMPI library, a very popular implementation that s already installed on muesli, lers, quaker, and the NERSC machines, and possibly on your workstation. All of the code snippets in Section 2, and some of the compiling and running instructions, are specific to that implementation. 2

3 Thisisnotyouronlyoptionthough! ThereareotherwidelyusedC/C++implementations, most notably MPICH, which is available on the franklin and hopper NERSC machines. There s also a C++ implementation using the Boost framework that lets you pass STL types, at least one Python implementation, and lots of others. Look for one that is well documented, supports your language of choice, is already present or easy to install on the machines you want to use, works with your favorite debugging and IDE suites, etc. 2 Writing MPI code 2.1 MPI program design Step one in any parallel computing project is figuring out what tasks in your code are parallelizable. Good candidates are tasks that involve doing the same operation many times on different pieces of data, where the result of each operation depends only on the input data to that operation and not on the output of any of the other operations. Relevant examples of this class of program tasks include MD force computation (each particle is independent), data analysis like computing g(r) (each configuration is independent), and replica exchange simulations (each replica is independent). In general, Monte Carlo is less parallelizable than molecular dynamics because a single-particle move usually depends on the result of the previous single-particle move (but ask Carl for some nice counterexamples). You don t need every part of your code to be parallelizable, but you ll probably get better speed-ups if you parallelize computationally-intensive tasks. Step two is identifying the input and output data of the to-be-parallelized tasks. MPI is a distributed memory system, so processes only have contact with each other by passing messages. This is nice because one process will never accidentally overwrite memory being used by another process. The flip side is that communication takes place over the network, so thinking about ways to minimize the amount of data being passed around may be worthwhile. When parallelizing the system propagation steps between replica exchange moves, I decided that the input to each replica would be its new simulation parameters (e.g., temperature, pressure, ǫ, or σ for an NPT Lennard-Jones system), and the output would be the set of energies of the replica s final configuration under all possible simulation parameters. Step three is deciding how to allocate tasks to MPI processes. For MD force computation, it would be silly to have one process per particle, but it might be reasonable to assign 128 particles out of a 1024-particle simulation to each of 8 processes. For replica exchange, one replica per process makes sense. I also decided to have a master process that coordinates the simulation, collecting information from and distributing information to each replica process; the replica processes are then the slaves. A common convention for masterslave program designs is to assign rank 0 to the master. Step four is thinking about how to incorporate this new functionality into 3

4 your code. Decide how to isolate the to-be-parallelized tasks into one or more functions, what messages should be passed between which processes at what times, and which processes should run which parts of the code. (In general, all processes run the same executable, but the flow of each process through the code can be controlled with statements like if (rank == somerank).) Cycle through steps 2 4 until you have a design you re happy with. Step five is actually writing the MPI code. Don t do step five until you ve done steps 1 4, especially if your code isn t under version control. Speaking of which, step zero: put your code under version control (ask Todd about using the group git server). 2.2 Basic functions: overhead To use the OpenMPI library in C++, put this line with your other include statements #include "mpi.h" and make sure the mpi.h file is in your path. Somewhere early in your program, before any other MPI command, you have to initialize MPI. This part is executed by every process, although the local value of rank will be different for each process. // passes the command-line arguments to MPI // the command-line arguments don t have to do anything though MPI_Init(&argc, &argv); // initialize the variable rank to the rank of this process int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); // initialize the variable nprocs to the total number of processes int nprocs; MPI_Comm_size(MPI_COMM_WORLD, &nprocs); The rank and nprocs variables (which can be named anything you like, incidentally) are very useful for controlling the flow of your code. For instance, you could include lines such as if (rank < nprocs/2) { // code that only half the processes should execute } and the code will execute as expected because rank and nprocs have the values that make sense. Don t expect something like for (int irank = 0; irank < nprocs; irank++) { // code that every process should execute once } 4

5 because every process executes every line of code once by default; that code snippet would result in every process executing the code inside that for-loop nprocs times. Another use for rank can be finding the right input files. Because every process has the same argc and argv, you may not want your simulation parameters or data-file names to be command-line arguments. One option is to put the parameters for each process in separate files, and make a function that reads the appropriate input file and sets up that process. \\ set up name of input file for this process char input_file_name[60]; sprintf(input_file_name, "params_for_rank_%d.txt", rank); \\ pass file name to file-reading function read_input_file(input_file_name); A big part of the point of MPI is making things go faster, so you ll probably want to know how long different parts of your code take to run in wall-time. Instead of the ctime C/C++ functions that I usually use for profiling, MPI provides the function MPI_Wtime for this purpose. It returns a double that represents the current time in seconds; the zero of time is arbitrary but fixed throughout the run-time of the process. So it s best to use MPI_Wtime in pairs: // get current time double starttime = MPI_Wtime(); // some code that takes time // get current time double endtime = MPI_Wtime(); // see how long the code took cout << "this took " << endtime - starttime << " seconds "; cout << "or " << (endtime - starttime)/3600/24 << " days" << endl; One last thing you may be curious about is what physical computer your processes have found themselves on. I think there are ways to access at least part of this information in the submit script (see Section 3), but you can find out directly from within your executable too. The command for this is MPI_Get_processor_name, and it s used like so: // initialize arguments for MPI_Get_processor_name int namelen; char nodename[max_mpi_processor_name]; // call function, overwrites arguments MPI_Get_processor_name(nodename, &namelen); 5

6 // output cout << "rank " << rank << " running on node " << nodename << endl; cout << "name is " << namelen << " chars long" << endl; cout << "max name length was " << MAX_MPI_PROCESSOR_NAME << endl; Finally, after the last MPI function has been called, you need to clean things up: // clean up MPI MPI_Finalize(); 2.3 Basic functions: send/receive Now let s get to the MP part of MPI message passing. Every process can exchange messages with every other process, and there are a variety of functions that allow different sorts of communication patterns: send/receive, broadcast/reduce, gather/scatter, ring pass, etc. Send/receive is the simplest one, just one message being passed from one process to another, and that s the only one I ll cover here. The other communication methods are collective, in contrast to the point-to-point nature of send/receive. Cloud computing algorithms are typically based on collective communication (e.g., Google s patented MapReduce and Apache s open-source Hadoop) so there s a significant possibility that it s worth your while to look into collective communication options. The functions MPI_Send and MPI_Recv have a similar syntax: int MPI_Send(void* data_to_send, int count, MPI_Datatype datatype, int dest_rank, int tag, MPI_Comm communicator); int MPI_Recv(void* data_to_recv, int count, MPI_Datatype datatype, int source_rank, int tag, MPI_Comm communicator, MPI_Status* status); The first argument of each function is the message data itself, and all the others are the envelope that allows the message to be processed correctly on each end. Let s take a closer look at each argument. data_to_send and data_to_recv are pointers to pre-allocated blocks of memory that holds (or will hold) the message data. Since this implementation of MPI deals with pointers and arrays instead of STL types like vectors, your code has to deal with pointers too sorry! The memory at data_to_recv gets overwritten by MPI_Recv. count is the number of values in the message. If you re sending an array containing 8 floats, then count should be equal to 8 in MPI_Send, and 8 in MPI_Recv. datatype is the MPI equivalent of the C++ datatype that you used to initialize your message: MPI_FLOAT, MPI_DOUBLE, MPI_CHAR, etc. There s also an option for MPI_PACKED if you want to send a structure. dest_rank must match the rank of the process calling MPI_Recv, and source_rank must match the rank of the process calling MPI_Send. 6

7 tag is there for fine-tuning the point-to-point communication, but I don t have much of a use for it. The tag in MPI_Send must be an actual integer, whereas the tag in MPI_Recv can be a wildcard like MPI_ANY_TAG. communicator is typically MPI_COMM_WORLD, which we saw in the MPI start-up code snippet. All processes are members of MPI_COMM_WORLD so it will probably meet your needs, but there are other options for communicators if you want something more specialized. status is a structure of type MPI_Status, with members status.mpi_source, status.mpi_tag, and status.mpi_error (all of type int). The return values of MPI_Send and MPI_Recv are error codes, but MPI usually just dies if something goes wrong. 2.4 Send/receive examples Here s an example where process 2 sends the array [0, 2, 4, 6, 8] to process 1, and process 1 sends the array [4, 4, 4, 4, 4] back to process 2. Note the usage of * and &, and note that process 1 receives the first message before it sends the second message. (If both processes tried to send before they tried to receive, the code would hang indefinitely.) // initialize empty static arrays int * first_message[5]; int * second_message[5]; int count = 5; // initialize status MPI_Status status; if (rank == 2) { // put some values in the first array for (int i = 0; i < count; i++) first_message[i] = i*rank; // send the first message with tag=count MPI_Send(&first_message, count, MPI_INT, 1, count, MPI_COMM_WORLD); // receive the second message MPI_Recv(&second_message, count, MPI_INT, 1, MPI_ANY_TAG, MPI_COMM_WORLD, &status); } else if (rank == 1) { // put some values in the second array for (int i = 0; i < count; i++) second_message[i] = count - rank; 7

8 } // receive the first message MPI_Recv(&first_message, count, MPI_INT, 2, MPI_ANY_TAG, MPI_COMM_WORLD, &status); // send the second message with tag=0 MPI_Send(&second_message, count, MPI_INT, 2, 0, MPI_COMM_WORLD); Here s a more complicated example: say you have a bunch of slave processes running simulations of a system with liquid and vapor phases, and you want the master process to make a histogram of the z-velocities of all the vapor particles in all the simulations to check against a Maxwell-Boltzmann distribution. (Patrick Varilly was doing something similar the other day.) Because each slave simulation may have a different number of vapor particles, and to avoid hard-coding the maximum possible number of vapor particles, we ll use dynamic memory allocation; note that the resulting * and & usages are a bit different than in the previous example. This example also introduces the function MPI_Get_count(&status, datatype, &real_count), which figures out the number of values actually received. The value of real_count can be less than the value of count passed to MPI_Recv, making MPI_Get_count useful for debugging in addition to how I used it here. if (rank > 0) { // slaves only // ask a function for the number of vapor particles int n_vapor_particles = get_number_in_vapor(); // initialize empty dynamic array float * z_vels = new float[n_vapor_particles]; // pass pointer to z_vels array to a function // that puts in the correct values get_vapor_z_vels(z_vels); // send to master (rank 0) with tag 1 MPI_Send(z_vels, n_vapor_particles, MPI_FLOAT, 0, 1, MPI_COMM_WORLD); } // end slaves // can have other code here // sends and receives need not be close to each other in the code file // they just have to happen in the right order when the code is executed if (rank == 0) { // master only // initialize status MPI_Status status; 8

9 // loop over all slaves for (int irank = 1; irank < nprocs; irank++) { // initialize empty dynamic array // assume that n_max_particles is initialized elsewhere, // perhaps from a configuration file float * z_vels = new float[n_max_particles]; // receive from slave MPI_Recv(z_vels, n_max_particles, MPI_FLOAT, irank, MPI_ANY_TAG, MPI_COMM_WORLD, &status); // initialize n_vapor_particles with length of received message int n_vapor_particles; MPI_Get_count(&status, MPI_FLOAT, &n_vapor_particles); // output to a log filestream (initialized elsewhere) logfile << "master received z-velocities of " << n_vapor_particles << " vapor particles from slave with rank " << irank << endl; // loop over received values only for (int iparticle = 0; iparticle < n_vapor_particles; iparticle++) { // pass values to a histogramming function add_velocity_to_histogram(z_vels[iparticle]); } // end loop over particles } // end loop over slaves } // end master 3 Running MPI code 3.1 Learning about your MPI installation The command ompi_info outputs a bunch of information about your local OpenMPI installation, most of which I don t know how to deal with. You can grep for some specific things, for example: [anna@quaker test_mpi]$ ompi_info grep Open MPI: Open MPI: [anna@quaker test_mpi]$ ompi_info grep Prefix Prefix: /opt/openmpi 9

10 3.2 Compiling in general MPI code must be compiled with a MPI-specific compiler. For instance, the C++ compiler that s equivalent to g++ is mpic++. The MPI compilers can be used for non-mpi code too, so you can try a test compilation of your code before you even start adding MPI functionality. First, just try your usual compilation command, replacing g++ with mpic++. If you compile on the command-line, [anna@quaker test_mpi]$ mpic++ myprogram.cpp -o myprogram.exe or if you use a makefile, change the value of CXX in the file, e.g., # CXX=g++ CXX=mpic++ If that doesn t work, you may have to add the path to the compiler to your PATH. For instance, if you found in the previous section that Prefix: /opt/openmpi, then [anna@quaker test_mpi]$ export PATH=$PATH:/opt/openmpi/bin or add it in your ~/.bash_profile or ~/.bashrc files. You can also find the path using which: [anna@quaker test_mpi]$ which mpic++ /opt/openmpi/bin/mpic++ If you have trouble with library linking at runtime, it may help to add this line to the makefile: LDLIBSOPTIONS= $(shell mpic++ --showme:link) 3.3 Compiling on NERSC The NERSC machines use the Portland Group compilers by default, instead of the GNU compilers (eg g++) or the Intel compilers (which I ve never used before). To swap to the GNU compilers on hopper or franklin, type nid00007 a/anna> module swap PrgEnv-pgi PrgEnv-gnu and compile using CC instead of mpic++. Note that this implicitly using MPICH instead of OpenMPI, so you have to comment out any OpenMPI-specific lines in your makefile, but otherwise it works without a hitch! To compile with g++ and OpenMPI on carver, type carver% module swap pgi gcc carver% module swap openmpi openmpi-gcc then compile using mpic++ as usual. Different compilers have different strengths and weaknesses, so it may be worth your time to try out the PGI and Intel ones. To see what modules are currently loaded: 10

11 carver% module list Currently Loaded Modulefiles: 1) pgi/10.8 2) openmpi/ Running locally Suppose you have a non-mpi executable called myprogram.exe that takes a single command-line argument myconfigfile.txt, such that you d typically run the program using the command [anna@quaker test_mpi]$./myprogram.exe myconfigfile.txt Then to run 5 local jobs of an MPI version of this program, simply type [anna@quaker test_mpi]$ mpirun -np 5 myprogram.exe myconfigfile.txt Replace the 5 in the number of processes flag -np 5 with the actual number of processes you want to run. To run with the valgrind memory debugger and profiler, type [anna@quaker test_mpi]$ mpirun -np 5 valgrind myprogram.exe myconfigfile.txt When running parallel jobs locally, your speed-up will be limited by the number of cores on your local machine. One way to find out how many cores you have is to type top then hit 1. On quaker, the first few lines of the resulting display are something like Tasks: 2143 total, 1 running, 2142 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 16.4%sy, 0.0%ni, 83.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.3%us, 16.0%sy, 0.0%ni, 83.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.7%us, 17.4%sy, 0.0%ni, 82.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.3%us, 17.3%sy, 0.0%ni, 82.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.3%us, 16.3%sy, 0.0%ni, 82.7%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st Cpu5 : 0.7%us, 17.4%sy, 0.0%ni, 81.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st Mem: k total, k used, k free, k buffers Swap: k total, 6516k used, k free, k cached which makes me think that the quaker interactive node has 6 cores. Another source of information about the number of cores is the file /proc/cpuinfo. 3.5 Submitting jobs on quaker, muesli, or lers Although it s definitely possible to submit jobs by typing a qsub command on the command-line, it s easier in the long run to set up a submit script that keeps track of your flags. In the submit script below, all the flags (the things prefaced by #$) could be added to your command-line call if you really wanted to. There are lots of other flags out there, some of which I should probably be using; check them out by reading the qsub man page. 11

12 test_mpi]$ cat submit_mpi.sh #!/bin/sh # run this script by typing the following command, # replacing $n with the actual number of processes: # qsub -pe orte $n./submit_mpi.sh # use bash as your shell #$ -S /bin/bash # change this to your job name #$ -N myjobname # run from the current working directory #$ -cwd # don t join stdout and stderr #$ -j n # export environment variables #$ -V echo This job is being run on $(hostname --short) echo Running $NSLOTS processes # change this to your actual executable and arguments mpirun -np $NSLOTS./myprogram.exe myconfigfile.txt 12

13 This sample MPI submit script is very similar to a non-mpi submit script, but has the variable $NSLOTS that doesn t appear to be initialized anywhere. $NSLOTS is actually a SGE built-in variable that s initialized by the -pe orte flag, which I ve chosen to keep on the command line to make it easier to run different numbers of processes. So to submit 5 jobs to quaker using this submit script, type [anna@quaker test_mpi]$ qsub -pe orte 5./submit_mpi.sh The -pe orte flag also sets the parallel environment to the value orte. There are other options for the parallel environment, such as mpich, but orte worked best for me. The qconf utility is a good source of information about things like this: [anna@quaker test_mpi]$ qconf -spl make mpi mpich orte [anna@quaker test_mpi]$ qconf -sp orte pe_name orte slots 9999 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary TRUE 3.6 Submitting jobs on NERSC NERSC is the supercomputing facility at LBNL. If you need more cores than are available on our group clusters, or just want your simulations to run much much much faster without changing a line of your code, NERSC is your best bet. The NERSC user info site has lots of information about how to use their system spend a while looking around there, especially the computational systems and queues and policies sections, to decide if NERSC is right for you. Currently, we have hours on their clusters through the Joint Center for Artificial Photosynthesis (JCAP) project, and possibly through other projects; Phill should have a rough idea of the computational resources available to us. For JCAP, start by ing Lin-Wang Wang <lwwang@lbl.gov> to get an account on the NERSC system and a budget of JCAP hours. There s a form you have to fill out and possibly fax, then a few webforms to click through. 13

14 Expect it to take a few days before you have ssh access to the clusters. Sign in at with your username and password to see how many hours you have available. Your hours can be used on any of the NERSC computers (hopper, franklin, carver, etc). Each of these computers has different software, hardware, and queue configurations, so choose one that fits your needs. Carver looks nice on paper because some queues have long maximum walltimes, but there may be a usage surcharge for using carver and my jobs spent days in the queue before running. Hopper ended up being the best answer for me: my code starts running sooner on hopper than on carver, and hopper uses global scratch whereas franklin has a separate scratch system. Youwillprobablywanttowritetheoutputofyourjobstoascratchdirectory, either local ($SCRATCH) or global ($GSCRATCH) depending on what computer you re on. Your allocated disk space is much larger in scratch than in your home directory, and I/O is much faster. You can even use scratch as if it were your home directory, eg you can submit your jobs from scratch. Carver has gnuplot and hopper doesn t, so another benefit of using global scratch is that you can easily run jobs on hopper then analyze them on carver. Note that your data won t be automatically backed up no matter what directory it s in, and files may even be purged periodically, so remember to back up your data to a safe place (one or more disks hosted by the group, or NERSC s storage system HPSS). The NERSC systems use PBS/Torque instead of SGE/Rocks for queue management. TheflagsaresimilartotheSGEflags, butareprefixedby#pbsinstead of $#, and I think the flags have to be the first thing in the submit script file (ie no comments before or during the flags). In the sample submit scripts on the next couple pages, replace -q debug or -q regular with your queue of choice; replace -l walltime=00:30:00 with your actual maximum walltime (format HH:MM:SS); and replace all the other names, numbers, and directory handling commands with reasonable values. Also note that runtime library linking errors may be resolved by swapping to your correct compiler modules within your submit script, not by anything Google might suggest about changing your $LD_LIBRARY_PATH. There are two main differences between submit scripts on hopper and carver. First, carver uses mpirun -np $nprocs, as on quaker, whereas hopper uses aprun -n $nprocs to do the same thing. Second, they use different syntax and criteria for deciding the number of cores alloted to your job, although both only allocate cores in multiples of the number of processors per node. On carver, there are 8 processors per node, so if you want 16 processors then use the flag -l nodes=2:ppn=8. Hopper has 24 processors per node, so the number in the -l mppwidth flag must be a multiple of 24. On both machines, you will be allocated and charged for the number of processors you request using this flag, which may larger than the number you actually utilize with the mpirun -np or aprun -n commands, so plan your processor use accordingly. Be warned: you don t use this flag, all your processes will run on a single node, making your job painfully slow and possibly running out of memory. 14

15 Here s a submit script for the debug queue on carver. bash-3.2$ cat submit_carver_debug.pbs #!/bin/bash #PBS -S /bin/bash #PBS -N debug_job_name #PBS -j n #PBS -V #PBS -q debug #PBS -l walltime=00:30:00 #PBS -l nodes=2:ppn=8 #PBS -M your @host.com #PBS -m aeb ### SET THESE VARIABLES BY HAND ### nprocs=16 # should match the -l nodes=xx:ppn=8 line above output_prefix=debug_output ################################### # output compute node and number of MPI processes echo This job is being run on $(hostname --short) echo $nprocs # set up with correct modules for GNU compilers # fixes runtime errors involving incorrect linking of # libstdc++ and GLIBCXX libraries module swap pgi gcc module swap openmpi openmpi-gcc module list # set up to use global scratch output_path=$scratch/$output_prefix echo $output_path if [! -d $output_path ]; then mkdir $output_path else # assume the directory s content shouldn t already be there echo "Deleting existing data in $output_path" rm -r $output_path/* fi # move to current working directory (like -cwd flag in SGE) cd $PBS_O_WORKDIR # submit job mpirun -np $nprocs myprogram.exe myargs 15

16 And here s a submit script for the regular queue on hopper. anna@hopper03:/global/scratch/sd/anna/production_output> cat submit_regular_hopper.pbs #!/bin/bash #PBS -S /bin/bash #PBS -N regular_job_name #PBS -j n #PBS -V #PBS -q regular #PBS -l walltime=36:00:00 #PBS -M your @host.com #PBS -m aeb #PBS -l mppwidth=144 ### SET VARIABLES BY HAND ### nprocs=122 output_prefix=production_output ############################# # output compute node and number of MPI processes echo This job is being run on $(hostname --short) echo $nprocs # set up with correct modules for GNU compilers # I didn t check whether this is necessary for hopper # like it is for carver, but why not module swap PrgEnv-pgi PrgEnv-gnu module list # set up to use scratch output_path=$gscratch/$output_prefix echo $output_path if [! -d $output_path ]; then mkdir $output_path else # assume directory should already be there echo "directory found at $output_path, leaving it there" fi # run job cd $PBS_O_WORKDIR aprun -n $nprocs myprogram.exe myargs 16

Simple examples how to run MPI program via PBS on Taurus HPC

Simple examples how to run MPI program via PBS on Taurus HPC MPI setup There's a number of MPI implementations install on the cluster. You can list them all issuing the following command: module avail/load/list/unload