Simple examples how to run MPI program via PBS on Taurus HPC

Simple examples how to run MPI program via PBS on Taurus HPC MPI setup There's a number of MPI implementations install on the cluster. You can list them all issuing the following command: module avail/load/list/unload [me@ln01 ~]$ module avail ------------------------------- /opt/modulefiles ------------------------------- gcc-4.8.1 intel-14.0.0 mpich-3.1.3 openmpi-1.8.2 impi-4.1.1 mkl-14.0.0 mpich2-1.0 python-3.4.1 ////////////////////////////////////////////////////////////////////////////////////// [me@ln01 ~]$ module load openmpi-1.8.2 [me@ln01 ~]$ module load python-3.4.1 [me@ln01 ~]$ ////////////////////////////////////////////////////////////////////////////////////// [me@ln01 ~]$ module list Currently Loaded Modulefiles: 1) openmpi-1.8.2 2) python-3.4.1 ////////////////////////////////////////////////////////////////////////////////////// [zhl@ln01 ~]$ module unload python-3.4.1 [zhl@ln01 ~]$ module unload openmpi-1.8.2 [zhl@ln01 ~]$ ////////////////////////////////////////////////////////////////////////////////////// Which module to use is a matter of taste. The example above demonstrates how to use OPENMPI-1.8.2. Job submission The cluster is currently using the PBS/Torque batch queuing system. See the Torque Docs for further information. http://www.clusterresources.com/torquedocs21/2.1jobsubmission.shtml For command line usage, can simply things by including a script containing the environment settings for setting up and executing the command you wish to run, as well as any PBS command line options you'd like to specify for resource allocation, etc.

Here's a basic example that will execute /bin/echo on one node and output the string HELLO to standard out. hello_echo.pbs #PBS -l nodes=1:ppn=1 #PBS -o hello_echo.out #PBS -e hello_echo.err /bin/echo "HELLO!" PBS will redirect the stdout and stderr to files specified in the PBS declarations (-o for stdout, -e for stderr). The default path is the current working directory. This script can be submitted to PBS for execution by logging into the head node (taurus.xao.ac.cn) and executing: qsub [me@taurus~]$ qsub hello_echo.pbs You can check on the status of your job by running the qstat command: [me@pleiades]# qstat 129.taurus.xao.ac.cn Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 129.taurus hello_echo.pbs me 0 Q batch MPI examples Here's an example which will request 72 processors on 6 nodes (make sure the $PATH for the MPI binary is accessibly on all nodes): The script bellow requests 6 computational nodes, 12 processors on each. The code bellow demonstrates a simple MPI program which reports from each process in communicator: hostname_mpi.c /* program hello shows the process ID and hostname/ /* using openmpi-1.8.2 by Hailong Zhang at 20145-1-10*/ #include <mpi.h> #include <stdio.h> int main(int argc, char **argv) { int *buf, i, rank, nints, size, namelen;

char hostname[256]; char processor_name[mpi_max_processor_name]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(processor_name, &namelen); gethostname(hostname,255); printf("hello world! I am process number: %d of %d on host %s with processor %s\n", rank, size, hostname, processor_name); MPI_Finalize(); return 0; message_send_recv_mpi.c /* program jieli shows how to send and receive data between defferent hosts*/ /* using openmpi-1.8.2 by Hailong Zhang at 2015-1-14*/ /** use the following command to run and give a minus number such as -9, Cheers : mpirun -hostfile die --prefix "/opt/software/openmpi-1.8.2.gnu/" -np 42 -x "LD_LIBRARY_PATH=/opt/software/openmpi-1.8.2.gnu/lib"./a.out * */ #include <mpi.h> #include <stdio.h> int main(int argc, char **argv) { int rank,value,size,namelen; char processor_name[mpi_max_processor_name]; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(processor_name,&namelen); do { if (rank == 0){ fprintf(stderr, "\n Please give me a new value= ");

scanf("%d",&value); fprintf(stderr, "\n\nprocess %d read <-<-<- (%d)\n\n\n",rank,value); if (size>1){ MPI_Send(&value,1,MPI_INT,rank +1,0,MPI_COMM_WORLD); fprintf(stderr, "\nprocess %d (FROM %s) send (%d) ->-> to ->-> process %d, (FROM %s)\n",rank,processor_name,value,rank+1,processor_name); else { MPI_Recv(&value,1,MPI_INT,rank-1,0,MPI_COMM_WORLD,&status); fprintf(stderr, "\nprocess %d [FROM %s] receive (%d) <-<- from <-<process %d [FROM %s]\n",rank,processor_name,value,rank-1,processor_name); if (rank <size -1){ MPI_Send(&value,1,MPI_INT,rank +1,0,MPI_COMM_WORLD); fprintf(stderr, "\nprocess %d (FROM %s) send (%d) ->-> to ->-> process %d, (FROM %s)\n",rank,processor_name,value,rank+1,processor_name); MPI_Barrier(MPI_COMM_WORLD); while (value > 0); MPI_Finalize(); return 0; my_hello_mpi.c #include <mpi.h> #include <string.h> #include <stdio.h> int main(int argc, char* argv[]) { int my_rank; // rank of process int p; // number of processes int source; // rank of sender int dest; // rank of receiver int tag; // tag for messages char message[100]; // storage for messages MPI_Status status; // return status for receive // Start up MPI MPI_Init(&argc, &argv);

// Find out process rank MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); // Find out number of processes MPI_Comm_size(MPI_COMM_WORLD, &p); if(my_rank!= 0) { // for slave processes sprintf(message, "Greetings from process %d!", my_rank); dest = 0; // lets send message MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); else { // for master process printf("i am the Master! My rank is %d.\n ", my_rank); for(source = 1; source < p; source++) { // lets receive message from source MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%s\n", message); // Shut down MPI MPI_Finalize(); processes_say_hello_to_each_other_mpi.c /* program shows how all processes say hello to each other, * use openmpi-1.8.2 by Hailong Zhang at 20145-1-15 * use the following command to run and give the np number greater than 2, mpirun -hostfile die --prefix "/opt/software/openmpi-1.8.2.gnu/" -np 4 -x "LD_LIBRARY_PATH=/opt/software/openmpi-1.8.2.gnu/lib"./a.out * the content of file die is: * * gpu01 slots=6 * * gpu03 slots=10 * * gpu16 slots=24 * *The slots number means how many available cores does a node can giving. * */

#include <mpi.h> #include <stdio.h> #include <stdlib.h> void Hello(void); int main(int argc, char *argv[]){ int me,option,namelen,size; char processor_name[mpi_max_processor_name]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Comm_size(MPI_COMM_WORLD, &size); if (size<2){ fprintf(stderr,"systest requires at least 2 processes"); MPI_Abort(MPI_COMM_WORLD,1); MPI_Get_processor_name(processor_name,&namelen); fprintf(stderr,"process %d is alive on %s \n",me,processor_name); MPI_Barrier(MPI_COMM_WORLD); //to do the synchronization Hello(); MPI_Finalize(); void Hello(void){ int nproc, me;

int type = 1; int buffer[2],node; MPI_Status status; MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Comm_size(MPI_COMM_WORLD, &nproc); if(me==0){ printf("\n Hellotest from all to all.\n"); fflush(stdout); for(node=0;node<nproc;node++){ if(node!= me){ buffer[0]=me; buffer[1]=node; MPI_Send(buffer,2,MPI_INT,node,type,MPI_COMM_WORLD); MPI_Recv(buffer,2,MPI_INT,node,type,MPI_COMM_WORLD,&status); if((buffer[0]!=node) (buffer[1]!= me)){ (void) fprintf(stderr, "Hello: %d!= %d or %d!= %d\n",buffer[0],node,buffer[1],me); printf("mismatch on hello process ids; node = %d\n",node); printf("hello from %d to %d\n",me,node); fflush(stdout); random_mpi.c /* * program shows how to do the random message send and receive with random source and random tag, * the use of status.mpi_source, status.mpi_tag * use openmpi-1.8.2 by Hailong Zhang at 20145-1-15 * use the following command to run and give the np number greater than 2, mpirun -hostfile die --prefix "/opt/software/openmpi-1.8.2.gnu/" -np 4 -x "LD_LIBRARY_PATH=/opt/software/openmpi-1.8.2.gnu/lib"./a.out * the content of file die is:

* * gpu01 slots=6 * * gpu03 slots=10 * * gpu16 slots=24 *The slots number means how many available cores does a node can giving. */ #include <mpi.h> #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv[]){ int rank,size,i,buf[i]; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); char processor_name[mpi_max_processor_name]; if(rank==0){ for(i=0;i<100*(size-1);i++){ s); MPI_Recv(buf,1,MPI_INT,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,&statu printf("msg=%d from %d with tag %d \n",buf[0],status.mpi_source,status.mpi_tag); else { for (i=0;i<100;i++){ buf[0]=rank+i; MPI_Finalize(); MPI_Send(buf,1,MPI_INT,0,i,MPI_COMM_WORLD); Now you can compile the program mpicc [me@taurus~]$ mpicc hostname_mpi.c -o hostname_mpi hostname_mpi.pbs [me@taurus~]$

#!/bin/sh #PBS -N hostname_mpi #PBS -l nodes=6:ppn=12 #PBS -q batch #PBS -V #PBS -o hostname_mpi.o #PBS -e hostname_mpi.e nprocs=`wc -l < $PBS_NODEFILE` cd $PBS_O_WORKDIR mpirun --mca btl openib,self -np $nprocs -hostfile $PBS_NODEFILE./hostname_mpi "--mca btl openib,self" parameter means MPIRUN will use IB network to do the message passing, and "self" means communication with itself. Now the program can be submitted to PBS for execution: qsub [me@taurus~]$ qsub hostname_mpi.pbs Whatever was output to the standard i/o can be found in your local directory in files hostname_mpi.out or hostname_mpi.err. Tips & Tricks PDSH is an incredibly useful tool for cluster-wide process management. It can execute any command on any node in the HPC cluster. For detailed usage info see the man page (available on any host on Taurus): pdsh [me@taurus~]$ module load pdsh-2.26 [me@taurus~]$ man pdsh Some examples follow: To see your active processes on all the nodes (sorted by hostname): pdsh-ps-user-host pdsh -R ssh -w gpu[01-16] "ps -ef grep $USER" sort -k 1

To see the current CPU load for all machines in the cluster (sorted by hostname): pdsh-cpu-load pdsh -R ssh -w gpu[01-16] 'uptime' sort -k 1 Combine the above with the watch command to provide a cluster-wide top (with a 5-sec refresh): pdsh-cpu-load watch -n 5 'pdsh -R ssh -w gpu[01-16] uptime' sort -k 1 Use the NVIDIA SMI tools to report GPU related information for all GPUs: pdsh-nvsmi-gpu-temp [me@taurus~]$ pdsh -R ssh -w gpu[01-16] "nvidia-smi -q" Just get the current GPU temperature for all devices in the cluster: pdsh-nvsmi-gpu-temp [me@taurus~]$pdsh -R ssh -w gpu[01-16] "nvidia-smi -q -d 'TEMPERATURE'" Just get the current GPU temperature for all devices in the cluster: pdsh-nvsmi-gpu-temp-simple [me@taurus~]$ pdsh -R ssh -w gpu[01-16] "nvidia-smi -q -d 'TEMPERATURE' grep Gpu"