Task farming on Blue Gene

Size: px

Start display at page:

Download "Task farming on Blue Gene"

Mitchell Little
6 years ago
Views:

1 Task farming on Blue Gene Fiona J. L. Reid July 3, 2006 Abstract In this paper we investigate how to implement a trivial task farm on the EPCC eserver Blue Gene/L system, BlueSky. This is achieved by adding a small number of MPI calls to an existing serial code. We illustrate the method using example codes and demonstrate it to be successful by application to a real user code.

2 Contents 1 Introduction 1 2 IBM eserver Blue Gene 2 3 Implementing a trivial task farm on Blue Gene Encapsulate the serial code with MPI calls Test cases - ClockModel code 5 5 Conclusions 6 6 Appendix Fortran 90 version of the serial test code Fortran 90 version of the serial test code with MPI calls added C version of the serial test code C version of the serial test code with MPI calls added ii

3 1 Introduction Many serial codes are limited by the total CPU time that they require to run. Often the individual tasks are actually independent of one another and therefore can be potentially be run simultaneously (in parallel) on different processors. This approach can greatly reduce the actual time required to obtain a scientific result. For example, consider a code which takes 1 hour to execute and requires 1000 runs to obtain a reliable solution. On a single processor this would require 1000 hours (42 days) of continuous runs. The same result could be obtained in just 1 hour if all the runs can be performed simultaneously using 1000 processors. Distributing the separate runs across many processors in such a way is known as task farming. Trivial task farming (or job farming) is one of the most common forms of parallelism available. It relies on being able to decompose your problem into a number of identical but independent serial tasks. Essentially, each processor (or node) runs its own copy of the serial code with its own input file(s) and output file(s). There is no communication required between the processes. The trivial task farming method is particularly suited to examining large independent parameter spaces or large independent datasets. Providing all tasks complete at the same time the there will be no load imbalance and linear scaling will be obtained. Trivial task farming can be very efficient and on many systems is relatively easy to implement. For example a Montecarlo simulation would be a good candidate for the trivial task farm approach. In a Montecarlo simulation the same model is typically run many times (with slightly different start points). This allows statistically significant summaries of the overall model behaviour to be built up. As each model takes approximately the same length of time to run, linear scaling will be attainable. The main advantages and disadvantages of the trivial task farm approach are given below: Advantages Generally easy to implement (on some systems it can be carried out via the batch system directly, e.g. lomond or via a taskfarm harness, e.g. HPCx) Can be very efficient - providing tasks take same length of time Linear scaling can be achieved Existing serial code can be used with minimal modification - in fact for some situations no modifications to the serial code are required whatsoever No communication overheads User may not require detailed knowledge of MPI techniques Disadvantages If tasks take different amounts of time then execution time will be governed by the slowest process Data/parameter space must be truly independent Not ideal for problems requiring communication between processes May restrict future code development - e.g. problem size will be limited to that which can fit on a single processor 1

4 2 IBM eserver Blue Gene BlueSky is an IBM eserver Blue Gene/L system consisting of a single cabinet containing 1024 compute chips (nodes). Each compute node consists of a dual-core 700MHz PowerPC 440 processor, 512 MB of RAM. A compute node can operate in two modes; Coprocessor (CO) mode or Virtual-Node (VN) mode. In Coprocessor mode one core handles communication whilst the other handles computation with 512 MB main memory available to the compute core. The idea behind this is that its possible for the programmer to overlap communications and computations and thus obtain optimal performance. In Virtual-Node mode both cores are used simultaneously for computation with 256MB main memory available to each core. In addition to the compute nodes there are also dedicated I/O nodes. The BlueSky service is a relatively I/O rich system and is configured with one I/O node for every eight compute nodes. The computes nodes run a lightweight Linux derived compute node kernel (CNK). The kernel offers only very limited functionality. The I/O nodes run a full Linux kernel. The rationale is to keep the compute nodes as uninterrupted by the operating system as possible by outsourcing the usual operating system tasks to dedicated additional hardware. For example on BlueSky the compute nodes (in CO mode) can access 508 MB of the total 512 MB main memory i.e. the CNK requires 4 MB. By comparison, a single 16 processor node of the HPCx [1, 2] system has 32 GB main memory, however, only 26.9 GB can be accessed by user code with the rest being required by the operating system. Finally the are four front-end nodes which provide the user interface to BlueSky. The front-end nodes consist of an IBM eserver BladeCenter JS20 with 4 blades. The frontend nodes run SUSE Linux and can be used for editing, compilation and job submission. Further details of the BlueSky system can be found at [3]. For the purposes of performing a trivial task farm users can think of the system as either up to 1024 processors each with 512 MB main memory (CO mode) or up to 2048 processors each with 256 MB main memory (VN mode). 3 Implementing a trivial task farm on Blue Gene Ideally we would like to run multiple copies (one copy per processor) of a serial code simultaneously, with each copy capable of accessing its own input/output file(s). On many high performance computing (HPC) systems (e.g. lomond [4, 5], various Linux clusters) the batch system can be used to execute multiple serial executables simultaneously with each running on a different processor. Unfortunately, this is not possible on either the HPCx or Blue Gene systems. Both HPCx and Blue Gene use the IBM scheduling software, Loadleveler, which does not allow more than one executable to be run simultaneously [6]. On the HPCx system this problem was overcome by using a task farm harness code which allows users to run multiple copies of a serial code with different input/output files without any modification to the serial code. Further details regarding the task farm harness code are available at: Essentially, the task farm harness code consists of an MPI wrapper code which invokes the serial executable by using the system() function/subroutine. e.g. 2

5 In Fortran: call system("./serialexename") or in C: int retcode; retcode = system("./serialexename"); would run the serial executable serialexename. Due to the reduced operating system installed on the compute nodes, the Blue Gene system does not allow calls to system on the compute nodes (backend). This means that the task farm harness code cannot be used and therefore another method of invoking the serial code must be found. As a result, all of methods considered in this paper will require some modifications to the serial code. In testing the different methods of implementing a trivial task farm on Blue Gene we make the following assumptions:- 1. The user has an existing serial code which runs on a single Blue Gene node 2. The memory requirements of the serial code do not exceed 512 MB (CO mode) 256 MB (VN mode) 3. This serial code can have both input and output file(s) or parameter sets 4. The file unit numbers (Fortran) are declared as variables within the serial code. If the file unit numbers are hard-wired then the serial code should be amended and tested prior to the addition of any MPI calls. To simulate such problem a simple test code has been written. The test code performs some simple statistical computations on an input dataset. The input data set consists of a vector of data of length nmax. The output file contains the statistics (mean and standard deviation) as computed from the input data. The full source for the test code is given in the Appendix. Several different approaches of implementing a task farm on Blue Gene are investigated: 1. Encapsulate the serial code with MPI calls 2. Place serial code inside a function and call this function from an MPI template code Both these approaches require very careful consideration of how file IO is handled. 3.1 Encapsulate the serial code with MPI calls For this approach an existing serial code is encapsulated with MPI calls. The encapsulated code will then be able to run on any number of processors. Input/output files require careful consideration. Essentially the procedure for a Fortran code is as follows: Add include "mpif.h" directly after the implicit none statement 3

6 Add the following block of code directly after all type declarations! MPI related declarations integer :: errcode, rank! New declarations to handle input/output files character (len=4) :: dir_id! Initialise MPI call MPI_INIT(errcode) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, errcode) Add call MPI FINALIZE(errcode) directly before the end program statement These changes will enable a copy of the serial code to run on each processor simultaneously. However, the input and output file names still require further consideration. Without further modification each processor will attempt to open the file taskfarm data.dat and will attempt to write output to the file taskfarm results.output. Clearly, this would result in the same input file being read in by all processors when in fact the user may wish a different file to be read in on each processor. Additionally, as all processors will attempt to write to the same output file this could also result in output files generated on one processor potentially being over-written by another processor. Therefore, some method of distinguishing which files are read from/written to by each processor is required. Probably the simplest way to achieve this is to place the input/output files in directories which are labelled in accordance with their MPI rank (or label the files in accordance with their MPI rank). The procedure for doing this for a Fortran code is as follows: We use the MPI rank to define a character variable, dir id, which will be used to determine the directory which contains the input/output files for a particular process, e.g. write(dir_id, (i4.4) )rank This statement should be executed after the call to MPI INIT and before any file open statements. As each MPI process has it s own copy of the input/output file unit numbers, iounit in and iounit out we do not need to change the file unit numbers (or file pointers for C/C++ codes), we only need to change the file names. We then modify all references to the input/output file names within the code so that they are preceded by the following "dir"//dir id// and add an additional / directly before the original file name e.g. open(unit=iounit_in, file = "dir"//dir_id//"/taskfarm_data.dat") Finally, before running the code you will need to ensure that the correct number of dir???? directories have been created and that the relevant input files are placed inside these directories. For example, if 4 processors are used, then the following directories need to be created prior to running the code: dir0000, dir0001, dir0002 and dir0003. The relevant input file(s) also need to be copied/moved into the relevant directory. A 4

7 Unix shell script could be used to achieve this. It may also be possible to achieve this via the Loadleveler batch script. A full version of the modified code is contained in the Appendix. It should be noted that the main body of the serial code remains completely unchanged. With the exception of the file name specific modifications all modifications occur at the beginning and end of the code. If the user doesn t want to place the input/output files into separate directories then the input and output filenames can simply be appended with the rank as follows: open(unit=iounit_in, file = "taskfarm_data"//dir_id//".dat") C or C++ codes can be also treated in a similar manner. A simple C example (testcode serial.c) and corresponding modified code containing the required MPI calls (testcode serial MPIwrapper.c) which allow the code to be run on several processors are also included in the Appendix for reference. 4 Test cases - ClockModel code To investigate the ease of applying this approach we have tested the method described in this paper on a real user supplied code. The code is a serial C code which models the biological clock of plants and was supplied by Professor Andrew Millar, University of Edinburgh. The serial code exists as a number of source files (*.c) and a single header file (*.h) which all the source files refer to. The serial code reads in a number of input files (up to 4) and writes to 3 output files. One of these output files is used for both input and output as the code executes (e.g. a solution is written out and subsequently re-read). The input and output files are opened from a number of different source files and therefore careful consideration of variable scope is required. The procedure used to implement a trivial task farm on the ClockModel code was very similar to that described in Section 3.1. The only additional complication arose from the fact that the input/output files are opened from both the main program and other functions out-with the scope of the main program. This means that the character variable (char dir id[4] in the sample code) used to control the output directory needs to be in global scope. This can be achieved by specifying this variable as an external variable within the header file (e.g. extern char dir id[]) and then declaring the variable prior to the (main) statement e.g. #include headerfile.h char dir_id[4]; int main(int argc, char* argv[])... We have successfully tested the trivial taskfarm on BlueSky by verifying that the same (or similar) results are obtained on all processors when using identical input files on each processor. Due to the nature of the code identical results cannot be obtained as a random number generator is used. 5

8 5 Conclusions Trivial task farming of a serial code can be performed relatively easily on the BlueSky machine allowing users to utilise a large number of processors simultaneously. Minimal modification to an existing serial code is required and no detailed MPI knowledge is needed. The method has been tested successfully on a real user application. Acknowledgements We would like to acknowledge the following for their support and assistance: Mark Bull and Joachim Hein. References [1] User s Guide to the HPCx Service (Version 2.02) [2] HPCx web page [3] User Guide to EPCC s BlueGene/L Service (Version 1.0), bgapps/userguide/bguser/bguser.html [4] Introduction to the University of Edinburgh HPC Service (Version 3.00) [5] Lomond web page [6] S. Kannan, P. Mayes, M. Roberts, D. Brelsford, and J. F. Skovira (2001). Workload Management with LoadLeveler, IBM Corp, SG

9 6 Appendix 6.1 Fortran 90 version of the serial test code! Serial code which is used to test various ways of performing a trivial! task farm on Blue Gene. The code reads in a vector of data from the input! file with unit iounit_in and writes out the mean and standard deviation to! the output file with iounit_out.! This test code attempts to simulate a typical user code that may be! appropriate for trivial task farming. program testcode_serial implicit none integer, parameter :: nmax = 10 real, dimension(nmax) :: adata real :: stddev = 0.0, mean = 0.0 integer :: i integer :: iounit_in, iounit_out iounit_in = 10 iounit_out = 11! Input data! Output data! Open input and output data files open(unit=iounit_in, file = "taskfarm_data.dat") open(unit=iounit_out, file = "taskfarm_results.output")! Read in input data from file with unit number iounit_in do i = 1, nmax read(iounit_in,*,err=100)adata(i) end do 100 continue write(*,*)"total number of points read from file = ",i-1! Close input file close(iounit_in)! Compute mean and standard deviation mean = sum(adata)/nmax do i = 1, nmax stddev = stddev + (adata(i) - mean)**2.0d0 end do stddev = sqrt(stddev/nmax)! Write results to output file with unit number iounit_out write(iounit_out,101)"standard deviation = ",stddev, " mean = ",mean 101 format(a21,x,f10.3,a8,x,f10.3)! Close output file 7

10 close(iounit_out) end program testcode_serial 6.2 Fortran 90 version of the serial test code with MPI calls added! Serial code with MPI calls inserted which will be used to perform a trivial! task farm on Blue Gene. The code reads in a vector of data from the input! file with iounit_in and writes out the mean and standard deviation to the! output file with iounit_out.! This test code attempts to simulate a typical user code that may be! appropriate for trivial task farming and includes the necessary MPI calls! in onder to run the code on multiple processors. program testcode_serial_mpiwrapper implicit none include "mpif.h" integer, parameter :: nmax = 10 real, dimension(nmax) :: adata real :: stddev = 0.0, mean = 0.0 integer :: i integer :: iounit_in, iounit_out! MPI related declarations integer :: errcode, rank! New declarations to handle input/output files character (len=4) :: dir_id! Initialise MPI call MPI_INIT(errcode) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, errcode)! Use the rank to define the directory name write(dir_id, (i4.4) )rank! Open input and output data files in directory dir**. The value! of "**" is determined by the rank of the process open(unit=iounit_in, file = "dir"//dir_id//"/taskfarm_data.dat") open(unit=iounit_out, file = "dir"//dir_id//"/taskfarm_results.output")! Read in input data from file with unit number iounit_in do i = 1, nmax read(iounit_in,*,err=100)adata(i) end do 100 continue 8

11 write(*,*)"total number of points read from file = ",i-1! Close input file close(iounit_in)! Compute mean and standard deviation mean = sum(adata)/nmax do i = 1, nmax stddev = stddev + (adata(i) - mean)**2.0d0 end do stddev = sqrt(stddev/nmax)! Write results to output file with unit number iounit_out write(iounit_out,101)"standard deviation = ",stddev, " mean = ",mean 101 format(a21,x,f10.3,a8,x,f10.3)! Close output file close(iounit_out)! Finalise MPI call MPI_FINALIZE (errcode) end program testcode_serial_mpiwrapper 6.3 C version of the serial test code #include <stdlib.h> #include <stdio.h> #include <math.h> #define nmax 10 int main() float adata[nmax]; float stddev = 0.0, mean = 0.0; int i; int count = 0; char fnamein[100], fnameout[100]; FILE *fpin, *fpout; /* Open input and output files */ sprintf(fnamein,"taskfarm_data.dat"); if (NULL == (fpin = fopen(fnamein,"r"))) fprintf(stderr, "Cannot open <%s>\n",fnamein); exit(-1); 9

12 sprintf(fnameout,"taskfarm_results.output"); if (NULL == (fpout = fopen(fnameout,"w"))) fprintf(stderr, "Cannot open <%s>\n",fnameout); exit(-1); /* Read in input data from file with file pointer fpin */ for (i = 0; i < nmax; i++) fscanf(fpin,"%f",&adata[i]); count = count + 1; printf("total number of points read from file = %d \n",count); /* Close the input file */ fclose(fpin); /* Compute mean and standard deviation */ for (i = 0; i < nmax; i++) mean = mean + adata[i]; mean = mean/nmax; for (i = 0; i < nmax; i++) stddev = stddev + (adata[i] - mean)*(adata[i] - mean); stddev = sqrt(stddev/nmax); /* Write results to output file */ fprintf(fpout,"standard deviation =%8.3f, mean = %8.3f \n",stddev,mean); /* Close output file */ fclose(fpout); 6.4 C version of the serial test code with MPI calls added #include <stdlib.h> #include <stdio.h> #include <math.h> #include <mpi.h> 10

13 #define nmax 10 int main(argc, argv) int argc; char *argv[]; float adata[nmax]; float stddev = 0.0, mean = 0.0; int i; int count = 0; char fnamein[100], fnameout[100]; FILE *fpin, *fpout; /* MPI related declarations */ int rank; /* New declarations to handle input/output files */ char dir_id[4]; /* Initialise MPI */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); /* Create character variable dir_id to control input/output directory */ sprintf(dir_id,"%i",rank); /* Open input and output files */ sprintf(fnamein,"dir%s/taskfarm_data.dat",dir_id); if (NULL == (fpin = fopen(fnamein,"r"))) fprintf(stderr, "Cannot open <%s>\n",fnamein); exit(-1); sprintf(fnameout,"dir%s/taskfarm_results.output",dir_id); if (NULL == (fpout = fopen(fnameout,"w"))) fprintf(stderr, "Cannot open <%s>\n",fnameout); exit(-1); /* Read in input data from file with file pointer fpin */ for (i = 0; i < nmax; i++) fscanf(fpin,"%f",&adata[i]); count = count + 1; printf("total number of points read from file = %d \n",count); 11

14 /* Close the input file */ fclose(fpin); /* Compute mean and standard deviation */ for (i = 0; i < nmax; i++) mean = mean + adata[i]; mean = mean/nmax; for (i = 0; i < nmax; i++) stddev = stddev + (adata[i] - mean)*(adata[i] - mean); stddev = sqrt(stddev/nmax); /* Write results to output file */ fprintf(fpout,"standard deviation =%8.3f, mean = %8.3f \n",stddev,mean); /* Close output file */ fclose(fpout); /* Finalise MPI */ MPI_Finalize (); 12

Message Passing Programming. Introduction to MPI

Message Passing Programming. Introduction to MPI Message Passing Programming Introduction to MPI Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us