ITNPBD7 Cluster Computing Spring Using Condor

The aim of this practical is to work through the Condor examples demonstrated in the lectures and adapt them to alternative tasks. Before we start, you will need to map a network drive to \\wsv.cs.stir.ac.uk\datasets and then copy the directory condor that is in this network drive to the top level directory on your H drive (it will be assumed that you have called this directory condor in the following sheet). In order to submit Condor jobs, we will need to be logged in to the Condor submission host called pound. Make a connection to pound using the Putty terminal client. This is the same application you have used on the lab PCs to connect to the Hadoop cluster but this time you should create a different connection for the machine pound.cs.stir.ac.uk using port 541. When you log in to pound you will start in a home directory (/home/<username>) on this machine that is mapped to the same location as your Windows file store. In the examples that follow, the username dec will be used to demonstrate examples but you should substitute your own username in its place. We can start by confirming our initial log in directory. To check which directory we are currently in on Unix or Linux, use the pwd command in the Putty terminal window. For the user dec, we should expect to get back the result: /home/dec You should now also be able to check that all the relevant example files were copied to your main CS file store correctly by typing ls condor. You should see the following output if they are in the correct place: seqcount tspga wordcount hello ipcheck The above correspond to the 5 sub-directories that contain the Condor examples we will be using in this practical. Computing Science & Mathematics 1

Hello World in Java We will begin by looking at submitting the basic Hello World Condor jobs for Java, Python and C. Each of these examples are in a separate sub directory of the hello directory. We will try the Java example first and then look at the differences between this submission job and the Python and C jobs. Move to the Java hello example by typing: cd /home/<username>/condor/hello/hello.j Confirm that you are in the right directory by typing pwd and check that all the relevant files are in your current directory using the ls command. For the ls command you should see the following listing: Hello.jar sub-hello.txt You can view the contents of text files using the command more <filename>. In this case we can look at the contents of the submission file sub-hello.txt using the command more sub-hello.txt which should produce the output: universe executable arguments jar_files log error output Queue 1 = java = Hello.jar = Hello = Hello.jar = hello.log = hello.err = hello-out.txt This is similar to the example seen in the lecture. The points to note from this example are, in entry order: it uses the Java universe the executable to run is Hello.jar the initial class that will be run from the executable archive is called Hello (denoted by the arguments field). In this example there are no further arguments to be passed to the Hello class (our args array in main will be empty). the Java files we need are contained in the Java archive Hello.jar (it some cases we may need to specify more than one jar file). Submission log data for all the queued jobs will be saved in hello.log Error log data for all the queued jobs will be saved in hello.err Any console output produced from our job via System.out.println will be saved to hello-out.txt. We will queue a single job. Before we submit this job, we will take a quick look at the status of the Condor pool to check that there are spare nodes free. To do this, type condor_status and review the output at the end (if you want to see it a page at a time, use condor_status more which Computing Science & Mathematics 2

will pipe the output of the condor_status command through to the more command and allow you to use the Spacebar to see a page of output at a time). Although the numbers might vary a little, the status output should produce something similar to the following (excluding the Preempting and Backfill columns): Total Owner Claimed Unclaimed Matched X86_64/LINUX 72 0 16 56 0 X86_64/WINDOWS 396 32 20 344 0 Total 468 32 36 400 0 We can see from this output that there are 400 unclaimed nodes in the cluster of which 56 are Linux nodes and 344 are Windows nodes. Since our Java job does not specify a particular OS and is able to run on either node type, we can run jobs on all 400 nodes. Given that we will only need 1 node for our single queued job, this is not of importance. In reality it is quite possible that you will have much larger jobs and that only 40 or so nodes may be free (the Linux nodes tend to be the most contested since they can reliably run longer jobs without interruption). To submit our hello world job, we use the command: condor_submit <submit file>. In this case try typing condor_submit sub-hello.txt This should produce output similar to the following: Submitting job(s). 1 job(s) submitted to cluster 7903. You can check on the status of your job queue by typing condor_q <username> (e.g. condor_q dec) although you would need to be quick in the above example since the job is likely to get be queued and run very quickly. You can also check all Condor jobs that are waiting to be queued by omitting the <username> option. In this case you will see all the jobs that all users have currently submitted to the queue. If you check the queue and see output similar to the following: -- Submitter: pound0.cs.stir.ac.uk : <139.153.252.220:47347> : pound0.cs.stir.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held This indicates that your job has completed. If you now type ls l you should see a more detailed listing and should observe that there are 3 new files in your directory with modification dates matching today. These are the 3 output files we specified in our submission files. You can quickly view the contents of them using the more command. If all has gone to plan, the file hello-out.txt will contain the words Hello World, the file hello.log will contain details about the job submission, execution and termination phases and hello.err should be empty. Computing Science & Mathematics 3

Hello World in C & Python We will now look at submitting a Condor job for code that has been compiled for a particular OS and hardware. In this case we will use a simple C program that was compiled on pound which is a Linux node similar to the other Condor Linux nodes. To ensure compatibility, it is worth trying to compile code on a submission node since if it then builds and runs on one node, it should run on the remaining nodes of that type. If you are still in the same directory as you were in for the previous example, you can move to the C example directory by typing: cd../hello.c This changes directory to the one above you (the..) and then down into a directory called helloc. If you are in a different location and wish to move to the right directory, you can also type: cd /home/<username>/condor/hello/hello.c This gives the cd command the full and complete path to where you want to go starting from the top level of the file store (the / ). If you now type ls you should get a listing of the hello.c directory as follows: hello.c hello.out sub-helloc.txt The source code for this example is in the file hello.c (which can be viewed by typing more hello.c). The compiled version of this source code is the file hello.out and the condor submission file is sub-helloc.txt. View the contents of the submission file by typing more sub-helloc.txt and confirm that you see the following: universe = Vanilla executable = hello.out output = out.txt error = out.err log = out.log arguments = stuff requirements = OpSys == "LINUX" queue 1 In contrast to the initial Java example, the Universe entry has been changed to Vanilla and we are supplying a compiled executable called hello.out. The other entries are similar to before (the arguments entry is ignored by this program but can be used to supply command line arguments where needed). Submit this job by typing condor_submit sub-helloc.txt and then try to see the job in the queue by typing condor_q <username> (e.g. condor_q dec). When the job has completed, use the more command to confirm the expected output in out.txt and the log and error information in out.log and out.err respectively. Computing Science & Mathematics 4

If you change directory to the Python example (cd../hello.py), you should be able to repeat the above submission process for the Python example using the submission file sub-hellopy.txt. After confirming that the supplied Python script in hello.py works as expected, you could also try editing the Python script and getting it to do something different. Unless you are familiar with a Unix/Linux editor such as vi, you will find it easiest to edit hello.py from Windows using TextPad (do not use Notepad since it does not handle Unix new line characters properly). IP Checker: Using the Process ID as a command line argument The next example uses the IP checker code that was demonstrated in lectures. This can be a useful debug tool that enables you to confirm that code has run on a particular node. To change to the directory containing this example, type: cd /home/<username>/condor/ipcheck Using the ls command, you should be able to confirm that this directory contains two files the Java archive NodeIP.jar and the Condor submission file sub-nodeip.txt. This example demonstrates using the process ID as a command line argument that is passed in to NodeIP program. If you examine the contents of the submission file using the command more sub-nodeip.txt, you should observe the line: arguments = NodeIP $(Process) The $(Process) entry will be replaced with a job number such that if we had used Queue 3 to request 3 jobs, the arguments passed in would be 0,1 and 2. In order to see the output of each individual job, we have also used the $(Process) macro to set the output file name for each job via the line: output = chkout$(process).txt For the example above with 3 jobs queued, this would result in three output files being generated called chkout0.txt, chkout1.txt and chkout2.txt. Note that these files will contain whatever was sent to standard output by your program (in Java this would be lines where you used the command System.out.print or System.out.println). To confirm this mechanism, submit the job file sub-nodeip.txt, check the queue and when all jobs have completed, list the contents of your working directory via the ls command. You should observe that you now have 10 output files labelled chkout0.txt to chkout9.txt. Examine the contents of some of these files using the more command and confirm that the Task number matches the job number. Computing Science & Mathematics 5

Passing input files to Condor jobs The next example demonstrates passing an input file to a Condor job using the word count example shown in lectures. To change to the example directory, type: cd /home/<username>/condor/wordcount List the contents of this directory and confirm that there are 5 files present. This job requires us to split a source text file into parts, send each of the parts to a separate Condor job and then merge the resultant output when the jobs have completed. The 5 files are as follows: 1. wp.txt Our document to analyse (War and Peace) 2. split.sh A script to run the Java class Splitter which will split wp.txt into 5 parts. 3. sub-count.txt The Condor submission file that create 5 jobs, 1 for each of the above parts. 4. wordcount.jar The Java archive containing the Java classes we need. The source code for the classes contained in this archive was shown in the lectures. 5. merge.sh A script that merges the partial counts into a single sorted list of counts. Using the more command, view the contents of the files split.sh, sub-count.txt and merge.sh and check that you understand what they are doing. For the Condor submission file sub-count.txt, you should observe that we instruct each job to count a process numbered input file (wp-0.txt, wp-1.txt etc) via the command: Arguments = Counter wp-$(process).txt 1 and that we ensure that the relevant file is copied over to the Condor job via the commands: transfer_input_files should_transfer_files = wp-$(process).txt = YES To try this example out, we first need to split up the document wp.txt using the script split.sh. Run the splitter scrip by typing the following:./split.sh Now type ls and confirm that there are 5 new files numbered wp-0.txt to wp-4.txt. These just contain wp.txt split into 5 equal sized sections. Run the Condor submission job subcount.txt and check the Condor queue to confirm when the job has completed. Once complete, list the contents of the current directory and confirm that 5 word count files have been deposited back in your working directory. The files should be labelled c-wp- 0.txt to c-wp-4.txt. Each of the files were created directly by an instance of the Counter program running on Condor. They were not created via standard output as in previous examples and there is therefore no mention on them in the Condor submission file. The Computing Science & Mathematics 6

Condor client has noticed that new files were created compared to those that it copied across and due to the command: when_to_transfer_output = ON_EXIT it has copied these output files back to the submission directory. Examine the content of a couple of the word count files (e.g. c-wp-2.txt) using the more command and check that it contains an unordered list of word counts. We now need to merge each of the c-wp-0.txt to c-wp-4.txt files into one word count file. This can be achieved using the script merge.sh which contains 2 commands. The first one runs the Java class Merge that is in the Java archive wordcount.jar. This merges all files with the prefix c-wp and saves the output to a single file called mergedcounts.txt (omitting all word counts with less than 50 entries). The second command sorts mergedcounts.txt using the Unix sort command to sort mergedcounts.txt using the second column (the -k 2 part) based on a numeric reversed sort (the -nr part). The output of sort is redirected to the file sortedmerge.txt (the > sortedmerge.txt part). To run this script, type./merge.sh and then use the ls command to confirm that 2 new files have appeared called mergedcounts.txt and sortedmerge.txt. Use the more command to see the unsorted and sorted versions of the word distribution counts. Collecting DNA sequence counts We are now going to look at a similar Condor job that applies the same principles as above to searching for given lengths of DNA sequences. Use the cd command to change directory to: /home/<username>/condor/seqcount If you list the contents of this directory via the ls command, you should observe 6 files are present. The files have the following roles: 1. seq-dna.txt The DNA sequence that we are going to analyse (this is a subset of a much larger sequence) 2. split.sh A script to split seq-dna.txt into 5 parts. This uses exactly the same code as the previous example but a different file to split. 3. sub-sequence3.txt and sub-sequence9.txt Condor submission files to count sequences of length 3 and 9 respectively. 4. seqcount.jar A Java archive containing the Splitter, Sequences and Merge classes. 5. merge.sh A script to merge the partial sequence counts. Again this uses the same code as the previous example but with a different set of files. Using the more command, examine the contents of all of the above files except for the seqcount.jar file and check that you understand the role and structure of each file. Computing Science & Mathematics 7

Our first task is to split the DNA sequence up. This can be achieved by running split.sh as follows:./split.sh This will produce 5 partial sequence files numbered seq-dna-0.txt to seq-dna-4.txt. Confirm that these files have been produced by listing the contents of the directory. Now submit the Condor submission file sub-sequence9.txt and check the queue to confirm when the jobs have finished. This should produce 5 files containing sequence counts for the given partial sequence that was analysed. If you list the directory contents, you should see these files listed as c-9-seq-dna-0.txt to c-9-seq-dna-4.txt. One of the command line arguments that is passed to the Condor job is the length of sequence to search for. In this case the jobs were instructed to look for sequences of length 9 and this value has been incorporated into the output file name. The Condor submission file sub-sequence3.txt is almost identical to the one we just submitted but requests a search for sequences of length 3. Since the output of this later job would have files labelled c-3-seq-dna-0.txt to c-3-seq-dna-4.txt, we avoid mixing up the output files for different types of job. You will often need to consider how you will differentiate your output data when you have a large number of similar jobs with different parameter settings so the above example should be considered as one possible mechanism. The last step in this example is to merge the individual count files into one merged and sorted file. This follows exactly the same steps as the previous example and can be achieved by running the script merge.sh as follows:./merge.sh Using the ls command, you should observe that two files have been produced seqcounts9.txt and sortedseq9.txt. These are the unsorted and sorted merged files as before. Use more to examine sortedseq9.txt and observe the most commonly occurring DNA sequences in our sample which have a length of 9 bases. Computing Science & Mathematics 8

Analysing the performance of a GA The last task we will look at is using Condor to examine the performance of a stochastic process. In this case we wish to look at the effect of mutation rate on the performance of a GA trying to solve the Travelling Salesman Problem (TSP). In order to study this properly we need to run the same job many times in order to even out the effects of randomness. The example we will look at is in the directory: /home/<username>/condor/tspga Change to this directory and list the contents of it. You should see 7 files that have the following roles: locations.txt a file containing the name and coordinates of the locations to visit. Each Condor process will use this file as problem for it to solve. You could easily replace it with a different set of location coordinates that must be solved. locations.png a map of the locations TSPGA.jar A Java archive containing the GA code to be run sub-tspga.txt The Condor submission file makeresults.sh A script that concatenates the result sets into 3 different files depending upon the mutation rate that was used. It then calls an R script to build a plot of the results. header.txt A file to supply column headers for the concatenated files. boxplotdistance.r The R script that creates a plot of the results In this case we are going to run 3 sets of jobs with a different mutation value for each set. If you look at the contents of the Condor submission file (sub-tspga.txt) you should be able to see the arguments that are set up to run the GA with the different values. The relevant lines are: arguments = Solver GA 0.01 output = sol-01-$(process).txt Queue 20 arguments = Solver GA 0.05 output = sol-05-$(process).txt Queue 20 arguments = Solver GA 0.10 output = sol-10-$(process).txt Queue 20 Unlike the previous examples, this submission file alters the arguments and output settings before each set of jobs is queued. This enables the one Condor submission to process a range of different parameter values in one go. In this case it varies the Computing Science & Mathematics 9

mutation rate from 0.01 to 0.05 and then 0.10. Submit the sub-tspga.txt job to Condor and wait for it to complete. You should note that Condor will report back with the total of the number of jobs that were submitted which should be 60 in this case. Once the jobs have finished and the queue is empty, you should find that your directory has filled up with 60 different files consisting of 3 groups of 20 files. Each group contains a set of runs for a given mutation rate (sol-01-*.txt, sol-05-*.txt and sol-10-*.txt). Use the more command to check the contents of some of these files. You should notice that each file just contains one line stating the number of generations that were run and the score that was achieved. The next step is to use the makeresults.sh script to concatenate these output files into the relevant file for the given mutation rate. It will also put a header at the start of each results file so that we can refer to a particular column of data in R. To run this script, type:./makeresults.sh You should notice that 3 results text files are produced and also a picture in the file results.pdf. If you look at your directory in the Windows file explorer you should be able to view the results.pdf file that you have just created. You can see that it would be very easy to create new problem files and new mutation parameters to test with little extra effort. We could also increase the runs from 20 per mutation rate to 500 with minimal changes. The work often comes in the design stage of parallelising a problem but this can be eased through experience and adaption of similar examples such as the ones you have used in this practical. cat & R For reference, the contents of makeresults.sh is as follows: cat header.txt sol-01* > results-01.txt cat header.txt sol-05* > results-05.txt cat header.txt sol-10* > results-10.txt R CMD BATCH boxplotdistance.r The cat command will concatenate all files that are listed and uses the wildcard character * to match a group of files that have a common pattern. In this case, each group of results are identifiable by having the file name prefixes sol-01, sol-05 and sol- 10. The contents of each of these sets of files will be put into the files results-01.txt, results-05.txt and results-10.txt respectively with the contents of header.txt preceding them. This is quite a useful procedure to use when trying to organise a set of results data. Computing Science & Mathematics 10

The final step is to run the R script boxplotdistance.r. R is a statistical analysis package that uses it s own scripting language to instruct it to perform a particular task. The R script contains the following instructions: sink("output.txt") res01<-read.table("results-01.txt",header=true) res05<-read.table("results-05.txt",header=true) res10<-read.table("results-10.txt",header=true) pdf("results.pdf") boxplot(res01$distance,res05$distance,res10$distance,names=c( "0.01","0.05","0.10"),xlab="Mutation Rate",ylab="Distance") sink() The first line of this file instructs R to send all output to the file output.txt so that you can check the script ran as intended. The next 3 lines extract the data in the results files and put them into internal matrices called res01, res05 and res10. The line pdf("results.pdf ) instructs R to send graphical output to the file results.pdf. The line that starts with boxplot creates a box plot using the Distance column in each of the 3 matrices that were built earlier, giving each of them a specific label and also labelling the x and y axis. The last line closes the redirection of the output. Adapting the DNA Sequence Analysis Process The source code used to build the DNA sequence distribution count is available in the condor/seqcode directory that you copied over from the network share. Examine this code and then adapt it to a new sequence analysis task. For example, you could search for all occurrences of patterns that start with the sequence ATC and are 9 characters long. You should be able to generate a new JAR file and substitute it for the one used in the seqcount example. If you restrict your new code to only modifying the countsequences method in Sequences.java, the rest of the scripts and submission file should work as before. You can also test your code directly in Eclipse using the sample.txt file which contains just 20 lines of the sequence. To do this, you can create a run profile with the command line arguments: sample.txt 9 This should produce an output file called c-9-sample.txt containing a count of the sequences that matched your pattern. Computing Science & Mathematics 11

Appendix: Condor Commands condor_status condor_submit <submit file> condor_q <username> condor_rm <username> condor_rm <cluster> Check the status of the Condor pool Lists the contents of the directory given by <path> Check the Condor queue for the given user name Show the contents of the named <file> (use the Return key for the next line, Spacebar for the next page and q to quit). Remove all Condor jobs belonging to <username> Remove all jobs with the given cluster id Useful Unix Commands cd <path> Change directory to the given <path> ls Lists the contents of the current directory ls <path> Lists the contents of the directory given by <path> man <command> Give the manual page for the given <command> mkdir <directory> Makes a new directory with the supplied name more <file> Show the contents of the named <file> (use the Return key for the next line, Spacebar for the next page and q to quit). pwd Prints the path to the current working directory rm <file> Remove the given file (note that there is no undo for this action) sort -k 2 -nr <file> Sort second column using numeric sort and reverse the sort order > <file> Redirects standard output from any Unix command to the given <file> You can use the Tab key to try to autocomplete a file name. For example if you type the first few characters of the file name and press Tab, the best match to the prefix you have typed will be autocompleted for you. Pressing Tab twice will show the available matches. The majority of Unix and Linux commands come with many extra options that allow you to request more detail or apply the command in a particular way. The options are usually indicated by a minus symbol followed by one or more letters. For example the ls command can provide more detailed output via the l option (e.g. ls -l). To find out what a command does and what the other possible options are, use the man command (short for manual) followed by the name of the command that you want more information on (e.g. man ls). Note that manual entries are also available for the Condor commands if you wish to find out more of its capabilities. Use the Return key and Spacebar to go through the pages that are produced and the q key to stop reading the output. Computing Science & Mathematics 12