A Guide to Condor Joe Antognini October 25, 2013 1 Condor is on Our Network What is an Our Network? The computers in the OSU astronomy department are all networked together. In fact, they re networked together in the most intimate of ways. You will find that if you unplug your ethernet cable your computer will not work so good. Things that you take for granted on your computer will not work. Things like ls. So don t unplug your computer. One of the consequences of this is that you can see the files on everyone s computer on the network. Just go to /home and you ll see all the computers on the network. If you have files on your computer that you don t want anyone to see, remember to change your permissions appropriately. (chmod 600 foo) Another consequence of this is that you can harness the power of anyone else s machine. All you have to do is ssh into their computer. Then you can run your favorite programs using their CPU. This is useful if they have a more powerful computer than you or you are already using all of your cores. (Though if you have more than a few jobs to run, you really should use Condor, which we ll get to in a bit.) It s good form to ask the owner if you can use their computer if they re going to be using it while your job is running. After all, you don t want your jobs to interfere with their ability to watch cat videos on YouTube. Moreover, if you re going to use their computer, remember to use nice -n 19 foo program. This will set your job to lowest priority so that it doesn t interfere with anything they re running. If you don t like the owner of the computer you re using, you may be tempted to use nice to set the priority of your job to be as high as possible so that the owner can t use his or her computer at all. But as it turns out you can t set the priority to anything higher than the owner s default priority unless you re root. So you ll just have to politely ask David Will for the password to root. One other thing to remember is that nice only limits your CPU usage. Some jobs don t use a lot of processing power, but spend a lot of time shuffling bits around the computers and are therefore I/O limited. If this is the case, running nice won t prevent you from slowing down your nice friend s computer. If you believe that your job will involve a lot of I/O, preface the command with ionice -c 0 foo program. 1
2 Who Is This Condor and What Is He Doing to My Computer? ssh ing into other people s computers to run your jobs is all well and good certainly better than running them all on your computer if you have a lot. But there s a better way the Condor Way. Condor takes advantage of all the computers in the network to run large batches of computing jobs. All you do is submit all the jobs you want run to Condor, and then Condor will automagically distribute them to all the computers on the network. The jobs will then run on the spare CPU cycles of those computers. 2.1 Condor is opt-in By default, your computer will not be on the Condor network you have to optin. You should do that. If you have large batches of jobs to run, Condor will help immensely. If you don t plan on running large batches of jobs, you should do it anyway for the benefit of those who do. By design, putting your computer on the Condor network should have no adverse consequences. Some people will claim that Condor slows down your computer. I think they re full of it. Having sixty Firefox tabs open with YouTube videos will make your computer slow. 1 The only thing Condor did was give them something to blame their slow computer on. But even if you re inclined to believe them, put your computer on the network anyway and see for yourself. If you think your computer is running unbearably slow, you can always ask David Will to remove you from the network later. (Don t feel guilty about it it s easy for him to do that.) 2.2 Is Condor for you? You will find Condor useful if you have a program you have to run a large number of times ( 10), maybe with different arguments on each run. In order for this whole Condor thing to work, your program has to satisfy two constraints: (1) The program has to run independently on everyone else s computer, and (2) the program has to print all its output to stdout or stderr. To unpack these two constraints a bit, before submitting a batch of jobs to Condor, make sure that your program can run on everyone else s machine independently. This just means that your program shouldn t depend on any libraries or files that exist only on your own machine. If you have 100 jobs running across the network that are all calling for a file on your computer, your computer will be slow and none of the jobs will run quickly. For a similar reason, make sure that your program is not writing its output directly to a file. Instead print your program s output to stdout. If you have 100 jobs across the network which are all trying to write files to your computer, none of them will be able to do so efficiently and your jobs will all run super slowly. Condor works best by dealing with stdout and stderr. Condor will 1 Seriously, though, Firefox has a memory leak, so you ll benefit from closing it occasionally. 2
collect all the output to stdout and save it up and then write it to your computer in batches so that the jobs aren t slowed down by waiting for your computer to write data. 3 How to Use Condor for Fun,???, and Profit So you re convinced. You see you and Condor living together happily ever after. More specifically, you have four thousand jobs you need to run before tomorrow. What do? 3.1 Submitting individual jobs Each job is submitted individually to Condor. So the first thing you have to know to submit large batches to Condor is how to submit an individual job. The heart of the Condor submission process is the Condor submit file. A Condor submit file is a text file that looks something like this: Executable = program_foo Requirements = (OpSys == "LINUX" && Arch == "X86_64") Rank = Machine == "milkyway.astronomy.osu.edu" universe = vanilla arguments = --foo 123 --bar 456 --baz 789 output = foo.dat error = foo.err log = foo.log queue 1 Okay, let s go over this. The first line ( Executable ) just says what program you want to run. Don t put arguments here. Save those for later. One thing to note here is that this program is going to run on someone else s computer. (Which is probably what you re hoping!) So make sure that your program doesn t require libraries that are only on your machine. Be sure that your program can run independently on everyone else s machines. The Requirements line specifies that Condor should only put your jobs on computers running Linux and 64-bit machines. If you want, you can edit this so that your jobs will only run on 32-bit machines or on both. (Though a lot of programs will only run on one kind of architecture so test that out before submitting all your jobs.) There s not really much point in including 32-bit machines, though, since they don t have a lot of computing power on our network. The only reason you might want to do it is if you don t have a whole lot of jobs you want to run (so you re not computing-power limited) and you do your code development on a 32-bit machine. The Rank line lists machines that Condor should give priority to. Some computers on the network are more powerful than others, so you may want to have Condor preferentially put your jobs on certain computers. I ve put 3
Stanek s machine on here just as an example, but you can change this to whatever computers you want. You can also put multiple computers on this line by separating the computers addresses like this: Rank = Machine == "foo.astronomy.osu.edu" Machine == "bar.astronomy.osu.edu" The universe line isn t important. Do not pay attention to the man behind the curtain. 2 On the arguments line, just write down any arguments that you would supply to your program. The Output line specifies the file to which Condor will write data printed to stdout. Similarly, the error line specifies the file to which Condor will write data printed to stderr. Finally, Condor itself will generate a log file which will contain information like when the job was submitted, when it started running, when it finished, etc. The log line specifies the file to which Condor will write this log. The last line is the queue line. This tells Condor how many times to run this job. If you want to run the exact same program 100 times, change this line to queue 100. So that s the Condor submit file! Suppose you have saved it to a file called C Submit. Now to submit this job to Condor, all you have to do is run condor submit C Submit. 3.2 Submitting many jobs So what if you want to submit a whole bunch of jobs with different arguments? Well, all you have to do is submit a whole bunch of individual jobs many times. The easiest way to do this is through a bash script. Suppose you wanted to run a program called program foo 100 times with an argument --bar x where x varies from 1 to 100. Then you would write a bash script which generates the appropriate C Submit file and then submits it to Condor. It should look something like this: #! /bin/bash i=1 iend=100 while [ $i -lt $iend ]; do echo "Executable = program_foo" > C_Submit echo Requirements = (OpSys == "LINUX" && Arch == "X86_64") >> C_Submit echo Rank = Machine == "milkyway.astronomy.osu.edu" >> C_Submit 2 There are fancier versions of Condor that can do much cooler things. But it s a pain for David Will to install, so he hasn t done it. The vanilla universe just specifies that we are using the most basic version of Condor. But if enough people start using Condor, we might be able to convince David Will that it s worth his time to upgrade to a cooler version of Condor! 4
echo "universe = vanilla" >> C_Submit echo "arguments = --bar "$i >> C_Submit echo "output = foo_"$i".dat" >> C_Submit echo "error = foo_"$i".err" >> C_Submit echo "log = foo_"$i".log" >> C_Submit echo "queue 1" >> C_Submit condor_submit C_Submit sleep.2 let i++ done One note here is that I ve added a sleep command. If you try to submit jobs to condor too rapidly, Condor can sometimes get confused. 3.3 A more elegant way of doing something similar If your index runs from 0 to some number n (say, 500), you don t have to explicitly write a loop. Instead, you can just submit a single Condor script which looks like this: Executable = program_foo Requirements = (OpSys == "LINUX" && Arch == "X86_64") Rank = Machine == "milkyway.astronomy.osu.edu" universe = vanilla arguments = --bar $(Process) output = foo_$(process).dat error = foo_$(process).err log = foo_$(process).log queue 500 The queue line tells Condor to run this process 500 times, and $(Process) is a variable which runs from 0 to 499. 4 Some other loose ends So that s it! There are a few other things that will be useful to know. You may want to check on the status of your jobs and on the status of the Condor network in general. The two commands that will be most useful are condor q and condor status. The condor status command lists all the computers on the Condor network and says whether they re running a job or not. The other command is condor q which will list all the jobs which have been submitted to Condor. You may be only interested in checking to see how your jobs are getting along. In that case, 5
type condor q username to just display your own jobs. If you don t want to see every individual job, but just the total number you have left, pipe the output to tail -1. Another command you might use on occasion is condor run. If you have just job you want to run on Condor (say, foo.sh), all you have to do is condor run foo.sh. This can be useful if you want to test something out and you don t want to go to all the trouble of making a C Submit file. Finally, if you realize you ve made a mistake and you want to get rid of your jobs, just type condor rm username. This will remove all of your jobs. You can also remove individual jobs by typing condor rm Condor ID. 4.1 Debugging Sometimes you may submit a bunch of jobs only to find them all sitting idle. To see what s going wrong, the command condor q -analyze is inordinately useful. In all probability, what has happened is that you have made a typo in the requirements section of your C Submit file such that none of the computers on the network fulfill your requirement. The condor q -analyze command will suggest which requirements to change. 4.2 I/O heavy jobs If your job involves a lot of I/O, the people will start to become restless and grumble. Condor only prevents your jobs from using too much CPU time. It won t do anything to prevent someone s computer from reading and writing a lot of data. If your job is I/O limited, you will slow down people s computers and people will hate you more than they already do. To get around this, change your executable to ionice and put your own program as an argument. As you might guess, ionice is like nice, but for I/O operations instead of CPU cycles. It has a different priority system though, so read the documentation. 4.3 A last note on long jobs If your job takes longer than a day to run, you may run into problems. Users who haven t run jobs in a while ( 1 day) are given priority. If your job takes longer than a day, you might find your job booted to make room for a newer user. This is generally not too big a deal since your job will start running again once the new user s job has finished running. But if two people both want to run jobs that take longer than a day, they ll alternately get kicked off to make room for each other and no one s jobs will get done. In those situations you should talk to the other person and come to some agreement in the Department Thunderdome. 6
5 Acknowledgements Sadly, I was so tired when writing a draft of this document that Ben Shappee managed to improve it. Rubab Khan also told me it would be a good idea to talk about condor run. So I did. The End. 7