HowTo: FermiGrid for MAP Users

HowTo: FermiGrid for MAP Users Tom Roberts, Muons, Inc. March 2010 Introduction... 1 Requirements... 1 Background... 2 Glossary...2 Security and Privacy...4 Basic Properties of FermiGrid... 4 One Time Tasks... 5 Anatomy of Grid Jobs... 5 Preparation and Submission...6 Execution...6 Completion...7 Submitting Multiple Jobs using G4beamline... 7 Example Investigate a simple target in a solenoid...9 Gotchas...10 Introduction Fermilab has several Linux clusters that are combined into a grid-based system called FermiGrid [http://fermigrid.fnal.gov]. There are thousands of CPUs in these clusters, and members of the Muon Acceleration Program can use them for MAP-related computations. This document is a synopsis of what is required to use FermiGrid, specifically for members of MAP, and a description of how to do useful work with it. Tom Roberts is the MAP liaison to FermiGrid, and questions or problems should ordinarily be directed to him. If he is unavailable, Scott Berg is an alternate. Requirements There are several requirements for using FermiGrid: 1. A Fermilab Kerberos principal plus the knowledge and tools required to use it to access Fermilab computer systems. 2. Awareness of the conditions in the Fermilab policy for using FermiGrid, and acceptance of them [http://fermigrid.fnal.gov/policy.html]. 3. A login on the FermiGrid login machines (fnpcsrv1, fnpcsrv2, fnpcsrv532, fnpcsrv564). Here s how to satisfy these requirements: 2010/02/02 TJR HowTo: FermiGrid for MAP Users 1

1. Fermilab Kerberos principal this is available from the Fermilab Computing Division. Participating in MAP is sufficient reason to obtain one. If you don t already have one, talk to Tom Roberts, who will guide you through the process; expect it to take several days. The MAP director will need to approve your request (should be a rubber stamp). If you already have one, you don t need a new one. 2. Policy read the link http://fermigrid.fnal.gov/policy.html. Note that incidental use is not permitted, so only MAP-related computations can use grid resources using the map group. If you don t agree with these policies, don t use FermiGrid. 3. Login on FermiGrid login machines tell Tom Roberts and he will request this for you. He will also request that you be added to the map group and to the Fermilab/map virtual organization. Background A useful resource containing tutorials on using FermiGrid is http://fermigrid.fnal.gov/fermigridschool.html. Note that FermiGrid 101 spends a lot of time on installing the tools; you can skip that as the tools are already installed on the FermiGrid login machines you will be using. The FermiGrid consists of several clusters of Linux machines. Some are dedicated to specific experiments, but the GP cluster (General Purpose) is available to us. The GP cluster currently has about 1900 job slots (cores). It often runs at >99% utilization, but still jobs are usually handled in a timely fashion. At present, MAP is set up only to use FermiGrid, not any of the external grids. This could be expanded in the future. Glossary The grid uses nomenclature all its own, and to understand both the documents and what is going on you need to become familiar with common terms. NOTE: these are my personal interpretations of these terms; they are not at all official. Google is your friend, as is https://twiki.grid.iu.edu/twiki/bin/view/documentation/glossaryofterms. Here are the terms I have found useful (in a logical order, not alphabetical): Term Grid (computing) Virtual Organization (VO) Group Grid Trust Meaning A collection of resources and virtual organizations that share a common system of grid trust, and thus share computing resources. A set of standards to facilitate this sharing has been published by the Open Science Grid organization (OSG) [http://www.opensciencegrid.org/]. NOTE: European grids often use a different set of standards (e.g. the MICE grid). A collection of users sharing a common purpose or project, and utilizing the same grid trust. Fermilab has created a VO named Fermilab. A subset of a VO. This is done to reduce the overheads of creating a VO for a small project. The Muon Accelerator Program has a group named map this is a group within the Fermilab VO called fermilab/map. An agreement between the VO and grid providers (such as FermiGrid) that describes what resources can be used. After approval, it is embodied 2010/02/02 TJR HowTo: FermiGrid for MAP Users 2

Certificate Proxy Cluster (computing) Login Machine Worker node Compute Element Condor Cluster (condor) as a certificate. MAP has executed a grid trust agreement with Fermilab. How you obtain your personal grid certificate (really a proxy) is described below. A computer-readable cryptographic file that represents a user s or system s identity; often includes a list of authorized privileges. A secondary certificate that stands in for an underlying grid trust certificate for a short time (~1 week). Used so the underlying semipermanent certificate is less exposed to attack by bad guys. A collection of computing resources. Usually consists of a few login machines plus a larger number of worker nodes. Usually all run Linux. Usually all machines share common disk space. Note that FermiGrid consists of several clusters, most of which are dedicated to specific experiments. We are using the GP Cluster (General Purpose). A machine in a computing cluster that permits external logins. Usually ssh is used for access (on FermiGrid, Kerberized ssh is required). Usually shares disk storage with the worker nodes of the cluster. Used to assemble jobs, to submit them to worker nodes via some batch processing system, and then to obtain results. On FermiGrid, condor is used as the batch system (quite common). A machine in a cluster used for running jobs. Usually external logins are not allowed. Usually shares disk resources with the login nodes. An interface machine that permits the submission of Grid jobs. On FermiGrid this gateway uses condor. But you run condor_submit on a login node it talks to the CE to actually submit the job on the cluster. A batch submission system that provides detailed control of jobs. Used on FermiGrid to submit jobs, monitor them, and cancel them when necessary. It is very flexible, and therefore rather complex. Most of the complexity is hidden from you when you use the job-submit scripts I have written. http://www.cs.wisc.edu/condor/ A collection of related jobs all submitted at one time, sharing a single executable file (arguments for individual jobs can differ). Do not confuse this with a computing cluster. /grid/data Disk space on FermiGrid for the input and output of jobs. Currently 24 Terabytes 59% used. Verbally called grid data. Our quota is currently 0.4 Terabytes (can be increased). Not backed up, so be sure to copy important files elsewhere. /grid/app Disk space on FermiGrid for applications. Currently 0.3 Terabytes 73% used. Verbally called grid app. Mounted read-only on worker nodes; read-write on login nodes. Our quota is currently 0.08 Terabytes (can be increased). Backed up daily to tape. OSG_DATA OSG_APP OSG_WN_TMP An environment variable set to /grid/data on FermiGrid. Called $DATA in some older documentation. An environment variable set to /grid/app on FermiGrid. Called $APP in some older documentation. An environment variable set to local worker-node disk space. Called 2010/02/02 TJR HowTo: FermiGrid for MAP Users 3

HOME AFS WN_TMP in some older documentation. The usual environment variable set to the home directory of your login. Valid on login machines, but not on worker nodes. This is local to FermiGrid and is different from your AFS home on other Fermilab machines. The Andrew File system, which is a worldwide filesystem used at Fermilab for home directories and other storage on most other Fermilab machines. Fermilab requires a Kerberos ticket to access Fermilab data. Available on login nodes but not on worker nodes. Security and Privacy On the grid, there is strong security to prevent unauthorized users from accessing resources (implemented cryptographically using certificates). Once authorized and admitted to a grid system, however, there is almost no security, and certainly no expectation of privacy. So, for instance, on FermiGrid you must have a valid Kerberos ticket and use ssh to access the login machines, and you must have a valid grid certificate to submit jobs. But on these machines everyone is a member of the group fnalgrid, and virtually all files and directories are group readable and writable. This is so because all jobs on worker nodes run as user fnalgrid and group fnalgrid. Consequently, any user on a login machine can read and write essentially anybody s files. This obviously requires some care and discipline on the part of all users. Basic Properties of FermiGrid The GP cluster used by MAP currently has about 1900 job slots (cores), and often runs with >99% occupancy. Some systems implementing worker nodes consist of 8 cores, 24 Gigabytes of RAM, and a local disk drive with >200 Gigabytes available; others have 4 nodes, 4 Gigabytes of RAM, and ~100 Gigabytes of disk available. So individual jobs should not expect more than 1 Gigabyte of RAM or about 10 Gigabytes of disk (with advanced condor usage you can require nodes with more RAM or disk space). If your job has multiple threads, they should not total more than ~101% of a single CPU; most grid jobs are single-threaded. Neither G4beamline nor ICOOL are likely to approach these limits on individual jobs. Remarkably, there is no CPU time limit on a job; instead, a reaper process runs every Sunday, killing jobs more than a week old (i.e. jobs it has already seen once). So jobs that start on Monday can get almost 2 weeks of runtime. There are disk quotas, and at present MAP has a quota of 400 Gigabytes on /grid/data and 80 Gigabytes on /grid/app. It this proves insufficient, we can obtain more space. There is a limit on the number of simultaneous jobs, which varies for each VO. At present MAP can have 100 simultaneous jobs. If this proves insufficient, we can get it increased. Because of the job submission and startup overheads, it is undesirable for jobs to run less than about 10 minutes. Because of both common sense and the reaper, it is undesirable for jobs to run more than about 48 hours. Generally you should make a trial run of your job on a login or worker 2010/02/02 TJR HowTo: FermiGrid for MAP Users 4

node before submitting a large number of jobs this is a check of its sanity and provides an estimate of its runtime which you can use to set the # events per job. The login nodes are for preparing, submitting, and monitoring jobs only. You should not attempt to perform large computations on them. But building programs and running modest analysis programs are OK. The disks used on FermiGrid are implemented on a BlueArc RAID system that is accessible by all login and worker nodes. While it is a high-capacity and high-performance system, it cannot possibly handle hundreds of jobs accessing the same files simultaneously (hundreds of jobs simultaneously accessing different files is generally OK as the different files are located on different physical disk drives). So to avoid a focused overload, it is necessary to have each job copy input data onto node-local disk space, run using node-local disk space, and copy output data back out from node-local space. Moreover, in case many jobs start nearly simultaneously, the copies should be protected by a throttling mechanism that limits the number of simultaneous copies for the entire job series. It is the responsibility of all users to avoid such overloads; the scripts I have written do so. One-Time Tasks Once you have met all the requirements, you must set up your environment to use FermiGrid. I ll describe this for bash users, but other shells can be used with similar commands. The primary login machine is fnpcsrv1, but it shares home directories and user names among all FermiGrid login machines. You will want to set up your customary Linux environment on it, possibly by copying.bash_profile and.bashrc from some other machine. I ll assume that initially you will just use the job submission scripts that I have written, leaving creating new types of jobs until after you have gained experience with FermiGrid. So you should do the following: (on your local machine) kinit ssh fnpcsrv1.fnal.gov (now on fnpcsrv1... login banner, etc.) mkdir /grid/data/$user # edit.bash_profile to put /grid/app/$user into PATH # possibly put cd /grid/data/$user into.bash_profile Then log out and login again. Since your HOME directory is not visible to worker nodes, I have found it convenient to work exclusively in /grid/data/$user essentially as if it were my HOME. Using that directory will distinguish your files from everybody else s; this is just good etiquette on a shared filesystem. Beware: it is not backed up, and is subject to being seen, and even accidentally overwritten or deleted by other users. Anatomy of Grid Jobs There are three basic aspects to grid jobs: preparation and submission, execution, and completion. The first and last are usually performed manually on a login node, while the 2010/02/02 TJR HowTo: FermiGrid for MAP Users 5

execution occurs on worker nodes, usually many in parallel. In some cases the combination of output files is large enough to justify using worker nodes. Preparation and Submission Stage Description preparation Performed manually. You construct a directory in /grid/data/$user for this specific set of jobs. Generally you must include all input files the jobs will use, and you usually create a compressed tarball containing them (to minimize disk I/O overheads). The job-submit scripts I write permit you to symbolically-link in files and directories; the script follows the links and includes all files and directories in the tarball. submission 1. To submit jobs to the grid, you need a proxy certificate. This is described in the documentation, but it s easiest to just use my script proxy it converts your Kerberos certificate to a grid certificate valid for the remaining renewal time of the former. 2. Condor requires the creation of a submit file, which tells condor how to submit each job. The job-submit scripts I write do this internally, and then execute the condor_submit command to submit them. Execution The execution of a single job on a worker node generally consists of three stages: setup, run, finalize. Stage setup run Description Sets up the input files in $OSG_WN_TMP. Generally this consists of: 1. Obtain a copy ticket 2. Copy files to $OWG_WN_TMP 3. Release the copy ticket 4. Un-tar the copied files if necessary The use of a copy ticket is designed to avoid a focused overload on /grid/data when a large number of similar jobs execute simultaneously. This is good etiquette on a shared filesystem. Executes the program, reading and writing $OSG_WN_TMP. Note that $OSG_WN_TMP is on a local drive that is not subject to overloads from multiple jobs. finalize Copies output files from $OSG_WN_TMP to /grid/data. Generally this consists of: 1. Obtain a copy ticket 2. Copy files to /grid/data/$user/... 3. Release the copy ticket The use of a copy ticket is designed to avoid a focused overload on /grid/data when a large number of similar jobs execute simultaneously. This is good etiquette on a shared filesystem. 2010/02/02 TJR HowTo: FermiGrid for MAP Users 6

Completion In some cases there may be no need for this. Stage Description completion Combine multiple output files from multiple jobs into a single file for analysis. Submitting Multiple Jobs using G4beamline A common task in MAP is to scan various parameter values, looking for an optimal set of parameters for a given configuration. The submit_g4bl script is specifically written to address this efficiently and easily. It has these features: Command-line syntax similar to g4bl; the parameter names and values are passed to g4bl, so the parameter names you use depend on the parameters used in the g4bl input file Simple command-line syntax to loop over values of a parameter Simple command-line syntax to run multiple identical jobs with different events Multiple parameter loops can be nested, including the event loop All parameter values are put into the names of the output files, including event #s For setup, creates a tarball of the current directory, following all symbolic links Adheres to the canons of good etiquette (using my throttle script) This script constructs a.submit file for condor, and a.run shell script that is executed by each job. The arguments to.run are: name, Job-id, first, last; these are given in the.submit file to condor. At execution time, the actual parameters to g4bl are parsed from the name. Here is the help text from submit_g4bl help : submit_g4bl - submit multiple g4bl jobs to FermiGrid USAGE: submit_g4bl input.g4bl a=b c=1,10,2 d=1:5:17 ev=1,10000,1000 This script will tar up the contents of the current directory, FOLLOWING symbolic links; the jobs it submits start out by un-taring into OSG_WN_TMP and using files from there; the jobs finish by copying g4beamline.root to the submit directory, using the name of the job. The "throttle" command is used to avoid overloads on /grid/data. This script creates a "Jobs.log" file containing the status of the jobs it submits. All output files are put into the Output directory. The current files in Output are moved to OldOutput at the start of this script. NOTE: arguments cannot contain spaces or semicolons, and they cannot contain commas or colons unless they are defining a loop or list. Job names are generated by the script, using the parameter names=values from the command line, separated by commas. The job name also defines the parameters to g4bl on its command line. The above command line generates 150 jobs with names: a=b,c=1,d=1,ev=1..1000 2010/02/02 TJR HowTo: FermiGrid for MAP Users 7

a=b,c=1,c=1,ev=1001..2000... 8 more with varying events a=b,c=1,d=5,ev=1..1000... 9 more with varying events a=b,c=1,d=17,ev=1..1000... 9 more with varying events a=b,c=3,d=1,ev=1..1000... etc. The output files will be named for the job (with.root appended). The "combine" script can be used to combine the 150 output files into 15, putting them into the submit directory. Loops are defined by an argument like c=1,10,2 -- this generates a loop in the usual DO-loop fashion from 1 to 10 (inclusive) incrementing by 2; values are floats, not ints. The increment cannot be omitted. Lists are defined by an argument like d=1:5:17 -- this generates a loop that iterates the value of d over the elements in the list; values are strings, and can thus be any type (int, float, or string). If multiple loops or lists are given, the first is outermost and the last is innermost. Note also that ev=a,b,c is special and if present gives the innermost loop on event numbers (see below). Events are handled specially with an argument ev=1,10000,1000 -- this generates the innermost loop over event numbers, defining the variables first and last to g4bl, but appending ev=1..1000 to the first job name, ev=1001..2000 to the second job name, etc. Regardless of where ev=a,b,c appears in the command line, it always generates the innermost loop and comes last in the job name. (".." is the Ada operator for an inclusive interval.) The input.g4bl file MUST follow these conventions: * histofile is not set (so the output file is g4beamline.root) * output format is root (the default) * input.g4bl and any auxiliary files it uses should be referenced from the current directory (e.g. magnet field maps, window profiles, etc.) -- they should be symbolically linked to the current directory before submit_g4bl is run, so they will be transferred to the worker node during job startup. Symbolic links to subdirectories are fine. * the only exception to the previous point is a large beam file, which should be referenced via an absolute pathname starting /grid/data/... * the beam command should include "firstevent=$first lastevent=$last", because the ev=... argument to submit_g4bl will be converted to "first=1 last=1000" (etc.) on the g4bl command-line. * The parameters used on the submit_g4bl command line (except ev=) should be used in input.g4bl to control the simulation This script throttles job startup and completion to avoid a focused overload on the disk (/grid/data). 2010/02/02 TJR HowTo: FermiGrid for MAP Users 8

Example Investigate a simple target in a solenoid An example of using FermiGrid is my investigation of a simple target in a solenoid magnetic field. This is primarily a demonstration tool, and not really a design for a real-world target station. G4beamline input file: target.g4bl * target.g4bl 20100220 TJR * * 8 GeV proton beam into a Tungsten target in a uniform Bz field. param -unset R=6 L=300 Bz=20 Angle=0 physics QGSP_BERT beam gaussian particle=proton firstevent=$first lastevent=$last beamz=0.0 \ sigmax=1.5 sigmay=1.5 sigmaxp=0.005 sigmayp=0.005 rotation=x$angle \ meanmomentum=8000.0 sigmap=8 meant=0.0 sigmat=3 trackcuts kill=nu_e,anti_nu_e,nu_mu,anti_nu_mu,e+,e-,neutron,gamma fieldexpr Field radius=2000 length=240000 Bz=$Bz place Field z=0 tubs Pipe innerradius=2000 outerradius=2001 length=80000 kill=1 place Pipe z=0 front=1 cylinder Target outerradius=$r length=$l material=w color=.6,.6,.6 place Target z=$l/2 rotation=x$angle y=$l/2*tan(-$angle*3.14159/180) particlefilter Filter radius=2000 length=0.1 keep=pi+,pi-,mu+,mu- color=1,0,0 place Filter zntuple zloop=1000,60000,1000 place Filter z=60001 Note that target.g4bl uses these parameters: R radius of the tungsten target L length of the tungsten target Bz Z component of the uniform solenoid field Angle angle (degrees) of the target relative to the solenoid axis The submit_g4bl script is used to scan over R, Bz, and Angle simultaneously (a loop over L as well was considered too many jobs; a previous scan gave 300 as its optimal value). Before submitting a large number of jobs to FermiGrid, I ran a short test locally on the login node, to determine how many events to run in each job: g4bl target.g4bl R=6 L=300 Angle=2 Bz=20 first=1 last=1000 From the runtime of this command, I determined that 20,000 events should take about an hour. Command Line submit_g4bl target.g4bl R=3:6:10:15 L=300 Angle=0:1:2:3:5 Bz=2:5:10:20 \ ev=1,20000,20000 2010/02/02 TJR HowTo: FermiGrid for MAP Users 9

This command submits a total of 80 jobs. They scan 4 values of R, 5 values of Angle, and 4 values of Bz. Output Files: lf Output R=10,L=300,Angle=0,Bz=10,ev=1..20000.log R=10,L=300,Angle=0,Bz=10,ev=1..20000.out R=10,L=300,Angle=0,Bz=10,ev=1..20000.root R=10,L=300,Angle=0,Bz=2,ev=1..20000.log R=10,L=300,Angle=0,Bz=2,ev=1..20000.out R=10,L=300,Angle=0,Bz=2,ev=1..20000.root... total of 240 files For this task, sufficient events were generated in each job, so there s no need to combine multiple output files. To analyze the output files, I wrote a Root macro that obtains the values of parameters from the file names, and writes a summary file consisting of one row per file, with columns containing the parameters and interesting values from each run (# muons, # pions, transverse sigmas, etc.), as a function of Z (1 to 60 meters). This is too long to include here; ask for details. Gotchas There are a number of rather unwelcome surprises when using FermiGrid. Here are the ones I have found: A shell script that is to be run by condor must begin #!/bin/bash. Usually this is optional, but for condor it is mandatory. $OSG_WN_TMP is not unique to your job, so if you assume that it is and two of your jobs land on different cores of the same worker node, there probably will be trouble. My scripts work in $OSG_WN_TMP/job-name. Your proxy certificate may run out while your jobs are running. When this happens, they are automatically put on hold status in condor. Fortunately, you can fix this simply by obtaining a new proxy certificate, and issuing the condor_release command. But only you can do this. 2010/02/02 TJR HowTo: FermiGrid for MAP Users 10