LCG Short Demo Markus Schulz LCG FZK 30 September 2003
LCG-1 Demo Outline Monitoring tools and where to get documentation Getting started Running simple jobs Using the information system More on JDL Data management Markus.Schulz@cern.ch 2
LCG-1 Deployment Status Up to date status can be seen here: http://www.grid-support.ac.uk/goc/monitoring/dashboard/dashboard.html Has links to maps with sites that are in operation Links to GridICE based monitoring tool (history of VO s jobs, etc) Using information provided by the information system Tables with deployment status Sites that are currently in LCG-1 (here) expect 18-20 by end of 2003 PIC-Barcelona (RB) Budapest (RB) (RB) Sites to enter soon CNAF (RB) FermiLab. (FNAL) BNL, Prague,(Lyon) FZK Several tier2 centres Krakow in Italy and Spain Moscow (RB) RAL (RB) Sites preparing to join Taipei (RB) Tokyo Pakistan, Sofia, Switzerland Total number of CPUs ~120 WNs # of sites matters more Markus.Schulz@cern.ch 3
Markus.Schulz@cern.ch 4
The Basics Get the LCG-1 Users Guide http://grid-deployment.web.cern.ch/grid-deployment/cgi-bin/index.cgi?var=eis/homepage Get a certificate Go to the CA that is responsible for you and request a user certificate List of CAs can be found here http://lcg-registrar.cern.ch/pki_certificates.html Follow instructions on how to load the certificate into an web-browser Do this. Register with LCG and a VO of your choice: http://lcg-registrar.cern.ch/ In case your cert is not in PEM format change it to it by using openssl Ask your CA how to do this Find a user interface machine http://grid-deployment.web.cern.ch/grid-deployment/cgibin/index.cgi?var=lcg1status We use adc0014 at Markus.Schulz@cern.ch 5
Get ready Check your certificate in ~/.globus $ grid-cert-info Cert valid? Should return with a O.K. $ openssl verify -CApath /etc/grid-security/certificates ~/.globus/usercert.pem Generate a proxy (valid for 12h) $ grid-proxy-init (will ask for your pass phrase) $ grid-proxy-info (to see details, like how many hours until t.o.d.) $ grid-proxy-destroy For long jobs register longterm credential with proxy server $ myproxy-init -s adc0024 -d -n Creates proxy with one week duration $ myproxy-info -s adc0024 -d $ myproxy-destroy -s adc0024 -d Markus.Schulz@cern.ch 6
Job Submission Basic command: edg-job-submit test.jdl Many, many options, see WLMS manual for details Try -help option (very useful -o to get job id in a file) Tiny JDL file executable = "testjob.sh"; StdOutput = "testjob.out"; StdError = "testjob.err"; InputSandbox = {"./testjob.sh"}; OutputSandbox = {"testjob.out","testjob.err"}; Connecting to host lxshare0380.cern.ch, port 7772 Logging to host lxshare0380.cern.ch, port 9002 ================================ edg-job-submit Success ===================================== The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_jobid) is: http://www.infn.it/workload-grid Docs for WLMS - https://lxshare0380.cern.ch:9000/1gmdxnfzed1o0b9bjfc3lw The edg_jobid has been saved in the following file: /afs/cern.ch/user/m/markusw/test/demo/out ============================================================================================= Markus.Schulz@cern.ch 7
Work Load Management System Input Sandbox is what you take with you to the node Output Sandbox is what you get back Job Status submitted Arrived on RB UI sandbox Network Server RB node Match- Maker/ Broker Replica Catalog waiting ready Matching Job Adapter Workload Manager RB storage Job Adapter Inform. Service scheduled running On CE Processed Logging & Bookkeeping Log Monitor Job Contr. - CondorG CE characts & status SE characts & status done cleared Output back User done Failed jobs are resubmitted Markus.Schulz@cern.ch 8
Work Load Management System The services that bring the resources and the jobs together Live most of the times on a node called RB (Resource Broker) Keep track of the status of jobs (LBS Logging and Bookkeeping Service) Talks to the globus gate keepers and resource managers on the remote sites (LRMS) (CE) Matches jobs with sites where data and resources are available Re-submission if jobs fail Uses almost all services: IS, RLS, GSI,.. Walking trough a job might be instructive see next slide The user describes the job and its requirements using JDL (Job Description Lang.) [ JobType= Normal ; Executable = gridtest ; StdError = stderr.log ; StdOutput = stdout.log ; InputSandbox = { home/joda/test/gridtest }; OutputSandbox = { stderr.log, stdout.log }; InputData = { lfn:green, guid:red }; DataAccessProtocol = gridftp ; Requirements = other.gluehostoperatingsystemnameopsys == LINUX && other.gluecestatefreecpus>=4; Rank = other.gluecepolicymaxcputime; ] RB http://www.infn.it/workload-grid Docs for WLMS Markus.Schulz@cern.ch 9
Where to Run? Before submitting a job you might want to see where you can run edg-job-list-match <jdl> Switching RBs Use the --config-vo < vo conf file> and --config <conf file> (see User Guide) Find out which RBs you could use Connecting to host lxshare0380.cern.ch, port 7772 *************************************************************************** COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found: *CEId* adc0015.cern.ch:2119/jobmanager-lcgpbs-infinite adc0015.cern.ch:2119/jobmanager-lcgpbs-long adc0015.cern.ch:2119/jobmanager-lcgpbs-short adc0018.cern.ch:2119/jobmanager-pbs-infinite adc0018.cern.ch:2119/jobmanager-pbs-long adc0018.cern.ch:2119/jobmanager-pbs-short dgce0.icepp.s.u-tokyo.ac.jp:2119/jobmanager-lcgpbs-infinite dgce0.icepp.s.u-tokyo.ac.jp:2119/jobmanager-lcgpbs-long dgce0.icepp.s.u-tokyo.ac.jp:2119/jobmanager-lcgpbs-short grid-w1.ifae.es:2119/jobmanager-lcgpbs-infinite grid-w1.ifae.es:2119/jobmanager-lcgpbs-long grid-w1.ifae.es:2119/jobmanager-lcgpbs-short hik-lcg-ce.fzk.de:2119/jobmanager-lcgpbs-infinite hik-lcg-ce.fzk.de:2119/jobmanager-lcgpbs-long hik-lcg-ce.fzk.de:2119/jobmanager-lcgpbs-short hotdog46.fnal.gov:2119/jobmanager-pbs-infinite hotdog46.fnal.gov:2119/jobmanager-pbs-long hotdog46.fnal.gov:2119/jobmanager-pbs-short lcg00105.grid.sinica.edu.tw:2119/jobmanager-lcgpbs-infinite lcg00105.grid.sinica.edu.tw:2119/jobmanager-lcgpbs-long lcg00105.grid.sinica.edu.tw:2119/jobmanager-lcgpbs-short lcgce01.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-infinite lcgce01.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-long lcgce01.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-short lhc01.sinp.msu.ru:2119/jobmanager-lcgpbs-infinite lhc01.sinp.msu.ru:2119/jobmanager-lcgpbs-long lhc01.sinp.msu.ru:2119/jobmanager-lcgpbs-short wn-02-29-a.cr.cnaf.infn.it:2119/jobmanager-lcgpbs-infinite wn-02-29-a.cr.cnaf.infn.it:2119/jobmanager-lcgpbs-long wn-02-29-a.cr.cnaf.infn.it:2119/jobmanager-lcgpbs-short zeus02.cyf-kr.edu.pl:2119/jobmanager-lcgpbs-infinite zeus02.cyf-kr.edu.pl:2119/jobmanager-lcgpbs-long zeus02.cyf-kr.edu.pl:2119/jobmanager-lcgpbs-short *************************************************************************** Markus.Schulz@cern.ch 10
And then? Check the status: edg-job-status -v <0 1 2> -o <file with id> Many options, play with it, do a -help --noint for working with scripts In case of problems: edg-job-get-logging-info (shows a lot of information) controlled by -v option Get output sandbox: edg-job-get-output, options do work on collections of jobs Output in /tmp/joboutput/1gmdxnfzed1o0b9bjfc3lw Remove the job edg-job-cancel Getting the output cancels the job, canceling a canceled job is an error Markus.Schulz@cern.ch 11
Information System Have a look at the status page to find BDII Query the BDII (use an ldap browser, or ldapsearch command) Sample: BDII at lxshare0222.cern.ch Have a look at the man pages and explore the BDII, RGIIS, CE and SE BDII ldapsearch -LLL -x -H ldap://lxshare0222.cern.ch:2170 -b "mds-vo-name=local,o=grid" "(objectclass=gluece)" dn Regional GIIS ldapsearch -LLL -x -H ldap://adc0026.cern.ch:2135 -b "mds-vo-name=lcgeast,o=grid" "(objectclass=gluece)" dn CE ldapsearch -LLL -x -H ldap://adc0018.cern.ch:2135 -b "mds-vo-name=local,o=grid" SE ldapsearch -LLL -x -H ldap://adc0021.cern.ch:2135 -b "mds-vo-name=local,o=grid" Markus.Schulz@cern.ch 12
GLUE SCHEMA Appendix B in the LCG-1 User Guide Many Categories, some attributes that are defined might be still not filled Describing CE, cluster, hosts, SE, Batchsystem etc Too many for this presentation->user Guide Markus.Schulz@cern.ch 13
GLUE SCHEMA Attributes for the Computing Element CE (objectclass GlueCE) GlueCEUniqueID: unique identifier for the CE GlueCEName: human-readable name of the service Info (objectclass GlueCEInfo) GlueCEInfoLRMSType: name of the local batch system GlueCEInfoLRMSVersion: version of the local batch system GlueCEInfoGRAMVersion: version of GRAM GlueCEInfoHostName: fully qualified name of the host where the gatekeeper runs GlueCEInfoGateKeeperPort: port number for the gatekeeperglueceinfototalcpus: number of CPUs in the cluster associated to the CE Policy (objectclass GlueCEPolicy) GlueCEPolicyMaxWallClockTime: maximum wall clock time available to jobs submitted to the CE GlueCEPolicyMaxCPUTime: maximum CPU time available to jobs submitted to the CE GlueCEPolicyMaxTotalJobs: maximum allowed total number of jobs in the queue GlueCEPolicyMaxRunningJobs: maximum allowed number of running jobs in the queue GlueCEPolicyPriority: information about the service priority State (objectclass GlueCEState) GlueCEStateRunningJobs: number of running jobs GlueCEStateWaitingJobs: number of jobs not running GlueCEStateTotalJobs: total number of jobs (running + waiting) GlueCEStateStatus: queue status: queueing (jobs are accepted but not run), production (jobs are accepted and run), closed (jobs are neither accepted nor run), draining (jobs are not accepted but those in the queue are run) GlueCEStateWorstResponseTime: worst possible time between the submission of a job and the start of its execution GlueCEStateEstimatedResponseTime: estimated time between the submission of a job and the start of its execution GlueCEStateFreeCPUs: number of CPUs available to the scheduler Job (currently not filled, the Logging and Bookkeeping service can provide this information) (objectclass GlueCEJob) GlueCEJobLocalOwner: local user name of the jobbcejobglobalowner: GSI subject of the real jobbowner GlueCEJobLocalID: local job identifier GlueCEJobGlobalId: global job identifier GlueCEJobGlueCEJobStatus: job status: SUBMITTED, WAITING, READY, SCHEDULED, RUNNING, ABORTED, DONE, CLEARED, CHECKPOINTED GlueCEJobSchedulerSpecific: any scheduler specific informationaccess control (objectclass GlueCEAccessControlBase) GlueCEAccessControlBaseRule: a rule defining any access restrictions to the CE. Current semantic: VO = a VO name, DENY = a\ n X.509 user subject Cluster (objectclass GlueCluster) GlueClusterUniqueID: unique identifier for the cluster GlueClusterName: human-readable name of the cluster Subcluster (objectclass GlueSubCluster) GlueSubClusterUniqueID: unique identifier for the subcluster Markus.Schulz@cern.ch 14 GlueSubClusterName: human-readable name of the subcluster
GLUE SCHEMA Host (objectclass GlueHost) GlueHostUniqueId: unique identifier for the host GlueHostName: human-readable name of the host Architecture (objectclass GlueHostArchitecture)GlueHostArchitecturePlatformType: platform description GlueHostArchitectureSMPSize: number of CPUs Operating system (objectclass GlueHostOperatingSystem) GlueHostOperatingSystemOSName: OS name GlueHostOperatingSystemOSRelease: OS release GlueHostOperatingSystemOSVersion: OS or kernel version Benchmark (objectclass GlueHostBenchmark) GlueHostBenchmarkSI00: SpecInt2000 benchmark Application software (objectclass GlueHostApplicationSoftware) GlueHostApplicationSoftwareRunTimeEnvironment: list of software installed on this host Processor (objectclass GlueHostProcessor) GlueHostProcessorVendor: name of the CPU vendor GlueHostProcessorModel: name of the CPU model GlueHostProcessorVersion: version of the CPU GlueHostProcessorOtherProcessorDescription: other description for the CPU GlueHostProcessorClockSpeed: clock speed of the CPU GlueHostProcessorInstructionSet: name of the instruction set architecture of the CPU GlueHostProcessorGlueHostProcessorFeatures: list of optional features of the CPU GlueHostProcessorCacheL1: size of the unified L1 cache GlueHostProcessorCacheL1I: size of the instruction L1 cache GlueHostProcessorCacheL1D: size of the data L1 cache GlueHostProcessorCacheL2: size of the unified L2 cache Main memory (objectclass GlueHostMainMemory) GlueHostMainMemoryRAMSize: physical RAM GlueHostMainMemoryRAMAvailable: unallocated RAM GlueHostMainMemoryVirtualSize: size of the configured virtual memory GlueHostMainMemoryVirtualAvailable: available virtual memory Network adapter (objectclass GlueHostNetworkAdapter) GlueHostNetworkAdapterName: name of the network card GlueHostNetworkAdapterIPAddress: IP address of the network card GlueHostNetworkAdapterMTU: the MTU size for the LAN to which the network card is attached GlueHostNetworkAdapterOutboundIP: permission for outbound connectivity GlueHostNetworkAdapterInboundIP: permission for inbound connectivity Processor load (objectclass GlueHostProcessorLoad) GlueHostProcessorLoadLast1Min: one-minute average processor availability for a single node GlueHostProcessorLoadLast5Min: 5-minute average processor availability for a single node GlueHostProcessorLoadLast15Min: 15-minute average processor availability for a single node Markus.Schulz@cern.ch 15
GLUE SCHEMA SMP load (objectclass GlueHostSMPLoad) GlueHostSMPLoadLast1Min: one-minute average processor availability for a single node GlueHostSMPLoadLast5Min: 5-minute average processor availability for a single node GlueHostSMPLoadLast15Min: 15-minute average processor availability for a single node Storage device (objectclass GlueHostStorageDevice) GlueHostStorageDeviceName: name of the storage device GlueHostStorageDeviceType: storage device type GlueHostStorageDeviceTransferRate: maximum transfer rate for the device GlueHostStorageDeviceSize: Size of the device GlueHostStorageDeviceAvailableSpace: amount of free space Local file system (objectclass GlueHostLocalFileSystem) GlueHostLocalFileSystemRoot: path name or other information defining the root of the file system GlueHostLocalFileSystemSize: size of the file system in bytes GlueHostLocalFileSystemAvailableSpace: amount of free space in bytes GlueHostLocalFileSystemReadOnly: true if the file system is read-only GlueHostLocalFileSystemType: file system type GlueHostLocalFileSystemName: the name for the file system GlueHostLocalFileSystemClient: host unique id of clients allowed to remotely access this file system Remote file system (objectclass GlueHostRemoteFileSystem) GlueHostLRemoteFileSystemRoot: path name or other information defining the root of the file system GlueHostRemoteFileSystemSize: size of the file system in bytes GlueHostRemoteFileSystemAvailableSpace: amount of free space in bytes GlueHostRemoteFileSystemReadOnly: true if the file system is read-only GlueHostRemoteFileSystemType: file system type GlueHostRemoteFileSystemName: the name for the file system GlueHostRemoteFileSystemServer: host unique id of the server which provides access to the file system File (objectclass GlueHostFile) GlueHostFileName: name for the file GlueHostFileSize: file size in bytes GlueHostFileCreationDate: file creation date and time GlueHostFileLastModified: date and time of the last modification of the file GlueHostFileLastAccessed: date and time of the last access to the file GlueHostFileLatency: time taken to access the file in seconds GlueHostFileLifeTime: time for which the file will stay on the storage device GlueHostFileOwner: name of the owner of the file Markus.Schulz@cern.ch 16
GLUE SCHEMA Attributes for the Storage Element Storage Service (objectclass GlueSE) GlueSEUniqueId: unique identifier of the storage service (URI) GlueSEName: human-readable name for the service GlueSEPort: port number that the service listens GlueSEHostingSL: unique identifier of the storage library hosting the service Storage Service State (objectclass GlueSEState) GlueSEStateCurrentIOLoad: system load (for example, number of files in the queue) Storage Service Access Protocol (objectclass GlueSEAccessProtocol) GlueSEAccessProtocolType: protocol type to access or transfer files GlueSEAccessProtocolPort: port number for the protocol GlueSEAccessProtocolVersion: protocol version GlueSEAccessProtocolAccessTime: time to access a file using this protocol GlueSEAccessProtocolSupportedSecurity: security features supported by the protocol Storage Library (objectclass GlueSL) GlueSLName: human-readable name of the storage library GlueSLUniqueId: unique identifier of the machine providing the storage service GlueSLService: unique identifier for the provided storage service Local File system (objectclass GlueSLLocalFileSystem) GlueSLLocalFileSystemRoot: path name (or other information) defining the root of the file system GlueSLLocalFileSystemName: name of the file system GlueSLLocalFileSystemType: file system type (e.g. NFS, AFS, etc.) GlueSLLocalFileSystemReadOnly: true is the file system is read-only GlueSLLocalFileSystemSize: total space assigned to this file system GlueSLLocalFileSystemAvailableSpace: total free space in this file system GlueSLLocalFileSystemClient: unique identifiers of clients allowed to access the file system remotely GlueSLLocalFileSystemServer: unique identifier of the server exporting this file system (only for remote file systems) Remote File system (objectclass GlueSLRemoteFileSystem) GlueSLRemoteFileSystemRoot: path name (or other information) defining the root of the file system GlueSLRemoteFileSystemName: name of the file system GlueSLRemoteFileSystemType: file system type (e.g. NFS, AFS, etc.) GlueSLRemoteFileSystemReadOnly: true is the file system is read-only GlueSLRemoteFileSystemSize: total space assigned to this file system GlueSLRemoteFileSystemAvailableSpace: total free space in this file system GlueSLRemoteFileSystemServer: unique identifier of the server exporting this file system Markus.Schulz@cern.ch 17
GLUE SCHEMA File Information (objectclass GlueSLFile) GlueSLFileName: file name GlueSLFileSize: file size GlueSLFileCreationDate: file creation date and time GlueSLFileLastModified: date and time of the last modification of the file GlueSLFileLastAccessed: date and time of the last access to the file GlueSLFileLatency: time needed to access the file GlueSLFileLifeTime: file lifetime GlueSLFilePath: file path Directory Information (objectclass GlueSLDirectory) GlueSLDirectoryName: directory name GlueSLDirectorySize: directory size GlueSLDirectoryCreationDate: directory creation date and time GlueSLDirectoryLastModified: date and time of the last modification of the directory GlueSLDirectoryLastAccessed: date and time of the last access to the directory GlueSLDirectoryLatency: time needed to access the directory GlueSLDirectoryLifeTime: directory lifetime GlueSLDirectoryPath: directory path Architecture (objectclass GlueSLDirectory) GlueSLDirectoryType: type of storage hardware (i.e. disk, RAID array, tape library, etc.) Performance (objectclass GlueSLPerformance) GlueSLPerformanceMaxIOCapacity: maximum bandwidth between the service and the network Storage Space (objectclass GlueSA) GlueSARoot: pathname of the directory containing the files of the storage space Policy (objectclass GlueSAPolicy) GlueSAPolicyMaxFileSize: maximum file size GlueSAPolicyMinFileSize: minimum file size GlueSAPolicyMaxData: maximum allowed amount of data that a single job can store GlueSAPolicyMaxNumFiles: maximum allowed number of files that a single job can store GlueSAPolicyMaxPinDuration: maximum allowed lifetime for non-permanent files GlueSAPolicyQuota: total available space GlueSAPolicyFileLifeTime: lifetime policy for the contained files Access Control Base (objectclass GlueSAAccessControlBase) GlueSAAccessControlBase Rule: list of the access control rules State (objectclass GlueSAState) GlueSAStateAvailableSpace: total space available in the storage space GlueSAStateUsedSpace: used space in the storage spacemarkus.schulz@cern.ch 18
More on JDL Based on Condor ClassAds syntax (parser very sensitive) Simple statements: attribute = value; Arguments = 1 2 3 -wall ; (passes arguments to the executable) Input sandbox can handle wildcards like *? Environment = { DTEAM_PATH=$HOME/dteam, TEAM=dteam }; OutputSE= adc0021.cern.ch ; (selects job to run close to this SE) [ InputSandbox = { home/joda/test/gridtest, /tmp/test/* }; OutputSandbox = { stderr.log, stdout.log }; InputData = { lfn:green, guid:red }; DataAccessProtocol ={ file, gridftp }; Requirements = other.gluehostoperatingsystemnameopsys == LINUX && other.gluecestatefreecpus>=4 && Member( alice3-4,other.gluehostapplicationsoftwareruntimeenvironment); Rank = other.gluecestatefreecpus; MyProxyServer = wn-02-36-a.cr.cnaf.infn.it ; RetryCount = 7 ; ] Markus.Schulz@cern.ch 19
More on JDL OutputData specifies where files should go If no LFN specified WP2 selects one If no SE is specified, the close SE is chosen At the end of the job the files are moved from WN and registered File with result of this operation is created and added to the snadbox DSUpload_<unique jobstring>.out OutputData = { [ OutputFile = toto.out ; StorageElement = adc0021.cern.ch ; LogicalFileName = thebesttotoever ; ], [ ] }; OutputFile = toto2.out ; StorageElement = adc0021.cern.ch ; LogicalFileName = thebesttotoever2 ; Markus.Schulz@cern.ch 20
Data Users should use high level tools (references and details -> User Guide for LCG and WP2 ) Avoid globus-url-copy, and the edg-gridftp-x tools, Except maybe the following X = exists, ls, mkdir The edg-replica-manager tools allow to: edg-rm move files around UI->SE WN->SE, Register files in the RLS Replicate them between SEs Many options -help + documentation Move a file from UI to SE Where? edg-rm --vo=dteam printinfo edg-rm --vo=dteam copyandregisterfile file:`pwd`/load -d srm://adc0021.cern.ch/flatfiles/se00/dteam/markus/t1 -l lfn:markus1 guid:dc9760d7-f36a-11d7-864b-925f9e8966fe is returned Hostname is sufficient for -d (without the RM decides where to go) Markus.Schulz@cern.ch 21
Data Replicate file to other SE (guid needed) edg-rm --vo=dteam replicatefile guid:dc9760d7-f36a-11d7-864b-925f9e8966fe -d wn-02-30- a.cr.cnaf.infn.it To list replicas edg-rm --vo=dteam listreplicas guid:dc9760d7-f36a-11d7-864b-925f9e8966fe To delete replicas use deletefile guid:xxx -s se.cern.ch To find all aliases of a file: First: edg-rm -i --vo=dteam listguid lfn:mm2 -> guid Then: edg-rmc -I aliasesforguid -h rlsdteam.cern.ch -p 7777 --vo=dteam guid Listing an SE dir: edg-rm -i --vo=dteam list srm://adc0021.cern.ch/flatfiles/se00/dteam/markus/ (broken) use instead edg-gridftp-ls --verbose gsiftp://adc0021.cern.ch/flatfiles/se00/dteam/markus Markus.Schulz@cern.ch 22
File access from a job The WLMS (RB) creates the brokerinfo file and moves it to the WN This is used to answer questions about the site you are on Get 1st input file name: (use.brokerinfo) infile=`edg-brokerinfo getinputdata cut -d -f 1` Get first close SE closese=`edg-brokerinfo getclosese cut -d -f 1 ` Get TURL TURL=`edg-rm --vo=dteam gbf $infile -d $closese -t file ` Get file name Localfile=`echo $TURL cut -d : -f 2 ` Markus.Schulz@cern.ch 23