virtual organization Grid Computing Introduction & Parachute method Socle 2006 Clermont-Ferrand (@lal Orsay) Olivier Dadoun LAL, Orsay dadoun@lal.in2p3.fr www.dadoun.net October 2006 1
Contents Preamble Introduction to the Grid Computing Authentication & authorization Job submission examples Parachute method Conclusion Socle October 2006 Olivier Dadoun 2
Preamble One of our goal is to evaluate the background into the detector due to the backscattered secondaries from : the disrupted beam, the compton, and the pairs losses along the extraction line CPU time consuming: To answer to this : BDSIM based on Geant4 Running BDSIM for 500K disrupted beam particles take one week with 160 JOBS (~60 days for 2.8 GHz intel CPU 2GB) and batch wait a lot of time in queue So I decided to use the Grid ( with a Parachute ) Socle October 2006 Olivier Dadoun 3
Introduction Definition: Allow scientists from multiple domains to use, share, and manage geographically distributed resources transparently A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high computational capabilities. The Grid, I. Foster and C. Kesselman, 1998 The name s origin: In analogy with the power grid a computational grid should be easy to use hiding the complex internal process An organization of people from different institutions with common goals who are sharing computational resources to achieve those goals is A Virtual Organization, a VO in the Grid point of view Socle October 2006 Olivier Dadoun 4
Major European Grid Projects European Funded European DataGrid CrossGrid DataTAG DEISA LHC Computing Grid EGEE Socle October 2006 Olivier Dadoun 5
Infrastructure LCG / EGEE Enabling Grids for E-sciencE: Provide and manage an European grid infrastructure to support researchers from many disciplines (Biomedical Applications, Earth Science, Computational Chemistry and High-Energy Physics) LHC Computing Grid: Prepare, deploy, and operate the computing environment to allow the physicists to analyze the data from LHC detectors LCG and EGEE have similar aims: LCG: world wide collaboration (one field) EGEE: European grid (many fields) Socle October 2006 Olivier Dadoun 6
LCG/EGEE Production Service > 200 sites > 20 kcpu > 13 PB Socle October 2006 Olivier Dadoun 7
Virtual Organization A set of individuals and/or institutions defined by such sharing rules is what we call a virtual organization. I. Foster, C. Kesselmann, S. Tuecke (2000) A VO represents a collaboration who is defined by: People from different institutions with common goals Computational share resources to achieve those goals same data same rules to analyze same access rights ILC and CALICE VOs all ready exists and are used why not to use it? Socle October 2006 Olivier Dadoun 8
ILC @ Grid The VOs ILC and CALICE are hosted at DESY CALICE is supported by DESY and Imperial College (IC) Registration to ILC and CALICE is managed by LCG (http://lcgregistrar.cern.ch) and has become a so-called global VO in EGEE ILC is currently supported by ~10 UKI sites, LAL, DESY,... (04/04/2006 27 CEs, 3500 CPUs, 42 TB, 6 RBs) The test beam data of CALICE were moved between DESY and IC using Grid tools (GridFTP, SRM, LFC) Source Andreas Gellrich, DESY ILC Meeting, Cambridge, 04.04.2006 (http://grid.desy.de/talks/) NB: CCIN2P3 support now ILC and CALICE VOs since last summer to use grid there copy your.globus in your home and set up the grid environment lcg_env.sh(.csh) file @ $THRONG_DIR Socle October 2006 Olivier Dadoun 9
Resource Broker Schematics Job submission JDL User Interface cert ssh Data Transfer output Computing Element NFS Storage Element disks workers Socle October 2006 Olivier Dadoun 10
What we need? User Interface (UI) account 1. Authentication (i.e. Who are you?) Certificate Authorities (CA), Electronic Certificat (cert.) X509 User generates time-limited proxy 2. Authorization (i.e. What can you do?) Done by Virtual Organization (VO) Public Key Infrastructure Uses Grid Security Infrastructure (GSI) from Globus Socle October 2006 Olivier Dadoun 11
Authentication & authorization (1) 1. Personal certificate https://igc.services.cnrs.fr/grid-fr For Mac OSX user: don t use safari For any OS I suggest you Firefox (at least for the grid site) Socle October 2006 Olivier Dadoun 12
Authentication & authorization (2) 3. Export, convert and install your certificat 4. VO registration https://lcg-registrar.cern.ch/cgi bin/register/account.pl Socle October 2006 Olivier Dadoun 13
Proxy and myproxy Create a proxy grid-proxy-init (by default 12h life time) lx2/dadoun % grid-proxy-init Your identity: /O=GRID-FR/C=FR/O=CNRS/OU=LAL/CN=Olivier Dadoun Enter GRID pass phrase for this identity: Creating proxy... Done Your proxy is valid until: Sat Sep 23 02:25:33 2006 Delete a proxy: grid-proxy-destroy Information on your proxy: grid-proxy-info If you need a longer time life proxy used a proxy server: myproxy-init -d -s <host_name> <host_name> server name proxy myproxy-info -d -s <host_name> myproxy-destroy -d -s <host_name> Socle October 2006 Olivier Dadoun 14
Hello Word submission level 0 (1) Executable = /bin/echo ; Arguments = HelloWorld ; StdError = hello.err ; StdOutput = hello.out ; OutputSandbox = { hello.out, hello.err }; lx2/dadoun % edg-job-submit --vo ilc -o out HelloWord_level0.jdl Selected Virtual Organisation name (from --vo option): ilc Connecting to host grid09.lal.in2p3.fr, port 7772 Logging to host grid09.lal.in2p3.fr, port 9002 ================= edg-job-submit Success ==================== The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_jobid) is: - https://grid09.lal.in2p3.fr:9000/ma4eskm9sxt85bjb4onvdg The edg_jobid has been saved in the following file: /users/delphi/dadoun/datagridtutorial/test/out ====================================================== Socle October 2006 Olivier Dadoun 15
Hello Word submission level 0 (2) lx2/dadoun % edg-job-status https://grid09.lal.in2p3.fr:9000/ma4eskm9sxt85bjb4onvdg ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://grid09.lal.in2p3.fr:9000/ma4eskm9sxt85bjb4onvdg Current Status: Done (Success) Exit code: 0 Status Reason: Job terminated successfully Destination: ce02.esc.qmul.ac.uk:2119/jobmanager-lcgpbs-lcg2_long reached on: Wed Sep 20 10:14:42 2006 ************************************************************* Job was successfully running when : Done and Success Exit code 0 NB: if code!= 0 Job running problem, the stderr can help to debug Socle October 2006 Olivier Dadoun 16
Hello Word submission level 0 (3) lx2/dadoun % edg-job-get-output https://grid09.lal.in2p3.fr:9000/ma4eskm9sxt85bjb4onvdg Retrieving files from host: grid09.lal.in2p3.fr ( for https://grid09.lal.in2p3.fr:9000/ma4eskm9sxt85bjb4onvdg ) ********************************************************************************* JOB GET OUTPUT OUTCOME Output sandbox files for the job: - https://grid09.lal.in2p3.fr:9000/ma4eskm9sxt85bjb4onvdg have been successfully retrieved and stored in the directory: /users/delphi//dadoun/joboutput/dadoun_ma4eskm9sxt85bjb4onvdg ********************************************************************************* We can check that HelloWord is on the stdout and the stderr is empty (exist code 0) Socle October 2006 Olivier Dadoun 17
Hello Word submission level 1 (1) JDL with an InputSandBox Executable = "HelloWord.sh"; StdOutput = hello.out"; StdError = "hello.err"; InputSandBox = {"HelloWord.sh } OutputSandbox = { hello.out","hello.err"} #!/bin/bash echo HelloWord InputSandBox: can t execeed few Mo lx2/dadoun % edg-job-submit --vo ilc -o out HelloWord.jdl Selected Virtual Organisation name (from --vo option): ilc Connecting to host grid09.lal.in2p3.fr, port 7772 Logging to host grid09.lal.in2p3.fr, port 9002 ============edg-job-submit Success ============================== The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier edg_jobid is: https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig The edg_jobid has been saved in the following file: /users/delphi/dadoun/datagridtutorial/test/out ============================================================= Socle October 2006 Olivier Dadoun 18
Hello Word submission level 1 (2) lx2/dadoun % edg-job-status https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig Current Status: Status Reason: Destination: Scheduled Job successfully submitted to Globus fal-pygrid-18.lancs.ac.uk:2119/jobmanager-lcgpbs-ilc reached on: Mon Sep 18 15:07:47 2006 ************************************************************* lx2/dadoun % edg-job-status https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig Current Status: Running Status Reason: Job successfully submitted to Globus Destination: fal-pygrid-18.lancs.ac.uk:2119/jobmanager-lcgpbs-ilc reached on: Mon Sep 18 15:11:25 2006 ************************************************************* Socle October 2006 Olivier Dadoun 19
Hello Word submission level 1 (3) lx2/dadoun % edg-job-status https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig Current Status: Done (Success) Exit code: 0 Status Reason: Job terminated successfully Destination: fal-pygrid-18.lancs.ac.uk:2119/jobmanager-lcgpbs-ilc reached on: Mon Sep 18 15:13:48 2006 ************************************************************* lx2/dadoun % edg-job-get-output https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig Retrieving files from host: grid09.lal.in2p3.fr ( for https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig ) ********************************************************************************* JOB GET OUTPUT OUTCOME Output sandbox files for the job: - https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig have been successfully retrieved and stored in the directory: /users/delphi//dadoun/joboutput/dadoun_3fpxyrq8cbcdxokz-qjnig ********************************************************************************* Socle October 2006 Olivier Dadoun 20
LCG commands (LHC Computing Grid) Configure it: export LCG_CATALOG_TYPE=lfc export LFC_HOST=grid-lfc.desy.de Usefull command List file or directory : lfc-ls /grid/ilc Copy file on SE (for ilc vo): lcg-cr --vo ilc file:`pwd`/your_file -l lfn:/path/you_file Copy file from SE to UI 1. You need Globally Unique IDentifier (GUID) lcg-lg --vo ilc lfn:/your_path/file 2. lcg-cp --vo ilc GUID file:`pwd`/file Erase the file from the 1. You need Site File Name (sfn) lcg-lr --vo ilc GUID 2. lcg-del --vo ilc sfn:sfn Socle October 2006 Olivier Dadoun 21
Underlying Technology Relative CPU, storage, and network capability impacts computing architecture Data Physics continue flux up to 1Go/s onto the grid (~DVD/5s) Using the optical fiber we expected 10Go/s (~2 DVD/s) Data transfer do not be anymore a challenge Socle October 2006 Olivier Dadoun 22
Parachute method for BDSIM how I use Geant4 onto the grid 1. Compiled and run on a Interactive SL (CCIN2P3 machines) 2. Copy the binary and the associates lib on lx2 3. Copy all the lib needed by BDSIM on lx2 (for BDSIM: Geant4, CLHEP, ROOT, ) 4. Defined all the variable and Run BDSIM on lx2 5. Copy Everything on the Storage Element 6. Make all the scripts need the run BDSIM on the grid Socle October 2006 Olivier Dadoun 23
Few words on how to run BDSIM 1. Gmad File Detector and extraction line descriptions (also some G4 flags: thresholdcutcharged ) 2. Input bunch file Output from guinea-pig simulation One gmad file correspond to one input bunch file (when you change the bunch file you need to change the gmad file) Socle October 2006 Olivier Dadoun 24
Parachute method for BDSIM how I use Geant4 onto the grid Tar Ball on SE Geant4, CLHEP, ROOT Get the ROOT files UI SL3 @ LAL Computing Element BASH SCRIPTS n JDLs Install Lib., and the files from the JDLs on Workers Run the shell script Copy the root output on SE RB InputSandBox sh script (how to run BDSIM) Gmad GuineaPig file GuineaPig files, also produced onto the GRID (SEED is now a argument of the program, Cécile, François et Socle October 2006 Guy ) and stored on SE Olivier Dadoun 25
Gains: Gain and problem with the Parachute 1. No disk space problem to store my data 2. At least a factor of 10 compare to CCALI clusters (where most the time is spent in queue) Problems: 1. Lost jobs : wait, no recovery a job may hang in waiting status when some problem arises at RB level 2. Proxy expired problem (still in investigation 10%) 3. Crashed for unknown problem few percent Socle October 2006 Olivier Dadoun 26
Conclusions and prospects Parachute method: > 95% of successful JOBs for GuineaPig Note: In the context of GRIF I used also XtremWeb (Oleg Lodygensky, LAL) for GuineaPig production. Need to be tested with BDSIM 85% of successful JOBs for BDSIM Maybe we need one VO : ILC & CALICE joined And install all the commun softwares for both VO : Geant4, CLHEP and ROOT at least I would like to thanks Charles Loomis (LAL) for useful discussion Socle October 2006 Olivier Dadoun 27