Grid Computing Introduction & Parachute method APC-Grid February 2007 Olivier Dadoun LAL, Orsay http://flc-mdi.lal.in2p3.fr dadoun@lal.in2p3.fr www.dadoun.net October 2006 1
Contents Machine Detector Interface (MDI) purpose Introduction to the Grid Computing Authentication & authorization Very simple Job submission examples Parachute method Conclusion All the examples can be found at: http://dadoun.net/informatique/apcgridexamples.tar.gz APC-Grid February 2007 Olivier Dadoun 2
The ILC project: e + e - linear collider e + e - polarized collisions s = 500 GeV - 1TeV #2820: 2.10 10 @ 5Hz σ y ~5.7nm~1% σ x L~2.10 34 cm -2 s -1 Solenoid Calorimeters TPC Vertex detector APC-Grid February 2007 Olivier Dadoun 3
Machine Detector Interface purpose Depend on the beam parameters set, the post collision beam could be very degraded beamstralung photon) BDSIM Need to extract the beam and transport it with the minimal losses to the dump (10-20 MW) In any case we will have some losses along the extraction line (>300m) damage on beam magnet and specially the SC magnets background generation One of our goals: evaluate the backscattered particles into the detector region using BDSIM toolkit (Geant4 based) nb: 4detectors concept X 3(4) extractions line APC-Grid February 2007 Olivier Dadoun 4
Why I use the grid? Geant4 based programs CPU time consuming: Running BDSIM for 500K disrupted beam particles with 50m extraction line take one week (human time) on BQS-CCIN2P3- (with 160 JOBS and job wait a lot of time in queue) Mokka BDSIM So I decided to use the Grid, under ILC Virtual Organization APC-Grid February 2007 Olivier Dadoun 5
Introduction Definition: Allow scientists from multiple domains to use, share, and manage geographically distributed resources transparently A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high computational capabilities. The Grid, I. Foster and C. Kesselman, 1998 The name s origin: In analogy with the power grid a computational grid should be easy to use hiding the complex internal process An organization of people from different institutions with common goals who are sharing computational resources to achieve those goals is A Virtual Organization, a VO in the Grid point of view APC-Grid February 2007 Olivier Dadoun 6
Major European Grid Projects European Funded European DataGrid CrossGrid DataTAG DEISA LHC Computing Grid EGEE APC-Grid February 2007 Olivier Dadoun 7
Infrastructure LCG / EGEE Enabling Grids for E-sciencE: Provide and manage an European grid infrastructure to support researchers from many disciplines (Biomedical Applications, Earth Science, Computational Chemistry and High-Energy Physics) LHC Computing Grid: Prepare, deploy, and operate the computing environment to allow the physicists to analyze the data from LHC detectors LCG and EGEE have similar aims: LCG: world wide collaboration (one field) EGEE: European grid (many fields) APC-Grid February 2007 Olivier Dadoun 8
LCG/EGEE Production Service > 200 sites > 20 kcpu > 13 PB APC-Grid February 2007 Olivier Dadoun 9
Virtual Organization A set of individuals and/or institutions defined by such sharing rules is what we call a virtual organization. I. Foster, C. Kesselmann, S. Tuecke (2000) A VO represents a collaboration who is defined by: People from different institutions with common goals Computational share resources to achieve those goals same data same rules to analyze same access rights ILC vo is currently supported by ~10 UKI sites, LAL, DESY,... (04/04/2006 27 CEs, 3500 CPUs, 42 TB, 6 RBs) APC-Grid February 2007 Olivier Dadoun 10
Resource Broker Schematics Job submission JDL User Interface cert ssh Data Transfer output Computing Element NFS Storage Element disks workers APC-Grid February 2007 Olivier Dadoun 11
What we need? User Interface (UI) account 1. Authentication (i.e. Who are you?) Certificate Authorities (CA), Electronic Certificat (cert.) X509 User generates time-limited proxy 2. Authorization (i.e. What can you do?) Done by Virtual Organization (VO) Public Key Infrastructure Uses Grid Security Infrastructure (GSI) from Globus APC-Grid February 2007 Olivier Dadoun 12
Authentication & authorization (1) 1. Personal certificate https://igc.services.cnrs.fr/grid-fr You will receive your certificate For Mac OSX user: don t use safari (I suggest you Firefox, at least for the grid site) don t use the mailer provided by default APC-Grid February 2007 Olivier Dadoun 13
Authentication & authorization (2) 3. Export, convert and install your certificat Copy your cert.p12 in your UI machine (~/.globus folder openssl pkcs12 -in cert.p12 -clcerts -nokeys -out usercert.pem openssl pkcs12 -in cert.p12 -nocerts -out userkey.pem 4. VO registration https://lcg-registrar.cern.ch/cgi-bin/register/account.pl APC-Grid February 2007 Olivier Dadoun 14
Proxy and myproxy Create a proxy grid-proxy-init (by default 12h life time) lx2/dadoun % grid-proxy-init Your identity: /O=GRID-FR/C=FR/O=CNRS/OU=LAL/CN=Olivier Dadoun Enter GRID pass phrase for this identity: Creating proxy... Done Your proxy is valid until: Sat Sep 23 02:25:33 2006 Delete a proxy: grid-proxy-destroy Information on your proxy: grid-proxy-info If you need a longer time life proxy used a proxy server: myproxy-init -d -s <host_name> <host_name> server name proxy myproxy-info -d -s <host_name> myproxy-destroy -d -s <host_name> Since December 06 each VO use voms-proxy-init APC-Grid February 2007 Olivier Dadoun 15
Hello Word submission level 0 (1) Executable = /bin/ls ; Arguments = -rtla ; StdError = first.err ; StdOutput = first.out ; OutputSandbox = { first.out, first.err }; lx2/dadoun % edg-job-submit --vo ilc -o out myfirstjdl.jdl Selected Virtual Organisation name (from --vo option): ilc Connecting to host grid09.lal.in2p3.fr, port 7772 Logging to host grid09.lal.in2p3.fr, port 9002 ================= edg-job-submit Success ==================== The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_jobid) is: - https://grid09.lal.in2p3.fr:9000/ma4eskm9sxt85bjb4onvdg The edg_jobid has been saved in the following file: /users/delphi/dadoun/datagridtutorial/apc/level0/output ====================================================== APC-Grid February 2007 Olivier Dadoun 16
Hello Word submission level 0 (2) lx2/dadoun % edg-job-status https://grid09.lal.in2p3.fr:9000/ma4eskm9sxt85bjb4onvdg ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://grid09.lal.in2p3.fr:9000/ma4eskm9sxt85bjb4onvdg Current Status: Done (Success) Exit code: 0 Status Reason: Job terminated successfully Destination: ce02.esc.qmul.ac.uk:2119/jobmanager-lcgpbs-lcg2_long reached on: Wed Sep 20 10:14:42 2006 ************************************************************* Job was successfully running when : Done and Success & Exit code 0 NB: if code!= 0 Job running problem, stderr can help to debug APC-Grid February 2007 Olivier Dadoun 17
Hello Word submission level 0 (3) lx2/dadoun % edg-job-get-output https://grid09.lal.in2p3.fr:9000/ma4eskm9sxt85bjb4onvdg Retrieving files from host: grid09.lal.in2p3.fr ( for https://grid09.lal.in2p3.fr:9000/ma4eskm9sxt85bjb4onvdg ) ********************************************************************************* JOB GET OUTPUT OUTCOME Output sandbox files for the job: - https://grid09.lal.in2p3.fr:9000/ma4eskm9sxt85bjb4onvdg have been successfully retrieved and stored in the directory: /users/delphi//dadoun/joboutput/dadoun_ma4eskm9sxt85bjb4onvdg ********************************************************************************* We can check that HelloWord is in the stdout. first.err file empty (exit code 0) APC-Grid February 2007 Olivier Dadoun 18
Comment on the ouput in first.out -rw-r--r-- 1 ilcsgm ilc 0 Feb 26 16:01.maradona.https_3a_2f_2fgrid rb1.desy.de_3a9000_2ffgoiajqghqqecn1_5fjkly3q.output drwxr-xr-x 3 ilcsgm ilc 4096 Feb 26 16:01.. -rw-r--r-- 1 ilcsgm ilc 807 Feb 26 16:01.BrokerInfo -rw-r--r-- 1 ilcsgm ilc 0 Feb 26 16:01 first.out -rw-r--r-- 1 ilcsgm ilc 0 Feb 26 16:01 first.err drwxr-xr-x 2 ilcsgm ilc 4096 Feb 26 16:01. Don t erase those files in your future scripts (rm.* not a good idea) APC-Grid February 2007 Olivier Dadoun 19
Hello Word submission level 1 (1) JDL with an InputSandBox Executable = "HelloWorld.sh"; StdOutput = hello.out"; StdError = "hello.err"; InputSandBox = {"HelloWorldScript.sh } OutputSandbox = { std.out", std.err"} #!/bin/bash echo Hello World :) InputSandBox: can t execeed few Mo lx2/dadoun % edg-job-submit --vo ilc -o out HelloWord.jdl Selected Virtual Organisation name (from --vo option): ilc Connecting to host grid09.lal.in2p3.fr, port 7772 Logging to host grid09.lal.in2p3.fr, port 9002 ============edg-job-submit Success ============================== The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier edg_jobid is: https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig The edg_jobid has been saved in the following file: /users/delphi/dadoun/datagridtutorial/test/out ============================================================= APC-Grid February 2007 Olivier Dadoun 20
Hello Word submission level 1 (2) lx2/dadoun % edg-job-status https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig Current Status: Status Reason: Destination: Scheduled Job successfully submitted to Globus fal-pygrid-18.lancs.ac.uk:2119/jobmanager-lcgpbs-ilc reached on: Mon Sep 18 15:07:47 2006 ************************************************************* lx2/dadoun % edg-job-status https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig Current Status: Running Status Reason: Job successfully submitted to Globus Destination: fal-pygrid-18.lancs.ac.uk:2119/jobmanager-lcgpbs-ilc reached on: Mon Sep 18 15:11:25 2006 ************************************************************* APC-Grid February 2007 Olivier Dadoun 21
Hello Word submission level 1 (3) lx2/dadoun % edg-job-status https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig Current Status: Done (Success) Exit code: 0 Status Reason: Job terminated successfully Destination: fal-pygrid-18.lancs.ac.uk:2119/jobmanager-lcgpbs-ilc reached on: Mon Sep 18 15:13:48 2006 ************************************************************* lx2/dadoun % edg-job-get-output https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig Retrieving files from host: grid09.lal.in2p3.fr ( for https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig ) ********************************************************************************* JOB GET OUTPUT OUTCOME Output sandbox files for the job: - https://grid09.lal.in2p3.fr:9000/3fpxyrq8cbcdxokz-qjnig have been successfully retrieved and stored in the directory: /users/delphi//dadoun/joboutput/dadoun_3fpxyrq8cbcdxokz-qjnig ********************************************************************************* APC-Grid February 2007 Olivier Dadoun 22
Configure it: LCG commands (LHC Computing Grid) export LCG_CATALOG_TYPE=lfc export LFC_HOST=grid-lfc.desy.de Usefull command List file or directory : lfc-ls /grid/ilc Copy file on SE (for ilc vo): lcg-cr --vo ilc file:`pwd`/your_file -l lfn:/path/you_file Copy file from SE to UI 1. You need Globally Unique IDentifier (GUID) lcg-lg --vo ilc lfn:/your_path/file 2. lcg-cp --vo ilc GUID file:`pwd`/file Erase the file from the 1. You need Site File Name (sfn) lcg-lr --vo ilc GUID 2. lcg-del --vo ilc sfn:sfn APC-Grid February 2007 Olivier Dadoun 23
Underlying Technology Relative CPU, storage, and network capability impacts computing architecture Data Physics continue flux up to 1Go/s onto the grid (~DVD/5s) Using the optical fiber we expected 10Go/s (~2 DVD/s) Data transfer do not be anymore a challenge APC-Grid February 2007 Olivier Dadoun 24
Parachute method using ROOT 1. Compile and run myhisto on a Interactive SL 2. Copy all the lib and header needed by myhisto into your UI (essentially ROOT lib) 3. Defined all the variable and run myhisto on your UI 4. Copy Everything (in a tar ball folder) onto your SE 5. Make the script need to run myhisto on the grid and to copy the output onto your SE APC-Grid February 2007 Olivier Dadoun 25
Parachute method using ROOT Tar Ball on SE with ROOT needed UI SL3 @ LAL Get the ROOT file Computing Element Install the tar ball execute myhisto Copy the root output on SE InputSandBox shell script (how to run myhisto) RB APC-Grid February 2007 Olivier Dadoun 26
Gain and problem with the Parachute Gains: 1. No disk space problem to store my data 2. At least a factor of 10 compare to BQS (where most the time is spent in queue) Problems: 1. Lost jobs : wait, no recovery a job may hang in waiting status when some problem arises at RB level 2. Proxy expired problem (10%) Understand: grid-proxy-init & voms-proxy-init confusion in the RB 3. Crashed for unknown problem few percent APC-Grid February 2007 Olivier Dadoun 27
Conclusions Parachute method: > 95% of successful JOBs for simple Jobs (self depend prog.) Note: In the context of GRIF I used also XtremWeb (on Linux and Mac OS X machines, need to install my software also on Windows, wmshare) 85% of successful JOBs for huge simulation Using: Geant4, CLHEP, ROOT, I would like to thanks Charles Loomis (LAL) for useful discussion Use the Grid APC-Grid February 2007 Olivier Dadoun 28