Eurogrid: a glideinwms based portal for CDF data analysis - 19th January 2012 S. Amerio. (INFN Padova) on behalf of Eurogrid support group

Similar documents
glideinwms architecture by Igor Sfiligoi, Jeff Dost (UCSD)

glideinwms Training Glidein Internals How they work and why by Igor Sfiligoi, Jeff Dost (UCSD) glideinwms Training Glidein internals 1

Look What I Can Do: Unorthodox Uses of HTCondor in the Open Science Grid

glideinwms UCSD Condor tunning by Igor Sfiligoi (UCSD) UCSD Jan 18th 2012 Condor Tunning 1

Long Term Data Preservation for CDF at INFN-CNAF

Distributed production managers meeting. Armando Fella on behalf of Italian distributed computing group

(Tier1) A. Sidoti INFN Pisa. Outline: Tasks and Goals The analysis (physics) Resources Needed

Flying HTCondor at 100gbps Over the Golden State

Primer for Site Debugging

Factory Ops Site Debugging

Bookkeeping and submission tools prototype. L. Tomassetti on behalf of distributed computing group

Condor-G: HTCondor for grid submission. Jaime Frey (UW-Madison), Jeff Dost (UCSD)

Tutorial 4: Condor. John Watt, National e-science Centre

Project Blackbird. U"lizing Condor and HTC to address archiving online courses at Clemson on a weekly basis. Sam Hoover

History of SURAgrid Deployment

glideinwms Frontend Installation

Day 9: Introduction to CHTC

HTCONDOR USER TUTORIAL. Greg Thain Center for High Throughput Computing University of Wisconsin Madison

Configuring a glideinwms factory

glite Grid Services Overview

Care and Feeding of HTCondor Cluster. Steven Timm European HTCondor Site Admins Meeting 8 December 2014

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

CMS HLT production using Grid tools

Introduction to HTCondor

Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

Introducing the HTCondor-CE

The PanDA System in the ATLAS Experiment

Kerberos & HPC Batch systems. Matthieu Hautreux (CEA/DAM/DIF)

First evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

HPC Resources at Lehigh. Steve Anthony March 22, 2012

What s new in HTCondor? What s coming? European HTCondor Workshop June 8, 2017

Installation of CMSSW in the Grid DESY Computing Seminar May 17th, 2010 Wolf Behrenhoff, Christoph Wissing

Shooting for the sky: Testing the limits of condor. HTCondor Week May 2015 Edgar Fajardo On behalf of OSG Software and Technology

Interfacing HTCondor-CE with OpenStack: technical questions

EUROPEAN MIDDLEWARE INITIATIVE

Geographical failover for the EGEE-WLCG Grid collaboration tools. CHEP 2007 Victoria, Canada, 2-7 September. Enabling Grids for E-sciencE

An update on the scalability limits of the Condor batch system

Introduction to Condor. Jari Varje

HTCondor overview. by Igor Sfiligoi, Jeff Dost (UCSD)

! " # " $ $ % & '(()

CMS experience of running glideinwms in High Availability mode

FACADE Financial Analysis Computing Architecture in Distributed Environment

CMS Tier-2 Program for user Analysis Computing on the Open Science Grid Frank Würthwein UCSD Goals & Status

g-eclipse A Framework for Accessing Grid Infrastructures Nicholas Loulloudes Trainer, University of Cyprus (loulloudes.n_at_cs.ucy.ac.

Advanced Job Submission on the Grid

ARC integration for CMS

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

HIGH-THROUGHPUT COMPUTING AND YOUR RESEARCH

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

CERN: LSF and HTCondor Batch Services

Implementing GRID interoperability

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

CernVM-FS beyond LHC computing

AGATA Analysis on the GRID

OSG Lessons Learned and Best Practices. Steven Timm, Fermilab OSG Consortium August 21, 2006 Site and Fabric Parallel Session

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

Setup Desktop Grids and Bridges. Tutorial. Robert Lovas, MTA SZTAKI

Singularity tests at CC-IN2P3 for Atlas

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

The ATLAS Production System

Overview of ATLAS PanDA Workload Management

Hardware Tokens in META Centre

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

Getting Started with OSG Connect ~ an Interactive Tutorial ~

Presented by: Jon Wedell BioMagResBank

The EU DataGrid Fabric Management

CRAB tutorial 08/04/2009

Scientific Computing on Emerging Infrastructures. using HTCondor

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

Scalability and interoperability within glideinwms

Understanding StoRM: from introduction to internals

EMI Deployment Planning. C. Aiftimiei D. Dongiovanni INFN

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

New Directions and BNL

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

Distributed Monte Carlo Production for

HTCondor Essentials. Index

Edinburgh (ECDF) Update

Connecting Restricted, High-Availability, or Low-Latency Resources to a Seamless Global Pool for CMS

UW-ATLAS Experiences with Condor

Teraflops of Jupyter: A Notebook Based Analysis Portal at BNL

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

! " #$%! &%& ' ( $ $ ) $*+ $

DIRAC pilot framework and the DIRAC Workload Management System

E G E E - I I. Document identifier: Date: 10/08/06. Document status: Document link:

Outline. ASP 2012 Grid School

LCG-2 and glite Architecture and components

HowTo: FermiGrid for MAP Users

GRID COMPANION GUIDE

Prototype DIRAC portal for EISCAT data Short instruction

Cloud Computing. Up until now

SZDG, ecom4com technology, EDGeS-EDGI in large P. Kacsuk MTA SZTAKI

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

Status of KISTI Tier2 Center for ALICE

SAS Grid Manager and Kerberos Authentication

Things you may not know about HTCondor. John (TJ) Knoeller Condor Week 2017

Enabling Distributed Scientific Computing on the Campus

The INFN Tier1. 1. INFN-CNAF, Italy

Transcription:

Eurogrid: a glideinwms based portal for CDF data analysis - 19th January 2012 S. Amerio (INFN Padova) on behalf of Eurogrid support group

CDF computing model CDF computing model is based on Central farm located at Fermilab CDFGrid Two access points for Grid resources: NamCAF American OSG grid Eurogrid European Lcg. NaMCAF CDFGrid Eurogrid All are based on the Central Analysis Farm (CAF). The CAF provides users with a uniform interface to PC farms on different s running different middleware and batch systems. The CAF software provides tools for submitting, managing and monitoring of batch jobs. 2

Eurogrid: CDF glideinwms based portal In Europe CDF can exploit a dedicated farm at CNAF Tier1 and Lcg resources in Italy, Spain, France and Germany In the past dedicated CNAF resources were accessed through CNAFCaf (Condor based) portal while grid ones through LcgCAF (glite based). Starting June 2011, both portals have been merged (thanks G.Compostella) into a single one, Eurogrid, based on glideinwms. Why? Manpower needs (1 portal, CDF responsible of FE only) Simpler for users GlideinWMS is Condor based easy to integrate with CAF code 3

Eurogrid at a glance User's desktop Factory @ UCSD Eurogrid headnode @ CNAF CAF code Submitter Condor (local queue) Monitor Mailer VO Frontend Key points: Job submission Job execution Authentication Code distribution and remote DB access Monitoring Output retrieval 4

Job submission CDF user on his/her desktop: Authentication (Kerberos) Job submission (Cafsubmit portal job options) Checks the sintax Contacts the headnode and submits the job User's desktop Eurogrid headnode @ CNAF Submitter Condor Monitor Mailer Submitter: Authentication (more details later) Prepares the job to be executable by Condor Creates class-ad containing all the information on the job provided by the user (type of job, queue, OS, etc...) condor_submit job to main schedd Factory @ UCSD VO Frontend 5

Job execution User's desktop Factory @ UCSD Eurogrid headnode @ CNAF Submitter Condor Monitor Mailer VO Frontend Cafexe (CDF job wrapper) on the Worker node: Fetches job tarball from squid server Runs job Renews regularly kerberos ticket When job is done, tars user files Rcp/fcp to user's areas Remove all users file and kdestroy tickets 6

Authentication User's desktop Eurogrid headnode @ CNAF Submitter Kerberos Factory @ UCSD Condor Monitor Mailer VO Frontend Kerberos Grid Proxy FNAL KCA CNAF VOMS Output storage Users send their job to Eurogrid using their kerberos ticket Headnode contacts the Kerberized authentication authority (KCA) : from Kerberos ticket to X.509 Grid-like certificate Calls voms-proxy-init to setup a valid proxy that can be used to submit to the grid User job submitted and executed with user credentials During job execution the krb ticket is renewed at fixed intervals. 7

DB access and code distribution User's desktop Eurogrid headnode @ CNAF Submitter Factory @ UCSD Condor Monitor Mailer VO Frontend Frontier Library (in job tarball) DB query (HTTP) MC simulation jobs need FNAL DB access to retrieve run conditions To access DB at FNAL Translate DB queries into HTTP requests using Frontier Use Squid proxies as caches to improve scalability and performance A job may need CDF specific software CDF software available in most of the Eurogrid sites through AFS. AFS client CDF code on AFS server Proxy Cache FNAL Oracle DB Server 8

Output storage User's desktop Eurogrid headnode @ CNAF Submitter Factory @ UCSD Condor Monitor Mailer VO Frontend Storage @ FNAL /CNAF/ User desktop User's job output copied to CDF storage (at FNAL or CNAF) using rcp/fcp with Kerberos authentication. TAPE (MC or reprocessed data) 9

Monitoring User's desktop Factory @ UCSD Eurogrid headnode @ CNAF Submitter Webserver Monitor Web monitor Mailer Condor VO Frontend Command line monitor from user's desktop Periodically requests info from monitoring daemons on headnode. It exploits glideinwms virtual monitoring machines on the nodes. User can do ps, ls, cat, directly on the worker node. 10

Web monitor 11

Matching resources and users needs I need to run on CDF data I need CDF software CDF software and DATA CDF software. Nor CDF software nor DATA CNAF We want job submission on eurogrid as much easy and transparent as possible for users. Exploited match_expressions to match jobs and sites glideinwms custom scripts to set the correct environment, compat libs and krb5 libs before running CafEXE on the node. 12

Site Selection Each job ClassAd (Condor submit file) defines its DESIRED_Sites (CAF: SubmitModule.py): Executable = exe.sh Universe = vanilla requirements = (Memory>=200) && (OpSys == "LINUX") && ((Arch == "INTEL") (Arch == "x86") (Arch == "x86_64") (Arch == "X86") (Arch == "X86_64")) && stringlistmember(glidein_site, DESIRED_Sites) [ ] +DESIRED_Sites="Bari,Legnaro" Queue Frontend configuration: <frontend...> <match match_expr='(job.has_key("desired_sites") and (glidein["attrs"] ["GLIDEIN_Site"] in job["desired_sites"].split(",")))'> <factory...> <match_attrs> <match_attr name="glidein_site" type="string"/> </match_attrs> </factory> <job...> <match_attrs> <match_attr name="desired_sites" type="string"/> </match_attrs> [ ] </job> [ ] </frontend> With this configuration jobs are sent only to matching sites, and glideins are sent only to sites listed as desired by jobs in the queue 13

Site Configuration file Sites are selected at submission time (SubmitModule.py) using info from this configuration file (/home/cdfcaf/cnafcaf/config/glideinwms_sites.cfg) [global_sites] ;list all sites available, meaning the sites ; considered "Production ready" list=cnaf,legnaro,bari [sam_enabled_sites] ;list sites where SAM can run list=cnaf [cdfsoft_enabled_sites] ;list sites that have cdfsoft list=cnaf,bari ;now you can list a different configuration for each group ; users submitting to those groups are automatically redirected ; to the sites you put in the list: beware that if ; a user submits to a group and uses sam or specifies ; a list of desired sites via --site, then the sites where ; the job will be redirected will be the INTERSECTION of ; the group, sam and sites requirements ; Exeption: group=test does not do check the desired_sites ; against the list of available production sites [test] ; list sites you are currently testing list=kit,ccin2p3 [italy] list=cnaf,legnaro,bari [MCprod] list=legnaro,bari 14

Site Setup with glideinwms custom scripts Sites are setup using glideinwms custom scripts: define_clibs_env.sh Glidein software unpacks the tgz, script sets compatlibs.tgz the correct Library Path for the job krb5.tgz define_krb5.sh After the unpack, the script sets the correct PATH for the krb5 binaries and libraries setup_site.sh Sets some site specific ENV vars (i.e. CDF software PATH and SAM ENV vars) publish_cdfsoft.sh If CDFSOFT PATH is accessible on the worker node then publish cdfsoft_installed=1 publish_os.sh Publish os version (SL4/SL5/UNKNOWN) At submission time users can ask to have cdfsoft and a specific OS, then the job will match only the correct resources (=Worker Nodes) 15

Performances Running jobs in the last 3 months (from the web monitor) 1k Running jobs per site (yesterday) waiting jobs in the last 3 months 16

Performances Glideins Running at CNAF in the last month Spain,France Jobs per site Germany Bari Legnaro Padova CNAF Pisa Roma 1 and 2 17

Main problems during installation None really important. Would it be possible to add more examples and use cases to the documentation? For example, the use of match_expression in the frontend configuration is not so clear for non-experts. 18

Main problems during running Hit maximum number of parallel jobs on the headnode Headnode reboot needed Headnode RAM was not taken into account properly Would it be possible to have specific instructions to optimize the configuration vs headnode hw? A site went bad (no cdfsoft) after condor startup but glidein had published cdfsoft_installed=1 all jobs on that node failed We will implement cron job to check periodically the health of the different sites. Sometimes we found difficult to understand Condor log messages Would it be possible to have a more direct link to Condor error messages descriptions? 19

Eurogrid and UCSD factory Response from UCSD admins are very fast and precise, thank you! A few requests: :-) Is it possible to add a link to the factory monitor from the main glideinwms page? It would be very useful (and less work for you) if we (VO) could look at some of your log files. 20

Summary We are very happy with glideinwms (& Condor) Relatively easy to install and maintain even for non-experts Performance in the first months are encouraging we had really good feedback from users 21

BACKUP 22

Example 1. The glidein software unpacks the tgz compat_libs.tgz and, if successfull, should set the variable CDFGLIDE_COMPAT_LIB_DIR and write it in the glidein_config file, that is passed as input to other scripts 2. After that, other files are transferred: the config file, and the cdf_compat_libs.sh script, that is set to executable and will be run by the glidein 3. In the script cdf_compat_libs.sh config variables are used to interact with the ENV of the job: 4. then the script creates the necessary simlynks to the unpacked libraries and sets the correct environment variables, writing them in the glidein_config and condor_vars file 23

Class-ad example (I) Executable = stage/cafexe Arguments = -s 130 -job_start_section 1 -job_end_section 250 -user tonel -cdfsoft /afs/infn.it/project/cdf/cdfsoft -krb5cc krb5cc_tonel -infile job_in.tgz -inurl http://cdfheadsrv.cr.cnaf.infn.it:2080/condorcafstage -insha1 7e266d1d7a433c9a067b2c0d76204d027aa108f4 -proxyfile proxies.cfg -proxyidx 18 -outfile rcp tonel@npisa30.fnal.gov:/data/npisa30/a/feirsteo/bsana/fit/10fbfit/coverage/outcaf/universes-ultimate/2d/universe-17/bs_toymc-tonel-universe-17-univ-17-fit-2-d-sec-130.tgz -outfile2 fcp cdfdata@cdfdata01.cr.cnaf.infn.it:/storage/gpfs_cdf0/icaf/cdfdata/icaf//icaf_out_tonel_130.tgz -iomonfile.logiofile.log -icaf cdfdata01.cr.cnaf.infn.it /storage/gpfs_cdf0/icaf/tonel/scratch -mintime 300 -maxtime 165600 -submit_time 1326682582 -log CafExe_130.log -jobout job_130.out -joberr job_130.err -iomap iomap.txt -frontier http://squid-cdf-2.cr.cnaf.infn.it:3128 -sam_cpp_api -callback 131.154.129.142 8125 -cafname EUROGRID --./run-pvalue-universes-2d-ultimate.csh 130 17 2 Universe = vanilla +AccountingGroup = "group_cdfgeneric.tonel" requirements = (Memory>=200) && (Disk>=4000000) && (OpSys == "LINUX") && ((Arch == "INTEL") (Arch == "x86") (Arch == "x86_64") (Arch == "X86") (Arch == "X86_64")) && stringlistmember(glidein_site, DESIRED_Sites) && (TARGET.os_installed == "SL5") Environment = LD_LIBRARY_PATH=/lib:/user/lib x509userproxy = /home/cdfcaf/cnafcafdata/cafcondor/tickets/x509cc_tonel Environment = LD_LIBRARY_PATH=/lib:/usr/lib Notification = Never +Owner = undefined 24

Class-ad example (II) copy_to_spool = false job_lease_duration = 10000 should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT transfer_input_files = /home/cdfcaf/cnafcafdata/cafcondor/tickets/krb5cc_tonel,/home/cdfcaf/cnafcaf/config/proxies.cfg,stage/iomap.txt,job. usterid encrypt_input_files = /home/cdfcaf/cnafcafdata/cafcondor/tickets/krb5cc_tonel on_exit_remove = true Log = job.log Output = section_130.out Error = section_130.err Periodic_Remove = (((JobStatus == 2) && ((CurrentTime - JobCurrentStartDate) > 168000)) =?= True) +CAFPool = "EUROGRID" +CAFGroup = "long" +CAFAcctGroup = "group_cdfgeneric" +CAFSection = 130 +CAFDH = "none" +CAFAllowPreempt = TRUE +CAFMaxTime = 166800 +DESIRED_Sites= "CNAF,Legnaro,Bari,KIT,CCIN2P3,PIC,Pisa,Roma1,Roma2,Padova" Queue 25