Factory Ops Site Debugging

Similar documents
Primer for Site Debugging

glideinwms architecture by Igor Sfiligoi, Jeff Dost (UCSD)

Condor-G: HTCondor for grid submission. Jaime Frey (UW-Madison), Jeff Dost (UCSD)

Configuring a glideinwms factory

glideinwms Training Glidein Internals How they work and why by Igor Sfiligoi, Jeff Dost (UCSD) glideinwms Training Glidein internals 1

Flying HTCondor at 100gbps Over the Golden State

Introducing the HTCondor-CE

Shooting for the sky: Testing the limits of condor. HTCondor Week May 2015 Edgar Fajardo On behalf of OSG Software and Technology

OSG Lessons Learned and Best Practices. Steven Timm, Fermilab OSG Consortium August 21, 2006 Site and Fabric Parallel Session

glideinwms UCSD Condor tunning by Igor Sfiligoi (UCSD) UCSD Jan 18th 2012 Condor Tunning 1

Eurogrid: a glideinwms based portal for CDF data analysis - 19th January 2012 S. Amerio. (INFN Padova) on behalf of Eurogrid support group

Look What I Can Do: Unorthodox Uses of HTCondor in the Open Science Grid

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

CCB The Condor Connection Broker. Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison

glideinwms Frontend Installation

History of SURAgrid Deployment

Troubleshooting Grid authentication from the client side

The HTCondor CacheD. Derek Weitzel, Brian Bockelman University of Nebraska Lincoln

How to install Condor-G

Care and Feeding of HTCondor Cluster. Steven Timm European HTCondor Site Admins Meeting 8 December 2014

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

Outline. ASP 2012 Grid School

An update on the scalability limits of the Condor batch system

Grid Compute Resources and Grid Job Management

Agent Teamwork Research Assistant. Progress Report. Prepared by Solomon Lane

Architecture Proposal

HTCondor overview. by Igor Sfiligoi, Jeff Dost (UCSD)

Introduction to SciTokens

glideinwms: Quick Facts

GROWL Scripts and Web Services

Introduction to HTCondor

HTCONDOR USER TUTORIAL. Greg Thain Center for High Throughput Computing University of Wisconsin Madison

Tutorial 4: Condor. John Watt, National e-science Centre

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

Troubleshooting Grid authentication from the client side

Scalability and interoperability within glideinwms

AutoPyFactory: A Scalable Flexible Pilot Factory Implementation

Regional SEE-GRID-SCI Training for Site Administrators Institute of Physics Belgrade March 5-6, 2009

Corral: A Glide-in Based Service for Resource Provisioning

! " #$%! &%& ' ( $ $ ) $*+ $

glideinwms experience with glexec

Using the MyProxy Online Credential Repository

Installation of CMSSW in the Grid DESY Computing Seminar May 17th, 2010 Wolf Behrenhoff, Christoph Wissing

HIGH-THROUGHPUT COMPUTING AND YOUR RESEARCH

ARC integration for CMS

Managing Grid Credentials

LHCb experience running jobs in virtual machines

CREAM-WMS Integration

Twitch Plays Pokémon: Twitch s Chat Architecture. John Rizzo Sr Software Engineer

Installation and Administration

Day 9: Introduction to CHTC

Interfacing HTCondor-CE with OpenStack: technical questions

glite Grid Services Overview

An XACML Attribute and Obligation Profile for Authorization Interoperability in Grids

Who am I? I m a python developer who has been working on OpenStack since I currently work for Aptira, who do OpenStack, SDN, and orchestration

DIRAC pilot framework and the DIRAC Workload Management System

The SciTokens Authorization Model: JSON Web Tokens & OAuth

VC3. Virtual Clusters for Community Computation. DOE NGNS PI Meeting September 27-28, 2017

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

First evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova

Grid Security Infrastructure

MyProxy Server Installation

Xrootd Monitoring for the CMS Experiment

Grids and Security. Ian Neilson Grid Deployment Group CERN. TF-CSIRT London 27 Jan

Edinburgh (ECDF) Update

Distributed production managers meeting. Armando Fella on behalf of Italian distributed computing group

Grid Computing Fall 2005 Lecture 16: Grid Security. Gabrielle Allen

Grid Authentication and Authorisation Issues. Ákos Frohner at CERN

Client tools know everything

Globus Toolkit Firewall Requirements. Abstract

CMS Tier-2 Program for user Analysis Computing on the Open Science Grid Frank Würthwein UCSD Goals & Status

CMS experience of running glideinwms in High Availability mode

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

Gridbus Portlets -- USER GUIDE -- GRIDBUS PORTLETS 1 1. GETTING STARTED 2 2. AUTHENTICATION 3 3. WORKING WITH PROJECTS 4

Logging. About Logging. This chapter describes how to log system messages and use them for troubleshooting.

High Throughput WAN Data Transfer with Hadoop-based Storage

Table of Contents OSG Blueprint Introduction What Is the OSG? Goals of the Blueprint Document Overview...

An XACML Attribute and Obligation Profile for Authorization Interoperability in Grids

ALHAD G. APTE, BARC 2nd GARUDA PARTNERS MEET ON 15th & 16th SEPT. 2006

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

Connecting Restricted, High-Availability, or Low-Latency Resources to a Seamless Global Pool for CMS

Kerberos & HPC Batch systems. Matthieu Hautreux (CEA/DAM/DIF)

Enabling Distributed Scientific Computing on the Campus

IMS Bench SIPp. Introduction. Table of contents

op5 Monitor 4.1 Manual

Condor and BOINC. Distributed and Volunteer Computing. Presented by Adam Bazinet

Understanding StoRM: from introduction to internals

The EU DataGrid Fabric Management

Introduction to Distributed HTC and overlay systems

Bookkeeping and submission tools prototype. L. Tomassetti on behalf of distributed computing group

WMS overview and Proposal for Job Status

Service Bus Guide. September 21, 2018 Version For the most recent version of this document, visit our documentation website.

The University of Oxford campus grid, expansion and integrating new partners. Dr. David Wallom Technical Manager

Microsoft Exchange Proxy Settings Outlook 2010 Gpo

InterPBX Communication System

EUROPEAN MIDDLEWARE INITIATIVE

! " # " $ $ % & '(()

Transcription:

Factory Ops Site Debugging This talk shows detailed examples of how we debug site problems By Jeff Dost (UCSD) Factory Ops Site Debugging 1

Overview Validation Rundiff Held Waiting Pending Unmatched Factory Ops Site Debugging 2

Overview Validation Rundiff Held Waiting Pending Unmatched Factory Ops Site Debugging 3

Validation analyze_entries shows 100% validation failure, mostly affecting CMSG-v1_0: strt fval 0job val idle wst badp waste time total CMS_T2_US_Purdue_hadoop 100% 100% 100% 100% 0% 100% 100% 72 72 207 Use entry_ls to find logs: $ entry_ls -fo CMS_T2_US_Purdue_hadoop fecmsglobal 20140605 /var/log/gwms-factory/client/user_fecmsglobal/glidein_gfactory_instance/entry_cms_t2_us_purdue_hadoop/job.2193146.0.out /var/log/gwms-factory/client/user_fecmsglobal/glidein_gfactory_instance/entry_cms_t2_us_purdue_hadoop/job.2193193.0.out /var/log/gwms-factory/client/user_fecmsglobal/glidein_gfactory_instance/entry_cms_t2_us_purdue_hadoop/job.2193720.0.out See if validation script generated an xml report: $ cat_xmlresult /var/log/gwms-factory/client/user_fecmsglobal/glidein_gfactory_instance/entry_cms_t2_us_purdue_hadoop/job. 2193720.0.out less Factory Ops Site Debugging 4

Validation Looks like grid-proxy-init is not found: <result> <status>error</status> <metric name="testid" ts="2014-06-03t13:20:56-04:00" uri="local">main/setup_x509.sh</metric> <metric name="failure" ts="2014-06-03t13:20:29-04:00" uri="local">wn_resource</metric> <metric name="command" ts="2014-06-03t13:20:35-04:00" uri="local">grid-proxy-init</metric> </result> <detail> Validation failed in main/setup_x509.sh. grid-proxy-init command not found in path! </detail> Find affected worker nodes: $ entry_ls -fo CMS_T2_US_Purdue_hadoop fecmsglobal 20140605 xargs cat_xmlresult get_wns sort uniq cms-e002.rcac.purdue.edu Problem seems isolated to a single node Open a ticket on site and report the error and affected WN hostname Factory Ops Site Debugging 5

Validation analyze_entries shows 100% validation failure, mostly affecting CMSG-v1_0: strt fval 0job val idle wst badp waste time total CMS_T3_US_FIU_fiupg 100% 100% 100% 100% 0% 100% 100% 87 87 260 Use entry_ls to find logs: $ entry_ls -fo CMS_T3_US_FIU_fiupg fecmsglobal 20140615 /var/log/gwms-factory/client/user_fecmsglobal/glidein_gfactory_instance/entry_cms_t3_us_fiu_fiupg/job.2094373.0.err /var/log/gwms-factory/client/user_fecmsglobal/glidein_gfactory_instance/entry_cms_t3_us_fiu_fiupg/job.2094384.0.err /var/log/gwms-factory/client/user_fecmsglobal/glidein_gfactory_instance/entry_cms_t3_us_fiu_fiupg/job.2094520.0.err See if validation script generated an xml report: $ cat_xmlresult.py /var/log/gwms-factory/client/user_fecmsglobal/glidein_gfactory_instance/entry_cms_t3_us_fiu_fiupg/ job.2098206.0.out less Factory Ops Site Debugging 6

Validation Unfortunately this one has no report: <result> <status>error</status> <metric name="testid" ts="2014-06-15t22:53:41-04:00" uri="local">client/discover_cmssw.sh</metric> </result> <detail> Validation failed in client/discover_cmssw.sh. The test script did not produce an XML file. No further information available. </detail> If no report the next best thing is to open the logs and look for any contextual info: Signature OK for client:discover_cmssw.e5lfkp.sh. cmsset_default.sh not found!\n Looked in /cmsset_default.sh and /osg/apps/cmssoft/cms/cmsset_default.sh and /cms.cern.ch/cmsset_default.sh and /cvmfs/cms.cern.ch/cmsset_default.sh === Validation error in /var/opt/condor/execute/dir_12458/glide_d12512/client/discover_cmssw.sh === Factory Ops Site Debugging 7

Validation In this case it is due to the glidein failing to find CMS Software. It was likely a misconfiguration at the site. If it is a FE validation script and you don't know how to interpret it, ask other operators at osggfactory-support@physics.ucsd.edu If we don't know either, then we email the FE admin to ask for help Factory Ops Site Debugging 8

Validation Get list of worker nodes where failures occurred: $ entry_ls -fo CMS_T3_US_FIU_fiupg fecmsglobal 20140615 xargs cat_xmlresult get_wns sort uniq compute-0-10.local compute-0-11.local compute-0-13.local compute-0-15.local compute-0-16.local Open a ticket with the site and provide the error and list of affected nodes Factory Ops Site Debugging 9

Overview Validation Rundiff Held Waiting Pending Unmatched Factory Ops Site Debugging 10

Rundiff CMS_T1_UK_RAL_arc_ce01- Registered significantly lower than running on factorystatus Factory Ops Site Debugging 11

Rundiff Run entry_q to ensure they aren't just stale glideins (runtime >> maxwalltime) $ entry_q CMS_T1_UK_RAL_arc_ce01 head -- Schedd: schedd_glideins2@gfactory-1.t2.ucsd.edu : <169.228.38.36:50104> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2463093.6 fecmsglobal 6/18 11:33 2+07:21:49 R 0 0.0 glidein_startup.sh 2463106.0 fecmsglobal 6/18 11:41 2+07:21:49 R 0 0.0 glidein_startup.sh 2463106.1 fecmsglobal 6/18 11:41 2+07:18:38 R 0 0.0 glidein_startup.sh Factory Ops Site Debugging 12

Rundiff CompletedStats don't show obvious validation errors Fraction wasted due to node validation Factory Ops Site Debugging 13

Rundiff Pick a few glideins and look at StartdLog: $ cat_startdlog job.2469098.0.err grep '^06' less... 06/20/14 14:42:10 (pid:5080) condor_write(): Socket closed when trying to write 4096 bytes to collector vocms032.cern.ch:9938, fd is 9 06/20/14 14:42:10 (pid:5080) Buf::write(): condor_write() failed TCP connections between the glidein (worker node) and User Collector are dropping Because this is not happening to glideins from the same VO at other sites it is likely occurring on the site side Common causes are NAT or firewall issues Open ticket with site and include found error Factory Ops Site Debugging 14

Overview Validation Rundiff Held Waiting Pending Unmatched Factory Ops Site Debugging 15

Held All CMSG-v1_0 glideins going held at CMSHTPC_T2_US_UCSD_gw4 Factory Ops Site Debugging 16

Held Check hold reason: $ entry_q CMSHTPC_T2_US_UCSD_gw4 -held -- Schedd: schedd_glideins3@gfactory-1.t2.ucsd.edu : <169.228.38.36:46438> ID OWNER HELD_SINCE HOLD_REASON 2236925.0 fecmsglobal 5/27 21:14 Error connecting to schedd osg-gw-4.t2.ucsd.edu: AUTHENTICATE:1003:Failed to authenticate with any method AUTHENTICATE:1004:Failed to authenticate using GSI GSI:5004:Failed to authenticate. Globus is reporting error (655360:97) AUTHENTICATE:1004:Failed to authenticate using FS Site seems to be rejecting proxy. Check to make sure it hasn't expired Find the proxy on disk: entry_q CMSHTPC_T2_US_UCSD_gw4 2236925.0 -format '%s\n' x509userproxy /var/lib/gwms-factory/client-proxies/user_fecmsglobal/glidein_gfactory_instance/credential_cmsg-v1_0.main_41186 Factory Ops Site Debugging 17

Held Proxy has not expired, must be a CE problem: $ sudo voms-proxy-info -all /var/lib/gwms-factory/client-proxies/user_fecmsglobal/glidein_gfactory_instance/credential_cmsgv1_0.main_411868 subject : /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch/CN=1687647593 issuer : /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch identity : /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch type : RFC compliant proxy strength : 1024 bits path : /var/lib/gwms-factory/client-proxies/user_fecmsglobal/glidein_gfactory_instance/credential_cmsg-v1_0.main_411868 timeleft : 66:50:11 key usage : Digital Signature, Key Encipherment === VO cms extension information === VO : cms subject : /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch issuer : /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch attribute : /cms/role=pilot/capability=null attribute : /cms/role=null/capability=null attribute : /cms/dcms/role=null/capability=null attribute : /cms/escms/role=null/capability=null attribute : /cms/itcms/role=null/capability=null attribute : /cms/local/role=null/capability=null attribute : /cms/uscms/role=null/capability=null timeleft : 66:50:11 uri : voms2.cern.ch:15002 Open ticket with site with hold error and proxy info Factory Ops Site Debugging 18

Overview Validation Rundiff Held Waiting Pending Unmatched Factory Ops Site Debugging 19

Waiting CMS_T2_US_Purdue_cmstest1 100% Waiting, 0 Running condor_activity log reveals CE is unreachable: 026 (1555490.000.000) 06/23 10:37:45 Detected Down Globus Resource... GridResource: condor cmstest1.rcac.purdue.edu cmstest1.rcac.purdue.edu:9619 020 (1555490.001.000) 06/23 10:37:45 Detected Down Globus Resource RM-Contact: cmstest1.rcac.purdue.edu... 026 (1555490.001.000) 06/23 10:37:45 Detected Down Globus Resource... GridResource: condor cmstest1.rcac.purdue.edu cmstest1.rcac.purdue.edu:9619 Factory Ops Site Debugging 20

Waiting CMS_T2_US_Purdue_cmstest1 100% Waiting, 0 Running Confirm this with telnet: $ telnet cmstest1.rcac.purdue.edu 9619 Trying 128.211.140.100... telnet: connect to address 128.211.140.100: Connection refused Factory Ops Site Debugging 21

Waiting Look for downtime notice in RSV: Site in UNKNOWN status, lots of red X failures, and no sign of planned maintenance Open ticket on site, and provide downtime message, RSV link, and how long it has been down Factory Ops Site Debugging 22

Overview Validation Rundiff Held Waiting Pending Unmatched Factory Ops Site Debugging 23

Pending OSG_US_Buffalo_u2-grid abruptly stopped working and went 100% pending Factory Ops Site Debugging 24

Pending The constant 0 slope of Pending is suspicious Typically this means we are no longer getting accurate updates back from the gatekeeper Factory Ops Site Debugging 25

Pending condor_activity shows no obvious issues other than jobs never getting past pending stage: 000 (1595555.009.000) 07/05 11:31:54 Job submitted from host: <129.79.53.27:32667> 027 (1595555.009.000) 07/05 11:32:07 Job submitted to grid resource GridResource: condor u2-grid.ccr.buffalo.edu u2-grid.ccr.buffalo.edu:9619... GridJobId: condor u2-grid.ccr.buffalo.edu u2-grid.ccr.buffalo.edu:9619 533839.0 Try removing pending glideins to see if fresh ones start Factory Ops Site Debugging 26

Pending First find cluster id of last pending glidein: entry_q OSG_US_Buffalo_u2-grid -- Schedd: schedd_glideins3@glidein.grid.iu.edu : <129.79.53.27:32667> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1607244.0 feosgflock 7/5 16:04 0+00:00:00 I 0 0.0 glidein_startup.sh 1607244.1 feosgflock 7/5 16:04 0+00:00:00 I 0 0.0 glidein_startup.sh 1607244.2 feosgflock 7/5 16:04 0+00:00:00 I 0 0.0 glidein_startup.sh... 1607268.9 feosgflock 7/5 16:11 0+00:00:00 I 0 0.0 glidein_startup.sh Remove all glideins up to and including this one: entry_rm OSG_US_Buffalo_u2-grid -const 'clusterid<=1607268' Jobs matching constraint (((GlideinFactory=?="OSGGOC")&&(GlideinName=?="v3_0")&&(GlideinEntryName=? ="OSG_US_Buffalo_u2-grid"))&&(clusterid<=1607268)) have been marked for removal Factory Ops Site Debugging 27

Pending In this case, removing did the trick, and fresh glideins started running normally If that still doesn't work, open ticket with site, and provide pilot proxy info, as well as a few grid job ids along with submit times seen in activity log, e.g.: 000 (1595555.009.000) 07/05 11:31:54 Job submitted from host: <129.79.53.27:32667> 027 (1595555.009.000) 07/05 11:32:07 Job submitted to grid resource... GridResource: condor u2-grid.ccr.buffalo.edu u2-grid.ccr.buffalo.edu:9619 GridJobId: condor u2-grid.ccr.buffalo.edu u2-grid.ccr.buffalo.edu:9619 533839.0 Factory Ops Site Debugging 28

When Gridmanager Loses Track In the last example the gridmanager no longer was receiving updates from the CE One useful trick to determine this is by comparing the entry on this factory with the same entry on other factories If the problem is only observed on one factory, it is usually safe to assume it is just gridmanager CE issues In this case, the glideins can be considered stale and can be safely removed Factory Ops Site Debugging 29

Overview Validation Rundiff Held Waiting Pending Unmatched Factory Ops Site Debugging 30

Unmatched Fermilab FE 100% waste in analyze_entries, no glideins running user jobs (100% idle): strt fval 0job val idle wst badp waste time total CMS_T2_US_Nebraska_Red_gw1 0% 0% 100% 0% 100% 100% 100% 1259 1259 3296 CMS_T2_US_Nebraska_Red_gw2 0% 0% 100% 0% 100% 100% 100% 1181 1181 3095 But condor isn't failing and there are no validation errors. Next, look at factorystatus Factory Ops Site Debugging 31

Unmatched FactoryStatus shows Fermilab glideins are 100% Unmatched at Nebraska Factory Ops Site Debugging 32

Unmatched At this point it is likely a Frontend problem Glideins are being submitted on behalf of the FE users at Nebraska...But the actual user jobs are not matching the glideins that started up Contact the FE admin and ask them to investigate their glidein start expressions Give the list of sites we see this on, and for how long it has been happening Factory Ops Site Debugging 33