OSG Lessons Learned and Best Practices. Steven Timm, Fermilab OSG Consortium August 21, 2006 Site and Fabric Parallel Session

Similar documents
Care and Feeding of HTCondor Cluster. Steven Timm European HTCondor Site Admins Meeting 8 December 2014

Tutorial 4: Condor. John Watt, National e-science Centre

CMS Tier-2 Program for user Analysis Computing on the Open Science Grid Frank Würthwein UCSD Goals & Status

Grid Compute Resources and Grid Job Management

Factory Ops Site Debugging

glideinwms Training Glidein Internals How they work and why by Igor Sfiligoi, Jeff Dost (UCSD) glideinwms Training Glidein internals 1

Cloud Computing. Summary

Look What I Can Do: Unorthodox Uses of HTCondor in the Open Science Grid

glideinwms UCSD Condor tunning by Igor Sfiligoi (UCSD) UCSD Jan 18th 2012 Condor Tunning 1

Edinburgh (ECDF) Update

Grid Compute Resources and Job Management

glideinwms architecture by Igor Sfiligoi, Jeff Dost (UCSD)

Distributed Monte Carlo Production for

glideinwms Frontend Installation

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

Introducing the HTCondor-CE

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

Primer for Site Debugging

OSGMM and ReSS Matchmaking on OSG

What s new in HTCondor? What s coming? European HTCondor Workshop June 8, 2017

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Interfacing HTCondor-CE with OpenStack: technical questions

Installing OSG in a VirtualBox Machine

Shooting for the sky: Testing the limits of condor. HTCondor Week May 2015 Edgar Fajardo On behalf of OSG Software and Technology

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Atrium Webinar- What's new in ADDM Version 10

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

Agent Teamwork Research Assistant. Progress Report. Prepared by Solomon Lane

Flying HTCondor at 100gbps Over the Golden State

Installation of CMSSW in the Grid DESY Computing Seminar May 17th, 2010 Wolf Behrenhoff, Christoph Wissing

HTCondor Essentials. Index

Monitoring and Analytics With HTCondor Data

g-eclipse A Framework for Accessing Grid Infrastructures Nicholas Loulloudes Trainer, University of Cyprus (loulloudes.n_at_cs.ucy.ac.

Things you may not know about HTCondor. John (TJ) Knoeller Condor Week 2017

The glite middleware. Presented by John White EGEE-II JRA1 Dep. Manager On behalf of JRA1 Enabling Grids for E-sciencE

PROOF-Condor integration for ATLAS

Architecture Proposal

GT 4.2.0: Community Scheduler Framework (CSF) System Administrator's Guide

Grid Interoperation and Regional Collaboration

WMS overview and Proposal for Job Status

Day 9: Introduction to CHTC

HPCC User Group Meeting

Configuring a glideinwms factory

Setup InstrucIons. If you need help with the setup, please put a red sicky note at the top of your laptop.

History of SURAgrid Deployment

Batch system usage arm euthen F azo he Z J. B T

UW-ATLAS Experiences with Condor

Things you may not know about HTCondor. John (TJ) Knoeller Condor Week 2017

Table of Contents OSG Blueprint Introduction What Is the OSG? Goals of the Blueprint Document Overview...

Eurogrid: a glideinwms based portal for CDF data analysis - 19th January 2012 S. Amerio. (INFN Padova) on behalf of Eurogrid support group

Troubleshooting Grid authentication from the client side

Windows 7 Manual Update Install Clean Disk. Space >>>CLICK HERE<<<

HIGH-THROUGHPUT COMPUTING AND YOUR RESEARCH

Get to know SysKit Monitor

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

Cloud Computing. Up until now

HTCondor overview. by Igor Sfiligoi, Jeff Dost (UCSD)

glideinwms: Quick Facts

Deploying virtualisation in a production grid

Aviary A simplified web service interface for Condor. Pete MacKinnon

HowTo: FermiGrid for MAP Users

VMs at a Tier-1 site. EGEE 09, Sander Klous, Nikhef

XRAY Grid TO BE OR NOT TO BE?

The University of Oxford campus grid, expansion and integrating new partners. Dr. David Wallom Technical Manager

Introduction to Distributed HTC and overlay systems

ALICE Grid Activities in US

NextGen Healthcare Success Community Frequently Asked Questions for Employees

How to install Condor-G

Troubleshooting Grid authentication from the client side

The Problem of Grid Scheduling

UK Tier-2 site evolution for ATLAS. Alastair Dewhurst

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

University of Johannesburg South Africa. Stavros Lambropoulos Network Engineer

Guillimin HPC Users Meeting April 13, 2017

Distributed production managers meeting. Armando Fella on behalf of Italian distributed computing group

X Grid Engine. Where X stands for Oracle Univa Open Son of more to come...?!?

Corral: A Glide-in Based Service for Resource Provisioning

Outline. ASP 2012 Grid School

The HTCondor CacheD. Derek Weitzel, Brian Bockelman University of Nebraska Lincoln

Ivane Javakhishvili Tbilisi State University High Energy Physics Institute HEPI TSU

Distributed Computation Models

AN OVERVIEW OF COMPUTING RESOURCES WITHIN MATHS AND UON

Windows Download & Installation

AUTOMATED RESTORE TESTING FOR TIVOLI STORAGE MANAGER

Implementing GRID interoperability

Windows 10 Tips and Tricks

Monitoring Primer HTCondor Week 2017 Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison

HPC Resources at Lehigh. Steve Anthony March 22, 2012

Martinos Center Compute Cluster

DR 3.3 Planning Guide

DIRAC pilot framework and the DIRAC Workload Management System

On the employment of LCG GRID middleware

Singularity tests at CC-IN2P3 for Atlas

10 Common pc problems and solutions

First Principles Vulnerability Assessment. Motivation

EUROPEAN MIDDLEWARE INITIATIVE

Autonomic Condor Clouds. David Wolinsky ACIS P2P Group University of Florida

HPC Metrics in OSCAR based on Ganglia

TiTiTo Manual. TiTiTo is a Tiny Timetabling Tool. Version 0.5 by Volker Dirr Preamble. Main Features.

Transcription:

OSG Lessons Learned and Best Practices Steven Timm, Fermilab OSG Consortium August 21, 2006 Site and Fabric Parallel Session

Introduction Ziggy wants his supper at 5:30 PM Users submit most jobs at 4:59 PM (on Friday). What to do so the system will stay up and Ziggy will get his supper on time?

Outline Pre Installation Planning Hardware Selection File Systems Post install Configuration Troubleshooting How to track a job

Hardware selection Compute Element head node Should have at least 4GB of RAM. We have 6GB and we still swap. At least 2 cpu's, as fast as you can get. Enough disk to hold the VDT software and logs (250GB or so). For small medium size cluster, can be NFS server for home areas too. If you are supporting a VO, use a 2 nd machine to be your VOMS server, of equal size to above. GUMS (Grid User Management System) is good way to manage users, if you're running it, use a 3 rd server of equal size to above.

File systems Some documents have already been written see DOCDB OSG Compute Element Best Practices http://osg docdb.opensciencegrid.org/cgi bin/showdocument?docid=379 (OSG Compute Element) NFS Lite http://osg docdb.opensciencegrid.org/cgi bin/showdocument?docid=382

Four File Systems Needed Home areas for each user quota 50MB per user Nothing permanent to be stored here APP area for each VO's applications quota 10GB per VO Any way to do garbage collection here, or make VO's do it? DATA area for each VO's data, 2GB per job slot Again, any way to make VO's clean up after themselves? right now using quotas to keep any one VO from overrunning the area Even if you have a SITE_READ, SITE_WRITE storage element should have DATA too.. WN_TMP area on each worker node In past VDS has require this be a fixed path will change for OSG 0.5.0 to be able to be assigned at execute time

To NFS or not to NFS Pre web services GRAM makes pig headed use of NFS, tolerates NFS failures badly. (WS GRAM doesn't need so much NFS). Small to medium cluster (120 machines) Get user home areas on globus gatekeeper machine itself, make second server with APP, DATA. Condor negotiator/collector and NFS servers don't mix. Big cluster Use NFS lite configuration or some global file system for home areas. Bluearc NAS head has worked well for us thus far. Others out there Ibrix, Panasas, etc.

Pre inst planning, Software Condor Install first and separately out of VDT area. Condor needs upgraded more than rest of VDT Some versions may not be available in VDT use RPM We install rpm on every WN helps with network trouble Needed for ManagedFork even if main batch system isn't condor. Accounts create them in advance. Suggest /sbin/nologin shell Ganglia have your gmetad elsewhere than your head node, big network traffic generator and cpu load too.

Software, during install I install worker node client on head node and on each worker Big load on VDT server and on CRL servers in 0.5.0 plan to use Squid to mitigate this. ManagedFork This is a lifesaver. Only way to log and control what users are doing with the jobmanager fork. $VDT_LOCATION/globus/TRUSTED_CA Is the default location for certificates now But /etc/grid security/certificates must also be symlinked to this, otherwise Condor >=6.7.18 doesn't know where to get certificates from. Also make sure worker node client TRUSTED_CA points here too

Monitoring 4 Major monitoring packages MonALISA MIS CI (feeds GridCat, did feed ACDC dashboard) CEMon >Generic Information Provider GRIS >Generic Information provider Frequencies of all monitoring software in the OSG are adjustable somehow not enough time to go into detail here. Going slower than the default frequency essential to smooth operation More things calling condor_q, more room for trouble Recent report grep can be faster if LOCALE=C rather thanus ENGLISH

Monitoring If vo is in grid3 user vo map.txt, need to support it grid3 user vo map.txt has to be right before GIP is configured generic info provider be sure it advertises the right number of job slots, otherwise it's a kick me sign to the LCG. Stuff breaks look at daily MonALISA email, check GRIS is up CEMon is new, we have been running it at FNAL under OSG 0.4.1. Can be memory hungry. Do we have to split some monitoring off to another node? maybe eventually, especially with advent of web services. Monitoring should use condor_q format and condor_status format

Importance of a central submit node Have a central submit node!! Discourage people from submitting from their desktops Can't control the configuration of desktops, lots of ways to make it insecure With central submit node, can kill rogue condor G processes if necessary Allows you to see the same condor errors they are seeing Most serious grid work needs bigger staging disk than a desktop will provide anyway Laptops turn off and jobs get held that would otherwise succeed

Tracking jobs From submission job, get GridJobId condor_q l grep GridJobId GridJobId = "gt2 fngp osg.fnal.gov/jobmanager condor https://fngposg.fnal.gov:18764/10515/1156102663/" in /var/log/messages Aug 20 14:37:48 fngp osg gridinfo[10515]: JMA 2006/08/20 14:37:48 GATEKEEPER_JM_ID 2006 08 20.14:37:43.0000010457.0000000000 for /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Charles C. Polly/UID=polly on 131.225.167.42 Aug 20 14:37:48 fngp osg gridinfo[10515]: JMA 2006/08/20 14:37:48 GATEKEEPER_JM_ID 2006 08 20.14:37:43.0000010457.0000000000 mapped to minboone (12755, 5468) Aug 20 14:37:48 fngp osg gridinfo[10515]: JMA 2006/08/20 14:37:48 GATEKEEPER_JM_ID 2006 08 20.14:37:43.0000010457.0000000000 has GRAM_SCRIPT_JOB_ID 507209 manager type condor

Wish List Better audit trail in Condor_G source job ID available to remote and vice versa globus job manager reaper, kill hung processes Better home area cleanup of /home/.globus/.gass_cache/* mysql root is open, why? Policies for VO's to clean up APP and DATA Watchdog to look for broken services.

Conclusion Ziggy is getting his supper!

Condor Configuration what we do Main condor_schedd on same node as OSG compute element Second dedicated machine to be condor collector/negotiator should not be an NFS server of any sort. Third machine for local submits to the cluster and run condorview. Requested 3 more machines Second globus gatekeeper/condor_schedd machine for internal jobs Separate server for postgres Login node for submitting outbound grid jobs.

Condor continued From previous batch system, users are used to no preemption, we try to keep it that way. Group Quotas keep any one group from taking over the whole cluster with week long jobs Each job auto assigned to a group via wrapper on condor_submit Priority factors ensure local user jobs start ahead of OSG jobs. Use condor_quill on our most busy schedd. (database frontend makes condor_q, condor_history go faster). Use TCP to talk to the collector. Condor config file and condor software installed independently on each node, no nfs. IMPORTANT change HOSTALLOW_WRITE to be something other than *