glideinwms architecture by Igor Sfiligoi, Jeff Dost (UCSD)

Similar documents
glideinwms Training Glidein Internals How they work and why by Igor Sfiligoi, Jeff Dost (UCSD) glideinwms Training Glidein internals 1

Configuring a glideinwms factory

glideinwms Frontend Installation

glideinwms UCSD Condor tunning by Igor Sfiligoi (UCSD) UCSD Jan 18th 2012 Condor Tunning 1

glideinwms: Quick Facts

Primer for Site Debugging

Condor-G: HTCondor for grid submission. Jaime Frey (UW-Madison), Jeff Dost (UCSD)

Look What I Can Do: Unorthodox Uses of HTCondor in the Open Science Grid

HTCondor overview. by Igor Sfiligoi, Jeff Dost (UCSD)

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

Flying HTCondor at 100gbps Over the Golden State

Shooting for the sky: Testing the limits of condor. HTCondor Week May 2015 Edgar Fajardo On behalf of OSG Software and Technology

Introducing the HTCondor-CE

Eurogrid: a glideinwms based portal for CDF data analysis - 19th January 2012 S. Amerio. (INFN Padova) on behalf of Eurogrid support group

Factory Ops Site Debugging

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

An update on the scalability limits of the Condor batch system

The HTCondor CacheD. Derek Weitzel, Brian Bockelman University of Nebraska Lincoln

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

CCB The Condor Connection Broker. Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison

CERN: LSF and HTCondor Batch Services

Care and Feeding of HTCondor Cluster. Steven Timm European HTCondor Site Admins Meeting 8 December 2014

Singularity in CMS. Over a million containers served

Interfacing HTCondor-CE with OpenStack: technical questions

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

Pegasus WMS Automated Data Management in Shared and Nonshared Environments

Connecting Restricted, High-Availability, or Low-Latency Resources to a Seamless Global Pool for CMS

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

Distributed Caching Using the HTCondor CacheD

Scalability and interoperability within glideinwms

CMS experience of running glideinwms in High Availability mode

Introduction to SciTokens

HTCondor Week 2015: Implementing an HTCondor service at CERN

OSGMM and ReSS Matchmaking on OSG

Introduction to Distributed HTC and overlay systems

Enabling Distributed Scientific Computing on the Campus

Scientific Computing on Emerging Infrastructures. using HTCondor

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

HTCondor: Virtualization (without Virtual Machines)

PROOF-Condor integration for ATLAS

glideinwms experience with glexec

High Throughput WAN Data Transfer with Hadoop-based Storage

Cycle Sharing Systems

Grid Authentication and Authorisation Issues. Ákos Frohner at CERN

VC3. Virtual Clusters for Community Computation. DOE NGNS PI Meeting September 27-28, 2017

Kerberos & HPC Batch systems. Matthieu Hautreux (CEA/DAM/DIF)

WLCG Lightweight Sites

ARC integration for CMS

CMS Tier-2 Program for user Analysis Computing on the Open Science Grid Frank Würthwein UCSD Goals & Status

UW-ATLAS Experiences with Condor

glite Grid Services Overview

User Manual. Admin Report Kit for IIS 7 (ARKIIS)

Cloud Computing. Summary

Day 9: Introduction to CHTC

OSG Lessons Learned and Best Practices. Steven Timm, Fermilab OSG Consortium August 21, 2006 Site and Fabric Parallel Session

First Principles Vulnerability Assessment. Motivation

The SciTokens Authorization Model: JSON Web Tokens & OAuth

Things you may not know about HTCondor. John (TJ) Knoeller Condor Week 2017

History of SURAgrid Deployment

Corral: A Glide-in Based Service for Resource Provisioning

Condor and BOINC. Distributed and Volunteer Computing. Presented by Adam Bazinet

Tier 3 batch system data locality via managed caches

Tutorial 4: Condor. John Watt, National e-science Centre

The EU DataGrid Fabric Management

CHTC Policy and Configuration. Greg Thain HTCondor Week 2017

AutoPyFactory: A Scalable Flexible Pilot Factory Implementation

Ciphermail Webmail Messenger Administration Guide

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

Distributed CI: Scaling Jenkins on Mesos and Marathon. Roger Ignazio Puppet Labs, Inc. MesosCon 2015 Seattle, WA

Important DevOps Technologies (3+2+3days) for Deployment

Enterprise Manager: Scalable Oracle Management

HTCondor with KRB/AFS Setup and first experiences on the DESY interactive batch farm

IBM WebSphere Application Server 8. Clustering Flexible Management

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Things you may not know about HTCondor. John (TJ) Knoeller Condor Week 2017

Dragonframe License Manager User Guide Version 1.3.1

Client tools know everything

Management of batch at CERN

Outline. ASP 2012 Grid School

What s new in HTCondor? What s coming? European HTCondor Workshop June 8, 2017

For those who might be worried about the down time during Lync Mobility deployment, No there is no down time required

First evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova

Understanding StoRM: from introduction to internals

Pegasus. Pegasus Workflow Management System. Mats Rynge

Managing large-scale workflows with Pegasus

Engineering Robust Server Software

Pegasus Workflow Management System. Gideon Juve. USC Informa3on Sciences Ins3tute

The LHC Computing Grid

Copyright. Copyright Ping Identity Corporation. All rights reserved. PingAccess Server documentation Version 4.

Containerized Cloud Scheduling Environment

The PanDA System in the ATLAS Experiment

Introduction to Programming and Computing for Scientists

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

Managing GSS Devices from the GUI

ATLAS COMPUTING AT OU

osg roll: Users Guide Edition

RAP as a Service for. Team Foundation Server. Prerequisites

Tutorial for CMS Users: Data Analysis on the Grid with CRAB

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

Pegasus. Automate, recover, and debug scientific computations. Mats Rynge

Transcription:

glideinwms architecture by Igor Sfiligoi, Jeff Dost (UCSD)

Outline A high level overview of the glideinwms Description of the components 2

glideinwms from 10k feet 3

Refresher - HTCondor A Condor pool is composed of 3 pieces 4

What is a glidein? A glidein is just a properly configured execution node submitted as a Grid job 5

What is glideinwms? glideinwms is an automated tool for submitting glideins on demand 6

glideinwms architecture glideinwms has 3 logical pieces 7

glideinwms architecture glidein_startup Configures and starts HTCondor execution daemons at sites Runtime environment discovery and validation Factory Knows about the sites and does the submission Grid knowledge and trouble shooting Frontend Knows about user jobs and requests glideins Site selection logic and job monitoring 8

Cardinality N-to-M relationship Each Frontend can talk to many Factories Each Factory may serve many Frontends 9

Many operators Factory and Frontend are usually operated by different people Frontends are VO specific Operated by VO admins Each sets policies for its users Factories are generic Do not need to be affiliated with any group Factory ops main task is Grid monitoring and troubleshooting 10

glideinwms Components 11

glidein_startup the glidein_startup script configures and starts Condor on the worker node 12

glidein_startup tasks Validate node (environment) Download HTCondor binaries Configure HTCondor performed by plugins Start HTCondor daemon(s) Collect post-mortem monitoring info Cleanup 13

glidein_startup plugins Config files and scripts are downloaded via HTTP From both the factory and the frontend Web servers Can use local Web proxy (e.g. Squid) if available at site Mechanism is tamper proof and cache coherent 14

glidein_startup scripts Standard plugins Basic Grid node validation (certs, disk space, etc.) Setup Condor VO provided plugins Optional, but can be anything CMS frontend checks for CMS SW Factory admin can also provide them We supply one that verifies correct OS (rhel6, rhel7) Details about the plugins can be found at http://glideinwms.fnal.gov/doc.prd/factory/custom_scripts.html 15

glidein factory The factory knows about the grid and submits glideins 16

Glidein factory Glidein factory knows how to submit to sites Sites are described in a local config Only trusted and tested sites are included Each site entry in the config contains Contact info (hostname, grid type, queue name) Site config (startup dir, OS type, ) VOs supported Other attributes (Site name, core count, max memory,...) Admin maintained, periodically compared to external information databases http://glideinwms.fnal.gov/doc.prd/factory/configuration.html 17

Glidein factory role The glidein factory is just a slave The frontend(s) tell it how many glideins to submit and where Once the glideins start to run, they report to the VO collector and the factory is no longer involved The communication is based on ClassAds The factory has a Collector for this purpose 18

Factory collector The factory collector handles all communication http://glideinwms.fnal.gov/doc.prd/factory/design_data_exchange.html 19

Frontends The factory admin decides which Frontends to serve FE must provide a valid proxy with known DN needed to talk to the collector Factory config has further fine grained controls 20

Glidein submission The glidein factory EntryGroup process uses Condor-G to submit glideins Condor-G does the heavy lifting The factory just monitors the progress 21

Credentials/Proxy Proxy is provided by the VO frontend Proxy is encrypted, then delivered in the ClassAd Factory provides the encryption key (PKI) Proxy stored on disk Each VO is mapped to a different UID 22

VO frontend The frontend monitors the user Condor pool, does the matchmaking and requests glideins 23

VO frontend The VO frontend is the brain of a glideinwms based pool It is like a site-level negotiator 24

Two level matchmaking The frontend triggers glidein submission The regular negotiator matches jobs to glideins 25

Frontend logic The glideinwms glidein request logic is based on the principle of constant pressure Frontend requests a certain number of idle glideins in the factory queue at all times It does not request a specific number of glideins This is done due to the asynchronous nature of the system Both the factory and the frontend are in a polling loop and talk to each other indirectly 26

Frontend logic Frontend matches job attrs against entry attrs It then counts the matched idle jobs A fraction of this number becomes the pressure requests (up to 1/3) The matchmaking expression is defined by the frontend admin Not the user 27

Frontend config The frontend owns the glidein proxy And delegates it to the factory(s) when requesting glideins Must keep it valid at all times (usually via cron) The VO frontend can (and should) provide VOspecific validation scripts The VO frontend can (and should) set the glidein start expression Used by the VO negotiator for final matchmaking 28

Summary Glideins are just properly configured HTCondor execute nodes submitted as Grid jobs glideinwms is a mechanism to automate glidein submission glideinwms is composed of three logical entities, two being actual services: glidein_startup script that starts HTCondor at the site worker node Glidein factories that submit glideins to Grid VO frontends that drive the factories based on user demand 29

Pointers glideinwms development team is reachable at glideinwms-support@fnal.gov The official project Web page is http://glideinwms.fnal.gov OSG glidein factory at UCSD http://http://www.t2.ucsd.edu/twiki2/bin/view/ucsdtier2/osggfactory http://gfactory-1.t2.ucsd.edu/factory/monitor/ 30

Acknowledgments glideinwms is a CMS-led project developed at FNAL glideinwms factory operations at UCSD is sponsored by OSG The funding comes from NSF, DOE and the UC system 31