Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

Similar documents
BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

The HTCondor CacheD. Derek Weitzel, Brian Bockelman University of Nebraska Lincoln

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

Enabling Distributed Scientific Computing on the Campus

Shooting for the sky: Testing the limits of condor. HTCondor Week May 2015 Edgar Fajardo On behalf of OSG Software and Technology

Introducing the HTCondor-CE

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

Flying HTCondor at 100gbps Over the Golden State

glideinwms Training Glidein Internals How they work and why by Igor Sfiligoi, Jeff Dost (UCSD) glideinwms Training Glidein internals 1

glideinwms architecture by Igor Sfiligoi, Jeff Dost (UCSD)

glideinwms: Quick Facts

CCB The Condor Connection Broker. Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison

Grid Compute Resources and Job Management

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

Adaptive Co-Scheduler for Highly Dynamic Resources

Primer for Site Debugging

Pegasus Workflow Management System. Gideon Juve. USC Informa3on Sciences Ins3tute

XSEDE High Throughput Computing Use Cases

New Directions and BNL

Cloud Computing. Summary

Look What I Can Do: Unorthodox Uses of HTCondor in the Open Science Grid

Care and Feeding of HTCondor Cluster. Steven Timm European HTCondor Site Admins Meeting 8 December 2014

Configuring a glideinwms factory

Tutorial 4: Condor. John Watt, National e-science Centre

CHTC Policy and Configuration. Greg Thain HTCondor Week 2017

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

Open Science Grid LATBauerdick/Fermilab

Connecting Restricted, High-Availability, or Low-Latency Resources to a Seamless Global Pool for CMS

Grid Compute Resources and Grid Job Management

Setup InstrucIons. If you need help with the setup, please put a red sicky note at the top of your laptop.

An update on the scalability limits of the Condor batch system

HTCondor on Titan. Wisconsin IceCube Particle Astrophysics Center. Vladimir Brik. HTCondor Week May 2018

Distributed Caching Using the HTCondor CacheD

HTCondor overview. by Igor Sfiligoi, Jeff Dost (UCSD)

Day 9: Introduction to CHTC

Cloud Computing. Up until now

First evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova

Introduction to Distributed HTC and overlay systems

Grid Scheduling Architectures with Globus

High Throughput Urgent Computing

Factory Ops Site Debugging

glideinwms UCSD Condor tunning by Igor Sfiligoi (UCSD) UCSD Jan 18th 2012 Condor Tunning 1

Corral: A Glide-in Based Service for Resource Provisioning

Outline. ASP 2012 Grid School

Solving Hard Integer Programs with MW

EGEE and Interoperation

(HT)Condor - Past and Future

Project Blackbird. U"lizing Condor and HTC to address archiving online courses at Clemson on a weekly basis. Sam Hoover

Autonomic Condor Clouds. David Wolinsky ACIS P2P Group University of Florida

High Throughput WAN Data Transfer with Hadoop-based Storage

CERN: LSF and HTCondor Batch Services

Experiences Using GlideinWMS and the Corral Frontend Across Cyberinfrastructures

OSGMM and ReSS Matchmaking on OSG

UW-ATLAS Experiences with Condor

Getting Started with OSG Connect ~ an Interactive Tutorial ~

Cloud Computing. Up until now

Managing large-scale workflows with Pegasus

OSG Lessons Learned and Best Practices. Steven Timm, Fermilab OSG Consortium August 21, 2006 Site and Fabric Parallel Session

The GridWay. approach for job Submission and Management on Grids. Outline. Motivation. The GridWay Framework. Resource Selection

Cycle Sharing Systems

TreeSearch User Guide

Condor and BOINC. Distributed and Volunteer Computing. Presented by Adam Bazinet

Special Topics: CSci 8980 Edge History

Condor-G: HTCondor for grid submission. Jaime Frey (UW-Madison), Jeff Dost (UCSD)

ECE 574 Cluster Computing Lecture 4

Batch Services at CERN: Status and Future Evolution

Eurogrid: a glideinwms based portal for CDF data analysis - 19th January 2012 S. Amerio. (INFN Padova) on behalf of Eurogrid support group

CMS experience of running glideinwms in High Availability mode

Ivane Javakhishvili Tbilisi State University High Energy Physics Institute HEPI TSU

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

Queuing and Scheduling on Compute Clusters

Changing landscape of computing at BNL

S i m p l i f y i n g A d m i n i s t r a t i o n a n d M a n a g e m e n t P r o c e s s e s i n t h e P o l i s h N a t i o n a l C l u s t e r

Kestrel An XMPP-Based Framework for Many Task Computing Applications

Pegasus WMS Automated Data Management in Shared and Nonshared Environments

AutoPyFactory: A Scalable Flexible Pilot Factory Implementation

Grid Mashups. Gluing grids together with Condor and BOINC

glideinwms experience with glexec

Adaptive Cluster Computing using JavaSpaces

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT

SmartSuspend. Achieve 100% Cluster Utilization. Technical Overview

The LHC Computing Grid

Welcome to HTCondor Week #16. (year 31 of our project)

Scientific Computing on Emerging Infrastructures. using HTCondor

HIGH-THROUGHPUT COMPUTING AND YOUR RESEARCH

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

Complex Workloads on HUBzero Pegasus Workflow Management System

Optimization of Italian CMS Computing Centers via MIUR funded Research Projects

EGI-InSPIRE. Security Drill Group: Security Service Challenges. Oscar Koeroo. Together with: 09/23/11 1 EGI-InSPIRE RI

Advanced cluster techniques with LoadLeveler

UGP and the UC Grid Portals

Overview of MOSIX. Prof. Amnon Barak Computer Science Department The Hebrew University.

Monitoring and Analytics With HTCondor Data

HTCondor with KRB/AFS Setup and first experiences on the DESY interactive batch farm

A Guide to Condor. Joe Antognini. October 25, Condor is on Our Network What is an Our Network?

Scalability and interoperability within glideinwms

VC3. Virtual Clusters for Community Computation. DOE NGNS PI Meeting September 27-28, 2017

Parallel Programming & Cluster Computing High Throughput Computing

glideinwms Frontend Installation

Transcription:

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

HCC: Campus Grids Motivation We have 3 clusters in 2 cities. Our largest (4400 cores) is always full 2

HCC: Campus Grids Motivation Workflows may require more power than available on a single cluster. Certainly more than a full cluster can provide. Offload single core jobs to idle resources, making room for specialized (MPI) jobs. 3

HCC Campus Grid Framework Goals Encompass: The campus grid should reach all clusters on the campus. Transparent execution environment: There should be an identical user interface for all resources, whether running locally or remotely. Decentralization: A user should be able to utilize his local resource even if it becomes disconnected from the rest of the campus. An error on a given cluster should only affect that cluster.

HCC Campus Grid Framework Goals Encompass: The campus grid should reach all clusters on the campus. Transparent execution environment: There should be an identical user interface for all resources, whether CONDOR running locally or remotely. Decentralization: A user should be able to utilize his local resource even if it becomes disconnected from the rest of the campus. An error on a given cluster should only affect that cluster.

Encompass Challenges Clusters have different job schedulers: PBS &? Each cluster has their own policies User Priorities Allowed users We may need to expand outside the Campus

HCC Model for a Campus Grid Campus PBS Cluster Grid Campus Local Cluster Local Cluster 2 2 Cluster 1 User 3 CGF OSG Interface Overlay 3 Other Campus OSG Me, my friends and everyone else 7

Preferences/Observations Prefer not installing on every worker node when PBS is already there. Less intrusive for sysadmins. PBS and should coordinate job scheduling. Running jobs look like idle cores to PBS. We don t want PBS to kill jobs if it doesn t have to.

Problem: PBS & Coordination Initial: is running a job. Worker Node Worker Node Idle PBS Idle PBS 9

Problem: PBS & Coordination PBS Starts a job restarts job Worker Node Worker Node Restart PBS Idle PBS 10

Problem: PBS & Coordination Real Problem: PBS doesn t know about Sees nodes as idle. PBS: Idle Worker Node Worker Node Idle PBS Idle PBS 11

Campus Grid Goals - Technologies Encompassed BLAHP Glideins (See earlier talk by Igor/Jeff) Campus Grid Factory Transparent execution environment Flocking Glideins Decentralized Campus Grid Factory Flocking 12

Encompassed BLAHP Written for European Grid Initiative Translates job into PBS job Distributed with With BLAHP: can provide a single interface for all jobs, whether or PBS.

Putting it all Together User Login Node User Submit User Login Node User Submit... Schedd Schedd Campus Grid Factory http://sourceforge.net/apps/trac/ campusfactory/wiki Looks for idle jobs Cluster Login Node Factory Schedd PBS Cluster Head Node PBS Scheduler PBS Cluster Collector/Neg PBS Worker Node Worker Node Worker Node PBS Mom PBS Mom PBS Mom Starter Starter Starter

Putting it all Together Provides ondemand pool for unmodified clients with Flocking. User Login Node User Submit Schedd Looks for idle jobs Cluster Login Node Factory Schedd Collector/Neg User Login Node PBS User Submit... Schedd Cluster Head Node PBS Scheduler PBS PBS Cluster Worker Node Worker Node Worker Node PBS Mom PBS Mom PBS Mom Starter Starter Starter

Putting it all Together Creates an on demand condor cluster + Glideins + BLAHP + GlideinWMS + Glue User Login Node User Submit Schedd Looks for idle jobs Cluster Login Node Factory Schedd Collector/Neg User Login Node PBS User Submit... Schedd Cluster Head Node PBS Scheduler PBS PBS Cluster Worker Node Worker Node Worker Node PBS Mom PBS Mom PBS Mom Starter Starter Starter

Campus Grid Factory Glideins on worker nodes create ondemand overlay cluster User Login Node User Submit Schedd Looks for idle jobs Cluster Login Node Factory Schedd Collector/Neg User Login Node PBS User Submit... Schedd Cluster Head Node PBS Scheduler PBS PBS Cluster Worker Node Worker Node Worker Node PBS Mom PBS Mom PBS Mom Starter Starter Starter

Advantages for the Local Scheduler Allows PBS to know and account for outside jobs. Can co-schedule with local user priorities. PBS can preempt grid jobs for local jobs.

Advantages of the Campus Factory User is presented with an uniform interface to resources. Can create overlay network on any resource (BLAHP) can submit to PBS, LSF, Uses well established technologies:, BLAHP, Glidein.

Problem with Pilot Submission Problem with Campus Factory: If it sees idle jobs, it assumes they will run on Glideins. s may require specific software, ram size. Campus Factory will waste cycles submitting idle Glideins. Solutions in past were filters, albeit sophisticated.

Advanced Pilot Scheduling What if we equated: Completed Glidein = Offline Node 21

Advanced Scheduling: OfflineAds OfflineAds were put in for power management When nodes were not needed, can turn them off needs to keep track of what nodes it has turned off, and their (maybe special) abilities. OfflineAds describe an turned off computer.

Advanced Scheduling: OfflineAds Submitted Glidein = Offline Node When a Glidein is no longer needed, turns off. Keep Glidein description in an OfflineAd When a match is detected with the OfflineAd, submit an actual Glidein. It is reasonably expected that one can get a similar Glidein when you submit to the local scheduler (BLAHP).

Extending Beyond the Campus Nebraska does not have idle resources: Running jobs on Firefly. ~4300 cores

Extending Beyond the Campus - Options In order to extend transparent execution goal, need to send outside the campus. Options for getting outside the campus Flocking to external clusters Grid workflow manager: GlideinWMS

Extending Beyond the Campus: GlideinWMS Expand further with OSG Production Grid GlideinWMS Creates a on-demand cluster on grid resources Campus Grid can flock to this on-demand cluster just as it would another local cluster

Campus Grid at Nebraska Prairiefire PBS/ (Like Purdue) HCC Campus Prairiefire CGF Firefly Overlay Firefly Only PBS GlideinWMS interface to OSG 1 User 2 2 1 GlideinWMS OSG Interface Flock to Purdue Purdue (Diagrid) OSG

HCC Campus Grid 8 Million Hours 8 Million Hours

Questions? Campus PBS Cluster Grid Campus Local Cluster Local Cluster 2 2 Cluster 1 User 3 CGF OSG Interface Overlay 3 Other Campus OSG Me, my friends and everyone else 29