Corral: A Glide-in Based Service for Resource Provisioning

Similar documents
An update on the scalability limits of the Condor batch system

Managing large-scale workflows with Pegasus

Pegasus Workflow Management System. Gideon Juve. USC Informa3on Sciences Ins3tute

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute

XSEDE High Throughput Computing Use Cases

Enabling Large-scale Scientific Workflows on Petascale Resources Using MPI Master/Worker

CCB The Condor Connection Broker. Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison

Grid Computing Fall 2005 Lecture 5: Grid Architecture and Globus. Gabrielle Allen

Grid Experiment and Job Management

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

Grid Compute Resources and Job Management

Part IV. Workflow Mapping and Execution in Pegasus. (Thanks to Ewa Deelman)

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

Clouds: An Opportunity for Scientific Applications?

10 years of CyberShake: Where are we now and where are we going

Condor and BOINC. Distributed and Volunteer Computing. Presented by Adam Bazinet

Cycle Sharing Systems

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

Globus Toolkit 4 Execution Management. Alexandra Jimborean International School of Informatics Hagenberg, 2009

On the Use of Cloud Computing for Scientific Workflows

Grid Scheduling Architectures with Globus

High Throughput WAN Data Transfer with Hadoop-based Storage

Pegasus. Pegasus Workflow Management System. Mats Rynge

A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

Grid Compute Resources and Grid Job Management

Scalability and interoperability within glideinwms

Grid Computing. MCSN - N. Tonellotto - Distributed Enabling Platforms

Shooting for the sky: Testing the limits of condor. HTCondor Week May 2015 Edgar Fajardo On behalf of OSG Software and Technology

Redesigning Apache Flink s Distributed Architecture. Till

Pegasus. Automate, recover, and debug scientific computations. Mats Rynge

Workflow as a Service: An Approach to Workflow Farming

Networking and High Throughput Computing. Garhan Attebury HTCondor Week 2015

Pegasus. Automate, recover, and debug scientific computations. Rafael Ferreira da Silva.

The University of Oxford campus grid, expansion and integrating new partners. Dr. David Wallom Technical Manager

UNICORE Globus: Interoperability of Grid Infrastructures

Introduction to Grid Computing

By Ian Foster. Zhifeng Yun

glideinwms Training Glidein Internals How they work and why by Igor Sfiligoi, Jeff Dost (UCSD) glideinwms Training Glidein internals 1

Rapid Processing of Synthetic Seismograms using Windows Azure Cloud

Experiences Using GlideinWMS and the Corral Frontend Across Cyberinfrastructures

AutoPyFactory: A Scalable Flexible Pilot Factory Implementation

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

The GridWay. approach for job Submission and Management on Grids. Outline. Motivation. The GridWay Framework. Resource Selection

Automating Real-time Seismic Analysis

vrealize Operations Management Pack for NSX for vsphere 3.5 Release Notes

glideinwms architecture by Igor Sfiligoi, Jeff Dost (UCSD)

Falkon: a Fast and Light-weight task execution framework

Meeting the Challenges of Managing Large-Scale Scientific Workflows in Distributed Environments

DiPerF: automated DIstributed PERformance testing Framework

Some thoughts on the evolution of Grid and Cloud computing

Building A Better Test Platform:

First evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster

Scalable Computing: Practice and Experience Volume 10, Number 4, pp

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Kerberos & HPC Batch systems. Matthieu Hautreux (CEA/DAM/DIF)

HTCondor overview. by Igor Sfiligoi, Jeff Dost (UCSD)

Scalable Streaming Analytics

vrealize Operations Management Pack for NSX for vsphere Release Notes

Pegasus. Automate, recover, and debug scientific computations. Mats Rynge

Cloud Computing. Up until now

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Grid Mashups. Gluing grids together with Condor and BOINC

Integration of Workflow Partitioning and Resource Provisioning

glideinwms UCSD Condor tunning by Igor Sfiligoi (UCSD) UCSD Jan 18th 2012 Condor Tunning 1

Configuring a glideinwms factory

StreamSets Control Hub Installation Guide

Grid Architectural Models

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Apache Flink Streaming Done Right. Till

GT 4.2.0: Community Scheduler Framework (CSF) System Administrator's Guide

Double-Take vs. Steeleye DataKeeper Cluster Edition

Tasks. Task Implementation and management

Work Queue + Python. A Framework For Scalable Scientific Ensemble Applications

Getting Started with Serial and Parallel MATLAB on bwgrid

Question No: 1 Which xcp component is responsible for providing page serving and managing annotations on documents?

GEMS: A Fault Tolerant Grid Job Management System

Introducing the HTCondor-CE

Boundary control : Access Controls: An access control mechanism processes users request for resources in three steps: Identification:

Tutorial 4: Condor. John Watt, National e-science Centre

Introduction to GT3. Introduction to GT3. What is a Grid? A Story of Evolution. The Globus Project

glideinwms Frontend Installation

Asynchronous SIP Routing

Sky Computing on FutureGrid and Grid 5000 with Nimbus. Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes Bretagne Atlantique Rennes, France

Workload management at KEK/CRC -- status and plan

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

A Cleanup Algorithm for Implementing Storage Constraints in Scientific Workflow Executions

glideinwms: Quick Facts

Problems for Resource Brokering in Large and Dynamic Grid Environments

Architecture Proposal

On the Use of Cloud Computing for Scientific Workflows

Veeam Cloud Connect. Version 8.0. Administrator Guide

Cloud Computing. Summary

Grid Middleware and Globus Toolkit Architecture

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

Sample excerpt. HP ProCurve Threat Management Services zl Module NPI Technical Training. NPI Technical Training Version: 1.

System Description. System Architecture. System Architecture, page 1 Deployment Environment, page 4

Data Transfers in the Grid: Workload Analysis of Globus GridFTP

Transcription:

: A Glide-in Based Service for Resource Provisioning Gideon Juve USC Information Sciences Institute juve@usc.edu

Outline Throughput Applications Grid Computing Multi-level scheduling and Glideins Example: SCEC CyberShake Future Work

Throughput Applications Characterized by Many tasks: Thousands to millions Short task runtimes: May be less than 60s Tasks are commonly serial Key performance metric: time-to-solution Examples: scientific workflows parameter sweep master-worker pleasantly parallel

Grid Computing Grids Benefit: Provide plenty of computing resources Challenge: Using those resources effectively Grid Overheads Queuing Delays Software Overheads Scheduling Delays Scheduling Policies => Bad performance for throughput applications! Some solutions Task clustering (workflows) Advance reservations

Multi-level Scheduling Way for an application to use grid without the overheads Overlay a personal cluster on top of grid resources Pilot jobs install and run a user-level resource manager, which contacts an application-specific scheduler to be matched with application jobs Glidein: How to do MLS using

Benefits of MLS and Glideins Running short jobs on the grid dispatches jobs faster than, e.g. Globus Bypass site scheduling policies Use application-specific policies e.g. prioritize jobs based on application needs Avoid competition for resources Glideins reserve resources for multiple jobs Minimizes queuing delays Better application scalability Compared to GT2 GRAM, for example Fewer jobmanagers => reduced load on gateway

Resource provisioning system Uses multi-level scheduling model Allocate resources explicitly rather than implicitly Pay to allocate resources once and reuse them Effectively minimizes grid overheads Requires resource specification web service Automates the installation and configuration of on grid sites Submits glideins to provision resources

How works LOCAL SITE GRID SITE User Server Globus LRM Head Node Central In addition to the central manager, the user runs a service container that hosts the web service. Worker Nodes

How works LOCAL SITE GRID SITE Create Site User Server Globus LRM Head Node Central The user creates a new site by sending a request to. The request specifies details about the grid site including: job submission information, file system paths, etc. Worker Nodes

How works LOCAL SITE GRID SITE User Install & Configure C Globus LRM Server Head Node Central installs executables on a shared file system at the grid site. The appropriate executables are automatically selected and downloaded from a central repository based on architecture, operating system and libraries. Worker Nodes

How works LOCAL SITE GRID SITE User Create Glideins Server C Globus LRM Head Node Central The user provisions resources by submitting glidein requests to. The request specifies the number of hosts/cpus to acquire and the duration of the reservation. Worker Nodes

How works LOCAL SITE GRID SITE User Submit Glidein Job C Globus LRM Server Head Node Worker Nodes Central translates the user s request into a glidein job, which is submitted to the grid site.

How works LOCAL SITE GRID SITE C User Globus LRM Server Head Node Start Glidein master startd Worker Nodes Central The glidein job starts daemons on the worker nodes of the grid site.

How works LOCAL SITE GRID SITE C User Globus LRM Server Contact Central Head Node master startd Worker Nodes Central The daemons contact the user s central manager and become part of the user s resource pool.

How works LOCAL SITE GRID SITE User Submit Application Job Central Server C Globus LRM Head Node The user submits application jobs to their pool. The jobs are matched with available glidein resources. master startd Worker Nodes

How works LOCAL SITE GRID SITE C User Globus LRM Server Run Application Job Head Node master startd job Worker Nodes Central The application jobs are dispatched to the remote workers for execution.

How works LOCAL SITE GRID SITE Remove Glideins C User Globus LRM Central Server Head Node When the user is finished with their application they cancel their glidein request (or let it expire automatically). master startd Worker Nodes

How works LOCAL SITE GRID SITE User Cancel Glidein Job C Globus LRM Server Un-register Head Node Central When the request is cancelled kills the glidein job, which causes the daemons on the worker nodes to un-register themselves with the user s pool and exit. Worker Nodes

How works LOCAL SITE GRID SITE User Remove Site Server C Globus LRM Head Node Central The user can submit multiple glidein requests for a single site. When the user is done with the site they ask to remove the site. Worker Nodes

How works LOCAL SITE GRID SITE User Uninstall Globus LRM Server Head Node Central Finally, removes from the shared file system at the grid site. This removes all executables, configuration files, and logs. Worker Nodes

Features Auto-configuration Detect architecture, OS, glibc => package Determine public IP (if any) Generates configuration file Large requests 1 glidein job = N slots Multiple interfaces Command-line, SOAP, Java API Automatic resubmission Indefinitely, N times, until date/time Notifications Asynchronous API for integration with other tools

Networking Challenges Firewalls / Private IPs Block communication between glideins and pool Use: GCB/CCB, VPN, or CM on head node Glideins can t be used on some sites Port Usage requires many ports Issue for LOWPORT/HIGHPORT firewall holes TCP TIME_WAIT can consume ports WAN issues Large glidein pools look like DDoS attacks Traffic gets blocked sometimes

SCEC CyberShake Probabilistic seismic hazard analysis workflow How hard will the ground shake in the future? Uses Pegasus and DAGMan for workflow management Transformation Tasks Runtime (s) SGT Extraction 7,000 139 Seismogram Synthesis 420,000 48 Peak Ground Motion 420,000 1 Total Tasks: Mean Runtime: 847,000 25.45

CyberShake Progress Using since January Provisioning resources from the TeraGrid Requests: 185 Slots: 33,137 CPU Hours: 240,496 Application Progress Jan 2009-Apr 2009 Tasks: >11.3M Jobs: >352K May 2009 (planned) Tasks: ~168M Jobs: ~5M With glideins a site can be completed in ~3 hours on 800 cores (down from 18+ hours)

Future Work Dynamic Provisioning Automatically grow/shrink pool according to application needs Support for other features GSI security CCB for firewall traversal (GCB already) Grid matchmaking Parallel universe Remote Pool? Deploy Collector/Negotiator/Schedd as well

Try it out Website: http://pegasus.isi.edu/corral Problems: juve@usc.edu Questions?