The GridWay. approach for job Submission and Management on Grids. Outline. Motivation. The GridWay Framework. Resource Selection

Similar documents
A framework for adaptive execution in grids

Developing Grid-Aware Applications with DRMAA on Globus-based Grids

Layered Architecture

GridWay interoperability through BES

Grid Architectural Models

Grid Scheduling Architectures with Globus

Submission, Monitoring and Control of Jobs

An Experimental Framework for Executing Applications in Dynamic Grid Environments

Transparent Access to Grid-Based Compute Utilities

Grid Computing Fall 2005 Lecture 5: Grid Architecture and Globus. Gabrielle Allen

Technologies for Grid Computing

Development and execution of an impact cratering application on a computational Grid 1

Cloud Computing. Up until now

Porting of scientific applications to Grid Computing on GridWay 1

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

Benchmarking of high throughput computing applications on Grids q

Grid Computing. MCSN - N. Tonellotto - Distributed Enabling Platforms

An Experimental Framework for Executing Applications in Dynamic Grid Environments

Day 1 : August (Thursday) An overview of Globus Toolkit 2.4

Adaptive Cluster Computing using JavaSpaces

Grid Computing Middleware. Definitions & functions Middleware components Globus glite

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

The Integration of Grid Technology with OGC Web Services (OWS) in NWGISS for NASA EOS Data

The University of Oxford campus grid, expansion and integrating new partners. Dr. David Wallom Technical Manager

Grid Compute Resources and Job Management

An Introduction to Virtualization and Cloud Technologies to Support Grid Computing

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

The Problem of Grid Scheduling

GEMS: A Fault Tolerant Grid Job Management System

IMAGE: An approach to building standards-based enterprise Grids

Architecture Proposal

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

CMS HLT production using Grid tools

Knowledge Discovery Services and Tools on Grids

First evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova

Corral: A Glide-in Based Service for Resource Provisioning

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM

An Evaluation of Alternative Designs for a Grid Information Service

APPLICATION LEVEL SCHEDULING (APPLES) IN GRID WITH QUALITY OF SERVICE (QOS)

Grid Compute Resources and Grid Job Management

UNIT IV PROGRAMMING MODEL. Open source grid middleware packages - Globus Toolkit (GT4) Architecture, Configuration - Usage of Globus

Distributed and Cloud Computing

A Data-Aware Resource Broker for Data Grids

igrid: a Relational Information Service A novel resource & service discovery approach

Designing a Resource Broker for Heterogeneous Grids

Introduction to Grid Computing

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Multiprocessor Scheduling. Multiprocessor Scheduling

NUSGRID a computational grid at NUS

Data Management 1. Grid data management. Different sources of data. Sensors Analytic equipment Measurement tools and devices

GT 4.2.0: Community Scheduler Framework (CSF) System Administrator's Guide

Fault tolerance based on the Publishsubscribe Paradigm for the BonjourGrid Middleware

Grid Computing: Status and Perspectives. Alexander Reinefeld Florian Schintke. Outline MOTIVATION TWO TYPICAL APPLICATION DOMAINS

OAR batch scheduler and scheduling on Grid'5000

Grid Computing Systems: A Survey and Taxonomy

PROOF-Condor integration for ATLAS

A Brief Survey on Resource Allocation in Service Oriented Grids

EGEE and Interoperation

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

CSCE 313: Intro to Computer Systems

Extensible Job Managers for Grid Computing

CSCE 313 Introduction to Computer Systems. Instructor: Dezhen Song

George Mason University

Types of Virtualization. Types of virtualization

Operating Systems CS3502 Spring 2018

GRAM: Grid Resource Allocation & Management

Job Management System Extension To Support SLAAC-1V Reconfigurable Hardware

Advanced School in High Performance and GRID Computing November Introduction to Grid computing.

Elastic Management of Cluster-based Services in the Cloud

A FRAMEWORK FOR THE DYNAMIC RECONFIGURATION OF SCIENTIFIC APPLICATIONS IN GRID ENVIRONMENTS

HPC learning using Cloud infrastructure

enanos Grid Resource Broker

Dynamic fault tolerant grid workflow in the water threat management project

The EU DataGrid Fabric Management

Real Parallel Computers

University of Castilla-La Mancha

Predicting the Performance of a GRID Environment: An Initial Effort to Increase Scheduling Efficiency

Pegasus. Automate, recover, and debug scientific computations. Rafael Ferreira da Silva.

GridNEWS: A distributed Grid platform for efficient storage, annotating, indexing and searching of large audiovisual news content

Managing MPICH-G2 Jobs with WebCom-G

University of Alberta

MONITORING OF GRID RESOURCES

Dynamic Workflows for Grid Applications

Globus Toolkit 4 Execution Management. Alexandra Jimborean International School of Informatics Hagenberg, 2009

Grid-Based Data Mining and the KNOWLEDGE GRID Framework

Work Queue + Python. A Framework For Scalable Scientific Ensemble Applications

EFFICIENT SCHEDULING TECHNIQUES AND SYSTEMS FOR GRID COMPUTING

VMware View Upgrade Guide

High Performance Computing Course Notes Grid Computing I

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme

Grid Computing Fall 2005 Lecture 10 and 12: Globus V2. Gabrielle Allen

Scientific Computing with UNICORE

Usage of LDAP in Globus

LIGO Virtual Data. Realizing. UWM: Bruce Allen, Scott Koranda. Caltech: Kent Blackburn, Phil Ehrens, Albert. Lazzarini, Roy Williams

CHAPTER 2 LITERATURE REVIEW AND BACKGROUND

Announcements Processes: Part II. Operating Systems. Autumn CS4023

Multiprocessor Scheduling. Multiprocessor Scheduling

Multiprocessor Scheduling

Transcription:

The GridWay approach for job Submission and Management on Grids Eduardo Huedo Rubén S. Montero Ignacio M. Llorente Laboratorio de Computación Avanzada Centro de Astrobiología (INTA - CSIC) Associated to NASA Astrobiology Institute Distributed Systems Architecture & Security Group Dpto. de Arquitectura de Computadores y Automática Facultad de Informática (UCM) Outline Motivation The GridWay Framework Selection Adaptive Job Execution Example: Opportunistic Job Migration Summary

Motivation I Globus Toolkit Enables secure multiple domain operation with different resource managers and access policies Globus components: Management (GRAM) Data Management (GridFTP & Replica Catalog) Grid Security Infrastructure (GSI) Information Service (MDS) User: Where do I execute my job? What do I need (files,...)? How do I execute my job? How is my job doing? Can I move my job to a better host? How do I retrieve job output? resource selection resource preparation job submission monitoring migration termination Motivation II High Fault Rate Network Dynamic Cost time of the day (working / nonworking hours) resource load (peak/off-peak) Dynamic Availability Job cancellation by remote administrator Addition and removal of resources A Grid Dynamic Load Shared resources Idle hosts become saturated, and vice versa. Job must be able to migrate among grid resources to obtain application performance and fault tolerance

The GridWay Framework Project Goal: Easy and efficient execution of jobs on heterogeneous and dynamic grids in a submit & forget fashion Design Guidelines: Easily Adaptable (modular design) Easily Scalable (decentralized architecture) Easily Deployable (user privileges, standard services) Easily Extensible (use of non-standard services) Easily Applicable (ready to use for a wide range of applications) The GridWay Framework User Interface: gwps: display job information and status JID AID TID DM SM GSM STIME ETIME EXETIME EXIT HOST TEMPLATE 0 -- -- submitted prologue -- --:-- --:-- --:-- -- columba job_template 1 -- -- zombie done -- 27:37 28:07 00:30 0 ursa job_template 7 -- -- pending done -- --:-- --:-- --:-- -- draco job_template gwhistory: display job execution history HOST RANK STIME ETIME EXETIME MIGRATION_REASON columba.dacya.ucm.es 100 --:-- --:-- --:-- -- ursa.dacya.ucm.es/jobmanager-grd 50 27:41 27:52 00:11 discovery timeout gwkill: signals a job (kill, stop, resume, reschedule) gwsubmit: submits a job, or an array job gwwait: waits for zombie state of a job (any, all, set) Client API: Allows the interaction with each module, (DMRAA subset)

The GridWay Architecture Selector MDS GIIS/GRIS requirements Rank expression Dispatch Manager Request Manager Performance Monitor Performance Degradation Evaluator Performance Profile Submission Agent Job Pool Submission Manager Job Files Executable I/O files Checkpoint GridFTP Client Host GRAM request GRAM callback GASS requirements Rank expression GateKeeper JobManager JOB JOB Performance Profile Execution Host Selector I Rank Expression Requirements (&(Mds-Computer-isa=sparc) (Mds-Memory-Ram-free256)) FQDN stage-rm ursa.ucm.es jobmanager draco.ucm.es jobmanager exec-rm rank jobmanager-sge 50 jobmanager 25 LDAP Filter Static Information (S.O., architecture) User-provided Requirements Authorization test Dispatch Manager Discovery Globus Monitoring and Discovery Service (MDS) Filtered LDAP search GRIS Dynamic Information (CPU load, ) Rank expression User provided executable Characterize discovered hosts Selection LDAP queries GRIIS GRIIS GRIIS

Selector II Estimated execution time (lowest is best) Rank = T exe (h n,h n ) = T cpu (h n,h n ) + T xfer (h n,h n ) Estimated Computational Time: Computational work already performed Dynamic performance of the host Estimated File Transfer Time between: Client host and candidate execution host Job submission and monitoring File staging (executable, input/output files) File server and candidate execution host Input/output files Candidate execution host and current execution host Restart files Adaptive Job Execution Job Adaptation is achieved by automatic job migration when: A new better resource is discovered (opportunistic migration) The remote host or its network connection fails The job is cancelled or suspended A performance degradation is detected The requirements of the application change (self-migration) Migration gain (opportunistic migration and performance degradation): G m Rank ( h = n 1, t n 1 ) Rank ( h Rank ( h, t n 1, t n 1 ) n n ) User threshold

Example: Opportunistic Migration Experimental testbed: Host Model Speed OS Memory Network ursa Sun Blade 100 500Mhz Solaris 8 256MB LAN draco Sun Ultra 1 167Mhz Solaris 8 128MB LAN columba Intel Pentium MMX 233MHz Linux 2.4 160MB LAN cepheus Intel Pentium Pro 200MHz Linux 2.4 64MB LAN solea Sun Enterprise 250 296MHz Solaris 8 256MB MAN Client host Execution host File server Experiment: CFD code to solve the 3D Navier-Stokes equations using an iterative multigrid method Initially the application is submitted to draco Re-schedules when columba and solea becomes available at different iterations of the application running on draco #Job template EXECUTABLE = NS3D.$GW_ARCH ARGUMENTS = input INPUT= gsftp://cepheus/mesh input OUTPUT=profile STDOUT=stdout.$GW_JOB_ID RESTART_FILE=checkpoint REQUIREMENTS=host_req.ldif RANK=rank.sh Example: Opportunistic Migration Dynamic ranks of solea and columba at different execution points 1 2 3 Measured Execution Profile, of the application when it is actually migrated at different iterations 1. Migration to solea is profitable until Iteration 2 is reached 2. From fourth iteration the best host is columba (nearest) 3. From fifth iteration the performance gain is not high enough to compensate the file transfer overhead

Summary Related Work: Job management within the same administration domain: Condor Load Sharing Facility (LSF) Sun Grid Engine Portable Batch System (PBS) Job management for interconnection of multiple domains: Sun Grid Engine Enterprise Edition Condor Flocking Globus middleware for job management: Condor/G AppLes Nimrod/G Job Adaptation: Cactus Worm GrADS GridWay