CREAM-WMS Integration

Similar documents
Architecture of the WMS

A Practical Approach for a Workflow Management System

Advanced Job Submission on the Grid

Grid Compute Resources and Grid Job Management

Job submission and management through web services: the experience with the CREAM service

Problemi di schedulazione distribuita su Grid

ISTITUTO NAZIONALE DI FISICA NUCLEARE Sezione di Padova

LCG-2 and glite Architecture and components

Condor-G: HTCondor for grid submission. Jaime Frey (UW-Madison), Jeff Dost (UCSD)

Gergely Sipos MTA SZTAKI

Parallel Computing in EGI

Grid Compute Resources and Job Management

CS 167 Final Exam Solutions

WMS Application Program Interface: How to integrate them in your code

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

DIRAC pilot framework and the DIRAC Workload Management System

Troubleshooting Grid authentication from the client side

DataGrid D EFINITION OF ARCHITECTURE, TECHNICAL PLAN AND EVALUATION CRITERIA FOR SCHEDULING, RESOURCE MANAGEMENT, SECURITY AND JOB DESCRIPTION

DataGrid. Document identifier: Date: 24/11/2003. Work package: Partner: Document status. Deliverable identifier:

Minutes of WP1 meeting Milan (20 th and 21 st March 2001).

WMS overview and Proposal for Job Status

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

Overview of WMS/LB API

DataGrid TECHNICAL PLAN AND EVALUATION CRITERIA FOR THE RESOURCE CO- ALLOCATION FRAMEWORK AND MECHANISMS FOR PARALLEL JOB PARTITIONING

International Collaboration to Extend and Advance Grid Education. glite WMS Workload Management System

Performing Administrative Tasks

PeerSim: A Brief Introduction

g-eclipse A Framework for Accessing Grid Infrastructures Nicholas Loulloudes Trainer, University of Cyprus (loulloudes.n_at_cs.ucy.ac.

Cluster Nazionale CSN4

The glite middleware. Presented by John White EGEE-II JRA1 Dep. Manager On behalf of JRA1 Enabling Grids for E-sciencE

Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

Distributed Systems. The main method of distributed object communication is with remote method invocation

GROWL Scripts and Web Services

EUROPEAN MIDDLEWARE INITIATIVE

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's

Introduction to Personal Computers Using Windows 8 Course 01 - Getting to Know PCs and the Windows 8 User Interface

Gustavo Alonso, ETH Zürich. Web services: Concepts, Architectures and Applications - Chapter 1 2

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

Overview Job Management OpenVZ Conclusions. XtreemOS. Surbhi Chitre. IRISA, Rennes, France. July 7, Surbhi Chitre XtreemOS 1 / 55

Distributed production managers meeting. Armando Fella on behalf of Italian distributed computing group

3D Input Devices for the GRAPECluster Project Independent Study Report By Andrew Bak. Sponsored by: Hans-Peter Bischof Department of Computer Science

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Interconnect EGEE and CNGRID e-infrastructures

MPLAB Harmony Compatibility Worksheet

AutoPyFactory: A Scalable Flexible Pilot Factory Implementation

XRAY Grid TO BE OR NOT TO BE?

What s new in HTCondor? What s coming? European HTCondor Workshop June 8, 2017

Interfacing HTCondor-CE with OpenStack: technical questions

glite Grid Services Overview

Call Back supports Suspend/Resume CallBack notification for both intracluster and intercluster QSIG

Chapter 1: Distributed Information Systems

Factory Ops Site Debugging

DataGrid. Logging and Bookkeeping Service for the DataGrid. Date: October 8, Workpackage: Document status: Deliverable identifier:

Using LSF with Condor Checkpointing

Virtualization. A very short summary by Owen Synge

IRIX Resource Management Plans & Status

Chapter 3: Processes. Operating System Concepts 9 th Edit9on

Monitoring the Usage of the ZEUS Analysis Grid

Paxos provides a highly available, redundant log of events

ADP Reporting Skills Business Requirements ADP Pro User Conference

Multiple Broker Support by Grid Portals* Extended Abstract

DMTCP: Fixing the Single Point of Failure of the ROS Master

Asynchronous Interactions and Managing Modeless UI with Autodesk Revit API

Dynamic Workflows for Grid Applications

User Tools and Languages for Graph-based Grid Workflows

Using Devices with Microsoft HealthVault

Midterm Exam, October 24th, 2000 Tuesday, October 24th, Human-Computer Interaction IT 113, 2 credits First trimester, both modules 2000/2001

Pegasus Workflow Management System. Gideon Juve. USC Informa3on Sciences Ins3tute

The glite middleware. Ariel Garcia KIT

Kattis Team Guide version 1.1

Computational Steering

PEERNET Document Conversion Service 3.0

Navigating Cisco Prime Health and Utilization Monitor Tasks in LMS 4.1

Workload management at KEK/CRC -- status and plan

Troubleshooting Grid authentication from the client side

DHCP and DNS. Nick Copyright Conditions: GNU FDL (seehttp:// A computing department

Lesson 9 Transcript: Backup and Recovery

GRID COMPANION GUIDE

EUROPEAN MIDDLEWARE INITIATIVE

Integration of the guse/ws-pgrade and InSilicoLab portals with DIRAC

Give Your Site a Boost With memcached. Ben Ramsey

Trinity File System (TFS) Specification V0.8

Edinburgh (ECDF) Update

Cisco Unified CM Trace

The Google File System

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

The Actor Model, Part Two. CSCI 5828: Foundations of Software Engineering Lecture 18 10/23/2014

Windows Interrupts

CHEP 2013 October Amsterdam K De D Golubkov A Klimentov M Potekhin A Vaniachine

Exceptions in Java

DMTN-003: Description of v1.0 of the Alert Production Simulator

(Partition Map) Exchange - under the hood

HTCONDOR USER TUTORIAL. Greg Thain Center for High Throughput Computing University of Wisconsin Madison

Cluster Administration

Configuring Trace. Configuring Trace Parameters CHAPTER

Bookkeeping and submission tools prototype. L. Tomassetti on behalf of distributed computing group

Hortonworks Data Platform

EGEE and Interoperation

BPEL4Job: A Fault-Handling Design for Job Flow Management

Transcription:

-WMS Integration Luigi Zangrando (zangrando@pd.infn.it) Moreno Marzolla (marzolla@pd.infn.it) JRA1 IT-CZ Cluster Meeting, Rome, 12 13/9/2005 1

Preliminaries We had a meeting in PD between Francesco G., Marco C. and the JRA1-PD team; in the meeting we discussed about -WMS integration We also had some preliminary discussions with the JRA1-CZ people about interaction with LB We now present the outcome of the meeting JRA1 IT-CZ Cluster Meeting, Rome, 12 13/9/2005 2

should be invoked: Directly from the UI (direct job submission) Through the WMS usage scenario Submission through the WMS Direct Job Submission WMS JRA1 IT-CZ Cluster Meeting, Rome, 12 13/9/2005 3

WMS- integration / 1 In analogy with JC+LM, a new component XYZ is needed for the interaction with (not found yet a fancy name) XYZ takes the job management requests from its filelist (the new mechanism proposed by Francesco G. is not yet available; in any case the migration should be straightforward) The WM chooses the filelist where to put jobs by reading the ce-id (e.g., grid005.pd.infn.it:8443/cream-lsf-grid02); not nice, but don't have a better solution at the moment XYZ invokes a Helper for formatting the JDL to be compliant (basically needed to rearrange the input/output sandbox parameters) XYZ manages the submission to (see next slides); XYZ keeps the mapping between the GridjobId and jobid Failed submissions are reinserted into the WM s filelist as in the current implementation XYZ FileList NS Helper WM WMProxy FileList FileList JC+LM Condor Helpers MM JA JRA1 IT-CZ Cluster Meeting, Rome, 12 13/9/2005 4

WMS- Integration / 2 XYZ Implementation approach: NS WMProxy Requirement: XYZ should have the lowest possible impact on the current WMS architecture (less chance to avoid major changes to existing components); should be easily plugged inside the WMS Interim solution: XYZ is an external process which communicates with WM using a filelist In the long term we will explore the possibility to make XYZ an internal component (thread) of WM This clearly requires a welldefined API in order not to couple XYZ with WM; there should be an abstraction layer between WM and CE-specific support engines XYZ FileList Helper FileList FileList JC+LM Condor Helpers MM WM JA JRA1 IT-CZ Cluster Meeting, Rome, 12 13/9/2005 5

A thread of XYZ receives notifications about the job status changes from CEMon closely coupled with CE As a fail-safe mechanism, another thread is needed to poll the status of all jobs still alive For each status change, XYZ must do basically what LM does now for job status changes WMS- Integration / 3 Job Status Changes XYZ Job Status Notification Handler Notifications sent by CEMon CEMon Periodic status polling JRA1 IT-CZ Cluster Meeting, Rome, 12 13/9/2005 6

WMS- Integration / 4 Interaction with LB The existing LB states are ok also for this WMS- integration Since supports suspending jobs, a new SUSPENDED status would be useful Or can we have a SUSPENDED flag attached to SCHEDULED and RUNNING states is enough? (as suggested by Ales) XYZ will log to LB the same kind of events logged by JC+LM for non- CEs (e.g., Condor submissions) The Job Wrapper will log to LB as currently done XYZ Logging Component LB Event Notification JRA1 IT-CZ Cluster Meeting, Rome, 12 13/9/2005 7

WMS- Integration / 5 Bulk Jobs At the moment they are handled by Condor DAGMan, which is not present in the proposed scenario CNAF is exploring a different implementation not using Condor but this will not be finalized in the short term Support for DAGs (which will imply support for parametric and collection jobs as well) will be implemented in Idea: allow submission of a whole DAG directly to the CE This is a requirement coming from many users To be studied when a whole DAG should be handled directly by To start, this could be specified by the user in the jdl JRA1 IT-CZ Cluster Meeting, Rome, 12 13/9/2005 8

XYZ implementation issues The protocol between XYZ and must be reliable and well thought There are potential problems when software components or connections fail 1. XYZ crashes 2. crashes 3. XYZ link crashes (logically equivalent to 1 or 2) We have to ensure that jobs do not get lost in the system Zombie jobs We are thinking to implement a lease-based mechanism as suggested also by Francesco G. JRA1 IT-CZ Cluster Meeting, Rome, 12 13/9/2005 9

Job Lease Standard mechanism commonly used in distributed systems Each job has an attribute (the lease) which is basically the time to live of the job When the lease expires, the job is removed from the system on both sides (WMS and ) Leases are renewed by XYZ as long as XYZ and can talk to each other JRA1 IT-CZ Cluster Meeting, Rome, 12 13/9/2005 10

XYZ crashes Data from the WM comes via filelist; no problem here Fault Scenarios If the lease for a given job is over, jobs have been removed on When XYZ comes alive, it checks all jobs and renew the leases if necessary; jobs which have been removed by will be removed by XYZ as well crashes has a simple FS-based journalling mechanism which ensures that no jobs get lost When comes up again, it checks the leases and purges expired jobs; if the lease is long enough, has the possibility to be restarted in time We are thinking about a utility to renew the lease if the WMS is brought down, e.g., for maintenance JRA1 IT-CZ Cluster Meeting, Rome, 12 13/9/2005 11

Job Lease XYZ crashes XYS Tries to renew lease Lease expires, job is removed from XYZ Cream restarts sees lease expired, purges jobs JRA1 IT-CZ Cluster Meeting, Rome, 12 13/9/2005 12

Proxy Renewal When XYZ realizes that a job proxy has been renewed on the WMS node, XYZ must invoke the appropriate proxy renewal function on How does XYZ realize that a proxy has been renewed? Polling or something better? JRA1 IT-CZ Cluster Meeting, Rome, 12 13/9/2005 13