Overview. Scientific workflows and Grids. Kepler revisited Data Grids. Taxonomy Example systems. Chimera GridDB

Similar documents
GridDB: A Data-Centric Overlay for Scientific Grids

GridDB: A Data-Centric Overlay for Scientific Grids

Grid Computing Systems: A Survey and Taxonomy

Carelyn Campbell, Ben Blaiszik, Laura Bartolo. November 1, 2016

The Problem of Grid Scheduling

Scientific Workflows

Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey

Introduction to Grid Computing

Managing large-scale workflows with Pegasus

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Workflow, Planning and Performance Information, information, information Dr Andrew Stephen M c Gough

Techniques for Efficient Execution of Large-Scale Scientific Workflows in Distributed Environments

A High-Level Distributed Execution Framework for Scientific Workflows

An Introduction to Software Architecture. David Garlan & Mary Shaw 94

Virtual Data in CMS Analysis

Part II. Integration Use Cases

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

USING THE BUSINESS PROCESS EXECUTION LANGUAGE FOR MANAGING SCIENTIFIC PROCESSES. Anna Malinova, Snezhana Gocheva-Ilieva

Accelerating the Scientific Exploration Process with Kepler Scientific Workflow System

Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example

Building blocks: Connectors: View concern stakeholder (1..*):

The Future of High Performance Computing

Database Assessment for PDMS

Automatic Generation of Workflow Provenance

Pegasus Workflow Management System. Gideon Juve. USC Informa3on Sciences Ins3tute

XSEDE High Throughput Computing Use Cases

Pegasus. Automate, recover, and debug scientific computations. Rafael Ferreira da Silva.

02 - Distributed Systems

3 Workflow Concept of WS-PGRADE/gUSE

Heterogeneous Workflows in Scientific Workflow Systems

Verteilte Systeme (Distributed Systems)

Pegasus. Automate, recover, and debug scientific computations. Mats Rynge

MICROSOFT BUSINESS INTELLIGENCE

IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK

L3.4. Data Management Techniques. Frederic Desprez Benjamin Isnard Johan Montagnat

WORKFLOW ENGINE FOR CLOUDS

Grid Compute Resources and Job Management

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

Dac-Man: Data Change Management for Scientific Datasets on HPC systems

Introduction to K2View Fabric

02 - Distributed Systems

Systematic Cooperation in P2P Grids

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Oracle Data Integrator 12c: Integration and Administration

Wide Area Query Systems The Hydra of Databases

S1 Informatic Engineering

Pegasus. Pegasus Workflow Management System. Mats Rynge

Appendix A - Glossary(of OO software term s)

Knowledge Discovery Services and Tools on Grids

Service Mesh and Microservices Networking

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Kepler and Grid Systems -- Early Efforts --

Scientific Workflow Tools. Daniel Crawl and Ilkay Altintas San Diego Supercomputer Center UC San Diego

Distributed System: Definition

2/4/2019 Week 3- A Sangmi Lee Pallickara

Schema-Agnostic Indexing with Azure Document DB

Managing Exploratory Workflows

A Federated Grid Environment with Replication Services

CAS 703 Software Design

Outline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems

DISTRIBUTED SYSTEMS. Second Edition. Andrew S. Tanenbaum Maarten Van Steen. Vrije Universiteit Amsterdam, 7'he Netherlands PEARSON.

TIBCO Complex Event Processing Evaluation Guide

Web Services for Visualization

05 Indirect Communication

JBoss DNA. Randall Hauch Principal Software Engineer JBoss Data Services

Outline. Mate: A Tiny Virtual Machine for Sensor Networks Philip Levis and David Culler. Motivation. Applications. Mate.

Large Scale Computing Infrastructures

Oracle Financial Services Profitability Management Application Pack v Minor Release #3 ( )

Active Endpoints. ActiveVOS Platform Architecture Active Endpoints

Fast Track Model Based Design and Development with Oracle9i Designer. An Oracle White Paper August 2002

Qlik Sense Enterprise architecture and scalability

Oracle Financial Services Enterprise Modeling User Guide. Release July 2015

WAN-DDS A wide area data distribution capability

Alteryx Technical Overview

SDS: A Scalable Data Services System in Data Grid

Automating Real-time Seismic Analysis

Assignment 5. Georgia Koloniari

70-532: Developing Microsoft Azure Solutions

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

CAS 703 Software Design

Sphinx: A Scheduling Middleware for Data Intensive Applications on a Grid

What is a Web Service?

Apache Flink. Alessandro Margara

B. By not making any configuration changes because, by default, the adapter reads input files in ascending order of their lastmodifiedtime.

Distributed Systems Principles and Paradigms. Chapter 01: Introduction. Contents. Distributed System: Definition.

Developing Microsoft Azure Solutions (70-532) Syllabus

Mapping Abstract Complex Workflows onto Grid Environments

Grid-Based Data Mining and the KNOWLEDGE GRID Framework

European Component Oriented Architecture (ECOA ) Collaboration Programme: Architecture Specification Part 2: Definitions

Virtualization of Workflows for Data Intensive Computation

WGL A Workflow Generator Language and Utility

DATA MANAGEMENT SYSTEMS FOR SCIENTIFIC APPLICATIONS

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

An Introduction to Big Data Formats

Programming Knowledge Discovery Workflows in Service-Oriented Distributed Systems

Distributed Systems Principles and Paradigms. Chapter 01: Introduction

Transcription:

Grids and Workflows

Overview Scientific workflows and Grids Taxonomy Example systems Kepler revisited Data Grids Chimera GridDB 2

Workflows and Grids Given a set of workflow tasks and a set of resources, how do we map them to Grid resources? What are some other challenges? 3

Executing Scientific Workflows on Grids Grids can address many challenges of scientific workflow execution Scalability Detached execution Many systems have been developed to aid in design and execution of Grid workflows 4

Taxonomy Classifies 4 elements of workflow systems in context of Grid computing Workflow design Workflow scheduling Fault Tolerance Data Movement 5

Workflow Design Workflow structure indicates temporal relationship between tasks Can be Directed Acyclic Graph (DAG) or non-dag DAG-based Sequence (ordered series of tasks) Parallel (tasks that run concurrently) Choice (task executed at runtime if all conditions are true) Non-DAG Iteration (sections of workflow can be repeated) 6

Workflow Design Workflow Model/Specification defines workflow including task definition and structure definition Abstract model Workflow specified without referring to specific resources Concrete model Bind workflow tasks to specific resources Applications that use abstract can generate concrete model before or during execution 7

Workflow Design Workflow Composition System enables users to assemble components into workflows User-directed Users edit workflows directly Language-based (e.g., XML) Graph-based (e.g., Kepler) Automatic Generate workflows from higher-level requirements, e.g., data products, input values Difficult to capture functionality of components 8

Workflow Scheduling Scheduling architecture can be centralized, hierarchical, or decentralized Centralized- one central scheduler makes decisions for all tasks in a workflow Hierarchical- central manager assigns subworkflows to lower-level schedulers Decentralized- multiple schedulers that can communicate with each other and balance load Optimality/scalability tradeoff 9

Workflow Scheduling How to map workflows onto resources? Decisions can be based on current task or subworkflow (local) or entire workflow (global) Global decisions may produce better results, but high overhead 10

Workflow Scheduling How to translate abstract models to concrete models? Static concrete models generated before execution User directed or simulation based Dynamic make decisions at runtime Prediction-based or just in time 11

Workflow Scheduling Scheduling workflow applications in distributed system is NP-complete Use heuristics to match users Quality of Service constraints (deadline, budget) Performance-driven minimize overall execution time Market-driven minimize usage price Trust-driven- select resources based on trust properties (security, reputation, site vulnerability, etc) 12

Fault Tolerance Failures may occur for a variety of reasons: network failure, overloaded resource conditions, non-availability of components Failure handling: task-level and workflowlevel Task-level mask the effects of the failure Workflow-level manipulate workflow structure 13

Fault Tolerance Task level Retry Alternate resource Checkpoint/restart Replication Workflow level Alternate task Redundancy User-defined exception handling Rescue workflow 14

Intermediate Data Movement Input files of tasks need to be staged at remote site before processing tasks Output files may be required by child tasks processed on other resources User directed movement specified as part of workflow Automatic system does it automatically Approaches can be centralized, mediated, or peer-to-peer 15

Intermediate Data Movement Centralized Easy to implement Good when large-scale data flow not required Mediated Intermediate data managed by distributed data management system Good when want to keep data for later use Peer-to-Peer Good for large-scale data transfer But more difficulties to deployment 16

Some examples Kepler Taverna Triana GrADS Pegasus 17

Kepler Classification Structure non-dag Graph-based Centralized architecture Many user-defined features Scheduling Fault tolerance Data movement 18

Taverna Workflow management system of the mygrid project Workflow can be expressed either graphically (Kepler-like GUI) or XML-based language (SCUFL) Allows implicit iteration over incoming datasets Allows multithreading to speed up interation Good for services capable of simultaneous processing, e.g., those backed by a cluster 19

Triana Visual workflow-oriented data analysis environment Clients can log in to Triana Controlling Service (TCS) TCS can execute locally or distribute based on distribution policy Parallel no host-based communication Peer-to-peer intermediate data passed between hosts Resources dynamically allocated 20

GrADS Grid Application Development Software Application-level task scheduling Goal: minimize overall job completion time (makespan) performance driven Scheduler maps tasks to resources using heuristics Weighted sum of expected execution time on resource and expected cost of data movement Monitors performance of executing tasks and reschedules as needed 21

Pegasus Workflow manager in GriPhyN Maps abstract workflow to available Grid resources and generates executable workflow DAG structure Two methods for resource selection: Random allocation Performance prediction Intermediate data registered with replica service (mediated approach) 22

Summary and Challenges Many projects have graphical workflow modeling language Standardization needed Quality of Service (QoS) not well addressed QoS needed at both specification and execution level Market-driven strategies will become increasingly important Optimal schedule requires estimates of task execution time Analytical models (GrADS) or historical performance (Pegasus) Better fault tolerance needed 23

Executing Kepler on the Grid Many challenges to Grid workflows, including: Authentication Data movement Remote service execution Grid job submission Scheduling and resource management Fault tolerance Logging and provenance User interaction May be difficult for domain scientists 24

Example Grid Workflow Stage-execute-fetch: 1. Stage local files to remote server 2. Execute computational experiment on remote resource Local server 3. Fetch results back to local environment Remote server 25

Why not use a script? Script does not specify low-level task scheduling and communication May be platform-dependent Can t be easily reused 26

Some Kepler Grid Actors Copy copy files from one resource to another during execution Stage actor local to remote host Fetch actor - remote to local host Job execution actor submit and run a remote job Monitoring actor notify user of failures Service discovery actor import web services from a service repository or web site 27

Data Grids Chimera GridDB 28

Data Grids Communities collaboratively construct collections of derived data Flat files, relational tables, persistent object structures Relationships between data objects corresponding to computational procedures used to derive one from the other 29

Relationships among Programs,Computations, Data Data Created by Produced by Consumed by Programs Execution of Computations 30

Challenges I ve come across some interesting data, but I need to understand how it was constructed before I can trust it for my purposes. I want to search an astronomical database for galaxies with certain characteristics. If a program that does this exists, I won t need to write one from scratch. I want to apply an astronomical analysis program to millions of objects. If the program has already been run and the results stored, I ll save weeks of computation. I ve detected a calibration error in an instrument and want to know which derived data to recompute. 31

Virtual Data Track how data products are derived Ability to create and/or recreate products using this knowledge Virtual data management operations Re-materialize deleted data products Generate data products defined but not created Regenerate data when dependencies or programs change Create replicas at remote locations when cheaper than transfer 32

Chimera (Foster et al., 2002) (now GriPhyN VDS) Virtual data system Two main components: Virtual data catalog (VDC) Implements virtual data schema Virtual data language interpreter Implements tasks to call VDC operations Queries can return a representation of tasks that will generate a specified data product 33

Chimera Architecture Chimera Virtual Data Language (definition and query) VDL Interpreter (manipulate derivations and Transformations) Virtual Data Applications Task Graphs (compute and data movement tasks, with Dependencies) Data Grid Resources (distributed execution and data management) SQL Virtual Data Catalog (implements Chimera Virtual Data Schema) 34

Some definitions Transformation an executable program Derivation an execution of a transformation Data object named entity that may be consumed or produced by a derivation Logical file name Replica catalog maps logical name to physical location Data objects can also be relations or objects 35

Chimera Virtual Data Language TR t1 ( output a2, input a1, none env= 100000, none pa = 500 ) { app vanilla= /usr/bin/app3 ; app parg = -p ${none:pa}; app farg = -x y ; arg stdout = ${output:a2}; profile env.maxmem = ${none:env}; } t1 reads input file a1 and produces a2 app is application to run (/usr/bin/app3) args are default argument values stdout redirects output to a2 36

Chimera VDL DV t1 ( a2=@{output:run1.exp15.t1932.summary}, a1=@{input:run1.exp15.t1932.raw}, env = 20000, pa= 600 ); String after DV indicates transformation to be invoked (t1) Corresponding invocation: export MAXMEM=20000 /usr/bin/app3 p 600 \ -f run1.exp15.t1932.raw x y \ > run1.exp15.t1932.summary 37

Queries VDL implemented in SQL Queries allow one to search for transformations by name, application name, input LFN(s), output LFN(s), argument matches, or other metadata Query results indicate if desired transformations already exist in data grid Retrieve them if they do Create them if they do not 38

Example: SDSS Galactic Structure Detection Applied virtual data to locating galactic clusters in image collection Sky tiled into set of fields For each field, search for clusters in that field and some set of neighbors Use brightest cluster galaxy (BCG) and brightest red galaxy (BRG) to determine cluster candiates 39

SDSS Galactic Structure Detection 1. fieldprep extract required measurements from galaxies and produce files with this data (~40x smaller than original files) 2. brgsearch unweighted BCG likelihood for each galaxy 3. bcgsearch weighted BCG likelihood (most expensive step 4. bcgcoalesce determine whether a galaxy is most likely galaxy in the neighborhood 5. getcatalog remove extraneous data and store result in compact format 40

SDSS Galactic Structure Detection getcatalog is a function that can invoke the four prior dependent steps Generate virtual results for entire sky by defining one derivation of getcatalog for each field 41

Virtual Data Summary Performs bookkeeping to track large scale productions Can be thought of as paradigm for management of batch job production scripts or a makefile for data production Data production can be performed interactively in parallel by users Virtual data grid acts as a cache 42

Pegasus and Chimera Pegasus can construct an abstract workflow using Chimera Before mapping tasks to resources, Pegasus reduces abstract workflow by eliminating materialized data products Assumes more costly to reproduce dataset than to access existing results Pegasus can automatically generate a workflow using metadata description of desired data product using AI planning 43

GridDB (Liu and Franklin, 2004) Data-centric overlay for scientific grid data analysis Manage data entities rather than processes Idea: provide interactive database interface to Grid computing 44

GridDB: Background Assumptions: Scientific analysis programs can be abstracted as typed functions, and invocations as function calls While most scientific data is not relational, there is a subset with relational characteristics 45

Benefits of GridDB Declarative interface Type checking Interactive Query Processing Memoization support Data provenance Co-existence with process-centric middleware 46

High-Energy Physics Example Scientists want to replace a slow but trusted detector simulation with faster, less precise one To ensure soundness of new simulation, need to compare response of new and old simulation for various physics events 47

High Energy Physics Abstract Workflow <pmas> gen <pmas>.evts atlfast atlsim imas = x <pmas>.atlfast imas = y <pmas>.atlsim 48

Grid Invocation 101 200 101.atlfast 101.atlsim 200.atlfast 200.atlsim diff pmas 49

GridDB Modeling Principles Programs and workflows can be represented as functions An important subset of data in workflow can be represented as relations relational cover Represent inputs and outputs to workflows as relational tables 50

High Energy Physics Example grn Input gid pmas g00 101 g99 200 Output Output frn fid f00 f99 fimas 102 srn sid s00 s99 simas 100 51

GridDB Architecture GridDB client DML y x streaming tuples GridDB overlay Request Manager Query Processor Scheduler RDBMS (PostgresQL) data,catalog procs,specs,files Process-centric middleware Grid Resources 52

Basic actions Workflow setup create sandbox entitysets and connect as inputs/outputs Data procurement submission of inputs to workflow, triggering function evaluations to create output entities Automatic views for streaming partial results 53

Basic actions 1. grn:set(g); frn:set(f); srn:set(s); 2. (frn,srn) = simcomparemap(grn); 3. INSERT INTO grn VALUES pmas= {100,,200}; 4. SELECT * FROM autoview(grn, frn,srn); 54

Summary GridDB can leverage relational database functionality Provides interactive data-centric interface What are some challenges/limitations? 55