Remote Workflow Enactment using Docker and the Generic Execution Framework in EUDAT

Similar documents
EUDAT- Towards a Global Collaborative Data Infrastructure

EUDAT - Open Data Services for Research

Data Discovery - Introduction

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

USING DOCKER FOR MXCUBE DEVELOPMENT AT MAX IV

EUDAT and Cloud Services

The EUDAT Collaborative Data Infrastructure

irods workflows for the data management in the EUDAT pan-european infrastructure

EUDAT Data Services & Tools for Researchers and Communities. Dr. Per Öster Director, Research Infrastructures CSC IT Center for Science Ltd

ODC and future EIDA/ EPOS-S plans within EUDAT2020. Luca Trani and the EIDA Team Acknowledgements to SURFsara and the B2SAFE team

EUDAT-B2FIND A FAIR and Interdisciplinary Discovery Portal for Research Data

RENKU - Reproduce, Reuse, Recycle Research. Rok Roškar and the SDSC Renku team

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

CONTAINERIZING JOBS ON THE ACCRE CLUSTER WITH SINGULARITY

Singularity: Containers for High-Performance Computing. Grigory Shamov Nov 21, 2017

FuncX: A Function Serving Platform for HPC. Ryan Chard 28 Jan 2019

Containers. Pablo F. Ordóñez. October 18, 2018

Singularity: container formats

EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture. Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Webinar Annotate data in the EUDAT CDI

Trust and Certification: the case for Trustworthy Digital Repositories. RDA Europe webinar, 14 February 2017 Ingrid Dillo, DANS, The Netherlands

CLARIN s central infrastructure. Dieter Van Uytvanck CLARIN-PLUS Tools & Services Workshop 2 June 2016 Vienna

DEPLOYMENT MADE EASY!

Introduction to containers

LSST software stack and deployment on other architectures. William O Mullane for Andy Connolly with material from Owen Boberg

Running Splunk Enterprise within Docker

e-infrastructures in FP7 INFO DAY - Paris

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Docker 101 Workshop. Eric Smalling - Solution Architect, Docker

strategy IT Str a 2020 tegy

Towards FAIRness: some reflections from an Earth Science perspective

Cloud-based Rapid Elastic MAnufacturing

Software Development I

e-infrastructure: objectives and strategy in FP7

Continuous integration & continuous delivery. COSC345 Software Engineering

Important DevOps Technologies (3+2+3days) for Deployment

Research Infrastructures and Horizon 2020

Persistent Identifiers for Audiovisual Archives and Cultural Heritage

Horizon 2020 and the Open Research Data pilot. Sarah Jones Digital Curation Centre, Glasgow

Platform Migrator Technical Report TR

EUDAT & SeaDataCloud

Index. Bessel function, 51 Big data, 1. Cloud-based version-control system, 226 Containerization, 30 application, 32 virtualize processes, 30 31

e2 factory the emlix Embedded Build Framework

Zumobi Brand Integration(Zbi) Platform Architecture Whitepaper Table of Contents

Think Small to Scale Big

Internet2 Technology Exchange 2018 October, 2018 Kris Steinhoff


Developing a Research Data Policy

EUDAT. Towards a pan-european Collaborative Data Infrastructure - A Nordic Perspective? -

Striving for efficiency

State of Containers. Convergence of Big Data, AI and HPC

Data management and discovery

Developing and Testing Java Microservices on Docker. Todd Fasullo Dir. Engineering

I data set della ricerca ed il progetto EUDAT

European Cloud Initiative: implementation status. Augusto BURGUEÑO ARJONA European Commission DG CNECT Unit C1: e-infrastructure and Science Cloud

Docker for HPC? Yes, Singularity! Josef Hrabal

Presented By: Gregory M. Kurtzer HPC Systems Architect Lawrence Berkeley National Laboratory CONTAINERS IN HPC WITH SINGULARITY

EUDAT. Towards a pan-european Collaborative Data Infrastructure. Damien Lecarpentier CSC-IT Center for Science, Finland EUDAT User Forum, Barcelona

The Photon and Neutron Data Initiative PaN-data

Improving the Yocto Project Developer Experience. How New Tools Will Enable a Better Workflow October 2016 Henry Bruce

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Gerrit

Science Europe Consultation on Research Data Management

B2DROP The EUDAT Personal Cloud Storage

ESFRI Strategic Roadmap & RI Long-term sustainability an EC overview

VMWARE NSX & OTRS. Automating Security with Help Desk Systems

Research Infrastructures and Horizon 2020

Git Basi, workflow e concetti avanzati (pt2)

swiftenv Documentation

Trials And Tribulations Of Moving Forward With Digital Preservation Workflows And Strategies

AMath 483/583 Lecture 2

Harbor Registry. VMware VMware Inc. All rights reserved.

Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France

AMath 483/583 Lecture 2. Notes: Notes: Homework #1. Class Virtual Machine. Notes: Outline:

Data Staging and Data Movement with EUDAT. Course Introduction Helsinki 10 th -12 th September, Course Timetable TODAY

Welcome to the Pure International Conference. Jill Lindmeier HR, Brand and Event Manager Oct 31, 2018

Deliverable 6.4. Initial Data Management Plan. RINGO (GA no ) PUBLIC; R. Readiness of ICOS for Necessities of integrated Global Observations

NOMAD Metadata for all

Implementing DPDK based Application Container Framework with SPP YASUFUMI OGAWA, NTT

Running user-defined functions in R on Earth observation data in cloud back-ends

Using DC/OS for Continuous Delivery

European digital repository certification: the way forward

Advanced Continuous Delivery Strategies for Containerized Applications Using DC/OS

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

Git Command Line Tool Is Not Installed

PID System for eresearch

Red Hat Atomic Details Dockah, Dockah, Dockah! Containerization as a shift of paradigm for the GNU/Linux OS

TECHNICAL BRIEF. Scheduling and Orchestration of Heterogeneous Docker-Based IT Landscapes. January 2017 Version 2.0 For Public Use

Deliverable D5.3. World-wide E-infrastructure for structural biology. Grant agreement no.: Prototype of the new VRE portal functionality

Docker and Security. September 28, 2017 VASCAN Michael Irwin

Bioshadock. O. Sallou - IRISA Nettab 2016 CC BY-CA 3.0

There Should be One Obvious Way to Bring Python into Production. Sebastian Neubauer

Installing and Using Docker Toolbox for Mac OSX and Windows

Git & Github Fundamental by Rajesh Kumar.

Introduction to Containers. Martin Čuma Center for High Performance Computing University of Utah

Docker for People. A brief and fairly painless introduction to Docker. Friday, November 17 th 11:00-11:45

Verteego VDS Documentation

Kepler Scientific Workflow and Climate Modeling

Arup Nanda VP, Data Services Priceline.com

Automatic Generation of Workflow Provenance

Transcription:

Remote Workflow Enactment using Docker and the Generic Execution Framework in EUDAT Asela Rajapakse Max Planck Institute for Meteorology EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-infrastructures. Contract No. 654065 www.eudat.eu

The Team Emanuel Dima (Uni Tuebingen) Pascal Dugenie (CINES) Weu Qiu (Uni Tuebingen) Asela Rajapakse (MPI-M) Luca Trani (KNMI) 2

Presentation outline 1) A short overview of EUDAT 2) The ideas behind the Generic Execution Framework (GEF) 3) The technology powering the GEF: Docker 4) GEF services and a GEF instance 5) An example of Docker containerization 6) Further aspects of GEF development 3

A problem facing European Research Infrastructures The trend is towards proliferation and diversification of Research Infrastructures in Europe. All RIs are facing data management challenges: How to handle the growing amount of data? How to find data sets? How to make the most of the data? Stakeholders can benefit from pan-european solutions and synergies can be exploited! 4

EUDAT s objective EUDAT (short for European Data) is to fulfill the vision of the High Level Expert Group on Scientific Data 2 and aims to provide an integrated, cost-effective, and sustainable pan-european solution for sharing, preserving, accessing, and performing computations with primary and secondary research data. EUDAT2020 Description of Work 2 High Level Expert Group on Scientific Data: Riding the wave: How Europe can gain from the rising tide of scientific data (2010) Final report to the European Commission 5

EUDAT2020-35 Partners

The EUDAT Service Suite Meant to solve the basic data service requirements of participating RIs. EUDAT2020 Service Suite Overview: https://www.eudat.eu/services 7

The EUDAT Service Suite Diagrams taken from the official EUDAT presentation. EUDAT2020 Service Suite Overview: https://www.eudat.eu/services 8

The EUDAT Service Suite Although the basic services are being taken up by communities, they are still under constant development. 9

EUDAT and the GEF The Generic Execution Framework started as the topic of a research activity in EUDAT 1. GEF development is continuing in EUDAT 2 and is conducted in: Work package 5: Service Building Work package 8: Data Life Cycle across Communities, necessitating research for possible GEF extensions 10

The ideas behind the GEF The GEF is meant to address the following problem: With growing data size it becomes increasingly costly and time consuming to transfer data from their storage to a data processing location. 11

The ideas behind the GEF Move the tools, not the data! Minimize data transfer cost when no heavy computation is required by containerizing the software tools required for data processing and moving them closer to the data. Reproduce results more easily by reusing containers! The same container and software tool templates can be used over and over ensuring the software environment and the processing tools remain the same. 12

The technology powering the GEF: Docker Docker containers wrap a piece of software in a complete filesystem that contains everything needed to run: code, [ ] system tools, system libraries. Docker Website Docker containers are based on Docker images which can be viewed as templates for container instances. Every Docker image is built on top of a base image, e.g. Ubuntu or BusyBox. We will look at an example later! 13

The technology powering the GEF: Docker Containers are run by the Docker Engine on a virtualized Linux system Containers share the host kernel Versions for Mac OS and Windows are available now, but they also run a virtualized Linux system If it can be installed on a Linux system, chances are good it can be containerized with Docker. We will look at an example later! 14

GEF services/docker containers The GEF relies on so-called GEF services that are customized by the user to perform the required tasks: Docker images specifically annotated to allow handling by the GEF. GEF service instances are Docker containers that are spun up for execution close to the data. User communities are solely responsible for the contents of their images. 15

GEF services/docker containers The schematic view of a GEF service instance running on a Dockerized host. 16

GEF services/docker containers The schematic view of a GEF service instance running on a Dockerized host. 17

GEF services/docker containers Processing results from the container can be stored in a Docker Volume managed by the Docker Engine. Docker Volumes can be shared between containers and even be containerized to move them to a different node in a cluster. 18

A GEF instance The container/gef service invocations on the hosts are controlled by a Docker Machine integrated with a GEF instance. 19

Workflow integration Users have two methods to communicate with a GEF instance: through a graphical user interface or access programmatically through an HTTP API. 20

Reproducibility of processing results The capability to reproduce the results of a data processing job is of major concern to scientists: Instantiation from a single GEF service image is to make this easier. In the future, using a central repository for all GEF service images will foster reuse of existing images. 21

The GEF image repository In the future, all GEF service images are to be stored in a central repository for trusted and tested images: This will be a dedicated Docker Hub repository. The GEF will only accept images from this repository. The service images will be immutable once stored in the repository (versions are possible). The repository is publicly accessible and all stored images are open to the peer-review process. 22

The GEF image repository For now, during development and once the GEF becomes a full prototype, GEF service images are not public yet and access to the GEF prototype will be restricted. 23

The Dockerization of a workflow Using a branch of ObsPy, a Python framework for processing seismological data, to extract metadata from an archive of waveform data. Install the mseed-qc branch of ObsPy. Configure the extraction through a JSON file. Run the WFCatalogCollector.py script. Thanks to Luca Trani and Mathijs Koymans from KNMI (Royal Dutch Meteorological Institute) 24

The Dockerization of a workflow The bulk of the work is creating the image, most easily through a Dockerfile. # Ubuntu 14.04 base image FROM ubuntu MAINTAINER Mathijs Koymans # Create an app directory RUN mkdir -p /usr/src/collector WORKDIR /usr/src/collector # Copy the application source to this directory COPY. /usr/src/collector # Get necessary software (e.g. git to fetch # the branch, vim for interactive editing) RUN apt-get update RUN apt-get install -y git RUN apt-get install -y libxml2-dev libxslt1-dev RUN apt-get install -y python-pip RUN apt-get install -y libfreetype6-dev RUN apt-get install -y pkg-config RUN apt-get install -y vim 25

The Dockerization of a workflow # Numpy needs to be installed manually # Also include PyMongo if the database is used RUN pip install numpy RUN pip install pymongo # Fetch the branch from ltrani@github RUN git clone https://github.com/ltrani/obspy.git # Set the work directory to inside the fetched ObsPy source WORKDIR /usr/src/collector/obspy # Switch to and install the branch RUN git checkout mseed-qc RUN pip install -e. # Set the workdir back to the main source WORKDIR /usr/src/collector The Docker image is named and built. $ docker build -t collector:1. This takes care of installation. We now only need to run the container and start the script. 26

The Dockerization of a workflow The container is started with a /mnt directory linking it to the archive on the host. $ docker run --name collector_container -v /path/to/archive:/mnt -d -t collector:1 We open an interactive terminal to the container and can modify the configuration file for our extraction. $ docker exec -it collector_container bash root@50e3f57e16e6:/# vi config.json Similarily, the Python script inside the container is executed with the link to the archive. $ docker exec collector_container bash -c "python WFCatalogCollector.py -- dir /mnt/~" 27

The Dockerization of a workflow This merely was to give you a basic idea of what Dockerization looks like. 28

Security through guidelines and trust Many people are still concerned about Docker security. When configured appropriately Docker is reasonably secure. We will provide a set of guidelines or best practices to users to ensure that Docker images created by them do not pose security risks. These guidelines and the measure of trust associated with accredited EUDAT users will guarantee an acceptable level of security. 29

The current status of the integration with existing services GEF service containers are to obtain their data through EUDAT services. For now, we are developing dedicated access containers to connect to B2SHARE and to B2DROP in our current implementation of a first use case, and we aim for full authorization and authentication through B2ACCESS. 30

Collaboration with Work Package 8 Results from the Joint Research Activity in Work Package 8 will feed into the GEF development as they appear along the way. We envision work on support for a provenance model, directives, dynamic data, and semantic annotation. 31

The next steps The next major goals: Set up a prototype instance that can be used for our use cases. Further addressing security concerns (repository and best practices for prospective users). 32

Come forward and scrutinize it! The GEF is being designed after community requirements. We are still looking for new requirements and use cases. Asela Rajapakse asela.rajapakse@mpimet.mpg.de 33

Thank you for your attention! 34

Questions? 35