Overview Job Management OpenVZ Conclusions. XtreemOS. Surbhi Chitre. IRISA, Rennes, France. July 7, Surbhi Chitre XtreemOS 1 / 55

Similar documents
Linux-CR: Transparent Application Checkpoint-Restart in Linux

Toward An Integrated Cluster File System

OS Security III: Sandbox and SFI

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

Making Applications Mobile

Linux-CR: Transparent Application Checkpoint-Restart in Linux

Process Migration via Remote Fork: a Viable Programming Model? Branden J. Moor! cse 598z: Distributed Systems December 02, 2004

DMTCP: Fixing the Single Point of Failure of the ROS Master

An introduction to checkpointing. for scientific applications

Operating system security models

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Eclipse-Based CodeWarrior Debugger

Large Scale Sky Computing Applications with Nimbus

VCS-276.exam. Number: VCS-276 Passing Score: 800 Time Limit: 120 min File Version: VCS-276

Table of Contents 1.1. Introduction. Overview of vsphere Integrated Containers 1.2

CS370 Operating Systems

an Object-Based File System for Large-Scale Federated IT Infrastructures

Web Self Service Administrator Guide. Version 1.1.2

Kerberos & HPC Batch systems. Matthieu Hautreux (CEA/DAM/DIF)

An introduction to checkpointing. for scientifc applications

Cycle Sharing Systems

[VMICMV6.5]: VMware vsphere: Install, Configure, Manage [V6.5]

VMware vsphere: Install, Configure, Manage plus Optimize and Scale- V 6.5. VMware vsphere 6.5 VMware vcenter 6.5 VMware ESXi 6.

Sky Computing on FutureGrid and Grid 5000 with Nimbus. Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes Bretagne Atlantique Rennes, France

Relational Database Service. User Guide. Issue 05 Date

Outline. Cgroup hierarchies

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Operating systems Architecture

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Understanding StoRM: from introduction to internals

Migration. 22 AUG 2017 VMware Validated Design 4.1 VMware Validated Design for Software-Defined Data Center 4.1

Table of Contents 1.1. Overview. Containers, Docker, Registries vsphere Integrated Containers Engine

VMware vsphere 6.5: Install, Configure, Manage (5 Days)

Scale-Out backups with Bareos and Gluster. Niels de Vos Gluster co-maintainer Red Hat Storage Developer

Announcements Processes: Part II. Operating Systems. Autumn CS4023

Smart Grid Embedded Cyber Security: Ensuring Security While Promoting Interoperability

3.1 Introduction. Computers perform operations concurrently

P.Buncic, A.Peters, P.Saiz for the ALICE Collaboration

Originally prepared by Lehigh graduate Greg Bosch; last modified April 2016 by B. Davison

EGEE and Interoperation

Tanium IaaS Cloud Solution Deployment Guide for Microsoft Azure

Outline. Cgroup hierarchies

CXS Citrix XenServer 6.0 Administration

Managing Switch Stacks

Advanced Memory Management

Amazon Web Services Training. Training Topics:

vsphere Networking Update 1 ESXi 5.1 vcenter Server 5.1 vsphere 5.1 EN

Access Control. CMPSC Spring 2012 Introduction Computer and Network Security Professor Jaeger.

2/14/2012. Using a layered approach, the operating system is divided into N levels or layers. Also view as a stack of services

CA IT Client Manager / CA Unicenter Desktop and Server Management

Rootless Containers with runc. Aleksa Sarai Software Engineer

OS security mechanisms:

Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content

Fast Forward I/O & Storage

Multiprocessors 2007/2008

Amazon Web Services (AWS) Training Course Content

Upgrading to Parallels Virtuozzo Containers 4.6

SMD149 - Operating Systems

Process Time. Steven M. Bellovin January 25,

g-eclipse A Framework for Accessing Grid Infrastructures Nicholas Loulloudes Trainer, University of Cyprus (loulloudes.n_at_cs.ucy.ac.

Fedora 12 Essentials

ElasterStack 3.2 User Administration Guide - Advanced Zone

Diplomado Certificación

Cloud Computing /AWS Course Content

Data Protection Guide

Data Protection Guide

CS370 Operating Systems

Course CXS-203 Citrix XenServer 6.0 Administration

Bitnami MySQL for Huawei Enterprise Cloud

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

Flicker: An Execution Infrastructure for TCB Minimization

1 Virtualization Recap

Distributed File Storage in Multi-Tenant Clouds using CephFS

MIGSOCK A Migratable TCP Socket in Linux

XtreemFS a case for object-based storage in Grid data management. Jan Stender, Zuse Institute Berlin

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Confinement (Running Untrusted Programs)

Chapter 2: System Structures. Operating System Concepts 9 th Edition

Exam Name: IBM Certified System Administrator - WebSphere Application Server Network Deployment V7.0

vsphere Installation and Setup Update 2 Modified on 10 JULY 2018 VMware vsphere 6.5 VMware ESXi 6.5 vcenter Server 6.5

Lecture 1 Introduction (Chapter 1 of Textbook)

RESOURCE MANAGEMENT MICHAEL ROITZSCH

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

State of the Linux Kernel

SnapCenter Software 4.0 Concepts Guide

CSE 565 Computer Security Fall 2018

Data Security and Privacy. Unix Discretionary Access Control

Process Concepts. CSC400 - Operating Systems. 3. Process Concepts. J. Sumey

Detail the learning environment, remote access labs and course timings

Transparent TCP Recovery

PROCESS CONTROL BLOCK TWO-STATE MODEL (CONT D)

Project Loom Ron Pressler, Alan Bateman June 2018

RESOURCE MANAGEMENT MICHAEL ROITZSCH

Accessing the Console and Re-Installing your VPS Manually

VMware Mirage Getting Started Guide

Azure Sphere: Fitting Linux Security in 4 MiB of RAM. Ryan Fairfax Principal Software Engineering Lead Microsoft

Allowing Users to Run Services at the OLCF with Kubernetes

HPE VMware ESXi and vsphere 5.x, 6.x and Updates Getting Started Guide

CO Oracle WebLogic Server 12c. Administration II. Summary. Introduction. Prerequisites. Target Audience. Course Content.

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

Transcription:

XtreemOS Surbhi Chitre IRISA, Rennes, France July 7, 2009 Surbhi Chitre XtreemOS 1 / 55

Surbhi Chitre XtreemOS 2 / 55

Outline What is XtreemOS What features does it provide in XtreemOS How is it new and different Conclusion Surbhi Chitre XtreemOS 3 / 55

Discussion Path 1 2 3 4 Surbhi Chitre XtreemOS 4 / 55

Discussion Path VO Management Security Entity Management 1 2 3 4 Surbhi Chitre XtreemOS 5 / 55

Grids VO Management Security Entity Management Surbhi Chitre XtreemOS 6 / 55

VO Management Security Entity Management Linux based Operating System for the next generation grids Installation CD - mobile, PC, cluster flavor Distributed Resource Abstraction Secure Resource Sharing Scalable - encorporates millions of nodes, users High Availability - replicated services Legacy applications executed Execute applications like./application Job Monitoring, isolation, fault tolerance provided Surbhi Chitre XtreemOS 7 / 55

Types of Actors VO Management Security Entity Management 1 User Offload huge computations to grid Security Monitoring 2 Administrator or Owner of Resource Non trusted users should not be allowed Node should not be attacked 3 Application Developer Easy to develop applications with no modification Surbhi Chitre XtreemOS 8 / 55

VO Management VO Management Security Entity Management Surbhi Chitre XtreemOS 9 / 55

VO Management VO Management Security Entity Management Requirements VOs have lifespan Resource sharing on demand Users, Resources - freely join / leave VO, members of multiple VO VO user account different from local account VO-level Policy - can a user access a VO resource Node Policy - can a VO user access this resource Surbhi Chitre XtreemOS 10 / 55

VO Management VO Management Security Entity Management XtreemOS features Natively supported Confidentiality, Integrity, Authenticity provided Manages VO lifecycle Manages users, resources VO credentials, distribution Enforces VO and node policies upon resource usage Authenticate and provide access control to local nodes Surbhi Chitre XtreemOS 11 / 55

Security VO Management Security Entity Management Single Sign On Pluggable Authentication Module Name service switch and key retention - Session Management Grid/VO users are never local users Surbhi Chitre XtreemOS 12 / 55

Single Sign On VO Management Security Entity Management Surbhi Chitre XtreemOS 13 / 55

Single Sign On VO Management Security Entity Management Surbhi Chitre XtreemOS 14 / 55

Single Sign On VO Management Security Entity Management Surbhi Chitre XtreemOS 15 / 55

VO Management Security Entity Management Entities: Users, Resources and Services Resources are discovered - advanced p2p techniques Services are decentralised and replicated User information is stored in a DHT. Surbhi Chitre XtreemOS 16 / 55

Data Management - XtreemFS VO Management Security Entity Management Distributed, spanning the grid Posix like Self replicating Transactional consistency Grid User has a corresponding fs space Automatically mounted when a user executes a job Surbhi Chitre XtreemOS 17 / 55

VO Management Security Entity Management Offload execution of heavy jobs in the outside grids! Execute jobs securely Have control over your jobs When it fails because of some problem outside the job, restart it from a last know point Debug a job if it fails from a last know point Store the output securely. Surbhi Chitre XtreemOS 18 / 55

Discussion Path Job Submission 1 2 3 4 Surbhi Chitre XtreemOS 19 / 55

Job Submission Surbhi Chitre XtreemOS 20 / 55

Job Submission Surbhi Chitre XtreemOS 21 / 55

Job Submission Surbhi Chitre XtreemOS 22 / 55

Job Submission Surbhi Chitre XtreemOS 23 / 55

Why checkpoint Job Submission Grid Node stops participation, restart job on some other node Debugging Recover job from some last known state Checkpointing - restart used Surbhi Chitre XtreemOS 24 / 55

Job Submission Surbhi Chitre XtreemOS 25 / 55

Multiple checkpointers Job Submission Surbhi Chitre XtreemOS 26 / 55

Discussion Path - Introduction Requirements Solution 1 2 3 4 Surbhi Chitre XtreemOS 27 / 55

- Introduction Requirements Solution Operating system evolution. Virtual Private Server Containers - root access, separate filesystem, process tree, network stacks, IPC objects and other resources Resources - limits and gurantees. Containers - checkpointed, restarted, migrated. On migration - network connections can be resumed. Surbhi Chitre XtreemOS 28 / 55

How do you get? - Introduction Requirements Solution - patch to Linux kernel Compile and install the kernel Configure grub Reboot in this kernel Separately Download Userspace utilities vzctl - managing containers. vzpkg - templates vzquota - quotas Surbhi Chitre XtreemOS 29 / 55

On the surface - kernel? - Introduction Requirements Solution Root container - id always 0 All the processes now start in a root container A new container is a child of this container. Surbhi Chitre XtreemOS 30 / 55

Create a container - Example - Introduction Requirements Solution Template - root filesystem and default programs to run in the container vzctl - userpace utility for creating, starting, executing, stopping and deleting containers. vzctl create 101 ostemplate <template name> vzctl set 101 ipadd 192.168.10.10 nameserver 192.168.10.2 save vzctl start vzctl set 101 userpasswd root:test vzctl exec 101 /etc/init.d/ssh start ssh 192.168.10.1 Surbhi Chitre XtreemOS 31 / 55

- Introduction Requirements Solution What happens when you start a container? A new VPS created. Virtual process id and real process id User space applications - see only virtual process id. Processes in the container can be seen from outside (ps/pstree) Surbhi Chitre XtreemOS 32 / 55

process hierarchy - Introduction Requirements Solution Surbhi Chitre XtreemOS 33 / 55

Checkpoint and Restart - Introduction Requirements Solution vzctl chkpnt 101 dumpfile dump vzctl restore 101 dumpfile dump A process runnning inside a container can be multiprocess, have IPC, have threads, access files etc, it will still be checkpointed. As long as the container is restarted immediately, the open network connections can be saved. When you restart after a long time, the connection can be lost Surbhi Chitre XtreemOS 34 / 55

Requirement - Introduction Requirements Solution Track a job execution Submit a job to a container through the Application execution manager of XtreemOS. Checkpoint, restart and migrate a job through the Checkpoint Restart Manager of XtreemOS Surbhi Chitre XtreemOS 35 / 55

Job Submission - Introduction Requirements Solution Job should be submitted to a container than to a native kernel. Identify that the job needs. Done through jsdl tag addition A job can contain job units. Job units of a same job should be submitted to a same container. A new container should be created. Surbhi Chitre XtreemOS 36 / 55

- Introduction Requirements Solution Job Submission - no foreign dependencies Surbhi Chitre XtreemOS 37 / 55

Job Submission - Introduction Requirements Solution Containers can be accessed using root access only! XtreemFS dir should be accessed by the corresponding user. Job submission with no foreign process dependency - does not allow checkpointing. loader application shall launch a grid job. loader shall setgid and setuid appropriately. loader application shall help in job tracking Stage the job executables and the dependencies to the container Surbhi Chitre XtreemOS 38 / 55

- Introduction Requirements Solution Simple solution - socket communication Surbhi Chitre XtreemOS 39 / 55

Job tracking - Introduction Requirements Solution Cleanup needed after the job has finished. Eg: XtreemFS unmount, container cleanup etc. No wait() or waitpid() from a process in the root container. Container cannot suspend/die automatically when the job has terminated. Cannot use kernel connectors. server process should exit. container should be suspended for future restart Job is now alive as long as the container is alive and running - checkpointing based on this. Surbhi Chitre XtreemOS 40 / 55

Job submission - Introduction Requirements Solution Surbhi Chitre XtreemOS 41 / 55

Job submission - Introduction Requirements Solution Surbhi Chitre XtreemOS 42 / 55

Job submission - Introduction Requirements Solution Surbhi Chitre XtreemOS 43 / 55

Job submission - Introduction Requirements Solution Surbhi Chitre XtreemOS 44 / 55

Job submission - Introduction Requirements Solution Surbhi Chitre XtreemOS 45 / 55

Job submission - Introduction Requirements Solution Surbhi Chitre XtreemOS 46 / 55

Job submission - Introduction Requirements Solution Surbhi Chitre XtreemOS 47 / 55

Job submission - Introduction Requirements Solution Surbhi Chitre XtreemOS 48 / 55

Job submission - Introduction Requirements Solution Surbhi Chitre XtreemOS 49 / 55

Checkpoint, Restart - Introduction Requirements Solution Checkpoint Given a jobid a container id should be identified You checkpoint this container which executes the job If a container is not running, do not checkpoint Restart Given a jobid the dump file should be located Given a jobid the container id should be identified A server should be restarted on the same port If a container is already running, do not restart After job finishes execution, suspend the container - for future restart Surbhi Chitre XtreemOS 50 / 55

To sum it - Introduction Requirements Solution Integration Foreign process dependency should be avoided at job submission for enabling checkpointing. Job tracking is important for job monitoring and cleanup of job integration lets an XtreemOS be submitted to a container. The job can be checkpointed and restarted to any node, once networking is set up. Surbhi Chitre XtreemOS 51 / 55

Discussion Path 1 2 3 4 Surbhi Chitre XtreemOS 52 / 55

New Features Not a middleware but a O.S Support for interactive applications shall come soon All required grid related components in one CD execute legacy application the legacy way -./elf-executable Surbhi Chitre XtreemOS 53 / 55

XtreemOS is not a Middleware but a O.S Provides true abstraction of grid resources Provides VO management and data management Provides scalability, security, fault tolerance, high availability Provides possibility of adding different security mechanisms Provides job monitoring, isolation, fault tolerance, debugging. Provides possibility of adding different checkpointers Easy to use, adminster, legacy application support Good solution for grids Surbhi Chitre XtreemOS 54 / 55

Thank You! Surbhi Chitre XtreemOS 55 / 55