Building a Virtualized Desktop Grid. Eric Sedore

Similar documents
Paperspace. Architecture Overview. 20 Jay St. Suite 312 Brooklyn, NY Technical Whitepaper

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Ananta: Cloud Scale Load Balancing. Nitish Paradkar, Zaina Hamid. EECS 589 Paper Review

Veeam Cloud Connect. Version 8.0. Administrator Guide

SAFEGUARDING YOUR VIRTUALIZED RESOURCES ON THE CLOUD. May 2012

HPC learning using Cloud infrastructure

<Insert Picture Here> Virtualisierung mit Oracle VirtualBox und Oracle Solaris Containern

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Network+ Guide to Networks 6 th Edition

A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis

Xen and CloudStack. Ewan Mellor. Director, Engineering, Open-source Cloud Platforms Citrix Systems

Condor and BOINC. Distributed and Volunteer Computing. Presented by Adam Bazinet

Securing Containers Using a PNSC and a Cisco VSG

Designing the Stable Infrastructure for Kernel-based Virtual Machine using VPN-tunneled VNC

Securing Containers Using a PNSC and a Cisco VSG

Building a Big IaaS Cloud. David /

Nested Virtualization and Server Consolidation

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. reserved. Insert Information Protection Policy Classification from Slide 8

ElasterStack 3.2 User Administration Guide - Advanced Zone

System Specification

Parallels EMEA Partner Roadshow Parallels Virtualization Portfolio how can I get my piece of the virtualization cake?

SECURING THE NEXT GENERATION DATA CENTER. Leslie K. Lambert Juniper Networks VP & Chief Information Security Officer July 18, 2011

TECHNICAL DESCRIPTION

ticrypt DEPLOYMENT OVERVIEW AND TIMELINE Information about hardware, deployment, and on-boarding

Experiences with OracleVM 3.3

StorageCraft OneBlox and Veeam 9.5 Expert Deployment Guide

Secure VFX in the Cloud. Microsoft Azure

*Performance and capacities are measured under ideal testing conditions using PAN-OS.0. Additionally, for VM

CLOUD COMPUTING IT0530. G.JEYA BHARATHI Asst.Prof.(O.G) Department of IT SRM University

Brocade Virtual Traffic Manager and Parallels Remote Application Server

StorageCraft OneXafe and Veeam 9.5

Microsoft Windows Embedded Server Overview

Virtualizing Oracle 11g/R2 RAC Database on Oracle VM: Methods/Tips

Aurora, RDS, or On-Prem, Which is right for you

Hyper-v Install Virtual Guest Services Greyed Out

70-414: Implementing an Advanced Server Infrastructure Course 01 - Creating the Virtualization Infrastructure

Virtual Machines. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

HySecure Quick Start Guide. HySecure 5.0

*Performance and capacities are measured under ideal testing conditions using PAN-OS 8.0. Additionally, for VM

ITRI Cloud OS: An End-to-End OpenStack Solution

Implementing the Twelve-Factor App Methodology for Developing Cloud- Native Applications

Huawei FusionCloud Desktop Solution 5.1 Resource Reuse Technical White Paper HUAWEI TECHNOLOGIES CO., LTD. Issue 01.

VNS3 Configuration. Quick Launch for first time VNS3 users in Azure

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Network Virtualisation Vision and Strategy_ (based on lesson learned) Telefónica Global CTO

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

Cisco 4000 Series Integrated Services Routers: Architecture for Branch-Office Agility

Virtualization of the MS Exchange Server Environment

WIND RIVER TITANIUM CLOUD FOR TELECOMMUNICATIONS

VMs at a Tier-1 site. EGEE 09, Sander Klous, Nikhef

Monitoring Remote Access VPN Services

VMware Horizon 7 Administration Training

Dell EMC Ready Architectures for VDI

vrealize Operations Management Pack for NSX for Multi-Hypervisor

Baremetal with Apache CloudStack

Special Topics: CSci 8980 Edge History

INNOV-4: Fun With Virtualization. Or, How I learned to love computers that don t really exist...

VMware Enterprise Desktop Solutions. Tommy Walker Enterprise Desktop Specialist Engineer Desktop Platform Solutions

Cloud Operations Using Microsoft Azure. Nikhil Shampur

Data Centers and Cloud Computing

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8

EFOLDER SHADOWPROTECT CONTINUITY CLOUD GUIDE

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Dell EMC Ready System for VDI on VxRail

What s New with VMware vcloud Director 8.0

MyCloud Computing Business computing in the cloud, ready to go in minutes

OPS-9: Fun With Virtualization. John Harlow. John Harlow. About John Harlow

Tutorial 4: Condor. John Watt, National e-science Centre

Directions in Data Centre Virtualization and Management

VMware vshield Edge Design Guide

Virtualization Overview. Joel Jaeggli AFNOG SS-E 2013

Improve Existing Disaster Recovery Solutions with VMware NSX

Data Centers and Cloud Computing. Data Centers

CS 350 Winter 2011 Current Topics: Virtual Machines + Solid State Drives

25 Best Practice Tips for architecting Amazon VPC

VIRTUAL CENTRAL LOCK

Table of Contents 1.1. Introduction. Overview of vsphere Integrated Containers 1.2

Citrix CloudPlatform (powered by Apache CloudStack) Version 4.5 Concepts Guide

VNS3 Configuration. IaaS Private Cloud Deployments

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

What s New in VMware vsphere 5.1 Platform

Deploying Cloud Network Services Prime Network Services Controller (formerly VNMC)

View4.6 PCoIP Secure Gateway FAQs.

Create a pfsense router for your private lab network template

Private Cloud Database Consolidation Name, Title

<Insert Picture Here> Introducing Oracle WebLogic Server on Oracle Database Appliance

Virtualizing Oracle 11g/R2 RAC Database on Oracle VM: Methods/Tips

Azure Certification BootCamp for Exam (Developer)

Multiprocessor Scheduling. Multiprocessor Scheduling

VMworld disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no

Autonomic Condor Clouds. David Wolinsky ACIS P2P Group University of Florida

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

ROYAL INSTITUTE OF INFORMATION & MANAGEMENT

Outline. ASP 2012 Grid School

Securing VMware NSX-T J U N E 2018

Enterprise print management in VMware Horizon

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

Minfy MS Workloads Use Case

Transcription:

Building a Virtualized Desktop Grid Eric Sedore essedore@syr.edu

Why create a desktop grid? One prong of an three pronged strategy to enhance research infrastructure on campus (physical hosting, HTC grid, private research cloud) Create a common, no cost (to them), resource pool for research community - especially beneficial for researchers with limited access to compute resources Attract faculty/researchers Leverage an existing resource Use as a seed to work toward critical mass in the research community

Goals Create Condor pool sizeable enough for significant computational work (initial success = 2000 concurrent cores) Create and deploy grid infrastructure rapidly (6 months) Secure and low impact enough to run on any machine on campus Create a adaptive research environment (virtualization) Simple for distributed desktop administrators to add computers to grid Automated methods for detecting/enabling Intel-VT (for hypervisor) Automated hypervisor deployment

Integration of Existing Components Condor VirtualBox Windows 7 (64 bit) TCL / FreeWrap Condor VM Catapult (glue) AD Group Policy Preference

Typical Challenges introducing the Grid (FUD) Security You want to use my computer? Where does my research data go? Technical Hypervisor / VM Management Scalability After you put the grid on my computer Governance Who gets access to my resources? How does the scheduling work?

Security

Security on the client Grid processes run as a non-privileged user Virtualization to abstract research environment / interaction VM s on the local drive are encrypted at all times (using certificate of non-privileged user) Local cached repository and when running in a slot Utilize Windows 7 encrypted file system Allows grid work on machines with end users as local administrators To-do create a signature to ensure researcher (and admins) that the VM started is approved and has not been modified (i.e. not modified to be a botnet)

Securing/Protecting the Infrastructure Create an isolated private 10.x.x.x. network via VPN tunnels (pfsense and OpenVPN) Limit bandwidth for each research VM to protect against a network DOS Research VM s NAT d on desktops Other standard protections Firewalls, ACL s

Public Network Research VM s ITS-SL6- LSCSOFT ITS-SL6- LSCSOFT ITS-SL6- LSCSOFT ITS-SL6- LSCSOFT ITS-SL6- LSCSOFT ITS-SL6- LSCSOFT OpenVPN End-Point (pfsense) / FW / Router 10.x.x.x network Bottleneck for higher bandwidth jobs Condor Infrastructure Roles Condor Submit Server

Technical

Condor VM Coordinator (CMVC) Condor s VM agent on the desktop Manage distribution of local virtual machine repository Manage encryption of virtual machines Runs as non-privileged user reduces adoption barriers Pseudo Scheduler Rudimentary logic for when to allow grid activity Windows specific is there a user logged in?

Why did you write CVMC? Runs as non-privileged user (and needs windows profile) Mistrust in a 3 rd party agent (condor client) on all campus desktops especially when turned over to the research community even with the strong sandbox controls in condor Utilizes built-in MS Task Scheduler for idle detection no processes running in user s context for activity detection VM repository management Encryption It seemed so simple when I started

Job Configuration Requirements = ( TARGET.vm_name == "its-u11- boinc-20120415" ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= DiskUsage ) && ( ( TARGET.Memory * 1024 ) >= ImageSize ) && ( ( RequestMemory * 1024 ) >= ImageSize ) && ( TARGET.HasFileTransfer ) ClassAd addition vm_name = "its-u11-boinc-20120415 CVMC Uses vm_name ClassAd to determine which VM to launch Jobs without vm_name can use running VM s (assuming the requirements match) but they won t startup new VM s

Condor Back -end Condor Queue Idle State Web Server Task Scheduler Slot 1 CVMC Slot 2 Win 7 Client VM Repo VirtualBox Slot

Technical Challenges Host resource starvation Leave memory for the host OS Memory controls on jobs (within Condor) Unique combination of approaches implementing Condor CVMC / Web service VM distribution Build custom VM s based on job needs vs. scavenging existing operating system configurations Hypervisor expects to have an interactive session environment (windows profile) Reinventing the wheel on occasion

How do you ensure low impact? When no one is logged in CVMC will allow grid load regardless of the time When a user is logged in CVMC will kill grid load at 7 AM and not allow it to run again until 5 PM (regardless if the machine is idle) Leave the OS memory (512MB-1GB) so it does not page out key OS components (using a simple memory allocation method) Do not cache VM disks will keep OS from filling its memory cache with VM I/O traffic

Keep OS from Caching VM I/O

Next Steps Grow the research community depth and diversity Increase pool size ~12,000 cores which are eligible Infrastructure Scalability Condor (tuning/sizing) Network / Storage (NFS Parrot / Chirp)

Solving the Data Transfer Problem Born from an unfinished side- project 7+ years ago. Goal: maximize the compute resources available to LIGO s search for gravitational waves More cycles == a better search. Problem: huge input data, impractical to move w/job. How to... Run on other LIGO Data Grid sites without a shared filesystem? Run on clusters outside the LIGO Data Grid lacking LIGO data? Tools to get the job done: ihope, GLUE, Pegasus, Condor Checkpointing, and Condor- C. People: Kayleigh Bohémier, Duncan Brown, Peter Couvares. Help from SU ITS, Pegasus Team, Condor Team

Idea: Cross- Pool Checkpoint Migration Condor_compiled (checkpointable) jobs. Jobs start on a LIGO pool with local data. Jobs read in data and pre- process. Jobs call checkpoint_and_exit(). Pegasus workflow treats checkpoint image as output, and provides it as input to a second Condor- C job. Condor- C job transfers and executes standalone checkpoint on remote pool, and transfers results back.

Devil in the Details Condor checkpoint_and_exit() caused the job to exit with SIGUSR2, so we needed to catch that and treat it as success. Standalone checkpoint images didn t like to restart in a different cwd, even if they shouldn t care, so we had to binary edit each checkpoint image to replace the hard- coded /path/to/cwd with.////////////! Will be fixed in Condor 7.8? Pegasus needed minor mods to support Condor- C grid jobs w/condor file transfer Fixed for next Pegasus release.

Working Solution Move jobs that do not require input files on the SUGAR cluster to the remote campus cluster.