Supporting GPUs in Docker Containers on Apache Mesos

Similar documents
Nvidia GPU Support on Mesos: Bridging Mesos Containerizer and Docker Containerizer

Using DC/OS for Continuous Delivery

S INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS

POWERING THE INTERNET WITH APACHE MESOS

Scale your Docker containers with Mesos

Introduction to Mesos and the Datacenter Operating System

Onto Petaflops with Kubernetes

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Mesosphere and the Enterprise: Run Your Applications on Apache Mesos. Steve Wong Open Source Engineer {code} by Dell

IBM Leading High Performance Computing and Deep Learning Technologies

Container Pods with Docker Compose in Apache Mesos

BUILDING A GPU-FOCUSED CI SOLUTION

Building a Data-Friendly Platform for a Data- Driven Future

Deployment Patterns using Docker and Chef

Orchestration Ownage: Exploiting Container-Centric Datacenter Platforms

Building/Running Distributed Systems with Apache Mesos

Launching StarlingX. The Journey to Drive Compute to the Edge Pilot Project Supported by the OpenStack

Seven Decision Points When Considering Containers

Deploying Applications on DC/OS

Containerizing GPU Applications with Docker for Scaling to the Cloud

AGILE DEVELOPMENT AND PAAS USING THE MESOSPHERE DCOS

Important DevOps Technologies (3+2+3days) for Deployment

WHITE PAPER. RedHat OpenShift Container Platform. Benefits: Abstract. 1.1 Introduction

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Kubernetes: Integration vs Native Solution

Choosing the Right Container Infrastructure for Your Organization

Sunil Shah SECURE, FLEXIBLE CONTINUOUS DELIVERY PIPELINES WITH GITLAB AND DC/OS Mesosphere, Inc. All Rights Reserved.

OpenStack Magnum Pike and the CERN cloud. Spyros

The four forces of Cloud Native

Container Orchestration on Amazon Web Services. Arun

Managing and Protecting Persistent Volumes for Kubernetes. Xing Yang, Huawei and Jay Bryant, Lenovo

State of OpenShift on Bare Metal

Go Faster: Containers, Platforms and the Path to Better Software Development (Including Live Demo)

Managing Deep Learning Workflows

2016 Mesosphere, Inc. All Rights Reserved.

OpenStack Mitaka Release Overview

Distributed Data on Distributed Infrastructure. Claudius Weinberger & Kunal Kusoorkar, ArangoDB Jörg Schad, Mesosphere

@joerg_schad Nightmares of a Container Orchestration System

CS-580K/480K Advanced Topics in Cloud Computing. Container III

The Path to GPU as a Service in Kubernetes Renaud Gaubert Lead Kubernetes Engineer

MESOS A State-Of-The-Art Container Orchestrator Mesosphere, Inc. All Rights Reserved. 1

Lessons Learned: Deploying Microservices Software Product in Customer Environments Mark Galpin, Solution Architect, JFrog, Inc.

Bright Cluster Manager: Using the NVIDIA NGC Deep Learning Containers

Advanced Continuous Delivery Strategies for Containerized Applications Using DC/OS

How to Keep UP Through Digital Transformation with Next-Generation App Development

Red Hat Cloud Platforms with Dell EMC. Quentin Geldenhuys Emerging Technology Lead

Mesosphere and Percona Server for MongoDB. Peter Schwaller, Senior Director Server Eng. (Percona) Taco Scargo, Senior Solution Engineer (Mesosphere)

利用 Mesos 打造高延展性 Container 環境. Frank, Microsoft MTC

SCALING LIKE TWITTER WITH APACHE MESOS

MesosCon Qian Zhang (IBM China), Jie Yu (Mesosphere) OCI Support in Mesos Mesosphere, Inc. All Rights Reserved. 1

Running MarkLogic in Containers (Both Docker and Kubernetes)

SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG

Cloud & container monitoring , Lars Michelsen Check_MK Conference #4

CONTAINERS AND MICROSERVICES WITH CONTRAIL

Secure Kubernetes Container Workloads

MANAGING MESOS, DOCKER, AND CHRONOS WITH PUPPET

TEN LAYERS OF CONTAINER SECURITY

Sharing High-Performance Devices Across Multiple Virtual Machines

Migration and Building of Data Centers in IBM SoftLayer

Revolutionizing Open. Cecilia Carniel IBM Power Systems Scale Out sales

An Introduction to Kubernetes

Distributed CI: Scaling Jenkins on Mesos and Marathon. Roger Ignazio Puppet Labs, Inc. MesosCon 2015 Seattle, WA

High Performance Containers. Convergence of Hyperscale, Big Data and Big Compute

CONTINUOUS DELIVERY WITH DC/OS AND JENKINS

Patching and Updating your VM SUSE Manager. Donald Vosburg, Sales Engineer, SUSE

A Whirlwind Tour of Apache Mesos

Mesosphere and Percona Server for MongoDB. Jeff Sandstrom, Product Manager (Percona) Ravi Yadav, Tech. Partnerships Lead (Mesosphere)

Facilitating Collaborative Analysis in SWAN

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

DGX UPDATE. Customer Presentation Deck May 8, 2017

Deep learning in MATLAB From Concept to CUDA Code

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

Containerization Dockers / Mesospere. Arno Keller HPE

In-cluster Open Source Testing Framework

Democratizing Machine Learning on Kubernetes

Merging Enterprise Applications with Docker* Container Technology

OpenShift Roadmap Enterprise Kubernetes for Developers. Clayton Coleman, Architect, OpenShift

Amir Zipory Senior Solutions Architect, Redhat Israel, Greece & Cyprus

I D C T E C H N O L O G Y S P O T L I G H T. V i r t u a l and Cloud D a t a Center Management

NVIDIA DEEP LEARNING INSTITUTE

Think Small to Scale Big

Lab 3. On-Premises Deployments (Optional)

Networking & Security for Mesos

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Python based Data Science on Cray Platforms Rob Vesse, Alex Heye, Mike Ringenburg - Cray Inc C O M P U T E S T O R E A N A L Y Z E

Genomics on Cisco Metacloud + SwiftStack

TRAINING AND CERTIFICATION UPDATE

CONTINUOUS DELIVERY WITH MESOS, DC/OS AND JENKINS

SAMPLE CHAPTER IN ACTION. Roger Ignazio. FOREWORD BY Florian Leibert MANNING

Containerised Development of a Scientific Data Management System Ben Leighton, Andrew Freebairn, Ashley Sommer, Jonathan Yu, Simon Cox LAND AND WATER

Implementing the Twelve-Factor App Methodology for Developing Cloud- Native Applications

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

KEYNOTE How open source and hybrid cloud brought Microsoft and Red Hat together

SUSE s vision for agile software development and deployment in the Software Defined Datacenter

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Allowing Users to Run Services at the OLCF with Kubernetes

64-bit ARM Unikernels on ukvm

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

Integrated Management of OpenPOWER Converged Infrastructures. Revolutionizing the Datacenter

Transcription:

Supporting GPUs in Docker Containers on Apache Mesos MesosCon Europe - 2016 Kevin Klues Senior Software Engineer Mesosphere Yubo Li Staff Researcher IBM Research China

Kevin Klues Yubo Li Kevin Klues is a Senior Software Engineer at Mesosphere working on the Mesos core team. Since joining Mesosphere, Kevin has been involved in the design and implementation of a number of Mesos s core subsystems, including GPU isolation and Pods. When not working, you can usually find Kevin on a snowboard or up in the mountains in some capacity or another. Dr. Yubo Li is a Staff Researcher at IBM Research, China. He is the architect of the GPU acceleration and deep-learning as a service (Dlaas) components of SuperVessel, an open-access cloud running OpenStack on OpenPOWER machines. He is currently working on GPU support for several cloud container technologies, including Mesos, Kubernetes, Marathon and Open Stack. Email: klueska@mesosphere.com Slack: @klueska Email: liyubobj@cn.ibm.com Slack: @liyubobj

Why GPUs? GPUs are the tool of choice for many big-data cloud applications Deep Learning Natural Language Processing Genome Sequencing 3

Why GPUs? Credit: www.slideshare.net/datasciencemd/deep-learning-with-gpus 4

Why GPUs? Mesos users have been asking for GPU support for years First email asking for it can be found in the dev-list archives from 2011 The request rate has increased dramatically in the last 9-12 months 5

Why GPUs? Without direct support for GPUs, there is no isolation guarantee No built-in coordination to restrict access to GPUs Possible for multiple frameworks / tasks to access GPUs at the same time Ad-hoc solutions have emerged to give access to GPUs, but they all rely on co-operative scheduling rather than true isolation. 6

Why GPUs? Enterprise users currently partition clusters to make use of GPUs They d like to consolidate them Not willing to do so without isolation guarantees Unified Job Scheduler 7

Why GPUs? MARATHON METRONOME (Long Running) (Batch) ON-PREMISE CLOUD GPU-Accelerated Node GPU-Accelerated Node Source: www.nvidia.com/object/apache-mesos.html 8

Why Docker? Extremely popular image format for containers Build once run everywhere Configure once run anything Dockerized Applications Image Pulls Source: DockerCon 2016 Keynote by Docker s CEO Ben Golub 9

Why Docker? Nvidia-docker Wrapper around docker to allow GPUs to be used inside docker containers Source: https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server-application-deployment-made-easy/ 10

Why Docker? Machine Learning Frameworks Support exists for many popular machine learning frameworks with nvidia-docker (including TensorFlow, Caffe, CNTK, etc.) Source: https://data-shaker.com/docker-tensorflow-with-jupyter-notebook-on-windows/ 11

Overall Goal Test locally with nvidia-docker Deploy to production with Mesos 12

Overview of Talk Challenges of supporting Nvidia GPUs in docker containers How nvidia-docker addresses these challenges How Apache Mesos addresses these challenges Isolation Demo Demo from IBM 13

Challenges of Supporting Nvidia GPUs in Docker containers Before containers it was easy 14

Challenges of Supporting Nvidia GPUs in Docker containers Before containers it was easy Buy some GPUs 15

Challenges of Supporting Nvidia GPUs in Docker containers Before containers it was easy Buy some GPUs Install them on your box Linux Kernel 16

Challenges of Supporting Nvidia GPUs in Docker containers Before containers it was easy Buy some GPUs Install them on your box Install the base nvidia drivers nvidia base libraries nvidia-kernel-module Linux Kernel 17

Challenges of Supporting Nvidia GPUs in Docker containers Before containers it was easy Buy some GPUs Install them on your box Install the base nvidia drivers Install some advanced toolkit libraries CUDA / TensorFlow libraries nvidia base libraries nvidia-kernel-module Linux Kernel 18

Challenges of Supporting Nvidia GPUs in Docker containers Before containers it was easy Buy some GPUs Install them on your box Install the base nvidia drivers Install some advanced toolkit libraries Link a GPU accelerated application against these libraries Application CUDA / TensorFlow libraries nvidia base libraries nvidia-kernel-module Linux Kernel 19

Challenges of Supporting Nvidia GPUs in Docker containers Before containers it was easy Buy some GPUs Install them on your box Install the base nvidia drivers Install some advanced toolkit libraries Link a GPU accelerated application against these libraries Run your application Application CUDA / TensorFlow libraries nvidia base libraries nvidia-kernel-module Linux Kernel 20

Challenges of Supporting Nvidia GPUs in Docker containers So what about containers? Buy some GPUs Install them on your box Install the nvidia-kernel-modules nvidia-kernel-module Linux Kernel 21

Challenges of Supporting Nvidia GPUs in Docker containers Container So what about containers? Buy some GPUs Install them on your box Install the nvidia-kernel-module Build a docker image Bundle the base nvidia libraries Bundle some advanced toolkit libraries Bundle a GPU accelerated application to use these libraries Application CUDA / TensorFlow libraries nvidia base libraries nvidia-kernel-module Linux Kernel 22

Challenges of Supporting Nvidia GPUs in Docker containers Container So what about containers? Buy some GPUs Install them on your box Install the nvidia-kernel-module Build a docker image Bundle the base nvidia libraries Bundle some advanced toolkit libraries Bundle a GPU accelerated application to use these libraries Run your docker container Application CUDA / TensorFlow libraries nvidia base libraries nvidia-kernel-module Linux Kernel 23

Challenges of Supporting Nvidia GPUs in Docker containers Container Straightforward, right? Application CUDA / TensorFlow libraries nvidia base libraries nvidia-kernel-module Linux Kernel 24

Challenges of Supporting Nvidia GPUs in Docker containers Will only work if the kernel / user driver versions match Container Application CUDA / TensorFlow libraries nvidia base libraries (v1) nvidia-kernel-module (v1) nvidia-kernel-module (v2) Linux Kernel Linux Kernel 25

Challenges of Supporting Nvidia GPUs in Docker containers Won t work if they don t Container Application CUDA / TensorFlow libraries nvidia base libraries (v1) nvidia-kernel-module (v1) nvidia-kernel-module (v2) Linux Kernel Linux Kernel 26

Challenges of Supporting Nvidia GPUs in Docker containers Either way, you have to map in the GPU devices somehow Container Application CUDA / TensorFlow libraries nvidia base libraries (v1) nvidia-kernel-module (v1) nvidia-kernel-module (v2) Linux Kernel Linux Kernel 27

nvidia-docker and GPUs Components of nvidia-docker Set of custom docker images that set custom labels / run some commands nvidia-docker-plugin ( standard docker volume plugin) nvidia-docker (wrapper script around docker itself) docker run... nvidia-docker run... 28

nvidia-docker and GPUs nvidia-docker-plugin Finds all standard nvidia libraries / binaries on the host and consolidates them into a single place as a docker volume /var/lib/docker/volumes nvidia_xxx.xx (version number) bin lib lib64 29

nvidia-docker and GPUs nvidia-docker wrapper script Looks for the label: com.nvidia.volumes.needed = nvidia_driver When found, it maps the nvidia_xxx.xx volume into the container at: /usr/local/nvidia Enumerates all GPUs on the machine and maps them into the container as available devices Passes all other docker options straight through to docker 30

nvidia-docker and GPUs Container Application Container Application CUDA / TensorFlow libraries CUDA / TensorFlow libraries nvidia base libraries (v1) nvidia base libraries (v2) nvidia-kernel-module (v1) nvidia-kernel-module (v2) Linux Kernel Linux Kernel 31

nvidia-docker and GPUs Container Application Container Application CUDA / TensorFlow libraries CUDA / TensorFlow libraries nvidia base libraries (v1) nvidia base libraries (v2) nvidia-kernel-module (v1) nvidia-kernel-module (v2) Linux Kernel Linux Kernel 32

Apache Mesos and GPUs Mimics functionality of nvidia-docker Supports nvidia docker images with custom labels Maps consolidated volume of binaries / libraries into /usr/local/nvidia Enumerates GPUs and injects them into containers Additionally isolates access to GPUs between tasks Not possible with nvidia-docker alone 33

Apache Mesos and GPUs Multiple Containerizer support Mesos (aka unified) containerizer (fully supported) Docker containerizer (currently being worked on) Why support both? Many people are asking for Docker containerizer support to bridge the feature gap People are already familiar with existing docker tools Unified containerizer needs time to mature 34

Apache Mesos and GPUs GPU_RESOURCES framework capability Frameworks must opt-in to receive offers with GPU resources Prevents legacy frameworks from consuming non-gpu resources and starving out GPU jobs Use agent attributes to select specific type of GPU resources Agents advertise the type of GPUs they have installed via attributes Only accept an offer if the attributes match the GPU type you want 35

Apache Mesos and GPUs Mesos Agent Containerizer API (Unified) Mesos Containerizer Memory CPU Isolator API 36

Apache Mesos and GPUs Mesos Agent Containerizer API (Unified) Mesos Containerizer GPU Memory CPU Isolator API 37

Apache Mesos and GPUs Mesos Agent Nvidia GPU Isolator Containerizer API (Unified) Mesos Containerizer GPU Memory CPU Isolator API Nvidia GPU Allocator Linux devices cgroup 38

Apache Mesos and GPUs Mesos Agent Mimics Nvidia GPUfunctionality Isolator of nvidia-docker-plugin Containerizer API (Unified) Mesos Containerizer GPU Memory CPU Isolator API Nvidia GPU Allocator Nvidia Volume Manager Linux devices cgroup 39

Apache Mesos and GPUs Mesos Agent Containerizer API (Unified) Mesos Containerizer Nvidia GPU Allocator Nvidia Volume Manager Nvidia GPU Isolator GPU Memory CPU Isolator API Linux devices cgroup 40

Apache Mesos and GPUs Containerizer API Mesos Agent Composing Containerizer Docker Containerizer GPU (Unified) Mesos Containerizer GPU Memory CPU Isolator API Nvidia GPU Allocator Nvidia Volume Manager 41

Implementation Timeline Support for the Unified containerizer Supports both image-less and docker-image based containers (fully compatible with nvidia-docker) Supported in the recent Mesos 1.0.0 release (with patch fixes in 1.0.1) Supported in upcoming Marathon 1.3 release (coming in next few weeks) Targeted for DC/OS 1.9 release ( mid October) Support for the Docker Containerizer Designed and partially implemented Target Mesos release: 1.1.0 Target Marathon release: 1.4 Target DC/OS release: 1.10 42

Looking forward Expose Underlying GPU Topology Include in offers sent to schedulers Help make more informed decision about which GPUs to choose Add support for other GPU / accelerator types AMD, Intel, Custom FPGAs, etc. 43

Special Thanks to All Collaborators Vikram Ditya Seetharami Seelam Andrew Iles Yong Feng Jonathan Calmels Guangya Liu Felix Abecassis Ian Downes Rob Todd Niklas Nielson Rajat Phull Connor Doyle Shivi Fotedar Benjamin Mahler Tim Chen 44

ISOLATION DEMO https://github.com/klueska-mesosphere/mesos-gpu-docker 45

46

DEMO FROM IBM 47