Onto Petaflops with Kubernetes

Similar documents
Kuberiter White Paper. Kubernetes. Cloud Provider Comparison Chart. Lawrence Manickam Kuberiter Inc

Container Orchestration on Amazon Web Services. Arun

Running MarkLogic in Containers (Both Docker and Kubernetes)

Deep Learning Inference on Openshift with GPUs

S Implementing DevOps and Hybrid Cloud

Launching StarlingX. The Journey to Drive Compute to the Edge Pilot Project Supported by the OpenStack

Deep Learning Frameworks with Spark and GPUs

An Introduction to Kubernetes

Containerizing GPU Applications with Docker for Scaling to the Cloud

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

The Path to GPU as a Service in Kubernetes Renaud Gaubert Lead Kubernetes Engineer

São Paulo. August,

Run Stateful Apps on Kubernetes with PKS: Highlight WebLogic Server

LegUp: Accelerating Memcached on Cloud FPGAs

Cloud & container monitoring , Lars Michelsen Check_MK Conference #4

Building an Operating System for AI

Full Scalable Media Cloud Solution with Kubernetes Orchestration. Zhenyu Wang, Xin(Owen)Zhang

Red Hat Roadmap for Containers and DevOps

Kubernetes 101. Doug Davis, STSM September, 2017

The speed of containers, the security of VMs. KataContainers.io

Kubernetes 1.9 Features and Future

Safe Harbor Statement

Azure DevOps. Randy Pagels Intelligent Cloud Technical Specialist Great Lakes Region

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Democratizing Machine Learning on Kubernetes

Supporting GPUs in Docker Containers on Apache Mesos

Declarative Modeling for Cloud Deployments

Enterprise Kubernetes

The Evolution of Big Data Platforms and Data Science

VMWARE ENTERPRISE PKS

Package your Java Application using Docker and Kubernetes. Arun

VMWARE PKS. What is VMware PKS? VMware PKS Architecture DATASHEET

Introduction to the Open Service Broker API. Doug Davis

Kubernetes. An open platform for container orchestration. Johannes M. Scheuermann. Karlsruhe,

OpenStack Magnum Pike and the CERN cloud. Spyros

More Containers, More Problems

Oh.. You got this? Attack the modern web

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

/ Cloud Computing. Recitation 5 February 14th, 2017

VMWARE PIVOTAL CONTAINER SERVICE

Fast Hardware For AI

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Lessons Learned: Building Scalable & Elastic Akka Clusters on Google Managed Kubernetes. - Timo Mechler & Charles Adetiloye

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018

Kuber-what?! Learn about Kubernetes

OpenShift on Public & Private Clouds: AWS, Azure, Google, OpenStack

DGX UPDATE. Customer Presentation Deck May 8, 2017

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

The future of data centers

How Container Runtimes matter in Kubernetes?

S INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS

CERN openlab & IBM Research Workshop Trip Report

Advanced Continuous Delivery Strategies for Containerized Applications Using DC/OS

Kubernetes 1.8 and Beyond

Important DevOps Technologies (3+2+3days) for Deployment

Beyond Training The next steps of Machine Learning. Chris /in/chrisparsonsdev

HOW TO BUILD A MODERN AI

BUILDING A GPU-FOCUSED CI SOLUTION

WHITE PAPER. RedHat OpenShift Container Platform. Benefits: Abstract. 1.1 Introduction

Container 2.0. Container: check! But what about persistent data, big data or fast data?!

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

Singapore. Service Proxy, Container Networking & K8s. Acknowledgement: Pierre Pfister, Jerome John DiGiglio, Ray

WHITEPAPER. Pipelining Machine Learning Models Together

Deploying Applications on DC/OS

MAHA. - Supercomputing System for Bioinformatics

Using DC/OS for Continuous Delivery

Accelerate Machine Learning on macos with Intel Integrated Graphics. Hisham Chowdhury May 23, 2018

IT S COMPLICATED: THE ENTERPRISE OPEN SOURCE VENDOR RELATIONSHIP. Red Hat s POV

Docker and Oracle Everything You Wanted To Know

Azure Day Application Development. Randy Pagels Sr. Developer Technology Specialist US DX Developer Tools - Central Region

PRP Distributed Kubernetes Cluster

Containers, Serverless and Functions in a nutshell. Eugene Fedorenko

Intelligent System for AI. 清大資工周志遠 AII Workshop

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

Microservices. GCPUG Tokyo Kubernetes Engine

/ Cloud Computing. Recitation 5 September 26 th, 2017

Microservices. Chaos Kontrolle mit Kubernetes. Robert Kubis - Developer Advocate,

ovirt and Docker Integration

Building the Most Efficient Machine Learning System

Integrated Management of OpenPOWER Converged Infrastructures. Revolutionizing the Datacenter

Managing Compute and Storage at Scale with Kubernetes. Dan Paik / Google

You Have Stateful Apps - What if Kubernetes Would Also Run Your Storage?

Go Faster: Containers, Platforms and the Path to Better Software Development (Including Live Demo)

Snapdragon NPE Overview

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Think Small to Scale Big

@briandorsey #kubernetes #GOTOber

DevOps Course Content

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Managing and Protecting Persistent Volumes for Kubernetes. Xing Yang, Huawei and Jay Bryant, Lenovo

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

The OpenVX Computer Vision and Neural Network Inference

EASILY DEPLOY AND SCALE KUBERNETES WITH RANCHER

Learn. Connect. Explore.

Machine Learning on VMware vsphere with NVIDIA GPUs

Convergence of VM and containers orchestration using KubeVirt. Chunfu Wen

Docker Live Hacking: From Raspberry Pi to Kubernetes

Internals of Docking Storage with Kubernetes Workloads

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

Transcription:

Onto Petaflops with Kubernetes Vishnu Kannan Google Inc. vishh@google.com

Key Takeaways Kubernetes can manage hardware accelerators at Scale Kubernetes provides a playground for ML ML journey with Kubernetes is just getting started.

Kubernetes Overview Kubernetes is a multi-workload platform Supports CPU, Memory and Storage Simplify packaging Portable across clouds Multiple managed solutions exist Many success stories

Kubernetes Overview Master Node Worker Node Worker Node

GPUs in Kubernetes Massively parallel workloads demand GPUs Some of those workloads need scale and simplified packaging Kubernetes now supports GPUs Compute per node increases orders of magnitude

Portability Container There are lots of GPUs App Add FPGAs, ASICs, NICs to the mix Each with it s own SW and HW requirements Library App Container portability is a challenge Drivers GPU Kernel FPGA Infiniband

Portability Container Support multiple versions of - Kernels Drivers User facing APIs Hardware versions App Library App Drivers Kernel Not a trivial problem Kubernetes provides primitives GPU Better left to Kubernetes vendors FPGA Infiniband

Hardware Device Plugins Hardware Device Plugins Kubernetes workloads are expected to be portable Kubernetes cannot support 100+ hardware natively Solution for managing infra dependencies at scale Package SW requirements as a container, manage via Kubernetes One click HW integration with Kubernetes - Magic! Hardware vendors author device plugins User consumes Kubernetes APIs

Hardware Device Plugins Step 1 - Hardware Vendor publishes device plugins for Kubernetes Step 2 - Kubernetes cluster admins deploy device plugins kubectl create -f http://vendor.com/device-plugin-daemonset.yaml Step 3 - Device plugin registers with Kubelet (Node agent) and advertises devices Step 4 - Applications can request the newly available device as a kubernetes resource

Challenges Memory Memory Many hardware configurations Performance guarantees CPU QPI CPU Heterogenous nodes PCIe Bin packing vs Performance PCIe Monitoring Better Scheduling Primitives GPU GPU Memory GPU GPU Memory

ML Playground with Kubernetes 13

ML and Kubernetes ML benefits from Kubernetes and Accelerators Productionizing ML is a devops problem Requires large amounts of compute and data Many dependencies - Hardware, libraries, frameworks, language runtimes Interactive, Streaming and Batch Various tools and solutions Difficult to deploy and manage lifecycle

Single Node ML Playground Single Node - two steps to productivity 1. Install kubernetes (kubeadm/minikube) 2. Install kubernetes apps pre-configured with ML stack of choice Works across distros, OSes Extensible to other hardware configurations Portable to the Cloud

Quota Logging Monitoring NFS Ceph Auth Spark ML Airflow Jupyter Cassandra MySQL Tensorflow Caffe TF-Serving Flask+Scikit Kubernetes Operating system (Linux, Windows) CPU Memory GCP SSD Disk AWS GPU FPGA Azure ASIC NIC On-prem

Scale out ML Playground Distributed training Hyperparameter search Batch preprocessing Automate using Kubernetes APIs! Compose with your tools of choice Build common ML platforms

Kubernetes on GCP Container Engine - managed Kubernetes Auto scaling, Auto upgrades, Auto repairs Backed by Kubernetes maintainers and Google SREs Supports Nvidia K80 GPUs (Alpha) - More in the future

Kubernetes on GCP Tensor Processing Units ASICs designed for Neural Nets V1 - Announced in 2016 15-30X higher performance

Kubernetes on GCP Tensor Processing Units V2 - Announced in 2017 Cloud TPUs 180 TFlops each

Kubernetes on GCP Tensor Processing Units V2 - Announced in 2017 Cloud TPUs 180 teraflops each Combined into Pods - 11.5 petaflops You may not need that much! That s ok!

Kubernetes on GCP Container Engine - managed Kubernetes Nvidia K80s - more in the future Tensor Processing Units (TPUs) BigQuery - Analyze massively large data - fully managed - petabyte scale low cost

Kubernetes on GCP Container Engine - managed Kubernetes Nvidia K80s - more in the future Tensor Processing Units (TPUs) - ~25x faster for neural net BigQuery - Analyze massively large data Global load balancing - Global predictions

Kubernetes on GCP Container Engine - managed Kubernetes Nvidia K80s - more in the future Tensor Processing Units (TPUs) - ~25x faster for neural net BigQuery - Analyze massively large data Global load balancing Cloud AI - Image, Video, Text, Translation, MLaaS

Demo 25

Demo Overview Goal is to show what Kubernetes can provide 1. Interactive ML with Jupyter and Tensorflow on Kubernetes 2. Analyze, train, package and deploy a model from Jupyter on Kubernetes 3. Focus on rapid iteration 4. Customizable to other workflows helm install https://github.com/vishh/helm-chart/raw/master/jupyterhub-v0.4.0.tgz

Master Node 1 Node 2

Deploy Jupyterhub Master Node 1 Node 2

Deploy Jupyterhub Master Run proxy pod Proxy Service Run hub pod Hub Service Proxy Hub Node 1 Node 2

Master Login Hub Service Proxy Service Proxy Hub Node 1 Node 2

Master Login Hub Service Proxy Service Proxy request to hub Proxy Hub Node 1 Node 2

Master Login Create notebook pod Proxy Service Hub Service Proxy request to hub Proxy Hub Node 1 Node 2

Master Login Run TF notebook pod Proxy Service Create TF notebook pod Hub Service Proxy request to hub Proxy Hub TF notebook GPU Node 1 Node 2

Master Login Proxy Service Hub Service Proxy request to hub Proxy Hub TF notebook Redirect to notebook GPU Node 1 Node 2

Master Hub Service Proxy Service Proxy Hub TF notebook GPU Node 1 Node 2

Master Hub Service Proxy Service Proxy Hub TF notebook GPU Node 1 Node 2

ML and Kubernetes Journey 37

Existing Solutions 1. 2. 3. 4. 5. 6. 7. 8. 9. Pachyderm Tensorflow Serving - Inception Helm Chart Pipeline AI - Tensorflow and Spark on Kubernetes Clarifai - Webinar Ebay Machine Learning Baidu - Paddle Paddle on Kubernetes Open AI - Tensorflow on Kubernetes Distributed TF training And more...

Accelerated Journey Enabling Accelerated workloads on Kubernetes is a journey Contributors across many organizations Interested in GPUs, join the community There are problems that needs to be solved across the stack Be part of a thriving OSS community!

Summary Kubernetes can manage hardware accelerators at Scale Kubernetes provides a playground for ML ML journey with Kubernetes is just getting started. Join the party!

Thank you @vishh [github, slack] @vishnukanan [Twitter] 41