TensorFlow on vivo

Similar documents
agenda PAE Docker Docker PAE

An Introduction to Kubernetes

Kubernetes Integration with Virtuozzo Storage

Weiting Chen Zhen Fan

Introduction to Kubernetes Storage Primitives for Stateful Workloads

Bright Cluster Manager: Using the NVIDIA NGC Deep Learning Containers

Kubernetes introduction. Container orchestration

Code: Slides:

Internals of Docking Storage with Kubernetes Workloads

Evolution of Kubernetes in One Year From Technical View

Question: 2 Kubernetes changed the name of cluster members to "Nodes." What were they called before that? Choose the correct answer:

PRP Distributed Kubernetes Cluster

OpenShift + Container Native Storage (CNS)

Package your Java Application using Docker and Kubernetes. Arun

Using PCF Ops Manager to Deploy Hyperledger Fabric

Red Hat Enterprise Linux Atomic Host 7 Getting Started with Kubernetes

Kubernetes Basics. Christoph Stoettner Meetup Docker Mannheim #kubernetes101

Enterprise Kubernetes

Managing Compute and Storage at Scale with Kubernetes. Dan Paik / Google


Federated Prometheus Monitoring at Scale

You Have Stateful Apps - What if Kubernetes Would Also Run Your Storage?

Full Scalable Media Cloud Solution with Kubernetes Orchestration. Zhenyu Wang, Xin(Owen)Zhang

How to build scalable, reliable and stable Kubernetes cluster atop OpenStack.

Kubernetes: Twelve KeyFeatures

Continuous delivery while migrating to Kubernetes

Stackube Documentation

So, I have all these containers! Now what?

Kubernetes 1.9 Features and Future

Infoblox IPAM Driver for Kubernetes User's Guide

Infoblox IPAM Driver for Kubernetes. Page 1

Microservices. Chaos Kontrolle mit Kubernetes. Robert Kubis - Developer Advocate,

Understanding and Evaluating Kubernetes. Haseeb Tariq Anubhavnidhi Archie Abhashkumar

A REFERENCE ARCHITECTURE FOR DEPLOYING WSO2 MIDDLEWARE ON KUBERNETES

Kubernetes, Persistent Volumes and the Pure Service Orchestrator. Simon Dodsley, Director of New Stack Technologies

Containers, Serverless and Functions in a nutshell. Eugene Fedorenko

Blockchain on Kubernetes

Building an on premise Kubernetes cluster DANNY TURNER

What s New in K8s 1.3

What s New in K8s 1.3

RAFT library for Java

Onto Petaflops with Kubernetes

Maximizing Network Throughput for Container Based Storage David Borman Quantum

Blockchain on Kubernetes

Important DevOps Technologies (3+2+3days) for Deployment

Container Orchestration on Amazon Web Services. Arun

Blockchain on Kubernetes User Guide

Two years of on Kubernetes

DevOps Workflow. From 0 to kube in 60 min. Christian Kniep, v Technical Account Manager, Docker Inc.

Kubernetes made easy with Docker EE. Patrick van der Bleek Sr. Solutions Engineer NEMEA

Kubernetes 101. Doug Davis, STSM September, 2017

VMware Integrated OpenStack with Kubernetes Getting Started Guide. VMware Integrated OpenStack 4.1

OPENSTACK + KUBERNETES + HYPERCONTAINER. The Container Platform for NFV

The Long Road from Capistrano to Kubernetes

Hacking and Hardening Kubernetes

Blockchain on vsphere By VMware

Infoblox Kubernetes1.0.0 IPAM Plugin

OpenShift 3 Technical Architecture. Clayton Coleman, Dan McPherson Lead Engineers

Kubernetes: What s New

Kubernetes: Container Orchestration and Micro-Services logo

Documentation Operations Bridge Premium

Enterprise Gateway Documentation

Kubernetes Storage: Current Capabilities and Future Opportunities. September 25, 2018 Saad Ali & Nikhil Kasinadhuni Google

Developing Kubernetes Services

Kubernetes deep dive

Table of Contents HOL CNA

Kubernetes. An open platform for container orchestration. Johannes M. Scheuermann. Karlsruhe,

Containers. Pablo F. Ordóñez. October 18, 2018

Oracle Container Services for use with Kubernetes. User's Guide

Kubernetes on Azure. Daniel Neumann Technology Solutions Professional Microsoft. Build, run and monitor your container applications

Container-Native Storage

Deploy Like a Boss: Using Apache Ignite TM and Kubernetes

A guide of PostgreSQL on Kubernetes ~ In terms of storage ~

Kuberiter White Paper. Kubernetes. Cloud Provider Comparison Chart. Lawrence Manickam Kuberiter Inc

/ Cloud Computing. Recitation 5 February 14th, 2017

Bringing Security and Multitenancy. Lei (Harry) Zhang

Kubernetes Love at first sight?

CONTAINERS AND MICROSERVICES WITH CONTRAIL

Singularity CRI User Documentation

Building a Kubernetes on Bare-Metal Cluster to Serve Wikipedia. Alexandros Kosiaris Giuseppe Lavagetto

Launching StarlingX. The Journey to Drive Compute to the Edge Pilot Project Supported by the OpenStack

Red Hat Containers Roadmap. Red Hat A panel of product directors

Introduction to Kubernetes

Continuous Delivery of Micro Applications with Jenkins, Docker & Kubernetes at Apollo

gcp / gke / k8s microservices

Kubernetes on Openstack

Container Orchestration with Kubernetes on SUSE Linux

Note: Currently (December 3, 2017), the new managed Kubernetes service on Azure (AKS) does not yet support Windows agents.

Containers OpenStack. Murano brings Docker & Kubernetes to OpenStack. Serg Melikyan. software.mirantis.com. January 27, 2015

Kubernetes. Introduction

Docker All The Things

Managing and Protecting Persistent Volumes for Kubernetes. Xing Yang, Huawei and Jay Bryant, Lenovo

What s New in Red Hat OpenShift Container Platform 3.4. Torben Jäger Red Hat Solution Architect

Scaling Jenkins with Docker and Kubernetes Carlos

$ wget V SOLUTIONS.tar.bz2 \ --user=lftraining --password=penguin2014

Kubernetes - Load Balancing For Virtual Machines (Pods)

Installation Guide for Kony Fabric Containers Solution On-Premises

Getting Started with VMware Integrated OpenStack with Kubernetes. VMware Integrated OpenStack 5.1

Kubernetes objects on Microsoft Azure

DEVELOPER INTRO

Transcription:

TensorFlow on Kubernetes @ vivo xidianwangtao@gmail.com

Agenda Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List

Outrageously large models Improving accuracy with up to 68 billion parameters https://www.cs.toronto.edu/~hinton/absps/outrageously.pdf

Distributed TensorFlow Derek Murray @ TensorFlow DEV SUMMIT 2017 Distributed TensorFlow h"ps://www.youtube.com/watch? 3me_con3nue=703&v=la_M6bCV91M

Distributed TensorFlow Model

In-graph Replication

Between-graph Replication

Async/Sync Training

Between-graph + Async

Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List

Motivation TensorFlow Task GPU GPU Task Task HDFS Read TensorFlow

Kubernetes is Suitable ResourceQuota, LimitRanger GPU (only limits) PLEG EFK Read Glusterfs, Ceph) TensorFlow

HDFS vs Glusterfs vs Ceph Glusterfs 12GB/s HDFS 3GB/s CephFS 2GB/s GlusterFS Read Performance is Best http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf

Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List

GlusterFS + K8S + TF

HDFS + K8S + TF GCEPersistentDisk AWSElasticBlockStor CephFS e Cinder AzureFile Glusterfs AzureDisk VsphereVolume FC (Fibre Channel) Quobyte Volumes FlexVolume HostPath Flocker VMware Photon NFS Portworx Volumes iscsi ScaleIO Volumes RBD StorageOS

Kube-scheduler Kube-apiserver etcd Job JobController NewJobController Kube-controller-manager Run.spec.completions.spec.parallelism.spec.activeDeadlineSeconds.spec.template.spec.backoffLimit RestartPolicy: Never or OnFailure syncjob func (jm *JobController) syncjob(key string) (bool, error) managejob func (jm *JobController) managejob(activepods []*v1.pod, succeeded int32, job *batch.job) (int32, error) Indexer SatisfiedExpectations

Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List

Components

Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List

Step 1- User Node $AlgorithmName copy User Node `/ var/www/html/$username/$algorithmname/` run.sh ( $AlgorithmName ) $AlgorithmName : User Node `/var/www/html/` httpd HTTP `http://$usernodeip:80/$username/$algorithmname`

Step 2- User Node `/opt/tensorflow/` tfcluster_template.yaml.jinja HDFS https://github.com/tensorflow/ecosystem/blob/master/render_template.py GlusterFS

Tensorboard

Config tfcluster_template.yaml.jinja, name, worker_replicas, ps_replicas, script script Http

Step 3- TensorFlow Cluster `python render_template.py tfcluster_template.yaml.jinja > wangtao.yaml` k8s yaml `kubectl apply -f wangtao.yaml` Between-Graph TensorFlow Cluster

TensorFlow Cluster Kubernetes Dashboard namespace ps worker

PS log

Worker log

PS and Worker Service

Secret for Harbor

Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List

Worker recreate pod kubernetes 1.7: ü kubelet: --maximum-dead-containers; ü Job Yaml:.spec.activeDeadlineSeconds; kubernetes 1.8: ü kubelet: --maximum-dead-containers; ü Job Yaml:.spec.activeDeadlineSeconds; ü Job Yaml:.spec.template.spec.backoffLimit (default 6)

Command or args? No such file or directory args shell command shell

Headless Service is Suitable Sometimes you dont need or want load-balancing and a single service IP. In this case, you can create headless services by specifying "None" for the cluster IP (spec.clusterip). For such Services, a cluster IP is not allocated, kube-proxy does not handle these services, and there is no load balancing or proxying done by the platform for them. How DNS is automatically configured depends on whether the service has selectors defined.

kubelet --cluster-dns=10.254.0.2 --cluster-domain=tensorflow.pro.vivo

Take Care Dockerfile ENV In Dockerfile: ENV CLASSPATH.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$(/usr/lib/hadoop-2.6.1/bin/hadoop classpath glob) env CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$(/usr/lib/hadoop-2.6.1/bin/hadoop classpath glob)

Walkaround Pod.spec.containers.command: [ /bin/sh, -c, export CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$(/usr/lib/hadoop-2.6.1/bin/hadoop classpath --glob); ]

Ps Process Hang

https://github.com/tensorflow/tensorflow/issues/4713 Shared Queue? Maybe ü TensorFlow on Kubernetes namespace namespace worker job successed kill hang PS Deployment delete namespace ü DevOps/TaaS Events successed worker Kubernetes

Reuse PV? https://github.com/kubernetes/kubernetes/issues/48609 When you delete a PVC, corresponding PV becomes Released. Make the PV available to everybody - delete PV.Spec.ClaimRef, Such PV can bound to any PVC (assuming that capacity, access mode and selectors match) Make the PV available to a specific PVC - pre-fill PV.Spec.ClaimRef with a pointer to a PVC. Leave the PV.Spec.ClaimRef,UID empty, as the PVC does not to need exist at this point and you don't know PVC's UID. This PV can be bound only to the specified PVC.

global secret 1. secret: kubectl create secret docker-registry harborsecret docker-server=registry.vivo.xyz: 4443 docker-username=admin docker-password=harbor1234 dockeremail=xxxxxxx@vivo.com 2. Kubectl get secret harborsecret -o yaml `.dockercfg`

3. tfcluster_template.yaml.jinja namespace harbor secret yaml, `.dockercfg`.dockercfg value image

Thinking worker PS? chief task restart.

Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List

Todo List NVIDIA GPU GPU IO? TensorFlow GPU TaaS TensorFlow ü ü TensorFlow Cluster ü ü TenserFlow Serving K8S Jupyter Notebook, Tensorboard

Q&A Thank you for your time. WaltonWang @ CSDN, OSCHINA