TensorFlow on vivo - PDF Free Download

TensorFlow on Kubernetes @ vivo xidianwangtao@gmail.com

Agenda Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List

Outrageously large models Improving accuracy with up to 68 billion parameters https://www.cs.toronto.edu/~hinton/absps/outrageously.pdf

Distributed TensorFlow Derek Murray @ TensorFlow DEV SUMMIT 2017 Distributed TensorFlow h"ps://www.youtube.com/watch? 3me_con3nue=703&v=la_M6bCV91M

Distributed TensorFlow Model

In-graph Replication

Between-graph Replication

Async/Sync Training

Between-graph + Async

Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List

Motivation TensorFlow Task GPU GPU Task Task HDFS Read TensorFlow

Kubernetes is Suitable ResourceQuota, LimitRanger GPU (only limits) PLEG EFK Read Glusterfs, Ceph) TensorFlow

HDFS vs Glusterfs vs Ceph Glusterfs 12GB/s HDFS 3GB/s CephFS 2GB/s GlusterFS Read Performance is Best http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf

Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List

GlusterFS + K8S + TF

HDFS + K8S + TF GCEPersistentDisk AWSElasticBlockStor CephFS e Cinder AzureFile Glusterfs AzureDisk VsphereVolume FC (Fibre Channel) Quobyte Volumes FlexVolume HostPath Flocker VMware Photon NFS Portworx Volumes iscsi ScaleIO Volumes RBD StorageOS

Kube-scheduler Kube-apiserver etcd Job JobController NewJobController Kube-controller-manager Run.spec.completions.spec.parallelism.spec.activeDeadlineSeconds.spec.template.spec.backoffLimit RestartPolicy: Never or OnFailure syncjob func (jm *JobController) syncjob(key string) (bool, error) managejob func (jm *JobController) managejob(activepods []*v1.pod, succeeded int32, job *batch.job) (int32, error) Indexer SatisfiedExpectations

Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List

Components

Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List

Step 1- User Node $AlgorithmName copy User Node `/ var/www/html/$username/$algorithmname/` run.sh ( $AlgorithmName ) $AlgorithmName : User Node `/var/www/html/` httpd HTTP `http://$usernodeip:80/$username/$algorithmname`

Step 2- User Node `/opt/tensorflow/` tfcluster_template.yaml.jinja HDFS https://github.com/tensorflow/ecosystem/blob/master/render_template.py GlusterFS

Tensorboard

Config tfcluster_template.yaml.jinja, name, worker_replicas, ps_replicas, script script Http

Step 3- TensorFlow Cluster `python render_template.py tfcluster_template.yaml.jinja > wangtao.yaml` k8s yaml `kubectl apply -f wangtao.yaml` Between-Graph TensorFlow Cluster

TensorFlow Cluster Kubernetes Dashboard namespace ps worker

PS log

Worker log

PS and Worker Service

Secret for Harbor

Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List

Worker recreate pod kubernetes 1.7: ü kubelet: --maximum-dead-containers; ü Job Yaml:.spec.activeDeadlineSeconds; kubernetes 1.8: ü kubelet: --maximum-dead-containers; ü Job Yaml:.spec.activeDeadlineSeconds; ü Job Yaml:.spec.template.spec.backoffLimit (default 6)

Command or args? No such file or directory args shell command shell

Headless Service is Suitable Sometimes you dont need or want load-balancing and a single service IP. In this case, you can create headless services by specifying "None" for the cluster IP (spec.clusterip). For such Services, a cluster IP is not allocated, kube-proxy does not handle these services, and there is no load balancing or proxying done by the platform for them. How DNS is automatically configured depends on whether the service has selectors defined.

kubelet --cluster-dns=10.254.0.2 --cluster-domain=tensorflow.pro.vivo

Take Care Dockerfile ENV In Dockerfile: ENV CLASSPATH.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$(/usr/lib/hadoop-2.6.1/bin/hadoop classpath glob) env CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$(/usr/lib/hadoop-2.6.1/bin/hadoop classpath glob)

Walkaround Pod.spec.containers.command: [ /bin/sh, -c, export CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$(/usr/lib/hadoop-2.6.1/bin/hadoop classpath --glob); ]

Ps Process Hang

https://github.com/tensorflow/tensorflow/issues/4713 Shared Queue? Maybe ü TensorFlow on Kubernetes namespace namespace worker job successed kill hang PS Deployment delete namespace ü DevOps/TaaS Events successed worker Kubernetes

Reuse PV? https://github.com/kubernetes/kubernetes/issues/48609 When you delete a PVC, corresponding PV becomes Released. Make the PV available to everybody - delete PV.Spec.ClaimRef, Such PV can bound to any PVC (assuming that capacity, access mode and selectors match) Make the PV available to a specific PVC - pre-fill PV.Spec.ClaimRef with a pointer to a PVC. Leave the PV.Spec.ClaimRef,UID empty, as the PVC does not to need exist at this point and you don't know PVC's UID. This PV can be bound only to the specified PVC.

global secret 1. secret: kubectl create secret docker-registry harborsecret docker-server=registry.vivo.xyz: 4443 docker-username=admin docker-password=harbor1234 dockeremail=xxxxxxx@vivo.com 2. Kubectl get secret harborsecret -o yaml `.dockercfg`

3. tfcluster_template.yaml.jinja namespace harbor secret yaml, `.dockercfg`.dockercfg value image

Thinking worker PS? chief task restart.

Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List

Todo List NVIDIA GPU GPU IO? TensorFlow GPU TaaS TensorFlow ü ü TensorFlow Cluster ü ü TenserFlow Serving K8S Jupyter Notebook, Tensorboard

Q&A Thank you for your time. WaltonWang @ CSDN, OSCHINA