TensorFlow on Kubernetes @ vivo xidianwangtao@gmail.com
Agenda Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List
Outrageously large models Improving accuracy with up to 68 billion parameters https://www.cs.toronto.edu/~hinton/absps/outrageously.pdf
Distributed TensorFlow Derek Murray @ TensorFlow DEV SUMMIT 2017 Distributed TensorFlow h"ps://www.youtube.com/watch? 3me_con3nue=703&v=la_M6bCV91M
Distributed TensorFlow Model
In-graph Replication
Between-graph Replication
Async/Sync Training
Between-graph + Async
Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List
Motivation TensorFlow Task GPU GPU Task Task HDFS Read TensorFlow
Kubernetes is Suitable ResourceQuota, LimitRanger GPU (only limits) PLEG EFK Read Glusterfs, Ceph) TensorFlow
HDFS vs Glusterfs vs Ceph Glusterfs 12GB/s HDFS 3GB/s CephFS 2GB/s GlusterFS Read Performance is Best http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf
Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List
GlusterFS + K8S + TF
HDFS + K8S + TF GCEPersistentDisk AWSElasticBlockStor CephFS e Cinder AzureFile Glusterfs AzureDisk VsphereVolume FC (Fibre Channel) Quobyte Volumes FlexVolume HostPath Flocker VMware Photon NFS Portworx Volumes iscsi ScaleIO Volumes RBD StorageOS
Kube-scheduler Kube-apiserver etcd Job JobController NewJobController Kube-controller-manager Run.spec.completions.spec.parallelism.spec.activeDeadlineSeconds.spec.template.spec.backoffLimit RestartPolicy: Never or OnFailure syncjob func (jm *JobController) syncjob(key string) (bool, error) managejob func (jm *JobController) managejob(activepods []*v1.pod, succeeded int32, job *batch.job) (int32, error) Indexer SatisfiedExpectations
Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List
Components
Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List
Step 1- User Node $AlgorithmName copy User Node `/ var/www/html/$username/$algorithmname/` run.sh ( $AlgorithmName ) $AlgorithmName : User Node `/var/www/html/` httpd HTTP `http://$usernodeip:80/$username/$algorithmname`
Step 2- User Node `/opt/tensorflow/` tfcluster_template.yaml.jinja HDFS https://github.com/tensorflow/ecosystem/blob/master/render_template.py GlusterFS
Tensorboard
Config tfcluster_template.yaml.jinja, name, worker_replicas, ps_replicas, script script Http
Step 3- TensorFlow Cluster `python render_template.py tfcluster_template.yaml.jinja > wangtao.yaml` k8s yaml `kubectl apply -f wangtao.yaml` Between-Graph TensorFlow Cluster
TensorFlow Cluster Kubernetes Dashboard namespace ps worker
PS log
Worker log
PS and Worker Service
Secret for Harbor
Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List
Worker recreate pod kubernetes 1.7: ü kubelet: --maximum-dead-containers; ü Job Yaml:.spec.activeDeadlineSeconds; kubernetes 1.8: ü kubelet: --maximum-dead-containers; ü Job Yaml:.spec.activeDeadlineSeconds; ü Job Yaml:.spec.template.spec.backoffLimit (default 6)
Command or args? No such file or directory args shell command shell
Headless Service is Suitable Sometimes you dont need or want load-balancing and a single service IP. In this case, you can create headless services by specifying "None" for the cluster IP (spec.clusterip). For such Services, a cluster IP is not allocated, kube-proxy does not handle these services, and there is no load balancing or proxying done by the platform for them. How DNS is automatically configured depends on whether the service has selectors defined.
kubelet --cluster-dns=10.254.0.2 --cluster-domain=tensorflow.pro.vivo
Take Care Dockerfile ENV In Dockerfile: ENV CLASSPATH.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$(/usr/lib/hadoop-2.6.1/bin/hadoop classpath glob) env CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$(/usr/lib/hadoop-2.6.1/bin/hadoop classpath glob)
Walkaround Pod.spec.containers.command: [ /bin/sh, -c, export CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$(/usr/lib/hadoop-2.6.1/bin/hadoop classpath --glob); ]
Ps Process Hang
https://github.com/tensorflow/tensorflow/issues/4713 Shared Queue? Maybe ü TensorFlow on Kubernetes namespace namespace worker job successed kill hang PS Deployment delete namespace ü DevOps/TaaS Events successed worker Kubernetes
Reuse PV? https://github.com/kubernetes/kubernetes/issues/48609 When you delete a PVC, corresponding PV becomes Released. Make the PV available to everybody - delete PV.Spec.ClaimRef, Such PV can bound to any PVC (assuming that capacity, access mode and selectors match) Make the PV available to a specific PVC - pre-fill PV.Spec.ClaimRef with a pointer to a PVC. Leave the PV.Spec.ClaimRef,UID empty, as the PVC does not to need exist at this point and you don't know PVC's UID. This PV can be bound only to the specified PVC.
global secret 1. secret: kubectl create secret docker-registry harborsecret docker-server=registry.vivo.xyz: 4443 docker-username=admin docker-password=harbor1234 dockeremail=xxxxxxx@vivo.com 2. Kubectl get secret harborsecret -o yaml `.dockercfg`
3. tfcluster_template.yaml.jinja namespace harbor secret yaml, `.dockercfg`.dockercfg value image
Thinking worker PS? chief task restart.
Distributed TensorFlow Why TensorFlow on Kubernetes How TensorFlow on Kubernetes Deploy Architecture Step By Step The Major Problems I Have Encountered Todo List
Todo List NVIDIA GPU GPU IO? TensorFlow GPU TaaS TensorFlow ü ü TensorFlow Cluster ü ü TenserFlow Serving K8S Jupyter Notebook, Tensorboard
Q&A Thank you for your time. WaltonWang @ CSDN, OSCHINA