Previously we mentioned the distributed computation features of TensorFlow, and gave some examples associated with cluster management, and process lifecycle management, which are not able to be easily solved by TensorFlow framework by default. These problems happen to be what Kubernetes is good at. This article is going to use Kubernetes to construct a distributed computation platform of TensorFlow. Its features and architecture are shown below:
First of all, we need to create a Kubernetes to provide TensorFlow operation and construction. Its architecture is listed below:
Nodes utilize virtual machines; operation systems are ‘Ubuntu 16.04 Server LTS’. Below is the detailed information:
|172.16.35.13||Master, Etcd, NFS||2vCore||4GB|
$ git clone https://github.com/kairen/kube-ansible.git $ cd kube-ansible $ ./setup-vagrant -n 3 -m 4096 Cluster Size: 1 master, 3 node. VM Size: 1 vCPU, 4096 MB VM Info: ubuntu16, virtualbox Start deploying?(y): y
Upon completion of the above commands, we may use vagrant to enter each virtual machine to operate Kubernetes:
$ vagrant ssh master1 $ kubectl get node NAME STATUS AGE VERSION master1 Ready,master 2h v1.6.2 node1 Ready 2h v1.6.2 node2 Ready 2h v1.6.2 node3 Ready 2h v1.6.2
After finishing the deployment of Kubernetes cluster, we may install NFS Server at Node master1 to provide shared storage based on the following commands:
$ sudo apt-get install -y nfs-kernel-server $ sudo mkdir -p /var/nfs $ cat &amp;amp;amp;lt;EOF &amp;amp;amp;gt; /etc/exports /var/nfs *(rw,sync,no_subtree_check) EOF $ sudo systemctl restart nfs-server.service
Upon finishing the deployment of Kubernetes, we can proceed to construct TensorFlow application. Below is an example:
$ git clone https://github.com/kairen/workshop413.git $ cd workshop413/k8s/lab01 $ ls client.yml &amp;amp;amp;amp;nbsp;pv.yml &amp;amp;amp;amp;nbsp;tensorboard.yml &amp;amp;amp;amp;nbsp;worker.yml
Here we will see four files.
Firstly we need to build a Persistent Volume and Persistent Volume Claim to provide shared storage, and then edit pv.yml and modify the following content:
nfs: path: /var/nfs server: 172.16.35.13
After that, check the status and save it by using kubectl command:
$ kubectl create -f pv.yml $ kubectl get pv,pvc
If we see the “Bound” in the Status field, it indicates that volumes are successfully assigned to Pods.
Then, we need to create Master service. Edit client.yml and modify the content as follows:
externalIPs: - 172.16.35.9
Then, use kubectl command to create Master service and check the status:
$ kubectl create -f client.yml $ kubectl get svc,po NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE svc/kubernetes 126.96.36.199 &amp;amp;amp;lt;none&amp;amp;amp;gt; 443/TCP 3h svc/tf-client 188.8.131.52 ,172.16.35.9 8888:30805/TCP 1m NAME READY STATUS RESTARTS AGE po/tf-client-998136869-3f8x9 1/1 Running 0 1m
When status “Running” is displayed, we can go ahead and browse the link http://172.16.35.9:8888。
Even though we have Master service, we still do not have physical nodes that execute operation. Therefore, we may use worker.yml file to create Worker service. Now edit and modify the following content:
externalIPs: - 172.16.35.9
After finishing the above editing, simply use the command to create Worker service and check the status as follows:
$ kubectl create -f worker.yml $ kubectl get svc, po ... po/tf-worker-1891917189-q9v58 1/1 Running 0 1m
After creating Worker, we can open our browser and link to http://172.16.35.9:8888/ to enter Jupyter for execution and interaction:
Finally, we will create visualized application. Use tensorboard.yml file to describe and create Tensorboard. Edit and modify the content as follows:
externalIPs: - 172.16.35.9
After that, use the following command to create Tensorboard service, and then check the status:
$ kubectl create -f tensorboard.yml $ kubectl get po,svc NAME READY STATUS RESTARTS AGE po/tf-dashboard-237766735-m9qt5 1/1 Running 0 1m .. NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE svc/tf-dashboard 184.108.40.206 ,172.16.35.9 6006:32621/TCP 1m ...
After finishing the above commands, we may once again use a browser to open http://172.16.35.9:6006/ and enter Tensorboard to view the process of visualized operation.
By utilizing Kubernetes to manage TensorFlow, we can make the system management simpler and easier to use. When a cluster fails, we can soon know where the problem is, by means of using Kubernetes Monitoring and Logging. However, to manage distributed program process, we do not need to manually activate TensorFlow Master and Worker, but we simply require the feature description of Kubernetes deployment and Service defined by YAML files, to rapidly create services. As for the storage part, the official TensorFlow distributed framework does not provide shared storage; consequently, it results in the fact that data cannot be shared. Nevertheless, the Kubernetes features of Persistent Volume and Persistent Volume Claim allow us to share data, and are available to different applications. When utilizing Kubernetes in TensorFlow distributed execution, we may have a lot of fun and it will bring us many interesting situations, such as TensorFlow on Kubernetes Workshop , which displays the use of four situations. Please refer to the link above if interested.
Written by 白凱仁 迎棧科技軟體工程師