TensorFlow on Kubernetes (Part 2)

TensorFlow on Kubernetes (Part 2)

Previously we mentioned the distributed computation features of TensorFlow, and gave some examples associated with cluster management, and process lifecycle management, which are not able to be easily solved by TensorFlow framework by default. These problems happen to be what Kubernetes is good at. This article is going to use Kubernetes to construct a distributed computation platform of TensorFlow. Its features and architecture are shown below:

Kubernetes

First of all, we need to create a Kubernetes to provide TensorFlow operation and construction. Its architecture is listed below:

Nodes utilize virtual machines; operation systems are ‘Ubuntu 16.04 Server LTS’. Below is the detailed information:

By using Kube-ansible tool to deploy testing environment, we need to download and install Vagrant and Virtualbox  tools before we start.

$ git clone https://github.com/kairen/kube-ansible.git
$ cd kube-ansible
$ ./setup-vagrant -n 3 -m 4096
Cluster Size: 1 master, 3 node.
VM Size: 1 vCPU, 4096 MB
VM Info: ubuntu16, virtualbox
Start deploying?(y): y

Upon completion of the above commands, we may use vagrant to enter each virtual machine to operate Kubernetes:

$ vagrant ssh master1
$ kubectl get node
NAME         STATUS         AGE       VERSION
master1      Ready,master   2h        v1.6.2
node1        Ready          2h        v1.6.2
node2        Ready          2h        v1.6.2
node3        Ready          2h        v1.6.2

After finishing the deployment of Kubernetes cluster, we may install NFS Server at Node master1 to provide shared storage based on the following commands:

$ sudo apt-get install -y nfs-kernel-server
$ sudo mkdir -p /var/nfs
$ cat <EOF > /etc/exports/var/nfs *(rw,sync,no_subtree_check)
EOF
$ sudo systemctl restart nfs-server.service

TensorFlow Applications

Upon finishing the deployment of Kubernetes, we can proceed to construct TensorFlow application. Below is an example:  

$ git clone https://github.com/kairen/workshop413.git
$ cd workshop413/k8s/lab01
$ ls
client.yml &nbsp;pv.yml &nbsp;tensorboard.yml &nbsp;worker.yml

Here we will see four files.

Firstly we need to build a Persistent Volume and Persistent Volume Claim to provide shared storage, and then edit pv.yml and modify the following content:

nfs:
path: /var/nfs
server: 172.16.35.13

After that, check the status and save it by using kubectl command:

$ kubectl create -f pv.yml
$ kubectl get pv,pvc

If we see the “Bound” in the Status field, it indicates that volumes are successfully assigned to Pods.

Then, we need to create Master service. Edit client.yml and modify the content as follows:

externalIPs:
- 172.16.35.9

Then, use kubectl command to create Master service and check the status:

$ kubectl create -f client.yml
$ kubectl get svc,po
NAME               CLUSTER-IP        EXTERNAL-IP     PORT(S)          AGE
svc/kubernetes     192.160.0.1       <none>          443/TCP          3h
svc/tf-client      192.175.52.16     ,172.16.35.9    8888:30805/TCP   1m
NAME                              READY     STATUS    RESTARTS   AGE
po/tf-client-998136869-3f8x9      1/1       Running   0          1m

When status “Running” is displayed, we can go ahead and browse the link http://172.16.35.9:8888。

Even though we have Master service, we still do not have physical nodes that execute operation. Therefore, we may use worker.yml file to create Worker service. Now edit and modify the following content:

externalIPs:
- 172.16.35.9

After finishing the above editing, simply use the command to create Worker service and check the status as follows:

$ kubectl create -f worker.yml
$ kubectl get svc, po
...
po/tf-worker-1891917189-q9v58     1/1       Running   0          1m

After creating Worker, we can open our browser and link to http://172.16.35.9:8888/  to enter Jupyter for execution and interaction:

Finally, we will create visualized application. Use tensorboard.yml file to describe and create Tensorboard. Edit and modify the content as follows:

externalIPs:
- 172.16.35.9

After that, use the following command to create Tensorboard service, and then check the status:

$ kubectl create -f tensorboard.yml
$ kubectl get po,svc
NAME                              READY     STATUS    RESTARTS   AGE
po/tf-dashboard-237766735-m9qt5   1/1       Running   0          1m
..
NAME               CLUSTER-IP        EXTERNAL-IP     PORT(S)          AGE
svc/tf-dashboard   192.174.81.212    ,172.16.35.9    6006:32621/TCP   1m
...

After finishing the above commands, we may once again use a browser to open http://172.16.35.9:6006/ and enter Tensorboard to view the process of visualized operation.

Summary

By utilizing Kubernetes to manage TensorFlow, we can make the system management simpler and easier to use. When a cluster fails, we can soon know where the problem is, by means of using Kubernetes Monitoring and Logging. However, to manage distributed program process, we do not need to manually activate TensorFlow Master and Worker, but we simply require the feature description of Kubernetes deployment and Service defined by YAML files, to rapidly create services. As for the storage part, the official TensorFlow distributed framework does not provide shared storage; consequently, it results in the fact that data cannot be shared. Nevertheless, the Kubernetes features of Persistent Volume and Persistent Volume Claim allow us to share data, and are available to different applications. When utilizing Kubernetes in TensorFlow distributed execution, we may have a lot of fun and it will bring us many interesting situations, such as TensorFlow on Kubernetes Workshop , which displays the use of four situations. Please refer to the link above if interested.

End

by 白凱仁/迎棧科技軟體工程師

EDM

Select list(s)*

 

Loading