Using Kubeflow to manage TensorFlow applications (Part 2)

Using Kubeflow to manage TensorFlow applications (Part 2)

Deploying Kubeflow

This section explains how to use ksonnet to deploy Kubeflow onto a Kubernetes cluster. First, initialize the ksonnet application directory onto the master node

$ ks init my-kubeflow

If you encounter the following problems, you can create a GitHub Token to access the GitHub API. Please refer to the GitHub rate limiting errors.

ERROR GET https://api.github.com/repos/ksonnet/parts/commits/master: 403 API rate limit exceeded for 122.146.93.152.

Next, install the Kubeflow suite into the application directory

$ cd my-kubeflow $ ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow $ ks pkg install kubeflow/core $ ks pkg install kubeflow/tf-serving $ ks pkg install kubeflow/tf-job

Then build the Kubeflow core component, which  should contain the JupyterHub and TensorFlow job controller

$ kubectl create namespace kubeflow $ kubectl create clusterrolebinding tf-admin --clusterrole=cluster-admin --serviceaccount=default:tf-job-operator $ ks generate core kubeflow-core --name=kubeflow-core --namespace=kubeflow # 啟動收集匿名使用者使用量資訊,如果不想開啟則忽略 $ ks param set kubeflow-core reportUsage true $ ks param set kubeflow-core usageId $(uuidgen) # 部署 Kubeflow $ ks param set kubeflow-core jupyterHubServiceType LoadBalancer $ ks apply default -c kubeflow-core

Please refer to Usage Reporting for more detailed usage information

Check the results of the Kubeflow component deployment after completion:

$ kubectl -n kubeflow get po -o wide NAME READY STATUS RESTARTS AGE IP NODE ambassador-7956cf5c7f-6hngq 2/2 Running 0 34m 10.244.41.132 kube-gpu-node1 ambassador-7956cf5c7f-jgxnd 2/2 Running 0 34m 10.244.152.134 kube-gpu-node2 ambassador-7956cf5c7f-jww2d 2/2 Running 0 34m 10.244.41.133 kube-gpu-node1 spartakus-volunteer-8c659d4f5-bg7kn 1/1 Running 0 34m 10.244.152.135 kube-gpu-node2 tf-hub-0 1/1 Running 0 34m 10.244.152.133 kube-gpu-node2 tf-job-operator-78757955b-2jbdh 1/1 Running 0 34m 10.244.41.131 kube-gpu-node1

At this time, you can log in to Jupyter Notebook, but you will need to modify the Kubernetes Service through the following instructions:

$ kubectl -n kubeflow get svc -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR ambassador ClusterIP 10.101.157.91 <none> 80/TCP 45m service=ambassador ambassador-admin ClusterIP 10.107.24.138 <none> 8877/TCP 45m service=ambassador k8s-dashboard ClusterIP 10.111.128.104 <none> 443/TCP 45m k8s-app=kubernetes-dashboard tf-hub-0 ClusterIP None <none> 8000/TCP 45m app=tf-hub tf-hub-lb ClusterIP 10.105.47.253 <none> 80/TCP 45m app=tf-hub # 修改 svc 將 Type 修改成 LoadBalancer,並且新增 externalIPs 指定為 Master IP。 $ kubectl -n kubeflow edit svc tf-hub-lb ... spec: type: LoadBalancer externalIPs: - 172.22.132.41 ...

Testing Kubeflow

Before starting the test, create an NFS PV for Kubeflow Jupyter to use:

$ cat <<EOF | kubectl create -f - apiVersion: v1 kind: PersistentVolume metadata: name: nfs-pv spec: capacity: storage: 20Gi accessModes: - ReadWriteOnce nfs: server: 172.22.132.41 path: /nfs-data EOF

Once it’s done, connect to http://Master_IP and enter your account information/password to log in.

After logging in, click the Start my Server button to create Spawner option for the server. By default, multiple images can be used:

  • CPU:gcr.io/kubeflow-images-staging/tensorflow-notebook-cpu。
  • GPU:gcr.io/kubeflow-images-staging/tensorflow-notebook-gpu。

This also uses the following GCP-built images for testing (the GPU is currently CUDA 8):

If the CUDA version is different, please modify the GCP Tensorflow Notebook image or the Kubeflow Tensorflow Notebook image to rebuild it.

If you are using a GPU, execute the following commands to confirm if resources can be allocated:

$ kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
NAME GPU
kube-gpu-master1 <none>
kube-gpu-node1 1
kube-gpu-node2 1

Finally, click Spawn to finish establishing the Server, as show below:

Test this using the CPU first. Because this article has CUDA 9.1 + cuDNN 7 installed, you will have to build you own image.

Next, wait for Kubernetes to download the image file, then it will start normally as shown below:

After it starts, click New > Python 3 to create a Notebook and paste the following sample program:

from __future__ import print_function import tensorflow as tf hello = tf.constant('Hello TensorFlow!') s = tf.Session() print(s.run(hello))

If executed correctly, the following figure will be shown:

If you want to close th cluster, you can click on Control Plane.

In addition, since Kubeflow will install TF Operator to maange TFJob, you will be able to manually create a job through Kubernetes.

$ kubectl create -f https://raw.githubusercontent.com/kubeflow/tf-operator/master/examples/tf_job.yaml $ kubectl get po NAME READY STATUS RESTARTS AGE example-job-ps-qq6x-0-pdx7v 1/1 Running 0 5m example-job-ps-qq6x-1-2mpfp 1/1 Running 0 5m example-job-worker-qq6x-0-m5fm5 1/1 Running 0 5m

If you want to remove Kubeflow related components from the Kubernetes cluster, you can input the following:

$ ks delete default -c kubeflow-core

Prev: 使用 kubefed 建立 Kubernetes Federation(On-premises)

Next: Kubernetes NVIDIA Device Plugins

回顧:Using Kubeflow to manage TensorFlow applications (Part 1)

by 白凱仁 迎棧科技軟體工程師

EDM

Select list(s)*

 

Loading