Using Kubeflow to manage TensorFlow applications (Part 2)

Using Kubeflow to manage TensorFlow applications (Part 2)

Deploying Kubeflow

This section explains how to use ksonnet to deploy Kubeflow onto a Kubernetes cluster. First, initialize the ksonnet application directory onto the master node

$ ks init my-kubeflow

If you encounter the following problems, you can create a GitHub Token to access the GitHub API. Please refer to the GitHub rate limiting errors.

ERROR GET 403 API rate limit exceeded for

Next, install the Kubeflow suite into the application directory

$ cd my-kubeflow $ ks registry add kubeflow $ ks pkg install kubeflow/core $ ks pkg install kubeflow/tf-serving $ ks pkg install kubeflow/tf-job

Then build the Kubeflow core component, which  should contain the JupyterHub and TensorFlow job controller

$ kubectl create namespace kubeflow $ kubectl create clusterrolebinding tf-admin --clusterrole=cluster-admin --serviceaccount=default:tf-job-operator $ ks generate core kubeflow-core --name=kubeflow-core --namespace=kubeflow # 啟動收集匿名使用者使用量資訊,如果不想開啟則忽略 $ ks param set kubeflow-core reportUsage true $ ks param set kubeflow-core usageId $(uuidgen) # 部署 Kubeflow $ ks param set kubeflow-core jupyterHubServiceType LoadBalancer $ ks apply default -c kubeflow-core

Please refer to Usage Reporting for more detailed usage information

Check the results of the Kubeflow component deployment after completion:

$ kubectl -n kubeflow get po -o wide NAME READY STATUS RESTARTS AGE IP NODE ambassador-7956cf5c7f-6hngq 2/2 Running 0 34m kube-gpu-node1 ambassador-7956cf5c7f-jgxnd 2/2 Running 0 34m kube-gpu-node2 ambassador-7956cf5c7f-jww2d 2/2 Running 0 34m kube-gpu-node1 spartakus-volunteer-8c659d4f5-bg7kn 1/1 Running 0 34m kube-gpu-node2 tf-hub-0 1/1 Running 0 34m kube-gpu-node2 tf-job-operator-78757955b-2jbdh 1/1 Running 0 34m kube-gpu-node1

At this time, you can log in to Jupyter Notebook, but you will need to modify the Kubernetes Service through the following instructions:

$ kubectl -n kubeflow get svc -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR ambassador ClusterIP <none> 80/TCP 45m service=ambassador ambassador-admin ClusterIP <none> 8877/TCP 45m service=ambassador k8s-dashboard ClusterIP <none> 443/TCP 45m k8s-app=kubernetes-dashboard tf-hub-0 ClusterIP None <none> 8000/TCP 45m app=tf-hub tf-hub-lb ClusterIP <none> 80/TCP 45m app=tf-hub # 修改 svc 將 Type 修改成 LoadBalancer,並且新增 externalIPs 指定為 Master IP。 $ kubectl -n kubeflow edit svc tf-hub-lb ... spec: type: LoadBalancer externalIPs: - ...

Testing Kubeflow

Before starting the test, create an NFS PV for Kubeflow Jupyter to use:

$ cat <<EOF | kubectl create -f - apiVersion: v1 kind: PersistentVolume metadata: name: nfs-pv spec: capacity: storage: 20Gi accessModes: - ReadWriteOnce nfs: server: path: /nfs-data EOF

Once it’s done, connect to http://Master_IP and enter your account information/password to log in.

After logging in, click the Start my Server button to create Spawner option for the server. By default, multiple images can be used:


This also uses the following GCP-built images for testing (the GPU is currently CUDA 8):

If the CUDA version is different, please modify the GCP Tensorflow Notebook image or the Kubeflow Tensorflow Notebook image to rebuild it.

If you are using a GPU, execute the following commands to confirm if resources can be allocated:

$ kubectl get nodes ",GPU:.status.allocatable.nvidia\.com/gpu"
kube-gpu-master1 <none>
kube-gpu-node1 1
kube-gpu-node2 1

Finally, click Spawn to finish establishing the Server, as show below:

Test this using the CPU first. Because this article has CUDA 9.1 + cuDNN 7 installed, you will have to build you own image.

Next, wait for Kubernetes to download the image file, then it will start normally as shown below:

After it starts, click New > Python 3 to create a Notebook and paste the following sample program:

from __future__ import print_function import tensorflow as tf hello = tf.constant('Hello TensorFlow!') s = tf.Session() print(

If executed correctly, the following figure will be shown:

If you want to close th cluster, you can click on Control Plane.

In addition, since Kubeflow will install TF Operator to maange TFJob, you will be able to manually create a job through Kubernetes.

$ kubectl create -f $ kubectl get po NAME READY STATUS RESTARTS AGE example-job-ps-qq6x-0-pdx7v 1/1 Running 0 5m example-job-ps-qq6x-1-2mpfp 1/1 Running 0 5m example-job-worker-qq6x-0-m5fm5 1/1 Running 0 5m

If you want to remove Kubeflow related components from the Kubernetes cluster, you can input the following:

$ ks delete default -c kubeflow-core

Prev: 使用 kubefed 建立 Kubernetes Federation(On-premises)

Next: Kubernetes NVIDIA Device Plugins

回顧:Using Kubeflow to manage TensorFlow applications (Part 1)

by 白凱仁 迎棧科技軟體工程師


Select list(s)*