Using Kubeflow to manage TensorFlow applications (Part 1)

Using Kubeflow to manage TensorFlow applications (Part 1)

Kubeflow is Google’s open source machine learning tool, it’s goal being to simplify the process of machine learning on Kubernetes. Kubeflow’s goal is not to rebuild other services, but to provide an optimal development system to deploy to various infrastructures. Due to using Kubernetes as a base, Kubeflow can be executed wherever Kubernetes is.

This tool is capable of creating the following features:

  1. JupyterHub, a tool for suggesting and managing interactive Jupyter notebooks.
  2. Tensorflow Training Controller, which allows you to adjust the size with a single cluster using either a CPU or a GPU.
  3. Tensorflow Serving, a container that allows you to provide modeling services.

Kubeflow’s goal is to make machine learning faster and easier through Kubernetes’ features.

  1. Implement a simple, repeatable, portable deployment on different infrastructures (Laptop <-> ML rig <-> Training Cluster <-> Production cluster).
  2. Deploy and manage loosely connected microservices.
  3. Scale according to your needs.

Node Information

This test environment uses a physical machine, with Ubuntu 16.04 Server installed as the operating system:

Preparations Beforehand

Before using Kubeflow, you will need to ensure that you have met the following conditions:

All nodes are installed correctly to the specified version of NVIDIA drivers, CUDA, Docker, NVIDIA Docker. Please refer to Installing Nvidia Docker 2.

(optional) ALl GPU nodes have installed cuDNN v7.1.2 for CUDA 9.1, please go to NVIDIA cuDNN for download.

$ tar xvf cudnn-9.1-linux-x64-v7.1.tgz
$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/

All nodes are deployed with kubeadm as Kubernetes v1.9+ clusters, please refer to Deploying Kubernetes Clusters with kubeadm.

The Kubernetes cluster requires NVIDIA Device Plugins to be installed. Please refer to Installing Kubernetes NVIDIA Device Plugins.

Create an NFS server and install NFS common on the Kubernetes node, then use Kubernetes to create the PV for Kubeflow to use:

# execute at master

$ sudo apt-get update && sudo apt-get install -y nfs-server

$ sudo mkdir /nfs-data
$ echo “/nfs-data *(rw,sync,no_root_squash,no_subtree_check)” | sudo tee 
-a /etc/exports

$ sudo /etc/init.d/nfs-kernel-server restart

# execute at node

$ sudo apt-get update && sudo apt-get install -y nfs-common

To install ksonnet 0.9.2, please refer to the following:

$ wget

$ tar xvf ks_0.9.2_linux_amd64.tar.gz

$ sudo cp ks_0.9.2_linux_amd64/ks /usr/local/bin/

$ ks version
version: 0.9.2

jsonnet version: v0.9.5

client-go version: 1.8

To be continued

by 白凱仁 迎棧科技軟體工程師


Select list(s)*