TensorFlow on Kubernetes (Part 1)

TensorFlow on Kubernetes (Part 1)

In 2016, there was a human vs machine contest; AlphaGo defeated a South Korean professional Go player Lee SeDol, which brought global attention. After that, some even think that machine will replace humans gradually in the near future. However, many technology giants, such as Google, Amazon, Microsoft, and Facebook, have already invested a lot in Artificial Intelligence (AI), while other well-known hardware providers, such as Intel, Qualcomm, and Nvidia, have also invested in AI for quite some time; nowadays even in automobile industry, many companies have conducted research on autonomous driving technology. No doubt about it that AI has become national campaign of all industries.

After some of tech giants utilized open source one after another to release their original source code associated with deep learning and machine learning frameworks, many small- and mid-cap companies are able to participate in this AL national activity, hoping to extract valuable models from existing resources to increase their own advantages by using this technology. Among the open-source deep learning frameworks, the most popular one is TensorFlow by Google Brian Team. It used open source to acquire the sharing power of social media, which accelerated the development of machine learning in a very short time.

This article aims to describe how TensorFlow conducts distributed data training. Meanwhile, as enterprise adoption of container technology increases, traditional IT architecture was gradually replaced. This article will also explain how to use Google Kubernetes Engine (GKE), a Google container system, to integrate with TensorFlow.


TensorFlow is a library that uses Dataflow Graphs to describe the process of numerical computation. The nodes of Data Flow Graphs indicate mathematical computation, whereas Edges represent interrelated multidimensional data arrays, which is Tensor. The flexibility of the architecture allows users to compute on different platforms. For example, one or multiple CPU (or GPU), and smart handheld devices or servers.

TensorFlow can be literally split into two parts which are listed below:

  • Tensor: n-dimensional arrays. For example, 1-dimensional array is a vector, 2-dimensional array is a matrix.
  • Flow: dataflow of calculation process in a graph

Take the following figure as an example. it illustrates how to establish a graph to describe computation:

The above example explains graph computation. The following figure shows the execution result.

Distributed Computing

TensorFlow cluster include one or multiple job(s). Each job can be divided into one or multiple task(s). Simply put, cluster is a set of jobs, while Job is a set of tasks. A cluster object mainly focuses on particular level object, such as training neural network, operating multiple machine in parallel. A cluster object can be defined by tf.train.ClusterSpec.

As mentioned above, TensorFlow cluster is a set of tasks; each task is a service, and each service can be divided into two parts, Master Service and Worker Service as listed below, to provide to client for operation.

  • Master Service: A RPC service that remotely connect to a set of distributed devices. It mainly provides tensorflow::Session interface, and communicates with tasks via Worker Service.
  • Worker Service: A RPC logic that utilize local devices (CPU or GPU) to compute some graphs through worker_service.proto interface for implementation.

TensorFlow Server is a process that run a tf.train.Server instance. It is a cluster member, which differentiates between Master and Worker.

A job of TensorFlow can be divided into multiple tasks with the same functionalities. Job can also be classified as Parameter Server and Worker. Both of their functionalities are listed below:

  • Parameter Server: Primarily update variables based on gradient, and save them in tf.Variable. It can be explained that it is a variable that only saves model. It also stores the replica of a variable.  
  • Worker: or compute node. Mainly execute intensive graph operation resources, and compute gradient according to the variables, and store graph replicas.

The replications of TensorFlow contain two modes: In-graph and Between-graph. Their differences are described below:

  • In-graph: has only one client (primarily calling tf::Session process), and specify the variables and op to the corresponding job to finish, therefore data distribution is done by only one client. The setup of this method is simple.
  • Between-graph: multiple independent clients establish same graph (including variables), and map these parameters to ps through tf.train.replica_device_setter.

The training variables were stored at Parameter Server, and data do not need to be allocated. Database shard will be saved at each individual compute node; therefore, each compute node is able to compute by itself. After computation, simply inform Parameter Server about the variables to be updated. It is suitable for Terabyte data storage, which save a large amount of data transmission time, also is a recommended mode of deep learning.

In addition to the differences between replication modes, the training mode also differentiate synchronous from asynchronous approaches as follows:

  • Synchronous: Each graph replica reads the same value of Parameter Server, and compute gradient in parallel.

Each time updating

  • Asynchronous: after gradient computed, parameter can be updated. Different replica has different coordination progress. Therefore, computation resources will be fully utilized.

Distributed Computation Problems:

In native TensorFlow, even though it already supports features of distributed calculation, there still are three major problems existing:

  • Cluster orchestration: while managing cluster, any processes require to start up or shut down manually. While malfunctions occur, we don’t know exactly what the problem is. It does not have monitoring and logging features, and it cannot be scheduled.
  • Process lifecycle: We cannot distinguish if a process finishes normally or exits abnormally. All processes require to be managed manually. When encountering large-scale cluster management, it will cause operators overload.
  • Shared storage: Since TensorFlow does not have shared storage system by default, data need to be imported manually or downloaded via network, which results in data transmission problem.

In summary, if we want to use TensorFlow to manage a trained cluster, we need to use its related tools to accomplish our objective. From the perspective of bottleneck problems, Kubernetes is the best choice for container cluster management system.


by 白凱仁/迎棧科技軟體工程師


Select list(s)*