In 2016, there was a human vs machine contest; AlphaGo defeated a South Korean professional Go player Lee SeDol, which brought global attention. After that, some even think that machine will replace humans gradually in the near future. However, many technology giants, such as Google, Amazon, Microsoft, and Facebook, have already invested a lot in Artificial Intelligence (AI), while other well-known hardware providers, such as Intel, Qualcomm, and Nvidia, have also invested in AI for quite some time; nowadays even in automobile industry, many companies have conducted research on autonomous driving technology. No doubt about it that AI has become national campaign of all industries.
After some of tech giants utilized open source one after another to release their original source code associated with deep learning and machine learning frameworks, many small- and mid-cap companies are able to participate in this AL national activity, hoping to extract valuable models from existing resources to increase their own advantages by using this technology. Among the open-source deep learning frameworks, the most popular one is TensorFlow by Google Brian Team. It used open source to acquire the sharing power of social media, which accelerated the development of machine learning in a very short time.
This article aims to describe how TensorFlow conducts distributed data training. Meanwhile, as enterprise adoption of container technology increases, traditional IT architecture was gradually replaced. This article will also explain how to use Google Kubernetes Engine (GKE), a Google container system, to integrate with TensorFlow.
TensorFlow is a library that uses Dataflow Graphs to describe the process of numerical computation. The nodes of Data Flow Graphs indicate mathematical computation, whereas Edges represent interrelated multidimensional data arrays, which is Tensor. The flexibility of the architecture allows users to compute on different platforms. For example, one or multiple CPU (or GPU), and smart handheld devices or servers.
TensorFlow can be literally split into two parts which are listed below:
Take the following figure as an example. it illustrates how to establish a graph to describe computation:
The above example explains graph computation. The following figure shows the execution result.
TensorFlow cluster include one or multiple job(s). Each job can be divided into one or multiple task(s). Simply put, cluster is a set of jobs, while Job is a set of tasks. A cluster object mainly focuses on particular level object, such as training neural network, operating multiple machine in parallel. A cluster object can be defined by tf.train.ClusterSpec.
As mentioned above, TensorFlow cluster is a set of tasks; each task is a service, and each service can be divided into two parts, Master Service and Worker Service as listed below, to provide to client for operation.
TensorFlow Server is a process that run a tf.train.Server instance. It is a cluster member, which differentiates between Master and Worker.
A job of TensorFlow can be divided into multiple tasks with the same functionalities. Job can also be classified as Parameter Server and Worker. Both of their functionalities are listed below:
The replications of TensorFlow contain two modes: In-graph and Between-graph. Their differences are described below:
The training variables were stored at Parameter Server, and data do not need to be allocated. Database shard will be saved at each individual compute node; therefore, each compute node is able to compute by itself. After computation, simply inform Parameter Server about the variables to be updated. It is suitable for Terabyte data storage, which save a large amount of data transmission time, also is a recommended mode of deep learning.
In addition to the differences between replication modes, the training mode also differentiate synchronous from asynchronous approaches as follows:
Each time updating
Distributed Computation Problems:
In native TensorFlow, even though it already supports features of distributed calculation, there still are three major problems existing:
In summary, if we want to use TensorFlow to manage a trained cluster, we need to use its related tools to accomplish our objective. From the perspective of bottleneck problems, Kubernetes is the best choice for container cluster management system.
Written by 白凱仁 迎棧科技軟體工程師