20个参数服务器中的一个流量很慢

I am trying to train DNN model using tensorflow, my script have two variables, one is dense feature and one is sparse feature, each minibatch will pull full dense feature and pull specified sparse feature using embedding_lookup_sparse, feedforward could only begin after sparse feature is ready. I run my script using 20 parameter servers and increasing worker count did not scale out. So I profiled my job using tensorflow timeline and found one of 20 parameter server is very slow compared to the other 19. there is not dependency between different part of all the trainable variables. I am not sure if there is any bug or any limitation issues like tensorflow can only queue 40 fan out requests, any idea to debug it? Thanks in advance. tensorflow timeline profiling

我正在尝试使用tensorflow对DNN模型进行训练，我的脚本有两个变量，一个是密集的特性，一个是稀疏的特性，每个小批都会使用embed _lookup_稀来提取全密集特性，指定的稀疏特性只有在准备好稀疏特性后才能进行前馈。我使用20个参数服务器运行我的脚本，增加工人数量并没有扩大。因此，我使用tensorflow时间线对我的工作进行了剖析，发现与其他19个参数服务器相比，20个参数服务器中的一个非常慢。所有可训练变量的不同部分之间不存在依赖关系。我不确定是否有任何bug或任何限制问题，比如tensorflow只能将40个请求排出队列，有什么办法可以调试它吗?提前谢谢。tensorflow时间分析

3 个解决方案

#1

It sounds like you might have exactly 2 variables, one is stored at PS0 and the other at PS1. The other 18 parameter servers are not doing anything. Please take a look at variable partitioning (https://www.tensorflow.org/versions/master/api_docs/python/state_ops/variable_partitioners_for_sharding), i.e. partition a large variable into small chunks and store them at separate parameter servers.

听起来你可能刚好有两个变量，一个存储在PS0，另一个存储在PS1。其他18个参数服务器什么都不做。请查看变量分区(https://www.tensorflow.org/versions/master_docs/python/state_ops/variable_partitioners_for_sharding)，也就是将一个大变量分割成小块，并将它们存储在单独的参数服务器上。

#2

This is kind of a hack way to log Send/Recv timings from Timeline object for each iteration, but it works pretty well in terms of analyzing JSON dumped data (compared to visualize it on chrome://trace).

这是一种从时间轴对象记录每次迭代发送/Recv时间的黑客方式，但是在分析JSON转储数据方面(与在chrome://trace上可视化它相比)，它工作得非常好。

The steps you have to perform are:

您必须执行的步骤是:

download TensorFlow source and checkout a correct branch (r0.12 for example)
下载并签出正确的分支(例如r0.12)
modify the only place that calls SetTimelineLabel method inside executor.cc
- instead of only recording non-transferable nodes, you want to record Send/Recv nodes also.
- 除了只记录不可转移的节点，还需要记录Send/Recv节点。
- be careful to call SetTimelineLabel once inside NodeDone as it would set the text string of a node, which will be parsed later from a python script
- 在NodeDone中调用一次SetTimelineLabel，因为它将设置节点的文本字符串，稍后将从python脚本解析该字符串
修改executor中调用settimtimelelinabel方法的惟一位置。与只记录不可转移节点不同，您还需要记录Send/Recv节点。在NodeDone中调用一次SetTimelineLabel，因为它将设置节点的文本字符串，稍后将从python脚本解析该字符串
build TensorFlow from modified source
从修改后的源构建张力流
modify model codes (for example, inception_distributed_train.py) with correct way of using Timeline and graph meta-data
使用正确的方式使用时间轴和图元数据修改模型代码(例如，inception_distributed_train.py)

Then you can run the training and retrieve JSON file once for each iteration! :)

然后，您可以在每次迭代中运行培训并检索JSON文件一次!:)

#3

Some suggestions that were too big for a comment:

有些建议太大了，无法发表评论:

You can't see data transfer in timeline that's because the tracing of Send/Recv is currently turned off, some discussion here -- https://github.com/tensorflow/tensorflow/issues/4809

在时间轴中无法看到数据传输，这是因为Send/Recv的跟踪目前被关闭，这里有一些讨论——https://github.com/tensorflow/tensorflow/issues/4809

In the latest version (nightly which is 5 days old or newer) you can turn on verbose logging by doing export TF_CPP_MIN_VLOG_LEVEL=1 and it shows second level timestamps (see here about higher granularity).

在最新的版本(每晚5天或更新)中，您可以通过执行export TF_CPP_MIN_VLOG_LEVEL=1来打开详细日志记录，并显示第二级时间戳(请参阅这里关于更高粒度的内容)。

So with vlog perhaps you can use messages generated by this line to see the times at which Send ops are generated.

因此，使用vlog，也许您可以使用这一行生成的消息来查看何时生成发送操作。

#1

#2