使用Keras模型和tf.Estimator的分布式训练

Following the example here one can create a tf.Estimator from an existing keras model. In the beginning this page states that by doing so, one can use benefits of the tf.Estimator such as an increased training speed due to distributed training. Sadly, when I run the code only one of the GPUs in my system is used for computations; therefore, there is no increase of speed. How exactly can I use distributed learning with an estimator built from a keras model?

按照这里的例子,可以从现有的keras模型创建一个tf.Estimator。在本页开头,该页面指出,通过这样做,可以使用tf.Estimator的好处,例如由于分布式培训而提高的培训速度。遗憾的是,当我运行代码时,我的系统中只有一个GPU用于计算;因此,速度没有增加。我怎样才能使用分布式学习和使用keras模型构建的估算器?

I stumbpled upon this method:

我对这种方法感到困惑:

distributed_model = tf.keras.utils.multi_gpu_model(model, gpus=2)

which sounds like it would take care of this problem. But this is not working at the moment, as it creates a graph which uses the get_slice(..) method defined in tensorflow/python/keras/_impl/keras/utils/training_utils.py and this method fails with the following error message:

听起来它会解决这个问题。但目前这不起作用,因为它创建了一个使用tensorflow / python / keras / _impl / keras / utils / training_utils.py中定义的get_slice(..)方法的图形,此方法失败并显示以下错误消息:

Traceback (most recent call last):   File "hub.py", line 75, in <module>
    estimator = create_model_estimator()   File "hub.py", line 67, in create_model_estimator
    estimator = tf.keras.estimator.model_to_estimator(keras_model=new_model, custom_objects={'tf': tf}, model_dir=model_dir, config=run_config)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 302, in model_to_estimator
    _save_first_checkpoint(keras_model, est, custom_objects, keras_weights)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 231, in _save_first_checkpoint
    custom_objects)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 109, in _clone_and_build_model
    model = models.clone_model(keras_model, input_tensors=input_tensors)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/models.py", line 1557, in clone_model
    return _clone_functional_model(model, input_tensors=input_tensors)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/models.py", line 1451, in _clone_functional_model
    output_tensors = topology._to_list(layer(computed_tensor, **kwargs))   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/engine/topology.py", line 258, in __call__
    output = super(Layer, self).__call__(inputs, **kwargs)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 696, in __call__
    outputs = self.call(inputs, *args, **kwargs)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/layers/core.py", line 630, in call
    return self.function(inputs, **arguments)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/utils/training_utils.py", line 156, in get_slice
    shape = array_ops.shape(data) NameError: name 'array_ops' is not defined

So, what can I do to use both of my GPUs to train a model with a tf.Estimator object?

那么,如何使用我的GPU来训练带有tf.Estimator对象的模型呢?

Edit: By switching the version/build of tensorflow I was able to get rid of the previous error message, but now I get this one:

编辑:通过切换tensorflow的版本/构建我能够摆脱以前的错误消息,但现在我得到这个:

Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value res2a_branch2c/bias
         [[Node: res2a_branch2c/bias/_482 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1142_res2a_branch2c/bias", _device="/job:localhost/replica:0/task:0/device:GPU:0"](res2a_branch2c/bias)]]
         [[Node: bn4a_branch2a/beta/_219 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_878_bn4a_branch2a/beta", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Maybe this is connected to this issue?

也许这与这个问题有关?

1 个解决方案

#1

you should setting distributed running config.

你应该设置分布式运行配置。

you can reference this demo for tensorflow high level API(estimator) for distributed training.

您可以参考此演示用于分布式培训的tensorflow高级API(估算器)。

https://github.com/colinwke/wide_deep_demo

#1