在分布式张量流中使用grpc + mpi协议 - 错误

问题描述：

我刚刚编译了支持MPI的TensorFlow（master），并且现在在tf.train.Server对象中指定了“grpc + mpi”协议。但是，试图启动训练过程时，总有一个确切工人谁与错误在分布式张量流中使用grpc + mpi协议 - 错误

F ./tensorflow/contrib/mpi/mpi_utils.h:47] Failed to convert worker name to MPI index: ps:0:0

我每次重现错误失败，这是一个不同的工失败“转换”。考虑到它实际上不能“转换”参数服务器的属性，我无法转换的名称是一个“工作者”名称，这对我来说颇为可疑。

使用“标准”协议“grpc”时，整个培训程序正常工作。

每个工作人员以及单参数服务器运行在专用机器上（无共享机器）。 OpenMPI版本是2.1.1

我将如何处理调试？不幸的是，我对MPI知之甚少。

感谢，

垫

答

我遇到了同样的问题，当我用TensorFlow与MPI支持。原因是我没有使用mpirun来启动培训程序。

例如，我的火车脚本mpi_train.sh：

#! /bin/bash 

host=$(hostname -s) 
if [[ $host = "node-1" ]]; then 
     job_name=ps 
     task_id=0 
elif [[ $host = "node-2" ]]; then 
     job_name=worker 
     task_id=0 
elif [[ $host = "node-3" ]]; then 
     job_name=worker 
     task_id=1 
fi 

cd /test/models/inception 

bazel-bin/inception/imagenet_distributed_train --batch_size=32 --data_dir=/test/data/ILSVRC2012 --job_name=${job_name} --task_id=${task_id} --ps_hosts=10.0.20.14:2276 --worker_hosts=10.0.20.15:2276,10.0.20.16:2276 --protocol=grpc+mpi --max_steps=1020

我应该使用的mpirun推出我的火车脚本：

mpirun -host 10.0.0.14,10.0.0.15,10.0.0.16 /test/mpi_train.sh

在分布式张量流中使用grpc + mpi协议 - 错误

相关推荐