tensorflow 分布式结合hdfs 报错:No lease on File does not exist. Holder DFSClient_NONMAPREDUCE
-
tensorflow 分布式脚本启动方式:
https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/hadoop.md -
在分布式程序执行代码中使用multiprocessing 分别启动ps, master, worker
multiprocessing.Process(target=start_dist, args=(params, ps_index, 'ps', '')).start() time.sleep(10.0) # 添加后解决
启动过程中报出错误: at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /root/Deles/pipeline.config (inode 59875): File does not exist. Holder DFSClient_NONMAPREDUCE_184200389_1 does not have any open files.
-
分析:错误可能原因,多个进程读取创建同一个目录导致 :https://www.cnblogs.com/wangxiaowei/p/3317479.html
解决:在使用多线程启动时,sleep(10s) -
使用tensorboard 查看日志,同样需要设置如上步骤1 的环境;
CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath --glob) tensorboard --logdir hdfs://master:9000/root/Deles --port 6007 &