tensorflow 分布式结合hdfs 报错:No lease on File does not exist. Holder DFSClient_NONMAPREDUCE

  1. tensorflow 分布式脚本启动方式:
    https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/hadoop.md

  2. 在分布式程序执行代码中使用multiprocessing 分别启动ps, master, worker

    multiprocessing.Process(target=start_dist, args=(params, ps_index, 'ps', '')).start()
            time.sleep(10.0)  # 添加后解决
    

启动过程中报出错误: at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /root/Deles/pipeline.config (inode 59875): File does not exist. Holder DFSClient_NONMAPREDUCE_184200389_1 does not have any open files.

  1. 分析:错误可能原因,多个进程读取创建同一个目录导致 :https://www.cnblogs.com/wangxiaowei/p/3317479.html
    解决:在使用多线程启动时,sleep(10s)

  2. 使用tensorboard 查看日志,同样需要设置如上步骤1 的环境;

    CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath --glob) tensorboard --logdir hdfs://master:9000/root/Deles --port 6007 &
    

tensorflow 分布式结合hdfs 报错:No lease on File does not exist. Holder DFSClient_NONMAPREDUCE