Uncaught exception while reverting partial writes to file ...(Too many open files)

在用一个新的spark集群 处理业务时,处理的任务量稍微大一点,涉及到较多的map和reduce的任务式就会报下列错误:

Uncaught exception while reverting partial writes to file ...(Too many open files)Uncaught exception while reverting partial writes to file ...(Too many open files)

 Uncaught exception while reverting partial writes to file ...(Too many open files)

开始以为是spark集群的内存没给够,因为在另一个更大集群中和自己只有6g的单机上跑都没有问题,但尝试加大集群运行内存和集群worker和executor数量后仍然报同样地错误。在技术群里问了一下,有人提示说是centos系统的系统参数没有设置好,沿着这个线索对比了出问题的spark集群的系统参数后:

Uncaught exception while reverting partial writes to file ...(Too many open files)Uncaught exception while reverting partial writes to file ...(Too many open files)

       有问题的集群                                                                                         能跑的集群

发现有问题的max user processes  (-u) 仅为1024 ,在把max user processes更改到最大值时,重启后原先的问题就消失了。

更改后(在更改的过程中顺手也把open files 的数量也加到了最大)

Uncaught exception while reverting partial writes to file ...(Too many open files)

centos系统参数的更改方法参见博客1博客2

折磨我多日的问题终于解决。