集群运行task数量总结
使用集群运行spark-sql计算 初始化大宽表 近二十年所有数据!
1700个Tasks(计算10min + 写入30min)
计算两年数据量:130G
计算三年数据量:190G
平均每年数据量:60G+(2000万条)
提交资源申请:
每个executor申请内存为16G
--executor-memory 12g \
--conf spark.yarn.executor.memoryOverhead=4096m \
由于进行repartition(10)操作
所以该集群最大处理数据量为10*16G =160G(两年多数据)
否则报错:
ExecutorLostFailure (executor 37 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 16.1 GB of 16 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.