概述

主要想法是，现在的分布式SGD是有缺陷的，因为需要花费大量的时间。且在投入了一定的机器后，继续投入计算资源，无法进一步缩短训练时间或者提升模型效果。

因而提出用online distillation的方式，进行大规模分布式神经网络训练，以期获得更快的训练速度并提升模型精度。该工作提出了Codistillation的概念，通过大规模实验，发现codistillation方法提高了准确性并加快了训练速度，并且易于在实践中使用。

对Codistillation的定义：
Knowledge Distillation(6)——Large scale distributed neural net training through online distillation
作者构造的是同样的网络并行训练，并且在未收敛时就使用distillation loss
感觉这篇文章是Deep Mutual Learning的升级版本，同时让许多同样的网络进行并行训练。效果是大家都学的更快还学的更好了
同样耗费上百个GPU的话，使用SGD加大batchsize没啥效果，但是这种codistillation的方式看似训练了太多模型，但是大家一起学、互相学，收敛的更快了~

不过这篇论文对于我来说没多大帮助，毕竟我没有128GPUs。。。所以看看就好。
工业界大厂还是暴力啊，不过实验室小作坊也有自己的玩法哈哈~

Knowledge Distillation(6)——Large scale distributed neural net training through online distillation

Large scale distributed neural network training through online distillation

概述

相关推荐