神经网络初始化权重

Machine learning and deep learning techniques have encroached in every possible domain you can think of. With the increasing availability of digitized data and the advancement of computation capability of modern computers, it will possibly flourish in near future. As part of it, every day more and more people are delving into the task of building and training models. As a newbie in the domain, I also have to build and train a few neural network models. Well, most of the time, my primary goal remains normally to make a highly accurate as well generalized model. And in order to achieve that I normally break my head over finding proper hyper-parameters for my model, what all different regularization techniques I can apply, or thinking about the question, do I need to further deepen my model and so on. But often I forget to play around with the weight-initialization techniques. And I believe this is the case for many others like me.

中号achine学习和深入学习技术，你能想到的每一个可能的领域已经侵占。随着数字化数据可用性的提高以及现代计算机计算能力的提高，它可能会在不久的将来蓬勃发展。作为其中的一部分，每天都有越来越多的人开始研究构建和训练模型的任务。作为该领域的新手，我还必须构建和训练一些神经网络模型。好吧，在大多数情况下，我的主要目标通常是制作一个高度准确且通用化的模型。为了实现这一点，我通常会在为模型找到合适的超参数时大吃一惊，我可以应用什么不同的正则化技术，或者考虑这个问题，我是否需要进一步加深模型等等。但是我经常忘记使用权重初始化技术。我相信许多像我这样的人都是这种情况。

Wait, weight initialization? Does it matter at all? I dedicate this article to find out the answer to the questions. When I went over the internet to find the answer to the question, there is overwhelming information about it. Some articles talk about the mathematics behind these articles, some compare this technique theoretically and some are more into deciding whether uniform initialization is better or the normal one. In this situation, I resorted to a result-based approach. I tried to find out the impact of weight initialization techniques via a small experiment. I applied a few of these techniques into a model and tried to visualize the training process itself, keeping aside the goal of achieving high accuracy for some time.

等一下，重量初始化吗？有关系吗？我致力于本文以找出问题的答案。当我在互联网上找到问题的答案时，有关于它的大量信息。一些文章讨论了这些文章背后的数学原理，一些文章从理论上比较了该技术，还有一些文章更多地用于确定统一初始化是更好还是常规初始化。在这种情况下，我求助于基于结果的方法。我试图通过一个小实验找出权重初始化技术的影响。我将其中的一些技术应用到模型中，并试图可视化训练过程本身，将目标保持一段时间以保持高精度。

设置： (The setup:)

神经网络初始化权重_神经网络的权重初始化重要吗 — The Model | Image by Author

I have created a simple CNN model with a combination of Conv-2D, Max-pool, Dropout, Batch-norm, and Dense layers with Relu activation(except for the last layer, which is Softmax). I have trained it for the classification task of the Cifar-10 data set. I have initialized the model with six different kernel initialization methods and analyzed the training phase. I trained the model with 30 epochs and with a batch size of 512 using an SGD optimizer. The six initialization methods used in this experiment namely are:

我创建了一个简单的CNN模型，结合了Conv-2D，Max-pool，Dropout，Batch-norm和具有Relu**的密集层(最后一层是Softmax)。我已经针对Cifar-10数据集的分类任务对其进行了培训。我已经用六种不同的内核初始化方法对模型进行了初始化，并分析了训练阶段。我使用SGD优化器以30个时期训练了模型，并以512个批次的大小进行了训练。本实验中使用的六个初始化方法是：

Glorot Uniform
Glorot制服
Glorot Normal
Glorot普通
He Uniform
他制服
He Normal
他正常
Random Uniform
随机制服
Random Normal
随机正态

Note: The default kernel initialization method for Keras layers is Glorot Uniform.

注意：Keras层的默认内核初始化方法是Glorot Uniform。

实验结果： (Experiment results:)

let us first have a look at how uniform and normal initialization behaved for the three different methods.

让我们首先看一下三种不同方法的统一初始化和正常初始化的行为。

For the first two methods, the validation accuracy curve for the model with normal initialization techniques almost followed the uniform technique, if not always. For Glorot methods(first plot) after it took a slow start, and then it started to rise. After the tenth epoch, both curves for normal and uniform showed a very jumpy behavior. But for the last few epochs, the normal curve took a downward journey, whereas for the uniform one the overall trend is upward.

对于前两种方法，使用标准初始化技术的模型验证准确性曲线(如果不是总是)几乎遵循统一技术。对于Glorot方法(第一个绘图)，它起步缓慢，然后开始上升。在第十个时期之后，正常曲线和均匀曲线都显示出非常跳跃的行为。但是在最后几个时期，法线曲线是向下的，而对于统一曲线，总体趋势是向上的。

For random methods(last plot), the trend for both the curves didn’t match at all. The overall validation accuracy trend for the random normal remained very low, below 40%(except two sharp spikes upward). Whereas the uniform method showed a better result. Although the curve for uniform technique is very unstable and jumpy in nature. After the 23rd epoch both the curves showed a downfall of validation accuracy. Following the trend of Glorot methods, here also we see a very slow start-up to epoch 5, showing a validation accuracy as low as ~10%. For Glorot and Random methods, the validation accuracy curve didn’t show a proper convergence pattern what so ever and remained unstable for the entire training process.

对于随机方法(最后一个图)，两条曲线的趋势根本不匹配。随机法线的总体验证准确性趋势仍然很低，低于40％(除了两个向上的尖峰)。而统一方法显示出更好的结果。尽管统一技术的曲线本质上非常不稳定且跳跃。在第23个时期之后，两条曲线均显示验证准确性下降。遵循Glorot方法的趋势，在这里我们也看到到纪元5的启动非常缓慢，显示验证精度低至约10％。对于Glorot和Random方法，验证准确性曲线没有显示出适当的收敛模式，并且在整个训练过程中仍然不稳定。

The He methods(middle plot) performed better than the other two in this aspect. The curves for uniform and normal techniques are similar and followed each other till the end of training of 30 epochs. And they both showed an overall upward trend, and the curves are comparatively less jumpy. Unlike Glorot and Random, when used He methods, the validation accuracy started to rise from the very first epoch without showing a slow start.

He方法(中间图)在这方面比其他两个方法表现更好。统一技术和正常技术的曲线相似，并且相互遵循，直到训练30个纪元为止。而且它们都显示出整体上升趋势，并且曲线的跳跃性相对较小。与Glorot和Random不同，当使用He方法时，验证准确性从一开始就开始提高，而没有显示出缓慢的开始。

Let us now take a deep dive and analyze the performance of these methods on a few desirable aspects.

现在让我们深入研究并在一些理想的方面分析这些方法的性能。

Accuracy: As far as accuracy is concerned highest validation accuracy is achieved using Glorot methods(Blue curves). Although as the curves suggest it failed to maintain the position rather shown a very unstable behavior. He methods(Orange curves) are far more stable and reliable than rest two and the validation accuracy is comparable with Glorot. Among these, Random techniques, both uniform and normal, produced the lowest validation accuracy throughout the training. The below table will give you an overall summary of how these methods behaved.

准确性：就准确性而言，使用Glorot方法(蓝色曲线)可获得最高的验证准确性。尽管如曲线所示，它无法保持位置，但表现出非常不稳定的行为。他的方法(橙色曲线)比后两种方法更稳定和可靠，并且验证精度与Glorot相当。在这些方法中，均匀和正常的随机技术在整个训练过程中产生的验证准确性最低。下表将为您提供这些方法的总体表现。

Convergence: Nowadays people are training very deep models. GPT-3 is having 175 billion parameters to train, so faster and distinct convergence is always a desirable aspect. As I already mentioned Glorot and Random methods’ curves are very unstable and jumpy, so it’s difficult to deduce whether convergence is achieved or not. If we analyze the overall trend, Random initialization methods perform very poorly and we can say that the training for random normal converged at around 40% mark, whereas for Random Uniform it's below 50%. Random curves took as minimum as 15–16 epochs to reach that level of validation accuracy. For Glorot Uniform and Normal initialization, the validation accuracy converges between 50–60%(some random spikes above 60%). And the convergence trend started to formalize after 15 epochs. He curves after increasing constantly crossed the 50% mark at around 12 epochs(He Normal curve was faster). After that both He uniform and normal curves continued to go upward and finished at around 60% mark. Also if we look at the He loss curves, the starting loss for Glorot uniform and Random uniform methods are low compared to He Uniform. This is probably because the He initialization method initializes the weights a bit on the higher side.

融合：如今，人们正在训练非常深入的模型。 GPT-3具有1,750亿个要训练的参数，因此，更快和独特的融合始终是理想的方面。正如我已经提到的Glorot和Random方法的曲线非常不稳定且波动很大，因此很难推断出是否实现收敛。如果我们分析总体趋势，则随机初始化方法的效果非常差，可以说，针对随机法线的训练收敛在40％左右，而对于随机均匀的训练则低于50％。随机曲线最少要花15-16个纪元才能达到验证精度水平。对于Glorot均匀和法线初始化，验证精度收敛在50–60％之间(某些随机峰值超过60％)。 15个时代之后，趋同趋势开始正式化。逐渐增加后，他在大约12个世纪越过50％大关时弯曲(正常曲线更快)。此后，He均匀曲线和正常曲线都继续上升，并最终达到60％左右。同样，如果我们查看He损失曲线，则与He一致相比，Glorot均匀和随机均匀方法的起始损失较低。这可能是因为He初始化方法在较高端初始化了权重。
Stability and randomness of training: The Validation accuracy comparison curves show very clearly that with Glorot and Random methods are very jumpy and as a consequence, it hinders the decision making ability of the designer for making an early stop of the training process. This is a very vital trick for reducing over-fitting and making a well-generalized model. On the other hand, when the model initialized with He uniform or random methods the validation accuracy curves show very stable and consistent behavior. This also gives confidence to the designer about the training process of the model. The designer can stop at any desired position as per their whim. In order to prove that our findings are correct, we trained the model five times with each of the initialization methods.

训练的稳定性和随机性：验证准确性的比较曲线非常清楚地表明，使用Glorot和Random方法是非常Swift的，因此，它阻碍了设计者提前停止训练过程的决策能力。这是减少过度拟合并制作出通用化模型的非常重要的技巧。另一方面，当使用He均匀或随机方法初始化模型时，验证精度曲线显示出非常稳定和一致的行为。这也使设计人员对模型的训练过程充满信心。设计人员可以根据自己的想法停在任何所需的位置。为了证明我们的发现是正确的，我们使用每种初始化方法对模型进行了五次训练。

The plots clearly show how jumpy the training process could be when initialized with Glorot and Random methods. The Random Uniform method failed to show any increment up to the 5th epoch and the story is the same for Glorot Uniform up to the 3rd epoch.

这些图清楚地表明了使用Glorot和Random方法初始化训练过程的过程有多快。直到第5个时期为止，“随机制服”方法都无法显示任何增量，而第3个时期之前，“ Glorot制服”的情况相同。

Generalization: Along with accuracy, we as designers want to train a model that is generalized well as well as accurate so that they perform well on unseen data. So we expect our model to show a minimal gap between training and testing.

概括：随着准确性，我们作为设计师要培养这是广义的良好，以及准确的，让他们在看不见的数据表现良好的典范。因此，我们希望我们的模型在训练和测试之间显示出最小的差距。

As these above plots show that the gap between the training and testing is narrowest when the model is initialized with the He Uniform method. The validation loss and training loss curve almost superimpose on one another. The gap is biggest when the model is initialized with the Random Uniform technique, ending up with a highly over-fitted model

如上图所示，当使用He Uniform方法初始化模型时，训练和测试之间的差距最窄。验证损失和训练损失曲线几乎相互叠加。当使用随机均匀技术初始化模型时，差距最大，最终出现高度过度拟合的模型

Environment: Yes you read it correctly. The heading is the environment. An article published on MIT technical review says “Training a single AI model can emit as much carbon as five cars in their lifetimes”.

环境：是的你正确阅读。标题是环境。麻省理工学院技术评论上发表的一篇文章说：“训练一个AI模型可以在其一生中排放多达五辆汽车的碳”。

As AI is shaping the future of humankind, on the other side while training our model we must reduce the carbon footprint generated by exploiting the computation power of computers to keep the world inhabitable. As model training is a cyclic process, we must look for techniques that converge fast. Looking at the training curves, we can see the model initialized with He methods are fast to converge and smooth training curves can enable the designer for an early stop.

另一方面，随着AI塑造人类的未来，另一方面，在训练我们的模型时，我们必须减少通过利用计算机的计算能力而产生的碳足迹，以保持世界的宜居性。由于模型训练是一个循环过程，因此我们必须寻找快速收敛的技术。查看训练曲线，我们可以看到用He方法初始化的模型可以快速收敛，平滑的训练曲线可以使设计人员早日停下来。

结论： (Conclusion:)

Well when the deep learning community is divided on which initialization method works best, what should we use Uniform or Normal, this experiment gives me confidence that HE initialization methods performed better on few aspects under the setting of this task given the model.

好吧，当深度学习社区在哪种初始化方法上效果最好时，我们应该使用Uniform还是Normal，该实验使我充满信心，在给定模型的情况下，HE初始化方法在此任务设置下在某些方面表现更好。

1) Very stable and smooth training progression

1)非常稳定和流畅的训练进度

2) Fast to converge to a desirable validation accuracy

2)快速收敛到理想的验证精度

3) The constant increase of validation accuracy from very fast epoch

3)从非常快的时代开始不断提高验证准确性

4) Multiple runs are similar, and less randomness in the training process

4)多次跑步相似，并且训练过程中的随机性较低

5) Clear convergence and very thin generalization gap(training vs testing) compared to others.

5)与其他方法相比，具有清晰的收敛性和非常小的泛化鸿沟(培训与测试)。

翻译自: https://towardsdatascience.com/weight-initialization-for-neural-networks-does-it-matter-e2fd99b3e91f

神经网络初始化权重

神经网络初始化权重_神经网络的权重初始化重要吗

设置： (The setup:)

实验结果： (Experiment results:)

结论： (Conclusion:)

相关推荐