深度学习背后的数学

Deep neural networks (DNNs) are essentially formed by having multiple connected perceptrons, where a perceptron is a single neuron. Think of an artificial neural network (ANN) as a system which contains a set of inputs that are fed along weighted paths. These inputs are then processed, and an output is produced to perform some task. Over time, the ANN ‘learns’, and different paths are developed. Various paths can have different weightings, and paths that are found to be more important (or produce more desirable results) are assigned higher weightings within the model than those which produce fewer desirable results.

深度神经网络(DNN)本质上是通过具有多个连接的感知器而形成的，其中感知器是单个神经元。可以将人工神经网络(ANN)视为一个系统，其中包含沿加权路径馈入的一组输入。然后处理这些输入，并产生输出以执行某些任务。随着时间的流逝，ANN“学习”了，并且开发了不同的路径。各种路径可能具有不同的权重，并且在模型中，比那些产生较少的理想结果的路径，被发现更重要(或产生更理想的结果)的路径分配了更高的权重。

Within a DNN, if all the inputs are densely connected to all the outputs, then these layers are referred to as dense layers. Additionally, DNNs can contain multiple hidden layers. A hidden layer is basically the point between the input and output of the neural network, where the activation function does a transformation on the information being fed in. It is referred to as a hidden layer because it is not directly observable from the system’s inputs and outputs. The deeper the neural network, the more the network can recognize from data.

在DNN中，如果所有输入都密集连接到所有输出，则这些层称为密集层 。此外，DNN可以包含多个隐藏层 。隐藏层基本上是神经网络输入和输出之间的点，**函数对输入的信息进行转换。之所以称其为隐藏层，是因为无法从系统的输入和输出中直接观察到这一点。输出。神经网络越深，网络可以从数据中识别的越多。

However, although learning as much as possible from the data is the goal, deep learning models can suffer from overfitting. This occurs when a model learns too much from the training data, including random noise. Models are then able to determine very intricate patterns within the data, but this negatively affects the performance on new data. The noise picked up in the training data does not apply to new or unseen data, and the model is unable to generalize the patterns found. Non-linearity is also of high importance in deep learning models. Although the model will learn a lot from having multiple hidden layers, applying linear forms to non-linear problems will result in poor performance.

但是，尽管目标是从数据中尽可能多地学习，但是深度学习模型可能会遭受过度拟合的困扰。当模型从训练数据(包括随机噪声)中学习太多时，就会发生这种情况。然后，模型可以确定数据中非常复杂的模式，但这会对新数据的性能产生负面影响。训练数据中拾取的噪声不适用于新数据或看不见的数据，并且该模型无法概括发现的模式。非线性在深度学习模型中也非常重要。尽管该模型将从具有多个隐藏层的内容中学到很多东西，但是将线性形式应用于非线性问题将导致性能下降。

The question now comes, “how do these layers learn things?” Well, let us apply an ANN to a real scenario to solve a problem and understand how the model would be trained to accomplish its goal. With the current pandemic, many schools have transitioned to virtual learning, and this has caused some students to be concerned about their chances of passing their courses. The ‘will I pass this class’ problem is one that any artificial intelligence system should be able to solve.

现在的问题是，“这些层如何学习东西？” 好吧，让我们将ANN应用于实际场景以解决问题并了解如何训练模型以实现其目标。在当前的大流行中，许多学校已经过渡到虚拟学习，这使一些学生担心他们通过课程的机会。 “我将通过本课程”这个问题是任何人工智能系统都应该能够解决的问题。

For simplicity, let us consider that this model only has 3 inputs: the number of lectures the student attended, the amount of time spent on assignments, and the number of times that internet connection was lost throughout lectures. The output of this model will be a binary classification; either the student passes the course or they do not. It is now the end of the semester and student A has attended 21 lectures, spent 90 hours on assignments, and lost internet connection 7 times over the semester. These inputs are fed into the model, and the output predicts that the student has a 5% chance of passing the course. A week later, final grades are released, and student A passed the course. So, what went wrong with the model’s prediction?

为简单起见，让我们考虑该模型只有3个输入：学生参加的讲座的数量，在作业上花费的时间以及整个讲座中互联网连接丢失的次数。该模型的输出将是二进制分类。学生要么通过了课程，要么没有通过。现在到了学期末，学生A参加了21堂课，花了90个小时进行作业，并且在整个学期中有7次失去互联网连接。这些输入被输入到模型中，并且输出预测学生有5％的机会通过课程。一周后，发布了最终成绩，学生A通过了该课程。那么，模型的预测出了什么问题？

Technically, nothing went wrong. The model would have worked as it was currently developed to work. The issue is that the model has no idea what is going on. We would have just initialized some weights on the pathways, but the model currently does not know what is right from wrong; and thus, the weights are incorrect. This is where the learning comes in. The idea is that the model needs to understand when it is wrong, and we do this by calculating some form of ‘loss’. The loss being calculated is dependent on the problem at hand, but it typically involves minimizing the discrepancy between the predicted output and the actual output.

从技术上讲，没有任何问题。该模型本来可以按目前开发的方式工作。问题在于该模型不知道发生了什么。我们将在路径上初始化一些权重，但是该模型当前不知道对与错。因此，权重不正确。这就是学习的源头。其想法是模型需要了解错误的时间，我们通过计算某种形式的“损失”来做到这一点。计算得出的损失取决于当前的问题，但是通常会涉及使预测输出与实际输出之间的差异最小化。

In the scenario presented above, there is only one student and one point of error to minimize. However, this is typically not the case. Now, consider that there are multiple students and multiple discrepancies to minimize. The overall loss, then, would typically be calculated as the average of the differences between all predictions and actual observations.

在上述情况下，只有一名学生和一个错误点可以减少到最小。但是，通常不是这种情况。现在，考虑将多个学生和多个差异最小化。那么，总损失通常将计算为所有预测和实际观察值之间的差异的平均值。

Recall that the loss being calculated is dependent on the problem at hand. Therefore, since our current problem is a binary classification, an appropriate loss calculation would be a cross-entropy loss. The idea behind this function is that it compares the predicted distribution of whether a student will pass the course to the actual distribution, and attempts to minimize the differences between these distributions.

回想一下，正在计算的损失取决于当前的问题。因此，由于我们当前的问题是二元分类，因此适当的损失计算将是交叉熵损失 。该功能背后的想法是，它比较学生是否将通过课程的预测分布与实际分布，并尝试最小化这些分布之间的差异。

Suppose instead that we no longer want to predict whether the student will pass the class, but we now want to predict the grade that they will get for the class. The cross-entropy loss would no longer be an appropriate method. Rather, the mean squared error loss would be more appropriate. This method is suitable for a regression problem, and the idea is that it will try to minimize the squared difference between the actual value and a predicted value.

取而代之的是，我们不再希望预测学生是否会通过该课程，而是现在希望预测他们将在该课程中获得的分数。交叉熵损失将不再是一种合适的方法。相反， 均方误差损失将更合适。此方法适用于回归问题，其思想是将尝试最小化实际值和预测值之间的平方差。

Now that we understand some loss functions, we can get into loss optimization and model training. A key factor in having good DNNs is having appropriate weights. The loss optimization should attempt to find a set of weights, W, that will minimize the calculated loss. If there is only one weight component, then it is possible to plot the weight and the loss on a 2-D graph and then choose the weight which minimizes the loss. However, most DNNs have multiple weight components, and visualizing an n-dimensional graph is quite hard.

现在我们了解了一些损失函数，我们可以进行损失优化和模型训练。拥有良好DNN的关键因素是拥有适当的权重。损耗优化应尝试找到一组权重W ，以最小化计算出的损耗。如果只有一个重量分量，则可以在二维图上绘制重量和损耗，然后选择使损耗最小的重量。但是，大多数DNN具有多个权重分量，并且可视化n维图非常困难。

Instead, the derivate of the loss function with respect to all the weights is calculated to determine the direction of maximum ascent. Now that the model understands which way is up and down, it travels downwards until it reaches a point of convergence at a local minimum. Once this decent is completed, a set of optimal weights will be returned, and this is what should be used for the DNN (assuming that the model was well developed).

取而代之的是，针对所有权重计算损失函数的导数，以确定最大上升的方向。现在，模型可以理解向上和向下的方向，然后向下移动，直到达到局部最小值的收敛点。完成这一体面操作后，将返回一组最佳权重，这就是DNN应该使用的权重(假设模型开发良好)。

The process of calculating this derivative is known as back propagation, and it essentially the chain rule from calculus. Consider the neural network shown above, how does a small change in the first set of weights affect the final loss? This is what the derivative, or gradient, seeks to explain. But, the first set of weights are fed into a hidden layer, which then has another set of weights leading to the predicted output and the loss. So, the effect that a change in weights has on the hidden layer should also be considered. Now, these were the only two parts within the network. But, if there are more weights to consider, this process would be continued by applying the chain rule from output to input.

计算此导数的过程称为反向传播 ，它本质上是来自微积分的链规则。考虑上面显示的神经网络，第一组权重的微小变化如何影响最终损失？这就是导数或梯度试图解释的内容。但是，第一组权重被馈送到隐藏层，然后隐藏层又具有另一组权重，从而导致预测的输出和损失。因此，还应考虑权重变化对隐藏层的影响。现在，这些是网络中仅有的两个部分。但是，如果要考虑的权重更多，则可以通过应用从输出到输入的链式规则来继续此过程。

Another important factor to consider when training a DNN is the learning rate. As the model travels to find an optimal set of weights, it needs to update its the weights by some factor. Although this could seem trivial, determining factor by which the model should move is quite difficult. If the factor is too small, then the model can either run for an exponentially long period of time or get trapped somewhere that’s not the global minimum. If the factor is too large, then the model might completely miss the target point and then diverge.

训练DNN时要考虑的另一个重要因素是学习率。当模型行进以找到最佳的权重集时，它需要以某种因素来更新其权重。尽管这似乎微不足道，但是确定模型移动的因素非常困难。如果因子太小，则该模型可以运行一段指数级的长时间，也可以陷入非全局最小值的某个位置。如果因数太大，则模型可能会完全错过目标点，然后发散。

Although a fixed rate can be ideal, an adaptive learning rate reduces the chances of the problems previously mentioned. That is, the factor will change depending on the current gradient, the size of the current weights, or some other thing that can affect where the model should go next to find the optimal weights.

尽管固定比率可能是理想的，但自适应学习比率会减少前面提到的问题的机会。也就是说，该系数将根据当前梯度，当前权重的大小或其他可能影响模型下一步寻找最佳权重的地方而变化。

As can be seen, DNNs are built on calculus and some statistics. Evaluating the mathematics behind these processes is useful because it can help one understand what is truly happening within the model, and this can lead to developing better models overall. But, even if the concepts are not easily understood, most programs come with tools such as automatic differentiation, so no worries. Happy coding!

可以看出，DNN是基于微积分和一些统计数据构建的。评估这些过程背后的数学是有用的，因为它可以帮助人们了解模型中真正发生的事情，并且可以导致整体上开发出更好的模型。但是，即使不容易理解这些概念，大多数程序也会附带自动区分等工具，因此无需担心。编码愉快！

digitaltrends.com/cool-tech/what-is-an-artificial-neural-network/

deepai.org/machine-learning-glossary-and-terms/hidden-layer-machine-learning#:~:text=In%20neural%20networks%2C%20a%20hidden,inputs%20entered%20into%20the%20network.

deepai.org/machine-learning-glossary-and-terms/hidden-layer-machine-learning#:~:text=In%20neural%20networks%2C%20a%20hidden,input%20enter%20into%20the%20network。

ncbi.nlm.nih.gov/pmc/articles/PMC4960264/

towardsdatascience.com/introduction-to-artificial-neural-networks-ann-1aea15775ef9

向datascience.com/introduction-to-artificial-neural-networks-ann-1aea15775ef9

explainthatstuff.com/introduction-to-neural-networks.html

说明thatstuff.com/introduction-to-neural-networks.html

neuralnetworksanddeeplearning.com/

neuronetworksanddeeplearning.com/

mathsisfun.com/calculus/derivatives-rules.html