反向传播算法及其实现

背景

去年看了《神经网络与深度学习》的前几章,了解了反向传播算法的一些皮毛,当时想自己实现一下,但是由于事情多,就放下了。现在有时间,也是由于想理清这算法就动手推公式和写代码了。

------这里只以全连接层作为例子,主要是我最近的一些理解------

定义全连接网络

反向传播算法及其实现
上图所示,说明一下参数:
wijlw_{ij}^{l}:表示第ll层中的第ii个神经元与第l+1l+1层中第jj个神经元的权重
bilb_{i}^{l}:表示第ll层中的第ii个神经元的偏向值
zilz_{i}^{l}:表示第ll层中的第ii个神经元的输入值,它由前一层对应权重与偏向和。
aila_{i}^{l}:表示第ll层中的第ii个神经元的输出值,它是输入值经**函数计算所得。
这里每个神经元所用**函数为sigmoidsigmoid函数 s(x)=11+exs(x)=\frac{1}{1+e^{-x}},顺便算下s(x)s(x)xx求导的结果为:s(x)=ex(1+ex)2=s(x)(1s(x))s^{\prime}(x)=\frac{e^{-x}}{\left(1+e^{-x}\right)^{2}}=s(x)(1-s(x))
这个网络输出是两个值,可以代表一个二分类网络每个类别的概率。输入是两个值,可以想象每一个样本有两个特征值。

接下来我们举个例子,看网络的前向运算。

前向运算

输入一个样本XX,它有两个特征值如下表示:
X={x1,x2}\begin{aligned} X =\{x_{1},x_{2}\} \end{aligned}
两个特征值进入输入层,也就是第一层的输出值,于是可以依次计算出第二层每个神经元的输入值和**值(对下一层的输出值):

z12=w111x1+w211x2+b12z_{1}^{2}=w_{11}^{1}\cdot x_{1}+w_{21}^{1}\cdot x_{2}+b_{1}^{2}

z22=w121x1+w221x2+b22z_{2}^{2}=w_{12}^{1}\cdot x_{1}+w_{22}^{1}\cdot x_{2}+b_{2}^{2}

z32=w131x1+w231x2+b32z_{3}^{2}=w_{13}^{1}\cdot x_{1}+w_{23}^{1}\cdot x_{2}+b_{3}^{2}

a12=s(z12)a_{1}^{2}=s(z_{1}^{2})

a22=s(z22)a_{2}^{2}=s(z_{2}^{2})

a32=s(z32)a_{3}^{2}=s(z_{3}^{2})

接下来算出第三层每个神经元的输入值和对下一层的输出值:

z13=w112a12+w212a22+w312a32+b13z_{1}^{3}=w_{11}^{2}\cdot a_{1}^{2}+w_{21}^{2}\cdot a_{2}^{2}+w_{31}^{2}\cdot a_{3}^{2}+b_{1}^{3}

z23=w122a12+w222a22+w322a32+b23z_{2}^{3}=w_{12}^{2}\cdot a_{1}^{2}+w_{22}^{2}\cdot a_{2}^{2}+w_{32}^{2}\cdot a_{3}^{2}+b_{2}^{3}

z33=w132a12+w232a22+w332a32+b33z_{3}^{3}=w_{13}^{2}\cdot a_{1}^{2}+w_{23}^{2}\cdot a_{2}^{2}+w_{33}^{2}\cdot a_{3}^{2}+b_{3}^{3}

a13=s(z13)a_{1}^{3}=s(z_{1}^{3})

a23=s(z23)a_{2}^{3}=s(z_{2}^{3})

a33=s(z33)a_{3}^{3}=s(z_{3}^{3})

有了第三层**值,那么可以算出第四层,也就是输出层的值:

z14=w113a13+w213a23+w313a33+b14z_{1}^{4}=w_{11}^{3}\cdot a_{1}^{3}+w_{21}^{3}\cdot a_{2}^{3}+w_{31}^{3}\cdot a_{3}^{3}+b_{1}^{4}

z24=w123a13+w223a23+w323a33+b24z_{2}^{4}=w_{12}^{3}\cdot a_{1}^{3}+w_{22}^{3}\cdot a_{2}^{3}+w_{32}^{3}\cdot a_{3}^{3}+b_{2}^{4}

a14=s(z14)a_{1}^{4}=s(z_{1}^{4})

a24=s(z24)a_{2}^{4}=s(z_{2}^{4})

得到网络的输出值a14,a24a_{1}^{4},a_{2}^{4},我们与真实值相比较,设对于样本XX的标签为Y={y1,y2}Y={\{y_{1},y_{2}\}}。那么算出网络计算值与真实值的差距lossloss,这里用平方差:
loss=(y1a14)2+(y2a24)22\begin{aligned} loss =\frac{(y_{1}-a_{1}^{4})^{2}+(y_{2}-a_{2}^{4})^{2}}{2} \end{aligned}
有了损失值,那么,我们应该优化网络参数,不断降低损失值。这里采用最常用的梯度下降法,来求loss的最小值。因为,沿梯度相反的方向,就是函数值下降最快的方向。那么接下来就是求每个ll关于w,bw,b的梯度,然后按照一定的学习率lrlr更新这些参数,如下:
w=wlrdlossdw1\begin{aligned} w = w-lr\cdot\frac{\mathfrak{d} loss}{\mathfrak{d} w} (1) \end{aligned}
b=blrdlossdb2\begin{aligned} b = b-lr\cdot\frac{\mathfrak{d} loss}{\mathfrak{d} b} (2) \end{aligned}
,总有一天,loss会降到最低,令我们满意。
那么,计算每个w,bw,b的梯度,这和前向计算一样,是一件体力活,接下来就采用链式求导来依次计算出dlossdw\frac{\mathfrak{d} loss}{\mathfrak{d} w}dlossdb\frac{\mathfrak{d} loss}{\mathfrak{d} b}

链式求导

从最后一层开始,求dlossdw113\frac{\mathfrak{d} loss}{\mathfrak{d} w_{11}^{3}}dlossdw123\frac{\mathfrak{d} loss}{\mathfrak{d} w_{12}^{3}}dlossdw213\frac{\mathfrak{d} loss}{\mathfrak{d} w_{21}^{3}}dlossdw223\frac{\mathfrak{d} loss}{\mathfrak{d} w_{22}^{3}}dlossdw313\frac{\mathfrak{d} loss}{\mathfrak{d} w_{31}^{3}}dlossdw323\frac{\mathfrak{d} loss}{\mathfrak{d} w_{32}^{3}}以及dlossdb14\frac{\mathfrak{d} loss}{\mathfrak{d} b_{1}^{4}}dlossdb24\frac{\mathfrak{d} loss}{\mathfrak{d} b_{2}^{4}}

参照上面前向计算式子,从后往前看,直到遇见 b14b_{1}^{4}为止:

loss=(y1a14)2+(y2a24)22loss =\frac{(y_{1}-a_{1}^{4})^{2}+(y_{2}-a_{2}^{4})^{2}}{2}

a14=s(z14)a_{1}^{4}=s(z_{1}^{4})

z14=w113a13+w213a23+w313a33+b14z_{1}^{4}=w_{11}^{3}\cdot a_{1}^{3}+w_{21}^{3}\cdot a_{2}^{3}+w_{31}^{3}\cdot a_{3}^{3}+b_{1}^{4}

那么可以依照链式求导法则来求 losslossb14b_{1}^{4}的偏导数:

dlossdb14=dlossda14da14dz14dz14db14=122(y1a14)s(z14)(1s(z14))=(y1a14)s(z14)(1s(z14))\begin{aligned} \frac{\mathfrak{d} loss}{\mathfrak{d} b_{1}^{4}}&=\frac{\mathfrak{d} loss}{\mathfrak{d} a_{1}^{4}}\cdot\frac{\mathfrak{d}a_{1}^{4}}{\mathfrak{d}z_{1}^{4}}\cdot\frac{\mathfrak{d}z_{1}^{4}}{\mathfrak{d} b_{1}^{4}} \\ &=-\frac{1}{2}\cdot2\cdot(y_{1}-a_{1}^{4})\cdot s(z_{1}^{4})\cdot(1-s(z_{1}^{4})) \\ &=-(y_{1}-a_{1}^{4})\cdot s(z_{1}^{4})\cdot(1-s(z_{1}^{4})) \end{aligned}

同理可以得到下面:

dlossdb24=dlossda24da24dz24dz24db24=122(y2a24)s(z24)(1s(z24))=(y2a24)s(z24)(1s(z24))\begin{aligned} \frac{\mathfrak{d} loss}{\mathfrak{d} b_{2}^{4}}&=\frac{\mathfrak{d} loss}{\mathfrak{d} a_{2}^{4}}\cdot\frac{\mathfrak{d}a_{2}^{4}}{\mathfrak{d}z_{2}^{4}}\cdot\frac{\mathfrak{d}z_{2}^{4}}{\mathfrak{d} b_{2}^{4}} \\ &=-\frac{1}{2}\cdot2\cdot(y_{2}-a_{2}^{4})\cdot s(z_{2}^{4})\cdot(1-s(z_{2}^{4})) \\ &=-(y_{2}-a_{2}^{4})\cdot s(z_{2}^{4})\cdot(1-s(z_{2}^{4})) \end{aligned}

dlossdw113=dlossda14da14dz14dz14dw113\begin{aligned} \frac{\mathfrak{d} loss}{\mathfrak{d} w_{11}^{3}} &= \frac{\mathfrak{d} loss}{\mathfrak{d} a_{1}^{4}} \cdot \frac{\mathfrak{d}a_{1}^{4}}{\mathfrak{d}z_{1}^{4}} \cdot \frac{\mathfrak{d}z_{1}^{4}}{\mathfrak{d} w_{11}^{3}} \end{aligned}

dlossdw123=dlossda24da24dz24dz24dw123\begin{aligned} \frac{\mathfrak{d} loss}{\mathfrak{d} w_{12}^{3}} &= \frac{\mathfrak{d} loss}{\mathfrak{d} a_{2}^{4}} \cdot \frac{\mathfrak{d}a_{2}^{4}}{\mathfrak{d}z_{2}^{4}} \cdot \frac{\mathfrak{d}z_{2}^{4}}{\mathfrak{d} w_{12}^{3}} \end{aligned}
............照这样计算下去就可以把这一层参数偏导数全求出来。
最后一层求出之后,再求倒数第二层dlossdw112\frac{\mathfrak{d} loss}{\mathfrak{d} w_{11}^{2}}dlossdw122\frac{\mathfrak{d} loss}{\mathfrak{d} w_{12}^{2}}dlossdw132\frac{\mathfrak{d} loss}{\mathfrak{d} w_{13}^{2}}dlossdw212\frac{\mathfrak{d} loss}{\mathfrak{d} w_{21}^{2}}dlossdw222\frac{\mathfrak{d} loss}{\mathfrak{d} w_{22}^{2}}dlossdw232\frac{\mathfrak{d} loss}{\mathfrak{d} w_{23}^{2}}dlossdw312\frac{\mathfrak{d} loss}{\mathfrak{d} w_{31}^{2}}dlossdw322\frac{\mathfrak{d} loss}{\mathfrak{d} w_{32}^{2}}dlossdw332\frac{\mathfrak{d} loss}{\mathfrak{d} w_{33}^{2}}以及dlossdb13\frac{\mathfrak{d} loss}{\mathfrak{d} b_{1}^{3}}dlossdb23\frac{\mathfrak{d} loss}{\mathfrak{d} b_{2}^{3}}dlossdb23\frac{\mathfrak{d} loss}{\mathfrak{d} b_{2}^{3}}

这一层有点深,求dlossdb13\frac{\mathfrak{d} loss}{\mathfrak{d} b_{1}^{3}},从后往前看:

loss=(y1a14)2+(y2a24)22loss =\frac{(y_{1}-a_{1}^{4})^{2}+(y_{2}-a_{2}^{4})^{2}}{2}

a14=s(z14)a_{1}^{4}=s(z_{1}^{4})

a24=s(z24)a_{2}^{4}=s(z_{2}^{4})

z14=w113a13+w213a23+w313a33+b14z_{1}^{4}=w_{11}^{3}\cdot a_{1}^{3}+w_{21}^{3}\cdot a_{2}^{3}+w_{31}^{3}\cdot a_{3}^{3}+b_{1}^{4}

z24=w123a13+w223a23+w323a33+b24z_{2}^{4}=w_{12}^{3}\cdot a_{1}^{3}+w_{22}^{3}\cdot a_{2}^{3}+w_{32}^{3}\cdot a_{3}^{3}+b_{2}^{4}

a13=s(z13)a_{1}^{3}=s(z_{1}^{3})

z13=w112a12+w212a22+w312a32+b13z_{1}^{3}=w_{11}^{2}\cdot a_{1}^{2}+w_{21}^{2}\cdot a_{2}^{2}+w_{31}^{2}\cdot a_{3}^{2}+b_{1}^{3}

直到出现b13b_{1}^{3},然后求偏导数:

dlossdb13=dlossda14da14dz14dz14da13da13dz13dz13db13+dlossda24da24dz24dz24da13da13dz13dz13db13\begin{aligned} \frac{\mathfrak{d} loss}{\mathfrak{d} b_{1}^{3}} &= \frac{\mathfrak{d} loss}{\mathfrak{d} a_{1}^{4}} \cdot \frac{\mathfrak{d}a_{1}^{4}}{\mathfrak{d}z_{1}^{4}} \cdot \frac{\mathfrak{d}z_{1}^{4}}{\mathfrak{d} a_{1}^{3}} \cdot \frac{\mathfrak{d} a_{1}^{3}}{\mathfrak{d} z_{1}^{3}} \cdot \frac{\mathfrak{d} z_{1}^{3}}{\mathfrak{d} b_{1}^{3}}+ \frac{\mathfrak{d} loss}{\mathfrak{d} a_{2}^{4}} \cdot \frac{\mathfrak{d}a_{2}^{4}}{\mathfrak{d}z_{2}^{4}} \cdot \frac{\mathfrak{d}z_{2}^{4}}{\mathfrak{d} a_{1}^{3}} \cdot \frac{\mathfrak{d} a_{1}^{3}}{\mathfrak{d} z_{1}^{3}} \cdot \frac{\mathfrak{d} z_{1}^{3}}{\mathfrak{d} b_{1}^{3}} \end{aligned}

好了,接下来看dlossdw112\frac{\mathfrak{d} loss}{\mathfrak{d} w_{11}^{2}}

loss=(y1a14)2+(y2a24)22loss =\frac{(y_{1}-a_{1}^{4})^{2}+(y_{2}-a_{2}^{4})^{2}}{2}

a14=s(z14)a_{1}^{4}=s(z_{1}^{4})

a24=s(z24)a_{2}^{4}=s(z_{2}^{4})

z14=w113a13+w213a23+w313a33+b14z_{1}^{4}=w_{11}^{3}\cdot a_{1}^{3}+w_{21}^{3}\cdot a_{2}^{3}+w_{31}^{3}\cdot a_{3}^{3}+b_{1}^{4}

z24=w123a13+w223a23+w323a33+b24z_{2}^{4}=w_{12}^{3}\cdot a_{1}^{3}+w_{22}^{3}\cdot a_{2}^{3}+w_{32}^{3}\cdot a_{3}^{3}+b_{2}^{4}

a13=s(z13)a_{1}^{3}=s(z_{1}^{3})

z13=w112a12+w212a22+w312a32+b13z_{1}^{3}=w_{11}^{2}\cdot a_{1}^{2}+w_{21}^{2}\cdot a_{2}^{2}+w_{31}^{2}\cdot a_{3}^{2}+b_{1}^{3}

看到了w112w_{11}^{2},那就求导:

dlossdw112=dlossda14da14dz14dz14da13da13dz13dz13dw112+dlossda24da24dz24dz24da13da13dz13dz13dw112\begin{aligned} \frac{\mathfrak{d} loss}{\mathfrak{d} w_{11}^{2}} &= \frac{\mathfrak{d} loss}{\mathfrak{d} a_{1}^{4}} \cdot \frac{\mathfrak{d}a_{1}^{4}}{\mathfrak{d}z_{1}^{4}} \cdot \frac{\mathfrak{d}z_{1}^{4}}{\mathfrak{d} a_{1}^{3}} \cdot \frac{\mathfrak{d} a_{1}^{3}}{\mathfrak{d} z_{1}^{3}} \cdot \frac{\mathfrak{d} z_{1}^{3}}{\mathfrak{d} w_{11}^{2}}+ \frac{\mathfrak{d} loss}{\mathfrak{d} a_{2}^{4}} \cdot \frac{\mathfrak{d}a_{2}^{4}}{\mathfrak{d}z_{2}^{4}} \cdot \frac{\mathfrak{d}z_{2}^{4}}{\mathfrak{d} a_{1}^{3}} \cdot \frac{\mathfrak{d} a_{1}^{3}}{\mathfrak{d} z_{1}^{3}} \cdot \frac{\mathfrak{d} z_{1}^{3}}{\mathfrak{d} w_{11}^{2}} \end{aligned}
接下来,算其它的也是一样的方法,这里就不赘述了!
求出所有层的参数,然后按照梯度下降法的公式(1)、(2),更新一次参数。再不断重复这个前向运算和后向求偏导并更新参数过程,使得lossloss降到最低。
这里大家可能就发现问题了,这样求导,越往深处求,越发现,有些偏导数前面的都是一样的,而且已经求过了,在求所有偏导数时,存在大量的不必要的重复计算。那怎么才能优化它呢?接下来就介绍反向传播算法来加速网络求梯度。

反向传播算法

反向传播算法,我的理解就是引入δ\delta,从后往前计算梯度时,每计算完一个参数梯度,就先保存下来,然后再计算前面梯度时,直接用先前保存下来的梯度值继续计算。这样避免重复计算!
ok,下面讲的是算法思想:
我们先定义δ\delta:
δil=dlossdzil\begin{aligned} \delta_{i}^{l}=\frac{\mathfrak{d} loss}{\mathfrak{d} z_{i}^{l}} \end{aligned}
明天再写。