Neural Networks and Deep Learning -- Class 3: Shallow neural networks

3.1 神经网络概览

将两个logistic回归连在一起

a[1] 第1层; a[2] 第二层

3.2 神经网络表示

隐藏层：在训练集中，节点的真实值不可见

a[0]=x a表示将要传递给下一层的值

输入层：第0层或者，输入层不视为单独的一层【论文默认】

每一层会有对应的参数，数值/矩阵（理解矩阵的维度）

3.3 计算神经网络的输出

向量化

在一层中有不同的节点 -> 纵向堆叠

3.4 多个例子中的向量化

横向指标对应不同的训练样本

竖向对应不同的输入节点

3.5 向量化实现的表示

3.6 **函数

tanh(z) 均值为0，起到类似数据中心化的效果，使得下一层的学习更方便

sigmod 在输出层使用sigmod，输出介于0到1之间

ReLU 隐藏层默认选择ReLU；学习速度更快

leaky ReLU max(,0.01z,z) 负数时，有一个平缓的斜率

sigmod 用于二元分类的输出层

3.7 为什么需要非线性**函数

如果不用，输出为输入特征的线性组合，可以直接删除隐藏层

只有输出层可以用线性**函数，或者隐藏层中与压缩有关的情况

3.8 **函数的导数

sigmod: g(z)/(1-g(z))

tanh: 1-g(z)^2

ReLU: 等于0处可以自定义为0或1 ；次梯度

leaky ReLU: 等于0处可以自定义为0.01或1

3.9 神经网络的梯度下降法

随机初始化参数很重要，而不是全部为0

keepdims=True, 保证输出矩阵的维度，避免输出秩为1的奇怪形式

3.10 直观理解反向传播

注意添加转置的情况

Neural Networks and Deep Learning -- Class 3: Shallow neural networks

确保矩阵运算的维度互相匹配

反向传播梯度下降算法的推导！

3.11 随机初始化

将W初始化全为0，输入任何样本均无差别，两个节点的计算完全相同，完全对称 => 无意义

需要不同的两个单元，计算不同的函数

通常习惯将随机初始化矩阵设置为比较小的值，否则易落入SIGMOD或tanh函数较为平缓的部分，梯度较小，下降缓慢，学习缓慢

浅层神经网络选择0.01就可以，有一些情况需要选择其他值

b可以初始化为0，而没有影响

【作业】

错题：

7. Logistic regression’s weights w should be initialized randomly rather than to all zeros, because if you initialize to all zeros, then logistic regression will fail to learn a useful decision boundary because it will fail to “break symmetry”, True/False?

False. Logistic Regression doesn't have a hidden layer. If you initialize the weights to zeros, the first example x fed in the logistic regression will output zero but the derivatives of the Logistic Regression depend on the input x (because there's no hidden layer) which is not zero. So at the second iteration, the weights values follow x's distribution and are different from each other if x is not a constant vector.

9. 10.

W1 =(4,2) b1= (4,1)

W2=(1,4) b2 =(1,1)

Z1=A1=(4,m) np.dot(W.T,X) 4*2 2 *m

这俩题还是非常有助于理解传播过程中矩阵的维度变化的

重点题目

1. $$a^{[2]}_4$$ is the activation output by the $$4^{th}$$ neuron of the $$2^{nd}$$ layer

$$a^{[2](12)}$$ denotes the activation vector of the $$2^{nd}$$ layer for the $$12^{th}$$ training example.

$$X$$ is a matrix in which each column is one training example.

5. A = np.random.randn(4,3)

B = np.sum(A, axis = 1, keepdims = True) 列堆叠在一起，成了一列

B.shape=(4, 1)

8. You have built a network using the tanh activation for all the hidden units. You initialize the weights to relative large values, using np.random.randn(..,..)*1000. What will happen?

A: This will cause the inputs of the tanh to also be very large, thus causing gradients to be close to zero. The optimization algorithm will thus become slow.

设置得太大，一下子就到最边上增长缓慢的地方去了

Neural Networks and Deep Learning -- Class 3: Shallow neural networks

相关推荐