这两天看了 Neural
Networks and Deep Learning 网上在线书目的第一章的内容 和 斯坦福大学 《机器学习》的公开课,学习了两种主要的神经网络结构和机器学习中重要的算法——随机梯度下降算法。现在总结如下:
一个计算模型要划分为神经网络,通常需要大量彼此连接的节点(神经元),具有两个特特性:
1.每个神经元通过某种特定的输出函数(或称 激励函数 activation function)计算处理来自其他相邻神经元的加权输入值
2.神经元之间的信息传递的强度,用所谓加权值来定义,算法不断自我学习,调整权值weights
在此基础上,神经网络的计算模型,依靠大量的数据来训练。
几个概念:
cost function(成本函数) : 用来定量评估根据特定输入值计算出来的输出结果,离正确值的偏差
learning algorithm : 根据cost function的结果,自我纠正,最快找到神经元之间的最优化的weights权重
神经元Perceptron :

图1 Perceptron neuron
其中x1, x2, x3 为inputs ,且必须为二进制数字(0 or 1),outputs 也是只有二进制输出。w 为计算权重,这个w设计是重点也是难点。其中计算公式如下:

简化公式,
,其中w
和 x 分别代表权重和输入向量 ,用 偏置 b= -threshold,
神经元Sigmoid :
图2 Sigmoid neurons
比较percrptron 神经元和 Sigmoid神经元,发现他们的结构是一样的,但是对于inputs取值不同,Sigmoid 神经元的Inputs 可以取0~1中的任意值,而且输出值不是0 or 1, 而是
,这里σ
被称为 sigmoid 函数,定义如下:
所以,inputs为x1, x2, ..., weights w1, w2,..., bias b 所对应的sigmoid neuron 的输出为:
根据公式,可以得到sigmoid 函数的响应曲线,如下
神经网络的架构
如上图所示,神经网络架构包括输入层、输出层和隐藏层。这种多层网络被称为 multilayer perceptrons or MLPs 。
梯度下降算法(gradient descent):
为了能够检验对于所有的训练输入值x,我们选择的weihts权重 和 biases偏置 使得输出值都近似和 y(x) 相等,使用了一个cost function(成本函数 or loss or objective function):
其中,w 代表网络中所有权重的集合,b 代表所有的偏置,n 是训练输入的总数目,a 是输出向量(依赖于x 、w、b)
如果C(w,b) ≈ 0 , 那么 对于所有的training inputs x, y(x) 约等于output a .非常好
如果 C(w,b) 非常大,那么说明 对于很多inputs ,y(x)不收敛到outputs a。
我们训练算法的目的是minimize 函数C(w,b),换句话说,我们想要找到一系列的w(权重)和b(偏置),使得 C(w,b) 尽可能的小。
我们使用的算法 就是 梯度下降算法。
我们要找到上图中的最低值,使用的方法是高数中的【梯度】,就是用来求变化率最大的地方,也即是沿着哪个方向,C(w,b)的值下降最快,这就是梯度下降算法的核心思想。(此处用v1,
v2来代表w 和 b)
∆v1 和∆v2 分别代表在v1方向和v2方向上的变化量, ∆C表示C(v1,v2)的变化量
我们现在的想法是 找到合适的∆v1 、∆v2
使得∆C为负值,这样 C就向着变小的方向变化了。
定义 梯度向量:
此时,公式(7)可以重新表示如下:
我们令
其中η 是一个小的正参数(被称为 学习率)
可以得到新公式:
然后 我们可以不断更新:
如何将梯度下降算法应用在神经网络中呢?就是用梯度下降算法来不断寻找、纠正权重w 和 偏置b 来使得等式(6)取得最小值。公式如下:
随机梯度下降算法(Stochastic Gradient Descent)
为了解决梯度下降算法训练样本输入数据太大,学习速度太慢的问题,来加速学习,产生了一种新的算法是随机梯度下降算法。这个算法通过随机选择一定的训练输入样本来计算出一个
来代表梯度
。
其中,m为随机选取的输入样本数量。 标记X1, X2,...,Xm 称作 mini-batch。
可以得到:
应用如上网络 进行 简单的手写数字识别 的代码实现
-
"""
-
network.py
-
~~~~~~~~~~
-
-
A module to implement the stochastic gradient descent learning
-
algorithm for a feedforward neural network. Gradients are calculated
-
using backpropagation. Note that I have focused on making the code
-
simple, easily readable, and easily modifiable. It is not optimized,
-
and omits many desirable features.
-
"""
-
-
#### Libraries
-
# Standard library
-
import random
-
-
# Third-party libraries
-
import numpy as np
-
-
class Network(object):
-
-
def __init__(self, sizes):
-
"""The list ``sizes`` contains the number of neurons in the
-
respective layers of the network. For example, if the list
-
was [2, 3, 1] then it would be a three-layer network, with the
-
first layer containing 2 neurons, the second layer 3 neurons,
-
and the third layer 1 neuron. The biases and weights for the
-
network are initialized randomly, using a Gaussian
-
distribution with mean 0, and variance 1. Note that the first
-
layer is assumed to be an input layer, and by convention we
-
won't set any biases for those neurons, since biases are only
-
ever used in computing the outputs from later layers."""
-
self.num_layers = len(sizes)
-
self.sizes = sizes
-
self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
-
self.weights = [np.random.randn(y, x)
-
for x, y in zip(sizes[:-1], sizes[1:])]
-
-
def feedforward(self, a):
-
"""Return the output of the network if ``a`` is input."""
-
for b, w in zip(self.biases, self.weights):
-
a = sigmoid(np.dot(w, a)+b)
-
return a
-
-
def SGD(self, training_data, epochs, mini_batch_size, eta,
-
test_data=None):
-
"""Train the neural network using mini-batch stochastic
-
gradient descent. The ``training_data`` is a list of tuples
-
``(x, y)`` representing the training inputs and the desired
-
outputs. The other non-optional parameters are
-
self-explanatory. If ``test_data`` is provided then the
-
network will be evaluated against the test data after each
-
epoch, and partial progress printed out. This is useful for
-
tracking progress, but slows things down substantially."""
-
if test_data: n_test = len(test_data)
-
n = len(training_data)
-
for j in xrange(epochs):
-
random.shuffle(training_data)
-
mini_batches = [
-
training_data[k:k+mini_batch_size]
-
for k in xrange(0, n, mini_batch_size)]
-
for mini_batch in mini_batches:
-
self.update_mini_batch(mini_batch, eta)
-
if test_data:
-
print "Epoch {0}: {1} / {2}".format(
-
j, self.evaluate(test_data), n_test)
-
else:
-
print "Epoch {0} complete".format(j)
-
-
def update_mini_batch(self, mini_batch, eta):
-
"""Update the network's weights and biases by applying
-
gradient descent using backpropagation to a single mini batch.
-
The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
-
is the learning rate."""
-
nabla_b = [np.zeros(b.shape) for b in self.biases]
-
nabla_w = [np.zeros(w.shape) for w in self.weights]
-
for x, y in mini_batch:
-
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
-
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
-
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
-
self.weights = [w-(eta/len(mini_batch))*nw
-
for w, nw in zip(self.weights, nabla_w)]
-
self.biases = [b-(eta/len(mini_batch))*nb
-
for b, nb in zip(self.biases, nabla_b)]
-
-
def backprop(self, x, y):
-
"""Return a tuple ``(nabla_b, nabla_w)`` representing the
-
gradient for the cost function C_x. ``nabla_b`` and
-
``nabla_w`` are layer-by-layer lists of numpy arrays, similar
-
to ``self.biases`` and ``self.weights``."""
-
nabla_b = [np.zeros(b.shape) for b in self.biases]
-
nabla_w = [np.zeros(w.shape) for w in self.weights]
-
# feedforward
-
activation = x
-
activations = [x] # list to store all the activations, layer by layer
-
zs = [] # list to store all the z vectors, layer by layer
-
for b, w in zip(self.biases, self.weights):
-
z = np.dot(w, activation)+b
-
zs.append(z)
-
activation = sigmoid(z)
-
activations.append(activation)
-
# backward pass
-
delta = self.cost_derivative(activations[-1], y) * \
-
sigmoid_prime(zs[-1])
-
nabla_b[-1] = delta
-
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
-
# Note that the variable l in the loop below is used a little
-
# differently to the notation in Chapter 2 of the book. Here,
-
# l = 1 means the last layer of neurons, l = 2 is the
-
# second-last layer, and so on. It's a renumbering of the
-
# scheme in the book, used here to take advantage of the fact
-
# that Python can use negative indices in lists.
-
for l in xrange(2, self.num_layers):
-
z = zs[-l]
-
sp = sigmoid_prime(z)
-
delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
-
nabla_b[-l] = delta
-
nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
-
return (nabla_b, nabla_w)
-
-
def evaluate(self, test_data):
-
"""Return the number of test inputs for which the neural
-
network outputs the correct result. Note that the neural
-
network's output is assumed to be the index of whichever
-
neuron in the final layer has the highest activation."""
-
test_results = [(np.argmax(self.feedforward(x)), y)
-
for (x, y) in test_data]
-
return sum(int(x == y) for (x, y) in test_results)
-
-
def cost_derivative(self, output_activations, y):
-
"""Return the vector of partial derivatives \partial C_x /
-
\partial a for the output activations."""
-
return (output_activations-y)
-
-
#### Miscellaneous functions
-
def sigmoid(z):
-
"""The sigmoid function."""
-
return 1.0/(1.0+np.exp(-z))
-
-
def sigmoid_prime(z):
-
"""Derivative of the sigmoid function."""
-
return sigmoid(z)*(1-sigmoid(z))
抓紧时间充电——面向对象的编程C++ / Python、神经网络知识体系架构!