文献阅读 - Binarized Neural Networks

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1


M. Courbariaux M, et al., Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1, (2016).


摘要

二值化神经网络(Binarized Neural Networks,BNNs):推理(run-time)阶段,权值(weights)和**(activations)均为二进制数值;训练阶段:使用二进制权值和**计算参数梯度

训练BNNs的方法

网络前馈过程(forward pass)中,BNNs减小内存占用及访问(access),且多数运算为位操作(bit-wise operations),能够有效降低功耗(power-efficiency)

引言

二值化神经网络(Binarized Neural Networks,BNNs):推理(run-time)阶段,权值(weights)和**(activations)均为二进制数值;训练阶段:使用二进制权值和**计算参数梯度

1 二值化神经网络

1.1 确定、随机二值化(Deterministic vs Stochastic Binarization)

训练BNN时,将权值和**限定为±1\pm 1

二值化函数(binarization functions):

(1)确定(deterministic)二值化函数:

(1)xb=Sign(x)={+1if x01otherwisex^b = \mathrm{Sign}(x) = \begin{cases} +1 & \text{if} \ x \geq 0 \\ -1 & \text{otherwise} \end{cases} \tag{1}

(2)随机(stochastic)二值化函数:

(2)xb={+1with probability p=σ(x)1with probability 1px^b = \begin{cases} +1 & \text{with probability} \ p = \sigma(x) \\ -1 & \text{with probability} \ 1 - p \end{cases} \tag{2}

其中,σ\sigma为“硬逻辑”(hard sigmoid)函数:

σ(x)=clip(x+12,0,1)=max(0,min(1,x+12))\sigma(x) = \mathrm{clip}(\frac{x + 1}{2}, 0, 1) = \max(0, \min(1, \frac{x + 1}{2}))

随机二值化函数性能优于确定二值化函数,但需要硬件生成随机序列,因此难以应用。

1.2 梯度计算与累加(Gradient Computation and Accumulation)

权值的梯度是实数值(real-valued),通过实值变量累加计算。

随机梯度下降(Stochasic Gradient Descent,SGD)采用有噪的小步长探索参数空间,各权值的随机梯度贡献累加平滑能够消除噪声。

计算参数的梯度时,向权重和**项中添加噪声相当于了一种正则化,有助于提高模型的泛化能力。

本文训练BNNs的方法可以视为Dropout的变体,Dropout是随机将**置零,本文是对权值和**二值化。

1.3 离散化梯度传播(Propagating Gradients Through Discretization)

符号函数量化(sign function quantization)

q=Sign(r)q = \mathrm{Sign}(r)

假设梯度Cq\frac{\partial C}{\partial q}的估计量gqg_q已知,则梯度Cr\frac{\partial C}{\partial r}的估计量(straight-through estimator)为

(4)gr=gq1r1g_r = g_q 1_{|r| \leq 1} \tag{4}

上式保留了梯度信息,但当rr过大时,丢弃(cancel)梯度。

  • 算法1:训练BNN

CC:迷你批次(minibatch)的损失函数
λ\lambda:学习速率衰减系数
LL:网络层数
\circ:元素乘法(element-wise multiplication)。

Binarize():指定权值和**的二值化方法(确定、随机);
Clip():指定如何截断权值;
BatchNorm():指定如何对**批量标准化;
BackBatchNorm():标准化层处,指定梯度如何反向传播;
Update():梯度已知时,如何更新参数(ADAM、AdaMax)。

文献阅读 - Binarized Neural Networks

导数1r11_{|r| \leq 1}可视为通过“硬正切”(hard tanh)传播梯度,表示为分段线性**函数(piece-wise linear activation function):

(5)Htanh(x)=Clip(x,1,1)=max(1,min(1,x))\mathrm{Htanh}(x) = \mathrm{Clip}(x, -1, 1) = \max(-1, \min(1, x)) \tag{5}

隐层单元通过 非线性符号函数(sign function non-linearity)得到二值**(binary activations),其权值计算分为两步:

(1)将实数权值限定在1-1+1+1之间:当权值更新使wrw^r超出[1,+1][-1, +1],将wrw^r投影到1-1+1+1上,即训练时截断(clip)权值。

(2)使用权值wrw^r时,将其二值化wb=Sign(wr)w^b = \mathrm{Sign}(w^r)

1.4 移位批标准化(Shift based Batch Normalization)

  • 算法2:移位批标准化(shift-based batch normalization,SBN)

AP2(x)=Sign(x)×2round(log2x)AP2(x) = \mathrm{Sign}(x) \times 2^{\mathrm{round}(\log_2 |x|)}:2的幂函数的近似;
\ll \gg:移位(both left and right binary shift)

文献阅读 - Binarized Neural Networks
【作者给出的公式书写有误】

推导:

x=Sign(x)2log2(x)Sign(x)×2round(log2x)x = \mathrm{Sign}(x) 2^{\log_2(|x|)} \approx \mathrm{Sign}(x) \times 2^{\mathrm{round}(\log_2 |x|)}

(1)BatchNorm:

均值:μB=1mi=1mxi\mu_B = \frac{1}{m} \sum_{i = 1}^{m} x_i
方差:σB2=1mi=1m(xiμB)2\sigma_B^2 = \frac{1}{m} \sum_{i = 1}^{m} (x_i - \mu_B)^2
标准化:x^i=xiμBσB\hat{x}_i = \frac{x_i - \mu_B}{\sigma_B}
缩放平移(scale and shift):yi=γx^i+βy_i = \gamma \hat{x}_i + \beta

(2)Shift-based BatchNorm:

C(xi)=xiμBC(x_i) = x_i - \mu_BσB2=1mi=1mC2(xi)\sigma_B^2 = \frac{1}{m} \sum_{i = 1}^{m} C^2(x_i)

用移位运算近似平方运算(C2(xi)C^2(x_i)):

C2(xi)=C(xi)×Sign(C(xi))2log2C(xi)C(xi)×Sign(C(xi))2round(log2C(xi))=C(xi)2round(log2C(xi))=C(xi)round(log2C(xi))\begin{aligned} C^2(x_i) = & C(x_i) \times \mathrm{Sign}(C(x_i)) 2^{\log_2 |C(x_i)|} \\ \approx & C(x_i) \times \mathrm{Sign}(C(x_i)) 2^{\mathrm{round}(\log_2 |C(x_i)|)} \\ = & |C(x_i)| 2^{\mathrm{round}(\log_2 |C(x_i)|)} \\ = & |C(x_i)| \ll \gg \mathrm{round}(\log_2 |C(x_i)|) \\ \end{aligned}

近似方差:

σB21mi=1mC(xi)round(log2C(xi))\sigma_B^2 \approx \frac{1}{m} \sum_{i = 1}^{m} C(x_i)| \ll \gg \mathrm{round}(\log_2 |C(x_i)|)

用移位运算近似除运算(C2(xi)/σBC2(x_i) / \sigma_B):

C(xi)σB=C(xi)Sign(σB)2log2σBC(xi)Sign(σB)2round(log2σB)=C(xi)2round(log2σB)=C(xi)round(log2σB))\begin{aligned} \frac{C(x_i)}{\sigma_B} = & \frac{C(x_i)}{\mathrm{Sign}(\sigma_B) 2^{\log_2 |\sigma_B|}} \\ \approx & \frac{C(x_i)}{\mathrm{Sign}(\sigma_B) 2^{\mathrm{round}(\log_2 |\sigma_B|)}} \\ = & \frac{C(x_i)}{2^{\mathrm{round}(\log_2 \sigma_B)}} \\ = & C(x_i) \ll \gg \mathrm{round}(\log_2 \sigma_B)) \\ \end{aligned}

标准化:

x^iC(xi)round(log2σB))\hat{x}_i \approx C(x_i) \ll \gg \mathrm{round}(\log_2 \sigma_B))

缩放平移:

yi=Sign(γ)x^iround(log2γ)+βy_i = \mathrm{Sign}(\gamma) \hat{x}_i \ll \gg \mathrm{round}(\log_2 |\gamma|) + \beta

1.5 移位AdaMax(Shift based AdaMax)

  • 算法4:移位AdaMax优化器(shift-based AdaMax)

gt2g_t^2:按元素取平方,gtgtg_t \circ g_t
默认设置:α=210\alpha = 2^{-10}1β1=231 - \beta_1 = 2^{-3}1β2=2101 - \beta_2 = 2^{-10}
所有向量运算均指按元素运算。
β1t\beta_1^tβ2t\beta_2^tβ1\beta_1β2\beta_2tt次方(β1\beta_1 and β2\beta_2 to the power tt

文献阅读 - Binarized Neural Networks

1.6 第一层

相邻两层,前一层的输出是后一层的输入,因此除第一层外,所有层的输入都是二进制的。

连续数值(continuous-valued inputs)输入可以用定点数(fixed point numbers)处理(fixed point numbers),8位(8-bit)定点输入可表示为:

s=xwbs = x \cdot w^b

s=i=182n1(xnwb)s = \sum_{i = 1}^8 2^{n - 1} (x^n \cdot w^b)

其中,xx为由1024个8-bit输入组成的向量,x18x_1^8表示第1个输入的最高位(most significant bit of the first input),wbw^b为由1024个1-bit权值组成的向量,ss为加权和。

  • 算法5:BNN预测

文献阅读 - Binarized Neural Networks

2 基准测试(Benchmark Results)

2.1 多层感知器、MNIST、Theano(Multi-Layer Perception (MLP) on MNIST (Theano))

文献阅读 - Binarized Neural Networks

2.2 多层感知器、MNIST、Torch7(MLP on MNIST (Torch7))

2.3 卷积网络、CIFAR-10、Theano(ConvNet on CIFAR-10 (Theano))

文献阅读 - Binarized Neural Networks

2.4 卷积网络、CIFAR-10、Torch7(ConvNet on CIFAR-10 (Torch7))

2.5 卷积网络、SVHN(ConvNet on SVHN)

文献阅读 - Binarized Neural Networks

3 前向过程低功耗(Very Power Efficient in Forward Pass)

3.1 内存占用及访问(Memory Size and Accesses)

文献阅读 - Binarized Neural Networks
文献阅读 - Binarized Neural Networks

3.2 同或(XNOR-Count)

1-bit XNOR-count operations

3.3 重复滤波器(Exploiting Filter Repetitions)

二值卷积核的数量取决于卷积核的尺寸(the number of unique filters is bounded by the filter size)。

4 运算速度提升7倍(Seven Times Faster on GPU at Run-Time)

文献阅读 - Binarized Neural Networks

5 讨论(Discussion and Related Work)