【论文阅读笔记】Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks

2. Convolutional Neural Networks

2.2.1 Normalization layers

  Normalization layers(LRN、BN) require a very large dynamic range for intermediate values(need intermediate values, not parameters). In AlexNet for example, the intermediate values of LRN layers are 214 times larger than any intermediate value from another layer. For this reason this thesis assumes LRN and BN layers are to be implemented in floating point, and we concentrate on the approximation of other layer types.

2.4 Computational Complexity and Memory Requirements

  The complexity of deep CNNs can be split into two parts. First, the convolutional layers contain more than 90% of the required arithmetic operations. The fully connected layers contain over 90% of the network parameters

  Extracting features in convolutional layers is computation-intense, while fully connected layers are memory-intense.

2.6 Neural Networks With Limited Numerical Precision

  Most deep learning frameworks use 32-bit or 64-bit floating point for CNN training and inference.(This paper is to describe the process of quantizing a full precision network to limited precision numbers).

2.6.1 Quantization

【论文阅读笔记】Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks

   As can be seen in the above graph, in order to simplify simulation of hardware, our framework quantizes both the layer inputs and weights, and the multiplication units require less area, but uses 32-bit floating point for accumulation. Adders are much cheaper in terms of area and power, compared to multipliers.

【论文阅读笔记】Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks

2.6.2 Rounding Schemes

Round nearest even

round(x)={x,if xxx+ϵ2x+ϵ,if x+ϵ2<xx+ϵround(x) = \begin{cases} \lfloor x \rfloor, & if \ \lfloor x \rfloor \leq x \leq x+\frac{\epsilon}{2} \\ \lfloor x \rfloor + \epsilon , & if \ \lfloor x \rfloor + \frac{\epsilon}{2} < x \leq x + \epsilon \end{cases}

  Denoting ϵ\epsilon as the quantization step size and x\lfloor x \rfloor as the largest quantization value less or equal to x.
  As round-nearest-even is deterministic, we chose this rounding scheme for inference

Round stochastic

round(x)={x,w.p. 1xxϵx+ϵ,w.p. xxϵround(x) = \begin{cases} \lfloor x \rfloor, & w.p. \ 1-\frac{x-\lfloor x \rfloor}{\epsilon} \\ \lfloor x \rfloor + \epsilon , & w.p. \ \frac{x-\lfloor x \rfloor}{\epsilon} \end{cases}

  Stochastic rounding adds randomness to the quantization procedure, which can have an averaging effect during training.
  We chose to use this rounding scheme when quantizing network parameters during fine-tuning.

2.6.3 Optimization in Discrete Parameter Space

  In the traditional setting of 64-bit floating point training, this optimization is a continuous problem with a smooth error surface.
  For quantized networks, this error surface becomes discrete.

  There have two way to train the quantized networks:

  1. train the network with quantized parameters right from the start.
  2. trains a network in the continuous domain, then quantizes the parameters, and finally fine-tunes in discrete parameter space.

  The way to update parameters: full precision shadow weights. Small weight updates ∆w are applied to the full precision weights w, whereas the discrete weights w1 are sampled from the full precision weights.

3. Network Approximation

3.1.1 Fixed Point Approximation

  Traditionally neural networks are trained in 32-bit floating point.
  Some proposed are presented to save resource from 32-bit floating point.

  1. Fixed point arithmetic is less resource hungry than floating point arithmetic.
  2. uses only binary weights, just like 1, -1 (A similar proposal represents the weights of the same network with +1, 0, -1)
    While these proposed fixed point schemes work well with small networks, only limited work has been done on large CNNs like AlexNet

3.1.2 Network Pruning and Shared Weights

  Off-chip memory access makes up for a significant part of the total energy budget of any data-intense application. So an important step is to compress the size of the network parameters.
   The compression process can be devided in three-step pipeline:

  1. Removed the ‘unimportant’ connections of a trained network. The resulting sparse network is then retrained to regain its classification accuracy, and the pruning step is repeated.
  2. Clustered the remaining parameters together to form shared weights. These shared weights are again fine-tuned to find optimal centroids.
  3. Apply a lossless data compression scheme(Huffman Coding) to the final weights.

3.1.3 Binary Networks

  Since memory access has a relatively high energy cost, it is desirable to reduce the network parameter size.

   The first BinaryConnect work, which represents weights in binary format instead of in traditional 32-bit floating point.(This approach reduces parameter size by factor 32X and removes the need of multiplications in the forward path. BinaryConnect achieves near-state-of-art performance on 10-class datasets)

  Turning multiplications in the backward propagation into bit shifts. (significantly reduces the hardware requirements for accelerators)

  Combining the two previous ideas, ‘Binarized Neural Network’ uses binary weights and layer activations. These numbers are constraint to +1 and -1 for both forward and backward propagation. Convolutional neural networks mainly consist of multiply-accumulate operations. For a binarized network, these operations are replaced by binary XNOR and binary count.. It also use bit-shift-based batch normalization and shift-based parameter update.