【1】Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation -- NIPS2014

文章地址: https://arxiv.org/pdf/1404.0736.pdf
代码地址: https://cs.nyu.edu/~denton/compress_conv.zip
Contribution.

  1. A collection of generic methods to exploit the redundancy inherent in deep CNNs.
  2. Showing empirical speedups on convolutional layers by a factor of 2-3x and a reduction of parameters in fully connected layers by a factor of 5-10x.

Monochronmatic Convolution Approximation.
Let WRC×X×Y×F(96,7,7,3)W\in \mathbb{R}^{C\times X \times Y \times F} (96,7,7,3)
For every output feature ff, consider the matrix WfRC×(XY)W_f \in \mathbb{R}^{C\times (XY)}
Find the SVD, Wf=UfSfVfTW_f = U_fS_fV_f^{T}, where UfRC×C(3,3),SfRC×XY(3,7×7=49),VfRXY×XY(49,49)U_f \in \mathbb{R}^{C\times C }(3,3), S_f \in \mathbb{R}^{C\times XY}(3,7\times 7 =49), V_f \in \mathbb{R}^{XY\times XY}(49,49).
We can take the rank 1 approximation of WfW_f, W~f=U~fS~fV~fT\tilde{W}_f = \tilde{U}_f\tilde{S}_f\tilde{V}_f^{T}, where U~fRC×1,S~fR,V~fR1×XY\tilde{U}_f\in \mathbb{R}^{C\times 1}, \tilde{S}_f\in \mathbb{R}, \tilde{V}_f\in \mathbb{R}^{1\times XY}.

Further clustering the FF left singular vectors, U~f\tilde{U}_f into CC' clusters, C<FC'<F. (Kmeans)
W~f=UcfS~fV~fT\tilde{W}_f =U_{c_f}\tilde{S}_f\tilde{V}_f^T, where UcfRC×1U_{c_f}\in\mathbb{R}^{C\times 1} is the cluster center for cluster cfc_f.

Biclustering Approximation.
Let WRC×X×Y×FW\in \mathbb{R}^{C\times X \times Y \times F}, WCRC×(XYF)W_C\in\mathbb{R}^{C\times (XYF)}, WFR(CXY)×FW_F\in\mathbb{R}^{(CXY)\times F}.
Clustering rows of WCW_C into GG clusters.
Clustering columns of WFW_F into HH clusters.
Then we get H×GH\times G sub-tensors W(Ci,:,:,Fj),WSRCG×(XY)×FHW(C_i, :, :,F_j),W_S\in\mathbb{R}^{\frac{C}{G}\times(XY)\times{\frac{F}{H}}}
Each sub-tensor contains similar elements, and thus is easier to fit with a low-rank approximation.

  1. Outer product decomposition (rankKrank-K)
    Wk+1WkαβγW^{k+1}\leftarrow W^k-\alpha \otimes \beta\otimes \gamma
    where αRC,βRXY,γRF\alpha\in \mathbb{R}^C,\beta\in\mathbb{R}^{XY},\gamma\in\mathbb{R}^{F}
  2. SVD decomposition
    For WRm×nkW\in\mathbb{R}^{m\times n k},W~U~S~V~T\tilde{W}\approx\tilde{U}\tilde{S}\tilde{V}^T.
    W can be compressed even further by applying SVD to V~\tilde{V}.
    Use K1K_1 and K2K_2 to denote the rank used in the first and second SVD.

Settings.
【1】Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation -- NIPS2014