Correlation Layer

Correlation Layer

Contracting part

我在看“A Survey on Deep Learning Techniques for Stereo-based Depth Estimation”时,在4.1.1中碰到了这句话:

The main advantage of the correlation over the $ L2 $ distance is that it can be implemented using a layer of $ 2D $ or $ 1D $ convolutional operations, called correlation layer. A correlation layer does not require training since the filters are in fact the features computed by the second branch of the network.

What is correlation layer?

correlation layer实际上是可以衡量两个feature maps位置之间的相似性,根据上面的描述,可以用卷积操作来计算$ L2 $ distance,且correlation layer不需要经过训练,因为其卷积核参数实质上是来自第二个分支的数据。

使用correlation-layer的原因如下:

  • 两幅图像的相似度只需要保留其相似性以及空间位置,图像本身的特征不应该被考虑。
    假设有两对图像的仿射变换参数相同,只是图像内容不同,如果考虑feature map的像素信息,那么两幅图像进入模型后输出的参数也将不同;
  • 如果只是简单的对两幅图中每一个通道的feature进行相加或者相减,如果匹配点相差很远,这种方法无法获取正确的相似度。如果使用correlation map+Norm,即使匹配点相差很远,也能够凸显出最为匹配的点。所以这种方法无法应对场景大范围变化的匹配问题;

具体实现

correlation layer是由“FlowNet: Learning Optical Flow with Convolutional Networks”首次提出的,这篇文章是研究光流的,是首个能与传统光流方法相比的基于深度学习的计算方法。在这里,我只关注correlation layer是如何进行计算的。

先把图贴出来,便于理解。

Correlation Layer

我把原文也贴出来了,可以对照理解,我的理解不一定正确。

Contracting part.

A simple choice is to stack both input images together and feed them through a rather generic network, allowing the network to decide itself how to process the image pair to extract the motion information. This is illustrated in Fig. 2 (top). We call this architecture consisting only of convolutional layers ‘FlowNetSimple’.

这一段讲了FlowNetSimple,FlowNetSimple的做法是将两张图像堆砌在一起,从图中也能看出,初始输入的通道数是6,然后将其输入到网络中,进行特征提取。

Another approach is to create two separate, yet identical processing streams for the two images and to combine them at a later stage as shown in Fig. 2 (bottom). With this architecture the network is constrained to first produce meaningful representations of the two images separately and then combine them on a higher level. This roughly resembles the standard matching approach when one first extracts features from patches of both images and then compares those feature vectors. However, given feature representations of two images, how would the network find correspondences?

另一种方法是FlowNetCorr,为两个图像创建两个单独的但相同的处理流,并在以后的阶段将它们组合起来,如图2所示(底部)。 使用这种架构,网络必须首先生成两个图像的有意义的表示,然后再将它们组合到更高的层次上。 当首先从两个图像提取特征,然后比较这些特征向量时,这大致类似于标准匹配方法。 但是,给定两个图像的特征表示,网络将如何找到对应关系?

To aid the network in this matching process, we introduce a ‘correlation layer’ that performs multiplicative patch comparisons between two feature maps. An illustration of the network architecture ‘FlowNetCorr’ containing this layer is shown in Fig. 2 (bottom). Given two multi-channel feature maps $ f_{1} $, $ f_{2} $ : $ R^{2} → R^{c} $, with w, h, and c being their width, height and number of channels, our correlation layer lets the network compare each patch from $ f_{1} $ with each path from f 2 f_{2} f2.

为了帮助网络进行此匹配过程,我们引入了一个“相关层”,可以在两个特征图之间执行乘法块的比较,如上图所示。给定两个多通道特征图 f 1 , f 2 : R 2 t o R c f_1,f_2:R^2 to R^c f1,f2:R2toRc,其中w、h和c是它们的宽度、高度和通道数,我们的相关层允许网络将 f 1 f_{1} f1中的每个块与 f 2 f_{2} f2 中的每个块进行比较。

For now we consider only a single comparison of two patches. The ’correlation’ of two patches centered at $ x_{1} $ in the first map and $ x_{2}$ in the second map is then defined as
c ( x 1 , x 2 ) = ∑ o ∈ [ − k , k ] × [ − k , k ] ⟨ f 1 ( x 1 + o ) , f 2 ( x 2 + o ) ⟩ \begin{aligned} c\left(\mathbf{x}_{1}, \mathbf{x}_{2}\right) &=\sum_{\mathbf{o} \in[-k, k] \times[-k, k]}\left\langle\mathbf{f}_{1}\left(\mathbf{x}_{1}+\mathbf{o}\right), \mathbf{f}_{2}\left(\mathbf{x}_{2}+\mathbf{o}\right)\right\rangle \end{aligned} c(x1,x2)=o[k,k]×[k,k]f1(x1+o),f2(x2+o)
or a square patch of size $ K := 2k+1$. Note that Eq.1 is identical to one step of a convolution in neural networks, **but instead of convolving data with a filter, it convolves data with other data **. For this reason, it has no trainable weights.

首先考虑两个块的比较。 x 1 x_{1} x1表示以 x 1 x_{1} x1为中心的块, x 2 x_{2} x2表示以 x 2 x_{2} x2为中心的块,那么两个块的“相关性”计算公式为 c ( x 1 , x 2 ) c(x_{1},x_{2}) c(x1,x2),一个方形块的size是 K : = 2 k + 1 K := 2k+1 K:=2k+1(论文中k:=0),公式1等价于神经网络中的一次卷积操作,但是不是用卷积核来进行卷积,而是用另一个分支的数据。因此,它没有训练权重。

Computing c ( x 1 , x 2 ) c(x_{1}, x_{2}) c(x1,x2) involves c ⋅ K 2 c·K^{2} cK2 multiplications. Comparing all path combinations involves w 2 ⋅ h 2 w^{2}·h^{2} w2h2 such computations, yields a large result and makes efficient forward and backward passes intractable. Thus, for computational reasons we limit the maximum displacement for comparisons and also introduce striding in both feature maps.

计算 c ( x 1 , x 2 ) c(x_{1}, x_{2}) c(x1,x2)需要进行 c ⋅ K 2 c·K^{2} cK2 次乘法运算。若计算所有块之间的相关性,则需要计算 w 2 ⋅ h 2 w^{2}·h^{2} w2h2 c ( x 1 , x 2 ) c(x_{1}, x_{2}) c(x1,x2),共需 w 2 ⋅ h 2 ⋅ c ⋅ K 2 w^{2}·h^{2}·c·K^{2} w2h2cK2次乘法运算,生成的数据过多且使得前向传播和反向传播变得困难。因此,我们需要通过设置最大位移控制范围,并为 x 1 x_{1} x1 x 2 x_{2} x2分别设置移动步幅。

这里的计算量可能不太好理解,可以参考:链接,讲的比较清楚。

Given a maximum displacement d d d, for each location x 1 x_{1} x1 we compute correlations c ( x 1 , x 2 ) c(x_{1}, x_{2}) c(x1,x2) only in a neighborhood of size D : = 2 d + 1 D := 2d + 1 D:=2d+1, by limiting the range of x 2 x_{2} x2. We use strides s 1 s_{1} s1 and s 2 s_{2} s2, to quantize x 1 x_{1} x1 globally and to quantize x 2 x_{2} x2 within the neighborhood centered around x 1 x_{1} x1.

通过两个方法减少计算量:第一,给定一个最大位移 d d d,对于 f 1 f_{1} f1中的每个 x 1 x_{1} x1,我们只计算 x 1 x_{1} x1 f 2 f_{2} f2中以 x 1 x_{1} x1对应位置为中心的邻域(范围是 D : = 2 d + 1 D:=2d+1 D:=2d+1)内的 x 2 x_{2} x2之间的相关性,因此对于 f 1 f_{1} f1中的每个 x 1 x_{1} x1,在 f 2 f_{2} f2中都有 D 2 D^{2} D2 x 2 x_{2} x2需要计算它们之间的相关性,而 f 1 f_{1} f1中共有 w ⋅ h w·h wh x 1 x_{1} x1,因此,总共需要计算 w × h × D 2 w × h ×D^{2} w×h×D2 c ( x 1 , x 2 ) c(x_{1}, x_{2}) c(x1,x2);第二,使用步幅,使用 s 1 s_{1} s1来控制 x 1 x_{1} x1在整个 f 1 f_{1} f1中的遍历,使用 s 2 s_{2} s2来控制 x 2 x_{2} x2在以 x 1 x_{1} x1对应位置为中心的邻域内(范围是 D : = 2 d + 1 D:=2d+1 D:=2d+1)的遍历,因此,依据 s 1 s_{1} s1 s 2 s_{2} s2的不同,当 s 1 > 1 s_{1}>1 s1>1时, x 1 x_{1} x1在整个 f 1 f_{1} f1中的遍历个数少于 w ⋅ h w·h wh,当 s 2 > 1 s_{2}>1 s2>1时, x 2 x_{2} x2在邻域范围内的遍历个数少于 D 2 D^2 D2。论文中,取 d = 20 , s 1 = 1 , s 2 = 2 d=20,s_1=1,s_2 = 2 d=20,s1=1,s2=2,所以计算之后得到的数字是441( ( ( ( 2 ∗ d + 1 ) − K ) / s 2 + 1 ) 2 = 441 (((2*d+1)-K)/s_2 + 1)^2 = 441 (((2d+1)K)/s2+1)2=441)。我自己也查看了一些博客,有看到一些博客把 k k k d d d混淆了,需要注意,这两个不是一个概念,可以这样理解, k k k决定卷积核的大小( ( 2 × k + 1 ) × ( 2 × k + 1 ) × c h a n n e l (2×k+1)×(2×k+1)×channel (2×k+1)×(2×k+1)×channel),而 d d d决定卷积核的数量?至少对于一个 x 1 x_1 x1应该可以这样理解。

In theory, the result produced by the correlation is four- dimensional: for every combination of two 2 D 2D 2D positions we obtain a correlation value, i . e . i.e. i.e. the scalar product of the two vectors which contain the values of the cropped patches respectively. In practice we organize the relative displacements in channels. This means we obtain an output of size ( w × h × D 2 ) (w × h ×D^{2}) (w×h×D2). For the backward pass we implemented the derivatives with respect to each bottom blob accordingly.

correlation的结果理论上是4维的,有人说是把 w × h × D 2 w × h ×D^{2} w×h×D2理解为 w × h × D × D w × h × D × D w×h×D×D。对于vector的理解,就是指 x 1 x_1 x1 x 2 x_2 x2,原本的vector是 1 × 1 × c h a n n e l 1×1×channel 1×1×channel大小的,correlation计算之后vector变为 1 × 1 × D 2 1×1×D^2 1×1×D2大小,也许不是 D 2 D^2 D2,取决于变量的取值,后来的vector中的每个值都是原本的vector( x 1 x_1 x1)与 f 2 f_{2} f2中其邻域范围内的vector( x 2 x_2 x2)的相关性吧,实际上就是两个vector点乘加和得到。最后一句,这句话我没太能理解,只是自己猜测的意思,大概率是错误的,由于correlation无法进行求导运算,因为不涉及卷积核,就没有所谓的参数了,这样反向传播就无法进行,为了能进行反向传播,所以在这里,用 1 × 1 1×1 1×1卷积来对 f 1 f_{1} f1进行降维,具体的细节我不太懂,如果有大佬明白了,希望不吝赐教!