论文阅读笔记:From Image-level to Pixel-level Labeling with Convolutional Networks

论文阅读笔记:From Image-level to Pixel-level Labeling with Convolutional Networks

Introduce

Task

Weakly supervised learning for image semantic segmentation, only use image class label.

Contribution

  • Combine MIL(Multiple Instance Learning) with classification CNN
  • experiments is state of the art

Framework

Train

论文阅读笔记:From Image-level to Pixel-level Labeling with Convolutional NetworksFor input image I:3×h×wI:3\times h \times w, pass a backbone(i.e. Overfeat + Segmentation Net), output feature maps Y:(C+1)×ho×woY:(|C|+1) \times h^{o} \times w^{o}, then YY pass a LSE(Log-Sum-Exp) pooling, output s:(C+1)×1×1s:(|C|+1) \times 1 \times 1. Finally compute a softmax cross entrophy loss for ss, backpropagation gradients to train backbone.

Inference

论文阅读笔记:From Image-level to Pixel-level Labeling with Convolutional Networks
pi,j(k)p_{i,j}(k) be the YY for location (i,j)(i,j) and kthk^{th} class label. ILP p(k)p(k) be the ss by softmax.
y^i,j=pi,j(kI)×p(kI)\widehat{y}_{i,j}=p_{i,j}(k|I) \times p(k|I)
Finally, y^i,j\widehat{y}_{i,j} pass a interpolation to restore input image resolution. Then use a threshold(Smoothing Prior) to get the final segmentation results.

Log-Sum-Exp(LSE)

sk=1rlog[1howoi.jexp(rsi,jk)]s^k = \frac{1}{r}\log \left[ \frac{1}{h^o w^o} \sum\limits_{i.j} exp\left( r s_{i,j}^k \right)\right]
LSE is a pooling method for Y:(C+1)×ho×woY:(|C|+1) \times h^{o} \times w^{o} to s:(C+1)×1×1s:(|C|+1) \times 1 \times 1, it is more smooth. When ss is high LSE similar to max pooling, rr low LSE similar to average pooling.

论文阅读笔记:From Image-level to Pixel-level Labeling with Convolutional Networks
For accuracy, performance be more high compare to max pooling and sum pooling.

Summary

  • LSE is smooth pooling than max and average pooling. Maybe it is useful.