BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation

Research background 

It can be broadly applied to the fields of augmented reality devices, autonomous driving, and video surveillance. These applications have a high demand for efficient inference speed for fast interaction or response.

在现实应用中,图像分割常常需要有效的预测和快速的响应。

Recently, the algorithms of real-time semantic segmentation have shown that there are mainly three approaches to accelerate the model.

try to restrict the input size to reduce the computation complexity by cropping or resizing.

prune the channels of the network to boost the inference speed especially in the early stages of the base model.

ENet proposes to drop the last stage of the model in pursuit of an extremely tight framework.

近期论文中有三种方法用以解决图像分割的快速响应需求,首先,限制输入大小,这种方法缺点为会丢失空间细节,尤其是图像边界(裁剪和resize);其次,修剪通道(模型前几层),缺点是弱化了空间容量,减少精度;最后,减少最后几层的数量(遗弃下采样层),缺点是感受野会下降。

BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation

(a)减小输入尺寸和轻量模型(减少前面几层通道和遗弃后面层)(b)u型网络,获得丰富的空间信息,但是是以检测速度为代价。(c)本文模型,使用spatial path和context path分别获得底层特征的空间信息和高层特征的全局感受野信息。

BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation

 

we propose a Spatial Path to preserve the spatial size of the original input image and encode affluent spatial information. The Spatial Path contains three layers.

使用Spatial Path保留图像空间尺寸和丰富的空间信息,由于Spatial Path只使用了三层卷积,所有不会耗费太多的运算时间,并且没有池化能够保留图像大部分信息。

The Context Path utilize lightweight model and global average pooling to provide large receptive field. In this work, the lightweight model, like Xception, can downsample the feature map fast to obtain large receptive field, which encodes high level semantic context information.

Context Path 使用了轻量模型,能够获得全局信息的同时减少运算时间。

In the Context Path, we propose a specific Attention Refinement Module (ARM) to refine the features of each stage. ARM employs global average pooling to capture global context and computes an attention vector to guide the feature learning.

使用了注意力机制,来指导和加强获得的高层特征图。

Given the different level of the features, we first concatenate the output features of Spatial Path and Context Path. And then we utilize the batch normalization to balance the scales of the features. Next, we pool the concatenated feature to a feature vector and compute a weight vector. This weight vector can re-weight the features, which amounts to feature selection and combination.

将获得的低层特征(空间信息)和高层特征(全局信息)以一定平衡的方式融合在一起。以网络图来看像是使用了两种注意力机制处理两种特征图拼接后的特征图。

BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation

消融实验(以一定的变量来判断某一结构是否有效)表明,本文提出的方法均能提高模型的准确率。

BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation

本文提出的方法能够在准确率和速度上取得最好效果

完(笑)