论文阅读:EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
文章目录
1、论文总述
本篇论文的出发点是将分类模型的大小和效率放在一起考虑,希望增大模型的同时,效率也能比较高(推理速度比较快的意思),作者在论文中指出,以前的传统增大模型的方法主要是在单个维度上进行(例如:模型输入尺寸resolution,模型深度depth,模型宽度width(feature map的通道数)),而作者在本文中则是将resolution、depth、width放在一起统一进行考虑,认为他们之间是有关系,例如增大resolution的同时也应该增大depth和width,提出了一个φ混合系数将三者联系到一起,作者先是在mobilenet和resnet上验证了提出的复合系数,有效果,然后作者又用NAS自己搜索出了一个baseline,在baseline基础上逐渐增大混合系数φ就诞生了EfficientNets家族。
EfficientNets屠榜图如下:
compound scaling method示意图如下
In this paper, we want to study and rethink the process
of scaling up ConvNets. In particular, we investigate the
central question: is there a principled method to scale up
ConvNets that can achieve better accuracy and efficiency?
Our empirical study shows that it is critical to balance all
dimensions of network width/depth/resolution, and surprisingly such balance can be achieved by simply scaling each of them with constant ratio. Based on this observation, we
propose a simple yet effective compound scaling method.
Unlike conventional practice that arbitrary scales these factors, our method uniformly scales network width, depth,and resolution with a set of fixed scaling coefficients.
2、compound scaling method方法的合理性
Intuitively, the compound scaling method makes sense because if the input image is bigger, then the network needs
more layers to increase the receptive field and more channels
to capture more fine-grained patterns on the bigger image. In
fact, previous theoretical (Raghu et al., 2017; Lu et al., 2018)
and empirical results (Zagoruyko & Komodakis, 2016) both
show that there exists certain relationship between network
width and depth, but to our best knowledge, we are the
first to empirically quantify the relationship among all three
dimensions of network width, depth, and resolution.
Observation 1 – Scaling up any dimension of network
width, depth, or resolution improves accuracy, but the accuracy gain diminishes for bigger models.
Observation 2 – In order to pursue better accuracy and
efficiency, it is critical to balance all dimensions of network
width, depth, and resolution during ConvNet scaling.
3、给定计算资源后确定α, β, γ(即compound scaling method的流程)
主要思想:基于(2)(3)利用网格搜索
基于(2)(3)式,进行scale up
4、EfficientNet-B0的来源和主要组成部分
主要思想:nas搜索
Inspired by (Tan et al., 2019), we develop our baseline network by leveraging a multi-objective neural architecture search that optimizes both accuracy and FLOPS. Specifi-
cally, we use the same search space as (Tan et al., 2019),
and use ACC(m)×[F LOP S(m)/T]w as the optimization
goal, where ACC(m) and F LOP S(m) denote the accuracy and FLOPS of model m, T is the target FLOPS and w=-0.07 is a hyperparameter for controlling the trade-off
between accuracy and FLOPS. Unlike (Tan et al., 2019;
Cai et al., 2019), here we optimize FLOPS rather than latency since we are not targeting any specific hardware device. Our search produces an efficient network, which we
name EfficientNet-B0. Since we use the same search space
as (Tan et al., 2019), the architecture is similar to Mnas-Net, except our EfficientNet-B0 is slightly bigger due to the larger FLOPS target (our FLOPS target is 400M). Table 1 shows the architecture of EfficientNet-B0. Its main building block is mobile inverted bottleneck MBConv (Sandler et al., 2018; Tan et al., 2019), to which we also add squeeze-and-excitation optimization (Hu et al., 2018).
EfficientNet-B0的结构如下:
mobile inverted bottleneck MBConv:是指mobilenetV2中的倒置残差结构,也就是说用的深度可分离卷积??
5、EfficientNets指标
6、EfficientNets**可视化
In order to further understand why our compound scaling
method is better than others, Figure 7 compares the class
activation map (Zhou et al., 2016) for a few representative
models with different scaling methods. All these models are
scaled from the same baseline, and their statistics are shown
in Table 7. Images are randomly picked from ImageNet
validation set. As shown in the figure, the model with compound scaling tends to focus on more relevant regions with
more object details, while other models are either lack of
object details or unable to capture all objects in the images.