【CV-Paper 18】图像分类 01:HighwayNet-2015
论文原文:LINK
论文被引:1280(10/09/2020)
文章目录
Highway Networks
Abstract
There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult with increasing depth and training of very deep networks remains an open problem. In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks. We refer to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on information highways. The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions, opening up the possibility of studying extremely deep and efficient architectures.
有大量的理论和经验证据表明,神经网络的深度是其成功的关键因素。然而,随着深度的增加,网络训练变得更加困难,并且非常深的网络的训练仍然是一个开放的问题。在这个扩展的摘要中,我们介绍了一种新的结构,该结构旨在简化非常深层网络的基于梯度的训练。我们将具有这种结构的网络称为高速公路网络,因为它们允许信息高速公路上的几层信息畅通无阻。该结构的特点是使用选通单元(门控机制),该选通单元学习调节通过网络的信息流。可以使用随机梯度下降法和具有多种**功能的方式直接训练具有数百层的高速公路网络,从而为研究极其深入和高效的架构提供了可能。
【Note: A full paper extending this study is available at http://arxiv.org/abs/1507.06228, with additional references, experiments and analysis.】
1. Introduction
Many recent empirical breakthroughs in supervised machine learning have been achieved through the application of deep neural networks. Network depth (referring to the number of successive computation layers) has played perhaps the most important role in these successes. For instance, the top-5 image classification accuracy on the 1000-class ImageNet dataset has increased from ∼84% (Krizhevsky et al., 2012) to ∼95% (Szegedy et al., 2014; Simonyan & Zisserman, 2014) through the use of ensembles of deeper architectures and smaller receptive fields (Ciresan et al., 2011a;b; 2012) in just a few years.
通过应用深度神经网络,在有监督的机器学习中取得了许多最新的经验突破。网络深度(指连续计算层的数量)可能在这些成功中起了最重要的作用。例如,在1000级ImageNet数据集上,前5个图像分类的准确性从〜84%(Krizhevsky等,2012)提高到了约95%(Szegedy等,2014; Simonyan&Zisserman,2014)。在短短几年内就使用了更深的架构和较小的接收场的集成(Ciresan等人,2011a; b; 2012)。
On the theoretical side, it is well known that deep networks can represent certain function classes exponentially more efficiently than shallow ones (e.g. the work of Håstad (1987); Håstad & Goldmann (1991) and recently of Montufar et al. (2014)). As argued by Bengio et al. (2013), the use of deep networks can offer both computational and statistical efficiency for complex tasks.
从理论上讲,众所周知,深层网络可以比浅层网络以指数方式更有效地表示某些功能类别(例如,Håstad(1987);Håstad&Goldmann(1991)以及最近的Montufar等人(2014)的著作) 。正如Bengio等人所言(2013年),深度网络的使用可以为复杂任务提供计算和统计效率。
However, training deeper networks is not as straightforward as simply adding layers. Optimization of deep networks has proven to be considerably more difficult, leading to research on initialization schemes (Glorot & Bengio, 2010; Saxe et al., 2013; He et al., 2015), techniques of training networks in multiple stages (Simonyan & Zisserman, 2014; Romero et al., 2014) or with temporary companion loss functions attached to some of the layers (Szegedy et al., 2014; Lee et al., 2015).
但是,训练更深层的网络并不像简单地添加层那样简单。事实证明,深度网络的优化要困难得多,因此需要对初始化方案进行研究(Glorot&Bengio,2010; Saxe等,2013; He等,2015),多阶段训练网络的技术(Simonyan和Zisserman,2014年; Romero等人,2014年)或在某些图层上附加了临时伴随损失函数(Szegedy等人,2014年; Lee等人,2015年)。
In this extended abstract, we present a novel architecture that enables the optimization of networks with virtually arbitrary depth. This is accomplished through the use of a learned gating mechanism for regulating information flow which is inspired by Long Short Term Memory recurrent neural networks (Hochreiter & Schmidhuber, 1995). Due to this gating mechanism, a neural network can have paths along which information can flow across several layers without attenuation. We call such paths information highways, and such networks highway networks.
在这个扩展的摘要中,我们提出了一种新颖的结构,该结构可以优化几乎任意深度的网络。这是通过使用受学习的门控机制来调节信息流来实现的,该机制受长短期记忆循环神经网络的启发(Hochreiter&Schmidhuber,1995)。由于这种门控机制,神经网络可以具有路径,信息可以沿着这些路径流过数层而不会衰减。我们称此类路径为信息高速公路,也称此类网络为高速公路网络(highway networks)。
In preliminary experiments, we found that highway networks as deep as 900 layers can be optimized using simple Stochastic Gradient Descent (SGD) with momentum. For up to 100 layers we compare their training behavior to that of traditional networks with normalized initialization (Glorot & Bengio, 2010; He et al., 2015). We show that optimization of highway networks is virtually independent of depth, while for traditional networks it suffers significantly as the number of layers increases. We also show that architectures comparable to those recently presented by Romero et al. (2014) can be directly trained to obtain similar test set accuracy on the CIFAR-10 dataset without the need for a pre-trained teacher network.
在初步实验中,我们发现可以使用具有动量的简单随机梯度下降(SGD)来优化高达900层的高速公路网络。对于多达100个层,我们将它们的训练行为与具有标准化初始化的传统网络的行为进行了比较(Glorot&Bengio,2010; He等人,2015)。我们表明,高速公路网络的优化实际上与深度无关,而对于传统网络,优化随着层数的增加而受到严重影响。我们还展示了与Romero等人最近提出的架构可比的架构(2014)可以直接训练以获得在CIFAR-10数据集上相似的测试集准确性,而无需预先训练的教师网络。
1.1. Notation
We use boldface letters for vectors and matrices, and italicized capital letters to denote transformation functions. 0 and 1 denote vectors of zeros and ones respectively, and denotes an identity matrix. The function is defined as .
我们对矢量和矩阵使用粗体字母,并使用斜体大写字母表示转换函数。 0和1分别表示零和一的向量, 表示单位矩阵。函数 定义为 .
2. Highway Networks
A plain feedforward neural network typically consists of layers where the layer () applies a nonlinear transform (parameterized by ) on its input to produce its output . Thus, is the input to the network and is the network’s output. Omitting the layer index and biases for clarity,
一个普通的前馈神经网络通常由 层组成,其中第 层()在其输入 上应用非线性变换 (由 参数化)以产生其输出 。因此,是网络的输入,而 是网络的输出。为了清晰起见,省略了图层索引和偏差。
is usually an affine transform followed by a non-linear activation function, but in general it may take other forms. For a highway network, we additionally define two nonlinear transforms and such that
通常是仿射变换,后面是非线性**函数,但通常可以采用其他形式。对于公路网,我们还定义了两个非线性变换 和 ,使得
We refer to as the transform gate and as the carry gate, since they express how much of the output is produced by transforming the input and carrying it, respectively. For simplicity, in this paper we set , giving
我们将 称为变换门,将 称为进位门,因为它们分别表示通过变换输入和进行输入产生多少输出。为简单起见,在本文中,我们将 设置为
The dimensionality of and must be the same for Equation (3) to be valid. Note that this re-parametrization of the layer transformation is much more flexible than Equation (1). In particular, observe that
和 的维数必须相同,等式(3)才有效。注意,该层变换的重新参数化比公式(1)灵活得多。特别要注意
Similarly, for the Jacobian of the layer transform,
Thus, depending on the output of the transform gates, a highway layer can smoothly vary its behavior between that of a plain layer and that of a layer which simply passes its inputs through. Just as a plain layer consists of multiple computing units such that the ith unit computes , a highway network consists of multiple blocks such that the ith block computes a block state and transform gate output . Finally, it produces the block output , which is connected to the next layer.
因此,取决于变换门的输出,高速公路层可以在普通层和简单地使其输入通过的层之间平滑地改变其行为。正如普通层由多个计算单元组成(以使第 个单元计算 )一样,高速公路网络也由多个块组成,以使第 个块计算出块状态 和变换门输出 。最后,产生块输出 ,该输出连接到下一层。
2.1. Constructing Highway Networks
As mentioned earlier, Equation (3) requires that the dimensionality of and be the same. In cases when it is desirable to change the size of the representation, one can replace x with obtained by suitably sub-sampling or zero-padding x. Another alternative is to use a plain layer (without highways) to change dimensionality and then continue with stacking highway layers. This is the alternative we use in this study.
如前所述,公式(3)要求 和 的维数相同。在需要改变表示的大小的情况下,可以用通过适当的二次采样或零填充 获得的 替换 。另一种选择是使用平原层(无高速公路)来更改尺寸,然后继续堆叠高速公路层。这是我们在这项研究中使用的替代方法。
Convolutional highway layers are constructed similar to fully connected layers. Weight-sharing and local receptive fields are utilized for both H and T transforms. We use zero-padding to ensure that the block state and transform gate feature maps are the same size as the input.
卷积公路层的构造类似于完全连接的层。权重共享和局部接受场都用于 和 转换。我们使用零填充来确保块状态和变换门特征图与输入的大小相同。
2.2. Training Deep Highway Networks
For plain deep networks, training with SGD stalls at the beginning unless a specific weight initialization scheme is used such that the variance of the signals during forward and backward propagation is preserved initially (Glorot & Bengio, 2010; He et al., 2015). This initialization depends on the exact functional form of H.
对于普通的深度网络,除非使用特定的权重初始化方案,以便在开始时保留向前和向后传播期间的信号方差(Glorot&Bengio,2010; He et al,2015),否则除非使用特定的权重初始化方案,否则开始时将使用SGD进行训练。此初始化取决于 的确切功能形式。
For highway layers, we use the transform gate defined as , where is the weight matrix and the bias vector for the transform gates. This suggests a simple initialization scheme which is independent of the nature of H: can be initialized with a negative value (e.g. -1, -3 etc.) such that the network is initially biased towards carry behavior. This scheme is strongly inspired by the proposal of Gers et al. (1999) to initially bias the gates in a Long Short-Term Memory recurrent network to help bridge long-term temporal dependencies early in learning. Note that , so the conditions in Equation (4) can never be exactly true.
对于高速公路层,我们使用定义为 的变换门,其中 是权重矩阵, 是变换门的偏置矢量。这表明了一种独立于H的性质的简单初始化方案: 可以用负值(例如-1,-3等)初始化,从而使网络最初偏向进位行为。该方案受到Gers等人的提议的强烈启发(1999年)最初偏向长期短期记忆循环网络中的门,以帮助在学习早期桥接长期的时间依赖性。注意,,因此公式(4)中的条件永远不可能完全正确。
In our experiments, we found that a negative bias initialization was sufficient for learning to proceed in very deep networks for various zero-mean initial distributions of and different activation functions used by . This is significant property since in general it may not be possible to find effective initialization schemes for many choices of .
在我们的实验中,我们发现负偏置初始化足以学习在 的各种零均值初始分布和 使用的不同**函数的非常深的网络中进行。这是很重要的属性,因为通常不可能为 的许多选择找到有效的初始化方案。
3. Experiments
3.1. Optimization
Very deep plain networks become difficult to optimize even if using the variance-preserving initialization scheme form (He et al., 2015). To show that highway networks do not suffer from depth in the same way we train run a series of experiments on the MNIST digit classification dataset. We measure the cross entropy error on the training set, to investigate optimization, without conflating them with generalization issues.
即使使用保留方差的初始化方案形式,非常深的平原网络也变得难以优化(He等,2015)。为了证明高速公路网络不会像我们训练的那样受深度影响,我们对MNIST数字分类数据集进行了一系列实验。我们在训练集上测量交叉熵误差,以研究优化,而不会将它们与泛化问题混淆。
We train both plain networks and highway networks with the same architecture and varying depth. The first layer is always a regular fully-connected layer followed by 9, 19, 49, or 99 fully-connected plain or highway layers and a single softmax output layer. The number of units in each layer is kept constant and it is 50 for highways and 71 for plain networks. That way the number of parameters is roughly the same for both. To make the comparison fair we run a random search of 40 runs for both plain and highway networks to find good settings for the hyperparameters. We optimized the initial learning rate, momentum, learning rate decay rate, activation function for H (either ReLU or tanh) and, for highway networks, the value for the transform gate bias (between -1 and -10). All other weights were initialized following the scheme introduced by (He et al., 2015).
我们训练具有相同架构和不同深度的平原网络和公路网络。第一层始终是常规的全连接层,其后是9、19、49或99个全连接的平原层或高速公路层,以及一个softmax输出层。每层中的单位数量保持不变,对于高速公路,单位数量为50;对于平原网络,单位数量为71。这样,两个参数的数量大致相同。为了使比较公平,我们对平原和高速公路网络进行了40次随机搜索,以找到超参数的良好设置。我们优化了H的初始学习速率,动量,学习速率衰减率,**函数(ReLU或tanh),对于高速公路网络,我们优化了变换门偏置的值(介于-1和-10之间)。所有其他权重均按照(He et al,2015)引入的方案进行初始化。
The convergence plots for the best performing networks for each depth can be seen in Figure 1. While for 10 layers plain network show very good performance, their performance significantly degrades as depth increases. Highway networks on the other hand do not seem to suffer from an increase in depth at all. The final result of the 100 layer highway network is about 1 order of magnitude better than the 10 layer one, and is on par with the 10 layer plain network. In fact, we started training a similar 900 layer highway network on CIFAR-100 which is only at 80 epochs as of now, but so far has shown no signs of optimization difficulties. It is also worth pointing out that the highway networksalwaysconvergesignificantlyfasterthantheplain ones.
在图1中可以看到针对每个深度的最佳性能网络的收敛图。尽管对于10层的普通网络而言,其性能非常好,但随着深度的增加,其性能会大大降低。另一方面,公路网似乎根本不受深度增加的影响。 100层公路网的最终结果比10层公路网的最终结果好大约1个数量级,并且与10层平原网络相当。实际上,我们已经开始在CIFAR-100上训练类似的900层高速公路网络,该网络到现在只有80个epoch,但是到目前为止,还没有发现优化困难的迹象。还值得指出的是,公路网的收敛速度明显快于普通公路网。
3.2. Comparison to Fitnets
Deep highway networks are easy to optimize, but are they also beneficial for supervised learning where we are interested in generalization performance on a test set? To address this question, we compared highway networks to the thin and deep architectures termed Fitnets proposed recently by Romero et al. (2014) on the CIFAR-10 dataset augmented with random translations. Results are summarized in Table 1.
深度高速公路网络易于优化,但是对于我们对测试集的泛化性能感兴趣的监督学习,它们是否也有益于监督学习?为了解决这个问题,我们将高速公路网络与Romero等人最近提出的称为Fitnets的瘦架构和深架构进行了比较(2014)在CIFAR-10数据集上增加了随机翻译。结果总结在表1中。
Romero et al. (2014) reported that training using plain backpropogation was only possible for maxout networks with depth up to 5 layers when number of parameters was limited to ∼250K and number of multiplications to ∼30M. Training of deeper networks was only possible through the use of a two-stage training procedure and addition of soft targets produced from a pre-trained shallow teacher network (hint-based training). Similarly it was only possible to train 19-layer networks with a budget of 2.5M parameters using hint-based training.
罗梅罗等(2014年)报告说,仅当参数数限制在250K左右,乘法数限制在30M左右时,使用深度反向传播的训练仅适用于深度最大为5层的maxout网络。只有通过使用两阶段的培训程序并添加从预先训练的浅层教师网络(基于提示的培训)中产生的软目标,才能对更深层的网络进行培训。同样,只能使用基于提示的训练来训练预算为250万的19层网络。
We found that it was easy to train highway networks with number of parameters and operations comparable to fitnets directly using backpropagation. As shown in Table 1, Highway 1 and Highway 4, which are based on the architecture of Fitnet 1 and Fitnet 4 respectively obtain similar or higher accuracy on the test set. We were also able to train thinner and deeper networks: a 19-layer highway network with ∼1.4M parameters and a 32-layer highway network with ∼1.25M parameter both perform similar to the teacher network of Romero et al. (2014).
我们发现,使用反向传播直接训练具有与fitnet相当的参数和操作数量的高速公路网络很容易。如表1所示,基于Fitnet 1和Fitnet 4的结构的Highway 1和Highway 4分别在测试集上获得了相似或更高的精度。我们还能够训练更薄和更深的网络:参数约为1.4M的19层高速公路网络和参数约为1.25M的32层高速公路网络的性能都类似于Romero等人的教师网络(2014)。
4. Analysis
In Figure 2 we show some inspections on the inner workings of the best150 hidden layer fully-connected highway networks trained on MNIST (top row) and CIFAR100 (bottom row). The first three columns show, for each transform gate, the bias, the mean activity over 10K random samples, and the activity for a single random sample respectively. The block outputs for the same single sample are displayed in the last column.
在图2中,我们对在MNIST(上排)和CIFAR100(下排)上训练的best150隐藏层全连接公路网络的内部工作进行了一些检查。前三列分别针对每个转换门显示偏差,10K随机样本的平均活动性和单个随机样本的活动性。在最后一列中显示同一样本的块输出。
The transform gate biases of the two networks were initialized to -2 and -4 respectively. It is interesting to note that contrary to our expectations most biases actually decreased further during training. For the CIFAR-100 network the biases increase with depth forming a gradient. Curiously this gradient is inversely correlated with the average activity of the transform gates as seen in the second column. This indicates that the strong negative biases at low depths are not used to shut down the gates, but to make them more selective. This behavior is also suggested by the fact that the transform gate activity for a single example (column 3) is very sparse. This effect is more pronounced for the CIFAR100 network, but can also be observed to a lesser extent in the MNIST network.
两个网络的变换门偏置分别初始化为-2和-4。有趣的是,与我们的预期相反,大多数偏差实际上在培训过程中进一步降低了。对于CIFAR-100网络,偏差随深度增加而形成梯度。奇怪的是,该梯度与第二列中看到的变换门的平均活动成反比。这表明在低深度处的强负偏置不用于关闭栅极,而是用于使其更具选择性。单个示例(第3列)的变换门活动非常稀疏的事实也表明了此行为。对于CIFAR100网络,这种影响更为明显,但在MNIST网络中也可以观察到的程度较小。
图1.各种深度的平原网络和公路网络的优化比较。所有网络都使用SGD进行了优化。显示的曲线是使用随机搜索为每种配置获得的最佳超参数设置的。随着深度的增加,平原网络变得越来越难以优化,而高达100层的高速公路网络仍然可以很好地进行优化。
表1.具有校正的线性**和S型门的卷积公路网络的CIFAR-10测试集精度。为了比较,Romero等报道了结果(2014)也显示了使用maxout网络。使用经过训练的教师网络中的软目标,通过两步训练程序对Fitnet进行了训练,该网络使用反向传播进行了训练。我们直接使用反向传播训练了所有高速公路网络。 *表示仅在训练集中的5万个示例中,对一组40K进行训练的网络。
The last column of Figure 2 displays the block outputs and clearly visualizes the concept of “information highways”. Most of the outputs stay constant over many layers forming a pattern of stripes. Most of the change in outputs happens in the early layers (≈ 10 for MNIST and ≈ 30 for CIFAR-100). We hypothesize that this difference is due to the higher complexity of the CIFAR-100 dataset.
图2的最后一列显示块输出,并清晰地可视化“信息高速公路”的概念。大多数输出在形成条纹图案的许多层上保持恒定。输出的大部分变化都发生在早期阶段(MNIST约为10,CIFAR-100约为30)。我们假设这种差异是由于CIFAR-100数据集的较高复杂性引起的。
图2.在MNIST(上排)和CIFAR-100(下排)上训练的最佳50个隐藏层高速公路网络中,某些区块内部的可视化。第一个隐藏层是将表示的尺寸更改为50的普通层。49个高速公路层(y轴)中的每一个都包含50个块(x轴)。第一列显示了变换门偏置,分别被初始化为-2和-4。在第二列中,描绘了超过10,000个训练示例的变换门的平均输出。第三列和第四列显示单个随机训练样本的变换门输出和块输出。
In summary it is clear that highway networks actually utilize the gating mechanism to pass information almost unchanged through many layers. This mechanism serves not just as a means for easier training, but is also heavily used to route information in a trained network. We observe very selective activity of the transform gates, varying strongly in reaction to the current input patterns.
综上所述,很明显,高速公路网络实际上利用选通机制来使信息几乎不变地通过多层传递。该机制不仅用作简化培训的手段,而且还广泛用于在经过培训的网络中路由信息。我们观察到变换门的选择性非常高,在对当前输入模式的反应中变化很大。
5. Conclusion
Learning to route information through neural networks has helped to scale up their application to challenging problems by improving credit assignment and making training easier (Srivastava et al., 2015). Even so, training very deep networks has remained difficult, especially without considerably increasing total network size.
通过神经网络学习路由信息,通过改善学分分配和简化训练,有助于将其应用扩展到具有挑战性的问题上(Srivastava等,2015)。即使这样,训练非常深的网络仍然很困难,尤其是在不显着增加总网络规模的情况下。
Highway networks are novel neural network architectures which enable the training of extremely deep networks using simple SGD. While the traditional plain neural architectures become increasingly difficult to train with increasing network depth (even with variance-preserving initialization), our experiments show that optimization of highway networks is not hampered even as network depth increases to a hundred layers.
公路网络是新型的神经网络架构,可使用简单的SGD训练极深的网络。尽管传统的简单神经体系结构随着网络深度的增加(甚至带有方差保留的初始化)而变得越来越难以训练,但我们的实验表明,即使网络深度增加到一百层,高速公路网络的优化也不会受到阻碍。
The ability to train extremely deep networks opens up the possibility of studying the impact of depth on complex problems without restrictions. V arious activation functions which may be more suitable for particular problems but for which robust initialization schemes are unavailable can be used in deep highway networks. Future work will also attempt to improve the understanding of learning in highway networks.
训练极深网络的能力为研究深度不受限制地研究复杂问题的可能性提供了可能性。可以在深层高速公路网络中使用各种**功能,这些功能可能更适合于特定问题,但缺乏强大的初始化方案。未来的工作还将尝试增进对高速公路网络学习的理解。