Computing Visual Perception to Enable Real-Time Object Recognition(计算视觉感知以启用实时对象识别)

Computing Visual Perception to Enable Real-Time Object Recognition

Visual search has become a necessity for many multimedia applications running on today’s computing systems. Tasks such as recognizing parts of a scene in an image, detecting items in a retail store, navigating an autonomous drone, etc. have great relevance to today’s rapidly changing environment. Many of the underlying functions in these tasks rely on high-resolution camera sensors that are used for capturing light intensity-based data. The data is then processed by different algorithms to perform tasks such as object detection, object recognition, image segmentation, etc.

视觉搜索已成为当今计算机系统上运行的许多多媒体应用程序的必要条件。 识别图像中场景的一部分,检测零售商店中的物品,导航自动无人机等任务与当今瞬息万变的环境息息相关。 这些任务中的许多基本功能都依赖于高分辨率相机传感器,该传感器用于捕获基于光强度的数据。 然后,数据将通过不同的算法进行处理,以执行诸如对象检测,对象识别,图像分割等任务。

Computing Visual Perception to Enable Real-Time Object Recognition(计算视觉感知以启用实时对象识别)

Visual attention has gained a lot of traction in computational neuroscience research over the past few years. Various computational models have used low-level features to build information maps, which are then fused together to form what is popularly called as a saliency map. Given an image to observe, this saliency map, in essence, provides a compact representation of what is most important in the image. The map can then be used as a mechanism to zoom in on identifying the regions of interest (RoIs) that are most important in the image. For example, in Figure 1, different saliency models show the extent to which pixels representing the bicyclist pop out in stark contrast to the background.

在过去的几年中,视觉注意力在计算神经科学研究中获得了很大的吸引力。 各种计算模型已使用低级功能来构建信息图,然后将信息图融合在一起以形成通常称为显着图的图。 给定要观察的图像,此显着图实质上提供了图像中最重要内容的紧凑表示。 然后,可以将地图用作放大识别图像中最重要的兴趣区域(RoI)的机制。 例如,在图1中,不同的显着性模型显示了代表骑自行车者的像素与背景形成鲜明对比的程度。

These models assume that the human eye uses its full resolution across the entire field of view (FOV). However, the resolution drops off from the center of the fovea towards the periphery and the human visual system (HVS) is adept at foveating so as to investigate other areas in the periphery when attention is drawn in that direction. In other words, our eyes foveate to allow points of interest to fall on the fovea, which is the region of highest resolution. It is only after this foveation process that we are capable of gathering complete information from the object of interest that drew our attention to it. The HVS has thus been built in such a way that it becomes necessary to move the eyes in order to facilitate processing information all around one’s environment. It is due to this reason that humans tend to select nearby locations more frequently than distant targets and salience maps need to be computed taking this into account to improve the predictive power of the models. Understanding the efficiency with which our eyes intelligently take in pertinent information to perform different tasks has a significant impact to building the next-generation of autonomous systems. Building a foveation framework to test and improve saliency models for real-time autonomous navigation is the focus of this work.

这些模型假设人眼在整个视场(FOV)中使用其完整分辨率。然而,分辨率从中央凹的中心向外围下降,并且人类视觉系统(HVS)擅长于中央凹,以便当在该方向上引起注意时研究周边的其他区域。换句话说,我们的眼睛向中央凹,使兴趣点落在中央凹上,该中央凹是最高分辨率的区域。只有在这个关注过程之后,我们才能够从感兴趣的对象中收集完整的信息,从而引起我们的注意。因此,HVS的构建方式使得必须移动眼睛以便于在一个人的整个环境中处理信息。正是由于这个原因,人类倾向于比附近的目标更频繁地选择附近的位置,因此需要计算显着图以提高模型的预测能力。了解我们的眼睛以智能方式获取相关信息以执行不同任务的效率,对构建下一代自治系统具有重大影响。建立工作中心框架以测试和改进实时自主导航的显着性模型是这项工作的重点。

Computing Visual Perception to Enable Real-Time Object Recognition(计算视觉感知以启用实时对象识别)
We choose an information theoretic computational saliency model, Attention based on Information Maximization (AIM) as a building block for our foveation framework. AIM has been benchmarked against many other saliency models and it has proven to come significantly close to human fixations. The model looks to compute visual content as a measure of surprise or rareness using Shannon’s self-information theory. The algorithm is divided into three major sections as shown in Figure 2. The first section involves creating a sparse representation of the image by projecting it on a set of learnt basis functions. The next section involves a density estimation using a histogram back projection technique. Finally a log-likelihood is computed to give the final information map. For more details on the algorithm and the theory behind it, one is pointed to [1].
我们选择一个信息理论计算显着性模型,即基于信息最大化(AIM)的注意力作为我们关注框架的基础。 AIM已针对许多其他显着性模型进行了基准测试,事实证明它与人类注视非常接近。 该模型使用Shannon的自我信息理论来计算视觉内容,以衡量是否感到惊讶或稀有。 如图2所示,该算法分为三个主要部分。第一部分涉及通过将图像投影到一组学习的基础函数上来创建图像的稀疏表示。 下一部分涉及使用直方图反投影技术的密度估计。 最后,计算对数似然以给出最终的信息图。 有关算法及其背后理论的更多详细信息,请参考[1]。

Computing Visual Perception to Enable Real-Time Object Recognition(计算视觉感知以启用实时对象识别)
Figure 3: Image breakdown into a 3-level architecture which has a central high-resolution fovea, a mid resolution region and a low-resolution region. The lowest level covers the entire FOV
图3:将图像分解为3级结构,该结构具有中央高分辨率中央凹,中分辨率区域和低分辨率区域。 最低级别涵盖整个FOV

In order to model the steep roll-off in resolution from fovea to periphery, once the image is captured by the camera sensor, we build a three level Gaussian pyramid as shown in Figure 3. To do this, we first extract a 50% high-resolution center region from Level 1 as our fovea. After blurring and downsampling, a second region is cropped out from Level 2, representing the mid-resolution region. Another round of blurring and downsampling leaves us with the entire FOV but at a much lower resolution (Level 3). It should be noted that as the resolution drops off, the FOV is gradually increasing in our framework.
为了模拟从中央凹到边缘的分辨率急剧下降,一旦图像被相机传感器捕获,我们就建立了一个三级高斯金字塔,如图3所示。为此,我们首先提取一个50%的高 中心凹从1级作为我们的中央凹。 经过模糊和下采样后,从级别2裁剪出第二个区域,代表中分辨率区域。 另一轮模糊和下采样使我们拥有整个FOV,但分辨率却低得多(第3级)。 应当指出,随着分辨率的降低,FOV在我们的框架中逐渐增加。

For our experiments, we use a ½” Format C-Mount Fisheye Lens having a focal length of 1.4 mm and a Field of View (FOV) of 185°. The images captured are 1920×1920 in size. The images have some inherent nonlinearity as one moves away from the center, which is similar to the way the human eyes perceive the world around. We run AIM on each of these three regions, which returns corresponding information maps. These information maps represent the salient regions at different resolutions as shown in Figure 4 ©. There are a number of ways in which to fuse these information maps to give a final multi-resolution saliency map. We believe that an adaptive weighting function on each of these maps will be a valuable parameter to tune in a dynamic environment. However, for this work, which focuses on static images, we use weights of w1 = 1/3, w2 = 2/3 and w3 = 1 for the high-resolution fovea, the mid-resolution region and the low-resolution region respectively. We use these weights since pixels in the fovea occur thrice across the pyramid while pixels in the mid-resolution region occur twice. These weights thus prevent the final saliency map from being overly center-biased. Since these maps are of different size, they are appropriately up-sampled and zero-padded before adding them up.
对于我们的实验,我们使用焦距为1.4毫米,视场(FOV)为185°的½英寸格式C型鱼眼镜头。拍摄的图像尺寸为1920×1920。当人们离开中心时,图像具有一些固有的非线性,这类似于人眼感知周围世界的方式。我们在这三个区域的每一个上运行AIM,这将返回相应的信息图。这些信息图表示不同分辨率下的显着区域,如图4(c)所示。有多种方法可以融合这些信息图以提供最终的多分辨率显着图。我们相信,这些地图中的每一个上的自适应加权函数将是在动态环境中进行调整的有价值的参数。但是,对于专注于静态图像的这项工作,我们分别将w1 = 1/3,w2 = 2/3和w3 = 1分别用于高分辨率中央凹,中分辨率区域和低分辨率区域的权重。我们使用这些权重是因为中央凹中的像素在整个金字塔上出现三次,而中分辨率区域中的像素出现两次。因此,这些权重可防止最终显着图过于偏心。由于这些地图的大小不同,因此在将它们相加之前,会对其进行适当的上采样和零填充。

Computing Visual Perception to Enable Real-Time Object Recognition(计算视觉感知以启用实时对象识别)

Figure 4: Methodology of the proposed framework from left to right:
(a) Input Image (b) Image Pyramids with increasing FOV © Visual Attention Saliency Maps (d) Multi-resolution Attention Map by fusing © with different weights
图4:从左到右的建议框架的方法:
(a)输入图像(b)FOV增加的图像金字塔(c)视觉注意力显着图(d)通过融合(c)不同权重的多分辨率注意力图

To validate our model termed as Multi-Resolution AIM (MR-AIM), we ran experiments on a series of patterns as shown in Figure 5. First we considered a series of spatially distributed red dots of same dimensions against a black background (Figures 5 (a) and 5 (b)). As can be seen in the saliency result (Figures 5 (e) and 5 (f)) there is a gradual decrease in saliency as one moves away from the fovea (Red corresponds to regions of higher saliency while Blue corresponds to regions of lower saliency).

为了验证称为多分辨率AIM(MR-AIM)的模型,我们对一系列模式进行了实验,如图5所示。首先,我们考虑了在黑色背景下一系列相同尺寸的空间分布红点(图5)。 (a)和5(b))。 从显着性结果中可以看出(图5(e)和5(f)),随着人远离中央凹,显着性逐渐降低(红色对应于显着性较高的区域,蓝色对应于显着性较低的区域。 )。

Computing Visual Perception to Enable Real-Time Object Recognition(计算视觉感知以启用实时对象识别)
Figure 5: Saliency results for different spatial perturbations
(a)-(d) Input Image, (e)-(h) Saliency Result
图5:不同空间扰动的显着性结果
(a)-(d)输入图像,(e)-(h)显着性结果

Onsets are considered to drive visual attention in a dynamic environment, so in Figure 5© we next considered the arrivals of new objects of interest within the fovea (red dot) and towards the periphery (yellow dot). Maximum response is obtained in the region around the yellow dot (Figure 5 (g)). Next, we consider a movement of the yellow dot further away from the fovea (Figure 5 (d)). Again we notice a slight shift in saliency moving attention towards the center (Figure 5 (h)). These experiments give us valuable information on the mechanisms of our model when the object of interest is moving relative to the fovea.

起病被认为是在动态环境中引起视觉注意的原因,因此在图5(c)中,我们接下来考虑到新感兴趣的对象在中央凹(红点)和外围(黄点)的到达。 在黄点周围区域获得最大响应(图5(g))。 接下来,我们考虑黄点远离中央凹的运动(图5(d))。 同样,我们注意到显着性略有变化,从而将注意力转移到中心(图5(h))。 当感兴趣的对象相对于中央凹移动时,这些实验为我们提供了有关模型机制的有价值的信息。

Computing Visual Perception to Enable Real-Time Object Recognition(计算视觉感知以启用实时对象识别)

Figure 6: Qualitative comparisons
图6:定性比较

Our next set of experiments was to compare the multi-resolution model with the original AIM model and evaluate the former, both, in terms of quality and performance. It should be noted here that the dataset provided in [4] has images of maximum size 1024×768, while the framework designed here is ideally targeted towards high-resolution images that contain a lot of salient objects. Figure 6 (Row 1) shows an example of such an image with increasing size from left to right. Row 2 depicts results from the original AIM model. Row 3 shows the output of the MR-AIM. For smaller image sizes, AIM does a very good job in spotting the main RoIs. But as the image size starts to increase, it starts to pick edges as most salient. This is due to the limited size (21×21) of the basis kernels used. Increasing the size of the kernels is not a viable option for a real-time system since that would in turn increase the computation time. MR-AIM has no such problem. Since it operates on smaller image sizes at different resolutions, it can detect objects at different scales. There is a bias towards objects in the center, but the weights do a significant job in capturing RoIs towards the periphery as well. It should be noted here that MR-AIM would not pick up objects that become extremely salient in the periphery, but adding other channels of saliency, such as motion, will make the model more robust in a dynamic environment.

我们的下一组实验是将多分辨率模型与原始AIM模型进行比较,并从质量和性能方面对前者进行评估。这里应该注意的是,[4]中提供的数据集具有最大尺寸为1024×768的图像,而此处设计的框架理想地针对包含许多显着对象的高分辨率图像。图6(第1行)显示了这样一个图像的示例,该图像的尺寸从左到右逐渐增大。第2行描述了原始AIM模型的结果。第3行显示MR-AIM的输出。对于较小的图像尺寸,AIM在发现主要RoI方面做得很好。但是,随着图像大小开始增加,它开始选择边缘最为显着。这是由于使用的基本内核的大小有限(21×21)。对于实时系统,增加内核的大小不是可行的选择,因为这反过来又会增加计算时间。 MR-AIM没有这样的问题。由于它以不同的分辨率在较小的图像尺寸上运行,因此可以检测到不同比例的物体。中心物体有偏差,但是权重在捕获RoI时也起到了重要作用。在这里应该注意,MR-AIM不会拾取在外围变得非常突出的对象,但是添加其他显着性通道(例如运动)将使模型在动态环境中更加健壮。

Computing Visual Perception to Enable Real-Time Object Recognition(计算视觉感知以启用实时对象识别)
Figure 7: Qualitative comparison of Itti’s model (yellow circle shows focus of attention), AIM (red box shows maximum saliency score) and MR-AIM (red box shows maximum saliency score) at different time instances captured from a video.

图7:在从视频中捕获的不同时间点,Itti模型(黄色圆圈表示关注的焦点),AIM(红色方框显示了最大显着性得分)和MR-AIM(红色方框显示了最大显着性得分)的定性比较。

Another set of experiments run was on a series of video frames captured in an environment where there was sufficient activity in the periphery to activate attention as shown in Figure 7. We compare our model to other models rather than verifying against actual eye-tracking data since such data is not readily available for high-definition images. The top row shows Itti’s results for frame numbers 17, 22 and 27. The middle row shows AIM’s results for the respective frames while the bottom row shows MR-AIM’s response. For a fair comparison we deactivated the inhibition of return in Itti’s model. Both AIM and MR-AIM capture the onset of the bicyclist in frame 22 successively. These experiments offer us confidence about the qualitative performance of our proposed model. MR-AIM was also benchmarked on the MIT Saliency dataset and detailed results can be found here. Our work was published in [5] and an OpenCV based source code is also available here.

另一组实验是在周围环境有足够活动来**注意力的环境中捕获的一系列视频帧上进行的,如图7所示。我们将模型与其他模型进行比较,而不是根据实际的眼动数据进行验证,因为 对于高清图像而言,此类数据并不容易获得。 上排显示Itti针对第17、22和27号帧的结果。中排显示相应帧的AIM结果,下排显示MR-AIM的响应。 为了公平起见,我们取消了Itti模型中对收益的抑制。 AIM和MR-AIM都连续捕获了第22帧中骑自行车的人的发作。 这些实验使我们对我们提出的模型的定性性能充满信心。

References:

[1] N. D. B. Bruce and J. K. Tsotsos, “Saliency Based on Information Maximization,” Advances in Neural Information Processing Systems, vol. 18, pp. 155–162, 2006.

[2] L. Itti, C. Koch, and E. Niebur, “A Model of Saliency-Based Visual Attention for Rapid Scene Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254–1259, 1998.

[3] T. Judd, K. Ehinger, F. Durand and A. Torralba, “Learning to Predict Where Humans Look”, IEEE International Conference on Computer Vision (ICCV), 2009.

[4] T. Judd, F. Durand, and A. Torralba, “A Benchmark of Computational Models of Saliency to Predict Human Fixations,” 2012.

[5] S. Advani, J. Sustersic, K. Irick, and V. Narayanan, “A Multi-Resolution Saliency Framework To Drive Foveation,” IEEE Proceedings of The 38th International Conference on Acoustics, Speech, and Signal Processing, 2013.