DeepMind Sceince文章:Neural scene representation and rendering,官方blog解读

There is more than meets the eye when it comes to how we understand a visual scene: our brains draw on prior knowledge to reason and to make inferences that go far beyond the patterns of light that hit our retinas. For example, when entering a room for the first time, you instantly recognise the items it contains and where they are positioned. If you see three legs of a table, you will infer that there is probably a fourth leg with the same shape and colour hidden from view. Even if you can’t see everything in the room, you’ll likely be able to sketch its layout, or imagine what it looks like from another perspective.
当谈到我们如何理解一个视觉场景时,有很多事情需要关注:我们的大脑利用先前的知识来推理,并做出远远超出打击视网膜的光线模式的推论。 例如,第一次进入房间时,您可以立即识别其包含的物品以及它们的位置。 如果您看到桌子上有三条腿,那么您会推断出可能存在第四条腿,其形状和颜色与视图中隐藏的相同。 即使你看不到房间里的所有东西,你也可以画出它的布局,或者从另一个角度想象它的外观。
DeepMind Sceince文章:Neural scene representation and rendering,官方blog解读
These visual and cognitive tasks are seemingly effortless to humans, but they represent a significant challenge to our artificial systems. Today, state-of-the-art visual recognition systems are trained using large datasets of annotated images produced by humans. Acquiring this data is a costly and time-consuming process, requiring individuals to label every aspect of every object in each scene in the dataset. As a result, often only a small subset of a scene’s overall contents is captured, which limits the artificial vision systems trained on that data. As we develop more complex machines that operate in the real world, we want them to fully understand their surroundings: where is the nearest surface to sit on? What material is the sofa made of? Which light source is creating all the shadows? Where is the light switch likely to be?
这些视觉和认知任务似乎对人类来说毫不费力,但它们对我们的人造系统来说是一个重大挑战。 今天,最先进的视觉识别系统使用人类产生的注释图像的大型数据集进行训练。 获取此数据是一个昂贵且耗时的过程,需要个人标记数据集中每个场景中每个对象的每个方面。 因此,通常只捕捉场景整体内容的一小部分,这限制了对该数据进行训练的人工视觉系统。 当我们开发在现实世界中运行的更复杂的机器时,我们希望它们完全理解它们的周围环境:最近的表面在哪里? 沙发是用什么材料制成的? 哪个光源正在创造所有阴影? 灯开关可能在哪里?

In this work, published in Science (Open Access version), we introduce the Generative Query Network (GQN), a framework within which machines learn to perceive their surroundings by training only on data obtained by themselves as they move around scenes. Much like infants and animals, the GQN learns by trying to make sense of its observations of the world around it. In doing so, the GQN learns about plausible scenes and their geometrical properties, without any human labelling of the contents of scenes.
在这项发表在Science(开放存取版本)中的工作中,我们介绍了生成查询网络(GQN),在这个框架内,机器学习只通过对他们在场景中移动时获得的数据进行训练来感知周围环境。 就像婴儿和动物一样,GQN通过尝试理解其对周围世界的观察来学习。 在这样做时,GQN学习了似乎合理的场景及其几何属性,而没有任何人物对场景内容的标注。

The GQN model is composed of two parts: a representation network and a generation network. The representation network takes the agent's observations as its input and produces a representation (a vector) which describes the underlying scene. The generation network then predicts (‘imagines’) the scene from a previously unobserved viewpoint.
GQN模型由两部分组成:表达网络和生成网络。 表达网络将代理人的观察结果作为其输入并产生描述基础场景的表示(矢量)。 然后生成网络从以前未观察到的角度预测('想像')场景。

The representation network does not know which viewpoints the generation network will be asked to predict, so it must find an efficient way of describing the true layout of the scene as accurately as possible. It does this by capturing the most important elements, such as object positions, colours and the room layout, in a concise distributed representation. During training, the generator learns about typical objects, features, relationships and regularities in the environment. This shared set of ‘concepts’ enables the representation network to describe the scene in a highly compressed, abstract manner, leaving it to the generation network to fill in the details where necessary. For instance, the representation network will succinctly represent ‘blue cube’ as a small set of numbers and the generation network will know how that manifests itself as pixels from a particular viewpoint.
表示网络不知道生成网络将被要求预测哪些视点,因此它必须尽可能准确地找到描述场景的真实布局的有效方式。 它通过以简洁的分布式表示捕获最重要的元素(例如对象位置,颜色和房间布局)来实现此目的。 在训练期间,生成器学习环境中的典型对象,特征,关系和规律。 这一共享的“概念”集使表示网络能够以高度压缩和抽象的方式描述场景,并将其留给代网络以在必要时填写详细信息。 例如,表示网络将简洁地将“蓝色立方体”表示为一小组数字,并且生成网络将知道如何将其自身表现为来自特定视点的像素。

We performed controlled experiments on the GQN in a collection of procedurally-generated environments in a simulated 3D world, containing multiple objects in random positions, colours, shapes and textures, with randomised light sources and heavy occlusion. After training on these environments, we used GQN’s representation network to form representations of new, previously unobserved scenes. We showed in our experiments that the GQN exhibits several important properties:
我们在模拟3D世界中的一系列程序性生成环境中对GQN进行了受控实验,其中包含随机位置,颜色,形状和纹理中的多个对象,随机光源和重度遮挡。 在对这些环境进行训练之后,我们使用GQN的表示网络来形成新的,以前未观察到的场景的表示。 我们在实验中表明GQN具有以下几个重要特性:

  • The GQN’s generation network can ‘imagine’ previously unobserved scenes from new viewpoints with remarkable precision. When given a scene representation and new camera viewpoints, it generates sharp images without any prior specification of the laws of perspective, occlusion, or lighting. The generation network is therefore an approximate renderer that is learned from data:
GQN的生成网络可以非常精确地从新视角“想象”以前未观察到的场景。 当给出场景表示和新的摄像机视点时,它将生成清晰的图像,而不需要事先规定透视法,遮挡法或照明法。 生成网络因此是从数据中学习的近似渲染器:
DeepMind Sceince文章:Neural scene representation and rendering,官方blog解读
  • The GQN’s representation network can learn to count, localise and classify objects without any object-level labels. Even though its representation can be very small, the GQN’s predictions at query viewpoints are highly accurate and almost indistinguishable from ground-truth. This implies that the representation network perceives accurately, for instance identifying the precise configuration of blocks that make up the scenes below:
GQN的表示网络可以学会对对象进行计数,本地化和分类,而无需任何对象级标签。 即使它的表示可能非常小,但GQN在查询视点上的预测是高度准确的,几乎与地面实况无法区分。 这意味着表示网络准确地感知,例如识别组成以下场景的块的精确配置:
DeepMind Sceince文章:Neural scene representation and rendering,官方blog解读
  • The GQN can represent, measure and reduce uncertainty. It is capable of accounting for uncertainty in its beliefs about a scene even when its contents are not fully visible, and it can combine multiple partial views of a scene to form a coherent whole. This is shown by its first-person and top-down predictions in the figure below. The model expresses its uncertainty through the variability of its predictions, which gradually reduces as it moves around the maze (grey cones indicate observation locations, yellow cone indicates query location):
GQN可以表示、衡量和减少不确定性。 即使其内容不完全可见,它也能够解释场景信息的不确定性,并且可以将场景的多个局部视图组合起来,形成一个整体。 下图显示了它的第一人称和自上而下的预测。 该模型通过其预测的可变性表达其不确定性,当其在迷宫中移动时逐渐减少(灰色锥体表示观察位置,黄色锥体表示查询位置):
DeepMind Sceince文章:Neural scene representation and rendering,官方blog解读
  • The GQN’s representation allows for robust, data-efficient reinforcement learning. When given GQN’s compact representations, state-of-the-art deep reinforcement learning agents learn to complete tasks in a more data-efficient manner compared to model-free baseline agents, as shown in the figure below. To these agents, the information encoded in the generation network can be seen to be ‘innate’ knowledge of the environment:
GQN的表示允许强大的数据高效强化学习。 当给定GQN的紧凑表示时,与无模型基线代理相比,最先进的深度强化学习代理学会以更高效的数据方式完成任务,如下图所示。 对于这些代理,可以将生成网络中编码的信息视为对环境的“天生”知识:
DeepMind Sceince文章:Neural scene representation and rendering,官方blog解读
Using GQN we observe substantially more data-efficient policy learning, obtaining convergence-level performance with approximately 4 times fewer interactions than a standard method using raw pixels.
使用GQN,我们观察到实际上更多数据有效的策略学习,与使用原始像素的标准方法相比,交互级别的性能减少了约4倍的交互。

GQN builds upon a large literature of recent related work in multi-view geometry, generative modelling, unsupervised learning and predictive learning, which we discuss here, in the Science paper and the Open Access version. It illustrates a novel way to learn compact, grounded representations of physical scenes. Crucially, the proposed approach does not require domain-specific engineering or time-consuming labelling of the contents of scenes, allowing the same model to be applied to a range of different environments. It also learns a powerful neural renderer that is capable of producing accurate images of scenes from new viewpoints.
GQN基于最近在多视图几何,生成建模,无监督学习和预测学习方面的相关工作的大量文献,我们在这里讨论科学论文和开放获取版本。 它演示了一种学习紧凑,基础的物理场景表示的新方法。 关键的是,所提出的方法不需要特定领域的工程或耗时的场景内容标记,从而允许将相同的模型应用于各种不同的环境。 它还学习了一个强大的神经渲染器,能够从新的视点生成精确的场景图像。

Our method still has many limitations when compared to more traditional computer vision techniques, and has currently only been trained to work on synthetic scenes. However, as new sources of data become available and advances are made in our hardware capabilities, we expect to be able to investigate the application of the GQN framework to higher resolution images of real scenes. In future work, it will also be important to explore the application of GQNs to broader aspects of scene understanding, for example by querying across space and time to learn a common sense notion of physics and movement, as well as applications in virtual and augmented reality.
与更传统的计算机视觉技术相比,我们的方法仍然有许多限制,目前只有接受过合成场景的训练。 然而,随着新数据源的出现以及我们硬件功能的进步,我们希望能够研究GQN框架在真实场景的更高分辨率图像中的应用。 在未来的工作中,探索GQN在场景理解的更广泛方面的应用也很重要,例如通过查询空间和时间来学习物理和运动的常识概念,以及虚拟和增强现实中的应用。

While there is still much more research to be done before our approach is ready to be deployed in practice, we believe this work is a sizeable step towards fully autonomous scene understanding.
尽管在我们的方法准备在实践中部署之前还有很多研究需要完成,但我们相信这项工作是迈向完全自主场景理解的一大步。