00040-Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional
Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture
Author:
David Eigen: Dept. of Computer Science, Courant Institute, New York University
Rob Fergus: FaceBook AI Research
Abstract:
1. multiscal conv : depth prediction, surface normal estimation, and semantic labeling.
2. refines predictions using a sequence of scales, and captures many image details without and superpixels or low-level segmentation.
1. Introduction
In this paper, we address three of these tasks, depth prediction,surface normal estimation and semantic segmentation— all using a single common architecture.
First, appropriate training set and loss function.
Second, simplify the implementation of systems that require multiple modalities.
Third, much of the computation can be shared between modalities , making the system more efficient.
2. Related Work
Most of these systems use ConvNets to find only local features, or generate descriptors of discrete proposal regions; by contrast, our network uses both local and global views to predict a variety of output types.
3. Model Architecture
The first scale in the network predicts a coarse but spatially-varying set of features for the entire image area, based on a large, full-image field of view, which we accomplish this through the use of two fully-connected layers.
The job of the second scale is to produce predictions at a mid-level resolution, by incorporating a more detailed but narrower view of the image along with the full-image information supplied by the coarse network.
The final scale of our model refines the predictions to higher resolution.
4. Tasks
We apply this same architecture structure to each of the three tasks we investigate: depths, normals and semantic labeling. Each makes use of a different loss function and target data defining the task.
To predict surface normals, we change the output from one channel to three, and predict the x, y and z components of the normal at each pixel.
For semantic labeling, we use a pixelwise softmax classifier to predict a class label for each pixel.
5 Training
We train our model in two phases using SGD: First, we jointly train both Scales 1 and 2. Second, we fix the parameters of these scales and train Scale 3.
We use random scaling, in-plane rotation, translation, color, flips and contrast.
5.3 Combining Depth and Normals
6. Performance Experiments
Sift flow: dense correspondence across difference scenes.
7. Probe Experiments
7.2 Effect of Depth and Normals Inputs
(i) How important are the depth and normals inputs relative to RGB in the semantic labeling task?
(ii) What might happen if we were to replace the true depth and normals inputs with the predictions made by our network?