3D Human Pose Estimation = 2D Pose Estimation + Matching(2017)

Abstract

Many approaches try to directly predict 3D pose from image measurements,we explore a simple architecture that reasons through intermediate 2D pose directions.
（1）Deep neural nets have revolutionized 2D pose estimation,producing accurate 2D predictionns even for poses with self-occlusions;
（2）“Big-data” sets of 3D mocap data are now readily available,making it tempting to “lift” predicted 2D poses to 3D through simple memorization(nearest neighbors)
The architecture is straightforward to implement with off-the-shelf 2D pose estimation systems and 3D mocap libraries.

Introduction

介绍Human Pose Estimation——介绍现有工作都是充分利用了highly sensored environment——我们的目标是从单张RGB图像中估计3D Pose——现有2D估计的效果很好（即使自遮挡）——目标：为已估计的2D joints 预测depth values——【Inspired by the success of data-driven architecture,we explore a simple non-parametric encoding of such high-level constraints:given a 3D pose library,we generate a large number of 2D projections(from virtual camera views)】——Given this training set of paired(2D,3D) data and predictions from a 2D pose estimation algorithm,we report back the depth from the 3D pose associated with the closest matching 2D example from our library 3D Human Pose Estimation = 2D Pose Estimation + Matching(2017)

Generalization

3D数据标注困难，通常都是在实验室获取的，而2D数据集更加多样化。Our two-stage pipeline makes use of different training sets for different,resulting a system that can predict 3D poses from “in-the-wild” images.

Evaluation

详见原文

Related Works

详见原文

Approach

we make use of a probabilistic formulation over variables including the image $I$ ,the 3D pose $X\in R^{N×3}$ ,and the 2D pose $x\in R^{N×2}$ ，where $N$ is the number of articulated joints.The joint probability as:
$p(X,x,I)=p(X|x,I)\cdot p(x|I)\cdot p(I)\space (1)$

Conditional independence

We assume that the 3D pose $X$ is $conditionally\space independent$ of image $I$ given the 2D pose $x$ 。即给定2D估计的情况下，相应的3D估计不受2D image measurements的影响。this factorization still allows for $p(x|I)$ to be $arbitrarily complex$ ,which is likely needed to accurately model complex interactions between 2D projections and image features during occlusions.Given this conditional Independence,one can write:
$p(X,x,I)=p(X|x)\cdot p(x|I)\cdot p(I)\space (2)$
—— $p(x|I)$ is a image-based CNN that predicts 2D keypoint heatmaps.
—— $p(X|x)$ is a non-parametric nearest-neighbor(NN) model.

Image-Based 2D Pose Estimation

考虑到上述独立性假设，首先需要从给定image measurements中预测2D pose。we model the conditional of 2D pose given an image as
$P(x|I)=CNN(I)\space (3)$
——we assume CNN is a nonlinear function that returns N 2D heatmaps(or marginal distributions over the location of individual joints)We make use of convolutional pose machines(CPMs),which return precisely N heatmaps for individual body joint.We normalize the heatmaps so that they can be interpreted as marginal distributions for each joint.

Nonparametric 3D shape model

We model $P(X|x)$ with a non-parametric nearest neighbor model and will follow a notational convention where $X=[X,Y,Z]$ and $x=[x,y]$ .Assume that we have library of 3D poses ${X_i}$ paired with a particular camera projection matrx { $M_i$ },such that the associated 2D poses are given by { $M_i(X_i)$ }.【If we want to consider multiple cameras for a single 3D pose,we add another copy of the 3D pose with a different camera matrix to our library】.We define a distribution over 3D poses based on reprojection error:
$P(X=X_i|x)\propto e^-{\frac{1}{\sigma^2}||M_i(X_i)-x||^2}\space (4)$
——MAP estimate is given by the 1-nearest neighbor (1NN)

Virtual cameras

We can further reduce the squared reprojection error by searching over small perturbations of each camera.This involves solving a camera resectioning problem【9】,where an iterative solver can be initialized with $M_i$ :
$M_i^*=\argmin \limits_{M} ||M(X_i)-x||^2\space (5)$
——construct a shortlist of $k$ candidates that score well according to 【4】,and resort them according to optimal camera matrix.优化相机参数可以微小地提升性能。k=10.

Warped exemplars

针对相机投射问题，提出一种简单有效的方法，详见原文。

Experiments

提出两种比较方案，第一种与【35】【25】比较，第二种与【37】【30】比较。

Diagnostics

We now perform an extensive set of diagnostics to reveal the strength for our individual components,as well as upperbound analysis that is useful for guiding future work.

Conclusion

先进行2D姿态估计，再进行3D样本匹配。2D数据集用于训练初始图像处理模块，3D数据集用于训练后续的3D推理模块。Given such reliable 2D estimates,we show that one can efficiently impute depth through simple memorization and warping of a 3D pose library.

3D Human Pose Estimation = 2D Pose Estimation + Matching(2017)