FaceShifter: Towards High Fidelity And Occlusion Aware Face Swapping

3. Methods

定义XsX_s为source image,提供identity信息,XtX_t为target image,提供attribute信息(包括pose、expression、scene lighting和background)

FaceShifter包含2个stage,在stage1中,采用Adaptive Embedding Integration Network(AEI-Net)生成high fidelity face swapping result Y^s,t\hat{Y}_{s,t};在stage2中,采用Heuristic Error Acknowledging Network((HEAR-Net)处理脸部的遮挡问题,进一步生成更精细的结果Ys,tY_{s,t}

3.1. Adaptive Embedding Integration Network

FaceShifter: Towards High Fidelity And Occlusion Aware Face Swapping
如Figure 3(a)所示,stage1的网络结构包含3个部分

  • the Idenitty Encoder zid(Xs)\bm{z}_{id}(X_s)(橙色部分),用于提取identity信息
  • the Multi-level Attributes Encoder zatt(Xt)\bm{z}_{att}(X_t)(灰色部分),用于提取attrubute信息
  • Adaptive Attentional Denormalization (AAD) Generator(绿色部分),结合identity和attribute信息,生成换脸结果

Identity Encoder
采用最新的人脸识别模型(来自文献[13]),取the last feature vector generated before the final FC layer作为identity embedding

Multi-level Attributes Encoder

Face attributes, such as pose, expression, lighting and background, require more spatial informations than identity.

为了保存attribute信息,取multi-level feature maps作为attribute embedding(之前的工作将attribute信息压缩为single vector)

具体来说,将XtX_t送入类似U-Net的网络,然后收集decoder部分每一层的feature map作为zatt(Xt)z_{att}(X_t)
zatt(Xt)={zatt1(Xt),zatt2(Xt),,zattn(Xt)}(1) \bm{z}_{att}(X_t)=\left \{ \bm{z}_{att}^1(X_t), \bm{z}_{att}^2(X_t), \cdots, \bm{z}_{att}^n(X_t) \right \} \qquad(1)
其中zattk(Xt)\bm{z}_{att}^k(X_t)表示U-Net decoder第k层输出的feature map

值得注意的是,Multi-level Attributes Encoder不需要attribute annotation,能够通过self-supervised training的方式自动提取attribute信息

定义了attribute embedding之后,我们希望换脸结果Y^xt\hat{Y}_{x_t}与target image XtX_t有相同的attribute embedding

Adaptive Attentional Denormalization Generator
这一步将2个embedding zid(Xs)\bm{z}_{id}(X_s)zatt(Xt)\bm{z}_{att}(X_t)整合起来,用于生成换脸结果Y^s,t\hat{Y}_{s,t}

之前的工作采用feature concatenation,会生成模糊的结果,因此我们提出Adaptive Attentional Denormalization(AAD),采用adaptive fashion的思想来解决这个问题

定义hink\bm{h}_{in}^k表示AAD layer的输入,首先对hink\bm{h}_{in}^k,进行instance normalization
hˉk=hinkμkσk(2) \bar{\bm{h}}_k=\frac{\bm{h}_{in}^k-\bm{\mu}^k}{\bm{\sigma}^k} \qquad(2)

第一步,attributes embedding integration
ADD layer接收zattkRCattk×Hk×Wk\bm{z}_{att}^k\in\mathbb{R}^{C_{att}^k\times H^k\times W^k}作为输入,然后对zattk\bm{z}_{att}^k进行卷积得到γattk,βattkRCk×Hk×Wk\gamma_{att}^k, \beta_{att}^k\in\mathbb{R}^{C^k\times H^k\times W^k}

然后利用γattk,βattk\gamma_{att}^k, \beta_{att}^k对normalized hˉk\bar{\bm{h}}_k进行denormalization,得到attribute activation Ak\bm{A}^k
Ak=γattkhˉk+βattk(3) \bm{A}^k=\gamma_{att}^k\otimes\bar{\bm{h}}_k+\beta_{att}^k \qquad(3)

第二步,identity embedding integration
XsX_s中提取identity embedding zidk\bm{z}_{id}^k,然后对zidk\bm{z}_{id}^k进行FC得到γidk,βidkRCk\gamma_{id}^k, \beta_{id}^k\in\mathbb{R}^{C^k}

以同样的方式对normalized hˉk\bar{\bm{h}}_k进行denormalization,得到identity activation Ik\bm{I}^k
Ik=γidkhˉk+βidk(4) \bm{I}^k=\gamma_{id}^k\otimes\bar{\bm{h}}_k+\beta_{id}^k \qquad(4)

第三步,adaptively attention mask
hˉk\bar{\bm{h}}_k进行conv+sigmoid运算,学习一个attentional mask Mk\bm{M}^k,最终利用Mk\bm{M}^kAk\bm{A}^kIk\bm{I}^k进行组合
houtk=(1Mk)Ak+MkIk(5) \bm{h}_{out}^k=\left ( 1-\bm{M}^k \right )\otimes\bm{A}^k+\bm{M}^k\otimes\bm{I}^k \qquad(5)

Figure 3 ( c)展示的就是上述所说的三步操作,然后将多个AAD layer组合起来,得到AAD ResBlk,如Figure 3(b)所示

Training Losses
首先设置multi-scale discriminator,得到adversarial loss Ladv\mathcal{L}_{adv}

然后定义identity preservation loss Lid\mathcal{L}_{id}
Lid=1cos(zid(Y^s,t),zid(Xs))(6) \mathcal{L}_{id}=1-cos\left ( \bm{z}_{id}\left ( \hat{Y}_{s,t} \right ), \bm{z}_{id}\left ( X_s \right ) \right ) \qquad(6)

接着定义attributes preservation loss Latt\mathcal{L}_{att}
Latt=12k=1nzattk(Y^s,t)zattk(Xt)22(8) \mathcal{L}_{att}=\frac{1}{2}\sum_{k=1}^{n}\left \| \bm{z}_{att}^k\left ( \hat{Y}_{s,t} \right ) - \bm{z}_{att}^k\left ( X_t \right ) \right \|_2^2 \qquad(8)

在训练过程中以80%的比例令Xt=XsX_t=X_s,则定义reconstruction loss Lrec\mathcal{L}_{rec}如下
Lrec={12Y^s,tXt22if Xt=Xs0otherwise(8) \mathcal{L}_{rec}=\left\{\begin{matrix} \frac{1}{2}\left \| \hat{Y}_{s,t}-X_t \right \|_2^2 & \text{if}\ X_t=X_s\\ 0 & \text{otherwise} \end{matrix}\right. \qquad(8)

对于AEI-NET,完整的损失函数如下
LAEINet=Ladv+λattLatt+λidLid+λrecLrec(9) \mathcal{L}_{{\rm AEI-Net}}=\mathcal{L}_{adv}+\lambda_{att}\mathcal{L}_{att}+\lambda_{id}\mathcal{L}_{id}+\lambda_{rec}\mathcal{L}_{rec} \qquad(9)
其中设置λatt=λrec=10\lambda_{att}=\lambda_{rec}=10λid=5\lambda_{id}=5

3.2. Heuristic Error Acknowledging Refinement Network

stage1生成的图像能够很好的保持target attributes,但无法保持来自XtX_t中的遮挡(occlusion)

已有的工作额外训练一个face segmentation network,缺点是需要occlusion annotation,并且对于新的occlusion的泛化性不好
FaceShifter: Towards High Fidelity And Occlusion Aware Face Swapping
在实验中,XtX_t是一幅包含occlusion的图像,令Xs=XtX_s=X_t,重构的图像为Y^tt=AEI-Net(Xt,Xt)\hat{Y}_{tt}={\rm AEI\text{-}Net}(X_t, X_t),我们观察到Y^tt\hat{Y}_{tt}本该重构出来的occlusion消失了,于是将Y^tt\hat{Y}_{tt}XtX_t进行比对,可以得知图像中哪些地方是occlusion

定义heuristic error如下
ΔYt=XtAEI-Net(Xt,Xt)(10) \Delta Y_t=X_t-{\rm AEI\text{-}Net}(X_t, X_t) \qquad(10)

如Figure 4(b)所示,HEAR-Net本质上是一个U-Net,接收ΔYt\Delta Y_tY^s,t\hat{Y}_{s,t}作为输入,输出最终的换脸结果Ys,tY_{s,t}
Ys,t=HEAR-Net(Y^s,t,ΔYt)(11) Y_{s,t}={\rm HEAR\text{-}Net}\left ( \hat{Y}_{s,t}, \Delta Y_t \right ) \qquad(11)

训练HEAR-Net的损失项包含3项
第1项是the identity preservation loss Lid\mathcal{L}_{id}'
Lid=1cos(zid(Ys,t),zid(Xs))(12) \mathcal{L}_{id}'=1-cos\left ( \bm{z}_{id}\left ( Y_{s,t} \right ), \bm{z}_{id}\left ( X_s \right ) \right ) \qquad(12)
第2项是the change loss Lchg\mathcal{L}_{chg}'
Lchg=Y^s,tYs,t(13) \mathcal{L}_{chg}'=\left | \hat{Y}_{s,t}-Y_{s,t} \right | \qquad(13)
第3项是the reconstruction loss Lrec\mathcal{L}_{rec}'
Lrec={12Ys,tXt22if Xt=Xs0otherwise(14) \mathcal{L}_{rec}'=\left\{\begin{matrix} \frac{1}{2}\left \| Y_{s,t}-X_t \right \|_2^2 & \text{if}\ X_t=X_s\\ 0 & \text{otherwise} \end{matrix}\right. \qquad(14)
总体的损失函数为三者之和
LHEAR-Net=Lrec+Lid+Lchg(15) \mathcal{L}_{{\rm HEAR\text{-}Net}}=\mathcal{L}_{rec}'+\mathcal{L}_{id}'+\mathcal{L}_{chg}' \qquad(15)