Auto-encoding Variational Bayes 阅读笔记

Notation

$p_{θ} (z | x)$ : intractable posterior
$p_{θ} (x | z)$ : probabilistic decoder
$q_{ϕ} (z | x)$ : recognition model, variational approximation to $p_{θ} (z | x)$ , also regarded as a probabilistic encoder
$p_{θ} (z) p_{θ} (x | z)$ : generative model
$ϕ$ : variational parameters
$θ$ : generative parameters

Abbreviation

SGVB: Stochastic Gradient Variational Bayes
AEVB: auto-encoding VB
ML: maximum likelihood
MAP: maximum a posteriori

Motivation

Problem
- How to perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distribution $p_{θ} (z | x)$ and large datasets?
Existing Solution and Difficulty
- VB: involves the optimization of an approximation to the intractable posterior
- mean-field: requires analytical solutions of expectations w.r.t. the approximate posterior, which are also intractable in the general case

Contribution of this paper

(1) SGVB estimator: an estimator of the variational lower bound
- yielded by a reparameterization of the variational lower bound
- simple & differentiable & unbiased
- straightforwad to optimize using standard SG ascent techniques
(2) AEVB algorithm
- using SGVB to optimize a recognition model that allows us to perform very efficient approximate posterior inference using simple ancestral sampling, which in turn allows us to efficiently learn the model parameters, without the need of expensive iterative inference schemes (such as MCMC) per datapoint.
- condition: i.i.d. datasets $X = {x^{(i)}}_{i = 1}^{N}$ & continuous latent variable $z$ per datapoint

Methodology

assumption

directed graphical models with continuous latent variables
i.i.d. dataset with latent variables per datapoint
- where we like to perform
  - ML or MAP inference on the (global) paramters $θ$
  - variational inference on the latent variable $z$
$p_{θ} (z)$ and $p_{θ} (x | z)$ : both PDFs are differentiable almost everywhere w.r.t. both $θ$ and $z$

target case

intractability
- $p_{θ} (x) = \int p_{θ} (z) p_{θ} (x | z) d z$ : so we cannot evaluate or differentiate it.
- $p_{θ} (z | x) = \frac{p_{θ} (x | z) p_{θ} (z)}{p_{θ} (x)}$ : so the EM algorithm cannot be used.
- the required integrals for any reasonable mean-field VB algorithm: so the VB algorithm cannot be used.
- in cases of moderately complicated likelihood function, e.g. in a neural network with a nonlinear hidden layer
a large dataset
- batch optimization is too costly => minibatch or single datapoints
- sampling bases solutions are too slow, e.g. Monte Carlo EM, since it involves a typically expensive sampling loop per datapoint.

solution and application

efficient approximate ML or MAP estimation for $θ$ (Full): Appendix F
- allow us to mimic the hidden random process and generate artificial data that resemble the real data
efficient approximate posterior inference $p_{θ} (z | x)$ for a choice of $θ$
- useful for coding or data representation tasks
efficient approximate marginal inference of $x$ : Appendix D
- allow us to perform all kinds of inference tasks where $p (x)$ is required, such as image denoising, inpainting, and super-resolution.

1. derivation of the variational bound

\log p_{θ} (x^{(1)} ， \dots, x^{(N)}) = \sum_{i = 1}^{N} \log p_{θ} (x^{(i)})

Here we use

x^{i}

to represent

x^{(i)}

\log p (x^{i}) = \int_{z} q (z | x^{i}) \log p (x^{i}) d z (q can be any distribution) = \int_{z} q (z | x^{i}) \log \frac{p (z, x^{i})}{p (z | x^{i})} d z = \int_{z} q (z | x^{i}) \log [\frac{q (z | x^{i})}{p (z | x^{i})} \cdot \frac{p (z, x^{i})}{q (z | x^{i})}] d z = \int_{z} q (z | x^{i}) \log \frac{q (z | x^{i})}{p (z | x^{i})} d z + \int_{z} q (z | x^{i}) \log \frac{p (z, x^{i})}{q (z | x^{i})} d z = D_{K L} [q_{ϕ} (z | x^{i}) ∥ p_{θ} (z | x^{i})] + L_{B} (θ, ϕ; x^{i})

Becuase $D_{K L} \geq 0$ , so $L_{B}$ is called the (variational) lower bound, then we have

\log p_{θ} (x^{i}) \geq L_{B} (θ, ϕ; x^{i})

L_{B} (θ, ϕ; x^{i}) = E_{q_{ϕ} (z | x)} [\log p_{θ} (x, z) - \log q_{ϕ} (z | x)] = \int_{z} q (z | x^{i}) \log \frac{p (z, x^{i})}{q (z | x^{i})} d z = \int_{z} q (z | x^{i}) \log \frac{p (x^{i} | z) p (z)}{q (z | x^{i})} d z = - D_{K L} [q (z | x^{i}) ∥ p (z)] + \int_{z} q (z | x^{i}) \log p (x^{i} | z) d z = - D_{K L} [q_{ϕ} (z | x^{i}) ∥ p_{θ} (z)] + E_{q_{ϕ} (z | x^{i})} [\log p_{θ} (x^{i} | z)]

$- D_{K L} [q_{ϕ} (z | x^{i}) ∥ p_{θ} (z)]$ : act as a regularizer
$E_{q_{ϕ} (z | x^{i})} [\log p_{θ} (x^{i} | z)]$ : a an expected negative reconstruction error
TARGET: defferentiate and optimize $L_{B}$ w.r.t. both $ϕ$ and $θ$ . However, $\nabla_{ϕ} L_{B}$ is problematic.

2. Solution-1 Naive Monte Carlo gradient estimator

disadvantages: high variance & impractical for this purpose, 原文中出现的公式如下：
$\nabla_{ϕ} E_{q_{ϕ} (z)} [f (z)] = E_{q_{ϕ} (z)} [f (z) \nabla_{q_{ϕ} (z)} \log q_{ϕ} (z)] ≃ \frac{1}{L} \sum_{l = 1}^{L} f (z) \nabla_{q_{ϕ} (z^{l})} \log q_{ϕ} (z^{l}) where z^{l} \sim q_{ϕ} (z | x^{i})$

但由于我还没来得及学Monte Carlo Gradient Estimator的理论，根据VAE这边论文后面的公式，个人觉得…上面的公式应该是（数学功底不够深厚，不确定二者是否等价）：

\nabla_{ϕ} E_{q_{ϕ} (z)} [f (z)] = E_{q_{ϕ} (z)} [f (z) \nabla_{q_{ϕ} (z)} \log q_{ϕ} (z)] ≃ \frac{1}{L} \sum_{l = 1}^{L} f (z^{l}) where z^{l} \sim q_{ϕ} (z | x^{i})

3. Solution-2 SGVB estimator

reparamterize $\tilde{z} \sim q_{ϕ} (z | x)$ using a differentiable transformation $g_{ϕ} (ϵ, x)$ of an (auxiliary) noise variable $ϵ$
- under certain mild conditions for a chosen approximate posterior $q_{ϕ} (z | x)$
  $\tilde{z} = g_{ϕ} (ϵ, x) w i t h ϵ \sim p (ϵ)$
form Monte Carlo estimates as follows:
$q_{ϕ} (z | x) \prod_{i} d z_{i} = p (ϵ) \prod_{i} d ϵ_{i} \int q_{ϕ} (z | x) f (z) d z = \int p (ϵ) f (z) d ϵ = \int p (ϵ) f (g_{ϕ} (ϵ, x)) d ϵ E_{q_{ϕ} (z | x^{i})} [f (z)] = E_{p (ϵ)} [f (g_{ϕ} (ϵ, x^{i}))] ≃ \frac{1}{L} \sum_{l = 1}^{L} f (g_{ϕ} (ϵ^{l}, x^{i})) w h e r e ϵ^{l} \sim p (ϵ)$
apply the MC estimator technique to $L_{B} (θ, ϕ; x^{i})$ , yielding 2 SGVB estimator ${\tilde{L}}^{A}$ and ${\tilde{L}}^{B}$ :
${\tilde{L}}^{A} (θ, ϕ; x^{i}) = \frac{1}{L} \sum_{l = 1}^{L} \log p_{θ} (x^{i}, z^{i, l}) - \log q_{ϕ} (z^{i, l} | x^{i}) {\tilde{L}}^{B} (θ, ϕ; x^{i}) = - D_{K L} (q_{ϕ} (z | x^{i}) ∥ p_{θ} (z)) + \frac{1}{L} \sum_{l = 1}^{L} \log p_{θ} (x^{i} | z^{i, l}) where z^{i, l} = g_{ϕ} (ϵ^{i, l}, x^{i}) and ϵ^{l} \sim p (ϵ)$
minibatch with size $M$
$L_{B} (θ, ϕ; X) ≃ {\tilde{L}}^{M} (θ, ϕ; X^{M}) = \frac{N}{M} \sum {i = 1}^{M} \tilde{L} (θ, ϕ; x^{i})$
- the number of samples L per datapoint can be set to 1 as long as the minibatch size $M$ was large enough, e.g. $M = 100$

3.1 Minibatch version of AEVB algorithm

$θ, ϕ \leftarrow$ Initialize parameters
repeate
- $X^{M} \leftarrow$ Random minibatch of $M$ datapoints
- $ϵ \leftarrow$ Random samples from noise distribution $p (ϵ)$
- $g \leftarrow \nabla_{θ, ϕ} {\tilde{L}}^{M} (θ, ϕ; X^{M}, ϵ)$
- $θ, ϕ \leftarrow$ Update parameters using gradients $g$ (e.g. SGD or Adagrad)
until convergence of parameters $θ, ϕ$
return $θ, ϕ$

chosen of $q_{ϕ} (z | x)$ , $p (ϵ)$ , and $g_{ϕ} (ϵ, x)$

There are three basic approaches:

Tractable inverse CDF. Let $ϵ \sim U (0, I)$ , $g_{ϕ} (ϵ, x)$ be the inverse CDF of $q_{ϕ} (z | x)$ .
- Examples: Exponential, Cauchy, Logistic, Rayleigh, Pareto, Weibull, Reciprocal, Gompertz, Gumbel and Erlang distributions.
”location-scale” family of distributions: choose the standard distribution (with location = 0, scale = 1) as the auxiliary variable $ϵ$ , and let $g (\cdot) = l o c a t i o n + s c a l e \cdot ϵ$
- Examples: Laplace, Elliptical, Student’s t, Logistic, Uniform, Triangular and Gaussian distributions.
- 下文介绍的VAE即适用于此种情况
Composition: It is often possible to express random variables as different transformations of auxiliary variables.
- Examples: Log-Normal (exponentiation of normally distributed variable), Gamma (a sum over exponentially distributed variables), Dirichlet (weighted sum of Gamma variates), Beta, Chi-Squared, and F distributions.

Variational Auto-Encoder

let $p_{θ} (z) = N (z; 0, I)$
$p_{θ} (z | x)$ is intractable
use a neural network for $q_{ϕ} (z | x)$
$ϕ$ and $θ$ are optimized jointly with the AEVB algorithm
params of $p_{θ} (x | z)$ are computed from $z$ with a MLP (multi-layered perceptrons, a fully-connected neural network with a hidden layer)
- multivariae Gaussian: in case of real-valued data
  $\log p (x | z) = \log N (x; μ, σ^{2} I) where μ = W_{μ} h + b_{μ} \log σ^{2} = W_{σ} h + b_{σ} h = \tanh (W_{h} z + b_{h})$
- Bernouli: incase of binary data
  $\log p (x | z) = \sum_{i = 1}^{D} x_{i} \log y_{i} + (1 - x_{i}) \cdot \log (1 - y_{i}) where y = f_{σ} (W_{y} \tanh (W_{h} + b_{h}) + b_{y}) f_{σ} (\cdot) : elementwise sigmoid activation function$

From what mentioned above, we have:

\log q_{ϕ} (z | x^{i}) = \log N (z; μ^{i}, σ^{2, i} I)

where

μ^{i}

and

σ^{i}

are outputs of the encoding MLP.

We sample form $z^{i, l} \sim q_{ϕ} (z | x^{i})$ using $z^{i, l} = g_{ϕ} (x^{i}, ϵ^{l}) = μ^{i} + σ^{i} ⊙ ϵ^{l}$ , where $ϵ^{l} \sim N (0, I)$ and $⊙$ denotes element-wize product.

Here both $p_{θ} (z)$ and $q_{ϕ} (z | x)$ are Gaussian, so we can use the estimator ${\tilde{L}}^{B}$ , since the KL divergence item is analytical. Then we have:

L (θ, ϕ; x^{i}) ≃ - D_{K L} (q_{ϕ} (z | x^{i}) ∥ p_{θ} (z)) + \frac{1}{L} \sum_{l = 1}^{L} \log p_{θ} (x^{i} | z^{i, l}) ≃ \frac{1}{2} \sum_{j = 1}^{L} (1 + \log (σ_{j}^{i 2}) - μ_{j}^{i 2} - σ_{j}^{i 2}) + \frac{1}{L} \sum_{l = 1}^{L} \log p_{θ} (x^{i} | z^{i, l}) where z^{i l} = μ^{i} + σ^{i} ⊙ ϵ^{l} and ϵ^{l} \sim N (0, I)

Solution of $- D_{K L} (q_{ϕ} (z) ∥ p_{θ} (z))$ , Gaussian case

Let $J$ be the dimensionality of $z$ , then we have:

\int q_{θ} (z) \log p_{θ} (z) d z = \int N (z; μ, σ^{2}) \log N (z; 0, I) d z = - \frac{J}{2} \log (2 π) - \frac{1}{2} \sum_{j = 1}^{J} (μ_{j}^{2} + σ_{j}^{2})

And:

\int q_{θ} (z) \log q_{θ} (z) d z = \int N (z; μ, σ^{2}) \log N (z; μ, σ^{2}) d z = - \frac{J}{2} \log (2 π) - \frac{1}{2} \sum_{j = 1}^{J} (1 + \log σ_{j}^{2})

Therefore:

- D_{K L} (q_{ϕ} (z) ∥ p_{θ} (z)) = \int q_{θ} (z) (\log p_{θ} (z) - \log q_{θ} (z)) d z = \frac{1}{2} \sum_{j = 1}^{L} (1 + \log (σ_{j}^{i 2}) - μ_{j}^{i 2} - σ_{j}^{i 2})

此处附上相关证明：
Auto-encoding Variational Bayes 阅读笔记

Visulization

Since the prior of the latent space is Gaussian, linearly spaced coordinates on the unit square were transformed through the inverse CDF of the Gaussian to produce values of the latent variables $z$ .