MATHEMATICAL PRINCIPLE OF GAN
Mathematical Principle of GAN
Introduction
The potential advantage of deep learning lies in representing complicate probability densities of data with models of large scale hierarchical structure. Till now, the great success of deep learning is mostly in discriminative models that usually map high-dimensional sensible inputs to categorical labels. Training discriminative models benefits a lot from back-propagation algorithm, dropout and ReLU with well-defined differentials. In contrast to discriminative models, however, generative deep models perform inferiorly. This is much due to the intractability of optimising joint likelihood.
Based upon the observation above, Goodfellow proposed generative adversarial net (GAN). As the name implies, GAN consists of two network: generator and discriminator. The generator produces some fake data as real as those genuine data that are assumed to be drawn from some unknown distribution (Only God knows what), while the discriminator tries to improve itself in distinguishing the genuine from the fake. The whole process is like a gambling game.
This formulation is actually a general framework, since both the discriminator and the generator can be a deep model of any kind. For sake of simplicity, in this work, we only consider the multi-layer perceptron. Besides, the data generated by the generator are obtained by transforming a random noise. With this approach, the entire model can be trained using back-propagation and dropout without approximation inference and Markov chain.
Generative Adversarial Net
In this section, we will see GAN in detail and derive its optimisation objective.
As mentioned above, both generator and discriminator are MLPs for simplicity. Philosophically, the generator is expected to something that maps a random noise into a generative data; the discriminator takes input either genuine or fake and outputs the probability of being genuine (drawn from the unknown distribution that only God knows). This whole framework is shown as follows,
Suppose we have a training set of
Further, the log likelihood is obtained
According to Large Number Theorem, as
Back to our aim: the whole process is tantamount to a gambling game. The generator produces some fake data as real as those genuine data that are assumed to be drawn from some unknown distribution (Only God knows what), while the discriminator tries to improve itself in distinguishing the genuine from the fake. Note that the log likelihood above is the objective with respect to the discriminator
Theoretical Results
The generator
Firstly, given any generator
Proposition 1. For a given
Proof.
For the given
where,
The second term, by
This equality is not that straightforward. (Refer to the details in Appendix C: Change of Random Variable in Measure Theory)
Consequently,
Since
This completes the proof. #
By this proposition, we further reformulate this min-max gambling game as minimizing the quantity
Now, by presenting Theorem 1, we answer the question raised at the beginning: whether
Theorem 1. The global optimal minimum of
Proof.
The equality holds iff
This completes the proof. #
As derived above, the generator is able to duplicate the process of generating data by God.
Formally, we present the GAN training algorithm as below
Noticeably, the algorithm presented differs from our optimization objective. Our objective is to optimize wrt D to the end and then optimize wrt G. But, in practice, this will bring prohibitive computational cost and usually result in overfitting. Rather, the algorithm optimizes D for k steps and then optimizes G. This procedure is carried out iteratively till converge. In this way, as long as G changes sufficiently slow, D is always near the optimal solution. Naturally, again, we are quite curious about the quantitative difference between these two procedures. This can be answered by the following proposition.
We present the convergence of the algorithm presented without proof.
Proposition 2. Suppose
then,
For a better understanding, we take a look at the following pedagogical epigraph.
The downmost horizontal line represents the range of the noise and x represents the data range. As is shown,
Appendix
A. K-L Divergence
In probability and information theory, K-L divergence, also called information gain, is a “measure” of the distance between two probability densities. Here its formal definition is presented, in discrete form and continuous form, respectively.
Given two probability densities of two discrete random variables,
Given two probability densities of two continuous random variables,
Mathematically, K-L divergence is not a measure, since it violates one of the maxima of measure, i.e. symmetry, that is,
There are a few important properties of K-L divergence:
(1) K-L divergence is well-defined if and only if for some
(2) For some
(3)
Now we prove the third property.
This completes the proof.#
B. Variational Calculus
Variational calculus is a natural extension of differentiation.
Given a functional
C. Change of Random Variable in Measure Theory
Proof:
where
First, we define a measure space
where
This completes the proof.#