Machine Learning（2）Estimate the probability density -- Mixture of Gaussians

Machine Learning（2）Mixture of Gaussians

Chenjing Ding
2018/02/21

notation	meaning
M	the number of mixture components
p(j)	weight of mixture component
$p (x \| θ_{j})$	mixture component
$p (x \| θ)$	mixture density
$θ_{j}$	j-th component parameters

1. Mixture of Multivariate Gaussians

In some cases, one Gaussian distribution cannot represent $p (x | θ)$ , (see red model in figure 1 ), thus in this chapter we want to estimate the mixture density of multivariate Gaussians.

1.1 Obtain mixture of density

Weight of mixture component:

p (j) = π_{j}

Mixture component:

p (x | θ_{j})

Mixture density

p (x | θ) = \sum_{j = 1}^{M} p (x | θ_{j}) p (j)

Machine Learning（2）Estimate the probability density -- Mixture of Gaussians

figure1 mixture of density

2. Maximum Likelihood

using maximum likelihood to estimate $u_{j}$ :

E_{n} (θ) = - \ln p (x_{n} | θ) E (θ) = \sum_{n = 1}^{N} E_{n} (θ) = \sum_{n = 1}^{N} - \ln p (x_{n} | θ) \frac{\partial E (θ)}{\partial u_{j}} = - \sum_{n = 1}^{N} \frac{\frac{\partial p (x_{n} | θ)}{\partial u_{j}}}{p (x_{n} | θ)} = - \sum_{n = 1}^{N} \frac{p (j) \frac{\partial p (x_{n} | θ_{j})}{\partial u_{j}}}{\sum_{k = 1}^{M} p (x_{n} | θ_{k}) p (k)} = - \sum_{n = 1}^{N} \frac{p (j) Σ^{- 1} (x_{n} - u_{j}) p (x_{n} | θ_{j})}{\sum_{k = 1}^{M} p (x_{n} | θ_{k}) p (k)} = - Σ^{- 1} \sum_{n = 1}^{N} (x_{n} - u_{j}) \frac{p (j) p (x_{n} | θ_{j})}{\sum_{k = 1}^{M} p (x_{n} | θ_{k}) p (k)}; γ_{j} (x_{n}) = \frac{p (j) p (x_{n} | θ_{j})}{\sum_{k = 1}^{M} p (x_{n} | θ_{k}) p (k)}; \Rightarrow u_{j} = \frac{\sum_{n = 1}^{N} x_{n} γ_{j} (x_{n})}{\sum_{n = 1}^{N} γ_{j} (x_{n})}

Problem with estimation $u_{j}$

u_{j}

depends on

γ_{j} (x_{n})

γ_{j} (x_{n})

also depends on

u_{j}

, so there is no analytical solution.

γ_{J} (x_{n}) = \frac{p (J) p (x_{n} | θ_{J})}{\sum_{k = 1}^{M} p (x_{n} | θ_{k}) p (k)} = \frac{p (x_{n} | j = J, θ) p (J)}{p (x_{n} | θ)} = \frac{p (x_{n}, j = J | θ)}{p (x_{n} | θ)} = p (j = J | x_{n}, θ)

thus

γ_{j} (x_{n})

represents “responsibility of component j for mixture density given $x_{n}$ ”, if we can estimate

γ_{j} (x_{n})

, then we can obtain

u_{j}

; and K-Means cluster is helpful.

3. K-Means cluster

K-Means cluster aims to assign data to one of the K clusters according to the distance to the mean of each cluster.

3.1 steps

step1: Initialization: pick K arbitrary centroids (cluster means)

step2: Assign each sample to the closest centroid.

step3: Adjust the centroids to be the means of the samples assigned to them.

step4: Go to step 2 until no change in step3;

figure2 the process of K-Means cluster (K = 2)

3.2 Objective function

K-Means optimizes the following objective function:

L = \sum_{n = 1}^{N} \sum_{k = 1}^{K} r_{n k} | | x_{n} - μ_{k} | |^{2} r_{n k} = {_{0, e l s e}^{1, k = a r g m i n_{k} | | x_{n} - μ_{k} | |^{2}}

r_{n k}

is an indicator variable that checks whether

u_{k}

is the nearest cluster center to point

x_{n}

3.3 Advantages and Disadvantages

Advantage:

simple and fast to compute
converge to local minimum of within-cluster squared error

Disadvantage:

sensitive to initialization
sensitive to outliers
difficult to set K properly
only detect spherical clusters

figure3 the problem of K-Means cluster (K = 2)

4 .EM Algorithm

Once we use K-Means cluster to get the mean of each cluster, then we have $θ_{j} = (u_{j}, Σ_{j})$ , we can estimate the “responsibility” of component j for mixture density $γ_{j} (x_{n})$ .

4.1 K-Means Clustering Revisited

step1: Initialization pick K arbitrary centroids [compute $θ_{j}^{0} = (μ_{j}^{0}, Σ_{j}^{0})$ ]

step2: Assign each sample to the closest centroid. [compute $γ_{j} (x_{n})$ $\Rightarrow$ Estep]

step3: Adjust the centroids to be the means of the samples assigned to them, [compute $θ_{j}^{τ} = (μ_{j}^{τ}, Σ_{j}^{τ})$ $\Rightarrow$ Mstep]

step4: Go to step 2 (until no change)

The process is almost same with K-Means cluster, but in K-Means one point only depends on one distribution, no concept like $γ_{j} (x_{n})$ .

4.2 Estep & Mstep

Estep: softly assign samples to mixture components

γ_{j} (x_{n}) = \frac{p (j) p (x_{n} | θ_{j})}{\sum_{k = 1}^{M} p (x_{n} | θ_{k}) p (k)}; \forall j = 1... K, \forall n = 1... N

Mstep: re-estimate the parameters (separately for each mixture component) based on the soft assignments.

\hat{N_{j}} = \sum_{n = 1}^{N} γ_{j} (x_{n}) \hat{p (j)} = \frac{\hat{N_{j}}}{N} \hat{u_{j}^{n e w}} = \frac{\sum_{n = 1}^{N} γ_{j} (x_{n}) * x_{n}}{\sum_{n = 1}^{N} γ_{j} (x_{n})} \hat{Σ_{j}^{n e w}} = \frac{1}{\hat{N_{j}}} \sum_{n = 1}^{N} γ_{j} (x_{n}) (x_{n} - \hat{u_{j}^{n e w}}) (x_{n} - \hat{u_{j}^{n e w}})^{T}

4.3 Advantages

Very general, can represent any (continuous) distribution.
Once trained, very fast to evaluate.
Can be updated online.

4.4 Caveats

introduce regularization
instead of $Σ^{-} 1$ , use $(Σ + σ)^{- 1}$ to avoid $Σ^{-} 1 = 0$ causing $p (x_{n} | θ_{j})$ goes to infinite
Initialize with k-Means to get better results
Typical steps:
Run k-Means M times (e.g. M = 10~100)
Pick the best result (lowest error J)
Use this result to initialize EM
EM for MoG is computational expensive
Need to select the number of mixture components K properly $\Rightarrow$ model selection problem