Machine Learning(2)Estimate the probability density -- Mixture of Gaussians

Machine Learning(2)Mixture of Gaussians


Chenjing Ding
2018/02/21


notation meaning
M the number of mixture components
p(j) weight of mixture component
p(x|θj) mixture component
p(x|θ) mixture density
θj j-th component parameters

1. Mixture of Multivariate Gaussians

In some cases, one Gaussian distribution cannot represent p(x|θ), (see red model in figure 1 ), thus in this chapter we want to estimate the mixture density of multivariate Gaussians.

1.1 Obtain mixture of density

Weight of mixture component:

p(j)=πj

Mixture component:
p(x|θj)

Mixture density
p(x|θ)=j=1Mp(x|θj)p(j)


Machine Learning(2)Estimate the probability density -- Mixture of Gaussians
figure1 mixture of density

2. Maximum Likelihood

using maximum likelihood to estimate uj:

En(θ)=lnp(xn|θ)E(θ)=n=1NEn(θ)=n=1Nlnp(xn|θ)E(θ)uj=n=1Np(xn|θ)ujp(xn|θ)=n=1Np(j)p(xn|θj)ujk=1Mp(xn|θk)p(k)=n=1Np(j)Σ1(xnuj)p(xn|θj)k=1Mp(xn|θk)p(k)=Σ1n=1N(xnuj)p(j)p(xn|θj)k=1Mp(xn|θk)p(k);γj(xn)=p(j)p(xn|θj)k=1Mp(xn|θk)p(k);uj=n=1Nxnγj(xn)n=1Nγj(xn)

Problem with estimation uj
ujdepends on γj(xn), γj(xn)also depends on uj, so there is no analytical solution.
γJ(xn)=p(J)p(xn|θJ)k=1Mp(xn|θk)p(k)=p(xn|j=J,θ)p(J)p(xn|θ)=p(xn,j=J|θ)p(xn|θ)=p(j=J|xn,θ)
thus γj(xn) represents “responsibility of component j for mixture density given xn, if we can estimate γj(xn), then we can obtain uj; and K-Means cluster is helpful.

3. K-Means cluster

K-Means cluster aims to assign data to one of the K clusters according to the distance to the mean of each cluster.

3.1 steps

step1: Initialization: pick K arbitrary centroids (cluster means)

step2: Assign each sample to the closest centroid.

step3: Adjust the centroids to be the means of the samples assigned to them.

step4: Go to step 2 until no change in step3;


Machine Learning(2)Estimate the probability density -- Mixture of Gaussians
figure2 the process of K-Means cluster (K = 2)

3.2 Objective function

K-Means optimizes the following objective function:

L=n=1Nk=1Krnk||xnμk||2rnk={   0,   else   1,   k=argmink||xnμk||2
rnk is an indicator variable that checks whether ukis the nearest cluster center to point xn.

3.3 Advantages and Disadvantages

Advantage:

  • simple and fast to compute
  • converge to local minimum of within-cluster squared error

Disadvantage:

  • sensitive to initialization
  • sensitive to outliers
  • difficult to set K properly
  • only detect spherical clusters

    Machine Learning(2)Estimate the probability density -- Mixture of Gaussians
    figure3 the problem of K-Means cluster (K = 2)

4 .EM Algorithm

Once we use K-Means cluster to get the mean of each cluster, then we have θj=(uj, Σj), we can estimate the “responsibility” of component j for mixture density γj(xn).

4.1 K-Means Clustering Revisited

step1: Initialization pick K arbitrary centroids [compute θj0=(μj0,Σj0)]

step2: Assign each sample to the closest centroid. [compute γj(xn) Estep]

step3: Adjust the centroids to be the means of the samples assigned to them, [compute θjτ=(μjτ,Σjτ) Mstep]

step4: Go to step 2 (until no change)

The process is almost same with K-Means cluster, but in K-Means one point only depends on one distribution, no concept like γj(xn) .

4.2 Estep & Mstep

Estep: softly assign samples to mixture components

γj(xn)=p(j)p(xn|θj)k=1Mp(xn|θk)p(k);j=1...K,n=1...N

Mstep: re-estimate the parameters (separately for each mixture component) based on the soft assignments.
Nj^=n=1Nγj(xn)p(j)^=Nj^Nujnew^=n=1Nγj(xn)xnn=1Nγj(xn)Σjnew^=1Nj^n=1Nγj(xn)(xnujnew^)(xnujnew^)T

4.3 Advantages

  • Very general, can represent any (continuous) distribution.
  • Once trained, very fast to evaluate.
  • Can be updated online.

4.4 Caveats

  1. introduce regularization
    instead of Σ1, use (Σ+σ)1 to avoid Σ1=0 causing p(xn|θj)goes to infinite
  2. Initialize with k-Means to get better results
    Typical steps:
    Run k-Means M times (e.g. M = 10~100)
    Pick the best result (lowest error J)
    Use this result to initialize EM
  3. EM for MoG is computational expensive
  4. Need to select the number of mixture components K properly model selection problem