高斯混合模型|机器学习推导系列(十三)

一、概述

以一维数据为例,我们可以看到下图通过将多个单一的高斯模型加权叠加到一起就可以获得一个高斯混合模型,这个混合模型显然具备比单个高斯模型更强的拟合能力:

高斯混合模型|机器学习推导系列(十三)

再举一个二维数据的例子,在下图中可以看到有两个数据密集区域,对应的概率分布也就会有两个峰。高斯混合模型可以看做生成模型,其数据生成过程可以认为先选择一个高斯分布,再从被选择的高斯分布中生成数据:

高斯混合模型|机器学习推导系列(十三)

综合上述两种描述,我们可以从两种角度来描述高斯混合模型:

  1. 几何角度:加权平均

可以认为高斯混合模型是将多个高斯分布加权平均而成的模型:

p(x)=k=1KαkN(μk,Σk),k=1Kαk=1p(x)=\sum_{k=1}^{K}\alpha _{k}N(\mu _{k},\Sigma _{k}),\sum_{k=1}^{K}\alpha _{k}=1

  1. 混合模型(或者生成模型)角度

可以认为高斯混合模型是一种含有隐变量的生成模型:

xx:observed variable
zz:latent variable

zz是隐变量,表示对应的样本xx属于哪一个高斯分布,其概率分为如下表:

zz C1C_1 C2C_2 \cdots CkC_k
pp p1p_1 p2p_2 \cdots pkp_k

可以认为这里的概率pkp_k就是几何角度加权平均中权重,两种角度的解释其实是一个意思。

我们可以画出高斯混合模型的概率图:

高斯混合模型|机器学习推导系列(十三)

实心点代表模型的参数,右下角的NN代表样本个数。

二、尝试用极大似然估计来求解

XX:observed dataX=(x1,x2,,xN)\rightarrow X=(x_{1},x_{2},\cdots ,x_{N})
(X,Z)(X,Z):comlete data
θ\theta:parameterθ={p1,p2,,pk,μ1,μ2,,μk,Σ1,Σ2,,Σk},i=1Kpk=1\rightarrow \theta =\left \{p_{1},p_{2},\cdots ,p_{k},\mu _{1},\mu _{2},\cdots ,\mu _{k},\Sigma _{1},\Sigma _{2},\cdots ,\Sigma _{k}\right \},\sum_{i=1}^{K}p_{k}=1

以上为我们的数据以及需要求解的参数。接下来我们表示一下概率p(x)p(x):

p(x)=zp(x,z)=k=1Kp(x,z=Ck)=k=1Kp(z=Ck)p(xz=Ck)=k=1KpkN(xμk,Σk)p(x)=\sum _{z}p(x,z)\\ =\sum _{k=1}^{K}p(x,z=C_{k})\\ =\sum _{k=1}^{K}p(z=C_{k})\cdot p(x|z=C_{k})\\ =\sum _{k=1}^{K}p_{k}\cdot N(x|\mu _{k},\Sigma _{k})

然后我们使用极大似然估计法求解这个参数估计问题。首先告知结论:极大似然估计法无法求解含有隐变量的参数估计问题,或者说不能得到解析解。接下来来看为什么不能用极大似然估计法来求解:

θ^MLE=argmaxθ  log  p(X)=argmaxθ  logi=1Np(xi)=argmaxθi=1Nlog  p(xi)=argmaxθi=1Nlogk=1KpkN(xiμk,Σk)\hat{\theta }_{MLE}=\underset{\theta }{argmax}\; log\; p(X)\\ =\underset{\theta }{argmax}\; log\prod_{i=1}^{N}p(x_{i})\\ =\underset{\theta }{argmax}\sum_{i=1}^{N}log\; p(x_{i})\\ =\underset{\theta }{argmax}\sum_{i=1}^{N}{\color{Red}{log\sum _{k=1}^{K}}}p_{k}\cdot N(x_{i}|\mu _{k},\Sigma _{k})

极大似然估计法不能得到解析解的原因为loglog函数内部出现了求和符号。当然我们可以使用梯度下降法来进行求解,但是对于含有隐变量的模型来说使用EM算法是更为合适的。

三、使用EM算法求解

由于使用EM算法需要用到联合概率p(x,z)p(x,z)和后验p(zx)p(z|x),所有我们首先写出这两个概率的表示:

p(x,z)=p(z)p(xz)=pzN(xμz,Σz)p(zx)=p(x,z)p(x)=pzN(xμz,Σz)k=1KpkN(xμk,Σk)p(x,z)=p(z)p(x|z)=p_{z}\cdot N(x|\mu _{z},\Sigma _{z})\\ p(z|x)=\frac{p(x,z)}{p(x)}=\frac{p_{z}\cdot N(x|\mu _{z},\Sigma _{z})}{\sum_{k=1}^{K}p_{k}\cdot N(x|\mu _{k},\Sigma _{k})}

  1. E step

Q(θ,θt)=Zlog  p(X,Zθ)p(ZX,θt)=z1z2zNlog  i=1Np(xi,ziθ)i=1Np(zixi,θt)=z1z2zNi=1Nlog  p(xi,ziθ)i=1Np(zixi,θt)=z1z2zN[log  p(x1,z1θ)+log  p(x2,z2θ)++log  p(xi,ziθ)]i=1Np(zixi,θt)Q(\theta ,\theta ^{t})=\sum _{Z}log\; p(X,Z|\theta )\cdot p(Z|X,\theta ^{t})\\ =\sum _{z_{1}z_{2}\cdots z_{N}}log\; \prod_{i=1}^{N}p(x_{i},z_{i}|\theta )\cdot \prod_{i=1}^{N} p(z_{i}|x_{i},\theta ^{t})\\ =\sum _{z_{1}z_{2}\cdots z_{N}}\sum_{i=1}^{N}log\; p(x_{i},z_{i}|\theta )\cdot \prod_{i=1}^{N} p(z_{i}|x_{i},\theta ^{t})\\ =\sum _{z_{1}z_{2}\cdots z_{N}}[log\; p(x_{1},z_{1}|\theta )+log\; p(x_{2},z_{2}|\theta )+\cdots +log\; p(x_{i},z_{i}|\theta )]\cdot \prod_{i=1}^{N} p(z_{i}|x_{i},\theta ^{t})

对于上式展开的每一项,我们可以进行化简:

z1z2zNlog  p(x1,z1θ)i=1Np(zixi,θt)=z1z2zNlog  p(x1,z1θ)p(z1x1,θt)i=2Np(zixi,θt)=z1log  p(x1,z1θ)p(z1x1,θt)z2z3zNi=2Np(zixi,θt)=z1log  p(x1,z1θ)p(z1x1,θt)z2z3zNp(z2x2,θt)p(z3x3,θt)p(zNxN,θt)=z1log  p(x1,z1θ)p(z1x1,θt)z2p(z2x2,θt)=1z3p(z3x3,θt)=1zNp(zNxN,θt)=1=z1log  p(x1,z1θ)p(z1x1,θt)z1z2zNlog  p(x1,z1θ)i=1Np(zixi,θt)=zilog  p(xi,ziθ)p(zixi,θt)\sum _{z_{1}z_{2}\cdots z_{N}}log\; p(x_{1},z_{1}|\theta )\prod_{i=1}^{N} p(z_{i}|x_{i},\theta ^{t})\\ =\sum _{z_{1}z_{2}\cdots z_{N}}log\; p(x_{1},z_{1}|\theta )\cdot p(z_{1}|x_{1},\theta ^{t})\prod_{i=2}^{N} p(z_{i}|x_{i},\theta ^{t})\\ =\sum _{z_{1}}log\; p(x_{1},z_{1}|\theta )\cdot p(z_{1}|x_{1},\theta ^{t})\sum _{z_{2}z_{3}\cdots z_{N}}\prod_{i=2}^{N} p(z_{i}|x_{i},\theta ^{t})\\ =\sum _{z_{1}}log\; p(x_{1},z_{1}|\theta )\cdot p(z_{1}|x_{1},\theta ^{t})\sum _{z_{2}z_{3}\cdots z_{N}}p(z_{2}|x_{2},\theta ^{t})p(z_{3}|x_{3},\theta ^{t})\cdots p(z_{N}|x_{N},\theta ^{t})\\ =\sum _{z_{1}}log\; p(x_{1},z_{1}|\theta )\cdot p(z_{1}|x_{1},\theta ^{t})\underset{=1}{\underbrace{\sum _{z_{2}}p(z_{2}|x_{2},\theta ^{t})}}\underset{=1}{\underbrace{\sum _{z_{3}}p(z_{3}|x_{3},\theta ^{t})}}\cdots \underset{=1}{\underbrace{\sum _{z_{N}}p(z_{N}|x_{N},\theta ^{t})}}\\ =\sum _{z_{1}}log\; p(x_{1},z_{1}|\theta )\cdot p(z_{1}|x_{1},\theta ^{t})\\ 同样的我们可以得到\sum _{z_{1}z_{2}\cdots z_{N}}log\; p(x_{1},z_{1}|\theta )\prod_{i=1}^{N} p(z_{i}|x_{i},\theta ^{t})=\sum _{z_{i}}log\; p(x_{i},z_{i}|\theta )p(z_{i}|x_{i},\theta ^{t})

继续对Q(θ,θt)Q(\theta ,\theta ^{t})进行化简可以得到:

Q(θ,θt)=z1log  p(x1,z1θ)p(z1x1,θt)++zilog  p(xi,ziθ)p(zixi,θt)=i=1Nzilog  p(xi,ziθ)p(zixi,θt)=i=1Nzilog  [pziN(xiμzi,Σzi)]pzitN(xiμzit,Σzit)k=1KpktN(xiμkt,Σkt)(pzitN(xiμzit,Σzit)k=1KpktN(xiμkt,Σkt)θp(zixi,θt))=i=1Nzilog  [pziN(xiμzi,Σzi)]p(zixi,θt)=zii=1Nlog  [pziN(xiμzi,Σzi)]p(zixi,θt)=k=1Ki=1Nlog  [pkN(xiμk,Σk)]p(zi=Ckxi,θt)=k=1Ki=1N[log  pk+log  N(xiμk,Σk)]p(zi=Ckxi,θt)Q(\theta ,\theta ^{t})=\sum _{z_{1}}log\; p(x_{1},z_{1}|\theta )\cdot p(z_{1}|x_{1},\theta ^{t})+\cdots +\sum _{z_{i}}log\; p(x_{i},z_{i}|\theta )\cdot p(z_{i}|x_{i},\theta ^{t})\\ =\sum_{i=1}^{N}\sum _{z_{i}}log\; p(x_{i},z_{i}|\theta )\cdot p(z_{i}|x_{i},\theta ^{t})\\ =\sum_{i=1}^{N}\sum _{z_{i}}log\; [p_{z_{i}}\cdot N(x_{i}|\mu _{z_{i}},\Sigma _{z_{i}})]\cdot \frac{p_{z_{i}}^{t}\cdot N(x_{i}|\mu _{z_{i}}^{t},\Sigma _{z_{i}}^{t})}{\sum_{k=1}^{K}p_{k}^{t}\cdot N(x_{i}|\mu _{k}^{t},\Sigma _{k}^{t})}\\ (\frac{p_{z_{i}}^{t}\cdot N(x_{i}|\mu _{z_{i}}^{t},\Sigma _{z_{i}}^{t})}{\sum_{k=1}^{K}p_{k}^{t}\cdot N(x_{i}|\mu _{k}^{t},\Sigma _{k}^{t})}与\theta 无关,暂时写作p(z_{i}|x_{i},\theta ^{t}))\\ =\sum_{i=1}^{N}\sum _{z_{i}}log\; [p_{z_{i}}\cdot N(x_{i}|\mu _{z_{i}},\Sigma _{z_{i}})]\cdot p(z_{i}|x_{i},\theta ^{t})\\ =\sum _{z_{i}}\sum_{i=1}^{N}log\; [p_{z_{i}}\cdot N(x_{i}|\mu _{z_{i}},\Sigma _{z_{i}})]\cdot p(z_{i}|x_{i},\theta ^{t})\\ =\sum_{k=1}^{K}\sum_{i=1}^{N}log\; [p_{k}\cdot N(x_{i}|\mu _{k},\Sigma _{k})]\cdot p(z_{i}=C_{k}|x_{i},\theta ^{t})\\ =\sum_{k=1}^{K}\sum_{i=1}^{N}[log\; p_{k }+log\; N(x_{i}|\mu _{k},\Sigma _{k})]\cdot p(z_{i}=C_{k}|x_{i},\theta ^{t})

  1. M step

EM算法的迭代公式为:

θt+1=argmaxθ  Q(θ,θt)\theta ^{t+1}=\underset{\theta }{argmax}\; Q(\theta ,\theta ^{t})

下面以求解pt+1=(p1t+1,p2t+1,,pKt+1)Tp^{t+1}=(p_{1}^{t+1},p_{2}^{t+1},\cdots ,p_{K}^{t+1})^{T}为例,来看如何进行迭代求解,以下是求解的迭代公式:

pkt+1=argmaxpkk=1Ki=1Nlog  pkp(zi=Ckxi,θt),s.t.  k=1Kpk=1p_{k}^{t+1}=\underset{p_{k}}{argmax}\sum_{k=1}^{K}\sum_{i=1}^{N}log\; p_{k }\cdot p(z_{i}=C_{k}|x_{i},\theta ^{t}),s.t.\; \sum_{k=1}^{K}p_{k}=1

于是可以转化为以下约束优化问题:

{minpk=1Ki=1Nlog  pkp(zi=Ckxi,θt)s.t.  k=1Kpk=1\left\{\begin{matrix} \underset{p}{min}-\sum_{k=1}^{K}\sum_{i=1}^{N}log\; p_{k }\cdot p(z_{i}=C_{k}|x_{i},\theta ^{t})\\ s.t.\; \sum_{k=1}^{K}p_{k}=1 \end{matrix}\right.

然后使用拉格朗日乘子法进行求解:

L(p,λ)=k=1Ki=1Nlog  pkp(zi=Ckxi,θt)+λ(k=1Kpk1)Lpk=i=1N1pkp(zi=Ckxi,θt)+λ=0i=1Np(zi=Ckxi,θt)+pkt+1λ=0k=1,2,,Ki=1Nk=1Kp(zi=Ckxi,θt)=1=N+k=1Kpkt+1=1λ=0N+λ=0λ=Ni=1Np(zi=Ckxi,θt)+pkt+1λ=0i=1Np(zi=Ckxi,θt)+pkt+1N=0pkt+1=i=1Np(zi=Ckxi,θt)NL(p,\lambda )=-\sum_{k=1}^{K}\sum_{i=1}^{N}log\; p_{k }\cdot p(z_{i}=C_{k}|x_{i},\theta ^{t})+\lambda (\sum_{k=1}^{K}p_{k}-1)\\ \frac{\partial L}{\partial p_{k}}=-\sum_{i=1}^{N}\frac{1}{p_{k}}\cdot p(z_{i}=C_{k}|x_{i},\theta ^{t})+\lambda \overset{\triangle }{=}0\\ \Rightarrow -\sum_{i=1}^{N}p(z_{i}=C_{k}|x_{i},\theta ^{t})+p_{k}^{t+1}\lambda =0\\ \overset{k=1,2,\cdots ,K}{\Rightarrow }-\underset{=N}{\underbrace{\sum_{i=1}^{N}\underset{=1}{\underbrace{\sum_{k=1}^{K}p(z_{i}=C_{k}|x_{i},\theta ^{t})}}}}+\underset{=1}{\underbrace{\sum_{k=1}^{K}p_{k}^{t+1}}}\lambda =0\\ \Rightarrow -N+\lambda =0\\ \Rightarrow \lambda =N\\ 代入-\sum_{i=1}^{N}p(z_{i}=C_{k}|x_{i},\theta ^{t})+p_{k}^{t+1}\lambda =0得\\ -\sum_{i=1}^{N}p(z_{i}=C_{k}|x_{i},\theta ^{t})+p_{k}^{t+1}N=0\\ \Rightarrow p_{k}^{t+1}=\frac{\sum_{i=1}^{N}p(z_{i}=C_{k}|x_{i},\theta ^{t})}{N}

这里以求解pt+1p^{t+1}为例展示了M step的求解过程,其他参数也按照极大化Q函数的思路就可以求解,求得一轮参数后要继续迭代求解直至收敛。