机器学习系列之【最大熵模型】

  了解最大熵模型之前,我们需要先了解一个和最大熵模型相伴的概念,指数家族

指数家族

  指数家族是一个包含我们常见的概率分布的分布族。不管是离散概率分布的代表伯努利分布还是连续概率分布的代表高斯分布,它们都属于指数家族。将其抽象到指数家族这一类会有一些性质,利于求解部分问题。指数家族的基本公式形式为:

p(xθ)=h(x)exp(θTϕ(x)A(θ))p(x \mid \theta)=h(x) \exp \left(\theta^{\mathrm{T}} \phi(x)-A(\theta)\right)

  其中h(x)h(x)一般被认为是一个简单的乘子项,大多数时候是一个常量。θ\theta表示模型中的参数,ϕ(x)\phi(x)被成为函数的充分统计量(Sufficient Statistic),A(θ)A(\theta)被称为对数分割函数(Log Partition Function),一般被用来归一化所有的概率项,具体形式为:

A(θ)=logxXh(x)exp(θTϕ(x))dxA(\theta)=\log \int_{x \in X} h(x) \exp \left(\theta^{\mathrm{T}} \phi(x)\right) \mathrm{d} x

  那常见的分布模型是如何变换成上述这种形式的呢?

伯努利分布转指数家族

  伯努利分布的形式为:

P(X=x)=px(1p)1xP(X=x)=p^{x}(1-p)^{1-x}

  其中pp表示X=1X=1时的概率,将其做一定变换得到:

P(X=x)=exp(log(px(1p)1x))=exp(xlogp+(1x)log(1p))=exp(x(logplog(1p))+log(1p)=exp(xlogp1plog11p)\begin{aligned} P(X=x) &=\exp \left(\log \left(p^{x}(1-p)^{1-x}\right)\right) \\ &=\exp (x \log p+(1-x) \log (1-p)) \\ &=\exp (x(\log p-\log (1-p))+\log (1-p)\\ &=\exp \left(x \log \frac{p}{1-p}-\log \frac{1}{1-p}\right) \end{aligned}

  其中h(x)=1h(x)=1ϕ(x)=x\phi(x)=xθ=log(p1p)\theta=log(\frac{p}{1-p})A(θ)=log11p=log(1+p1p)A(\theta)=log\frac{1}{1-p}=log(1+\frac{p}{1-p})

高斯分布转指数家族

N(xμ,σ2)=12πσ2exp[(xμ)22σ2]=12πexp[logσ(xμ)22σ2]=12πexp[logσ12σ2((x22xμ+μ2))]=12πexp[logσ12σ2μ2+1σ2xμ12σ2x2]\begin{aligned} N\left(x \mid \mu, \sigma^{2}\right) &=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left[-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right] \\ &=\frac{1}{\sqrt{2 \pi}} \exp \left[-\log \sigma-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right] \\ &=\frac{1}{\sqrt{2 \pi}} \exp \left[-\log \sigma-\frac{1}{2 \sigma^{2}}\left(\left(x^{2}-2 x \mu+\mu^{2}\right)\right)\right] \\ &=\frac{1}{\sqrt{2 \pi}} \exp \left[-\log \sigma-\frac{1}{2 \sigma^{2}} \mu^{2}+\frac{1}{\sigma^{2}} x \mu-\frac{1}{2 \sigma^{2}} x^{2}\right] \end{aligned}

  其中h(x)=12πh(x)=\frac{1}{\sqrt{2\pi}}θ=[μσ2,12σ2]\theta=\left[\frac{\mu}{\sigma^{2}},-\frac{1}{2 \sigma^{2}}\right]ϕ(x)=[x,x2]\phi(x)=\left[x, x^{2}\right],对于A(θ)A(\theta)有:

A(θ)=logσ+12σ2μ2=logσ2+(μσ2)2×12σ2=log1σ2+θ12×12×1σ2=log2×12σ2θ12×14×12σ2=log2×θ2θ12×14×θ2=12log(2θ2)θ124θ2\begin{aligned} A(\theta) &=\log \sigma+\frac{1}{2 \sigma^{2}} \mu^{2} \\ &=\log \sqrt{\sigma^{2}}+\left(\frac{\mu}{\sigma^{2}}\right)^{2} \times \frac{1}{2} \sigma^{2} \\ &=-\log \sqrt{\frac{1}{\sigma^{2}}}+\theta_{1}^{2} \times \frac{1}{2 \times \frac{1}{\sigma^{2}}} \\ &=-\log \sqrt{-2 \times-\frac{1}{2 \sigma^{2}}}-\theta_{1}^{2} \times \frac{1}{4 \times-\frac{1}{2 \sigma^{2}}} \\ &=-\log \sqrt{-2 \times \theta_{2}}-\theta_{1}^{2} \times \frac{1}{4 \times \theta_{2}}\\ &=-\frac{1}{2} \log(-2\theta_{2})-\frac{\theta_{1}^{2}}{4\theta_{2}} \end{aligned}

指数家族的性质

  • A(θ)A(\theta)的一阶导等于ϕ(x)\phi(x)的期望:

A(θ)θ=θ(logh(x)exp{θTϕ(x)}dx)=h(x)exp{θTϕ(x)}ϕ(x)dxexp(A(θ))=h(x)exp{θTϕ(x)A(θ)}ϕ(x)dx=p(xθ)ϕ(x)dx=Eθ[ϕ(x)]\begin{aligned} \frac{\partial A(\theta)}{\partial \theta} &=\frac{\partial}{\partial \theta}\left(\log \int h(x) \exp \left\{\theta^{\mathrm{T}} \phi(x)\right\} \mathrm{d} x\right) \\ &=\frac{\int h(x) \exp \left\{\theta^{\mathrm{T}} \phi(x)\right\} \phi(x) \mathrm{d} x}{\exp (A(\theta))} \\ &=\int h(x) \exp \left\{\theta^{\mathrm{T}} \phi(x)-A(\theta)\right\} \phi(x) \mathrm{d} x \\ &=\int p(x \mid \theta) \phi(x) \mathrm{d} x \\ &=E_{\theta}[\phi(x)] \end{aligned}

  • A(θ)A(\theta)的二阶导等于ϕ(x)\phi(x)的方差:

2A(θ)θθT=h(x)exp(θTϕ(x)A(θ))ϕ(x)(ϕ(x)A(θ)θ)dx=p(xθ)ϕ(x)(ϕ(x)Eθ[ϕ(x)])dx=p(xθ)ϕ2(x)dxEθ[ϕ(x)]p(xθ)ϕ(x)dx=Eθ[ϕ2(x)]Eθ2[ϕ(x)]=Varθ[ϕ(x)]\begin{aligned} \frac{\partial^{2} A(\theta)}{\partial \theta \partial \theta^{\mathrm{T}}} &=\int h(x) \exp \left(\theta^{\mathrm{T}} \phi(x)-A(\theta)\right) \phi(x)\left(\phi(x)-\frac{\partial A(\theta)}{\partial \theta}\right) \mathrm{d} x \\ &=\int p(x \mid \theta) \phi(x)\left(\phi(x)-E_{\theta}[\phi(x)]\right) \mathrm{d} x \\ &=\int p(x \mid \theta) \phi^{2}(x) \mathrm{d} x-E_{\theta}[\phi(x)] \int p(x \mid \theta) \phi(x) \mathrm{d} x \\ &=E_{\theta}\left[\phi^{2}(x)\right]-E_{\theta}^{2}[\phi(x)] \\ &=\operatorname{Var}_{\theta}[\phi(x)] \end{aligned}

  因此,将概率分布以指数家族的形式表达后,ϕ(x)\phi(x)的期望与方差实际上拥有两种解法,一种是直接使用这个函数进行求解,另一种则是采用对数分割函数进行求解。在实际问题中选择一种简便方式即可。

  之后为了简便,指数家族采用如下简单的形式:

p(xθ)=1Z(θ)h(x)eθTϕ(x)p(x \mid \theta)=\frac{1}{Z(\theta)} h(x) \mathrm{e}^{\theta^{\mathrm{T}} \phi(x)}

最大熵模型

  设随机变量XX的概率分布为P(X)P(X),熵可表示为:

H(P)=XP(X)logP(X) H(P)=-\sum_{X}P(X)\log P(X)

  在建立一个概率模型时,往往会伴随一些约束,有时,这些约束的存在并不能得到一个唯一的模型,满足这些条件的模型有很多种,这些模型在约束上的表现基本一致的,而在没有约束的子空间上则表现得不尽相同,那么根据最大熵的思想,没有约束的子空间应该拥有均等的概率,这样就能确定一个唯一的概率分布

  因此最大熵原理为:在满足已知条件的情况下,选取熵最大的模型。而决策树的划分是不断降低实例所属类的不确定性,最终给实例一个合适的分类,是不确定性不断减小的过程,所以选择熵最小的划分。

  通常我们希望模型能学习到数据中的分布特性,因此我们可以建立一个约束:模型分布的特征期望和训练样本分布的特征期望相等。其中模型分布的特征期望和训练样本的特征期望可分别表示为:

  • 模型分布的特征期望,特征f(x,y)f(x,y)关于模型要建立的概率分布p(yx)p(y|x)的特征期望为:

p(f)=x,yp~(x)p(yx)f(x,y)p(f)=\sum_{x, y} \tilde{p}(x) p(y \mid x) f(x, y)

  • 训练样本分布的特征期望,特征f(x,y)f(x,y)关于训练样本分布p~(x,y)\tilde{p}(x, y)的特征期望为:

p~(f)=x,yp~(x,y)f(x,y)=x,y1Ncount(x,y)f(x,y)\tilde{p}(f)=\sum_{x, y} \tilde{p}(x, y) f(x, y)=\sum_{x, y} \frac{1}{N} \operatorname{count}(x, y) f(x, y)

  这样我们就找到一个约束条件:

x,yp~(x,y)f(x,y)=x,yp~(x)p(yx)f(x,y)\sum_{x, y} \tilde{p}(x, y) f(x, y)=\sum_{x, y} \tilde{p}(x) p(y \mid x) f(x, y)

  于此同时希望模型的概率分布能最大化熵H(p)H(p)

H(p)=x,yp~(x)p(yx)logp(yx)H(p)=-\sum_{x, y} \tilde{p}(x) p(y \mid x) \log p(y \mid x)

  此时问题又变成了一个规划问题,其目标函数和约束项可表示为:

p=argmaxpH(p)=argmaxpx,yp~(x)p(yx)logp(yx) s.t. p(yx)0,x,yyp(yx)=1,xx,yp~(x)p(yx)f(x,y)=x,yp~(x,y)f(x,y),i{1,2,,n}\begin{aligned} &p^{*}=\operatorname{argmax}_{p} H(p)=\operatorname{argmax}_{p}-\sum_{x, y} \tilde{p}(x) p(y \mid x) \log p(y \mid x)\\ &\text { s.t. } \quad p(y \mid x) \geqslant 0, \forall x, y\\ &\begin{array}{l} \sum_{y} p(y \mid x)=1, \forall x \\ \sum_{x, y} \tilde{p}(x) p(y \mid x) f(x, y)=\sum_{x, y} \tilde{p}(x, y) f(x, y), i \in\{1,2, \ldots, n\} \end{array} \end{aligned}

  那怎么求pp呢?采用拉格朗日乘子法,得到拉格朗日函数ξ(p,Λ,γ)\xi(p, \Lambda, \gamma)

ξ(p,Λ,γ)=x,yp~(x)p(yx)logp(yx)+i=1nλi(x,yp~(x)p(yx)fi(x,y)x,yp~(x,y)fi(x,y))+γ(yp(yx)1)\begin{aligned} \xi(p, \Lambda, \gamma)=&-\sum_{x, y} \tilde{p}(x) p(y \mid x) \log p(y \mid x)+\sum_{i=1}^{n} \lambda_{i}\left(\sum_{x, y} \tilde{p}(x) p(y \mid x) f_{i}(x, y)\right.\\ &\left.-\sum_{x, y} \tilde{p}(x, y) f_{i}(x, y)\right)+\gamma\left(\sum_{y} p(y \mid x)-1\right) \end{aligned}

  其中Λ\Lambda表示乘子λ1,,λn\lambda_{1},\cdots,\lambda_{n}的集合。假设Λ\Lambdaγ\gamma不变,对pp求导,可以得到:

ξp(yx)=p~(x)(logp(yx)+1)+i=1nλip~(x)fi(x,y)+γ\frac{\partial \xi}{\partial p(y \mid x)}=-\tilde{p}(x)(\log p(y \mid x)+1)+\sum_{i=1}^{n} \lambda_{i} \tilde{p}(x) f_{i}(x, y)+\gamma

  令导数为0,可以得到pp的极值:

p~(x)(logp(yx)+1)+i=1nλip~(x)fi(x,y)+γ=0-\tilde{p}(x)(\log p(y \mid x)+1)+\sum_{i=1}^{n} \lambda_{i} \tilde{p}(x) f_{i}(x, y)+\gamma=0

  整理得到:

p(yx)=exp(i=1nλifi(x,y))exp(γp~(x)1)p(y \mid x)=\exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right) \exp \left(\frac{\gamma}{\tilde{p}(x)}-1\right)

  再次使用之前关于概率总和为1的约束yp(yx)=1,x\sum_{y} p(y \mid x)=1, \forall x,代入求得:

yp(yx)=yexp(i=1nλifi(x,y))exp(γp~(x)1)1=yexp(i=1nλifi(x,y))exp(γp~(x)1)exp(γp~(x)1)=1yexp(i=1nλifi(x,y))\begin{aligned} \sum_{y} p(y \mid x) &=\sum_{y} \exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right) \exp \left(\frac{\gamma}{\tilde{p}(x)}-1\right) \\ \rightarrow & 1=\sum_{y} \exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right) \exp \left(\frac{\gamma}{\tilde{p}(x)}-1\right) \\ \rightarrow \exp \left(\frac{\gamma}{\tilde{p}(x)}-1\right) &=\frac{1}{\sum_{y} \exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right)} \end{aligned}

  此时p(yx)p(y|x)可表示为:

p(yx)=exp(i=1nλifi(x,y))1yexp(i=1nλifi(x,y))p(y \mid x)=\exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right) \frac{1}{\sum_{y} \exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right)}

  令Z(x)Z(x)为:

Z(x)=yexp(i=1nλifi(x,y))Z(x)=\sum_{y} \exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right)

  模型最优解为:

p(yx)=1Z(x)exp(i=1nλifi(x,y))p(y \mid x)=\frac{1}{Z(x)} \exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right)

  经过拉格朗日乘子法计算,变量由模型pp变成了乘子λ\lambda,此时的乘子λ\lambda更像是每一个特征的权重项,为每个特征乘一个权重判断数据所属的yy的概率值。

最大似然求解

  那如何来求取参数λ\lambda呢?采用最大似然法求解,模型展开为:

p(yx,λ)=exp[iλifi(x,y)]yexp[iλifi(x,y)]p(y \mid x, \lambda)=\frac{\exp \left[\sum_{i} \lambda_{i} f_{i}(x, y)\right]}{\sum_{y^{\prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime}\right)\right]}

  最大化对数似然logp(yx,λ)\log p(y|x,\lambda),可以得到:

logp(YX)=(x,y)(X,Y)logp(yx,λ)=(x,y)(X,Y)logexp[iλifi(x,y)]yexp[iλifi(x,y)]=(x,y)(X,Y)iλifi(x,y)(x,y)(X,Y)logyexp[iλifi(x,y)]\begin{aligned} \log p(Y \mid X) &=\sum_{(x, y) \in(X, Y)} \log p(y \mid x, \lambda) \\ &=\sum_{(x, y) \in(X, Y)} \log \frac{\exp \left[\sum_{i} \lambda_{i} f_{i}(x, y)\right]}{\sum_{y^{\prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime}\right)\right]} \\ &=\sum_{(x, y) \in(X, Y)} \sum_{i} \lambda_{i} f_{i}(x, y)-\sum_{(x, y) \in(X, Y)} \log \sum_{y^{\prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime}\right)\right] \end{aligned}

  对公式进行求导,可以得到:

logP(YX)λi=(x,y)(X,Y)fi(x,y)(x,y)(X,Y)1yexp[iλifi(x,y)]yexp[iλifi(x,y)]λi=(x,y)(X,Y)fi(x,y)(x,y)(X,Y)1yexp[iλifi(x,y)]yexp[iλifi(x,y)]fi(x,y)=(x,y)(X,Y)fi(x,y)(x,y)(X,Y)yexp[iλifi(x,y)]yexp[iλifi(x,y)]fi(x,y)=(x,y)(X,Y)fi(x,y)(x,y)(X,Y)yp(yx,λ)fi(x,y)\begin{array}{l} \frac{\partial \log P(Y \mid X)}{\partial \lambda_{i}} \\ =\sum_{(x, y) \in(X, Y)} f_{i}(x, y)-\sum_{(x, y) \in(X, Y)} \frac{1}{\sum_{y^{\prime \prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime \prime}\right)\right]} \frac{\partial \sum_{y^{\prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime}\right)\right]}{\partial \lambda_{i}} \\ =\sum_{(x, y) \in(X, Y)} f_{i}(x, y)-\sum_{(x, y) \in(X, Y)} \frac{1}{\sum_{y^{\prime \prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime \prime}\right)\right]} \sum_{y^{\prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime}\right)\right] f_{i}\left(x, y^{\prime}\right) \\ =\sum_{(x, y) \in(X, Y)} f_{i}(x, y)-\sum_{(x, y) \in(X, Y)} \sum_{y^{\prime}} \frac{\exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime}\right)\right]}{\sum_{y^{\prime \prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime \prime}\right)\right]} f_{i}\left(x, y^{\prime}\right) \\ =\sum_{(x, y) \in(X, Y)} f_{i}(x, y)-\sum_{(x, y) \in(X, Y)} \sum_{y^{\prime}} p\left(y^{\prime} \mid x, \lambda\right) f_{i}\left(x, y^{\prime}\right) \end{array}

  可以看出,最大似然的梯度等于训练数据分布的特征期望与模型的特征期望的差。当梯度为0时,刚好满足约束条件:

(x,y)(X,Y)fi(x,y)=(x,y)(X,Y)yp(yx,λ)fi(x,y)\sum_{(x, y) \in(X, Y)} f_{i}(x, y)=\sum_{(x, y) \in(X, Y)} \sum_{y^{\prime}} p\left(y^{\prime} \mid x, \lambda\right) f_{i}\left(x, y^{\prime}\right)

  所以可以通过梯度下降法,不断迭代更新参数λ\lambda

最大熵似然法

  上文已经求出了最大熵模型的目标梯度。令XX的数量为dimxdim_{x}YY的数量为dimydim_{y},于是我们可以定义dimx×dimydim_{x} \times dim_{y}个特征,每一个特征可以指示一组数据是否存在:

fi,j(x,y)={1i=x 且 j=y0 其他情况 f_{i, j}(x, y)=\left\{\begin{array}{ll} 1 & i=x \text { 且 } j=y \\ 0 & \text { 其他情况 } \end{array}\right.

  模型拥有dimx×dimydim_{x}\times dim_{y}个参数λi,j\lambda_{i,j},每一个参数表示对应特征的权重。如果一个样本拥有NN个特征,那么我们可以得到某个样本xx关于结果yiy_{i}的概率:

p(yjx)=exp(iλi,jfi,j(xi,yj))jexp(iλi,jfi,j(xi,yj))p\left(y_{j} \mid x\right)=\frac{\exp \left(\sum_{i} \lambda_{i, j} f_{i, j}\left(x_{i}, y_{j}\right)\right)}{\sum_{j} \exp \left(\sum_{i} \lambda_{i, j} f_{i, j}\left(x_{i}, y_{j}\right)\right)}

  知道了这个概率,就能计算出一个样本对所有特征的累积量,其中每一个特征累积量为:

fi,j(x)=jp(yjx)fi,j(xi,yj)f_{i, j}(x)=\sum_{j} p\left(y_{j} \mid x\right) f_{i, j}\left(x_{i}, y_{j}\right)

  也就可以求出一批样本对所有特征的累积量,其中每一个特征累积量为:

fi,j(X)=(x,y)(X,Y)jp(yjx)fi,j(xi,yj)f_{i, j}(X)=\sum_{(x, y) \sim(X, Y)} \sum_{j} p\left(y_{j} \mid x\right) f_{i, j}\left(x_{i}, y_{j}\right)

  完整算法如下所示:

机器学习系列之【最大熵模型】

参考

  • 强化学习精要-核心算法与Tensorflow实现
  • http://www.huaxiaozhuan.com/%E7%BB%9F%E8%AE%A1%E5%AD%A6%E4%B9%A0/chapters/14_maxent.html