统计参数估计

统计参数估计

在进行统计决策时,我们假设后验概率是已知的,这往往不现实。在有些情况下,我们假设数据近似服从某种分布(最常见的如高斯分布),估计数据的分布参数。

最大似然估计

我们一般假设所观测到的变量是最有可能出现的,因此,直观上最大似然估计就是估计出使分布函数最契合所观测到数据分布的参数。形式化的语言表示为:选择θ,使得:
θ^=argmax[p(Xθ)] \hat{\theta}=\operatorname{argmax}[\mathrm{p}(\mathrm{X} | \theta)]
上述公式对应只有一个观测变量的情况,当有多个观测变量时,最大似然估计的目标是使每个观测变量概率都最大,也就是这些变量的概率乘积最大化:
θ^=argmax[logk=1Np(x(kθ)]=argmax[k=1Nlogp(x(kθ)] \hat{\theta}=\operatorname{argmax}\left[\log \prod_{\mathrm{k}=1}^{\mathrm{N}} \mathrm{p}\left(\mathrm{x}^{(\mathrm{k}} | \theta\right)\right]=\operatorname{argmax}\left[\sum_{\mathrm{k}=1}^{\mathrm{N}} \log \mathrm{p}\left(\mathrm{x}^{(\mathrm{k}} | \theta\right)\right]
通过一个取对数操作,就讲乘积变成了加法,这样求导就方便了很多。最大似然估计,其实就是一个求最优化的问题。

多元高斯分布为例

对于一个多元高斯分布(就是有多个变量的高斯分布):
p(x;μ,Σ)=1(2π)D2Σ12exp(12(xμ)TΣ1(xμ)) p(x ; \mu, \Sigma)=\frac{1}{(2 \pi)^{\frac{D}{2}}|\Sigma|^{\frac{1}{2}}} \exp \left(-\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu)\right)
其中:
μ=[μ1μ2μN] \mu=\left[\begin{array}{c} {\mu_{1}} \\ {\mu_{2}} \\ {\vdots} \\ {\mu_{N}} \end{array}\right]

Σ=1mXTX \Sigma=\frac{1}{m} X^{T} X

分别为均值和协方差。对于N个观测变量,最大似然估计需要估计均值和协方差使得:
E(θ)=k=1Nlogp(xkθ)=k=1Nlog(2π)D212logΣ12(xkμ)TΣ1(xkμ) E(\theta)=\sum_{k=1}^{N} \log p\left(x_{k} | \theta\right)=\sum_{k=1}^{N}-\log (2 \pi)^{\frac{D}{2}}-\frac{1}{2} \log |\Sigma|-\frac{1}{2}\left(x_{k}-\mu\right)^{T} \Sigma^{-1}\left(x_{k}-\mu\right)
最大,即:
θ^=argmaxE(θ) \hat \theta = \arg \max E(\theta )
1.首先估计均值。对μ求偏导:
xE(θ)μ=k=1NΣ1(xkμ)=0(:(XTAX)X=(A+AT)X,Σ1T=Σ1) x\frac{{\partial E(\theta )}}{{\partial \mu }} = \sum\limits_{k = 1}^N {{\Sigma ^{ - 1}}} \left( {{x_k} - \mu } \right) = 0 (注:\frac{{\partial ({X^T}AX)}}{{\partial X}} = (A + {A^T})X,{\Sigma^{ - {1^T}}}{\rm{ = }}{\Sigma^{ - 1}})

μ=1Nk=1Nxk \mu=\frac{1}{N} \sum_{k=1}^{N} x_{k}

2.接下来估计协方差,对Sigma求偏导(实际上是对Sigma的逆求偏导,这并不影响结果,但可以是推导更简便):
E(θ)Σ1=(k=1N12logΣ112(xkμ)TΣ1(xkμ))Σ1 \frac{{\partial E(\theta )}}{{\partial {\Sigma ^{ - 1}}}} = \frac{{\partial \left( {\sum\limits_{k = 1}^N {\frac{1}{2}} \log \left| {{\Sigma^{ - 1}}} \right| - \frac{1}{2}{{\left( {{x_k} - \mu } \right)}^T}{{\bf{\Sigma}}^{ - 1}}\left( {{x_k} - \mu } \right)} \right)}}{{\partial {\Sigma^{ - 1}}}}
上式求解有些复杂,分别对式中两项拆开进行求解:

第一项:
A,:A1=A1;logAA=2A1Diag(A1) 对于对称矩阵A,有:\left| {{A^{ - 1}}} \right| = {\left| A \right|^{ - 1}};\frac{{\partial \log{\left| A \right|}}}{{\partial A}} = 2{A^{ - 1}} - Diag({A^{ - 1}})

logΣ1Σ1=2ΣDiag(Σ) 于是就有:\frac{{\partial \log \left| {{\Sigma ^{ - 1}}} \right|}}{{\partial {\Sigma ^{ - 1}}}} = 2\Sigma - Diag(\Sigma )

第二项:
ATXTAX=Tr(AXXT);Tr(AB)A=B+BTDiag(B);XTAXA=2XXTDiag(XXT) \begin{array}{l} 对于对称矩阵A、T,有:{X^T}AX = Tr(AX{X^T});\frac{{\partial Tr(AB)}}{{\partial A}} = B + {B^T} - Diag(B);\\ 从而可以得出:\frac{{\partial {X^T}AX}}{{\partial A}}{\rm{ = 2}}X{X^T} - Diag(X{X^T}) \end{array}

(xkμ)TΣ1(xkμ)Σ1=2(xkμ)(xkμ)TDiag((xkμ)(xkμ)T) 于是就有:\frac{{\partial {{\left( {{x_k} - \mu } \right)}^T}{\Sigma ^{ - 1}}\left( {{x_k} - \mu } \right)}}{{\partial {\Sigma ^{ - 1}}}} = {\rm{2}}\left( {{x_k} - \mu } \right){\left( {{x_k} - \mu } \right)^T} - Diag(\left( {{x_k} - \mu } \right){\left( {{x_k} - \mu } \right)^T})

综合这两项有:
E(θ)Σ1=12k=1N2Σ2(xkμ)(xkμ)TDiag(Σ)+Diag((xkμ)(xkμ)T)=12k=1N2(Σ(xkμ)(xkμ)T)Diag(Σ(xkμ)(xkμ)T)=0:Diag(A)+Diag(B)=Diag(A+B) \begin{array}{l} \frac{{\partial E(\theta )}}{{\partial {\Sigma ^{ - 1}}}} = \frac{1}{2}\sum\limits_{k = 1}^N {2\Sigma - {\rm{2}}\left( {{x_k} - \mu } \right){{\left( {{x_k} - \mu } \right)}^T} - Diag(\Sigma )} + Diag(\left( {{x_k} - \mu } \right){\left( {{x_k} - \mu } \right)^T})\\ = \frac{1}{2}\sum\limits_{k = 1}^N {2(\Sigma - \left( {{x_k} - \mu } \right){{\left( {{x_k} - \mu } \right)}^T}) - Diag(\Sigma - } \left( {{x_k} - \mu } \right){\left( {{x_k} - \mu } \right)^T}){\rm{ = 0}} \\(注:直观可得Diag(A)+Diag(B)=Diag(A+B)) \end{array}
求解这个优化问题可得:
k=1NΣ(xkμ)(xkμ)T=0Σ=1Nk=1N(xkμ)(xkμ)T \sum\limits_{k = 1}^N {\Sigma - \left( {{x_k} - \mu } \right){{\left( {{x_k} - \mu } \right)}^T}{\rm{ = 0}} \Rightarrow } \Sigma {\rm{ = }}\frac{{\rm{1}}}{N}\sum\limits_{k = 1}^N {\left( {{x_k} - \mu } \right){{\left( {{x_k} - \mu } \right)}^T}}

估计的偏差和方差

这部分内容部分参考了知乎@大海啊你全是水的文章

估计的偏差

所谓偏差就是估计值的期望和真实值的差距。公式描述为:
bias(θ^m)=E(θ^m)θ \operatorname{bias}\left(\hat{\boldsymbol{\theta}}_{m}\right)=\mathbb{E}\left(\hat{\boldsymbol{\theta}}_{m}\right)-\boldsymbol{\theta}
以高斯分布为例。我们知道高斯分布的参数估计分别为:
μ=xˉ,σ2=1ni=1n(xiμ)2 \mu=\bar{x}, \quad \sigma^{2}=\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\mu\right)^{2}
推导过程可见@我是8位的 博客

均值估计的偏差为:
bias(μ^m)=E(μ^m)μ=E[1mi=1mx(i)]μ=(1mi=1mE[x(i)])μ=(1mi=1mμ)μ=μμ=0 \begin{aligned} \operatorname{bias}\left(\hat{\mu}_{m}\right) &=\mathbb{E}\left(\hat{\mu}_{m}\right)-\mu \\ &=\mathbb{E}\left[\frac{1}{m} \sum_{i=1}^{m} x^{(i)}\right]-\mu \\ &=\left(\frac{1}{m} \sum_{i=1}^{m} \mathbb{E}\left[x^{(i)}\right]\right)-\mu \\ &=\left(\frac{1}{m} \sum_{i=1}^{m} \mu\right)-\mu \\ &=\mu-\mu=0 \end{aligned}
所以均值是无偏估计。方差估计的偏差为:
bias(σ^m2)=E[σ^m2]σ2=m1mσ2σ2=σ2m \operatorname{bias}\left(\hat{\sigma}_{m}^{2}\right)=\mathbb{E}\left[\hat{\sigma}_{m}^{2}\right]-\sigma^{2}=\frac{m-1}{m} \sigma^{2}-\sigma^{2}=-\frac{\sigma^{2}}{m}
这个期望E是这样计算来的:
E[σ^m2]=E[1mi=1m(x(i)μ^m)2]=E{1mi=1m[(x(i))22x(i)μ^m+μ^m2]}=E{1mi=1m[(x(i))2]2μ^mmi=1m[x(i)]+1mi=1m(μ^m2)}=E{1mi=1m[(x(i))2]2μ^m2+μ^m2}=E{1mi=1m[(x(i))2]μ^m2}=1mi=1mE[(x(i))2]E(μ^m2) \begin{aligned} \mathbb{E}\left[\hat{\sigma}_{m}^{2}\right] &=\mathbb{E}\left[\frac{1}{m} \sum_{i=1}^{m}\left(x^{(i)}-\hat{\mu}_{m}\right)^{2}\right] \\ &=\mathbb{E}\left\{\frac{1}{m} \sum_{i=1}^{m}\left[\left(x^{(i)}\right)^{2}-2 x^{(i)} \hat{\mu}_{m}+\hat{\mu}_{m}^{2}\right]\right\} \\ &=\mathbb{E}\left\{\frac{1}{m} \sum_{i=1}^{m}\left[\left(x^{(i)}\right)^{2}\right]-\frac{2 \hat{\mu}_{m}}{m} \sum_{i=1}^{m}\left[x^{(i)}\right]+\frac{1}{m} \sum_{i=1}^{m}\left(\hat{\mu}_{m}^{2}\right)\right\} \\ &=\mathbb{E}\left\{\frac{1}{m} \sum_{i=1}^{m}\left[\left(x^{(i)}\right)^{2}\right]-2 \hat{\mu}_{m}^{2}+\hat{\mu}_{m}^{2}\right\} \\ &=\mathbb{E}\left\{\frac{1}{m} \sum_{i=1}^{m}\left[\left(x^{(i)}\right)^{2}\right]-\hat{\mu}_{m}^{2}\right\} \\ &=\frac{1}{m} \sum_{i=1}^{m} \mathbb{E}\left[\left(x^{(i)}\right)^{2}\right]-\mathbb{E}\left(\hat{\mu}_{m}^{2}\right) \end{aligned}

:Var(x)=E{[xE(x)]2}=E{x22xE(x)+E[(x)]2}=E(x2)[E(x)]2 根据方差定义:{\mathop{\rm Var}\nolimits} (x) = \mathbb{E}\left\{ {{{[x - \mathbb{E}(x)]}^2}} \right\} = \mathbb{E}\left\{ {{x^2} - 2x\mathbb{E}(x) + \mathbb{E}{{[(x)]}^2}} \right\} = \mathbb{E}({x^2}) - {[\mathbb{E}(x)]^2}

E(x2)=Var(x)+[E(x)]2 \mathbb{E}\left(x^{2}\right)=\operatorname{Var}(x)+[\mathbb{E}(x)]^{2}

故有:
E{[x(i)]2}=Var[x(i)]+{E[x(i)]}2=σ2+μ2E(μm2)=Var[μm]+[E(μm)]2=Var{1mi=1m[x(i)]}+μ2=1m2i=1mVar[x(i)]+μ2=σ2m+μ2 \begin{aligned} \mathbb{E}\left\{\left[x^{(i)}\right]^{2}\right\}=& \operatorname{Var}\left[x^{(i)}\right]+\left\{\mathbb{E}\left[x^{(i)}\right]\right\}^{2}=\sigma^{2}+\mu^{2} \\ \mathbb{E}\left(\mu_{m}^{2}\right) &=\operatorname{Var}\left[\mu_{m}\right]+\left[\mathbb{E}\left(\mu_{m}\right)\right]^{2}=\operatorname{Var}\left\{\frac{1}{m} \sum_{i=1}^{m}\left[x^{(i)}\right]\right\}+\mu^{2} \\ &=\frac{1}{m^{2}} \sum_{i=1}^{m} \operatorname{Var}\left[x^{(i)}\right]+\mu^{2} \\ &=\frac{\sigma^{2}}{m}+\mu^{2} \end{aligned}
最终可得方差的期望:
E[σ^m2]=σ2+μ2σ2mμ2=m1mσ2 \mathbb{E}\left[\hat{\sigma}_{m}^{2}\right]=\sigma^{2}+\mu^{2}-\frac{\sigma^{2}}{m}-\mu^{2}=\frac{m-1}{m} \sigma^{2}
所以方差是有偏估计。

估计的方差

方差顾名思义即:
Var(θ^) \operatorname{Var}(\hat{\theta})
在大多数情况下,偏差和方差就行家庭和事业不可得兼:
统计参数估计

贝叶斯估计

最大似然估计把待估计参数θ看作是一个确定但未知的参数,而贝叶斯估计把参数θ看作为一个随机变量。我们可能对θ的分布有一个粗略的认识,即我们可能粗略知道一个先验p(θ),通过利用观测到的样本使我们能够得出θ的后验概率密度p(θ|X)。
统计参数估计
下面以简单的高斯分布为例

先给出高斯分布公式:
f(x)=12πσexp((xμ)22σ2) f(x)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right)
假设我们观测到了下面的一系列样本,它们来自一个方差已知的高斯分布,我们需要估计这个分布的均值μ。
X={x1,x2,xN} X=\left\{x^{1}, x^{2}, \ldots x^{N}\right\}
并且我们粗略地知道μ服从一个高斯分布(即先验):
p0(θ)=12πσ0exp(12σ02(θμ0)2) \mathrm{p}_{0}(\theta)=\frac{1}{\sqrt{2 \pi} \sigma_{0}} \exp \left(-\frac{1}{2 \sigma_{0}^{2}}\left(\theta-\mu_{0}\right)^{2}\right)
根据贝叶斯公式就可以获得μ的后验概率密度p(θ|X):
p(θX)=p(Xθ)p0(θ)p(X)=p0(θ)p(X)k=1Np(x(kθ)=12πσ0exp(12σ02(θμ0)2)1p(X)k=1N[12πσexp(12σ2(x(kθ)2)] \begin{array}{l}p(\theta |X) = \frac{{p(X|\theta ){p_{\rm{0}}}(\theta )}}{{p(X)}}\\ = \frac{{{p_0}(\theta )}}{{p(X)}}\prod\limits_{k = 1}^N p \left( {{x^{(k}}|\theta } \right)\\ = \frac{1}{{\sqrt {2\pi } {\sigma _0}}}\exp \left( { - \frac{1}{{2\sigma _0^2}}{{\left( {\theta - {\mu _0}} \right)}^2}} \right)\frac{1}{{p(X)}}\prod\limits_{k = 1}^N {\left[ {\frac{1}{{\sqrt {2\pi \sigma } }}\exp \left( { - \frac{1}{{2{\sigma ^2}}}{{\left( {{x^{(k}} - \theta } \right)}^2}} \right)} \right]} \end{array}
我们选择参数θ时,一般选择后验概率最大的那个。即:
xθ^=argmaxp(θX) x\hat \theta = \arg \max p(\theta |X)
通过求解最优化问题获得参数θ。接着上面的例子,对p(θ|X)取对数后求偏导(取对数是为了避免连乘求导的困难):
θlogp(θX)=μ(12σ02(μμ0)2+k=1N12σ2(x(kμ)2)=0 \frac{\partial }{{\partial \theta }}{\mathop{\rm logp}\nolimits} (\theta |X) = \frac{\partial }{{\partial \mu }}\left( { - \frac{1}{{2\sigma _0^2}}{{\left( {\mu - {\mu _0}} \right)}^2} + \sum\limits_{k = 1}^N {\frac{{ - 1}}{{2{\sigma ^2}}}} {{\left( {{x^{(k}} - \mu } \right)}^2}} \right) = 0

μN=σ2σ2+Nσ02μ0PRIOR+Nσ02σ2+Nσ021Nk=1Nx(kMAXIMUMLIKELHOOD {\mu _{\rm{N}}} = \underbrace {\frac{{{\sigma ^2}}}{{{\sigma ^2} + N\sigma _0^2}}{\mu _0}}_{{\rm{PRIOR }}} + \underbrace {\frac{{{\rm{N}}\sigma _0^2}}{{{\sigma ^2} + {\rm{N}}\sigma _0^2}}\frac{1}{N}\sum\limits_{k = 1}^N {{x^{(k}}} }_{{\rm{ MAXIMUM LIKELHOOD }}}

其中,前面项是先验项,后面是似然项。当观测变量比较少时,先验信息为主要贡献;当观测变量比较多时,似然信息为主要贡献。