十分钟带你了解PCA

在我们之前分类器的讨论中，如SVM、贝叶斯判别等，都假定已给出了特征向量维数确定的样本集，其中各样本的每一维都是该样本的一个特征。然而不同的特征对于分类器设计的影响是不同的，如果将数目很多的测量值不做分析，全部直接用作分类特征，不但耗时，而且会影响到分类的效果，产生“特征维数灾难”问题。因此，我们需要对特征进行选择和提取，即“降维”。

简介

PCA，全名主成分分析（Principal Component Analysis），又称为K-L变换，是一种特征提取的方法。与简单地删掉某 $n - k$ 个特征不同，PCA将原来的特征做正交变换，获得的每个数据都是原来 $n$ 个数据的线性组合，然后从新的数据中选出少数几个，使其尽可能多地反映各类模式之间的差异。而这些特征间又尽可能相互独立，则比单纯的选择方法更灵活、更有效。而PCA就是一种适用于任意概率密度函数的正交变换。

PCA的离散展开式

设一连续的随机实函数 $\mathbf{x}(t), T_1 \le t \le T_2$ ，则 $\mathbf{x}(t)$ 可用已知的正交函数集 $\{\phi_j(t), j = 1, 2, \dots \}$ 的线性组合来展开，即

$\begin{aligned} \mathbf{x}(t) &= a_1\phi_1(t) + a_2\phi_2(t) + \cdots + a_j\phi_j(t) + \cdots \\ &= \sum_{j = 1}^{\infty}a_j\phi_j(t), \quad T_1 \le t \le T_2\\ \end{aligned}$

式中， $a_j$ 为展开式的随机系数， $\phi_j(t)$ 为一连续的正交函数，它应满足，

$\int_{T_1}^{T_2} \phi_i(t)\tilde{\phi}_j(t) = \left\{ \begin{aligned} &0, \quad if \ i \ne j, \\ &1, \quad if \ i == j. \\ \end{aligned} \right.$

其中， $\tilde{\phi}_j(t)$ 为 $\phi_j(t)$ 的共轭复数式。

将上式写成离散的正交函数形式，使连续随机函数 $\mathbf{x}(t)$ 和连续正交函数 $\phi_j(t)$ 在区间 $T_1 \le t \le T_2$ 内被等间隔采样为 $n$ 个离散点，即，

$\begin{aligned} \mathbf{x}(t) &\to \{\mathbf{x}(1), \mathbf{x}(2), \dots \mathbf{x}(n)\} \\ \phi_j(t) &\to \{\phi_j(1), \phi_j(2), \dots \phi_j(n)\} \\ \end{aligned}$

写成向量形式，

$\begin{aligned} \mathbf{x} &= (\mathbf{x}(1), \mathbf{x}(2), \dots, \mathbf{x}(n))^{T} \\ \phi_j &= (\phi_j(1), \phi_j(2), \dots, \phi_j(n))^{T} \\ \end{aligned}$

将原展开式取 $n$ 项近似，有，

$\mathbf{x} = \sum_{j = 1}^{n}a_j \phi_j = \Phi\mathbf{a}, \quad T_1 \le t \le T_2$

其中， $\mathbf{a}$ 为展开式中随机系数的向量形式，即

$\mathbf{a} = (a_1, a_2, \dots, a_n)^{T}$

$\Phi$ 是一 $n \times n$ 的矩阵，

$\Phi = (\phi_1, \phi_2, \dots, \phi_n) = \left[ \begin{aligned} &\phi_1(1) \quad &\phi_2(1) \quad &\dots \quad &\phi_n(1) \\ &\phi_1(2) \quad &\phi_2(2) \quad &\dots \quad &\phi_n(2) \\ &\dots \quad &\dots \quad &\dots \quad &\dots \\ &\phi_1(n) \quad &\phi_2(n) \quad &\dots \quad &\phi_n(n) \\ \end{aligned} \right]$

其中，每一列为正交函数集中的一个函数，小括号内的序号为正交函数的采样点次序。因此， $\Phi$ 实质上是由 $\phi_j$ 向量组成的正交变换矩阵，它将 $\mathbf{x}$ 变换成 $\mathbf{a}$ 。

正交向量集的确定

在前面的讨论中，我们讨论了PCA的离散展开式，其实际上就是将原样本 $\mathbf{x}$ 变换为 $\mathbf{a}$ 。而变换的重点则是正交向量集 $\Phi$ 的确定。

在直接讨论正交向量集 $\Phi$ 之前，我们不妨先看看其他基本参量。设随机向量 $\mathbf{x}$ 的总体自相关矩阵为 $R = E\{\mathbf{x}\mathbf{x}^{T}\}$ ，由

$\mathbf{x} = \sum_{j = 1}^{n}a_j \phi_j = \Phi\mathbf{a}, \quad T_1 \le t \le T_2$

将 $\mathbf{x} = \Phi\mathbf{a}$ 代入 $R = E\{\mathbf{x}\mathbf{x}^{T}\}$ ，得

$R = E\{\Phi\mathbf{a}\mathbf{a}^{T}\Phi^{T}\} = \Phi E\{\mathbf{a}\mathbf{a}^{T}\}\Phi^{T}$

因为我们希望向量 $\mathbf{a}$ 的各个不同分量应统计独立，即应使 $(a_1, a_2, \dots, a_j, \dots, a_n)$ 满足以下关系，

$E(a_ia_j) = \left\{ \begin{aligned} &\lambda_i, \quad &if \ i = j \\ &0, \quad &if \ i \ne j \\ \end{aligned} \right.$

写成矩阵形式，应使： $E\{a a^{T}\} = D_{\lambda}$ ，其中 $D_{\lambda}$ 为对角形矩阵，其互相关成分均为0，即，

$D_{\lambda} = \left[ \begin{aligned} &\lambda_1 \quad &0 \quad &\dots \quad &\dots \quad &0 \\ &0 \quad &\lambda_2 \quad &\dots \quad &\dots \quad &0 \\ &\dots \quad &\dots \quad &\dots \quad &\dots \quad &\dots \\ &0 \quad &0 \quad &\dots \quad &\dots \quad &\lambda_n \\ \end{aligned} \right]$

则，

$R = \Phi D_{\lambda}\Phi^{T}$

由于 $\Phi$ 中的各个向量 $\phi_j$ 都相互归一正交，故有，

$R\Phi = \Phi D_{\lambda} \Phi^{T} \Phi = \Phi D_{\lambda}$

其中， $\phi_j$ 向量对应为，

$R \phi_j = \lambda_j\phi_j$

由矩阵知识可以看出， $\lambda_j$ 是 $\mathbf{x}$ 的自相关矩阵 $R$ 的特征值， $\phi_j$ 是对应的特征向量。因为 $R$ 是实对称矩阵，其不同特征值对应的特征向量应正交，即，

$\phi_i^{T} \phi_j = \left\{ \begin{aligned} &0 \quad &if \ i \ne j \\ &1 \quad &if \ i = j \\ \end{aligned} \right.$

计算步骤

好了，罗里吧嗦说了这么多，大家可能会有疑问。PCA到底是什么？到底怎么做啊？

首先，PCA用于特征选择相当于一种线性变换，若从 $\Phi$ 这 $n$ 个特征向量中取出 $m$ 个组成变换矩阵 $\hat{\Phi}$ ，即

$\hat{\Phi} = (\phi_1, \phi_2, \dots, \phi_m), m < n$

此时， $\hat{\Phi}$ 是一个 $n \times m$ 维矩阵， $\mathbf{x}$ 是 $n$ 维向量，经过 $\hat{\Phi}^{T}\mathbf{x}$ 变换，即得到降维为 $m$ 的新向量。

因此，PCA的计算步骤如下，

计算整体样本的均值，并对全部样本减去均值，以使均值成为新座标轴的原点；
求随机向量 $\mathbf{x}$ 的自相关矩阵： $R = E\{\mathbf{x}\mathbf{x}^{T}\}$ ；
求出矩阵 $R$ 的特征值 $\lambda_j$ 和对应的特征向量 $\phi_j, j = 1, 2, \dots, n$ ，得矩阵
$\Phi = (\phi_1, \phi_2, \dots, \phi_n)$
从中选取按照特征值大小，降序选取前 $m$ 个特征向量 $\phi_j$ ，构成矩阵 $\hat{\Phi}$ ；
计算展开式系数
$\mathbf{a} = \hat{\Phi}^{T}\mathbf{x}$

十分钟带你了解PCA

有效性

好了，PCA的计算步骤我们清楚了。可是，它为什么有效呢？它是如何保证降维后的特征向量与原特征向量的误差尽可能地小的？不急，我们接下来就来证明这一点。

我们已知，对于 $\mathbf{x} = \sum_{j = 1}^{n}a_j\phi_j$ ，现取 $m$ 项，对略去的系数项用预先选定的常数 $b$ 代替，此时对 $\mathbf{x}$ 的估计值为，

$\hat{\mathbf{x}} = \sum_{j = 1}^{m}a_j\phi_j + \sum_{j = m + 1}^{n}b\phi_j$

则产生的误差为，

$\Delta{\mathbf{x}} = \mathbf{x} - \hat{\mathbf{x}} = \sum_{j = m + 1}^{n}(a_j - b)\phi_j$

则 $\Delta{\mathbf{x}}$ 的均方误差为，

$\bar{\varepsilon}^2 = E\{||\Delta \mathbf{x}||\}^2 = \sum_{i = m + 1}^{n} \{E(a_j - b)^2\}$

要使 $\bar{\varepsilon}^2$ 最小，则对 $b$ 的选择应满足，

$\frac{\partial}{\partial b}E(a_j - b)^2 = -2[E(a_j - b)] = 0$

因此， $b = E(a_j)$ ，即对省略掉的a中的分量，应使用它们的数学期望来代替，此时的误差为，

$\begin{aligned} \bar{\varepsilon}^2 &= \sum_{j = m + 1}^{n}E[(a_j - E\{a_j\})^{2}] \\ &= \sum_{j = m + 1}^{n}\phi_{j}^{T}E[(\mathbf{x} - E\{\mathbf{x}\})(\mathbf{x} - E\{\mathbf{x}\})^{T}]\phi_j \\ &= \sum_{j = m + 1}^{n}\phi_j^TC_{\mathbf{x}}\phi_j\\ \end{aligned}$

其中， $C_{\mathbf{x}}$ 为 $\mathbf{x}$ 的协方差矩阵。设 $\lambda_j$ 为 $C_{\mathbf{x}}$ 的第 $j$ 个特征值， $\phi_j$ 是与 $\lambda_j$ 对应的特征向量，则

$C_{\mathbf{x}}\phi_j = \lambda_j\phi_j$

由于

$\phi_j^{T}C_{\mathbf{x}}\phi_j = \lambda_j$

因此，

$\bar{\varepsilon}^2 = \sum_{j = m + 1}^{n}\phi_j^TC_{\mathbf{x}}\phi_j = \sum_{j = m + 1}^{n}\lambda_j$

由此可以看出， $\lambda_j$ 值越小，误差也越小。因此，我们也可以说对于PCA，最小化样本均方误差等价于最大化样本方差。

总结

PCA是在均方误差最小的意义下获得数据压缩（降维）的最佳变换，且不受模式分布的限制。对于一种类别的模式特征提取，它不存在特征分类问题，只是实现用低维的 $m$ 个特征来表示原来高维的 $n$ 个特征，使其误差最小，亦即使其整个模式分布结构尽可能保持不变。

参考文献

黄庆明，《第四章.ppt》

郭嘉丰，《无监督学习——维度约简》

十分钟带你了解PCA

简介

PCA的离散展开式

正交向量集的确定

计算步骤

有效性

总结

参考文献

相关推荐