NNLM（Nerual Network Language Model）是2003年Bengio等人在文章A neural probabilistic language model提出的语言模型

基本思想

假定词表中的每一个word都对应着一个连续的特征向量
假定一个连续平滑的概率模型，输入一段词向量的序列，可以输出这段序列的联合概率
$\hat{P}(w_1,w_2,...,w_T) = \prod_{i=1}^T\hat{P}(w_t|w_1,w_2,...,w_{t-1})$
$\hat{P}(w_1,w_2,...,w_{t-1}) \approx \hat{P}(w_t|w_{t-1},w_t,...,w_{t-n+1})$
$w_t:表示第t个单词$
同时学习词向量的权重和N-gram概率模型里的参数

网络结构

NNLM（Nerual Network Language Model）论文笔记

使用了一个三层结构，第一层为映射层，第二层为隐藏层，第三层为输出层。用端到端的思想来看，我们输入一个词的one-hot向量表征，希望得到相应的相应词的条件概率，则神经网络模型要做的就是拟合一个由one-hot向量映射为相应概率模型的函数。我们将上图的网络结构拆成两部分来理解：

映射层

首先是一个线性的映射层，假如求 $w_n$ 的概率，则依次输入 $w_1,w_2,...,w_{n-1}$ 的one-hot向量，乘上一个Embedding矩阵 $C_{m*V}$ ，
$m$ 是Embedding向量的维度，
$V$ 是词典的长度，
$C$ 矩阵也是学习的产物。这个过程其实就是一个通过one-hot向量映射词向量的过程。

例：现有一个有N个词的文本，长度为V的词典

词向量W：是一个one-hot向量，大小=[10W，1]，W(t)表示第t个词语的one-hot

Embedding矩阵C：维度[m*V]，V=10W，谷歌测试时选取m=300

计算时：投影矩阵C[300 * 10W] X 词向量W(t)[10W *1] 得到= 矩阵[300 * 1]
比如根据前3个词来预测第4个词语，那么上述操作会重复三次，得到3个[300*1]的矩阵
将这3个[300*1]的矩阵按行拼接，得到[900x1]的矩阵。

输入层&隐藏层

设 $h$ 为隐藏层层数，

通过映射层得到输入向量 $x_{({n-1}) * m \times 1}$ ，即前n-1个词的词向量矩阵
输入层到隐藏层(the hidden layer weights) 的权重矩阵为 $H_{h \times (n-1)*m}$
输入向量 $X$ 的权值矩阵 $W_{V \times (n-1)*m}$ ,
隐藏层到输出层(the hidden-to-output weights) 的权重矩阵 $U_{m \times h}$
输入层到隐藏层(the hidden layer weights) 的偏置 $d_{h \times 1}$
隐藏层到输出层(the hidden layer weights) 的偏置 $b_{V \times 1}$
输出公式 $y = b+WX+U \tanh (d+Hx)$

当输入特征向量和输出层没有直接连接的时候，W矩阵设为0

输出层

隐藏层计算出 $y$ 之后，通过softmax，公式为：
$\hat{P}(w_t|w_{t-1},w_t,...,w_{t-n+1}) = {{e^{y_{wt}}} \over {\sum_{i=1}^ {n-1}}e^{y_wi} }$

损失函数&反向传播

损失函数为：
$L = {1 \over T} {\sum_{t} log\hat{P}(w_t|w_{t-1},w_t,...,w_{t-n+1}) + R(\theta)}$
需要更新的参数 $\theta$ ：
$\theta = (b,d,W,U,H,C)$
反向传播梯度下降：
$\theta ← \theta + \varepsilon {\partial log\hat{P}(w_t|w_{t-1},w_t,...,w_{t-n+1}) \over \partial \theta}$

超参数

论文设置 $V = 17964$

初始学习率 $\varepsilon_0 = 10^{-3}$

动态学习率 $\varepsilon_t = {\varepsilon_0 \over 1+rt}$ ;其中 $t$ 为完成参数更新的次数，r是一个被“启发式选择”的减少因子 $r = 10^{-8}$

把n-gram拓展到了n=5

Questions

输入层直接和输出层相连的原理和意义

The results do not allow to say whether the direct connections from input to output are useful or not, but suggest that on a smaller corpus at least, better generalization can be obtained without the direct input-to-output connections, at the cost of longer training: without direct connections the network took twice as much time to converge (20 epochs instead of 10), albeit to a slightly lower perplexity.
A reasonable interpretation is that direct input-to-output connections provide a bit more capacity and faster learning of the “linear” part of the mapping from word features to log-probabilities.

结果无法说明从输入到输出的直接连接是否有用，但建议至少在较小的语料库中，无需直接输入到输出的连接即可获得更好的概括性，但代价是更长训练：在没有直接连接的情况下，网络收敛所需的时间是原来的两倍，而不是10个，而是20倍。

合理的解释是直接的输入到输出连接提供了更多的容量，并且可以更快地学习从单词特征到对数概率的“线性”映射。

简而言之，就是当年的算力不够，用直接连接的方法，简单粗暴。

词向量矩阵C的初始化问题

Random initialization of the word features was done (similarly to initialization of neural network weights), but we suspect that better results might be obtained with a knowledge-based initialization.

The feature vectors associated with each word are learned, but they could be initialized using prior knowledge of semantic features.

随机初始化

待补充

Future work

复现代码

Reference

https://zhuanlan.zhihu.com/p/84338492
https://blog.****.net/Pit3369/article/details/104513784/
http://hanyaopeng.coding.me/2019/04/30/word2vec/