Bidirectional Attention Flow for Machine Comprehension

Bidirectional Attention Flow for Machine Comprehension
论文及代码代码1 pytorch 实现

模型结构

模型主要包含六层结构，如下图所示:

Character Embedding Layer

使用字符级的卷积神经网络 (character-level CNN)，将词汇映射到高维的向量空间，该网络由Kim在2014年提出。

假设文本段落 (context paragraph) 的词汇集合表示为 $\{ x_1, x_2, \dots , x_T \}$ ，问题 (query) 的词汇集合表示为 $\{ q_1, q_2, \dots , q_J \}$ 。使用 character-level CNN 得到字符级的嵌入向量。使用字符级嵌入层，将每个字符转换为一维向量，该字符向量的维度与 CNN 输入通道 (channel) 相同。使用最大池化层将 CNN 的输出规范为维度相等的词向量。

Word Embedding Layer

使用预先训练的词向量 (GloVe)，将每一个词映射到固定大小的高维向量。

将字符级 CNN 得到的词向量与词嵌入向量拼接，然后，利用 highway network 得到 context vector $X \in R^{d * T}$ 和 query vector $Q \in R^{d * J}$ 。

Contextual Embedding Layer

对 highway network 得到的结果 $X$ 和 $Q$ ，分别分别两个 Bi-LSTM 进行编码，捕捉 $X$ 和 $Q$ 各自单词间的局部关系，最后将 Bi-LSTM 的输出进行拼接，得到 $X$ 对应的 $H \in R^{2d * T}$ 和 $Q$ 对应的 $U \in R^{2d * J}$ 。

以上处理用来捕捉 query 和 context 各自不同粒度 (character, word, phrase) 上的特征。

Attention Flow Layer

该层用于表示文本段落与问题的关系。这里是组合问题和上下文的向量，生成一个问题-察觉的特征向量集合。

首先使用 $H$ 和 $U$ 计算 context 和 query 的相关性矩阵 $S \in R^{T*J}$ ，
$\begin{aligned} S_{t j} &=\alpha\left(H_{: t}, U_{: j}\right) \\ \alpha(h, u) &=w_{(S)}^{T}[h ; u ; h \odot u] \end{aligned}$
其中， $S_{ij}$ 表示 context 中第 $t$ 个词与 query 中第 $j$ 个词的相关性， $S \in R^{T * J}$ 。 $H_{:t}$ 表示 $H$ 的第 $t$ 列， $H_{:t} \in R^{2d}$ 。 $U_{:j}$ 表示 $U$ 的第 $j$ 列， $U_{:j} \in R^{2d}$ 。 $\alpha (h, u)$ 是关于 $h$ 和 $u$ 的一个标量函数 (scalar function)，用于计算分数 score。 $W_S$ 为函数 $\alpha$ 的参数， $W_S \in \mathbb{R}^{6d}$ 。 $\odot$ 表示矩阵对应元素相乘 (element-wise multiplication)。 $[ ; ]$ 表示向量的拼接。

context-to-query attention(C2Q):

计算每个 context word 获得的问题信息。也就是说，对于每一个 context word 而言，有哪些 query words 与其最相关。首先，使用 softmax 对先关性矩阵 $S$ 的行进行归一化，然后，计算其对 query 向量的加权和，最终得到c2q vector $\hat{U}$ :
$\begin{array}{c}{a_{t}=\operatorname{softmax}\left(S_{t:}\right)} \\ {\hat{U}_{: t}=\sum_{j} a_{t j} U_{: j}}\end{array}$
其中， $a_t$ 表示 context 中的第 $t$ 个词与 query 中的每个词相关性的归一化结果， $a_t \in \mathbb{R}^{J}$ 。 $\hat{U}_{:t}$ 表示 context 中的第 $t$ 个词对于 query 中所有词的加权和， $\hat{U}_{:t} \in \mathbb{R}^{2d}$ ，则 $\hat{U} \in \mathbb{R}^{2d * T}$ 。

query-to-context attention(Q2C):
计算每个 query word 获得的context信息。也就是说，计算对于每一个 query word 而言，有哪些 context words 与其最相关，可以理解为有哪些 context words 对回答问题很重要。首先，取相关性矩阵 $S$ 每列最大值，对其进行 softmax 归一化计算，之后计算其对 context 向量的加权和，然后，复制 (tile) T 次得到 $\hat{H} \in R^{2d*T}$ 。
$\begin{array}{c}{b=\operatorname{softmax}\left(\max _{c o l}(S)\right)} \\ {\hat{h}=\sum_{t} b_{t} H_{: t} \in R^{2 d}}\end{array}$
其中， $b \in \mathbb(R)^{T}$ ， $\hat{h} \in \mathbb{R}^{2d}$ 。

计算得到的 $\hat{U} \in \mathbb{R}^{2d * T}$ 和 $\hat{H} \in R^{2d*T}$ 维度相同，都等于 $H \in R^{2d * T}$ 的维度。使用这三个矩阵计算 $G$ :
$\mathbf{G}_{: t}=\beta\left(\mathbf{H}_{: t}, \tilde{\mathbf{U}}_{: t}, \tilde{\mathbf{H}}_{: t}\right) \in \mathbb{R}^{d_{\mathrm{G}}}$
其中， $\beta$ 为多层 perception，论文中使用拼接：
$\beta(h, \hat{u}, \hat{h})=[h ; \hat{u} ; h \odot \hat{u} ; h \odot \hat{h}] \in R^{8 d * T}$

Modeling Layer

使用 G 作为输入，经过一层 Bi-LSTM 得到 $M \in R^{2d*T}$ ，用于捕捉 interaction among the context words conditioned on the query。M 的每一个列向量都包含了对应单词关于整个 context 和 query 的上下文信息。

Output layer

预测答案的开始位置 $p^1$ 和结束位置 $p^2$ :
$p^{1}=\operatorname{softmax}\left(W_{\left(p^{1}\right)}^{T}[G ; M]\right) \\ p^{2}=\operatorname{softmax}\left(W_{\left(p^{2}\right)}^{T}\left[G ; M^{2}\right]\right)$
其中， $M^2$ 表示Mondeling Layer 输出 $M$ 再经过一次 Bi-LSTM 得到， $M^2 \in \mathbb{R}^{2d * T}$ 。 $M^{T}_{p^1}$ 和 $M^{T}_{p^2}$ 的维度为 $\mathbb{R}^{10d}$ 。

损失函数

损失函数使用对数损失和：
$L(\theta)=-\frac{1}{N} \sum_{i}^{N}\left[\log \left(p_{y_{i}^{1}}^{1}\right)+\log \left(p_{y_{i}^{2}}^{2}\right)\right]$
其中， $y^{1}_{i}$ 和 $y^{1}_{i}$ 表示第 $i$ 个样本答案的真正起始位置和结束位置。

实验

分词工具：PTB Tokenizer
char-embedding中filter个数是 100，width是 5
LSTM 的 hidden size=100,也就是文中的d
用的是AdaDelta optimizer
minibatch size=60，
initial learning rate=0.5
12 epochs
在CNN、LSTM、softmax之前的Linear transformation使用dropout，dropout=0.2
训练时，模型所有权重的 moving averages 使用0.999的 exponential decay rate？

更多关于理解滑动平均(exponential moving average)，以及相关的[code教程](https://github.com/wuliytTaotao/tensorflow-tutorial/blob/master/Deep_Learning_with_TensorFlow/1.4.0/Chapter04/4. 滑动平均模型.ipynb)

SQuAD 最优结果：dev set EM 72.6, F1 80.7; test set EM 68.0, F1 77.3.

疑问

流动是如何体现的？
Memory-less 提现在哪里？

Bidirectional Attention Flow for Machine Comprehension