引言

Siamese CBOW，来自Tom Kenter等的于2016年发的论文：Siamese CBOW: Optimizing Word Embeddings for Sentence Representations
作者提到，当前的很多句向量的表示方法都是简单的用词向量的加和平均，这种方法表现出一定的有效性，但是并没有针对特定任务的句向量进行优化的方法。
因此本文提出一种训练句向量的方法，借鉴于CBOW模型，采用目标句子的上下文句来预测当前句子。

数据源

采用Toronto Book Corpus，包括了七千多万句已经预处理好的句子，从前到后为连贯的句子。需要考虑的是不同的段落间，句子可能存在一定的不连贯，但在这种规模的数据量下，也可以忽略。

模型

模型结构如下：
Siamese CBOW: Optimizing Word Embeddings for Sentence Representations

Input

对于每个目标句子，考虑其前后两个句子作为正例，同时随机抽取一定量的样本，比如2个，作为负例。这里的token粒度为word级别，处理成相应形式输入，即[sentence，pre-sentence，post-sentence，neg-sentence，neg-sentence]

Embedding Layer

对于输入句子，通过初始化的Embeding层查找，分别得到其相应的Embeding Matrix W，将相应的Matrxi进行加和平均得到句子的句向量。

Cosine Layer

分别计算sentence与其余句子之前的cosine相似度作为句子间的相似度。

Prediction Layer

直接将上述相似度作为输入，通过softmax计算最终的输出，如下：
Siamese CBOW: Optimizing Word Embeddings for Sentence Representations
其中 $s_x ^\theta$ 代表句子 $s$ 基于当前模型参数 $\theta$ 的句向量，对于当前模型，假设输入正例集合为 $S_+$ ，负例集合为 $S_-$ ，那么输出期望为以下形式：

Loss Function

则，损失函数使用categorical cross-entropy，如下形式：
Siamese CBOW: Optimizing Word Embeddings for Sentence Representations
其中 $p(s_i, s_j)$ 为期望概率，而 $p_\theta(s_i, s_j)$ 则为预测概率。

结果

作者在20个数据集上进行对比，同时建立了baselinse任务，发现在大多数任务上都比baselinse有了提升：
Siamese CBOW: Optimizing Word Embeddings for Sentence Representations
后续还对相关的参数进行了分析，详细请看原论文。

结论

构建了一个新方法做句向量的表示，方法简单有效，同时可简单改造用于其他类型数据，比如相似句判别。

参考

1、Siamese CBOW: Optimizing Word Embeddings for Sentence Representations

Siamese CBOW: Optimizing Word Embeddings for Sentence Representations

引言

数据源

模型

Input

Embedding Layer

Cosine Layer

Prediction Layer

Loss Function

结果

结论

参考

相关推荐