NLP365第113天NLP论文摘录和抽象神经文档摘要
内置AI NLP365(INSIDE AI NLP365)
Project #NLP365 (+1) is where I document my NLP learning journey every single day in 2020. Feel free to check out what I have been learning over the last 257 days here. At the end of this article, you can find previous papers summary grouped by NLP areas :)
项目#NLP365(+1)是我记录我的NLP的学习之旅的每一天在2020年随时检查出什么,我一直在学习,在过去257天这里。 在本文的结尾,您可以找到按NLP领域分组的以前的论文摘要:)
Today’s NLP paper is On Extractive and Abstractive Neural Document Summarization with Transformer Language Models. Below are the key takeaways of the research paper.
今天的NLP论文是关于使用变压器语言模型进行提取性和抽象性神经文档摘要的。 以下是研究论文的主要内容。
目标与贡献 (Objective and Contribution)
Proposed an abstractive summarisation method on long documents. This is achieved through a two-step process of extractive and abstractive summarisation. The output of the extractive step is used to train the abstractive transformer language model. This extractive step has been shown to be very important towards the end summarisation results. In addition, the generated abstractive summaries are more abstractive than previous work that employed the copy mechanism and also yielded higher ROUGE scores. The contributions are:
在长文档上提出了一种抽象的摘要方法。 这是通过提取和抽象概述的两步过程实现的。 提取步骤的输出用于训练抽象转换器语言模型。 事实证明,此提取步骤对于最终总结结果非常重要。 此外,生成的抽象摘要比以前采用复制机制的工作更抽象,并且产生了更高的ROUGE分数。 贡献是:
- Demonstrated the effectiveness of transformer language models in summarising long scientific articles, outperforming Seq2Seq models 证明了转换器语言模型在总结较长的科学文章方面的有效性,优于Seq2Seq模型
- The proposed model was able to produce more abstractive summaries than previous work and still achieve higher ROUGE scores所提出的模型比以前的工作能够产生更多的抽象总结,并且仍然获得更高的ROUGE分数
人类总结过程(The human summarisation process)
- Read and understand the source document阅读并了解源文件
- Select the most important parts of the source document选择源文档中最重要的部分
- Paraphrase the key concepts in these important parts解释这些重要部分中的关键概念
- Generate a coherent and fluent output summaries生成连贯,流畅的输出摘要
数据集(Datasets)
There are four different long document summarisation datasets:
有四个不同的长文档摘要数据集:
- arXiv arXiv
- Pubmed Pubmed
- bigPatent 大专利
- Newsroom编辑部
构架 (Framework)
The proposed framework is broken into two independent components:
提议的框架分为两个独立的组件:
-
Extractive summarisation. A hierarchical document model that either copy or classify sentences in the document to build extractive summary
提取摘要。 可以复制或分类文档中的句子以构建摘要的分层文档模型
-
Abstractive summarisation. The extractive summary as well as the document is used to condition the transformer language model
抽象摘要。 摘录摘要以及文档用于调节转换器语言模型
提取摘要(Extractive summarisation)
The extractive step involves sentence extraction using two different hierarchical document models: hierarchical seq2seq sentence pointer and sentence classifier. The goal is to filter out noisy sentences and extract important sentences to better train our transformer language model. The hierarchical seq2seq sentence pointer has an encoder-decoder architecture:
提取步骤涉及使用两种不同的分层文档模型提取句子:分层seq2seq句子指针和句子分类器。 目的是过滤出嘈杂的句子并提取重要的句子,以更好地训练我们的转换器语言模型。 分层seq2seq句子指针具有编码器-解码器体系结构:
- The encoder is a bidirectional LSTM at both the word and sentence level (hierarchical) 编码器是单词和句子级别的双向LSTM(分层)
- The decoder is an autoregressive LSTM 解码器是自回归LSTM
The hierarchical encoder combines both the word and sentence-level directional LSTM. The token-level biLSTM encodes each sentence in the document to obtain the sentence embeddings. The sentence-level biLSTM encodes these sentence embeddings to obtain document representations. The decoder is an autoregressive LSTM that takes in the hidden state of the previously extracted sentence as input and predict the next sentence to be extract.
分层编码器结合了单词和句子级别的定向LSTM。 令牌级别的biLSTM对文档中的每个句子进行编码以获得句子嵌入。 句子级别的biLSTM对这些句子嵌入进行编码以获得文档表示。 解码器是一种自回归LSTM,它将先前提取的句子的隐藏状态作为输入并预测要提取的下一个句子。
Similar to the pointer network, the sentence classifier uses a hierarchical LSTM to encode the document and produce a sequence of sentence embeddings. The final document representation is the average of these sentence embeddings. The final document representation is concatenated to each sentence embedding and feed into a neural network with a sigmoid function to obtain the probability of each sentence to be included in the extractive summary.
类似于指针网络,句子分类器使用分层LSTM对文档进行编码,并产生一系列句子嵌入。 最终的文档表示形式是这些句子嵌入的平均值。 最终文档表示形式与每个句子的嵌入连接,并馈入具有S型函数的神经网络中,以获取每个句子包含在摘要中的概率。
抽象总结 (Abstractive summarisation)
We trained a single transformer language model from scratch using “formatted” data. The transformer language model is GPT-2. Language models are trained by factorising joint distribution of words autoregressively. This inspires us to organise the training data in certain format where we put the ground-truth summary after the information the model would normally use to generate summaries. In this way, we model the joint distribution of document and summary during training and use the conditional distribution (given the document) to generate summary at inference. Therefore, the training data is formatted in 4 different sections:
我们使用“格式化”数据从零开始训练了一个单一的转换器语言模型。 转换器语言模型为GPT-2。 通过自动分解单词的联合分布来训练语言模型。 这激励我们以某种格式组织训练数据,在该模型中,我们将真实情况摘要放在模型通常用于生成摘要的信息之后。 这样,我们在训练期间对文档和摘要的联合分布进行建模,并使用条件分布(给定文档)在推断时生成摘要。 因此,训练数据分为4个不同部分:
-
Paper Introduction. Assumption that introduction should contain enough to generate the abstract
论文简介。 假设简介应包含足以生成摘要的内容
-
Extracted summary (from extractive summarisation)
摘录摘要(摘录摘要)
-
Abstract (ground-truth summary)
摘要(实情摘要)
-
Rest of the paper. Serve to train language model to understand domain language
本文的其余部分。 服务训练语言模型以了解领域语言
For some datasets, the introduction section would be the entire document as there are no rest of the paper section. Figure below showcase the overall framework.
对于某些数据集,简介部分将是整个文档,因为纸质部分没有其余部分。 下图显示了总体框架。
结果与分析 (Results and Analysis)
Table 2 and 4 showcase that our extractive models outperformed all previous extractive baselines on both arXiv and PubMed datasets. On the Newsroom dataset (table 6), our TLM outperformed the other abstractive model, Seq2Seq, by a massive margin and also outperformed the pointer-generator network. However, the Exconsumm model dominates the extractive and mixed results.
表2和表4展示了我们的提取模型优于arXiv和PubMed数据集上所有以前的提取基线。 在Newsroom数据集(表6)上,我们的TLM大大超出了其他抽象模型Seq2Seq,并且也超过了指针生成器网络。 但是,Exconsumm模型支配了提取结果和混合结果。
The best performing TLM (TLM-I+E (G,M)) has outperformed previous abstractive results on most ROUGE scores metrics except on ROUGE-L. We believe this might be due to the fact that we don’t have a copy mechanism in place, making it very challenging to get exact matches on large n-grams. The figure below supports this hypothesis as the copy mechanism of the discourse-aware model can copy up to 25-grams from the source document. In addition, the figure below also showcase that our TLM has generated more abstractive summaries than previous work by the low percentage of n-grams overlap between generated summaries and source documents.
在大多数ROUGE得分指标上,除ROUGE-L之外,性能最好的TLM(TLM-I + E(G,M))已超过了先前的抽象结果。 我们认为,这可能是由于我们没有适当的复制机制,这使得在大n-gram上获得精确匹配非常困难。 下图支持此假设,因为话语感知模型的复制机制最多可以从源文档复制25克。 此外,下图还显示,由于生成的摘要和源文档之间的n-gram重叠率较低,因此我们的TLM比以前的工作生成了更多的抽象摘要。
We also measure the upper bound performance of our TLM (TLM-I+E (G,G)) by including the ground-truth extracted sentences in both training and testing. Lastly, the figure below showcase a qualitative results of summaries generated by our TLM.
我们还通过在训练和测试中包括从地面提取的句子来衡量TLM(TLM-I + E(G,G))的上限性能。 最后,下图展示了我们的TLM生成的摘要的定性结果。
结论与未来工作(Conclusion and Future Work)
The fluency and coherency of the generated summaries are of strong level. However, there remains the problem of abstractive summaries generating imaginary / inaccurate content. Potential future work could focus more on factual correctness and coherency when evaluating summarisation models.
生成的摘要的流利性和连贯性很强。 但是,仍然存在抽象摘要生成虚假/不准确内容的问题。 在评估汇总模型时,潜在的未来工作可能会更多地关注事实的正确性和连贯性。
资源: (Source:)
[1] Subramanian, S., Li, R., Pilault, J. and Pal, C., 2019. On extractive and abstractive neural document summarization with transformer language models. arXiv preprint arXiv:1909.03186.
[1] Subramanian,S.,Li,R.,Pilault,J.和Pal,C.,2019年。关于使用变压器语言模型的提取性和抽象性神经文档摘要。 arXiv预印本arXiv:1909.03186 。
Originally published at https://ryanong.co.uk on April 22, 2020.
最初于2020年4月22日在https://ryanong.co.uk上发布。