nlp 论文生成摘要

内置 AI NLP365(INSIDE AI NLP365)

Project #NLP365 (+1) is where I document my NLP learning journey every single day in 2020. Feel free to check out what I have been learning over the last 257 days here. At the end of this article, you can find previous papers summary grouped by NLP areas :)

在＃NLP365(+1)项目中，我记录了2020年每一天的NLP学习历程。请随时在这里查看我过去257天的学习内容。在本文的结尾，您可以找到按NLP领域分组的以前的论文摘要：)

Today’s NLP paper is Data-driven Summarization of Scientific Articles. Below are the key takeaways of the research paper.

如今的NLP论文是《科学论文的数据驱动摘要》 。以下是研究论文的主要内容。

目标与贡献 (Objective and Contribution)

Created two multi-sentence summarisation datasets from scientific articles: the title-abstract pairs (title-gen) and abstract-body pairs (abstract-gen) and applied a wide range of extractive and abstractive models to it. The title-gen dataset consists of 5 million biomedical papers whereas the abstract-gen dataset consists of 900K papers. The analysis show that scientific papers are suitable for data-driven summarisation.

从科学文章中创建了两个多句子摘要数据集：标题摘要对(title-gen)和摘要正文对(abstract-gen)，并对其应用了广泛的提取和抽象模型。 title-gen数据集包含500万生物医学论文，而abstract-gen数据集包含90万论文。分析表明，科学论文适合于数据驱动的摘要。

什么是数据驱动的摘要？ (What is data-driven summarisation?)

It is a way of saying the recent SOTA results of summarisation models rely heavily on large volume of training data.

可以说，最近的SOTA汇总模型结果严重依赖大量的训练数据。

数据集 (Datasets)

The two evaluation datasets are title-gen and abstract-gen. Title-gen was constructed using MEDLINE and abstract-gen was conducted using PubMed. The title-gen pairs the abstract to the title of the paper whereas the abstract-gen dataset pairs the full body (without tables and figures) to the abstract summary. The text processing pipeline is as follows:

这两个评估数据集是title-gen和abstract-gen。使用GENLINE构建Title-gen，使用PubMed进行abstract-gen。 title-gen将摘要与论文标题配对，而abstract-gen数据集将整个正文(不含表和图)与摘要摘要配对。文本处理管道如下：

Tokenisation and lowercase
令牌化和小写
Removal of URLs
删除网址
Numbers are replaced by # token
数字由＃令牌代替
Only include pairs with abstract length 150–370 tokens, title length 6–25 tokens and body length 700–10000 tokens
仅包括摘要长度为150–370的标记，标题长度为6–25的标记和正文长度为700–10000的标记的对

We also computed the Overlap score and Repeat score for each data pairs. The Overlap score measures the overlapping tokens between the summary (title or abstract) and the input text (abstract or full body). The Repeat score measures the average overlap of each sentence in a text with the remainder of the text. This is to measure the repetitive content that exists in the body text of a paper where the same concepts are repeated over and over again. Below are the summary statistics of both datasets.

我们还计算了每个数据对的重叠分数和重复分数。重叠分数测量摘要(标题或摘要)与输入文本(摘要或全文)之间的重叠标记。重复分数衡量文本中每个句子与文本其余部分的平均重叠。这是为了测量重复存在于同一正文中的论文正文中存在的重复内容。以下是这两个数据集的摘要统计信息。

nlp 论文生成摘要_NLP365第116天NLP论文摘要数据驱动的科学文章摘要 — Statistics of dataset [1]

实验设置和结果 (Experimental Setup and Results)

型号比较(Models Comparison)

Extractive summarisation methods. Two unsupervised baselines here: TFIDF-emb and rwmd-rank. TFIDF-emb creates sentence representation by computing a weighted sum of its constituent word embeddings. Rwmd-rank ranks sentences by how similar the sentence is compared to all the other sentences in the document. Rwmd stands for Relaxed Word Mover’s Distance, which it’s the formula used to compute similarity and subsequently LexRank is used to rank the sentences.

提取摘要方法。这里有两个不受监督的基准：TFIDF-emb和rwmd-rank。 TFIDF-emb通过计算其组成词嵌入的加权和来创建句子表示。 Rwmd-rank通过比较句子与文档中所有其他句子的相似程度来对句子进行排名。 Rwmd代表轻松单词移动器的距离，它是用于计算相似度的公式，随后使用LexRank来对句子进行排名。
Abstractive summarisation methods. Three baselines here: lstm, fconv, and c2c. Lstm is the common LSTM encoder-decoder model but with an attention mechanism at the word-level. Fconv is a CNN encoder-decoder on subword-level, separating words into smaller units using byte-pair encoding (BPE). Character-level models are good at dealing with rare / out-of-vocabulary (OOV) words. C2c is a character-level encoder-decoder model. It builds character representations from the input using CNN and feed it into an LSTM encoder-decoder model.

抽象摘要方法。这里有三个基准：lstm，fconv和c2c。 Lstm是常见的LSTM编码器/解码器模型，但在单词级别具有关注机制。 Fconv是子词级的CNN编码器-解码器，使用字节对编码(BPE)将词分成较小的单元。字符级模型擅长处理稀有/词汇不足(OOV)单词。 C2c是字符级编码器-解码器模型。它使用CNN从输入中构建字符表示，并将其输入到LSTM编码器-解码器模型中。

结果 (Results)

The evaluation metrics are ROUGE scores, METEOR score, Overlap score and Repeat score. Despite the weaknesses of ROUGE scores, they are common in summarisaiton. METEOR score are used for machine translation and Overlap score can measure to what extent the models just copy text directly from input text as summary. Repeat score can measure how often the summary contains repeated phrases, which it’s a common problem in abstractive summarisation.

评估指标为ROUGE得分，METEOR得分，Overlap得分和Repeat得分。尽管ROUGE分数存在弱点，但它们在summarisaiton中很常见。 METEOR得分用于机器翻译，Overlap得分可以衡量模型在多大程度上只是直接从输入文本中复制文本作为摘要。重复分数可以衡量摘要包含重复短语的频率，这是抽象摘要中的常见问题。

For title-gen results (table 2), rwmd-rank is the best extractive model, however, c2c (abstractive model) outperformed all extractive models by a large margin, including the oracle. Both c2c and fconv achieved similar results with similar high overlap scores. For abstract-gen results (table 3), lead-10 was a strong baseline and only extractive models managed to outperformed it. All extractive models achieved similar ROUGE scores with similar Repeat score. Abstractive models performed poorly based on ROUGE scores but outperformed all models in terms of METEOR score so it was difficult to draw up conclusion.

对于标题生成结果(表2)，rwmd-rank是最佳的抽取模型，但是c2c(抽象模型)在很大程度上优于所有抽取模型，包括Oracle。 c2c和fconv都以相似的高重叠分数获得了相似的结果。对于抽象生成的结果(表3)，lead-10是一个很强的基准，只有提取模型才能胜过它。所有提取模型均获得了相似的ROUGE得分和相似的重复得分。基于ROUGE得分的抽象模型表现不佳，但就METEOR得分而言，其表现优于所有模型，因此很难下结论。

Qualitative evaluation is common and conducted on the generated summary. See below an example of the title-gen qualitative evaluation. The observations are as follow:

定性评估是很普遍的，并且对生成的摘要进行评估。参见下面的标题生成定性评估示例。观察结果如下：

Large variation of sentence locations selected by extractive models on title-gen, with first sentence in the abstract being the most important
标题源的抽取模型选择的句子位置的变化很大，摘要中的第一句话是最重要的
Many abstractive generated titles tend to be of high quality, demonstrating their ability to select important information
许多抽象生成的标题往往是高质量的，这表明它们选择重要信息的能力
Lstm tends to generate more novel words whereas c2c and fconv tend to copy more from input text
Lstm倾向于生成更多新颖的单词，而c2c和fconv倾向于从输入文本中复制更多单词
The generated titles occasionally make mistakes by using incorrect words, being too generic and fail to capture the main point of the paper. This could all lead to factual inconsistencies
生成的标题有时会因使用不正确的单词而犯错，因为它们太笼统，无法抓住论文的重点。这都可能导致事实不一致
For abstract-gen, it appears that introduction and conclusion sections are most relevant for generating abstract. However, important content are spread across sections and sometimes the reader focuses more about the methodology and results
对于摘要生成，似乎引言和结论部分与生成摘要最相关。但是，重要的内容分布在各个部分，有时读者会更多地关注方法和结果
Output of fconv abstractive model is of bad quality where it lacks coherent and content flow. There is also the common problem of repeated sentence or phrases in the summary
fconv抽象模型的输出质量欠佳，缺少连贯的内容流。摘要中还存在重复句子或短语的常见问题

结论与未来工作 (Conclusion and Future Work)

There was a mixed results where the models performed well in title generation but struggled with abstract generation. This can be explained by the high-level of difficulty in understanding long input and output sequences. A future work is a hybrid extractive-abstractive end-to-end approaches.

结果有好有坏，模型在标题生成上表现不错，但在抽象生成上却苦苦挣扎。这可以通过理解较长的输入和输出序列的高度难度来解释。未来的工作是端到端混合提取-提取方法。

资源： (Source:)

[1] Nikolov, N.I., Pfeiffer, M. and Hahnloser, R.H., 2018. Data-driven summarization of scientific articles. arXiv preprint arXiv:1804.08875.

[1] NI，Nikolov，M。Pfeiffer和RH，Hahnloser，2018年。科学文章的数据驱动汇总。 arXiv预印本arXiv：1804.08875 。

Originally published at https://ryanong.co.uk on April 25, 2020.

最初于2020年4月25日在https://ryanong.co.uk上发布。

方面提取/基于方面的情感分析 (Aspect Extraction / Aspect-based Sentiment Analysis)

总结 (Summarisation)

其他 (Others)

翻译自: https://towardsdatascience.com/day-116-of-nlp365-nlp-papers-summary-data-driven-summarization-of-scientific-articles-3fba016c733b