AAAI 2020 NLP 语法纠错相关论文笔记

MaskGEC: Improving Neural Grammatical Error Correction via Dynamic Masking

Motivation
NMT methods need a fairly large parallel corpus of error-annotated sentence pairs

Our methods
adding random masks to the original source sentences dynamically
in the training procedure

result
The experiments on NLPCC 2018 Task 2 show that our MaskGEC model improves the performance of the neural GEC models. Besides, our single model for Chinese GEC outperforms the current state-of-the-art ensemble system in NLPCC 2018 Task 2 without any extra knowledge.

AAAI 2020 NLP 语法纠错相关论文笔记

we substitute the original sentences with the noisy ones on the source side directly.

without increasing the size of the training set. By the introduction of noise, the generalization ability of the grammatical error correction model
is enhanced in our approach.

It is worth mentioning that the choice of the NMT framework is not a focus of this paper. We expect that other seq2seq models would benefit from
our approach.

we add noises to source sentence X with a certain probability in the j-th epoch of the training process dynamically

AAAI 2020 NLP 语法纠错相关论文笔记

noising schemes
Padding Substitution
Random Substitution
Word Frequency Substitution
Homophone Substitution
Mixed Substitution

AAAI 2020 NLP 语法纠错相关论文笔记

AAAI 2020 NLP 语法纠错相关论文笔记

Based on the Char-Transformer, we also re-implement the source-word dropout method proposed by Junczys-Dowmunt et al. (2018). Following their work, we set the full embedding vector for a source word to 0 with a probability
psrc, all other embedding values are scaled with 1/(1−psrc).They presented that dropout over source words can bring gains for neural grammatical error correction. By the introduction of corruption on the source side, the model is taught to reduce trust into the input and to apply corrections more
aggressively.

The approach of Junczys-Dowmunt et al. (2018) reaches
32.15 F0.5, an improvement of 4.29 F0.5over the Char-
Transformer model. Despite this, our dynamic masking
model still beats it by a significant margin of 4.82 F0.5. The
reason is that our proposed dynamic masking method yields
more diverse noisy sentence pairs, which may benefit the
generalization ability of our GEC model.

The main difference
between our dynamic masking method and the approaches
above is that our method serves as a kind of regularization
to some extend while training the seq2seq model. We do not
synthesize new training data explicitly. Instead, our dynamic
masking approach is more like a token-level dropout.

Towards Minimal Supervision BERT-Based Grammar Error Correction (Student Abstract)

Motivation:
GEC requires large amounts of annotated data and limit the applications in data-limited settings.

We try to incorporate contextual information from pre-trained language model to leverage annotation and benefit multilingual scenarios.

A shortcoming of our current approach is that it only produces as many corrections as masked input tokens;

For our preliminary experiments we focus on a simplified single-edit setting, where we attempt to correct sentences with a single error (assuming oracle error identification annotations). The goal is to assess the capabilities of pre-trained BERT-like models assuming perfect perfor-
mance for all other components (e.g. error identification).

Related and Future work
A retrieve-edit model is proposed for text generation (Guu
et al. 2017). However, the edition is one-time and sentences
with multiple grammatical errors could further reduce the
similarity between the correct form and the oracle sentence.
An iterative decoding approach (Ge, Wei, and Zhou 2018) or
the neural language model (Choe et al. 2019) as the scoring
function are employed for GEC. To the best of our knowl-
edge, there is no prior work in applying pre-trained con-
textual model in grammatical error correction. In the future
work, we will additionally model error fertility, allowing us
to exactly predict the number of necessary [MASK] tokens.
Last, we will employ a re-ranking mechanism which scores
the candidate outputs from BERT, taking into account larger
context and specific grammar properties.
Better span detection Although BERT could predict all
the missing token in the sentence in a reasonable way, pre-
diction of the correct words could easily fall into redundant
editing. Our experiment shows that simply rephrasing the
whole sentence using BERT would lead to too diverse an
output. Instead, a prior error span detection could be neces-
sary for efficient GEC, and it is part of our future work.
Partial masking and fluencymeasures Multi-masking or
masking an informative part in the sentence will lead to loss
of original information, and it will allow unwanted freedom
in the predictions; see Table 3 for examples. Put in plain
terms, multi-masking allows BERT to get too creative. In-
stead, we will investigate partial masking strategies (Zhou
et al. 2019) which could alleviate this problem. Fluency is
an important measure when employing an iterative approach
(Napoles, Sakaguchi, and Tetreault 2016). We plan to ex-
plore fluency measures as part of our reranking mechanisms.