2019.9.5 note
2019.9.5 note
A Structural Probe for Finding Syntax in Word Representations
- The probe identifies a linear transformation under which squared L2 distance encodes the distance between words in the parse tree, and one in which squared L2 norm encodes depth in the parse tree. Using this probe, we show that such transformations exist, providing evidence that entire syntax trees are embedded implicitly in deep models’ vector geometry.
This defines and for BERT embedding . This finds that this can learn the distances on parsing trees.
Analyzing the Structure of Attention in a Transformer Language Model
- This work visualizes attention for individual instances and analyze the interaction between attention and syntax over a large corpus.
- This work finds that attention targets different POS at different layer depths within the model, and that attention aligns with dependency relations most strongly in the middle layers. This work also finds that the deepest layers of the model capture the most distant relationships.
BERT Rediscovers the Classical NLP Pipeline
- A wave of recent work has begun to “probe” state-of-the-art models to understand whether they are representing language in a satisfying way. This work builds on this latter line of work, focusing on the BERT model, and use a suite of probing tasks derived from the traditional NLP pipeline to quantify where specific types of linguistic information are encoded.
- This work adopts two metrics: Scalar Mixing Weights and Cumulative Scoring. In Scalar Mixing Weights, they calculates the gravity center of mixing weights from different layers. In Cumulative Scoring, this work would like to estimate at which layer in the encoder a target can be correctly predicted. Mixing weights cannot tell us this directly, because they are learned as parameters and do not correspond to a distribution over data. Instead, this work defines expected layer as: . This work can then compute a differential score, which measures how much better we do on the probing task if we observe one additional encoder layer.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT Model.
Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding
In this work, a set of tasks where there is task-specific labeled training data are picked. Then, for each task, an ensemble of different neural nets (teacher) is trained. The teacher is used to generate for each task-specific training sample a set of soft targets. Given the soft targets of the training datasets across multiple tasks, a single MT-DNN (student) is trained using multi-task learning and back propagation.
Visualizing and Measuring the Geometry of BERT
-
Investigating how BERT represents syntax, this work describes evidence that attention matrices contain grammatical representations. This work also provides mathematical arguments that may explain the particular form of the parse tree embeddings described. Turning to semantics, using visualizations of the activations created by different pieces of text, this work shows suggestive evidence that BERT distinguishes word senses at a very fine level. Moreover, much of this semantic information appears to be encoded in a relatively low-dimensional subspace.
-
This work discusses the mathematics of embedding trees in Euclidean space about power-p tree.
-
This work visualizes Geometry of word senses.
Also the paper A Structural Probe for Finding Syntax in Word Representations.
Parameter-Efficient Transfer Learning for NLP
This work proposes transfer with adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task, and new tasks can be added without revisiting
previous ones. The parameters of the original network remain fixed, yielding a high degree of parameter sharing. The modules are shown as following:
Probing Neural Network Comprehension of Natural Language Arguments
- This work defines Statistical Cues in ARCT dataset (Reason(not sinse) Warrant->Claim and Reason(not sinse) Alternative->not Claim).
- This work discusses productivity and coverage of using the presence of “not” in the warrant to predict the label in ARCT. Across the whole dataset, if you pick the warrant with “not” you will be right 61% of the time, which covers 64% of all data points.
- They conduct adversarial attacks on the test set: They add a ‘not’ in Claim, then the Warrant and Alternative should reverse. However, BERT only achieves about 50% accuracy on the adversarial test set.
Right for the Wrong Reasons
This work introduces a new evaluation set called HANS (Heuristic Analysis for NLI Systems). Bert performs poorly on HANS.
RoBERTa
A Robustly Optimized BERT Pretraining Approach:
(1) Training the model longer, with bigger batches, over more data
(2) Removing the next sentence prediction objective
(3) Training on longer sequences
(4) Dynamically changing the masking pattern applied to the training data (avoid using the same mask for each training instance in every epoch)
Sense BERT
SenseBERT: Driving Some Sense into BERT.
This work not only predicts the masked words but also their WordNet supersenses. This work changes word embeddings in BERT into word-sense-joint embeddings.
Sentence-BERT
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Span-BERT
SpanBERT extends BERT by
(1) masking contiguous random spans, rather than random tokens
(2) Span Boundary Objective (Predicting span with ): training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it.
ERNIE (Baidu ERNIE) extends BERT by masking a whole Chinese word.
It also states Next Sentence Prediction can be removed (It is mentioned in many other papers, too).
Towards a Deep and Unified Understanding of Deep Neural Models in NLP
This work defines a unified information-based measure to provide quantitative explanations on how intermediate layers of deep NLP models leverage information of input words.
See Author’s passages also.
XLNET
(1) XLNET enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order (Using attention mask to implement it)
(2) XLNET overcomes the limitations of BERT thanks to its autoregressive formulation