2019.9.5 note

2019.9.5 note

A Structural Probe for Finding Syntax in Word Representations

  1. The probe identifies a linear transformation under which squared L2 distance encodes the distance between words in the parse tree, and one in which squared L2 norm encodes depth in the parse tree. Using this probe, we show that such transformations exist, providing evidence that entire syntax trees are embedded implicitly in deep models’ vector geometry.

This defines d(x,y)=f(x)Tf(y)d(x, y)=f(x)^Tf(y) and f(x)=Avxf(x)=Av_x for BERT embedding vv. This finds that this dd can learn the distances on parsing trees.

Analyzing the Structure of Attention in a Transformer Language Model

  1. This work visualizes attention for individual instances and analyze the interaction between attention and syntax over a large corpus.
  2. This work finds that attention targets different POS at different layer depths within the model, and that attention aligns with dependency relations most strongly in the middle layers. This work also finds that the deepest layers of the model capture the most distant relationships.

2019.9.5 note

BERT Rediscovers the Classical NLP Pipeline

  1. A wave of recent work has begun to “probe” state-of-the-art models to understand whether they are representing language in a satisfying way. This work builds on this latter line of work, focusing on the BERT model, and use a suite of probing tasks derived from the traditional NLP pipeline to quantify where specific types of linguistic information are encoded.
  2. This work adopts two metrics: Scalar Mixing Weights and Cumulative Scoring. In Scalar Mixing Weights, they calculates the gravity center of mixing weights from different layers. In Cumulative Scoring, this work would like to estimate at which layer in the encoder a target (s1;s2;label)(s1; s2; label) can be correctly predicted. Mixing weights cannot tell us this directly, because they are learned as parameters and do not correspond to a distribution over data. Instead, this work defines expected layer as: E[l]=(l(Score(P(l))Score(P(l1))))/((Score(P(l))Score(P(l1)))E[l]=(\sum l(Score(P^{(l)})-Score(P^{(l-1)})))/(\sum(Score(P^{(l)})-Score(P^{(l-1)})). This work can then compute a differential score, which measures how much better we do on the probing task if we observe one additional encoder layer.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT Model.

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding

In this work, a set of tasks where there is task-specific labeled training data are picked. Then, for each task, an ensemble of different neural nets (teacher) is trained. The teacher is used to generate for each task-specific training sample a set of soft targets. Given the soft targets of the training datasets across multiple tasks, a single MT-DNN (student) is trained using multi-task learning and back propagation.

Visualizing and Measuring the Geometry of BERT

  1. Investigating how BERT represents syntax, this work describes evidence that attention matrices contain grammatical representations. This work also provides mathematical arguments that may explain the particular form of the parse tree embeddings described. Turning to semantics, using visualizations of the activations created by different pieces of text, this work shows suggestive evidence that BERT distinguishes word senses at a very fine level. Moreover, much of this semantic information appears to be encoded in a relatively low-dimensional subspace.

  2. This work discusses the mathematics of embedding trees in Euclidean space about power-p tree.

  3. This work visualizes Geometry of word senses.

    Also the paper A Structural Probe for Finding Syntax in Word Representations.

Parameter-Efficient Transfer Learning for NLP

This work proposes transfer with adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task, and new tasks can be added without revisiting
previous ones. The parameters of the original network remain fixed, yielding a high degree of parameter sharing. The modules are shown as following:
2019.9.5 note

Probing Neural Network Comprehension of Natural Language Arguments

  1. This work defines Statistical Cues in ARCT dataset (Reason(not sinse) andand Warrant->Claim and Reason(not sinse) andand Alternative->not Claim).
  2. This work discusses productivity and coverage of using the presence of “not” in the warrant to predict the label in ARCT. Across the whole dataset, if you pick the warrant with “not” you will be right 61% of the time, which covers 64% of all data points.
  3. They conduct adversarial attacks on the test set: They add a ‘not’ in Claim, then the Warrant and Alternative should reverse. However, BERT only achieves about 50% accuracy on the adversarial test set.

Right for the Wrong Reasons

This work introduces a new evaluation set called HANS (Heuristic Analysis for NLI Systems). Bert performs poorly on HANS.
2019.9.5 note

RoBERTa

A Robustly Optimized BERT Pretraining Approach:

(1) Training the model longer, with bigger batches, over more data

(2) Removing the next sentence prediction objective

(3) Training on longer sequences

(4) Dynamically changing the masking pattern applied to the training data (avoid using the same mask for each training instance in every epoch)

Sense BERT

SenseBERT: Driving Some Sense into BERT.

This work not only predicts the masked words but also their WordNet supersenses. This work changes word embeddings in BERT into word-sense-joint embeddings.

Sentence-BERT

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
2019.9.5 note

Span-BERT

SpanBERT extends BERT by

(1) masking contiguous random spans, rather than random tokens

(2) Span Boundary Objective (Predicting span [s,e][s,e] with xs1,xe+1x_{s-1}, x_{e+1} ): training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it.

ERNIE (Baidu ERNIE) extends BERT by masking a whole Chinese word.

It also states Next Sentence Prediction can be removed (It is mentioned in many other papers, too).

Towards a Deep and Unified Understanding of Deep Neural Models in NLP

This work defines a unified information-based measure to provide quantitative explanations on how intermediate layers of deep NLP models leverage information of input words.

See Author’s passages also.

XLNET

(1) XLNET enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order (Using attention mask to implement it)

(2) XLNET overcomes the limitations of BERT thanks to its autoregressive formulation