理解Transformers/Bert中的一些笔记

Transformers

  1. Transformers相较于LSTM的好处就是能够将整个句子作为input来处理,依靠attention来理解词与词之间的关系,但是相对应的坏处就是失去了词的顺序这个重要的信息,意思也就是input sentence的词的顺序可以随意颠倒,也不会什么影响,所以需要额外的将词的positional information给嵌入(encode)到模型中。具体嵌入的方法有很多种,一种是直接用一个和word embedding同维度的positional encoding向量来表示词的相对位置,再用向量加法加到word embedding中,取一个比句子要长很多embedding dimension可以使模型无压力地generate sentence longer than what’s in training.

  2. LSTM/RNN用hidden states来记录long-range information,而Transformers靠的则是Attention。例如当翻译这个英文句子“The animal didn’t cross the street because it was too tired"时,在翻译it这个词的时候,就会assign更多的attention给animal,然后把animal的encoding的部分信息给嵌入到it中去一起送给decoder。

  3. 下图:

    • 计算self-attention的过程,self-表明这个attention weight是how much attention I am giving to myself的意思,这里I可以是“it”这个单词。
    • Q,K,V分别是E和三个不同的matrix相乘之后得到的vector(类似于copy of E in vector form),其中Q是用来代表本身,K用来和其它的词比较,Q和所有词的K一起相乘得出softMax(scores),V是用来往下一步的FF里面传送的本身(当然要被scores weighted之后),最后再把V*scores的vector给element-wise的sum起来,就是self-attention score了(in vector form as well)。
      理解Transformers/Bert中的一些笔记
    • Multi-headed self-attention:然而z1涵盖其它词的程度还不够,以至于在训练中词i还是很容易收到其本身的影响更多一点,所以transformers的结构中用了8个这样的Q/K/V组合来尽可能代表多一点subspace的信息(不同的attention head会focus on不同的词),最后再把这8个Z matrix给concat起来,linear transform的reshape一下,得出最终的一个Z matrix喂给FF。
      理解Transformers/Bert中的一些笔记
  4. Transformers也是有加residual的,如LayerNorm()里的X,因为Z已经是对原文的多重抽象了,为了更稳定的训练模型(避免模型一开始一直learn from noise),所以有必要加上residual X。decoder也是同理(所有normalize layer都是)。
    理解Transformers/Bert中的一些笔记

  5. minor note: even though a machine is able to take the whole sentence as an input and parallelly compute the K,V matrices (output from the encoder), the decoder still predicts one by one (sequentially).

  6. beam search: greedy algorithm but with multiple good top candidates.

参考:
http://jalammar.github.io/illustrated-transformer/

Bert

  1. Bert(Bidirectional Encoder Representation from Transformers)与其说是一种ML architecture,不如说是建立在Transformers这种architecture之上的一种训练策略(training strategy)。这种strategy的主要出发点就是利用训练集的文字中的context,实现方式是在模型训练之前对数据集做一个预训练(pre-training, usually unlabeled),改进初始参数,例如:
    1. word embeddings:喂进transformers的encoding实际上是从另外一些巨大的corpus里训练得出的weights,与我当下的训练集的数据不一定有可观的联系,比如说单词can在整个corpus中出现“能够”这个意思的次数远多于其其余的意思,但是在句子“I want to open the can”中,其少用的“罐头”的意思的概率反而因为它所在的context变大了,所以如果input embedding能够更多的基于当下的context,则对learning更有帮助。
      1. 实现方法:general word embeddings are trained from a corpus UNI-directionally, meaning the model predicts the next word conditioning on the previous word. 但是要实现bidirectional的word embeddings, it’s impossible to do something like p(current word|previous word, future word) as future words need to be predicted using current word beforehand. That is using hindsight. So what smart-ass google people do is they mask out 15% of the words in a sentence and train a shallow NN to predict the masked word based on the context (apparently some sort of attention and decoder is needed in this shallow network). The task is basically unlabeled, because the model is not asked to fully recover the original sentence, meaning it’s good enough that the prediction is likely/natural. So the loss function is designed to measure how well BERT can predict the missing word, not reconstruction error. Then the loss gets backpropagated to the word embeddings, thus giving it more contextual information.
    2. 因为论文当时出的时候是针对QA的任务,所以他们还考虑了一个pre-trained NN to predict 句子的先后顺序(this is a task-specific pre-training, not necessarily applicable to all tasks. it is designed for QA)

参考:
https://medium.com/@jonathan_hui/nlp-bert-transformer-7f0ac397f524
https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html