1 提出背景

【DL小结5】Transformer模型与self attention
针对attention model不能平行化，且忽略了输入句中文字间和目标句中文字间的关系，google在2017年《Attention is all you need》一文提出了Transformer模型。Transformer最大的特点就是完全抛弃了RNN、CNN架构。模型中主要的概念有2项：1. Self attention（代替RNN）：解决输入句中文字间和目标句中文字间的关系被忽略的问题
2. Multi-head：解決平行化和计算复杂度过高的问题

2 模型架构

transformer和Seq2seq模型皆包含2部分：Encoder和Decoder。不同的是，transformer中的Encoder是由6个Encoder堆积而成，Deocder亦然。
【DL小结5】Transformer模型与self attention

3 Attention model中的Decoder公式改写

用Query, Key, Value解释如何计算attention model中的语义向量

输入句中的每个文字是由一系列成对的 <地址Key, 元素Value>所构成，即word embedding vector
输出句中的每个文字是Query

4 Scaled Dot-Product Attention

transformer计算 attention score的方法和attention model如出一辙，但 transformer还要除上分母 $\sqrt{d_k}$ ，目的是避免内积过大造成softmax的结果非0即1。
【DL小结5】Transformer模型与self attention

5 transformer计算attention的3种方式

【DL小结5】Transformer模型与self attention

Encoder中的self attention在计算时，key, value, query都是来自encoder前一层的输出，Decoder亦然。
为了避免在解码的时候，还在翻译前半段时，就突然翻译到后半段的句子，在计算decode self attention的softmax前先mask掉未来的位置(设定成-∞)，确保在预测位置i的时候只能根据i之前位置的输出
Encoder-Decoder Attention和Encoder/Decoder self attention不一样，它的Query来自于decoder self-attention，而Key、Value则是encoder的output。
从输入文字的序列给Encoder开始，Encoder的output会变成attention vectors的Key、Value，接着传送到encoder-decoder attention layer，帮助Decoder该将注意力摆在输入文字序列的哪个位置进行解码。

6 Multi-head attention

有趣的是，如果我们只计算一个attention，很难捕捉输入句中所有空间的讯息，为了优化模型，提出了Multi-head attention，概念是不要只用d_{model}维度的key, value, query计算一个attention，而是把key, value, query们线性投射到不同空间h次，分別变成维度d_{q}, d_{k} and d_{v}，再各自做attention，其中，d_{k}=d_{v}=d_{model}/h=64，概念就是投射到h个head上。
【DL小结5】Transformer模型与self attention

参考教程

Seq2seq pay Attention to Self Attention: Part 2(中文版)
https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4