NOTES of NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE
NOTES of NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE by Dzmitry Bahdanau et al. (2016)
Traditional
An encoder neural network reads and encodes a source sentence
into a fixed-length vector.
A decoder then outputs a translation from the encoded vector.
The whole encoder–decoder system, which consists of the encoder and the decoder for a language pair,
is jointly trained to maximize the probability of a correct translation given a source sentence.
Issue: a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector, difficult to cope with long sentences (especially longer than those in the training corpus).
Our work
Align and translate jointly.
Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated.
The model predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.
It does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a equence of vectors and chooses a subset of these vectors adaptively while decoding the translation. Allow a model cope better with long sentences.
Problem depiction of translation
Translation is equivalent to finding a target sentence that maximizes
the conditional probability of given a source sentence , i.e., .
In neural machine translation, we fit a parameterized model to maximize the conditional probability of sentence pairs using a parallel training corpus.
background-basic work
In the Encoder–Decoder framework, an encoder reads the input sentence, a sequence of vectors , into a vector . The most common approach is to use an RNN such that
and
where
The decoder is often trained to predict the next word given the context vector and all the previously predicted words . In other words, the decoder defines a probability over the translation y by decomposing the joint probability into the ordered conditionals:
where. With an RNN, each conditional probability is modeled as
here is a nonlinear, potentially multi-layered, function that outputs the probability of , and is the hidden state of the RNN.
Learing to align and translate
decoder
define each conditional probability
in Eq. (2) as:
where is an RNN hidden state for time , computed by
here the probability is conditioned on a distinct context vector for each target word .
The context vector depends on a sequence of annotations to which an encoder maps the input sentence. Each annotation contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence.
and
where
is an alignment model which scores how well the inputs around position and the output at position match. The score is based on the RNN hidden state (just before emitting , Eq. (4)) and the
j-th annotation of the input sentence.
encoder
we would like the annotation of each word to summarize not only the preceding words, but also the following words. Hence, we propose to use a bidirectional RNN.
A BiRNN consists of forward and backward RNN’s. The forward RNN reads the input sequence as it is ordered (from to ) and alculates a sequence of forward hidden states .
The backward RNN reads the sequence in the reverse order (from to ), resulting in a sequence of backward hidden states .
In this way, the annotation contains the summaries of both the preceding words and the following words.