NOTES of NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE by Dzmitry Bahdanau et al. (2016)

Traditional

An encoder neural network reads and encodes a source sentence
into a fixed-length vector.

A decoder then outputs a translation from the encoded vector.

The whole encoder–decoder system, which consists of the encoder and the decoder for a language pair,
is jointly trained to maximize the probability of a correct translation given a source sentence.

Issue: a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector, difficult to cope with long sentences (especially longer than those in the training corpus).

Our work

Align and translate jointly.

Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated.

The model predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.

It does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a equence of vectors and chooses a subset of these vectors adaptively while decoding the translation. Allow a model cope better with long sentences.

Problem depiction of translation

Translation is equivalent to finding a target sentence $y$ that maximizes
the conditional probability of $y$ given a source sentence $x$ , i.e., $a r g m a x_{y} p (y | x)$ .

In neural machine translation, we fit a parameterized model to maximize the conditional probability of sentence pairs using a parallel training corpus.

background-basic work

In the Encoder–Decoder framework, an encoder reads the input sentence, a sequence of vectors $x = (x_{1}; . . .; x_{T_{x}})$ , into a vector $c$ . The most common approach is to use an RNN such that

h_{t} = f (x_{t}; h_{t - 1})

(1)
and

c = q ({h_{1}; . . .; h_{T_{x}}})

where

h_{t} \in R^{n}

is a hidden state at time

t

, and

c

is a vector generated from the sequence of the hidden states.

f

and

q

are some nonlinear functions. (Sutskever et al.) used an LSTM as

f

and

q ({h_{1}; . . .; h_{T_{x}}}) = h_{T}

, for instance.

The decoder is often trained to predict the next word $y_{t^{'}}$ given the context vector $c$ and all the previously predicted words ${y_{1}; . . .; y_{t^{'} - 1}}$ . In other words, the decoder defines a probability over the translation y by decomposing the joint probability into the ordered conditionals:

p (y) = \prod_{t = 1}^{T} p (y_{t} | {y_{1}, . . ., y_{t - 1}}, c)

(2)

where $y =  (y_{1}, . . ., y_{T_{y}}) $ . With an RNN, each conditional probability is modeled as

p (y_{t} | {y_{1}, . . ., y_{t - 1}}, c) = g (y_{t - 1}, s_{t}, c)

(3)
here

g

is a nonlinear, potentially multi-layered, function that outputs the probability of

y_{t}

, and

s_{t}

is the hidden state of the RNN.

Learing to align and translate

decoder

define each conditional probability
in Eq. (2) as:

p (y_{t} | {y_{1}, . . ., y_{t - 1}}, x) = g (y_{i - 1}, s_{i}, c_{i})

(4)

where $s_{i}$ is an RNN hidden state for time $i$ , computed by

s_{i} = f (s_{i - 1}; y_{i - 1}, c_{i})

here the probability is conditioned on a distinct context vector $c_{i}$ for each target word $y_{i}$ .

The context vector $c i$ depends on a sequence of annotations $(h_{1}; . . .; h_{T_{x}})$ to which an encoder maps the input sentence. Each annotation $h_{i}$ contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence.

c i = \sum_{j = 1}^{T_{x}} α_{i j} h_{j}

(5)

and $α_{i j} = \frac{e x p (e_{i j})}{\sum_{k = 1}^{T_{x}} e x p (e_{i k})}$

where $e_{i j} = a (s i - 1, h j)$

is an alignment model which scores how well the inputs around position $j$ and the output at position $i$ match. The score is based on the RNN hidden state $s_{i - 1}$ (just before emitting $y_{i}$ , Eq. (4)) and the
j-th annotation $h_{j}$ of the input sentence.

NOTES of NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

encoder

we would like the annotation of each word to summarize not only the preceding words, but also the following words. Hence, we propose to use a bidirectional RNN.

A BiRNN consists of forward and backward RNN’s. The forward RNN $\vec{f}$ reads the input sequence as it is ordered (from $x_{1}$ to $x_{T_{x}}$ ) and alculates a sequence of forward hidden states $(\vec{h_{1}}; . . .; \vec{h_{T_{x}}})$ .

The backward RNN $\overset{\leftarrow}{f}$ reads the sequence in the reverse order (from $x_{T_{x}}$ to $x_{1}$ ), resulting in a sequence of backward hidden states $(\overset{\leftarrow}{h_{1}}; . . .; \overset{\leftarrow}{h_{T_{x}}})$ .

In this way, the annotation $h_{j}$ contains the summaries of both the preceding words and the following words.