Reading Note: Gated Self-Matching Networks for Reading Comprehension and Question Answering
Abstract
Authors present the gated self-matching networks for reading comprehension style question answering, which aims to answer questions from a given passage.
Firstly, math the question and passage with gated attention-based recurrent networks to obtatin the question-aware passage representation.
Then, utilize a self-matching attention mechanism to refine the presentation by matching the passage against itself.
Finally, employ the pointer networks to locate the positions of answers from the passage.
Introduction
This model (R-Net) consists of four parts:
1. the recurrent network encoder (to build representation for questions and passage separately)
2. the gated matching layer (to match the question and passage)
3. the self-matching layer (to aggregate information from the whole passage)
4. the pointer network layer (to predict the answer boundary)
Three-fold key contributions:
1. propose a gated attention-based recurrent network, assigning different levels of importance to passage parts depending on their relevance to the question
2. introduce a self-matching mechanism, effectively aggregating evidence from the whole passage to infer the answer and dynamically refining passage representation with information from the whole passage
3. yield state-of-the-art results against strong baselines
Task Description
Given a passage and a question , predict an answer to question based on information in
Methods
Question and Passage Encoder
Consider a question and a passage , firstly convert words to word-level embeddings ( and ) and character-level embeddings ( and ) which are generated by taking final hidden states of a bi-directional recurrent neural network applied to embeddings of characters in the token. Such character-level embeddings have been shown to be helpful to deal with out-of-vocab tokens.
Then use a bi-directional RNN to produce new representation and .
Here, use Gated Recurent Unit (GRU) because it is computationally cheaper.
Gated Attention-based Recurrent Networks
Utilize a gated attention-based recurrent network (a variant of attention-based recurrent networks) to incorporate question information into passage representation.
Given and , generate question-aware passage representation via soft-alignment of words
where is another gate to the input ([u^P_t, c_t]) of RNN:
is an attention-pooling vector of the whole question which focuses on the relation between the question and the current passage word:
where the vector and all matrices contain weights to be learned.
Self-matching Attention
The self-matching attention is aim to solve the presentation with limited knowledge of context. It dynamically
1. coleects evidence from the whole passage words
2. encodes the evidence relevant to the current passage word and its matching question information into the passage representation :
where is another gate to the input ([v^P_t, c_t]) of RNN,
is an attention-pooling vector of the whole question which focuses on the relation between the question and the current passage word:
where the vector and all matrices contain weights to be learned.
After the original self-matching layer of the passage, authors utilize bi-directional GRU to deeply integrate the matching results before feeding them into answer pointer layer. It helps to further propagate the information aggregated by self-matching of the passage.
Output Layer
Use an attention-polling over the question representation to generate the initial hidden vector for the pointer network to predict the start and end position of the answer.
Given a passage representation , the attention mechanism is utilized as a pointer to select the start position and end position :
where represents the last hidden state of the pointer network,
is the attention-pooling vector based on current predicted probability :
And authors utilize the question vector as the initial state of the pointer network, where is an attention-pooling vector of the question based on the parameter :
Objective Function
To train the network, minimize the objective function:
Implementation Details
- Use the tokenizer from Stanford CoreNLP to preprocess each passage and question
- Use the Gated Recurrent Unit
- Use GloVe embeddings for questions and passages and fix embeddings
- Use zero vectors to prepresent all out-of-vocab words
- Use 1 layer of bi-directional GRU to compute character-level embeddings and 3 layers of bi-directional GRU to encode questions and passages
- Use bi-directional gated attention-based recurrent network
- Set hidden vector length to 75 for all layers
- Set hidden size to 75 for attention scores
- Set dropout rate to 0.2
- Use AdaDelta (an initial learning rate of , the decay rate of , constant of )