Reading Note: Gated Self-Matching Networks for Reading Comprehension and Question Answering

Abstract

Authors present the gated self-matching networks for reading comprehension style question answering, which aims to answer questions from a given passage.

Firstly, math the question and passage with gated attention-based recurrent networks to obtatin the question-aware passage representation.
Then, utilize a self-matching attention mechanism to refine the presentation by matching the passage against itself.
Finally, employ the pointer networks to locate the positions of answers from the passage.

Introduction

This model (R-Net) consists of four parts:
1. the recurrent network encoder (to build representation for questions and passage separately)
2. the gated matching layer (to match the question and passage)
3. the self-matching layer (to aggregate information from the whole passage)
4. the pointer network layer (to predict the answer boundary)

Three-fold key contributions:
1. propose a gated attention-based recurrent network, assigning different levels of importance to passage parts depending on their relevance to the question
2. introduce a self-matching mechanism, effectively aggregating evidence from the whole passage to infer the answer and dynamically refining passage representation with information from the whole passage
3. yield state-of-the-art results against strong baselines

Task Description

Given a passage $P$ and a question $Q$ , predict an answer $A$ to question $Q$ based on information in $P$

Methods

Reading Note: Gated Self-Matching Networks for Reading Comprehension and Question Answering

Question and Passage Encoder

Consider a question $Q = {w_{t}^{Q}}_{t = 1}^{m}$ and a passage $P = {w_{t}^{P}}_{t = 1}^{n}$ , firstly convert words to word-level embeddings ( ${e_{t}^{Q}}_{t = 1}^{m}$ and ${e_{t}^{P}}_{t = 1}^{n}$ ) and character-level embeddings ( ${c_{t}^{Q}}_{t = 1}^{m}$ and ${c_{t}^{P}}_{t = 1}^{n}$ ) which are generated by taking final hidden states of a bi-directional recurrent neural network applied to embeddings of characters in the token. Such character-level embeddings have been shown to be helpful to deal with out-of-vocab tokens.

Then use a bi-directional RNN to produce new representation ${u_{t}^{Q}}_{t = 1}^{m}$ and ${u_{t}^{P}}_{t = 1}^{n}$ .

u_{t}^{Q} = {BiRNN}_{Q} (u_{t - 1}^{Q}, [e_{t}^{Q}, c_{t}^{Q}]) u_{t}^{P} = {BiRNN}_{P} (u_{t - 1}^{P}, [e_{t}^{P}, c_{t}^{P}])

Here, use Gated Recurent Unit (GRU) because it is computationally cheaper.

Gated Attention-based Recurrent Networks

Utilize a gated attention-based recurrent network (a variant of attention-based recurrent networks) to incorporate question information into passage representation.

Given ${u_{t}^{Q}}_{t = 1}^{m}$ and ${u_{t}^{P}}_{t = 1}^{n}$ , generate question-aware passage representation ${v_{t}^{P}}_{t = 1}^{n}$ via soft-alignment of words

v_{t}^{P} = RNN (v_{t - 1}^{P}, [u_{t}^{P}, c_{t}]^{*})

where

[u_{t}^{P}, c_{t}]^{*}

is another gate to the input ([u^P_t, c_t]) of RNN:

\begin{array}{rcl} g_{t} = sigmoid (W_{g} [u_{t}^{P}, c_{t}]) \\ [u_{t}^{P}, c_{t}]^{*} = g_{t} ⊙ [u_{t}^{P}, c_{t}] \end{array}

c_{t} = a t t (u^{Q}, [u_{t}^{P}, u_{t - 1}^{P}])

is an attention-pooling vector of the whole question

u_{Q}

which focuses on the relation between the question and the current passage word:

\begin{array}{rcl} s_{j}^{t} & = & w^{T} \tanh (W_{u}^{Q} u_{j}^{Q} + W_{u}^{P} u_{j}^{P} + W_{v}^{P} v_{t - 1}^{P}) \\ a_{i}^{t} & = & \exp (s_{i}^{t}) / \sum_{j = 1}^{m} \exp (s_{j}^{t}) \\ c_{t} & = & \sum_{i = 1}^{m} a_{i}^{t} u_{i}^{Q} \end{array}

where the vector

w^{T}

and all matrices

W^{*}

contain weights to be learned.

Self-matching Attention

The self-matching attention is aim to solve the presentation with limited knowledge of context. It dynamically
1. coleects evidence from the whole passage words
2. encodes the evidence relevant to the current passage word and its matching question information into the passage representation $h_{t}^{P}$ :

h_{t}^{P} = RNN (h_{t - 1}^{P}, [v_{t}^{P}, c_{t}]^{*})

where

[v_{t}^{P}, c_{t}]^{*}

is another gate to the input ([v^P_t, c_t]) of RNN,

c_{t} = a t t (v^{P}, v_{t}^{P}])

is an attention-pooling vector of the whole question

u_{Q}

which focuses on the relation between the question and the current passage word:

\begin{array}{rcl} s_{j}^{t} & = & w^{T} \tanh (W_{u}^{Q} u_{j}^{Q} + W_{u}^{P} u_{j}^{P} + W_{v}^{\tilde{P}} v_{t - 1}^{P}) \\ a_{i}^{t} & = & \exp (s_{i}^{t}) / \sum_{j = 1}^{m} \exp (s_{j}^{t}) \\ c_{t} & = & \sum_{i = 1}^{m} a_{i}^{t} u_{i}^{Q} \end{array}

where the vector

w^{T}

and all matrices

W^{*}

contain weights to be learned.

After the original self-matching layer of the passage, authors utilize bi-directional GRU to deeply integrate the matching results before feeding them into answer pointer layer. It helps to further propagate the information aggregated by self-matching of the passage.

Output Layer

Use an attention-polling over the question representation to generate the initial hidden vector for the pointer network to predict the start and end position of the answer.

Given a passage representation ${h_{t}^{P}}_{t = 1}^{n}$ , the attention mechanism is utilized as a pointer to select the start position $p^{1}$ and end position $p^{2}$ :

\begin{array}{rcl} s_{j}^{t} & = & w^{T} \tanh (W_{h}^{P} h_{j}^{P} + W_{h}^{a} h_{t - 1}^{a}) \\ a_{i}^{t} & = & \exp (s_{i}^{t}) / \sum_{j = 1}^{n} \exp (s_{j}^{t}) \\ p^{t} & = & argmax (a_{1}^{t}, a_{2}^{t}, . . ., a_{n}^{t}) \end{array}

where

h_{t - 1}^{a}

represents the last hidden state of the pointer network,

h_{t}^{a}

is the attention-pooling vector based on current predicted probability

a^{t}

\begin{array}{rcl} c_{t} & = & \sum_{i = 1}^{n} a_{i}^{t} h_{i}^{P} \\ h_{t}^{a} & = & RNN (h_{t - 1}^{a}, c_{t}) \end{array}

And authors utilize the question vector

r^{Q}

as the initial state of the pointer network, where

r^{Q} = att (u^{Q}, V_{r}^{Q})

is an attention-pooling vector of the question based on the parameter

V_{r}^{Q}

\begin{array}{rcl} s_{j} & = & w^{T} \tanh (W_{u}^{Q} u_{j}^{Q} + W_{v}^{Q} V_{r}^{Q}) \\ a_{i} & = & \exp (s_{i}) / \sum_{j = 1}^{n} \exp (s_{j}) \\ r^{Q} & = & \sum_{i = 1}^{m} a_{i} u_{i}^{Q} \end{array}

Objective Function

To train the network, minimize the objective function:

J = - (\sum_{i = 1}^{n} 1 {p^{1} = i} \log a_{i}^{1} + \sum_{i = 1}^{n} 1 {p^{2} = i} \log a_{i}^{2})

Implementation Details

Use the tokenizer from Stanford CoreNLP to preprocess each passage and question
Use the Gated Recurrent Unit
Use GloVe embeddings for questions and passages and fix embeddings
Use zero vectors to prepresent all out-of-vocab words
Use 1 layer of bi-directional GRU to compute character-level embeddings and 3 layers of bi-directional GRU to encode questions and passages
Use bi-directional gated attention-based recurrent network
Set hidden vector length to 75 for all layers
Set hidden size to 75 for attention scores
Set dropout rate to 0.2
Use AdaDelta (an initial learning rate of $1$ , the decay rate $ρ$ of $0.95$ , constant $ϵ$ of $1 e^{- 6}$ )

Reading Note: Gated Self-Matching Networks for Reading Comprehension and Question Answering

Abstract

Introduction

Task Description

Methods

Question and Passage Encoder

Gated Attention-based Recurrent Networks

Self-matching Attention

Output Layer

Objective Function

Implementation Details

相关推荐