CMU 11-785 L13 Recurrent Networks

Modelling Series

In many situations one must consider a series of inputs to produce an output
- Outputs too may be a series
Finite response model
- Can use convolutional neural net applied to series data (slide)
  - Also called a Time-Delay neural network
- Something that happens today only affects the output of the system for days into the future
  - $Y_{t}=f\left(X_{t}, X_{t-1}, \ldots, X_{t-N}\right)$
Infinite response systems
- Systems often have long-term dependencies
- What happens today can continue to affect the output forever
  - $Y_{t}=f\left(X_{t}, X_{t-1}, \ldots, X_{t-\infty}\right)$

Infinite response systems

A one-tap NARX network
- 「nonlinear autoregressive network with exogenous inputs」
- $Y_t = f(X_t,Y_{t-1})$
- An input at t=0 affects outputs forever
An explicit memory variable whose job it is to remember
- $\begin{array}{c} m_{t}=r\left(y_{t-1}, h_{t-1}^{\prime}, m_{t-1}\right) \\\\ h_{t}=f\left(x_{t}, m_{t}\right) \\\\ y_{t}=g\left(h_{t}\right) \end{array}$

CMU 11-785 L13 Recurrent Networks

Jordan Network
- Memory unit simply retains a running average of past outputs
- Memory has fixed structure; does not “learn” to remember
Elman Networks
- Separate memory state from output
- Only the weight from the memory unit to the hidden unit is learned
  - But during training no gradient is backpropagated over the “1” link (Just cloned state)
Problem
- “Simple” (or partially recurrent) because during learning current error does not actually propagate to the past

State-space model

$\begin{array}{c} h_{t}=f\left(x_{t}, h_{t-1}\right) \\\\ y_{t}=g\left(h_{t}\right) \end{array}$

$h_t$ is the state of the network
- Model directly embeds the memory in the state
- State summarizes information about the entire past
Recurrent neural network

Variants

All columns are identical

CMU 11-785 L13 Recurrent Networks

The simplest structures are most popular

Recurrent neural network

CMU 11-785 L13 Recurrent Networks

Forward pass

CMU 11-785 L13 Recurrent Networks

Backward pass

BPTT
- Back Propagation Through Time
- Defining a divergence between the actual and desired output sequences
- Backpropagating gradients over the entire chain of recursion
  - Backpropagation through time
- Pooling gradients with respect to individual parameters over time

CMU 11-785 L13 Recurrent Networks

Notion

The divergence computed is between the sequence of outputs by the network and the desired sequence of outputs
- DIV is a scalar function of a …series… of vectors
- This is not just the sum of the divergences at individual times
$Y(t)$ is the output at time $t$
- $Y_i(t)$ is the ith output
$Z^{(2)}(t)$ is the pre-activation value of the neurons at the output layer at time $t$
$h(t)$ is the output of the hidden layer at time $t$

BPTT

$Y(t)$ is a column vector
$DIV$ is a scalar
$\frac{d Div}{d Y(t)}$ is a row vector

Derivative at time $T$

Compute $\frac{d DIV}{d Y_i(T)}$ for all $i$
- In general we will be required to compute $\frac{d DIV}{d Y_i(t)}$ for all $i$ and $t$ as we will see
  - This can be a source of significant difficulty in many scenarios
- Special case, when the overall divergence is a simple sum of local divergences at each time
  - $\frac{d D I V}{d Y_{i}(t)}=\frac{d D i v(t)}{d Y_{i}(t)}$
Compute $\nabla_{Z^{(2)}(T)}{D I V}$
- $\nabla_{Z^{(2)}(T)}{D I V}=\nabla_{Y(T)} D I V \nabla_{Z^{(2)}(T)} Y(T)$
- For scalar output activation
  - $\frac{d D I V}{d Z_{i}^{(2)}(T)}=\frac{d D I V}{d Y_{i}(T)} \frac{d Y_{i}(T)}{d Z_{i}^{(2)}(T)}$
- For vector output activation
  - $\frac{d D I V}{d Z_{i}^{(2)}(T)}=\sum_{i} \frac{d D I V}{d Y_{j}(T)} \frac{d Y_{j}(T)}{d Z_{i}^{(2)}(T)}$
Compute $\nabla_{h_(T)}{D I V}$
- $W^{(2)} h(T) = Z^{(2)}(T)$
- $\frac{d D I V}{d h_{i}(T)}=\sum_{j} \frac{d D I V}{d Z_{j}^{(2)}(T)} \frac{d Z_{j}^{(2)}(T)}{d h_{i}(T)}=\sum_{j} w_{i j}^{(2)} \frac{d D I V}{d Z_{j}^{(2)}(T)}$
- $\nabla_{h(T)} D I V=\nabla_{Z^{(2)}(T)} D I V W^{(2)}$
Compute $\nabla_{W^{(2)}}{D I V}$
- $\frac{d D I V}{d w_{i j}^{(2)}}=\frac{d D I V}{d Z_{j}^{(2)}(T)} h_{i}(T)$
- $\nabla_{W^{(2)}} D I V=h(T) \nabla_{Z^{(2)}(T)} D I V$
Compute $\nabla_{Z^{(1)}(T)}{D I V}$
- $\frac{d D I V}{d Z_{i}^{(1)}(T)}=\frac{d D I V}{d h_{i}(T)} \frac{d h_{i}(T)}{d Z_{i}^{(1)}(T)}$
- $\nabla_{Z^{(1)}(T)} D I V=\nabla_{h(T)} D I V \nabla_{Z^{(1)}(T)} h(T)$
Compute $\nabla_{W^{(1)}}{D I V}$
- $W^{(1)} X(T) + W^{(11)} h(T-1)= Z^{(1)}(T)$
- $\frac{d D I V}{d w_{i j}^{(1)}}=\frac{d D I V}{d Z_{j}^{(1)}(T)} X_{i}(T)$
- $\nabla_{W^{(1)}} D I V=X(T) \nabla_{Z^{(1)}(T)} D I V$
Compute $\nabla_{W^{(11)}}{D I V}$
- $\frac{d D I V}{d w_{i i}^{(11)}}=\frac{d D I V}{d Z_{i}^{(1)}(T)} h_{i}(T-1)$
- $\nabla_{W}^{(11)} D I V=h(T-1) \nabla_{Z^{(1)}(T)} D I V$

Derivative at time $T-1$

Compute $\nabla_{Z^{(2)}(T-1)}{D I V}$
- $\nabla_{Z^{(2)}(T-1)}{D I V}=\nabla_{Y(T-1)} D I V \nabla_{Z^{(2)}(T-1)} Y(T-1)$
- For scalar output activation
  - $\frac{d D I V}{d Z_{i}^{(2)}(T-1)}=\frac{d D I V}{d Y_{i}(T-1)} \frac{d Y_{i}(T-1)}{d Z_{i}^{(2)}(T-1)}$
- For vector output activation
  - $\frac{d D I V}{d Z_{i}^{(2)}(T-1)}=\sum_{j} \frac{d D I V}{d Y_{j}(T-1)} \frac{d Y_{j}(T-1)}{d Z_{i}^{(2)}(T-1)}$
Compute $\nabla_{h_(T-1)}{D I V}$
- $\frac{d D I V}{d h_{i}(T-1)}=\sum_{j} w_{i j}^{(2)} \frac{d D I V}{d Z_{j}^{(2)}(T-1)}+\sum_{j} w_{i j}^{(11)} \frac{d D I V}{d Z_{j}^{(1)}(T)}$
- $\nabla_{h(T-1)} D I V=\nabla_{Z^{(2)}(T-1)} D I V W^{(2)}+\nabla_{Z^{(1)}(T)} D I V W^{(11)}$
Compute $\nabla_{W^{(2)}}{D I V}$
- $\frac{d D I V}{d w_{i j}^{(2)}}+=\frac{d D I V}{d Z_{j}^{(2)}(T-1)} h_{i}(T-1)$
- $\nabla_{W^{(2)}} D I V+=h(T-1) \nabla_{Z^{(2)}(T-1)} D I V$
Compute $\nabla_{Z^{(1)}(T-1)}{D I V}$
- $\frac{d D I V}{d Z_{i}^{(1)}(T-1)}=\frac{d D I V}{d h_{i}(T-1)} \frac{d h_{i}(T-1)}{d Z_{i}^{(1)}(T-1)}$
- $\nabla_{Z^{(1)}(T-1)} D I V=\nabla_{h(T-1)} D I V \nabla_{Z^{(1)}(T-1)} h(T-1)$
Compute $\nabla_{W^{(1)}}{D I V}$
- $\frac{d D I V}{d w_{i j}^{(1)}}+=\frac{d D I V}{d Z_{j}^{(1)}(T-1)} X_{i}(T-1)$
- $\nabla_{W^{(1)}} D I V+=X(T-1) \nabla_{Z^{(1)}(T-1)} D I V$
Compute $\nabla_{W^{(11)}}{D I V}$
- $\frac{d D I V}{d w_{i j}^{(11)}}+=\frac{d D I V}{d Z_{j}^{(1)}(T-1)} h_{i}(T-2)$
- $\nabla_{W^{(11)}} D I V+=h(T-2) \nabla_{Z^{(1)}(T-1)} D I V$

Back Propagation Through Time

$\frac{d D I V}{d h_{i}(-1)}=\sum_{i} w_{i j}^{(11)} \frac{d D I V}{d Z_{j}^{(1)}(0)}$

$\frac{d D I V}{d h_{i}^{(k)}(t)}=\sum_{j} w_{i, j}^{(k+1)} \frac{d D I V}{d Z_{j}^{(k+1)}(t)}+\sum_{j} w_{i, j}^{(k, k)} \frac{d D I V}{d Z_{j}^{(k)}(t+1)}$

$\frac{d D I V}{d Z_{i}^{(k)}(t)}=\frac{d D I V}{d h_{i}^{(k)}(t)} f_{k}^{\prime}\left(Z_{i}^{(k)}(t)\right)$

$\frac{d D I V}{d w_{i j}^{(1)}}=\sum_{t} \frac{d D I V}{d Z_{j}^{(1)}(t)} X_{i}(t)$

$\frac{d D I V}{d w_{i j}^{(11)}}=\sum_{t} \frac{d D I V}{d Z_{j}^{(1)}(t)} h_{i}(t-1)$

Algorithm

CMU 11-785 L13 Recurrent Networks

Bidirectional RNN

CMU 11-785 L13 Recurrent Networks

Two independent RNN
Clearly, this is not an online process and requires the entire input data
It is easy to learning two RNN independently
- Forward pass: Compute both forward and backward networks and final output
- Backpropagation
  - A basic backprop routine that we will call
  - Two calls to the routine within a higher-level wrapper

CMU 11-785 L13 Recurrent Networks

Modelling Series

Infinite response systems

State-space model

Variants

Recurrent neural network

Forward pass

Backward pass

Notion

BPTT

Derivative at time TTT

Derivative at time T−1T-1T−1

Back Propagation Through Time

Algorithm

Bidirectional RNN

相关推荐

Derivative at time $T$

Derivative at time $T-1$