Chapter 7 n-step Bootstrapping

核心思想就是在做bootstrapping之前再向前多走几步

7.1 n-step TD Prediction

Chapter 7 n-step Bootstrapping
temporal difference 扩展了n步，这就被称为n-step TD methods

n-step returns

G_{t : t + n} ≐ R_{t + 1} + γ R_{t + 2} + \dots + γ^{n - 1} R_{t + n} + γ^{n} V_{t + n - 1} (S_{t + n})

其中 $V_{t} : S \to R$ 这里是在t时刻对 $v_{π}$ 的估计

因为又向后看了几步，所以只有等到得到 $R_{t + n}$ 和计算出 $V_{t + n - 1}$ 之后才能做更新

V_{t + n} (S_{t}) ≐ V_{t + n - 1} (S_{t}) + α [G_{t : t + n} - V_{t + n - 1} (S_{t})], 0 \leq t \leq T

Chapter 7 n-step Bootstrapping

error reduction property of n-step returns
the worst error of the expected n-step return is guaranteed to be less than or equal to $γ^{n}$ times the worst error under $V_{t + n - 1}$ :

max_{s} | E_{π} [G_{t : t + n} | S_{t} = s] - v_{π} (s) | \leq γ^{n} max_{s} | V_{t + n - 1} (s) - v_{π} (s) |

这表明所有的n-step TD方法在合适的技术条件下都收敛到正确的预测

7.2 n-step Sarsa

跟之前介绍的Sarsa相比，只有G变成了n-step returns

G_{t : t + n} ≐ R_{t + 1} + γ R_{t + 2} + \dots + γ^{n - 1} R_{t + n} + γ^{n} Q_{t + n - 1} (S_{t_{n}}, A_{t + n}), n \geq 1, 0 \leq t < T - n

更新公式也基本没有发生变化

Q_{t + n} (S_{t}, A_{t}) ≐ Q_{t + n - 1} (S_{t}, A_{t}) + α [G_{t : t + n} - Q_{t + n - 1} (S_{t}, A_{t})], 0 \leq t \leq T

对于上图展示的Expected Sarsa。跟n-step Sarsa类似，除了最后考虑的一项不同。

G_{t : t + n} ≐ R_{t + 1} + \dots + γ^{n - 1} R_{t + n} + γ^{n} {\bar{V}}_{t + n - 1} (S_{t + n}), t + n < T,

这里的不同点有

G_{t : t + n} ≐ G_{t} for t + n \geq T

，
其中

{\bar{V}}_{t} (s)

是 expected approximte value of state s

{\bar{V}}_{t} (s) ≐ \sum_{a} π (a | s) Q_{t} (s, a), for all s \in S

7.3 n-step On-policy Learning by Importance Sampling

这一节有关于off-policy learning很好的介绍。off-policy learning就是学习一个policy $π$ 的值，同时遵循另外一个policy b的experience。通常， $π$ 是对当前action-value估计的greedy policy，而b是一个跟具有探索性的policy，或许是 $ε -greedy$

还是要用上 importance sampling ratio

ρ_{t : h} ≐ \prod_{k = t}^{min (k, T - 1)} \frac{π (A_{k} | S_{k})}{b (A_{k} | S_{k})}

更新公式

V_{t + n} (S_{t}) ≐ V_{t + n - 1} (S_{t}) + α ρ_{t : t + n - 1} [G_{t : t + n} - V_{t + n - 1} (S_{t})], 0 \leq t < T

off-policy form n-step Sarsa

Q_{t + n} (S_{t}, A_{t}) ≐ Q_{t + n - 1} (S_{t}, A_{t}) + α ρ_{t + 1 : t + n - 1} [G_{t : t + n} - Q_{t + n - 1} (S_{t}, A_{t})], 0 \leq t < T

7.4 *Per-decision Off-policy Methods with Control Variates

A more sophisticated approach would use per-decision importance sampling ideas

n-step returns可以写为
$G_{t : h} = R_{t + 1} + γ G_{t + 1 : h}, t < h < T,$

off-policy definition of the n-step return ending at horizon

\begin{matrix} (7.13) & G_{t : h} ≐ ρ_{t} (R_{t + 1} + γ G_{t + 1 : h}) + (1 - ρ_{t}) V_{h - 1} (S_{t}), t < h < T, \end{matrix}

同时有

G_{h : h} ≐ V_{h - 1} (S_{h})

上式7.13中的第二项被称为 control variate
control variate 不会改变期望更新，因为在5.9节介绍过，importance sampling ratio的期望值是1。

An off-policy form with control variates

\begin{aligned} G_{t : h} & ≐ R_{t + 1} + γ (ρ_{t + 1} G_{t + 1 : h} + {\bar{V}}_{h - 1} (S_{t + 1}) - ρ_{t + 1} Q_{h - 1} (S_{t + 1}, A_{t + 1})), \\ = R_{t + 1} + γ ρ_{t + 1} (G_{t + 1 : h} + Q_{h - 1} (S_{t + 1}, A_{t + 1})) + γ {\bar{V}}_{h - 1} (S_{t + 1}), t < h \leq T . \end{aligned}

如果

h < t

，则递归以

G_{h : h} ≐ Q_{h - 1} (S_{h}, A_{h})

结束；如果

h \geq T

，则递归以

G_{T - 1 : T} ≐ R_{T}

结束。

control variates就是一种减小方差的方法

7.5 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

不需要importance sampling的off-policy方法
Chapter 7 n-step Bootstrapping

tree-backup n-step return的一般形式

G_{t : t + n} ≐ R_{t + t} + γ \sum_{α \neq A_{t + 1}} π (a | S_{t + 1}) Q_{t + n - 1} (S_{t + 1}, a) + γ π (A_{t + 1}, S_{t + 1}) G_{t + 1 : t + n}, t < T - 1

当n=1时，

G_{T - 1 : T} ≐ R_{T}

上述action-value用于n-step Sarsa

Q_{t + n} (S_{t}, A_{t}) ≐ Q_{t + n - 1} (S_{t}, A_{t}) + α [G_{t : t n} - Q_{t + n - 1} (S_{t}, A_{t})], 0 \leq t < T,

7.6 *A Unifying Algorithm: n-step $Q (δ)$

跟前面描述的类似，就是往前看的方式变了，其他的都是一样的，看下图
Chapter 7 n-step Bootstrapping

改写7.16的形式为如下：

\begin{aligned} G_{t : h} & = R_{t + 1} + γ \sum_{a \neq A_{t + 1}} π (a | S_{t + 1}) Q_{h - 1} (S t + 1, a) + γ π (A_{t + 1} | S_{t + 1}) G_{t + 1 : h} \\ = R_{t + 1} + γ {\bar{V}}_{h - 1} (S_{t + 1}) - γ π (A_{t + 1} | S_{t + 1}) Q_{h - 1} (S_{t + 1}, A_{t + 1}) + γ π (A_{t + 1} | S_{t + 1}) G_{t + 1 : h} \\ = R_{t + 1} + γ π (A_{t + 1} | S_{t + 1}) (G_{t + 1 : h} - Q_{h - 1} (S_{t + 1}, A_{t + 1})) + γ {\bar{V}}_{h - 1} (S_{t + 1}) ， \end{aligned}

把其中的

π (A_{t + 1} | S t + 1)

替换成importance-sampling ratio

ρ_{t + 1}

G_{t : h} ≐ R_{t + 1} + γ (δ_{t + 1} ρ_{t + 1} + (1 - δ_{t + 1}) π (A_{t + 1 | S_{t + 1}})) (G_{t + 1 : h} - Q_{h - 1} (S_{t + 1}, A_{t + 1})) + γ {\bar{V}}_{h - 1} (S_{t + 1})

对于

t < h \leq T

，如果

h < T

，则递归式最后以

G_{h : h} ≐ 0

结束；如果

h = T

，则递归式最后以

G_{T - 1 : T} ≐ R_{T}

结束。

Chapter 7 n-step Bootstrapping

Chapter 7 n-step Bootstrapping

7.1 n-step TD Prediction

7.2 n-step Sarsa

7.3 n-step On-policy Learning by Importance Sampling

7.4 *Per-decision Off-policy Methods with Control Variates

7.5 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

7.6 *A Unifying Algorithm: n-step Q(δ)Q(δ)

相关推荐

7.6 *A Unifying Algorithm: n-step $Q (δ)$