《reinforcement learning:an introduction》第十一章《Off-policy Methods with Approximation》总结

由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。

对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。

因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。





The tabular off-policy methods developed in Chapters 6 and 7 readily extend to semi-gradient algorithms, but these algorithms do not converge nearly as robustly as in the on-policy case The on-policy distribution is special and is important to the stability of semi-gradient methods. 





Methods developed in earlier chapters for the off-policy case extend readily to function approximation as semi-gradient methods. Although these methods may diverge, and in that sense are not sound, they are still often successfully used. 

Many of these algorithms use the per-step importance sampling ratio: 

《reinforcement learning:an introduction》第十一章《Off-policy Methods with Approximation》总结

《reinforcement learning:an introduction》第十一章《Off-policy Methods with Approximation》总结

《reinforcement learning:an introduction》第十一章《Off-policy Methods with Approximation》总结





上面提到的semi-gradient methods确实在某些环境下unstable and diverge,比如Baird’s Counterexample 。

If we alter just the distribution of DP backups in Baird’s counterexample, from the uniform distribution to the on-policy distribution (which generally requires asynchronous updating), then convergence is guaranteed to a solution with error bounded by (9.14). 

The Baird’s Counterexample shows that even the simplest combination of bootstrapping and function approximation can be unstable if the backups are not done according to the on-policy distribution. There are also counterexamples similar to Baird’s showing divergence for Q-learning. 

It may be possible to guarantee convergence of Q-learning as long as the behavior policy (the policy used to select actions) is sufficiently close to the estimation policy (the policy used in GPI), for example, when it is the "-greedy policy. To the best of our knowledge, Q-learning has never been found to diverge in this case, but there has been no theoretical analysis



The danger of instability and divergence arises whenever we combine three things:

1. training on a distribution of transitions other than that naturally generated by the process whose expectation is being estimated (e.g., off-policy learning)

2. scalable function approximation (e.g., linear semi-gradient)

3. bootstrapping (e.g, DP, TD learning) 

Also note that any two of these three is fine; the danger arises only in the presence of all three.
《reinforcement learning:an introduction》第十一章《Off-policy Methods with Approximation》总结