[Reinforcement Learning] Model-Free Prediction

蒙特卡洛学习

蒙特卡洛方法（Monte-Carlo Methods，简称MC）也叫做蒙特卡洛模拟，是指使用随机数（或更常见的伪随机数）来解决很多计算问题的方法。其实本质就是，通过尽可能随机的行为产生后验，然后通过后验来表征目标系统。

在Model-Free的情况下，MC在强化学习中的应用就是获取价值函数，其特点如下：

MC 可以从完整的 episodes 中学习（no bootstrapping）
MC 以均值来计算价值，即 value = mean(return)
MC 只能适用于 episodic MDPs（有限MDPs）

在获得完整序列的过程中，很可能会遇到环，即一个状态点经过多次，对此MC有两种处理方法，first step（只在第一次经过时N(s)+=1）和every step（每次经过这个点都N(s)+=1）。

[Reinforcement Learning] Model-Free Prediction

V(s)可以看做所有经过s的回报求和后取平均值产生。但是这个平均值计算不仅可以先求和再做除法，还可以通过在已有的平均值上加一点差值获得，就是下面左式的形式，已有的平均值为V(St)，此次采样获得的回报为Gt，同当前平均值的差值为(Gt - V(St))

但这样来看，需要一直维护一个N(s)计数器，可是，真正平均值优化时只需要知道一个优化的方向即可，所以用一个(0,1)常数α来代替1/N(St)，即下面右式的形式。 α的现实意义是一个遗忘系数，即适当程度遗忘古老的采样结果，不需要对所有sample出的序列都记得很清楚。

[Reinforcement Learning] Model-Free Prediction

时序差分学习

Monto Carlo采样有一个很明显的缺点，就是必须要sample出完整的序列才能观测出这个序列得到的回报是多少。但是TD(0)这种方法就不需要，它利用Bellman Equation，当前状态收益只和及时回报Rt+1和下一状态收益有关（如下式），红色部分为TD target，α右边括号内为TD error。所以TD(0)只sample出下一个状态点St+1，用已有的policy计算出Rt+1和V(St+1)，这种用已知来做估计的方法叫做bootstrapping(updates a guess towards a guess)，而MC是观测的实际值取平均，是没有bootstrapping的。由于TD(0)只需要sample出下一个状态St+1，所以可以用于non-terminate序列中（incomplete）。

[Reinforcement Learning] Model-Free Prediction

同MC比较，TD(0)采用已有policy预测出TD error，和MC的实际值相比有更大的偏差，但是TD(0)只需要sample出下一个状态序列而不是MC的完整序列，所以TF(0)预测获得的方差比MC小

[Reinforcement Learning] Model-Free Prediction

可以看出，无论是TD还是MC，v(B)都是取平均值计算出来的0.75；但是通过MC算出的V(A)是0，因为A只有一次sample结果是0，TD(0)算出来的是0.75，因为A的下一个状态是B且V(B)=0.75, r=0。这一点来看，TD算法更能够利用Markov特性；TD(0)只sample下一个状态点的结果，而不需要每次sample都要等到最终序列结束出结果，所以比MC更高效；但是由于是bootstrapping方法，受初始化值影响更大，拟合性也不如MC好。

时序差分方法（Temporal-Difference Methods，简称TD）特点：

TD 可以通过 bootstrapping 从非完整的 episodes 中学习
TD updates a guess towards a guess

[Reinforcement Learning] Model-Free Prediction

从上图可以看出，当 n 达到终止时，即为一个episode，此时对应的方法为MC，因此从这个角度看，MC属于TD的特殊情况

[Reinforcement Learning] Model-Free Prediction

n步奖赏

[Reinforcement Learning] Model-Free Prediction

Forward-view TD(λ)的特点：

Update value function towards the λ-return
Forward-view looks into the future to compute GλtGtλ
Like MC, can only be computed from complete episodes

Backward View TD(λ)

Forward view provides theory
Backward view provides mechanism
Update online, every step, from incomplete sequences

[Reinforcement Learning] Model-Free Prediction

MC与TD优缺点

学习方式

TD 可以在知道最后结果之前学习（如上图举例）
- TD can learn online after every step
- MC must wait until end of episode before return is known
TD 可以在不存在最后结果的情况下学习（比如无限/连续MDPs）
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments

方差与偏差

MC has high variance, zero bias（高方差，零偏差）
- Good convergence properties
- Not very sensitive to initial value
- Very simple to understand and use
TD has low variance, some bias（低方差，存在一定偏差）
- Usually more efficient than MC
- TD(0) converges to v**π(s)vπ(s)
- More sensitive to initial value

[Reinforcement Learning] Model-Free Prediction

Bootstrapping vs. Sampling

Bootstrapping：基于已预测的值进行更新

DP bootstraps
MC does not bootstrap
TD bootstraps

Sampling：基于采样的期望来更新

DP does not sample（model-based methods don’t need sample）
MC samples（model-free methods need sample）
TD samples（model-free methods need sample）

下图从宏观的视角显示了 RL 的几种基本方法的区别：

on’t need sample）

MC samples（model-free methods need sample）
TD samples（model-free methods need sample）

下图从宏观的视角显示了 RL 的几种基本方法的区别：

[Reinforcement Learning] Model-Free Prediction

[Reinforcement Learning] Model-Free Prediction

[Reinforcement Learning] Model-Free Prediction

蒙特卡洛学习

时序差分学习

Backward View TD(λ)

MC与TD优缺点

学习方式

方差与偏差

Bootstrapping vs. Sampling

相关推荐