基本概念

trial and error
DRL=RL+deep_learning
on-policy：所有数据都是当前agent与env交互后产生的，训练时不使用old data，即不使用以前agent产生的数据
- 缺点：these algorithms works weaker on sample efficiency
- 优点：these algorithms directly optimize the objective you care about—policy performance—and it works out mathematically that you need on-policy data to calculate the updates
- 总结：牺牲sample efficiency以获取训练时的stability和reliable
off-policy：old data可reused。通过 Bellman’s equations 来训练一个 Q-function
- 缺点：
  - there are no guarantees that doing a good job of satisfying Bellman’s equations leads to having great policy performance.
  - the absence of guarantees makes algorithms in this class potentially brittle and unstable.
- 优点：high sample efficiency

基础数学公式

log-derivate 技巧
$\bigtriangledown P_\theta(\tau|\theta) = P(\tau|\theta)\,\bigtriangledown\log P(\tau|\theta)$
Expect Grad-Log-Prob (EGLP) lemma：假定 $P_\theta$ 是关于随机变量 $X$ 的一个参数化概率分布，则关于 $\log P_\theta(x)$ 的导数的期望为 0，即
$E_{x\sim P_\theta}[\bigtriangledown\log P_\theta(x)]=0$
证明
全期望公式（law of iterated expectations）：设 $X,Y$ 为具有分布 $P_X,P_Y$ 的随机变量，则有
$E_{x\sim P_X}(x) = E_{y \sim P_Y}(E_{x \sim P_X}(x|y))$
证明

核心概念

In a nutshell, RL is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future.

states: a state is a complete description of the state of the world. There is no information about the world which is hidden from the state.

observations: an observation is a partial description of a state, which may omit information.
environment: fully observed, partially observed
action spaces: the set of all valid actions in a given environment. discrete action spaces, continuous action space
policy: a policy is a rule used by an agent to decide what actions to take
- deterministic: denoted by $\mu$ , $a_t = \mu(s_t)$
- stochasitc: denoted by $\pi$ , $a_t \sim \pi(\cdot|s_t)$
  - categorical policies: used in discrete action spaces. 输入是state，输出是动作的概率分布
  - diagonal Gaussian policies: used in continous action spaces
    
    采样方式为 $a = \mu_\theta(s) + \sigma_\theta(s) \circ z$ ，其中 $\mu_\theta(\cdot),\sigma_\theta(\cdot)$ 为 mean action(向量) 和 covariance matrix(仅对角线有值，其余为0)， $z \sim \mathcal N(0,1)$ 为噪声向量
通常可将 policy 替换 agent

parameterized policies: 使用参数化模型对policy进行建模，通过优化算法来调整参数从而实现policy 的 behavior 的改变。通常使用 $\theta$ 表示参数：
$\begin{aligned} a_t &= \mu_\theta(s_t) \\ a_t &\sim \pi_\theta(\cdot|s_t) \end{aligned}$
trajectories: 动作-状态序列 $\tau \sim (s_0,a_0,s_1,a_1,\dots)$ ，其中 $s_{t+1} \sim P(\cdot|s_t,a_t)$
reward: 每步决策时获取的反馈， $r_t = R(s_t, a_t, s_{t+1})$ ，常简化为 $r_t = R(s_t) / r_t = R(s_t,a_t)$

return: 即累计长期回报
- finite-horizon undiscounted return: $R(\tau) = \sum_{t=0}^T r_t$
- infinite-horizon discounted return: $R(\tau) = \sum_{t=0}^\infty \gamma^t r_t$ 。使用 discount factor $\gamma$ 的原因主要有两个：
  - 保证 return 收敛到一个有限的值
  - 越近时刻的 reward 影响越大
  注意：尽管这两种收益表述之间的界限在RL形式主义中是很明显的，但深入的RL做法往往会使这条线模糊不清-例如，我们经常设置算法来优化未衰减收益，但在评估价值函数时使用衰减因子
RL problem: RL的目标是调整policy以实现最大化累计长期回报的目的。给定当前策略 $\pi$ ，一条长度为 $T$ 的 trajectory $\tau$ 出现的概率为 $P(\tau|\pi) = \rho(s_0) \displaystyle\prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\,\pi(a_t|s_t)$ ，则 $\pi$ 对应的 expected return 为 $J(\pi) = \int_\tau P(\tau|\pi)\,R(\tau)$ ，因此 RL 的优化目标为 $\pi^* = \displaystyle\arg\max_\pi J(\pi)$
value functions
- on-policy value function: $V^{\pi}(s) = E_{\tau \sim \pi}[R(\tau)|s_0=s]$
- on-policy action-value function: $Q^{\pi}(s,a) = E_{\tau\sim\pi}[R(\tau)|s_0=s,a_0=a]$
- optimal value function: $V^*(s) = \max_\pi E_{\tau \sim \pi}[R(\tau)|s_0=s]$
- optimal action-value function: $Q^*(s,a) = \max_\pi E_{\tau\sim\pi}[R(\tau)|s_0=s,a_0=a]$
- $\begin{aligned} V^\pi(s) &= E_{a\sim\pi}[Q^\pi(s,a)] \\ V^*(s) &= \max_a Q^*(s,a) \end{aligned}$
Bellman Equations: the reward-plus-next-value
- Bellman Equations for on-policy value functions:
  - $V^\pi(s) = E_{\begin{aligned} a&\sim\pi \\ s'&\sim P \end{aligned}} [r(s,a) + \gamma\,V^\pi(s')]$
  - $Q^\pi(s,a) = E_{s' \sim P}[r(s,a) + \gamma\,E_{a'\sim\pi}[Q^\pi(s',a')]]$ ，其中 $a\sim\pi$ 意味着 $a\sim\pi(\cdot|s)$ ， $a'\sim\pi$ 意味着 $a'\sim\pi(\cdot|s')$ ， $s' \sim P$ 即 $s' \sim P(\cdot|s,a)$
- Bellman equations for optimal value function:
  - $V^*(s) = \max_a E_{s' \sim P}[r(s,a) + \gamma\,V^*(s')]$
  - $Q^*(s,a) = E_{s' \sim P}[r(s,a) + \gamma\,\max_{a'}Q^*(s',a')]$
advantage functions: sometimes in RL, we don’t need to describe how good an action is in an absolute sense, but only how much better it is than others on average. That is to say, we want to know the relative advantage of that action. We make this concept precise with the advantage function.
$A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$
where $Q^\pi(s_t,a_t)=r_t + \gamma V^\pi(s_{t+1})$ ，即 $Q$ 无需建模成NN，只有 $V$ 为 NN
policy gradient 的目标是增大有助于取得高 return 的 actions 发生的概率，降低会导致低 return 的actions 发生的概率，不停的以该目标优化 policy 直至达到最优策略

RL算法

DRL笔记系列一

model-free/model-based: whether the agent has access to (or learns) a model of the environment. By a model of the environment, we mean a function which predicts state transitions and rewards. The main upside to having a model is that it allows the agent to plan by thinking ahead, seeing what would happen for a range of possible choices, and explicitly deciding between its options. Agents can then distill the results from planning ahead into a learned policy. The main downside is that a ground-truth model of the environment is usually not available to the agent.
model-free
- policy optimization
  - 通过直接使用 gradient ascent 来最大化 expected return $J(\pi_\theta)$ ，进而更新参数，即优化目标是 directly，因此会 stable and reliable
  - on-policy
  - 优点是学出的 policy 是 stable 和 reliable，缺点是训练时 sample efficiency 很低
- Q-learning
  - 目标是学习一个 $Q^*(s,a)$ 的近似函数 $Q(s,a)$
  - 基于 Bellman equations 的目标函数，本质上属于线性回归方法。通过拟合 Bellman equations 来间接（indirectly）实现最大化 expected return，因此会 brittle and unstable
  - off-policy
  - policy 为选取 Q 值最大所对应的 action
  - 优点是高 sample efficiency，缺点是 brittle 和 unstable

Policy Optimization数学推导

使用梯度上升法（gradient ascent）直接最大化 expected return $J(\pi_\theta)$ ，即
$\theta_{t+1} = \theta_t + \alpha\bigtriangledown_\theta J(\pi_\theta)$

计算 trajectory 的出现概率：
$P(\tau|\theta)=\rho_0(s_0)\prod_{t=0}^T P(s_{t+1}|s_t,a_t)\,\pi_\theta(a_t|s_t)$
log-导数技巧：
$\bigtriangledown P_\theta(\tau|\theta) = P(\tau|\theta)\,\bigtriangledown\log P(\tau|\theta)$
trajectory 的 log 变换：
$\log P(\tau|\theta) = \log\rho_0(s_0) + \sum_{t=0}^T(\log P(s_{t+1}|s_t,a_t) + \log\pi_\theta(a_t|s_t))$
对 $\theta$ 求导：
$\begin{aligned} \bigtriangledown_\theta\log P(\tau|\theta) &= \bigtriangledown_\theta\log\rho_0(s_0) + \sum_{t=0}^T(\bigtriangledown_\theta\log P(s_{t+1}|s_t,a_t) + \bigtriangledown_\theta\log\pi_\theta(a_t|s_t)) \\ ~& = \sum_{t=0}^T\bigtriangledown_\theta\log\pi_\theta(a_t|s_t) \end{aligned}$

各种 policy gradient 目标函数

基础的 policy gradient：
$\begin{aligned} \bigtriangledown_\theta J(\pi_\theta) &= \bigtriangledown_\theta E_{\tau\sim\pi_\theta}[R(\tau)] \\ ~&= \bigtriangledown_\theta\int_\tau P(\tau|\theta)\,R(\tau) \\ ~&= \int_\tau\bigtriangledown_\theta P(\tau|\theta)\,R(\tau) \\ ~&= \int_\tau P(\tau|\theta)\,\bigtriangledown_\theta\log P(\tau|\theta)\,R(\tau) \\ ~&= E_{\tau\sim\pi_\theta}[\bigtriangledown_\theta\log P(\tau|\theta)\,R(\tau)] \\ ~&= E_{\tau\sim\pi_\theta}[R(\tau)\sum_{t=0}^T\bigtriangledown_\theta\log\pi_\theta(a_t|s_t)] \\ ~&\approx \frac{1}{|\mathcal D|} \sum_{\tau\in\mathcal{D}}\sum_{t=0}^{T}R(\tau)\bigtriangledown_\theta\log\pi_\theta(a_t|s_t) \\ 即 \bigtriangledown_\theta J(\pi_\theta) &\approx \frac{1}{|\mathcal D|} \sum_{\tau\in\mathcal{D}}\sum_{t=0}^{T}R(\tau)\bigtriangledown_\theta\log\pi_\theta(a_t|s_t) \end{aligned}$ ▽θJ(πθ) 即▽θJ(πθ)=▽θEτ∼πθ[R(τ)]=▽θ∫τP(τ∣θ)R(τ)=∫τ▽θP(τ∣θ)R(τ)=∫τP(τ∣θ)▽θlogP(τ∣θ)R(τ)=Eτ∼πθ[▽θlogP(τ∣θ)R(τ)]=Eτ∼πθ[R(τ)t=0∑T▽θlogπθ(at∣st)]≈∣D∣1τ∈D∑t=0∑TR(τ)▽θlogπθ(at∣st)≈∣D∣1τ∈D∑t=0∑TR(τ)▽θlogπθ(at∣st)
注意，根据 EGLP lemma 可得： $\bigtriangledown_\theta J(\pi_\theta)=E_{\tau\sim\pi_\theta}[\bigtriangledown_\theta\log P(\tau|\theta)\,R(\tau)]=0$

代码实现
pytorch tensorflow
Reward-to-Go Policy Gradient：基础 policy gradient 中一个很严重的问题为，使用相同的 return 来评估同一 trajectory 中所有 $\pi_\theta(a|s)$ 的质量，但这是不合理的，因为在同一 trajectory 中，只有在 $t$ 时刻以后获取的 reward 才与 $\pi_\theta(a_t|s_t)$ 有关，而在 $t$ 时刻之前得到的 reward 与 $\pi_\theta(a_t|s_t)$ 无关。因此，应改为 reward-to-go from state-action pairs 的形式：
$\bigtriangledown_\theta J(\pi_\theta) \approx \frac{1}{|\mathcal D|} \sum_{\tau\in\mathcal{D}}\sum_{t=0}^{T}\bigtriangledown_\theta\log\pi_\theta(a_t|s_t)\sum_{t'=t}^T R(s_{t'},a_{t'},s_{t'+1})$
注意，policy gradient 的一个关键问题是应该获取多少 trajectories 以保证训练样本具有较小的方差，这样policy gradient 才能估计的准。然而，在基础版的 policy gradient 中，由于考虑了 action-before 的 rewards，这些与当前 actions 无关的 rewards 就是 nosie，因此会导致训练样本具有较大的方差，所以只能通过增大 trajectories 的数量来减小样本方差。通过去除 action-before 的 rewards，减少了 policy gradient 采样估计的噪声，提高采样估计的精确度，同时也可减少训练样本的数量

代码实现
pytorch tensorflow
Baselines in Policy Gradients
$\bigtriangledown_\theta J(\pi_\theta) = E_{\tau\sim\pi_\theta}\left[\sum_{t=0}^T\bigtriangledown_\theta\log\pi_\theta(a_t|s_t)\left(\sum_{t'=t}^TR(s_{t'},a_{t'},s_{t'+1}) - b(s_t)\right)\right]$
其中 $b(s)$ 称为 baselines，它仅与 $s$ 有关，作用是能够减少 policy gradient 采样估计的方差；经验上， $b(s_t)$ 常取 $V^\pi(s_t) = E_{a_t\sim\pi}[Q^\pi(s_t,a_t)]$ 。注意，baselines不会改变期望，即期望依旧为 0。在实现上，常使用一个 neural network $V_\phi(s_t)$ 来计算 Q 值：
$\begin{aligned} \phi_k = \arg\min_\phi E_{s_t,\hat R_t \sim \pi_k} \left[(V_\phi(s_t) - \hat R_t)^2\right] \end{aligned}$
其中 $\pi_k$ 是第 $k$ 次迭代得到的策略网络
其它
$\bigtriangledown_\theta J(\pi_\theta) = E_{\tau\sim\pi_\theta}\left[\sum_{t=0}^T\bigtriangledown_\theta\log\pi_\theta(a_t|s_t)\,\Phi_t\right]$
- $\Phi_t=R(\tau)$ ，即基础的 policy gradient
- $\Phi_t = \sum_{t'=t}^T R(s_{t'},a_{t'},s_{t'+1})$ ，即 Reward-to-Go policy gradient
- $\Phi_t = \sum_{t'=t}^T R(s_{t'},a_{t'},s_{t'+1}) - b(s_t)$ ，即 Baselines in Policy Gradients
- $\Phi_t = Q^{\pi_\theta}(s_t,a_t)$ ，即 On-Policy Action-Value Function
- $\Phi_t = A^{\pi_\theta}(s_t,a_t) = Q^{\pi_\theta}(s_t,a_t) - V^{\pi_\theta}(s_t)$ ，即 Advantage Function，用于描述 how much better or worse it is than other actions on average (relative to the current policy)

Vanilla Policy Gradient (VPG)

特点
- on-policy
- 适用于离散和连续的动作空间
目标函数
$\bigtriangledown_\theta J(\pi_\theta) = E_{\tau\sim\pi_\theta}\left[ \sum_{t=0}^T \bigtriangledown_\theta\log\pi_\theta(a_t|s_t)\,A^{\pi_\theta}(s_t,a_t) \right]$
采用 SGA (stochastic gradient ascent) 优化目标函数
$\theta_{k+1} = \theta_k + \alpha\bigtriangledown_\theta J(\pi_\theta)$
policy gradient 实现 Advantage function estimates 时是使用 infinite-horizon discounted return的，其它的则常用 finite-horizon undiscounted return

Trust Region Policy Optimization (TRPO)

特点
- on-policy
- 适用于离散和连续的动作空间
关注问题：如何才能使用当前拥有的数据在策略上采取最大可能的改进措施，而又不至于导致意外导致性能下降
核心思想：在满足 new policy 和 new policy 在参数空间差异尽可能小（用KL散度衡量）的前提下，通过采取 the largest step 来更新 policy，以达到每步更新时都能取得最大性能的目的（个人理解，参数空间差异小的作用是限制模型更新的速度，以防过快导致 collapse policy performence，使得训练更稳定）
理论目标函数
$\begin{aligned} \theta_{k+1} &= \arg\max_\theta \mathcal L(\theta_k,\theta),\;\text{s.t.}\,\overline D_{KL}(\theta\|\theta_k) \leq \delta \\ \mathcal L(\theta_k,\theta) &= E_{s,a\sim\pi_k}\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)}A^{\pi_{\theta_k}}(s,a) \right] \\ \overline D_{KL}(\theta\|\theta_k) &= E_{s\sim\pi_{\theta_k}}[D_{KL}(\pi_\theta(\cdot|s)\|\pi_{\theta_k}(\cdot|s))] \end{aligned}$
其中 $\mathcal L(\theta_k,\theta)$ 为 surrogate advantage (替代优势函数)，用于评估 $\pi_\theta$ 和 old policy $\pi_{\theta_k}$ 在使用 data from old policy 时，新旧 policies 的相对性能。注意，当 $\theta=\theta_k$ 时， $\mathcal L(\theta_k,\theta)=0$ 且 $\overline D(\theta\|\theta_k)=0$
理论目标不方便优化，因此对目标函数和约束条件进行使用展开
$\begin{aligned} \mathcal L(\theta_k,\theta) &\approx g^T(\theta - \theta_k) \\ \overline D_{KL}(\theta\|\theta_k) &\approx \frac{1}{2} (\theta-\theta_k)^T H(\theta-\theta_k) \end{aligned}$

Proximal Policy Optimization (PPO)

两个变种
- PPO-Penalty：近似解决了TRPO之类的受KL约束的更新，但是对目标函数中的KL散度进行了惩罚，而不是使其成为硬约束，并且在训练过程中自动调整了罚分系数，以便对其进行适当缩放
- PPO-Clip：在目标中没有KL散度项，也没有任何约束。取而代之的是依靠对目标函数的专门裁剪来消除新政策远离旧政策的动机
特点
- on-policy
- 适用于离散和连续的动作空间
目标与TRPO一样，只是简化了公式求解——去除TRPO中的约束，使用 PPO-Clip 方式间接限制新旧 policies 之间参数更新的幅度
$\begin{aligned} \theta_{k+1} &= \arg\max_\theta E_{s,a\sim\pi_{\theta_k}}[L(s,a,\theta_k,\theta)] \\ L(s,a,\theta_k,\theta) &= \min\left( \frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)}A^{\pi_{\theta_k}}(s,a),\;\text{clip}\left(\frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)},1-\epsilon,1+\epsilon\right)A^{\pi_{\theta_k}}(s,a) \right) \end{aligned}$
其中 $\epsilon$ 是一个很小的正参数，用于控制新旧 policies 之间参数更新的幅度。

注意此处
$A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$
其中 $Q^\pi(s_t,a_t)=r_t + \gamma V^\pi(s_{t+1})$ ，即 $Q$ 无需建模成NN，只有 $V$ 为 NN。在 PPO 中仅涉及两个网络的更新： $\pi$ 和 $V$ ，策略网络 $\pi$ 在更新时的梯度入口仅是公式 $L$ 中的 $\pi_\theta(a|s)$ 。价值网络 $V$ 则是通过拟合 $return$ 进行优化的。即**优化时的loss由策略网络的 $L$ 和价值网络的 $\text{square error}$ 组成的 **

上面的 $L$ 有一个简化的版本
$\begin{aligned} L(s,a,\theta_k,\theta) &= \min\left( \frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)}A^{\pi_{\theta_k}}(s,a), g(\epsilon,A^{\pi_{\theta_k}}) \right) \\ g(\epsilon,A) &= \begin{cases} (1+\epsilon)A, &\text{if}\;A \geq 0 \\ (1-\epsilon)A, &\text{otherwise} \end{cases} \end{aligned}$