【5分钟读懂英文原版书】强化学习之父Sutton——有限马尔可夫决策过程概念快速理解

事情的起因是这样的。。
po主毕设即将入坑强化学习
导师强烈建议从强化学习之父Sutton的《Reinforcement Learning: An Introduction》开始学起
好吧！！让人毛骨悚然的英文原版书！！我来啦T T

接下来，让我们切入正题。

MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards.

MDP是顺序决策的经典形式，其中行动不仅影响直接奖励，还通过未来的奖励影响后续情况或状态。

The learner and decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations to the agent.The environment also gives rise to rewards, special numerical values that the agent seeks to maximize over time through its choice of actions.

学习者和决策者叫代理人。
与代理人交互的东西，包括代理人之外的所有东西，称为环境。
这些不断交互，代理选择行动，环境对这些行动进行相应，并向代理人呈现新情况环境还产生奖励，代理人通过其后续行动寻求奖励的最大化。
【5分钟读懂英文原版书】强化学习之父Sutton——有限马尔可夫决策过程概念快速理解

More specifically, the agent and environment interact at each of a sequence of discrete time steps, t=0,1,2,3,… At each time step, the agent receives some representation of the environment’s state,St ∈ S, and on that basis selects an action, At ∈ A(s).

更具体地说，代理和环境在一系列离散时间点上发生交互，t = 0,1,2,3，…在每个时间点，代理接收环境状态的一些表示，称为“state” -S，并在此基础上选择一个动作，称为”action“-A。

One time step later, in part as a consequence of its action, the agent receives a numerical reward, Rt+1 ∈ R ⊂ R, and finds itself in a new state, St+1.4 The MDP and agent together thereby give rise to a sequence or trajectory that begins like this:
S0,A0,R1,S1,A1,R2,S2,A2,R3,…

在下一个时间点，由于代理人选择了动作，代理人收到一个数字奖励，称为“Reward"-R，并发现己处于一个新的状态,MDP和代理人一起从而产生一个序列或轨迹，如下所示：
S0，A0，R1，S1，A1，R2，S2，A2，R3，…

In general, actions can be any decisions we want to learn how to make, and the states can be anything we can know that might be useful in making them.

一般而言，动作可以是我们想要学习如何制定的任何决定，而状态可以是我们可以知道的任何可能有助于制定它们的任何事物。

The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment. We do not assume that everything in the environment is unknown to the agent.

通常来说，任何不能被代理人任意改变的东西都被认为是在它之外，因此也是其环境的一部分，但我们不会假设代理人对环境一无所知。

In fact, in some cases the agent may know everything about how its environment works and still face a difficult reinforcement learning task, just as we may know exactly how a puzzle like Rubik’s cube works, but still be unable to solve it. The agent–environment boundary represents the limit of the agent’s absolute control, not of its knowledge.

实际上，在某些情况下，代理人可能知道其环境如何工作的所有内容，并且仍然面临着难以强化的强化学习任务，就像我们可能确切地知道像魔方这样的拼图如何工作，但仍然无法解决它。代理 - 环境边界表示代理的绝对控制的限制，而不是其知识的限制。

The MDP framework is a considerable abstraction of the problem of goal-directed learning from interaction. Any problem of learning goal-directed behavior can be reduced to three signals passing back and forth between an agent and its environment: one signal to represent the choices made by the agent (the actions), one signal to represent the basis on which the choices are made (the states), and one signal to define the agent’s goal (the rewards).

MDP框架是对交互中目标导向学习问题的一个相当大的抽象。学习目标导向行为的任何问题都可以减少到在代理及其环境之间来回传递的三个信号：一个信号代表代理做出的选择（动作），一个信号代表选择的基础（状态）和一个信号来定义代理人的目标（奖励）。

➡️概念定义
（1）state-transition probabilities 状态转移概率
【5分钟读懂英文原版书】强化学习之父Sutton——有限马尔可夫决策过程概念快速理解
即由当前状态S和动作A转移到下一个状态S’的概率。

（2）Expected Return 期望回报

If the sequence of rewards received after time step t is denoted Rt+1,Rt+2, Rt+3, . . ., then what precise aspect of this sequence do we wish to maximize? In general, we seek to maximize the expected return, where the return, denoted Gt, is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards: Gt =. Rt+1 +Rt+2 +Rt+3 +···+RT, where T is a final time step.

如果在时间步长t之后接收的奖励序列表示为Rt + 1，则Rt + 2，Rt + 3，… 。那么，我们希望最大化这个序列的精确方面？一般而言，我们寻求最大化预期收益，其中表示为Gt的收益被定义为奖励序列的某个特定功能。在最简单的情况下，回报是奖励的总和：
Gt =（Rt + 1） + （Rt + 2 ）+ （Rt + 3 ）+ … + RT，其中T是最后一步。

This approach makes sense in applications in which there is a natural notion of final time step, that is, when the agent–environment interaction breaks naturally into subsequences, which we call episodes, such as plays of a game, trips through a maze, or any sort of repeated interaction. Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states.

这种方法在有最终时间步骤的自然概念的应用中是有意义的，也就是说，当代理 - 环境交互自然地分解为子序列时，我们称之为剧集，例如游戏中的每一局，每一次尝试穿过迷宫，或任何形式的重复互动。每集以一种称为终端状态的特殊状态结束，然后重置为标准起始状态或从起始状态的标准分布中取样。

On the other hand, in many cases the agent–environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. For example, this would be the natural way to formulate an on-going process-control task, or an application to a robot with a long life span. We call these continuing tasks. The return formulation Gt =（Rt + 1） + （Rt + 2 ）+ （Rt + 3 ）+ … + RT is problematic for continuing tasks because the final time step would be T = ∞, and the return, which is what we are trying to maximize, could itself easily be infinite. (For example, suppose the agent receives a reward of +1 at each time step.)

另一方面，在许多情况下，代理 - 环境交互不会自然地分解为可识别的事件，而是持续不断地进行。例如，这将是制定正在进行的过程控制任务的自然方式，或者是对具有长寿命的机器人的应用。我们用Gt =（Rt + 1） + （Rt + 2 ）+ （Rt + 3 ）+ … + RT 来表示这些持续的任务是有问题的，因为最终时间步长将是T =∞，并且我们试图最大化的回报本身可以很容易地是无限的。（例如，假设代理在每个时间步都获得+1的奖励。）

（3）Discounting的概念

The additional concept that we need is that of discounting. According to this approach, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. In particular, it chooses At to maximize the expected discounted return

我们需要的额外概念是折扣。根据这种方法，代理尝试选择动作，以使其在未来接收的折扣奖励的总和最大化。
⬇️下面这个公式是对于分“剧集”的行为定义的回报
【5分钟读懂英文原版书】强化学习之父Sutton——有限马尔可夫决策过程概念快速理解

The discount rate determines the present value of future rewards: a reward received k time steps in the future is worth only γk−1 times what it would be worth if it were received immediately. If γ < 1, the infinite sum above has a finite value as long as the reward sequence {Rk} is bounded.

贴现率决定未来奖励的现值：未来收到的k个时间步骤的奖励价值仅为立即收到的价值的γk-1倍。如果γ<1，只要奖励序列{Rk}有界，则上面公式中的无限和就具有有限值。

Returns at successive time steps are related to each other in a way that is important for the theory and algorithms of reinforcement learning

连续时间序列的回报以对强化学习的理论和算法也很重要
⬇️下面这个公式是对于连续时间序列定义的回报，要注意二者之间的区分⚠️
【5分钟读懂英文原版书】强化学习之父Sutton——有限马尔可夫决策过程概念快速理解
（4）Policy and Value functions 策略和值函数

Almost all reinforcement learning algorithms involve estimating value functions—functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).

几乎所有的强化学习算法都涉及估计值函数 - 状态（或状态 - 动作对）的函数，它们估计代理在给定状态下的好坏程度（或者在给定状态下执行给定动作的程度有多好）

Formally, a policy is a mapping from states to probabilities of selecting each possible action.

通常，策略是从状态到选择每个可能动作的概率的映射。

The value of a state s under a policy π, denoted vπ(s), is the expected return when starting in s and following π thereafter. For MDPs, we can define vπ formally by

策略π下的状态s的值，表示为vπ（s），是在s开始并且随后是π之后的期望。对于MDP，我们可以正式定义vπ：
⬇️下面这个公式也就是著名的贝尔曼方程Bellman Equation
【5分钟读懂英文原版书】强化学习之父Sutton——有限马尔可夫决策过程概念快速理解
⚠️概率论里的期望的定义是概率*值，在这里即由状态s 和a决定下一个状态s‘和奖励r的概率乘以回报

The Bellman equation averages over all the possibilities, weighting each by its probability of occurring. It states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way.

贝尔曼方程是平均所有可能性，按其概率加权。它指出，开始状态的值必须等于预期的下一个状态的（打折）值，加上沿途预期的奖励。
⬇️下面这个流程图是对贝尔曼方程很好的说明，大家可以仔细看一下
【5分钟读懂英文原版书】强化学习之父Sutton——有限马尔可夫决策过程概念快速理解

Similarly, we define the value of taking action a in state s under a policy π, denoted qπ(s,a), as the expected return starting from s, taking the action a, and thereafter following policy π:

类似地，我们定义在策略π下的状态s中采取行动a的值，表示为qπ（s，a），作为从s开始的预期收益，采取行动a，然后遵循策略π：
【5分钟读懂英文原版书】强化学习之父Sutton——有限马尔可夫决策过程概念快速理解
（5）Optimal Policies and Optimal Value Functions最优策略与最优价值函数

For finite MDPs, we can precisely define an optimal policy in the following way. Value functions define a partial ordering over policies. A policy π is defined to be better than or equal to a policy π′ if its expected return is greater than or equal to that of π′ for all states. They share the same state-value function, called the optimal state-value function, denoted v∗, and defined as

对于有限MDP，我们可以通过以下方式精确定义最优策略。值函数定义了对策略的部分排序。如果策略π的预期收益大于或等于所有状态的π’，则策略π被定义为优于或等于策略π’。它们共享相同的状态值函数，称为最佳状态值函数，表示为v *，并定义为
【5分钟读懂英文原版书】强化学习之父Sutton——有限马尔可夫决策过程概念快速理解

Optimal policies also share the same optimal action-value function, denoted q∗, and defined as

最优策略也共享相同的最佳动作值函数，表示为q *，并定义为
【5分钟读懂英文原版书】强化学习之父Sutton——有限马尔可夫决策过程概念快速理解
⬇️下面是贝尔曼最优性方程Bellman optimality equation

Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state:

直观地，贝尔曼最优性方程表达了这样一个事实，即最优政策下的状态价值必须等于该状态最佳行动的预期收益。

????下面是作者对贝尔曼最优性方程写的一段话，可以帮助理解。

If you have the optimal value function, v∗, then the actions that appear best after a one-step search will be optimal actions. Another way of saying this is that any policy that is greedy with respect to the optimal evaluation function v∗ is an optimal policy. The term greedy is used in computer science to describe any search or decision procedure that selects alternatives based only on local or immediate considerations, without considering the possibility that such a selection may prevent future access to even better alternatives. Consequently, it describes policies that select actions based only on their short-term consequences. The beauty of v∗ is that if one uses it to evaluate the short- term consequences of actions—specifically, the one-step consequences—then a greedy policy is actually optimal in the long-term sense in which we are interested because v∗ already takes into account the reward consequences of all possible future behavior. By means of v∗, the optimal expected long-term return is turned into a quantity that is locally and immediately available for each state.Hence, a one-step-ahead search yields the long-term optimal actions.

如果你具有最佳值函数v *，那么在一步搜索后出现最佳的操作将是最佳操作。另一种说法是任何对最优评估函数v *贪婪的策略都是最优策略。 “贪婪”一词在计算机科学中用于描述任何仅根据当地或直接考虑选择备选方案的搜索或决策程序，而不考虑这种选择可能妨碍将来获得更好的替代方案的可能性。因此，它描述了仅根据短期后果选择行动的政策。 v *的美妙之处在于，如果用它来评估行为的短期后果 - 特别是一步后果 - 那么贪婪的政策在我们感兴趣的长期意义上实际上是最优的，因为v *已经考虑到所有可能的未来行为的奖励后果。通过v *，最佳预期长期收益变为本地且立即可用于每个州的数量。因此，一步提前搜索产生长期最优行动。

????读下来感觉英文原版书中的概念讲解确实非常透彻，Sutton不愧是强化学习之父啊！！有一豁然开朗的感觉hhh

最近会不定时发博客分享一些强化学习方面的学习收获，欢迎大家一起来交流????
如果有讲的不对的地方也请大家指出，我们一起变更好呀????

参考资料：《Reinforcement Learning:An Introduction》 http://incompleteideas.net/book/bookdraft2017nov5.pdf

【5分钟读懂英文原版书】强化学习之父Sutton——有限马尔可夫决策过程概念快速理解

相关推荐